Lorccan February 2016

sed to match pattern across a newline

Here's my input:

<array>
    <string>extra1</string>
    <string>extra2</string>
    <string>Yellow
5</string>

Note: there's a space and newline between "Yellow" and "5"

I am piping this to sed:

| sed -n 's#.*<string>\(.*\)</string>#\1#p'

and I am getting the output:

extra1
extra2

I know that, because sed strips the newline from the end of each input line, the newline is not there to be matched - so that accounts for the result. I have read articles on adding the next line from the buffer, but I can't work out what I need to use in the pattern match to get this to work.

The output I want is:

extra1
extra2
Yellow 5

(In case it makes a difference, I am using a Mac, so I need this to work with - I think - the FreeBSD variant of sed.)

Of course, if another tool is better for what I want to achieve I am open to suggestions! Thanks!

Answers


Cyrus February 2016

Close your array tag and try this with xmlstarlet and GNU sed:

xmlstarlet sel -t -v "//array/string" input.xml | sed '/ $/{:a;N;s/\n//;ta}'

Output:

extra1
extra2
Yellow 5


anubhava February 2016

perl is available on OSX by default so you can use:

perl -0ne 's#<string>([^<]*)</string>#sub{$x=$1;$x=~tr/\n/ /;print $x."\n";}->()#eg' file.xml
extra1
extra2
Yellow 5

Alternatively you can install gnu-awk using home brew and use:

awk -v RS= -v FPAT='<string>([^<]*)</string>' 'for(i=1; i<=NF; i++) {
   gsub(/<\/?string>/, "", $i); gsub(/\n/, " ", $i); print $i}}' file.xml
extra1
extra2
Yellow 5


Ed Morton February 2016

Any time you start talking about "buffers" or "hold space" or sed constructs other than s, g, and p (with -n) you're simply using the wrong tool. All of that stuff for sed became obsolete in the mid-1970s when awk was invented so just use awk. Here's one way with GNU awk for multi-char RS:

$ awk -v RS='</?string>' '!(NR%2){gsub(/\n/," "); print}' file
extra1
extra2
Yellow 5

The above just prints whatever's between <string> and </string> after converting any newlines to blank chars.

With other awks one way would be:

$ cat tst.awk
{ rec = (rec=="" ? "" : rec " ") $0 }
END {
    split(rec,f,"</?string>")
    for (i=2;i in f;i+=2) {
        print f[i]
    }
}

$ awk -f tst.awk file
extra1
extra2
Yellow 5


Walter A February 2016

Join the lines and tear them apart:

tr -d "\n" < file| grep -o "<string>[^<]*</string>"|sed 's/<string>\(.*\)<\/string>/\1/'


SaxDaddy February 2016

You can approach this problem with xmllint. I modified your example slightly so that you can see what's going on.

test.xml

<array>
  <string1>extra1</string1>
  <string2>extra2</string2>
  <string3>Yellow
5</string3>
</array>

Since you want the string with the line break, I made this value unique. Now use xmllint and sed to get your results

[saxdaddy ~]$  x="$(xmllint --xpath "/array/string3" test.xml | sed '/^\/ >/d' | sed 's/<[^>]*.//g')"
[saxdaddy ~]$  echo $x
Yellow 5

xmllint's xpath feature will search the XML in dictionary manner. sed will then strip our the beginning and ending tags. The "trick" to this is using quotes to capture the variable and then not using quotes to echo the result.

If your target tag is not unique in the file path, then you can craft a for loop to look for $'\n' (a line break) and set that to your variable.

Post Status

Asked in February 2016
Viewed 1,407 times
Voted 5
Answered 5 times

Search




Leave an answer