bash command(sed) to remove XML Node - sed

I'm trying to create Bash command which delete a particular node in XML if it contains a string starting with particular few characters
For e.g.
if my XML is like this:
<X>
<Y> abc... </y>
<Y> trf... </y>
<Y> abc... </y>
</X>
then I've remove all such Y nodes which have values starting with abc... string
In the end, it should remain just like below:
<X>
<Y> trf... </y>
</X>
I was searching and found, 'sed' commmand does something similar with help of regular expressions. Was trying to read various other similar questions on this site and tutorials but getting overwhelmed
I know, asking for kind-of spoon feeding but please suggest if something can be easily done for this as I've just next few hours available before I've to start associated activity!
Also is there easy tutorial on 'sed' as finding learning and understanding it little complex ..whatever I found till now on net.
Thanks !

If you are open to use awk then , following command can be used to print those line which does not contains abc in it.
awk '!/<Y> abc/' xml
<X>
<Y> trf... </y>
</X>
or
awk ' /<X>/,/<\/X>/ {if($0 !~ "<Y> abc") print $0}' xml

$ cat ip.txt
<X>
<Y> abc... </y>
<Y> trf... </y>
<Y> abc... </y>
</X>
$ sed '/<Y> abc/d' ip.txt
<X>
<Y> trf... </y>
</X>
/<Y> abc/ match a line containing the pattern <Y> abc
d command to delete the matching line
can also use grep -v '<Y> abc' ip.txt
Further reading:
https://stackoverflow.com/tags/sed/info
How can I replace a string in a file(s)?

Related

Get TagValue of nth occurence of a Tag in XML using sed

MY xml
<?xml version="1.0" encoding="UTF-8" ?>
<Attributes>
<Attribute>123</Attribute>
<Attribute>959595</Attribute>
<Attribute>1233</Attribute>
<Attribute>jiji</Attribute>
</Attributes>
I need to get the tag value of second occurence of attribute tag i.e 959595 using sed
i used the command
sed -n ':a;$!{N;ba};s#\(<Attribute\)\(.*\)\(</Attribute>\)#\1#2#\2#p' file
pattern one second occurrence pattern two value it doesnt work
i dont know whether my approach is correct or not please correct my command
The proper way to do this is :
$ xmllint --xpath '/Attributes/Attribute[2]/text()' file.xml
NOTES
xmllint comes with libxml2.
the '2' is the second searched element
sed -n '/<Attributes>/,\#</Attributes># {
/<Attribute>/ {
H;g
s#.*<Attribute>\(.*\)</Attribute>.*#\1#
t found
}
b
:found
p;q
}' YourFile
Assuming, like in your sample, there is only 1 Attributes to found, this sed only return the 1st. (if the xml content is only like your sample, the /<Attributes>/,\#</Attributes># selection is not needed)
Posix version so --posix on GNU sed
This sed prints all Attribute entries from the Attributes block, then takes the second entry and removes the tags:
sed -n '/<Attributes>/,\#</Attributes>#{/<Attribute>/p}' attrib.txt | sed -n '2p' | sed 's#</Attribute>##;s/<Attribute>//'
Output:
959595
Or another way without pipes is to use sed commands, this goes to the second entry strips the Attribute tag and then quits:
sed -n '/<Attributes>/,\#</Attributes>#{/<Attribute>/{n;s#.*<Attribute>\(.*\)</Attribute>.*#\1#;p;q};}' attrib.txt
Or if your number of Attribute entries changes you can make it a bit more intuitive by parsing all values and then using sed to print the attribute placement where you want:
sed -n '/<Attributes>/,\#</Attributes>#{/<Attribute>/{s#</Attribute>##;s#<Attribute>##;p}}' attrib.txt | sed -n '2p'
You can change the end where from 2, to whatever Attribute value field you want to display or take multiple values like sed -n '2p;3p' or sed -n '1,2p'
I also would follow the xmllint xpath way. It however seems like there is two versions available. According to this man page at https://linux.die.net/man/1/xmllint there is no xpath parameter, but it is called "pattern".
Following this documentation, your call then would be
$ xmllint --pattern '/Attributes/Attribute[2]/text()' file.xml
I recommend checking your local man page to see which one to use.

sed - insert lines when text found / not found

I have issue with sed, i need to accomplish two things with a csv file
in front of each line that does not start UNES I need to add tag "BF2;"
at the start of the file (after UNES if present) I need to add a tag "UNH;"
Example (no UNES;)
50000024;IE15;041111;113901;verstuurd;Aangift;
50000024;IE15;041111;113901;verstuurd;Aangifte;
50000024;IE15;041111;113901;verstuurd;Aangifte;
Example (with UNES;)
UNES;
50000024;IE15;041111;113901;verstuurd;Aangift;
50000024;IE15;041111;113901;verstuurd;Aangifte;
50000024;IE15;041111;113901;verstuurd;Aangifte;
so far I have this:
sed -e 's/^\([^"UNES"]\)/BF2;\1/' | sed '/UNES/ a\UNH;'
THis works as long as a UNES; tag is present - I can't seem to figure out how to insert the UNH; when UNES is not present!
Any help much appreciated
Sample output:
UNES;
UNH;
BF2;50000024;IE15;041111;113901;verstuurd;Aangifte;
BF2;50000024;IE15;041111;113901;verstuurd;Aangifte;
BF2;50000024;IE15;041111;113901;verstuurd;Aangifte;
Here's how you could do it using awk:
awk 'NR==1 {if(f=/^UNES;/)print; print "UNH;"} !f{print "BF2;" $0} {f=0}' file
On the first line, if /^UNES;/ is matched, print it and set the flag f. Always print "UNH;". If the f flag has been set, don't do the next action, which works for the rest of the lines. Always reset f to 0 after the first line so all further lines have "BF2;" added to the start.
Testing it out:
$ cat file
UNES;
50000024;IE15;041111;113901;verstuurd;Aangift;
50000024;IE15;041111;113901;verstuurd;Aangifte;
50000024;IE15;041111;113901;verstuurd;Aangifte;
$ awk 'NR==1 {if(f=/^UNES;/)print; print "UNH;"} !f{print "BF2;" $0} {f=0}' file
UNES;
UNH;
BF2;50000024;IE15;041111;113901;verstuurd;Aangift;
BF2;50000024;IE15;041111;113901;verstuurd;Aangifte;
BF2;50000024;IE15;041111;113901;verstuurd;Aangifte;
$ cat file2
50000024;IE15;041111;113901;verstuurd;Aangift;
50000024;IE15;041111;113901;verstuurd;Aangifte;
50000024;IE15;041111;113901;verstuurd;Aangifte;
$ awk 'NR==1 {if(f=/^UNES;/)print; print "UNH;"} !f{print "BF2;" $0} {f=0}' file2
UNH;
BF2;50000024;IE15;041111;113901;verstuurd;Aangift;
BF2;50000024;IE15;041111;113901;verstuurd;Aangifte;
BF2;50000024;IE15;041111;113901;verstuurd;Aangifte;
You can use this sed command:
sed '/^UNES;$/{i\
UNH;
n};s/^/BF2;/;' file.txt
details:
/^UNES;$/i\
UNH; insert a new line when UNES; is the whole line.
n replaces the pattern space with the next line
Try this, its works for me
sed '/^UNES;$/{i\
UNH;
n};s/^[0-9]*/BF2;&/;'

extract a substring of 11 characters from a line using sed,awk or perl

I have a file with many lines, in each line
there is either substring
whatever_blablablalsfjlsdjf;asdfjlds;f/watch?v=yPrg-JN50sw&amp,whatever_blabla
or
whatever_blablabla"/watch?v=yPrg-JN50sw&amp" class=whatever_blablablavwhate
I want to extract a substring, like the "yPrg-JN50s" above
the matching pattern is
the 11 characters after the string "/watch?="
how to extract the substring
I hope it is sed, awk in one line
if not, a pn line perl script is also ok
You can do
grep -oP '(?<=/watch\?v=).{11}'
if your grep knows Perl regex, or
sed 's/.*\/watch?v=\(.\{11\}\).*/\1/g'
$ cat file
/watch?v=yPrg-JN50sw&amp
"/watch?v=yPrg-JN50sw&amp" class=
$
$ awk 'match($0,/\/watch\?v=/) { print substr($0,RSTART+RLENGTH,11) }' file
yPrg-JN50sw
yPrg-JN50sw
Just with the shell's parameter expansion, extract the 11 chars after "watch?v=":
while IFS= read -r line; do
tmp=${line##*watch?v=}
echo ${tmp:0:11}
done < filename
You could use sed to remove the extraneous information:
sed 's/[^=]\+=//; s/&.*$//' file
Or with awk and sensible field separators:
awk -F '[=&]' '{print $2}' file
Contents of file:
cat <<EOF > file
/watch?v=yPrg-JN50sw&amp
"/watch?v=yPrg-JN50sw&amp" class=
EOF
Output:
yPrg-JN50sw
yPrg-JN50sw
Edit accommodating new requirements mentioned in the comments
cat <<EOF > file
<div id="" yt-grid-box "><div class="yt-lockup-thumbnail"><a href="/watch?v=0_NfNAL3Ffc" class="ux-thumb-wrap yt-uix-sessionlink yt-uix-contextlink contains-addto result-item-thumb" data-sessionlink="ved=CAMQwBs%3D&ei=CPTsy8bhqLMCFRR0fAodowXbww%3D%3D"><span class="video-thumb ux-thumb yt-thumb-default-185 "><span class="yt-thumb-clip"><span class="yt-thumb-clip-inner"><img src="//i1.ytimg.com/vi/0_NfNAL3Ffc/mqdefault.jpg" alt="Miniature" width="185" ><span class="vertical-align"></span></span></span></span><span class="video-time">5:15</span>
EOF
Use awk with sensible record separator:
awk -v RS='[=&"]' '/watch/ { getline; print }' file
Note, you should use a proper XML parser for this sort of task.
grep --perl-regexp --only-matching --regexp="(?<=/watch\\?=)([^&]{0,11})"
Assuming your lines have exactly the format you quoted, this should work.
awk '{print substr($0,10,11)}'
Edit: From the comment in another answer, I guess your lines are much longer and complicated than this, in which case something more comprehensive is needed:
gawk '{if(match($0, "/watch\\?v=(\\w+)",a)) print a[1]}'

sed/awk : match a pattern and return everything between the end of the pattern and a semicolon

I have a line:
<random junk>TYPE=snp;<more random junk>
and I need to return everything between the end of TYPE= and the ; (in this case snp but it could be any of a number of text strings.
I tried various sed / awk solutions but I can't seem to get it working. I have the feeling this is a simple problem so, sorry about that.
This seems to work:
sed 's/.*TYPE=\(.*\);.*/\1/'
EDIT:
Ah, so there can be semicolons in the random junk. Try this:
sed 's/.*TYPE=\([^;]*\);.*/\1/'
requires GNU grep:
grep -Po '(?<=TYPE=)[^;]+'
meaning: preceded by "TYPE=", find some non-semicolon characters
One way using GNU sed:
sed -r 's/.*TYPE=([^;]+).*/\1/' file.txt
Since you also tagged this awk:
$ text='<random junk>TYPE=snp;<more random junk>'
$ echo "$text" | awk -FTYPE= '{sub(/;.*/,"",$2); print $2}'
snp
$ text='foo=bar;baz=fnu;TYPE=snp;XAI=0;XAM=0'
$ echo "$text" | awk -FTYPE= '{sub(/;.*/,"",$2); print $2}'
snp
(Only using the variable to keep the lines from wrapping.)
Or, to parse this as set of variable=value pairs rather than just a string of text:
$ echo "$text" | awk -vRS=";" -F= '$1=="TYPE" {print $2}'
snp
You can also do this in pure bash, if you want:
$ t="red=blue;TYPE=snp;XAI=0.0037843;XAM=0.0170293;XAS=0.013245;XRI=0;XRM=0"
$ t=${t#*TYPE=}
$ t=${t%%;*}
$ echo $t
snp

help using command line to extract snippets of data on stdout

I would like the option of extracting the following string/data:
/work/foo/processed/25
/work/foo/processed/myproxy
/work/foo/processed/sample
=or=
25
myproxy
sample
But it would help if I see both.
From this output using cut or perl or anything else that would work:
Found 3 items
drwxr-xr-x - foo_hd foo_users 0 2011-03-16 18:46 /work/foo/processed/25
drwxr-xr-x - foo_hd foo_users 0 2011-04-05 07:10 /work/foo/processed/myproxy
drwxr-x--- - foo_hd testcont 0 2011-04-08 07:19 /work/foo/processed/sample
Doing a cut -d" " -f6 will get me foo_users, testcont. I tried increasing the field to higher values and I'm just not able to get what I want.
I'm not sure if cut is good for this or something like perl?
The base directories will remain static /work/foo/processed.
Also, I need the first line Found Xn items removed. Thanks.
You can do a substitution from beginning to the first occurrence of / , (non greedily)
$ your_command | ruby -ne 'print $_.sub(/.*?\/(.*)/,"/\\1") if /\//'
/work/foo/processed/25
/work/foo/processed/myproxy
/work/foo/processed/sample
Or you can find a unique separator (field delimiter) to split on. for example, the time portion is unique , so you can split on that and get the last element. (2nd element)
$ ruby -ne 'print $_.split(/\s+\d+:\d+\s+/)[-1] if /\//' file
/work/foo/processed/25
/work/foo/processed/myproxy
/work/foo/processed/sample
With awk,
$ awk -F"[0-9][0-9]:[0-9][0-9]" '/\//{print $NF}' file
/work/foo/processed/25
/work/foo/processed/myproxy
/work/foo/processed/sample
perl -lanF"\s+" -e 'print #F[-1] unless /^Found/' file
Here is an explanation of the command-line switches used:
-l: remove line break from each line of input, then add one back on print
-a: auto-split each line of input into an #F array
-n: loop through each line of input
-F: the regexp pattern to use for the auto-split (with -a)
-e: the perl code to execute (for each line of input if using -n or -p)
If you want to just output the last portion of your directory path, and the basedir is always '/work/foo/processed', I would do this:
perl -nle 'print $1 if m|/work/foo/processed/(\S+)|' file
Try this out :
<Your Command> | grep -P -o '[\/\.\w]+$'
OR if the directory '/work/foo/processed' is always static then:
<Your Command>| grep -P -o '\/work\/foo\/processed\/.+$'
-o : Show only the part of a matching line that matches PATTERN.
-P : Interpret PATTERN as a Perl regular expression.
In this example, the last word in the input will be matched .
(The word can also contain dot(s)),so file names like 'text_file1.txt', can be matched).
Ofcourse, you can change the pattern, as per your requirement.
If you know the columns will be the same, and you always list the full path name, you could try something like:
ls -l | cut -c79-
which would cut out the 79th character until the end. That might work in this exact case, but I think it would be better to find the basename of the last field. You could easily do this in awk or perl. Respond if this is not what you want and I'll add the awk and perl versions.
take the output of your ls command and pipe it to awk
your command|awk -F'/' '{print $NF}'
your_command | perl -pe 's#.*/##'