Extracting value from a xml file using awk - sed

I have a text stream like this
<device nid="05023CA70900" id="1" fblock="-1" type="switch" name="Appliance Home" brand="Google" active="false" energy_lo="427" />
<device nid="0501C1D82300" id="2" fblock="-1" type="switch" name="TELEVISION Home" brand="Google" active="pending" energy_lo="3272" />
from which i would like an output like
05023CA70900##1##-1##switch##Appliance Home##Google##false##427
0501C1D82300##2##-1##switch##TELEVISION Home##Google##pending##3272
There are many lines in the input all of which are not writable.
How can we achieve this using awk or sed ?

Following awk should work:
awk -F '"' '$1 == "<device nid=" { printf("%s##%s##%s##%s##%s##%s##%s##%s\n",
$2, $4, $6, $8, $10, $12, $14, $16)}' file
PS: It is not always best approach to parse XML using awk/sed.

Its very simple in perl . So why not use perl ?
perl -lne 'push #a,/\"([\S]*)\"/g;print join "##",#a;undef #a' your_file
Sample tested:
> cat temp
<device nid="05023CA70900" id="1" fblock="-1" type="switch" name="Appliance Home" brand="Google" active="false" energy_lo="427" />
<device nid="0501C1D82300" id="2" fblock="-1" type="switch" name="TELEVISION Home" brand="Google" active="pending" energy_lo="3272" />
> perl -lne 'push #a,/\"([\S]*)\"/g;print join "##",#a;undef #a' temp
05023CA70900##1##-1##switch##Google##false##427
0501C1D82300##2##-1##switch##Google##pending##3272
>

awk -F\" -v OFS="##" '/^<device nid=/ { print $2, $4, $6, $8, $10, $12, $14, $16 }' file
or more generally:
awk -F\" '/^<device nid=/ {for (i=2;i<=NF;i+=2) printf "%s%s",(i==2?"":"##"),$i; print ""}' file
To address your question in your comment: If you could have a tab in front of <device nid:
awk -F\" '/^\t?<device nid=// ...'
If you meant something else, update your question and provide more representative input.

Related

Linux sed remove two patterns

I hope you're having a great day,
I want to remove two patterns, I want to remove the parts that contains the word images from a text that I have:
in the files test1 I have this:
APP:Server1:files APP:Server2:images APP:Server3:misc APP:Server4:xml APP:Server5:json APP:Server6:stats APP:Server7:graphs APP:Server8:images-v2
I need to remove APP:Server2:image and APP:Server8:images-v2 ... I want this output:
APP:Server1:files APP:Server3:misc APP:Server4:xml APP:Server5:json APP:Server6:stats APP:Server7:graphs
I'm trying this:
cat test1 | sed 's/ .*images.* / /g'
You need to make sure that your wildcards do not allow spaces:
cat data | sed 's/ [^ ]*image[^ ]* / /g'
This should work for you
sed 's/\w{1,}:Server[2|8]:\w{1,} //g'
\w matches word characters (letters, numbers, _)
{1,} matches one or more of the preceeding item (\w)
[2|8] matches either the number 2 or 8
cat test.file
APP:Server1:files APP:Server2:images APP:Server3:misc APP:Server4:xml APP:Server5:json APP:Server6:stats APP:Server7:graphs APP:Server8:images-v2
The below command removes the matching lines and leaves blanks in their place
tr ' ' '\n' < test.file |sed 's/\w\{1,\}:Server[2|8]:\w\{1,\}.*$//'
APP:Server1:files
APP:Server3:misc
APP:Server4:xml
APP:Server5:json
APP:Server6:stats
APP:Server7:graphs
To remove the blank lines, just add a second option to the sed command, and paste the contents back together
tr ' ' '\n' < test.file |sed 's/\w\{1,\}:Server[2|8]:\w\{1,\}.*$//;/^$/d'|paste -sd ' ' -
APP:Server1:files APP:Server3:misc APP:Server4:xml APP:Server5:json APP:Server6:stats APP:Server7:graphs
GNU aWk alternative:
awk 'BEGIN { RS="APP:" } $0=="" { next } { split($0,map,":");if (map[2] ~ /images/ ) { next } OFS=RS;printf " %s%s",OFS,$0 }'
Set the record separator to "APP:" and then process the text in between as separate records. If the record is blank, skip to the next record. Split the record into array map based on ":" as the delimiter, then check if there is image in the any of the text in the second index. If there is, skip to the next record, otherwise print along with the record separator.

how to grep and replace it multiple files and multiples elements on OS X

I use this grep command line on OS X.
grep -E 'Title|Amount|AwardID|FirstName|LastName| *.xml and the result is here:
<Title>ABC System</Title>
<Amount>50000</Amount>
<AwardID>1000</AwardID>
<FirstName>Name</FirstName>
<LastName>Thanks</LastName>
and now, I tried to use sed to replace strings and get things done. But it does not get things done.
What options should I use to get it.
sed -i "" 's/Title//g'
Results as a txt file:
ABC System, 50000, 100, Name, Thanks
Update
I can do it separately.
$ grep -E 'AwardID|AwardAmount|FirstName|LastName' 1433501.xml > test
$ sed -E '/AwardID|AwardAmount|FirstName|LastName/s/.*>([^<]+)<.*/\1/' test
43856
1433501
Faisal
Hossain
$ sed -E '/AwardID|AwardAmount|FirstName|LastName/s/.*>([^<]+)<.*/\1/' test | paste -sd',' -
43856,1433501,Faisal,Hossain
but when I put xxx.xml -> *.xml, I need to put new line. What should I put?
Update
AwardTable
xml sel -t -v //AwardID -o , -v //AwardAmount -nl *.xml > AwardTable.csv
InvestigatorTable
xml sel -t -v //AwardID -m '//Investigator[RoleCode = "Principal Investigator"]' -o , -v FirstName -o , -v LastName -b -o [PI] -m '//Investigator[RoleCode = "Co-Principal Investigator"]' -o , -v FirstName -o , -v LastName -b -o [CoPI] -nl *.xml
How should I get data for InvestigatorTable? How can I have following formats?
ID, Firstname, Lastname, Role
12345, FirstName, LastName, PI
12345, FirstName, LastName, Co-PI
12345, FirstName, LastName, Former-PI
xml sel -t -v //AwardID -o , -v //AwardAmount -m '//Investigator[RoleCode = "Principal Investigator"]' -o , -v FirstName -o , -v LastName -o [PI] -b -m '//Investigator[RoleCode = "Former Principal Investigator"]' -o , -v FirstName -o , -v LastName -o [FoPI] -b -m '//Investigator[RoleCode = "Co-Principal Investigator"]' -o , -v FirstName -o , -v LastName -o [CoPI] -b -nl *.xml
I can get like this
1417948,93147,M. Lee,Allison[PI],Jennifer,Arrigo[CoPI],Cynthia,Chandler[CoPI],Kerstin,Lehnert[CoPI]
1417966,574209,Robb,Lindgren[PI]
1418062,253000,Julia,Coonrod[PI],Gary,Harrison[FoPI]
I can do it manually now but please help it for me.
Update
Please help me to get the results with structures
AwardID, FirstName, LastName, Role
Here is another way to do it:
sed -nE '/Title|Amount|AwardID|FirstName|LastName/s/.*>([^<]+)<.*/\1/p' *.xml | paste -sd',' -
With your sample data, it gave the following output:
$ sed -nE '/Title|Amount|AwardID|FirstName|LastName/s/.*>([^<]+)<.*/\1/p' xmlfile | paste -sd',' -
Collaborative Research: Using the Rurutu hotspot to evaluate mantle motion and absolute plate motion models,137715,1433097,Jasper,Konter
awk would do it:
awk -v ORS=", " -F '[<>]' '
/Title|Amount|AwardID|FirstName|LastName/ {print $3}
END {printf "\b\b \n"}
' << EOF
<Title>ABC System</Title>
<Amount>50000</Amount>
<AwardID>1000</AwardID>
<FirstName>Name</FirstName>
<LastName>Thanks</LastName>
EOF
ABC System, 50000, 1000, Name, Thanks
With multiple files, I assume you want a newline for each file. GNU awk v4 has an extension: ENDFILE
gawk -v ORS=", " -F '[<>]' '
/Title|Amount|AwardID|FirstName|LastName/ {print $3}
ENDFILE {printf "\b\b \n"}
' *.xml
otherwise it's a bit more work:
awk -v ORS=", " -F '[<>]' '
/Title|Amount|AwardID|FirstName|LastName/ {print $3}
FNR == 1 && FILENAME != ARGV[1] {printf "\b\b \n"}
END {printf "\b\b \n"}
' *.xml
For robustness, you should be using an XML parser or XSLT transformation.
Given your sample xml files, here's a solution using xmlstarlet, an xml processing tool I like:
xmlstarlet sel -t -v //AwardTitle -o , -v //AwardAmount -o , -v //AwardID -m //Investigator -o , -v FirstName -o , -v LastName -b -nl 1419538.xml 1424234.xml
IBDR: Workshop on Successful Approaches for Development and Dissemination of Instrumentation for Biological Research - May 1-2, 2014; Rosslyn, VA,49990,1419538,Sameer,Sonkusale,Valencia,Koomson,Eduardo,Rosa-Molinar
RAPID: Role of Physical, Chemical and Diffusion Properties of 4-Methyl-cyclohexane methanol in Remediating Contaminated Water and Water Pipes,49999,1424234,Daniel,Gallagher,Andrea,Dietrich,Paolo,Scardina
If you want to use another XSLT tool, here's the generated stylesheet:
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:exslt="http://exslt.org/common" version="1.0" extension-element-prefixes="exslt">
<xsl:output omit-xml-declaration="yes" indent="no"/>
<xsl:template match="/">
<xsl:call-template name="value-of-template">
<xsl:with-param name="select" select="//AwardTitle"/>
</xsl:call-template>
<xsl:text>,</xsl:text>
<xsl:call-template name="value-of-template">
<xsl:with-param name="select" select="//AwardAmount"/>
</xsl:call-template>
<xsl:text>,</xsl:text>
<xsl:call-template name="value-of-template">
<xsl:with-param name="select" select="//AwardID"/>
</xsl:call-template>
<xsl:for-each select="//Investigator">
<xsl:text>,</xsl:text>
<xsl:call-template name="value-of-template">
<xsl:with-param name="select" select="FirstName"/>
</xsl:call-template>
<xsl:text>,</xsl:text>
<xsl:call-template name="value-of-template">
<xsl:with-param name="select" select="LastName"/>
</xsl:call-template>
</xsl:for-each>
<xsl:value-of select="'
'"/>
</xsl:template>
<xsl:template name="value-of-template">
<xsl:param name="select"/>
<xsl:value-of select="$select"/>
<xsl:for-each select="exslt:node-set($select)[position()>1]">
<xsl:value-of select="'
'"/>
<xsl:value-of select="."/>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
The schema is not great. Specifically, it's not flexible: what if there are more than 5 investigators? You need something like this:
Perhaps more simple:
Award table: id, title, amount
AwardInvestigators table: award_id, firstname, lastname, role
BTW, I read the question more carefully. I've amended by xmlstarlet command a bit to ensure the Principal Investigator's name is first:
xmlstarlet sel -t \
-v //AwardID -o , -v //AwardAmount \
-m '//Investigator[RoleCode = "Principal Investigator"]' -o , -v FirstName -o , -v LastName -b \
-m '//Investigator[RoleCode = "Co-Principal Investigator"]' -o , -v FirstName -o , -v LastName -b \
-nl \
*.xml

Convert multiple Columns to rows in comma delimited csv

I want to convert multiple Columns to rows.
e.g
Input :
A,0,10,12,14,16,2,
B,10,10P
Output:
A,0,0
A,10,10
A,12,12
A,14,14
A,16,16
A,2,2
B,10,10
B,10P,10p
I tried but not sure how to repeat the certain columns.
awk '{FS=",";OFS="\n"}{print $1, $2, $3, $4, $5}' filename
This can be a way:
$ awk 'BEGIN{FS=OFS=","}{for (i=2;i<=NF; i++) print $1, $i, $i}' file
A,0,0
A,10,10
A,12,12
A,14,14
A,16,16
A,2,2
B,10,10
B,10P,10P
Explanation
BEGIN{FS=OFS=","} set input and output field separator as comma.
for (i=2;i<=NF; i++) print $1, $i, $i loop through all fields from 2nd, printing the first field plus the i-th twice.
Note that your attempt awk '{FS=",";OFS="\n"}{print $1, $2, $3, $4, $5}' filename was setting FS and OFS on every line, while it's better to do it once, in the BEGIN{} block.
This might work for you (GNU sed):
sed -r 's/^([^,]+)(,[^,]+)/&\2\n\1/;//P;D' file

SED command Issue with values exceeding 9

I need to generate a file.sql file from a file.csv, so I use this command :
cat file.csv |sed "s/\(.*\),\(.*\)/insert into table(value1, value2)
values\('\1','\2'\);/g" > file.sql
It works perfectly, but when the values exceed 9 (for example for \10, \11 etc...) it takes consideration of only the first number (which is \1 in this case) and ignores the rest.
I want to know if I missed something or if there is another way to do it.
Thank you !
EDIT :
The not working example :
My file.csv looks like
2013-04-01 04:00:52,2,37,74,40233964,3860,0,0,4878,174,3,0,0,3598,27.00,27
What I get
insert into table
val1,val2,val3,val4,val5,val6,val7,val8,val9,val10,val11,val12,val13,val14,val15,val16
values
('2013-04-01 07:39:43',
2,37,74,36526530,3877,0,0,6080,
2013-04-01 07:39:430,2013-04-01 07:39:431,
2013-04-01 07:39:432,2013-04-01 07:39:433,
2013-04-01 07:39:434,2013-04-01 07:39:435,
2013-04-01 07:39:436);
After the ninth element I get the first one instead of the 10th,11th etc...
As far I know sed has a limitation of supporting 9 back references. It might have been removed in the newer versions (though not sure). You are better off using perl or awk for this.
Here is how you'd do in awk:
$ cat csv
2013-04-01 04:00:52,2,37,74,40233964,3860,0,0,4878,174,3,0,0,3598,27.00,27
$ awk 'BEGIN{FS=OFS=","}{print "insert into table values (\x27"$1"\x27",$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16 ");"}' csv
insert into table values ('2013-04-01 04:00:52',2,37,74,40233964,3860,0,0,4878,174,3,0,0,3598,27.00,27);
This is how you can do in perl:
$ perl -ple 's/([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+)/insert into table values (\x27$1\x27,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16);/' csv
insert into table values ('2013-04-01 04:00:52',2,37,74,40233964,3860,0,0,4878,174,3,0,0,3598,27.00,27);
Try an awk script (based on #JS웃 solution):
script.awk
#!/usr/bin/env awk
# before looping the file
BEGIN{
FS="," # input separator
OFS=FS # output separator
q="\047" # single quote as a variable
}
# on each line (no pattern)
{
printf "insert into table values ("
print q $1 q ", "
print $2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16
print ");"
}
Run with
awk -f script.awk file.csv
One-liner
awk 'BEGIN{OFS=FS=","; q="\047" } { printf "insert into table values (" q $1 q ", " $2","$3","$4","$5","$6","$7","$8","$9","$10","$11","$12","$13","$14","$15","$16 ");" }' file.csv

Using sed / awk to process file with stanza format

I have a file in stanza format. Example of the file are as below.
id_1:
id=241
pgrp=staff
groups=staff
home=/home/id_1
shell=/usr/bin/ks
id_2:
id=242
pgrp=staff
groups=staff
home=/home/id_2
shell=/usr/bin/ks
How do I use sed or awk to process it and return only the id name, id and groups in a single line and tab delimited format? e.g.:
id_1 241 staff
id_2 242 staff
with awk:
BEGIN { FS="="}
$1 ~ /id_/ { printf("%s", $1) }
$1 ~ /id/ && $1 !~ /_/ { printf("\t%s", $2) }
$1 ~ /groups/ { printf("\t%s\n", $2) }
Here is an awk solution:
translate.awk
#!/usr/bin/awk -f
{
if(match($1, /[^=]:[ ]*$/)){
id_=$1
sub(/:/,"",id_)
}
if(match($1,/id=/)){
split($1,p,"=")
id=p[2]
}
if(match($1,/groups=/)){
split($1,p,"=")
print id_," ",id," ",p[2]
}
}
Execute it either by:
chmod +x translated.awk
./translated.awk data.txt
or
awk -f translated.awk data.txt
For completeness, here comes a shortened version:
#!/usr/bin/awk -f
$1 ~ /[^=]:[ ]*$/ {sub(/:/,"",$1);printf $1" ";FS="="}
$1 ~ /id/ {printf $2" "}
$1 ~ /groups/ {print $2}
sed 'N;N;N;N;N;y/=\n/ /' data.txt | awk '{print $1,$3,$7}'
Here is the one-liner approach by setting RS:
awk 'NR>1{print "id_"++i,$3,$7}' RS='id_[0-9]+:' FS='[=\n]' OFS='\t' file
id_1 241 staff
id_2 242 staff
Requires GNU awk and assumes the IDs are in increasing order starting at 1.
If the ordering of the ID's is arbitrary:
awk '!/shell/&&NR>1{gsub(/:/,"",$1);print "id_"$1,$3,$5}' RS='id_' FS='[=\n]' OFS='\t' file
id_1 241 staff
id_2 242 staff
awk -F"=" '/id_/{split($0,a,":");}/id=/{i=$2}/groups/{printf a[1]"\t"i"\t"$2"\n"}' your_file
tested below:
> cat temp
id_1:
id=241
pgrp=staff
groups=staff
home=/home/id_1
shell=/usr/bin/ks
id_2:
id=242
pgrp=staff
groups=staff
home=/home/id_2
shell=/usr/bin/ks
> awk -F"=" '/id_/{split($0,a,":");}/id=/{i=$2}/groups/{printf a[1]"\t"i"\t"$2"\n"}' temp
id_1 241 staff
id_2 242 staff
This might work for you (GNU sed):
sed -rn '/^[^ :]+:/{N;N;N;s/:.*id=(\S+).*groups=(\S+).*/\t\1\t\2/p}' file
Look for a line holding an id then get the next 3 lines and re-arrange the output.