I am trying to consolidate an email list, but I want to uniq (or uniq -i -u) by the email address, not the entire line so that we don't have duplicates.
list 1:
Company A <companya#companya.com>
Company B <companyb#companyb.com>
Company C <companyc#companyc.com>
list 2:
firstname lastname <firstname#gmail.com>
Fake Person <companyb#companyb.com>
Joe lastnanme <joe#gmail.com>
the current output is
Company A <companya#companya.com>
Company B <companyb#companyb.com>
Company C <companyc#companyc.com>
firstname lastname <firstname#gmail.com>
Fake Person <companyb#companyb.com>
Joe lastnanme <joe#gmail.com>
the desired output would be
Company A <companya#companya.com>
Company B <companyb#companyb.com>
Company C <companyc#companyc.com>
firstname lastname <firstname#gmail.com>
Joe lastnanme <joe#gmail.com>
(as companyb#companyb.com is listed in both)
How can I do that?
given your file format
$ awk -F'[<>]' '!a[$2]++' files
will print the first instance of duplicate content in angled brackets. Or if there is no content after the email, you don't need to un-wrap the angled brackets
$ awk '!a[$NF]++' files
Same can be done with sort as well
$ sort -t'<' -k2,2 -u files
side-effect is output will be sorted which can be desired (or not).
N.B. For both alternatives the assumption is angled brackets don't appear anywhere else than the email wrappers.
Here is one in awk:
$ awk '
match($0,/[a-z0-9.]+#[a-z.]+/) { # look for emailish string *
a[substr($0,RSTART,RLENGTH)]=$0 # and hash the record using the address as key
}
END { # after all are processed
for(i in a) # output them in no particular order
print a[i]
}' file2 file1 # switch order to see how it affects output
Output
Company A <companya#companya.com>
Company B <companyb#companyb.com>
Company C <companyc#companyc.com>
Joe lastnanme <joe#gmail.com>
firstname lastname <firstname#gmail.com>
Script looks for very simple emailish string (* see the regex in the script and tune it to your liking) which it uses to hash the whole records,last instance wins as the earlier onse are overwritten.
uniq has an -f option to ignore a number of blank-delimited fields, so we can sort on the third field and then ignore the first two:
$ sort -k 3,3 infile | uniq -f 2
Company A <companya#companya.com>
Company B <companyb#companyb.com>
Company C <companyc#companyc.com>
firstname lastname <firstname#gmail.com>
Joe lastnanme <joe#gmail.com>
However, this isn't very robust: it breaks as soon as there aren't exactly two fields before the email address as the sorting will be on the wrong field and uniq will compare the wrong fields.
Check karakfa's answer to see how uniq isn't even required here.
Alternatively, just checking for uniqueness of the last field:
awk '!e[$NF] {print; ++e[$NF]}' infile
or even shorter, stealing from karakfa, awk '!e[$NF]++' infile
Could you please try following.
awk '
{
match($0,/<.*>/)
val=substr($0,RSTART,RLENGTH)
}
FNR==NR{
a[val]=$0
print
next
}
!(val in a)
' list1 list2
Explanation: Adding explanation of above code.
awk ' ##Starting awk program here.
{ ##Starting BLOCK which will be executed for both of the Input_files.
match($0,/<.*>/) ##Using match function of awk where giving regex to match everything from < to till >
val=substr($0,RSTART,RLENGTH) ##Creating variable named val whose value is substring of current line starting from RSTART to value of RLENGTH, basically matched string.
} ##Closing above BLOCK here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when 1st Input_file named list1 will be read.
a[val]=$0 ##Creating an array named a whose index is val and value is current line.
print $0 ##Printing current line here.
next ##next will skip all further statements from here.
}
!(val in a) ##Checking condition if variable val is NOT present in array a if it is NOT present then do printing of current line.
' list1 list2 ##Mentioning Input_file names here.
Output will be as follows.
Company A <companya#companya.com>
Company B <companyb#companyb.com>
Company C <companyc#companyc.com>
firstname lastname <firstname#gmail.com>
Joe lastnanme <joe#gmail.com>
Perhaps I don't understand the question !
but you can try this awk :
awk 'NR!=FNR && $3 in a{next}{a[$3]}1' list1 list2
I have hundreds of PDF files which I need to parse and insert into MySQL tables. I have converted a pdf file to text with pdftotext using -layout option. The data is voter information in the following format:
1 TES1065268 2 TES1306415 3 AP281900579616
Elector's Name: DINESH ALAMPELLY Elector's Name: DHURGA PRASAD E Elector's Name: KADARI JANGAIAH
Father's Name: SRINIWASULU Father's Name: BALAIAH E Father's Name: RAMAIAH
ALAMPALLY
House No: --- House No: 00 House No: 1-1
Age: 23 Sex: Male Age: 24 Sex: Male Age: 71 Sex: Male
4 HCJ4116364 5 AP281900579174 6 AP281900582129
Elector's Name: Kadari Venkataiah Elector's Name: KADARI RAAM SWAMI Elector's Name: Kadari Lakshmamma
Father's Name: Jangaiah Father's Name: JANGAIAH Husband's Name: Ramasvami
House No: 1-1 House No: 1-1 House No: 1-1
Age: 31 Sex: Male Age: 40 Sex: Male Age: 36 Sex: Female
. . .
. . .
. . .
. . .
I need to export this data into mysql table named "voters". Or is it easier to first convert this into JSON as there are colon separated data already?
I have tried using sed, tr column, fold but unable to reach a solution. Please help :)
This might work for you (GNU sed):
Divide the file into 3, one for each column:
sed -rn -e 's/^(.{46})(.{52})/\1\n\2\n/;h;s/\n.*//w col1' -e 'g;s/.*\n(.*)\n.*/\1/w col2' -e 'g;s/.*\n//w col3' file
Collapse each record to a comma separated line:
sed -ri.bak 'N;N;N;N;s/^\s*(\S+)\s/\1,/;s/\n/,/g;s/\s*,[^:]*:\s*/,/g;s/\s*Sex:\s*(\S+)\s*/,\1/' col{1,2,3}
Interleave records in correct sequence using paste:
paste -d'\n' col{1,2,3} >csvFile
If you want headers use:
sed 'N;N;N;N;s/Sex:/\n&/;s/\n/,/g;s/^[^,]*/Rowid,Key/;s/:[^,]*//g;q' col1.bak >headers
sed -i.bak '1e cat headers' csvFile
This is how I would like to go about it:
Use grep (or any other command) to pick voter ids (1 TES1065268, in this the number 1 should be removed which can be done later).
a) To make this happen, append a keyword "voterid" in all the lines which have the voterids, then use grep to extract all these ids and print them in another file in a column and not row.
Use grep (or any other command) to match fields like Elector's Name: , Father's Name: etc and take the corresponding value and print in subsequent columns beside the voterid columns in new file.
This way we can get a neat column based data. But in some places in the text file, the name value is split into two lines. How to go about it?
Kindly someone provide me with additional inputs in this regard.
I need to generate a file.sql file from a file.csv, so I use this command :
cat file.csv |sed "s/\(.*\),\(.*\)/insert into table(value1, value2)
values\('\1','\2'\);/g" > file.sql
It works perfectly, but when the values exceed 9 (for example for \10, \11 etc...) it takes consideration of only the first number (which is \1 in this case) and ignores the rest.
I want to know if I missed something or if there is another way to do it.
Thank you !
EDIT :
The not working example :
My file.csv looks like
2013-04-01 04:00:52,2,37,74,40233964,3860,0,0,4878,174,3,0,0,3598,27.00,27
What I get
insert into table
val1,val2,val3,val4,val5,val6,val7,val8,val9,val10,val11,val12,val13,val14,val15,val16
values
('2013-04-01 07:39:43',
2,37,74,36526530,3877,0,0,6080,
2013-04-01 07:39:430,2013-04-01 07:39:431,
2013-04-01 07:39:432,2013-04-01 07:39:433,
2013-04-01 07:39:434,2013-04-01 07:39:435,
2013-04-01 07:39:436);
After the ninth element I get the first one instead of the 10th,11th etc...
As far I know sed has a limitation of supporting 9 back references. It might have been removed in the newer versions (though not sure). You are better off using perl or awk for this.
Here is how you'd do in awk:
$ cat csv
2013-04-01 04:00:52,2,37,74,40233964,3860,0,0,4878,174,3,0,0,3598,27.00,27
$ awk 'BEGIN{FS=OFS=","}{print "insert into table values (\x27"$1"\x27",$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16 ");"}' csv
insert into table values ('2013-04-01 04:00:52',2,37,74,40233964,3860,0,0,4878,174,3,0,0,3598,27.00,27);
This is how you can do in perl:
$ perl -ple 's/([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+)/insert into table values (\x27$1\x27,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16);/' csv
insert into table values ('2013-04-01 04:00:52',2,37,74,40233964,3860,0,0,4878,174,3,0,0,3598,27.00,27);
Try an awk script (based on #JS웃 solution):
script.awk
#!/usr/bin/env awk
# before looping the file
BEGIN{
FS="," # input separator
OFS=FS # output separator
q="\047" # single quote as a variable
}
# on each line (no pattern)
{
printf "insert into table values ("
print q $1 q ", "
print $2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16
print ");"
}
Run with
awk -f script.awk file.csv
One-liner
awk 'BEGIN{OFS=FS=","; q="\047" } { printf "insert into table values (" q $1 q ", " $2","$3","$4","$5","$6","$7","$8","$9","$10","$11","$12","$13","$14","$15","$16 ");" }' file.csv
I have some data (separated by semicolon) with close to 240 rows in a text file temp1.
temp2.txt stores 204 rows of data (separated by semicolon).
I want to:
Sort the data in both files by field1, i.e. the first data field in every row.
Compare the data in both files and redirect the rows that are not equal in separate files.
Sample data:
temp1.txt
1000xyz400100xyzA00680xyz0;19722.83;19565.7;157.13;11;2.74;11.00
1000xyz400100xyzA00682xyz0;7210.68;4111.53;3099.15;216.95;1.21;216.94
1000xyz430200xyzA00651xyz0;146.70;0.00;0.00;0.00;0.00;0.00
temp2.txt
1000xyz400100xyzA00680xyz0;19722.83;19565.7;157.13;11;2.74;11.00
1000xyz400100xyzA00682xyz0;7210.68;4111.53;3099.15;216.95;1.21;216.94
The sort command I'm using:
sort -k1,1 temp1 -o temp1.tmp
sort -k1,1 temp2 -o temp2.tmp
I'd appreciate if someone could show me how to redirect only the missing/mis-matching rows into two separate files for analysis.
Try
cat temp1 temp2 | sort -k1,1 -o tmp
# mis-matching/missing rows:
uniq -u tmp
# matching rows:
uniq -d tmp
You want the difference as described at http://www.pixelbeat.org/cmdline.html#sets
sort -t';' -k1,1 temp1 temp1 temp2 | uniq -u > only_in_temp2
sort -t';' -k1,1 temp1 temp2 temp2 | uniq -u > only_in_temp1
Notes:
Use join rather than uniq, as shown at the link above if you want to compare only particular fields
If the first field is fixed width then you don't need the -t';' -k1,1 params above
Look at the comm command.
using gawk, and outputting lines in file1 that is not in file2
awk -F";" 'FNR==NR{ a[$1]=$0;next }
( ! ( $1 in a) ) { print $0 > "afile.txt" }' file2 file1
interchange the order of file2 and file to output line in file2 that is not in file1