Compare semicolon separated data in 2 files using shell script - perl

I have some data (separated by semicolon) with close to 240 rows in a text file temp1.
temp2.txt stores 204 rows of data (separated by semicolon).
I want to:
Sort the data in both files by field1, i.e. the first data field in every row.
Compare the data in both files and redirect the rows that are not equal in separate files.
Sample data:
temp1.txt
1000xyz400100xyzA00680xyz0;19722.83;19565.7;157.13;11;2.74;11.00
1000xyz400100xyzA00682xyz0;7210.68;4111.53;3099.15;216.95;1.21;216.94
1000xyz430200xyzA00651xyz0;146.70;0.00;0.00;0.00;0.00;0.00
temp2.txt
1000xyz400100xyzA00680xyz0;19722.83;19565.7;157.13;11;2.74;11.00
1000xyz400100xyzA00682xyz0;7210.68;4111.53;3099.15;216.95;1.21;216.94
The sort command I'm using:
sort -k1,1 temp1 -o temp1.tmp
sort -k1,1 temp2 -o temp2.tmp
I'd appreciate if someone could show me how to redirect only the missing/mis-matching rows into two separate files for analysis.

Try
cat temp1 temp2 | sort -k1,1 -o tmp
# mis-matching/missing rows:
uniq -u tmp
# matching rows:
uniq -d tmp

You want the difference as described at http://www.pixelbeat.org/cmdline.html#sets
sort -t';' -k1,1 temp1 temp1 temp2 | uniq -u > only_in_temp2
sort -t';' -k1,1 temp1 temp2 temp2 | uniq -u > only_in_temp1
Notes:
Use join rather than uniq, as shown at the link above if you want to compare only particular fields
If the first field is fixed width then you don't need the -t';' -k1,1 params above

Look at the comm command.

using gawk, and outputting lines in file1 that is not in file2
awk -F";" 'FNR==NR{ a[$1]=$0;next }
( ! ( $1 in a) ) { print $0 > "afile.txt" }' file2 file1
interchange the order of file2 and file to output line in file2 that is not in file1

Related

I am trying to filter records based on date field format "YYYYMMDD" through awk command . Source & Target file is comma separated with Header

awk 'BEGIN {FS = ","};(NR>=2){($2 > "20210331");}' test1.csv > test.csv
File test1.csv:
Col1,Col2,Col3
A,20210101,JohnA
B,20210101,JohnB
G,20210501,JohnG
C,20210108,JohnC
D,20210202,JohnD
E,20210331,JohnE
F,20210401,JohnF
H,20210715,JohnH
Expected output:
Col1,Col2,Col3
G,20210501,JohnG
F,20210401,JohnF
H,20210715,JohnH
You can simply treat the dates in your shown samples like integers and compare them. In order to print the header, you need a separate condition.
awk 'BEGIN{FS=OFS=","} FNR==1{print;next} 20210331<$2' Input_file
I prefer the shorter code below:
awk 'BEGIN{FS = OFS = ","}(FNR == 1) || ($2 > 20210331)' test1.csv

Remove any text between two parameters not working properly

I need to remove any data between , and ( and the "," along with it.
I'm currently using the below command.
sed -i '/,/,/(/{//!d;s/ ,$//}' test1.txt
cat test1.txt
CREATE SET TABLE EDW_EXTRC_TAB.AVER_MED_CLM_HDR_EXTRC
,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
DEFAULT MERGEBLOCKRATIO
(
EXTRC_RUN_ID INTEGER NOT NULL,
Current Output
CREATE SET TABLE EDW_EXTRC_TAB.AVER_MED_CLM_HDR_EXTRC
,NO FALLBACK (
EXTRC_RUN_ID INTEGER NOT NULL,
Expected Output:
CREATE SET TABLE EDW_EXTRC_TAB.AVER_MED_CLM_HDR_EXTRC
(
EXTRC_RUN_ID INTEGER NOT NULL,
What is wrong here ?
Any suggestions?
Thanks in advance.
Two approaches:
-- GNU sed approach:
sed -z 's/,[^(]*//' test1.txt
-- GNU awk approach:
awk -v RS= '{ sub(/,[^(]+/,"",$0) }1' test1.txt
The output:
CREATE SET TABLE EDW_EXTRC_TAB.AVER_MED_CLM_HDR_EXTRC
( EXTRC_RUN_ID INTEGER NOT NULL,

Complex sed Command for Insert Command

I have a bunch of php files, which have many insert commands.
In each query, I want to insert a column variable admin_id = '$admin_id',
i.e., if the query is
insert into users (ch_id, num_value) values ('2', '100')
the query should be converted to
insert into users (admin_id, ch_id, num_value) values ($admin_id, '2', '100')
To do this, I have executed the following command
sed -i 's/\(insert.*into.*\) (\(.*values\)/\1 (admin_id, \2/' *.php
and
sed -i "s/\(insert.*into.*\) values (/\1 values ('\$admin_id', /" *.php
The above has worked successfully, but am still facing problem with SQL queries where there is no where in the query, i.e.,
insert into abctable (id,no)
to
insert into tablename (admin_id, id, no)
and
insert into abctable select $column from $tableperiod
to
insert into abctable select $column from $tableperiod where admin_id='$admin_id'
and
insert into abctable select $column from $tableperiod where abc != 'xyz'
to
insert into abctable select $column from $tableperiod where admin_id = '$admin_id' and abc != 'xyz'
How can I insert admin_id in these queries as well?
The queries in php files are executed by passing the query to the function in the following way:
execute_query("insert * from $table order by username");
I can find the queries still which are left to be modified by
executing
grep 'execute_query' *| grep insert| grep -v admin_id > stillleft.txt
I have solved it by using the following command
sed -e "s/\(query.*insert.*select.*where\)/& admin_id='\$admin_id' and /g" -e t \
-e "s/\(query.*insert.*select.*\)\")/\1 where admin_id='\$admin_id\")'/g" -e t \
-e "s/\(query.*insert.*\)(\(.*\)values (/\1(admin_id, \2values ('\$admin_id', /g" -e t \
-e "s/\(query.*insert.*(\)/& admin_id, /g" \
-i *.php
I'm not sure my testcases are right, but I think this could help you:
I changed the first statement, because I think it's easier and it matches the first and the second command of YOUR sed
sed -i 's/\(insert into .* (\)\(.*) values (\)\(.*\)) /\1admin_id, \2\$admin_id, \3/' *.php
The second (the first you are looking for) should work with the following
sed -i 's/\(insert into .* (\)\(.*) \)/\1admin_id, \2/' *.php
And the last two should work with this:
sed -i "s/\(insert into \w* select \$column from \$tableperiod\)/\1 where admin_id='\$admin_id'/" *.php
I hope this works for you, if not, please send a little bit more test data, if tested the commands with the text of your question as input
I you use multiple sed commands, you'll traverse the complete file each time. You can do it in a single pass. Assuming an input file infile that looks like this:
insert into users (ch_id, num_value) values ('2', '100')
insert into abctable (id, no)
insert into abctable select $column from $tableperiod
insert into abctable select $column from $tableperiod where abc != 'xyz'
we can use the following sed script sedscr
/^insert into/ {
s/\(([^)]*)\)(.*)\(([^)]*)\)/(admin_id, \1)\2($admin_id, \3)/
s/^([^(]+)\(([^)]*)\)$/\1(admin_id, \2)/
/\(.*\)/! {
/where/s/$/ and admin_id ='$admin_id'/
/where/!s/$/ where admin_id='$admin_id'/
}
}
It does the following:
if a line starts with insert into, then
for all lines with two pairs of parentheses, insert admin_id in the the first one and $admin_id in the second one
for lines with one pair of parentheses at the end, insert admin_id
if there are no parentheses, then
if there is a "where" clause, append and admin_id = '$admin_id'
else append where admin_id='$admin_id'
This can be called as follows:
$ sed -rf sedscr infile
insert into users (admin_id, ch_id, num_value) values ($admin_id, '2', '100')
insert into abctable (admin_id, id, no)
insert into abctable select $column from $tableperiod where admin_id='$admin_id'
insert into abctable select $column from $tableperiod where abc != 'xyz' and admin_id ='$admin_id'
If you can't use extened regular expressions (-r), the quoting of parentheses has to be inverted (all \( become ( etc.) and the + has to be replaced by \{1,\}.
The cumbersome regexes such as \(([^)]*)\) stand for "between literal parentheses, capture zero or more characters that are not a closing parenthesis" – this enables non-greedy capturing.

SED command Issue with values exceeding 9

I need to generate a file.sql file from a file.csv, so I use this command :
cat file.csv |sed "s/\(.*\),\(.*\)/insert into table(value1, value2)
values\('\1','\2'\);/g" > file.sql
It works perfectly, but when the values exceed 9 (for example for \10, \11 etc...) it takes consideration of only the first number (which is \1 in this case) and ignores the rest.
I want to know if I missed something or if there is another way to do it.
Thank you !
EDIT :
The not working example :
My file.csv looks like
2013-04-01 04:00:52,2,37,74,40233964,3860,0,0,4878,174,3,0,0,3598,27.00,27
What I get
insert into table
val1,val2,val3,val4,val5,val6,val7,val8,val9,val10,val11,val12,val13,val14,val15,val16
values
('2013-04-01 07:39:43',
2,37,74,36526530,3877,0,0,6080,
2013-04-01 07:39:430,2013-04-01 07:39:431,
2013-04-01 07:39:432,2013-04-01 07:39:433,
2013-04-01 07:39:434,2013-04-01 07:39:435,
2013-04-01 07:39:436);
After the ninth element I get the first one instead of the 10th,11th etc...
As far I know sed has a limitation of supporting 9 back references. It might have been removed in the newer versions (though not sure). You are better off using perl or awk for this.
Here is how you'd do in awk:
$ cat csv
2013-04-01 04:00:52,2,37,74,40233964,3860,0,0,4878,174,3,0,0,3598,27.00,27
$ awk 'BEGIN{FS=OFS=","}{print "insert into table values (\x27"$1"\x27",$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16 ");"}' csv
insert into table values ('2013-04-01 04:00:52',2,37,74,40233964,3860,0,0,4878,174,3,0,0,3598,27.00,27);
This is how you can do in perl:
$ perl -ple 's/([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+)/insert into table values (\x27$1\x27,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16);/' csv
insert into table values ('2013-04-01 04:00:52',2,37,74,40233964,3860,0,0,4878,174,3,0,0,3598,27.00,27);
Try an awk script (based on #JS웃 solution):
script.awk
#!/usr/bin/env awk
# before looping the file
BEGIN{
FS="," # input separator
OFS=FS # output separator
q="\047" # single quote as a variable
}
# on each line (no pattern)
{
printf "insert into table values ("
print q $1 q ", "
print $2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16
print ");"
}
Run with
awk -f script.awk file.csv
One-liner
awk 'BEGIN{OFS=FS=","; q="\047" } { printf "insert into table values (" q $1 q ", " $2","$3","$4","$5","$6","$7","$8","$9","$10","$11","$12","$13","$14","$15","$16 ");" }' file.csv

need help removing time from a csv file

im trying to process a csv and make it easier for sorting, and i need to remove the time and the dash from it. the file has entries like this:
James,07/20/2009-14:40:11
Steve,08/06/2006-02:34:37
John,11/03/2008-12:12:34
and parse it into this:
James,07/20/2009
Steve,08/06/2006
John,11/03/2008
im guessing sed is the right tool for this job?
thanks for your help.
Python
import csv
import datetime
rdr = csv.reader( open("someFile.csv", "rb" ) )
rows = list( reader )
rdr.close()
def byDateTime( aRow ):
return return datetime.datetime.strptime( aRow[1], "%m/%d/%Y-%H:%M:%S" )
rows.sort( key= byDateTime )
wtr = csv.writer( open("sortedFile.csv", "wb" ) )
wtr.writerows( rows )
wtr.close()
cut -d '-' -f 1 file
Edit after comment:
sed 's/-[0-9][0-9]:[0-9][0-9]:[0-9][0-9]//g' file
just use awk
awk -F"," '{ split($2,_,"-"); print $1,_[1] }' OFS="," file
Yes, I think sed is the right tool for the job:
sed 's/-[:0-9]*$//' file