I have some text files. I need to do the subtraction between second and fourth columns in each file. The subtracted values should print to the original files as fifth column. How can I do this with awk or sed?
HII 62.0 HII 35.1
MEE 21.3 MEE 21.3
GLL 42.3 GLL 18.5
ASS 105.9 ASS 105.9
RRG 65.6
GLL 48.3
SES 83.5
Desired output
HII 62.0 HII 35.1 26.9
MEE 21.3 MEE 21.3 0
GLL 42.3 GLL 18.5 23.8
ASS 105.9 ASS 105.9 0
RRG 65.6
GLL 48.3
SES 83.5
If the third and fourth columns are blank, no need to subtract.
awk 'NF == 2 { print }
NF == 4 { print $0, $2 - $4 }'
That could all be fitted onto one line, but it clearer what it is doing when it is spread over two lines.
If you want more control over the format, you can use printf() instead of just print.
After sanitizing trailing spaces in the data, it produces:
HII 62.0 HII 35.1 26.9
MEE 21.3 MEE 21.3 0
GLL 42.3 GLL 18.5 23.8
ASS 105.9 ASS 105.9 0
RRG 65.6
GLL 48.3
SES 83.5
This might work for you (GNU sed & Bash):
sed -ri '/^\S+\s+(\S+)\s+\S+\s+(\S+)/s//echo "&\t$(echo \1-\2|bc)"/e' file
Related
Am quite new in the Unix field and I am currently trying to extract data set from a text file. I tried with sed, grep, awk but it seems to only work with extracting lines, but I want to extract an entire dataset... Here is an example of file from which I'd like to extract the 2 data sets (figures after the lines "R.Time Intensity")
[Header]
Application Name LabSolutions
Version 5.87
Data File Name C:\LabSolutions\Data\Antoine\170921_AC_FluoSpectra\069_WT3a derivatized lignin LiCl 430_GPC_FOREVER_430_049.lcd
Output Date 2017-10-12
Output Time 12:07:32
[Configuration]
Instrument Name BOTAN127-Instrument1
Instrument # 1
Line # 1
# of Detectors 3
Detector ID Detector A Detector B PDA
Detector Name Detector A Detector B PDA
# of Channels 1 1 2
[LC Chromatogram(Detector A-Ch1)]
Interval(msec) 500
# of Points 9603
Start Time(min) 0,000
End Time(min) 80,017
Intensity Units mV
Intensity Multiplier 0,001
Ex. Wavelength(nm) 405
Em. Wavelength(nm) 430
R.Time (min) Intensity
0,00000 -709779
0,00833 -709779
0,01667 17
0,02500 3
0,03333 7
0,04167 19
0,05000 9
0,05833 5
0,06667 2
0,07500 24
0,08333 48
[LC Chromatogram(Detector B-Ch1)]
Interval(msec) 500
# of Points 9603
Start Time(min) 0,000
End Time(min) 80,017
Intensity Units mV
Intensity Multiplier 0,001
R.Time (min) Intensity
0,00000 149
0,00833 149
0,01667 -1
I would greatly appreciate any idea. Thanks in advance.
Antoine
awk '/^[^0-9]/&&d{d=0} /R.Time/{d=1}d' file
Brief explanation,
Set d as a flag to determine print line or not
/^[^0-9]/&&d{d=0}: if regex ^[^0-9] matched && d==1, disabled d
/R.Time/{d=1}: if string "R.Time" searched, enabled d
awk '/R.Time/,/LC/' file|grep -v -E "R.Time|LC"
grep part will remove the R.Time and LC lines that come as a part of the output from awk
I think it's a job for sed.
sed '/R.Time/!d;:A;N;/\n$/!bA' infile
This question already has answers here:
Add double quotation on duplicated name
(4 answers)
Closed 5 years ago.
I tried to use
sed 's/ */:/' file | awk -F: '{ if (arr[$1":"$2]) print "\""$1"\":"$2; else { arr[$1":"$2]++; print $0 }}'
but cannot get ideal output. Thanks.
The following is the file information and the desired output that I want.
Text File:
Jon DeLoach:408-253-3122:123 Park St., San Jose, CA 04086:7/25/53:85100
Karen Evich:284-758-2857:23 Edgecliff Place, Lincoln, NB 92086:7/25/53:85100
Karen Evich:284-758-2867:23 Edgecliff Place, Lincoln, NB 92743:11/3/35:58200
Karen Evich:284-758-2867:23 Edgecliff Place, Lincoln, NB 92743:11/3/35:58200
Fred Fardbarkle:674-843-1385:20 Parak Lane, DeLuth, MN 23850:4/12/23:780900
Fred Fardbarkle:674-843-1385:20 Parak Lane, DeLuth, MN 23850:4/12/23:780900
Lori Gortz:327-832-5728:3465 Mirlo Street, Peabody, MA 34756:10/2/65:35200
Paco Gutierrez:835-365-1284:454 Easy Street, Decatur, IL 75732:2/28/53:123500
Paco Gutierrez:835-365-1284:454 Easy Street, Decatur, IL 75732:2/28/53:123500
Jesse Neal:408-233-8971:45 Rose Terrace, San Francisco, CA 92303:2/3/36:25000
Jesse Neal:408-233-8971:45 Rose Terrace, San Francisco, CA 92303:2/3/36:25000
Zippy Pinhead:834-823-8319:2356 Bizarro Ave., Farmount, IL 84357:1/1/67:89500
Required output: Add stars indicating duplicated names
Jon DeLoach:408-253-3122:123 Park St., San Jose, CA 04086:7/25/53:85100
*Karen Evich*:284-758-2857:23 Edgecliff Place, Lincoln, NB 92086:7/25/53:85100
*Karen Evich*:284-758-2867:23 Edgecliff Place, Lincoln, NB 92743:11/3/35:58200
*Karen Evich*:284-758-2867:23 Edgecliff Place, Lincoln, NB 92743:11/3/35:58200
*Fred Fardbarkle*:674-843-1385:20 Parak Lane, DeLuth, MN 23850:4/12/23:780900
*Fred Fardbarkle*:674-843-1385:20 Parak Lane, DeLuth, MN 23850:4/12/23:780900
Lori Gortz:327-832-5728:3465 Mirlo Street, Peabody, MA 34756:10/2/65:35200
*Paco Gutierrez*:835-365-1284:454 Easy Street, Decatur, IL 75732:2/28/53:123500
*Paco Gutierrez*:835-365-1284:454 Easy Street, Decatur, IL 75732:2/28/53:123500
*Jesse Neal*:408-233-8971:45 Rose Terrace, San Francisco, CA 92303:2/3/36:25000
*Jesse Neal*:408-233-8971:45 Rose Terrace, San Francisco, CA 92303:2/3/36:25000
Zippy Pinhead:834-823-8319:2356 Bizarro Ave., Farmount, IL 84357:1/1/67:89500
Give a test to this. Seems to work ok.
$ awk -F":" 'NR==FNR{a[$1]++;next}(a[$1]>1){sub($1,"*" $1 "*")}1' file1 file1
Explanation:
This code reads the same file twice. This maybe has a performance penalty depending on the filesize.
-F":" : Global Input Fields Delimiter is defined as :
NR==FNR{a[$1]++;next} : The code in { } is executed when NR==FNR = the first file is read by awk
a[$1]++ : Creates an array a with index $1 and value ++ => +1 for each $1 found. So for record 1 we have a[Jon DeLoach]=1. For Record2 a[Karen Evich]=1, for record 3 a[Karen Evich]++ => 2,etc
next : instructs awk to go to the next record and skip the rest script.
(a[$1]>1){sub($1,"*" $1 "*")}1 : This condition & action is performed on the second file. For each a[$1] found in second file with a value >1 (as has been finalized when the first file finished), we insert * around $1 using awk sub function. sub function applies substitution directly to $0 = Whole record.
1 : prints the whole record of the second file.
SED question
I need to print any lines that have contain 11 for November or 12 for December.
My two questions are:
How do I search for more than one item I.E. print lines with the value 11 and 12?
How do I tell the search to look in column 4 which has the dates?
What I have so far:
sed -n -e '/11/,/12/p' datebook
File datebook:
Steve Blenheim:238-923-7366:95 Latham Lane, Easton, PA 83755:11/12/56:20300
Betty Boop:245-836-8357:635 Cutesy Lane, Hollywood, CA 91464:6/23/23:14500
Igor Chevsky:385-375-8395:3567 Populus Place, Caldwell, NJ 23875:6/18/68:23400
Norma Corder:397-857-2735:74 Pine Street, Dearborn, MI 23874:3/28/45:245700
Jennifer Cowan:548-834-2348:583 Laurel Ave., Kingsville, TX 83745:10/1/35:58900
Jon DeLoach:408-253-3122:123 Park St., San Jose, CA 04086:7/25/53:85100
Karen Evich:284-758-2857:23 Edgecliff Place, Lincoln, NB 92086:7/25/53:85100
Karen Evich:284-758-2867:23 Edgecliff Place, Lincoln, NB 92743:11/3/35:58200
Karen Evich:284-758-2867:23 Edgecliff Place, Lincoln, NB 92743:11/3/35:58200
Fred Fardbarkle:674-843-1385:20 Parak Lane, DeLuth, MN 23850:4/12/23:780900
Fred Fardbarkle:674-843-1385:20 Parak Lane, DeLuth, MN 23850:4/12/23:780900
Lori Gortz:327-832-5728:3465 Mirlo Street, Peabody, MA 34756:10/2/65:35200
Paco Gutierrez:835-365-1284:454 Easy Street, Decatur, IL 75732:2/28/53:123500
Ephram Hardy:293-259-5395:235 CarltonLane, Joliet, IL 73858:8/12/20:56700
James Ikeda:834-938-8376:23445 Aster Ave., Allentown, NJ 83745:12/1/38:45000
Barbara Kertz:385-573-8326:832 Ponce Drive, Gary, IN 83756:12/1/46:268500
Lesley Kirstin:408-456-1234:4 Harvard Square, Boston, MA 02133:4/22/62:52600
William Kopf:846-836-2837:6937 Ware Road, Milton, PA 93756:9/21/46:43500
Sir Lancelot:837-835-8257:474 Camelot Boulevard, Bath, WY 28356:5/13/69:24500
Jesse Neal:408-233-8971:45 Rose Terrace, San Francisco, CA 92303:2/3/36:25000
Zippy Pinhead:834-823-8319:2356 Bizarro Ave., Farmount, IL 84357:1/1/67:89500
Arthur Putie:923-835-8745:23 Wimp Lane, Kensington, DL 38758:8/31/69:126000
Popeye Sailor:156-454-3322:945 Bluto Street, Anywhere, USA 29358:3/19/35:22350
Jose Santiago:385-898-8357:38 Fife Way, Abilene, TX 39673:1/5/58:95600
Tommy Savage:408-724-0140:1222 Oxbow Court, Sunnyvale, CA 94087:5/19/66:34200
Yukio Takeshida:387-827-1095:13 Uno Lane, Ashville, NC 23556:7/1/29:57000
Vinh Tranh:438-910-7449:8235 Maple Street, Wilmington, VM 29085:9/23/63:68900
How do I tell the search to look in column 4 which has the dates?
This is an indication that you should use awk because sed doesn't have the concept of fields. An awk solution would be
awk -v FS=":" '$4 ~ /^1[12]\/.*/{print}' datebook
Output
Steve Blenheim:238-923-7366:95 Latham Lane, Easton, PA 83755:11/12/56:20300
Karen Evich:284-758-2867:23 Edgecliff Place, Lincoln, NB 92743:11/3/35:58200
Karen Evich:284-758-2867:23 Edgecliff Place, Lincoln, NB 92743:11/3/35:58200
James Ikeda:834-938-8376:23445 Aster Ave., Allentown, NJ 83745:12/1/38:45000
Barbara Kertz:385-573-8326:832 Ponce Drive, Gary, IN 83756:12/1/46:268500
Deciphering the solution
FS=":" sets the the field/column delimiter to colon.
$4 represents the column four in your input file which is the date in the format mm/dd/yy
The ~ in $4 ~ /^1[12]\/.*/ means we do a regex match in which
^ represents the beginning of the string
[12] can match either one or two.
Since the regex part itself is delimited by / you need to escape any literal / as in \/
It appears that you want to select lines where the first characters after the third colon on the line are 11/ or 12/ (since the data formats appear to be pre-Y2K-style US-format dates with mm/dd/yy notation). So you write:
$ sed -n '/^\([^:]*:\)\{3\}1[12]\//p' datebook
Steve Blenheim:238-923-7366:95 Latham Lane, Easton, PA 83755:11/12/56:20300
Karen Evich:284-758-2867:23 Edgecliff Place, Lincoln, NB 92743:11/3/35:58200
Karen Evich:284-758-2867:23 Edgecliff Place, Lincoln, NB 92743:11/3/35:58200
James Ikeda:834-938-8376:23445 Aster Ave., Allentown, NJ 83745:12/1/38:45000
Barbara Kertz:385-573-8326:832 Ponce Drive, Gary, IN 83756:12/1/46:268500
$
The ^ matches at the start of a line; the \([^:]*]:\) part looks for a series of zero or more non-colons followed by a colon; the \{3\} requires 3 of them; the 1[12]\/ demands 11/ or 12/ after that; the p prints.
I observe that the initial statement says 'contain 11 for November or 12 for December', but your first numbered question says 'value 11 and 12'. These are contradictory; a given date field can only start with one or the other, not both. I've assumed that 'or' is what you intended.
Here's my code. I am unable to read the dates from the input, it keeps giving me incorrect format, I tried changing a few times to mmddyy10. mmddyy8. and others but it still does not read them in correctly.
data master_patients;
infile datalines;
input account_number name $8-16 address $17-34 date MMDDYYYY10. gender $1.
insurance_code $49-51 updated_date mmddyyyy10.;
datalines;
620135 Smith 234 Aspen St. 12-21-1975 m CBC 02-16-1998
645722 Miyamoto 65 3rd Ave. 04-03-1936 f MCR 05-30-1999
645739 Jensvold 505 Glendale Ave. 06-15-1960 f HLT 09-23-1993
874329 Kazoyan 76-C La Vista . . MCD 01-15-2003
;
proc print data=master_patients;
run;
Could you please point out where I am going wrong? Thanks for any help.
I recommend a specific informat, rather than anydtdte though it helps you get started. It will ensure that your data is correct.
data master_patients;
infile datalines;
informat date updated_date mmddyy10.;
format date updated_date date9.;
input account_number name $ 8-16 address $ 17-34 date gender $1.
insurance_code $ 49-51 updated_date;
datalines;
620135 Smith 234 Aspen St. 12-21-1975 m CBC 02-16-1998
645722 Miyamoto 65 3rd Ave. 04-03-1936 f MCR 05-30-1999
645739 Jensvold 505 Glendale Ave. 06-15-1960 f HLT 09-23-1993
874329 Kazoyan 76-C La Vista . . MCD 01-15-2003
;
run;
There are two main problems. First the informat name does not have 4 Y's in it. Just 2. Second you don't have the column pointer in the right place when you are trying to read 10 characters as a date so that you are getting a blank and then the first 9 characters of the date. SAS cannot represents dates in the second or third century AD. Try MDY(12,21,197) and see what happens.
data master_patients;
infile datalines firstobs=2;
input account_number name $8-16 address $17-34 #36 date MMDDYY10.
gender $1. insurance_code $49-51 #53 updated_date mmddyy10.
;
datalines;
----+----1----+----2----+----3----+----4----+----5----+----6----+
620135 Smith 234 Aspen St. 12-21-1975 m CBC 02-16-1998
645722 Miyamoto 65 3rd Ave. 04-03-1936 f MCR 05-30-1999
645739 Jensvold 505 Glendale Ave. 06-15-1960 f HLT 09-23-1993
874329 Kazoyan 76-C La Vista . . MCD 01-15-2003
;
proc print data=master_patients;
run;
For modified list input for this problem.Just add ":" between variable name and informat.
data master_patients;
infile datalines;
input account_number name $8-16 address $17-34 date : mmddyy10. gender $1.
insurance_code $49-51 updated_date : mmddyy10.;
datalines;
620135 Smith 234 Aspen St. 12-21-1975 m CBC 02-16-1998
645722 Miyamoto 65 3rd Ave. 04-03-1936 f MCR 05-30-1999
645739 Jensvold 505 Glendale Ave. 06-15-1960 f HLT 09-23-1993
874329 Kazoyan 76-C La Vista . . MCD 01-15-2003
;
proc print data=master_patients;
run;
Please note if you don't add ":" , just change mmddyy10. to anydtdte. , the data read into dataset may Not correct.
I'm trying to do some calculations on the columns of a tab delimited file using this perl one-liner:
perl -ape 'if (/^\d/) { s/$F[2]/$F[2]\/$F[4]/e && s/$F[3]/$F[3]\/$F[4]/e}' infile
the idea is to get A and B columns divided by C column
infile:
X Y A B C
5001 3 1.03333 0.652549 4215
6001 4 1.2 0.723137 4870
7001 2 1 0.807843 5153
8001 2 1 0.807843 5355
9001 2 1 0.807843 5389
10001 2 1 0.807843 4955
11001 7 1.7671 1.05573 4966
12001 17 8.18802 4.72554 5124
But the output is this:
X Y A B C
5001 3 0.000245155397390273 0.000154815895610913 4215
6001 4 0.000246406570841889 0.000148488090349076 4870
7000.000194061711624297 2 1 0.000156771395303707 5153
8000.000186741363211951 2 1 0.000150857703081232 5355
9000.000185563184264242 2 1 0.000149905919465578 5389
0.0002018163471241170001 2 1 0.000163035923309788 4955
11001 7 0.000355839710028192 0.000212591623036649 4966
12001 17 0.00159797423887588 0.000922236533957845 5124
What is going on on the 3rd to 6th lines? How can manage to fix this?
Thanks.
EDIT:
I removed the /e option from the substitute command and it seems that the calculation is being performed on the wrong column.
perl -ape 'if (/^\d/) { s/$F[2]/$F[2]\/$F[4]/ && s/$F[3]/$F[3]\/$F[4]/}' infile
X Y A B C
5001 3 1.03333/4215 0.652549/4215 4215
6001 4 1.2/4870 0.723137/4870 4870
7001/5153 2 1 0.807843/5153 5153
8001/5355 2 1 0.807843/5355 5355
9001/5389 2 1 0.807843/5389 5389
1/49550001 2 1 0.807843/4955 4955
11001 7 1.7671/4966 1.05573/4966 4966
12001 17 8.18802/5124 4.72554/5124 5124
13001 30 13.8763/5138 8.05385/5138 5138
After substitution and evaluation, you have something like s/1/0.000194061711624297/. So the s operator looks for a 1 and finds it as part of the first column. Whoops. If we add some \b word-boundary markers, we can force the match part of the s operators to match a complete column, never just part of a column:
perl -ape 'if (/^\d/) { s/\b$F[2]\b/$F[2]\/$F[4]/e && s/\b$F[3]\b/$F[3]\/$F[4]/e}' infile
But that's still going to run into issues if it's possible for column X to equal column A or B. Better to just do the calculations and then replace the entire line by assigning to $_:
perl -ape 'if (/^\d/) { $F[2] /= $F[4]; $F[3] /= $F[4]; $_ = join(" ", #F); }'
Use sprintf instead of join if you want a particular format to the output.
Your basic problem is that you are substituting the value that is in column 3 and 4 whereever they appear in the whole line. For row 3, for example, you are doing s/1/1\/5153/e which affects the first occurrence of the digit 1 in the line, not necessarily the 1 that happens to be in column 3.
Try this:
perl -lane 'if ($F[4] =~ /[1-9]/) { $F[2] /= $F[4]; $F[3] /= $F[4] } print join "\t", #F' infile
If you want to limit the precision, do something like $F[2] = sprintf "%f", $F[2]/$F[4]; ...