How to find lines in a file matching lines in another file?

How to find lines in a file matching lines in another file? - sed

I have a large file with 11 columns of either text or numbers:
ETNOFIKK 03001 E0146 a1 1001 0146 10303001 10 500 EKO24 2001_200
ETNOFIKK 03002 E0147 a1 1001 0147 10303002 10 500 EKO24 2001_200
ETNOFIKK 03003 E0148 a1 1001 0148 10303003 10 500 EKO24 2001_200
...
and another file of only one column of numbers:
0146
0148
...
I need to extract lines from the first file when the 6th column matches the entries of the second file. So, in the above example, if the second file contains only the two entries, then the first and the third lines are printed from the first file.
Thanks

Using awk
awk 'FNR==NR {a[$1];next} $6 in a' file2 file1
ETNOFIKK 03001 E0146 a1 1001 0146 10303001 10 500 EKO24 2001_200
ETNOFIKK 03003 E0148 a1 1001 0148 10303003 10 500 EKO24 2001_200
This store the file2 (index) in an array
Then look if $6 is equal in the array, yes, print line.

sed 's/^/^\\([^[:blank:]]\\{1,\\}[[:blank:]]\\{1,\\}\\)\\{5\\}/' Other.file > /tmp/pregrep.txt
egrep -f /tmp/pregrep.txt Source.File
Use of sed only is possible (after a cat of both file and a pipe) but lot more instruction. So awk of Jotne seems to be the champ

Try this:
awk 'FNR==NR &&NF{a[$1];next} $6 in a' file2 file1

Related

Decomposing the output of tail

I have a text file like below. I was wondering how I can check the values of each element in the last line after using tail -n 1.
abd we 12345 1000
abd we 12350 1000
abd we 12355 1000
abd we 12360 1000
Thanks in advance

Your requirements are very vague. Do you want this?
fields=( $(tail -n 1 file) )
if [[ ${fields[2]} == 13000 ]]; then ...

Awk to generate a file of 100 lines using 5 lines supplied based on some condition

I have a file with 5 lines in it. I'd like to copy and paste these lines until a condition is satisfied.. e.g.,
123 89.98 34.56
AbcDef
0.0 10.567
Hijkl
XYZ 345
I'd like to repeat these lines 20 times and each time incrementing the number '123' (on the fist line) by 1. i.e.,
123 89.98 34.56
AbcDef
0.0 10.567
Hijkl
XYZ 345
124 89.98 34.56
AbcDef
0.0 10.567
Hijkl
XYZ 345
.....
.....
172 89.98 34.56
AbcDef
0.0 10.567
Hijkl
XYZ 345
Is this possible?

One way to do it (but without any elegance) is this:
1) Get a file with the desired final number of lines (in your case, 20x5). Let's name it "100lines.txt".
2) Do:
awk '{if(NR%5==1) {print (NR+4)/5+122, "89.98 34.56"} if(NR%5==2) {print "AbcDef"} if(NR%5==3) {print "0.0 10.567"} if(NR%5==4) {print "Hijkl"} if(NR%5==0) {print "XYZ 345"}}' 100lines.txt
I think it could work.

How to sum values in a column grouped by values in the other

I have a large file consisting data in 2 columns
100 5
100 10
100 10
101 2
101 4
102 10
102 2
I want to sum the values in 2nd column with matching values in column 1. For this example, the output I'm expecting is
100 25
101 6
102 12
I'm trying to work on this using bash script preferably. Can someone explain me how can I do this

Using awk:
awk '{a[$1]+=$2}END{for(i in a){print i, a[i]}}' inputfile
For your input, it'd produce:
100 25
101 6
102 12

In a perl oneliner
perl -lane "$s{$F[0]} += $F[1]; END { print qq{$_ $s{$_}} for keys %s}" file.txt

You can use an associative array. The first column is the index and the second becomes what you add to it.
#!/bin/bash
declare -A columns=()
while read -r -a line ; do
columns[${line[0]}]=$((${columns[${line[0]}]} + ${line[1]}))
done < "${1}"
for idx in ${!columns[#]} ; do
echo "${idx} ${columns[${idx}]}"
done

Using awk and maintain the order:
awk '!($1 in a){a[$1]=$2; b[++i]=$1;next} {a[$1]+=$2} END{for (k=1; k<=i; k++) print b[k], a[b[k]]}' file
100 25
101 6
102 12

Python is my choice:
d = {}
for line in f.readlines():
key,value = line.split()
if d[key] == None:
d[key] = 0
d[key] += value
print d
Why would you want a bash script?

AWK - filter file with not equal fields

I've been trying to pull a field from a row in a file although each row may have plus or minus 2 or 3 fields per row. They aren't always equal in the number of fields per row.
Here is a snippet:
A orarpp 45286124 1 1 0 20 60 Nov 25 9-16:42:32 01:04:58 11176 117056 0 - oracleXXX (LOCAL=NO)
A orarpp 45351560 1 1 3 20 61 Nov 30 5-03:54:42 02:24:48 4804 110684 0 - ora_w002_XXX
A orarpp 45548236 1 1 22 20 71 Nov 26 8-19:36:28 00:56:18 10628 116508 0 - oracleXXX (LOCAL=NO)
A orarpp 45679190 1 1 0 20 60 Nov 28 6-23:42:20 00:37:59 10232 116112 0 - oracleXXX (LOCAL=NO)
A orarpp 45744808 1 1 0 20 60 10:52:19 23:08:12 00:04:58 11740 117620 0 - oracleXXX (LOCAL=NO)
A root 45810380 1 1 0 -- 39 Nov 25 9-19:54:34 00:00:00 448 448 0 - garbage
In the case of the first line, I'm interested in 9-16:42:32 and the similar fields for each row.
I've tried to pull it by using ':' as the field separator and then filter from there however, what I am trying to accomplish is to do something if the number before the dash (in the example it's 9) is greater than one.
cat file.txt | grep oracle | awk -F: '{print substr($1, length($1)-5)}'
This is because the number of fields on either side of the actual field I need can be different from line to line.
Definitely not the most efficient but I've been trying to do this with an awk one liner.
Hints or a direction would be appreciated to get me moving again. I am not opposed to doing in a better way than awk.
Thanks.

Maybe cut is the right tool for this job? For example, with your snippet:
$ cut -c 62-71 file.txt
9-16:42:32
5-03:54:42
8-19:36:28
6-23:42:20
23:08:12
9-19:54:34
The arguments tell cut to snip columns (-c) 62 through 71.
For additional processing, you can pipe it to awk.
You can also accomplish the whole thing in awk by accepting entire lines and then using substr to extract the columns you want. For example, this awk command produces the same output as the cut command above:
awk '{ print substr($0, 62, 10) }' file.txt
Whether you create a pipeline or do the processing entirely in awk is at least in part a matter of personal taste / style.

Would this do?
awk -F: '/oracle/ {print substr($0,62,10)}' file.txt
9-16:42:32
8-19:36:28
6-23:42:20
23:08:12
This search for oracle and then print 10 characters starting from position 62

You can grab those identifiers with one of
grep -o '[[:digit:]]\+-[[:digit:]]\{2\}:[[:digit:]]\{2\}:[[:digit:]]\{2\}'
grep -oP '\d+-\d\d:\d\d:\d\d' # GNU grep
It sounds like you want to do something with the lines, not just find the ids. Please elaborate.
Using GNU awk:
gawk --re-interval '
/oracle/ && \
match($0, /([[:digit:]]+)-([[:digit:]]{2}:){2}[[:digit:]]{2}/, a) && \
a[1]>1 {
# do something with the matching line
print
}
' file

create file by feching corresponding data from other files?

I have a list of SNPs for example (let's call it file1):
SNP_ID chr position
rs9999847 4 182120631
rs999985 11 107192257
rs9999853 4 148436871
rs999986 14 95803856
rs9999883 4 870669
rs9999929 4 73470754
rs9999931 4 31676985
rs9999944 4 148376995
rs999995 10 78735498
rs9999963 4 84072737
rs9999966 4 5927355
rs9999979 4 135733891
I have another list of SNP with corresponding P-value (P) and BETA (as shown below) for different phenotypes here i have shown only one (let's call it file2):
CHR SNP BP A1 TEST NMISS BETA SE L95 U95 STAT P
1 rs3094315 742429 G ADD 1123 0.1783 0.2441 -0.3 0.6566 0.7306 0.4652
1 rs12562034 758311 A ADD 1119 -0.2096 0.2128 -0.6267 0.2075 -0.9848 0.3249
1 rs4475691 836671 A ADD 1111 -0.006033 0.2314 -0.4595 0.4474 -0.02608 0.9792
1 rs9999847 878522 A ADD 1109 -0.2784 0.4048 -1.072 0.5149 -0.6879 0.4916
1 rs999985 890368 C ADD 1111 0.179 0.2166 -0.2455 0.6034 0.8265 0.4087
1 rs9999853 908247 C ADD 1110 -0.02015 0.2073 -0.4265 0.3862 -0.09718 0.9226
1 rs999986 918699 G ADD 1111 -1.248 0.7892 -2.795 0.2984 -1.582 0.114
Now I want to make two files named file3 and file4 such that:
file3 should contain:
SNPID Pvalue_for_phenotype1 Pvalue_for_phenotype2 Pvalue_for_phenotype3 and so on....
rs9999847 0.9263 0.00005 0.002 ..............
The first column (SNPIDs) in file3 will be fixed (all the snps in my chip will be listed here), and i want to write a programe so that it will match snp id in file3 and file2 and will fetch the P-value for that corresponding snp id and put it in file3 from file2.
file4 should contain:
SNPID BETAvale_for_phenotype1 BETAvale_for_phenotype2 BETAvale_for_phenotype3 .........
rs9999847 0.01812 -0.011 0.22
the 1st column (SNPIDs) in file4 will be fixed (all the SNPs in my chip will be listed here), and I want to write a program so that it will match SNP ID in file4 and file2 and will fetch the BETA for that corresponding SNP ID and put it in file4 from file2.

it's a simple exercise about How to transfer the data of columns to rows (with awk)?
file2 to file3.
I assumed that you have got machine with large RAM, because I think that you have got million lines into file2.
you could save this code into column2row.awk file:
#!/usr/bin/awk -f
BEGIN {
snp=2
val=12
}
{
if ( vector[$snp] )
vector[$snp] = vector[$snp]","$val
else
vector[$snp] = $val
}
END {
for (snp in vector)
print snp","vector[snp]
}
where snp is column 2 and val is column 12 (pvalue).
now you could run script:
/usr/bin/awk -f column2row.awk file2 > file3
If you have got small RAM, then you could divide load:
cat file1 | while read l; do s=$(echo $l|awk '{print $1}'); grep -w $s file2 > $s.snp; /usr/bin/awk -f column2row.awk $s.snp >> file3; done
It recovers from $l (line) first parameter ($s, snp name), search $s into file2 and create small file about each snp name.
and then it uses awk script to generate file3.
file2 to file4.
you could modify value about val into awk script from column 12 to 7.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to find lines in a file matching lines in another file? - sed

Using awk awk 'FNR==NR {a[$1];next} $6 in a' file2 file1 ETNOFIKK 03001 E0146 a1 1001 0146 10303001 10 500 EKO24 2001_200 ETNOFIKK 03003 E0148 a1 1001 0148 10303003 10 500 EKO24 2001_200 This store the file2 (index) in an array Then look if $6 is equal in the array, yes, print line.

sed 's/^/^\\([^[:blank:]]\\{1,\\}[[:blank:]]\\{1,\\}\\)\\{5\\}/' Other.file > /tmp/pregrep.txt egrep -f /tmp/pregrep.txt Source.File Use of sed only is possible (after a cat of both file and a pipe) but lot more instruction. So awk of Jotne seems to be the champ

Try this: awk 'FNR==NR &&NF{a[$1];next} $6 in a' file2 file1

Related

Decomposing the output of tail

Awk to generate a file of 100 lines using 5 lines supplied based on some condition

How to sum values in a column grouped by values in the other

AWK - filter file with not equal fields

create file by feching corresponding data from other files?

Categories

Resources