Is there a way to parse multiple lines to a single line by merging duplicates and tabbing non-duplicates? - text-processing

I'm having trouble formatting a list like this:
Problem:
XYZ gene1
XYZ gene2
GHE ATG01
GHE ATG02
Goal(tab-delimited spaces):
XYZ gene1 gene2
GHE ATG01 ATG02
I tried ruby -F -ane '$F[1].split(/\t/).each {|x|print [$F [0],x,$F[2]]*"\t", xargs, and paste commands, but then got stuck figuring out how it would work, and that the ruby command is to make multiple lines, not single lines. I'm also new to command line text processing.
This is what I'm actually dealing with (and some more):
14-3-3 proteins AT1G22300
14-3-3 proteins AT1G26480
14-3-3 proteins AT1G34760
14-3-3 proteins AT1G35160
ZIK subfamily AT1G64630
ZIK subfamily AT3G04910
ZIK subfamily AT3G18750
And I wish to get this:
14-3-3 proteins AT1G22300 AT1G26480 AT1G34760 AT1G35160
ZIK subfamily AT1G64630 AT3G04910 AT3G18750
This is what I get:
xargs -a <some_file> | sed 's/ /,/g'
14-3-3,proteins,AT1G22300,14-3-3,proteins,AT1G26480,14-3-3,proteins,AT1G34760,14-3-3,proteins,AT1G35160,14-3-3,proteins,AT1G78220,14-3-3,proteins,AT1G78300,14-3-3,proteins,AT2G42590,14-3-3,proteins,AT3G02520,14-3-3,proteins

With miller (https://github.com/johnkerl/miller/releases/tag/5.4.0)
mlr --nidx --ofs "\t" nest --nested-fs " " --implode --values --across-records -f 3 input.csv
You have (tab as field separator, space as field separator for nested values)
14-3-3 proteins AT1G22300 AT1G26480 AT1G34760 AT1G35160
ZIK subfamily AT1G64630 AT3G04910 AT3G18750
As input I have used this one (space delimited)
14-3-3 proteins AT1G22300
14-3-3 proteins AT1G26480
14-3-3 proteins AT1G34760
14-3-3 proteins AT1G35160
ZIK subfamily AT1G64630
ZIK subfamily AT3G04910
ZIK subfamily AT3G18750

Related

search for pattern line by line from the second field and print out the first field corresponding to the line where the pattern was found

I have the following 0.txt file below separating content by columns (fields):
'Disinfectants', 'Brand A', 'Brand B', 'Brand C'
'brand A', 'below brand C', 'greater than brand B'
'brand B', 'greater than brand D', 'below brand A'
I would like to find (from the second column) every time a pattern (say "brand A") occurs, and print out the contents of the first column belonging to the line where this pattern is found.
For both the content of the resulting file is like:
Disinfectants
brand B
I have seen other similar issues but only printing the column itself where the pattern was found, usually using grep.
EDIT UPDATE: from #jubilatious1 suggestion
, I found a question (https://stackoverflow.com/a/9153113) on the OS as part of my search for a solution.
awk '/brand A/{ print substr( $1, RLENGTH )}' 0.txt > 1.txt
but my 1.txt output has been different than expected as it prints only part of the content of the first field (column):
'brand
'brand
Besides, just using awk '/brand A/{ print substr( $1, RLENGTH )}' I can't specify that the search only works from the second field (column) onwards for each line.
EDIT UPDATE 1: maybe just fix the output of awk '/brand A/{ print substr( $1, RLENGTH )}' so that correctly printing the contents of the fields in the first column is a first step.
Hackish pipeline:
cut -d, -f2- 0.txt | grep -ni 'brand a' | sed 's/:.*/p/' | sed -nf- 0.txt | cut -d, -f1
Split on commas and omit field 1
grep for line numbers with 'brand a' (case insensitive)
turn line numbers into {linenumber}p -- a sed command to print that line
pipe those sed commands to sed -nf- ... this will print only when instructed to from stdin ... so you get only the lines you want
split on commas and print only the first field
Or perl:
perl -lanF, -e 'print $F[0] if grep /brand a/i, #F[1..$#F]' 0.txt
Autosplit into #F on commas, and print the first field if 'brand a' (case insensitive) is found in any of the other fields.
Both output this:
'Disinfectants'
'brand B'
You can strip the single quotes however you'd like, or, you can change the split regex for perl autosplit:
perl -lanF"/[',]+/" -e 'print $F[1] if grep /brand a/i, #F[2..$#F]' brand.txt
To get this:
Disinfectants
brand B
... note that once the line starts with a split delimiter, $F[0] is an empty string.

Treat Two Columns as One

Sample Text:
$ cat X
Birth Death Name
02/28/42 07/03/69 Brian Jones
11/27/42 09/18/70 Jimi Hendrix
11/19/43 10/04/70 Janis Joplin
12/08/43 07/03/71 Jim Morrison
11/20/46 10/29/71 Duane Allman
After Processing With Perl, column & sed:
$ perl -lae 'print "$F[2]_$F[3] $F[0]"' X | column -t | sed 's/_/ /g'
Name Birth
Brian Jones 02/28/42
Jimi Hendrix 11/27/42
Janis Joplin 11/19/43
Jim Morrison 12/08/43
Duane Allman 11/20/46
This is the exact output I want. But the issue is, I do not want to use column -t | sed 's/_/ /g' at the end.
My intuition is that this can be done only with perl oneliner (without the need of sed or column).
Is it possible? How can I do that?
P.S. I have an awk solution (awk '{print $3"_"$4" "$1}' X | column -t | sed 's/_/ /g')as well for this exact same result. However, I am looking for a perl only solution.
One way
perl -wlnE'say join " ", (split " ", $_, 3)[-1,0]' input.txt
This limits the split to three terms -- first two fields obtained by normally splitting by the given pattern, and then the rest, here comprising the name.
It won't line up nicely as in the shown output.
If the proper alignment is a must, then there's more to do since one must first see the whole file in order to know what the field width should be. Then the "one"-liner (command-line program) is
perl -MList::Util=max -wlne'
push #recs, [ (split " ", $_, 3)[-1,0] ];
END {
$m = max map { length $_->[0] } #recs;
printf("%-${m}s %s\n", #$_) for #recs
}' input.txt
If an apriori-set field width is acceptable, as brought up in a comment, we can do
perl -wlne'printf "%-20s %s\n", (split " ", $_, 3)[-1,0]' input.txt
The saving grace for the obvious short-coming here -- what with names that are longer? -- is that it is only those particular lines that will be out of order.
See if following one liner will be an acceptable solution
perl -ne "/(\S+)\s+\S+\s+(.*)/, printf \"%-13s %s\n\",$2,$1" birth_data.dat
Input birth_data.dat
Birth Death Name
02/28/42 07/03/69 Brian Jones
11/27/42 09/18/70 Jimi Hendrix
11/19/43 10/04/70 Janis Joplin
12/08/43 07/03/71 Jim Morrison
11/20/46 10/29/71 Duane Allman
Output
Name Birth
Brian Jones 02/28/42
Jimi Hendrix 11/27/42
Janis Joplin 11/19/43
Jim Morrison 12/08/43
Duane Allman 11/20/46

comparing columns of multiple files using shell

I have to compare two files based on first column, if they match then print second column of file1 and file2 in the same line
file 1
1cu2 pf00959
3nnr pf00440
3nnr pf13972
2v89 pf13341
4aqb pf00431
4aqb pf00431
4aqb pf07645
4aqb pf00084
2liv pf13458
2liv pf01094
file 2
1cu2 d.2.1.3
2v89 g.50.1.2
2v89 g.50.1.2
2liv c.93.1.1
2liv c.93.1.1
1q2w b.47.1.4
1q2w b.47.1.4
1rgh d.1.1.2
1rgh d.1.1.2
1zxl c.2.1.2
output
1cu2 pf00959 d.2.1.3
2v89 pf13341 g.50.1.2
2liv pf13458 c.93.1.1
Assuming you're using more than Bourne (/bin/sh), you can do this in a one liner:
$ join <(sort -u file1) <(sort -u file2)
1cu2 pf00959 d.2.1.3
2liv pf01094 c.93.1.1
2liv pf13458 c.93.1.1
2v89 pf13341 g.50.1.2
If you're actually writing a shell script for /bin/sh, you'll need temporary files, e.g.
$ sort file1 > file1-sorted
$ sort file2 > file2-sorted
$ join file1-sorted file2-sorted
Update: Your example output has one hit per key, even though 2liv has two values in file1. To accomplish this, you need to run through a post-processor to note the duplicates:
$ join <(sort -u file1) <(sort -u file2) |awk '!done[$1,$3] { print; done[$1,$3] = 1 }'
1cu2 pf00959 d.2.1.3
2liv pf01094 c.93.1.1
2v89 pf13341 g.50.1.2
This uses a simple hash in awk. The sort -u items already got rid of the duplicates from file1 (the second column of the final output), so we're merely looking for the first unique pairing of the keys with the values from file2 (the first and third columns). If we find a new pair, the line is printed and the pair is saved so it won't print on its next hit.
Note that this is not sorted the way your sample output was. That would be nontrivial (you'd need a third job just to determine the original order and then map things to it.)

Sorting lines according to the numerical value of the first element on each line?

I have just started learning Perl, hence my question might seem very silly. I apologize in advance.
I have a list say #data which contains a list of lines read from the input. The lines contain numbers that are separated by (unknown number of) spaces.
Now, I would like to sort them and print them out, but not in the lexicographical order but according to the numerical value of the first number appearing on the line.
I know this must be something very simple but I cannot figure out how to do it?
Thanks in advance,
You can use a Schwartzian transform, capturing the first number in the row with a regex
use strict;
use warnings;
my #sorted = map $_->[0],
sort { $a->[1] <=> $b->[1] }
map { [ $_, /^(-?[\d.]+)/ ] } <DATA>;
print #sorted;
__DATA__
21 13 14
0 1 2
32 0 4
11 2 3
1 3 3
Output:
0 1 2
1 3 3
11 2 3
21 13 14
32 0 4
Reading the transform from behind, the <DATA> is the file handle we use, it will return a list of the lines in the file. The first map statement returns an array reference [ ... ], that contains the original line, plus the first number that is captured in the line. Alternatively, you can use the regex /^(\S+)/ here, to just capture whatever non-whitespace that comes first. The sort uses this captured number inside the array ref when comparing lines. And finally, the last map converts the array ref back to the original value, stored in $_->[0].
Be aware that this relies on the lines having a number at the start of the line. If that can be missing, or blank, this will have some unforeseen consequences.
Note that only using a simple numerical sort will also "work", because Perl will convert one of your lines to the correct number, assuming each line begins with a number followed by space. You will have some warnings about that, such as Argument "21 13 14\n" isn't numeric in sort. For example, if I replace my code above with
my #foo = sort { $a <=> $b } <DATA>;
I will get the output:
Argument "21 13 14\n" isn't numeric in sort at foo.pl line 6, <DATA> line 5.
Argument "0 1 2\n" isn't numeric in sort at foo.pl line 6, <DATA> line 5.
Argument "32 0 4\n" isn't numeric in sort at foo.pl line 6, <DATA> line 5.
Argument "11 2 3\n" isn't numeric in sort at foo.pl line 6, <DATA> line 5.
Argument "1 3 3\n" isn't numeric in sort at foo.pl line 6, <DATA> line 5.
0 1 2
1 3 3
11 2 3
21 13 14
32 0 4
But as you can see, it has sorted correctly. I would not advice this solution, but it is a nice demonstration in this context, I think.
You can use the sort function :
#sorted_data = sort(#data);

Merging two files into one based on the first column

I have two files, both in the same format -- two columns both containing a number, for example:
file 1
1.00 99
2.00 343
3.00 34
...
10.00 343
file 2
1.00 0.4
2.00 0.5
3.00 0.34
...
10.00 0.9
and i want to generate the following file (using, awk, bash perl):
1.00 99 0.4
2.00 343 0.5
3.00 34 0.34
...
10.00 343 0.9
thanks
join file1 file2
Which assumes that the files are sorted on the join field. If they are not, you can do this:
join <(sort -V file1) <(sort -V file2)
Here's an AWK version (the sort compensates for AWK's non-deterministic array ordering):
awk '{a[$1]=a[$1] FS $2} END {for (i in a) print i a[i]}' file1 file2 | sort -V
It seems shorter and more readable than the Perl answer.
In gawk 4, you can set the array traversal order:
awk 'BEGIN {PROCINFO["sorted_in"] = "#ind_num_asc"} {a[$1]=a[$1] FS $2} END {for (i in a) print i a[i]}' file1 file2
and you won't have to use the sort utility. #ind_num_asc is Index Numeric Ascending. See Controlling Array Traversal and Array Sorting and Using Predefined Array Scanning Orders with gawk.
Note that -V (--version-sort) in the sort commands above requires GNU sort from coreutils 7.0 or later. Thanks for #simlev pointing out that it should be used if available.
A Perl-solution
perl -anE 'push #{$h{$F[0]}}, $F[1]; END{ say "$_\t$h{$_}->[0]\t$h{$_}->[1]" for sort{$a<=>$b} keys %h }' file_1 file_2 > file_3
Ok, looking at the awk-oneliner this is shorter then my first try and it has the nicer output then the awk-oneliner and it doesn't use the 'pipe sort -n':
perl -anE '$h{$F[0]}="$h{$F[0]}\t$F[1]"; END{say "$_$h{$_}" for sort {$a<=>$b} keys %h}' file_1 file_2
And the one-liners behave different then the join-example if there are entries with no value in the second column in the first file.
You can do it with Alacon - command-line utility for Alasql database.
It works with Node.js, so you need to install Node.js and then Alasql package:
To join two data from tab-separated files you can use the following command:
> node alacon "SELECT * INTO TSV("main.txt") FROM TSV('data1.txt') data1
JOIN TSV('data2.txt') data2 USING [0]"
This is one very long line. In this example all files have data in "Sheet1" sheets.