Extraction of rows which have a value > 50 - perl

How to select those lines which have a value < 10 value from a large matrix of 21 columns and 150 rows.eg.
miRNameIDs degradome AGO LKM......till 21
osa-miR159a 0 42 42
osa-miR396e 0 7 9
vun-miR156a 121 77 4
ppt-miR156a 12 7 4
gma-miR6300 118 2 0
bna-miR156a 0 114 48
gma-miR156k 0 46 1
osa-miR1882e 0 7 0
.
.
.
Desired output is:-
miRNameIDs degradome AGO LKM......till 21
vun-miR156a 121 77 4
gma-miR6300 118 2 0
bna-miR156a 0 114 48
.
.
.
till 150 rows

Using a perl one-liner
perl -ane 'print if $. == 1 || grep {$_ > 50} #F[1..$#F]' file.txt
Explanation:
Switches:
-a: Splits the line on space and loads them in an array #F
-n: Creates a while(<>){...} loop for each “line” in your input file.
-e: Tells perl to execute the code on command line.
Code:
$. == 1: Checks if the current line is line number 1.
grep {$_ > 50} #F[1..$#F]: Looks at each entries from the array to see if it is greater than 50.
||: Logical OR operator. If any of our above stated condition is true, it prints the line.

Related

Print the missing number in a unique sequential list with an arbitrary starting range or starting from 1

This question is similar to How can I find the missing integers in a unique and sequential list (one per line) in a unix terminal?.
The difference being is that I want to know if it is possible to specify a starting range to the list
I have noted the following provided solutions:
awk '{for(i=p+1; i<$1; i++) print i} {p=$1}' file1
and
perl -nE 'say for $a+1 .. $_-1; $a=$_'
file1 is as below:
5
6
7
8
15
16
17
20
Running both solutions, it gives the following output:
1
2
3
4
9
10
11
12
13
14
18
19
Note that the output start printing from 1.
Question is how to pass an arbitrary starting/minimum to start with and if nothing is provided, assume the number 1 as the starting/minimum number?
9
10
11
12
13
14
18
19
Yes, sometimes you will want the starting number to be 1 but sometimes you will want the starting number as the least number from the list.
You can use your awk script, slightly modified, and pass it an initial p value with the -v option:
$ awk 'BEGIN{p=p<1?1:p} {for(i=p; i<$1; i++) print i} {p=p<=$1?$1+1:p}' file1
1
2
3
4
9
10
11
12
13
14
18
19
$ awk -v p=10 'BEGIN{p=p<1?1:p} {for(i=p; i<$1; i++) print i} {p=p<=$1?$1+1:p}' file1
10
11
12
13
14
18
19
The BEGIN block initializes p to 1 if it is not specified or set to 0 or a negative value. The loop starts at p instead of p+1, and the last block assigns $1+1 to p (instead of $1), if and only if p is less or equal $1.
This assumes that the default (1) is the minimum starting number you would want. If you would like to start from 0 or even from a negative number just replace BEGIN{p=p<1?1:p} by BEGIN{p=(p==""?1:p)}:
$ awk -v p=-2 'BEGIN{p=(p==""?1:p)} {for(i=p; i<$1; i++) print i} {p=p<=$1?$1+1:p}' file1
-2
-1
0
1
...
Slight variations of those one-liners to include a start point:
awk
# Optionally include start=NN before the first filename
$ awk 'BEGIN { start= 1 }
$1 < start { next }
$1 == start { p = start }
{ for (i = p + 1; i < $1; i++) print i; p = $1}' start=5 file1
9
10
11
12
13
14
18
19
$ awk 'BEGIN { start= 1 }
$1 < start { next }
$1 == start { p = start }
{ for (i = p + 1; i < $1; i++) print i; p = $1}' file1
1
2
3
4
9
10
11
12
13
14
18
19
perl
# Optionally include -start=NN before the first file and after the --
$ perl -snE 'BEGIN { $start //= 1 }
if ($_ < $start) { next }
if ($_ == $start) { $a = $start }
say for $a+1 .. $_-1; $a=$_' -- -start=5 file1
9
10
11
12
13
14
18
19
$ perl -snE 'BEGIN { $start //= 1 }
if ($_ < $start) { next }
if ($_ == $start) { $a = $start }
say for $a+1 .. $_-1; $a=$_' -- file1
1
2
3
4
9
10
11
12
13
14
18
19
Using Raku (formerly known as Perl_6)
raku -e 'my #a=lines.map: *.Int; .put for (#a.Set (^) #a.minmax.Set).sort.map: *.key;'
Sample Input:
5
6
7
8
15
16
17
20
Sample Output:
9
10
11
12
13
14
18
19
Here's an answer coded in Raku, a member of the Perl-family of programming languages. No, it doesn't address the OP's request for a user-definable starting point. Instead the code above is a general solution that computes the input's minimum Int and counts up from there, returning any missing Ints found up--to the input's maximum Int.
Really need a user-defined lower limit? Try the following code, which allows you to set a $init variable:
~$ raku -e 'my #a=lines.map: *.Int; my $init = 1; .put for (#a.Set (^) ($init..#a.max).Set).sort.map: *.key;'
1
2
3
4
9
10
11
12
13
14
18
19
For explanation and shorter code (including single-line return and/or return without sort), see the link below.
https://stackoverflow.com/a/72221301/7270649
https://raku.org
not as elegant as i hoped :
< file | mawk '
BEGIN { _= int(_)^(\
( ORS = "")<_)
} { ___[ __= $0 ] }
END {
do {
print _ in ___ \
? "" : _ "\n"
} while(++_ < __) }' \_=10
10
11
12
13
14
18
19

Modifying Script to include the Count of a each time a name appears from a table

I have a script below that takes my FILE1 and parses out FILE2 only if the first column of FILE1 matches column number 10 of FILE2. So it will print out the rows I need. This part works great. The part I am having a tad bit of difficulty is inserting a sort of count for the output. The goal of the script is take column 10 at the end and produce an output. In my list there are 12 names and I want to get the count of each name. For the example below, I have used four names.
FILE1:
name1 15
name2 15
name2 30
name5 15
name4 10
name2 5
name2 5
FILE2:
23 15 5.4 1.3 5 55 128 21799 + 32 name2 1 77 0 1
23 20 5.4 1.3 5 55 128 7998 + 18 name4 1 77 0 1
23 20 5.4 1.3 6 55 128 9984 + 13 name4 1 77 1 1
23 20 5.4 1.3 7 55 128 7998 + 14 name5 1 77 2 1
23 20 5.4 1.3 6 55 128 994 + 14 name1 1 77 3
23 20 5.4 1.3 9 55 128 984 + 5 name7 1 77 4 1
23 20 5.4 1.3 5 55 128 99 + 5 name8 1 77 5 1
Expected Output
$VAR1 = {
'name1' => 1,
'name2' => 4,
'name4' => 1,
'name5' => 1,
};
5 55 128 21799 32 name2 77 0 1
5 55 128 7998 18 name4 77 0 1
6 55 128 9984 13 name4 77 1 1
7 55 128 7998 14 name5 77 2 1
6 55 128 994 14 name1 77 3 1
name1 1
name2 1
name4 2
name5 1
You can test the script it works. The part I am having difficulty with is inserting the count of each name based on the output. The print \%x is a way of checking if my original list was truly used as I am working with a much larger set of data. If someone could point me the right direction on how to modify my script without changing it drastically that would be great. I feel like this script fulfills the majority of my needs even if it is not the most efficient way of doing it.
use strict;
use Data::Dumper;
my %x;
open(FILE1, $ARGV[0]) or die "Cannot open the file: $!";
while (my $line = <FILE1>) {
my #array = split(" ", $line);
$x{$array[0]}++;
}
close FILE1;
print Dumper( \%x );
my %count;
open(FILE2, $ARGV[1]) or die "Cannot open the file: $!";
while (my $line = <FILE2>) {
my #name = split(" ", $line);
my $y = $name[9];
if ( $x{ $y } ) {
print join(" ", #name[4,5,6,7,9,11,12,13]), "\n";
$count{#name[9]}++;
}
}
print Dumper (\%count);
close FILE2;
exit;
Script now counts. Just need to debug.
the "minimal" change would be to set the elements of %x to 0 in the FILE1 loop, then check for exists $x{$y} in the FILE2 loop and do ++$x{$y} inside the condition body. Now at the end %x has the counts of all the occurrences.
The usual way (as mentioned in the comments of the question) would be to declare an additional %count and perform the same ++$count{$y} inside the if block as in the above method.
The first has the advantage and disadvantage (depending on your needs) of reporting the count even when the name has zero found occurrences.

Find "N" minimum and "N" maximum values with respect to a column in the file and print the specific rows

I have a tab delimited file such as
Jack 2 98 F
Jones 6 25 51.77
Mike 8 11 61.70
Gareth 1 85 F
Simon 4 76 4.79
Mark 11 12 38.83
Tony 7 82 F
Lewis 19 17 12.83
James 12 1 88.83
I want to find the N minimum values and N maximum values (more than 5) in th the last print the rows that has those values. I want to ignore the rows with E. For example, if I want minimum two values and maximum in above data, my output would be
Minimum case
Simon 4 76 4.79
Lewis 19 17 12.83
Maximum case
James 12 1 88.83
Mike 8 11 61.70
I can ignore the columns that does not have numeric value in fourth column using
awk -F "\t" '$4+0 != $4{next}1' inputfile.txt
I can also pipe this output and find one minimum value using
awk -F "\t" '$4+0 != $4{next}1' inputfile.txt |awk 'NR == 1 || $4 < min {line = $0; min = $4}END{print line}'
and similarly for maximum value, but how can I extend this to more than one values like 2 values in the toy example above and 10 cases for my real data.
n could be a variable. in this case, I set n=3. not, this may have problem if there are lines with same value in last col.
kent$ awk -v n=3 '$NF+0==$NF{a[$NF]=$0}
END{ asorti(a,k,"#ind_num_asc")
print "min:"
for(i=1;i<=n;i++) print a[k[i]]
print "max:"
for(i=length(a)-n+1;i<=length(a);i++)print a[k[i]]}' f
min:
Simon 4 76 4.79
Lewis 19 17 12.83
Mark 11 12 38.83
max:
Jones 6 25 51.77
Mike 8 11 61.70
James 12 1 88.83
You can get the minimum and maximum at once with a little redirection:
minmaxlines=2
( ( grep -v 'F$' inputfile.txt | sort -n -k4 | tee /dev/fd/4 | head -n $minmaxlines >&3 ) 4>&1 | tail -n $minmaxlines ) 3>&1
Here's a pipeline approach to the problem.
$ grep -v 'F$' inputfile.txt | sort -nk 4 | head -2
Simon 4 76 4.79
Lewis 19 17 12.83
$ grep -v 'F$' inputfile.txt | sort -rnk 4 | tail -2
Mike 8 11 61.70
James 12 1 88.83

Read and parse multiple text files

Can anyone suggest a simple way of achieving this. I have several files which ends with extension .vcf . I will give example with two files
In the below files, we are interested in
File 1:
38 107 C 3 T 6 C/T
38 241 C 4 T 5 C/T
38 247 T 4 C 5 T/C
38 259 T 3 C 6 T/C
38 275 G 3 A 5 G/A
38 304 C 4 T 5 C/T
38 323 T 3 A 5 T/A
File2:
38 107 C 8 T 8 C/T
38 222 - 6 A 7 -/A
38 241 C 7 T 10 C/T
38 247 T 7 C 10 T/C
38 259 T 7 C 10 T/C
38 275 G 6 A 11 G/A
38 304 C 5 T 12 C/T
38 323 T 4 A 12 T/A
38 343 G 13 A 5 G/A
Index file :
107
222
241
247
259
275
304
323
343
The index file is created based on unique positions from file 1 and file 2. I have that ready as index file. Now i need to read all files and parse data according to the positions here and write in columns.
From above files, we are interested in 4th (Ref) and 6th (Alt) columns.
Another challenge is to name the headers accordingly. So the output should be something like this.
Position File1_Ref File1_Alt File2_Ref File2_Alt
107 3 6 8 8
222 6 7
241 4 5 7 10
247 4 5 7 10
259 3 6 7 10
275 3 5 6 11
304 4 5 5 12
323 3 5 4 12
343 13 5
You can do this using the join command:
# add file1
$ join -e' ' -1 1 -2 2 -a 1 -o 0,2.4,2.6 <(sort -n index) <(sort -n -k2 file1) > file1.merged
# add file2
$ join -e' ' -1 1 -2 2 -a 1 -o 0,1.2,1.3,2.4,2.6 file1.merged <(sort -n -k2 file2) > file2.merged
# create the header
$ echo "Position File1_Ref File1_Alt File2_Ref File2_Alt" > report
$ cat file2.merged >> report
Output:
$ cat report
Position File1_Ref File1_Alt File2_Ref File2_Alt
107 3 6 8 8
222 6 7
241 4 5 7 10
247 4 5 7 10
259 3 6 7 10
275 3 5 6 11
304 4 5 5 12
323 3 5 4 12
323 4 12 4 12
343 13 5 13 5
Update:
Here is a script which can be used to combine multiple files.
The following assumptions have been made:
The index file is sorted
The vcf files are sorted on their second column
There are no spaces (or any other special characters) in filenames
Save the following script to a file e.g. report.sh and run it without any arguments from the directory containing your files.
#!/bin/bash
INDEX_FILE=index # the name of the file containing the index data
REPORT_FILE=report # the file to write the report to
TMP_FILE=$(mktemp) # a temporary file
header="Position" # the report header
num_processed=0 # the number of files processed so far
# loop over all files beginning with "file".
# this pattern can be changed to something else e.g. *.vcf
for file in file*
do
echo "Processing $file"
if [[ $num_processed -eq 0 ]]
then
# it's the first file so use the INDEX file in the join
join -e' ' -t, -1 1 -2 2 -a 1 -o 0,2.4,2.6 <(sort -n "$INDEX_FILE") <(sed 's/ \+/,/g' "$file") > "$TMP_FILE"
else
# work out the output fields
for ((outputFields="0",j=2; j < $((2 + $num_processed * 2)); j++))
do
outputFields="$outputFields,1.$j"
done
outputFields="$outputFields,2.4,2.6"
# join this file with the current report
join -e' ' -t, -1 1 -2 2 -a 1 -o "$outputFields" "$REPORT_FILE" <(sed 's/ \+/,/g' "$file") > "$TMP_FILE"
fi
((num_processed++))
header="$header,File${num_processed}_Ref,File${num_processed}_Alt"
mv "$TMP_FILE" "$REPORT_FILE"
done
# add the header to the report
echo "$header" | cat - "$REPORT_FILE" > "$TMP_FILE" && mv "$TMP_FILE" "$REPORT_FILE"
# the report is a csv file. Uncomment the line below to make it space-separated.
# tr ',' ' ' < "$REPORT_FILE" > "$TMP_FILE" && mv "$TMP_FILE" "$REPORT_FILE"
This Perl solution will handle 1 or more, (50), files.
#!/usr/bin/perl
use strict;
use warnings;
use File::Slurp qw/ slurp /;
use Text::Table;
my $path = '.';
my #file = qw/ o33.txt o44.txt /;
my #position = slurp('index.txt') =~ /\d+/g;
my %data;
for my $filename (#file) {
open my $fh, "$path/$filename" or die "Can't open $filename $!";
while (<$fh>) {
my ($pos, $ref, $alt) = (split)[1, 3, 5];
$data{$pos}{$filename} = [$ref, $alt];
}
close $fh or die "Can't close $filename $!";
}
my #head;
for my $file (#file) {
push #head, "${file}_Ref", "${file}_Alt";
}
my $tb = Text::Table->new( map {title => $_}, "Position", #head);
for my $pos (#position) {
$tb->load( [
$pos,
map $data{$pos}{$_} ? #{ $data{$pos}{$_} } : ('', ''), #file
]
);
}
print $tb;

line extraction dependin on range for specific colums

I would like to extract some lines from a text file, I have started to tweak sed lately,
I have a file with the structure
88 3 3 0 0 1 101 111 4 3
89 3 3 0 0 1 3 4 112 102
90 3 3 0 0 1 102 112 113 103
91 3 3 0 0 2 103 113 114 104
What I would like to do is to extract the information according to the second column, I use sth like in my bash script(argument 2 is infile)
sed -n '/^[0-9]* [23456789]/ p' < $2 > out
however I have different entries other than the range [23456789], for instance 10, since it is composed of 1 and 0, to get that these two characters should be in the range I guess, however there are entries with '1'(for the second column) that I do not like to keep so how can write '10's but not '1's.
Best,
Umut
sed -rn '/^[0-9]* ([23456789]|10)/ p' < $2 > out
You need the extend-regexp support (-r) to have the | operator (or)
Another interesting way is:
sed -rn '/^[0-9]* ([23456789]|[0-9]{2,})/ p' < $2 > out
Which means [23456789] or 2 or more repetition of a digit.
The instant you see variable-sized columns in your data, you should start thinking about awk:
awk '$2 > 1 && $2 < 11 {print}{}'
will do the trick assuming your file format is correct.
sed -rn '/^[0-9]* (2|3|4|5|6|7|8|9|10)/p' < $2 > out