line extraction dependin on range for specific colums - sed

I would like to extract some lines from a text file, I have started to tweak sed lately,
I have a file with the structure
88 3 3 0 0 1 101 111 4 3
89 3 3 0 0 1 3 4 112 102
90 3 3 0 0 1 102 112 113 103
91 3 3 0 0 2 103 113 114 104
What I would like to do is to extract the information according to the second column, I use sth like in my bash script(argument 2 is infile)
sed -n '/^[0-9]* [23456789]/ p' < $2 > out
however I have different entries other than the range [23456789], for instance 10, since it is composed of 1 and 0, to get that these two characters should be in the range I guess, however there are entries with '1'(for the second column) that I do not like to keep so how can write '10's but not '1's.
Best,
Umut

sed -rn '/^[0-9]* ([23456789]|10)/ p' < $2 > out
You need the extend-regexp support (-r) to have the | operator (or)
Another interesting way is:
sed -rn '/^[0-9]* ([23456789]|[0-9]{2,})/ p' < $2 > out
Which means [23456789] or 2 or more repetition of a digit.

The instant you see variable-sized columns in your data, you should start thinking about awk:
awk '$2 > 1 && $2 < 11 {print}{}'
will do the trick assuming your file format is correct.

sed -rn '/^[0-9]* (2|3|4|5|6|7|8|9|10)/p' < $2 > out

Related

Extraction of rows which have a value > 50

How to select those lines which have a value < 10 value from a large matrix of 21 columns and 150 rows.eg.
miRNameIDs degradome AGO LKM......till 21
osa-miR159a 0 42 42
osa-miR396e 0 7 9
vun-miR156a 121 77 4
ppt-miR156a 12 7 4
gma-miR6300 118 2 0
bna-miR156a 0 114 48
gma-miR156k 0 46 1
osa-miR1882e 0 7 0
.
.
.
Desired output is:-
miRNameIDs degradome AGO LKM......till 21
vun-miR156a 121 77 4
gma-miR6300 118 2 0
bna-miR156a 0 114 48
.
.
.
till 150 rows
Using a perl one-liner
perl -ane 'print if $. == 1 || grep {$_ > 50} #F[1..$#F]' file.txt
Explanation:
Switches:
-a: Splits the line on space and loads them in an array #F
-n: Creates a while(<>){...} loop for each “line” in your input file.
-e: Tells perl to execute the code on command line.
Code:
$. == 1: Checks if the current line is line number 1.
grep {$_ > 50} #F[1..$#F]: Looks at each entries from the array to see if it is greater than 50.
||: Logical OR operator. If any of our above stated condition is true, it prints the line.

Find "N" minimum and "N" maximum values with respect to a column in the file and print the specific rows

I have a tab delimited file such as
Jack 2 98 F
Jones 6 25 51.77
Mike 8 11 61.70
Gareth 1 85 F
Simon 4 76 4.79
Mark 11 12 38.83
Tony 7 82 F
Lewis 19 17 12.83
James 12 1 88.83
I want to find the N minimum values and N maximum values (more than 5) in th the last print the rows that has those values. I want to ignore the rows with E. For example, if I want minimum two values and maximum in above data, my output would be
Minimum case
Simon 4 76 4.79
Lewis 19 17 12.83
Maximum case
James 12 1 88.83
Mike 8 11 61.70
I can ignore the columns that does not have numeric value in fourth column using
awk -F "\t" '$4+0 != $4{next}1' inputfile.txt
I can also pipe this output and find one minimum value using
awk -F "\t" '$4+0 != $4{next}1' inputfile.txt |awk 'NR == 1 || $4 < min {line = $0; min = $4}END{print line}'
and similarly for maximum value, but how can I extend this to more than one values like 2 values in the toy example above and 10 cases for my real data.
n could be a variable. in this case, I set n=3. not, this may have problem if there are lines with same value in last col.
kent$ awk -v n=3 '$NF+0==$NF{a[$NF]=$0}
END{ asorti(a,k,"#ind_num_asc")
print "min:"
for(i=1;i<=n;i++) print a[k[i]]
print "max:"
for(i=length(a)-n+1;i<=length(a);i++)print a[k[i]]}' f
min:
Simon 4 76 4.79
Lewis 19 17 12.83
Mark 11 12 38.83
max:
Jones 6 25 51.77
Mike 8 11 61.70
James 12 1 88.83
You can get the minimum and maximum at once with a little redirection:
minmaxlines=2
( ( grep -v 'F$' inputfile.txt | sort -n -k4 | tee /dev/fd/4 | head -n $minmaxlines >&3 ) 4>&1 | tail -n $minmaxlines ) 3>&1
Here's a pipeline approach to the problem.
$ grep -v 'F$' inputfile.txt | sort -nk 4 | head -2
Simon 4 76 4.79
Lewis 19 17 12.83
$ grep -v 'F$' inputfile.txt | sort -rnk 4 | tail -2
Mike 8 11 61.70
James 12 1 88.83

How do I right justify columns in a file [duplicate]

This question already has answers here:
right text align - bash
(3 answers)
Closed 8 years ago.
How do I right justify the columns of a file in awk, sed, or bash ?
My file is currently left justified and space delimited.
Can I used printf or rev?
Here is what my file looks like :
$ cat file
14,107 aaa 12,436 0.0 0 0 313 0 373
3,806,201 bbb 1,573 0.0 0 0 -25 0 -25
And using rev doesn't give me the output I'm looking for.
$rev file | column -t | rev
14,107 aaa 12,436 0.0 0 0 313 0 373
3,806,201 bbb 1,573 0.0 0 0 -25 0 -25
In lieu of a specific example here is a general solution using a trick with rev:
$ cat file
a 10000.00 x
b 100 y
c 1 zzzZZ
$ rev file | column -t | rev
a 10000.00 x
b 100 y
c 1 zzzZZ
Where column -t is replaced by whatever you are trying to do.

How to filter a file by n. word in line after pattern?

I've got a large file with diffrent lines.
The lines i am interested in, are looking alike:
lcl|NC_005966.1_gene_59 scaffold441.6 99.74 390 1 0 1 390 34065 34454 0.0 715
lcl|NC_005966.1_gene_59 scaffold2333.4 89.23 390 42 0 1 390 3114 2725 1e-138 488
lcl|NC_005966.1_gene_60 scaffold441.6 100.00 186 0 0 1 186 34528 34713 1e-95 344
Now i want to get the lines after the pattern 'lcl|NC_' but just if the third word(or the nth word in the line) is smaller than 100.
(In this case the first two lines, since they just got a number of 99.74 and 89.23)
Next they should be saved into a new file.
This can make it:
$ awk '$1 ~ /^lcl\|NC_/ && $3<100' file
lcl|NC_005966.1_gene_59 scaffold441.6 99.74 390 1 0 1 390 34065 34454 0.0 715
lcl|NC_005966.1_gene_59 scaffold2333.4 89.23 390 42 0 1 390 3114 2725 1e-138 488
It checks both things:
- 1st field starting with lcl|NC_: $1 ~ /^lcl\|NC_/ does it. (Thanks Ed Morton for improving the previous $1~"^lcl|NC_")
- 3rd field being <100: $3<100.
To save into a file, you can do:
awk '$1 ~ /^lcl\|NC_/ && $3<100' file > new_file
^^^^^^^^^^

How to delete all characters but the last

I want to parse a file and delete all leading 0's of a number using sed. (of course if i have something like 0000 then results to 0) How to do that?
I think you may be searching for this.
Here lies your answer. You need to modify of course.
How to remove first/last character from a string using SED
This is probably over complicated, but it catches all the corner cases I tested:
sed 's/^\([^0-9]*\)0/\1\n0/;s/$/}/;s/\([^0-9\n]\)0/\1\n/g;s/\n0\+/\n/g;s/\n\([^0-9]\)/0\1/g;s/\n//g;s/}$//' inputfile
Explanation:
This uses the divide-and-conquer technique of inserting newlines to delimit segments of a line so they can be manipulated individually.
s/^\([^0-9]*\)0/\1\n0/ - insert a newline before the first zero
s/$/}/ - add a buffer character at the end
s/\([^0-9\n]\)0/\1\n/g - insert newlines before each leading zero (and remove the first)
s/\n0\+/\n/g - remove the remaining leading zeros
s/\n\([^0-9]\)/0\1/g - replace bare zeros
s/\n//g - remove the newlines
s/}$// - remove the end-of-line buffer
This file:
0 foo 1 bar 01 10 001 baz 010 100 qux 000 00 0001 0100 0010
100 | 00100
010 | 010
001 | 001
100 | 100
0 | 0
00 | 0
000 | 0
00 | 00
00 | 00
00 | 00 z
Becomes:
0 foo 1 bar 1 10 1 baz 10 100 qux 0 0 1 100 10
100 | 100
10 | 10
1 | 1
100 | 100
0 | 0
0 | 0
0 | 0
0 | 0
0 | 0
0 | 0 z
If you have leading zeroes and it is accompanied by string of numbers, all you have to do is to convert it into integer. Something like this
$ echo "000123 test " | awk '{$1=$1+0}1'
123 test
This will not require any significant amount of regex whether they are simple or overly complicated.
Similarly (Ruby1.9+)
$ echo "000123 test " | ruby -lane '$F[0]=$F[0].to_i; print $F.join(" ")'
123 test
For cases of all 0000's
$ echo "0000 test " | ruby -lane '$F[0]=$F[0].to_i; print $F.join(" ")'
0 test
$ echo "000 test " | awk '{$1=$1+0}1'
0 test