Deleting lines containing duplicated strings - sed

I always appreciate your help.
I would like to delete lines containing duplicated strings in the second column.
test.txt
658 invert_d2e_q_reg_0_/Qalu_ecl_zlow_e 0.825692
659 invert_d2e_q_reg_0_/Qalu_byp_rd_data_e[31] 0.825692
660 invert_d2e_q_reg_0_/Qalu_byp_rd_data_e[63] 0.825692
661 invert_d2e_q_reg_0_/Qalu_ecl_zhigh_e 0.825692
665 invert_d2e_q_reg_0_/Qalu_byp_rd_data_e[62] 0.825692
666 invert_d2e_q_reg_0_/Qalu_ecl_zlow_e 0.825692
668 invert_d2e_q_reg_0_/Qalu_ecl_zhigh_e 0.825692
670 invert_d2e_q_reg_0_/Qalu_ecl_zhigh_e 0.825692
673 invert_d2e_q_reg_0_/Qalu_ecl_zlow_e 0.825692
675 invert_d2e_q_reg_0_/Qalu_ecl_zhigh_e 0.825692
677 invert_d2e_q_reg_0_/Qalu_ecl_zhigh_e 0.825692
678 invert_d2e_q_reg_0_/Qalu_byp_rd_data_e[27] 0.825692
679 invert_d2e_q_reg_0_/Qalu_byp_rd_data_e[27] 0.8120
.
.
.
output.txt
658 invert_d2e_q_reg_0_/Qalu_ecl_zlow_e 0.825692
659 invert_d2e_q_reg_0_/Qalu_byp_rd_data_e[31] 0.825692
660 invert_d2e_q_reg_0_/Qalu_byp_rd_data_e[63] 0.825692
661 invert_d2e_q_reg_0_/Qalu_ecl_zhigh_e 0.825692
665 invert_d2e_q_reg_0_/Qalu_byp_rd_data_e[62] 0.825692
678 invert_d2e_q_reg_0_/Qalu_byp_rd_data_e[27] 0.825692
.
.
.
I know sed can delete lines with predefined specific strings, but in my cases, I could not expect the strings are duplicated. Also, duplicated strings may be more than 1000.
I used “uniq” to do this job, but this does not work.
uniq –u –f 4 test.txt
(-u prints unique lines. –f skips the first 4 letters. )
Is there any way to do this with sed/awk/perl? Or please correct my uniq semantics.
Best,
Jaeyoung

This might work for you (GNU sed):
sed -r 'G;/^\S+\s+(\S+)\s+.*\n.*\1/!{P;s/\S+\s+(\S+)\s+.*/\1/;H};d' file
Test the second column against all unique values of that column stored in the hold space (HS) and if not present print the line and store its value in the HS.
Or use sort:
sort -suk2,2 file | sort -nk1,1

Awk would do this with one tool, but here is fairly straight forward way to do it with Bash associative arrays. Loop over the lines and pull out column three, if there is no associative array entry then echo the line out and set a value so it won't be printed any further.
unset col3 && declare -A col3 && IFS=$(echo -en "\n\b") && for a in $(< test.txt); do
lncol3=$(echo "${a}" | tr '/' ' ' | awk '{print $3}')
[[ -z "${col3["${lncol3}"]}" ]] && echo "${a}" && col3["${lncol3}"]=1
done

awk '!seen[$0]++' input.txt > output.txt

Related

Splitting one file into multiple files

I have a large file like below, I want to split this file into multiple files. Each file should be break after ENDMDL. For the following file there will be three output files with name pose1.av, pose2.av and pose3.av.
MODEL 1
SML 170 O PRO A 17 16.893 3.030 0.799 1.00 1.00 O
SML 171 OXT PRO A 17 18.167 2.722 2.597 1.00 1.00 O
TER 172 PRO A 17
ENDMDL
MODEL 2
SML 4 CG ARG A 1 -2.171 -7.105 -4.278 1.00 1.00 C
SML 5 CD ARG A 1 -1.851 -8.581 -4.022 1.00 1.00 C
SML 113 HD1 HIS A 12 2.465 -8.206 5.062 1.00 1.00 H
TER 114 HIS A 12
ENDMDL
MODEL 3
SML 101 N HIS A 12 3.765 -3.995 7.233 1.00 1.00 N
SML 102 CA HIS A 12 2.584 -4.736 6.934 1.00 1.00 C
TER 103 HIS A 12
ENDMDL
A rather efficient one, using bash and sed:
n=0
while IFS= read -r firstline; do
{ echo "$firstline"; sed '/^ENDMDL$/q'; } > "pose$((++n)).av"
done < file
It's much more efficient than the other Bash answer: the output file is only opened once, and most of the parsing is done by sed, and not by bash.
csplit can do this out of the box
csplit -z -s -f pose -b "%01d.av" file '/^ENDMDL$/+1' '{*}'
Awk is a good choice for this task:
awk '{file="pose"++i;printf "%s%s",$0,RS > file;close(file)}' RS='ENDMDL\n' file
Using a perl one-liner
perl -ne '$fh or open $fh, "> pose".++$i".av"; print $fh $_; undef $fh if /^ENDMDL/' file.txt
In pure Bash:
cnt=1
while read line; do
echo "$line" >> pose${cnt}.av
[ "$line" == "ENDMDL" ] && let cnt+=1
done < filename.txt
awk '/^MODEL/{out="pose"++cnt".av"} {print > out}' file

How to capture several text from a file and print it with specific format?

I have a file with the following content:
CLASS
1001
CATEGORY
11 12 13 15
16 17
CLASS
3101
CATEGORY
900 901 902 904 905 907
908 909
910 912 913
CLASS
8000
CATEGORY
400 401 402 403
and I like to reformat it using perl or awk to get the following result:
1001 11&12&13&15&16&17
3101 900&901&902&904&905&907&908&909&910&912&913
8000 400&401&402&403
Your help would be appreciated. (I used to do it with excel VBA), but this time I like to make it simple using perl or awk. Thanks in advance. :)
perl -lne'
BEGIN{ $/ ="CLASS"; $" ="&" }
($x, #F) = /\d+/g or next;
print "$x #F"
' file
output
1001 11&12&13&15&16&17
3101 900&901&902&904&905&907&908&909&910&912&913
8000 400&401&402&403
Another awk version
awk '/CLASS/ {c=1;f=0;if (NR>1) print a;next} c {a=$0 " ";c=0} /CATEGORY/ {f=1;c=0;next} f {gsub(/ /,"\\&",$0);a=a $0} END {print a}' file
1001 11&12&13&1516&17
3101 900&901&902&904&905&907908&909910&912&913
8000 400&401&402&403

Find "N" minimum and "N" maximum values with respect to a column in the file and print the specific rows

I have a tab delimited file such as
Jack 2 98 F
Jones 6 25 51.77
Mike 8 11 61.70
Gareth 1 85 F
Simon 4 76 4.79
Mark 11 12 38.83
Tony 7 82 F
Lewis 19 17 12.83
James 12 1 88.83
I want to find the N minimum values and N maximum values (more than 5) in th the last print the rows that has those values. I want to ignore the rows with E. For example, if I want minimum two values and maximum in above data, my output would be
Minimum case
Simon 4 76 4.79
Lewis 19 17 12.83
Maximum case
James 12 1 88.83
Mike 8 11 61.70
I can ignore the columns that does not have numeric value in fourth column using
awk -F "\t" '$4+0 != $4{next}1' inputfile.txt
I can also pipe this output and find one minimum value using
awk -F "\t" '$4+0 != $4{next}1' inputfile.txt |awk 'NR == 1 || $4 < min {line = $0; min = $4}END{print line}'
and similarly for maximum value, but how can I extend this to more than one values like 2 values in the toy example above and 10 cases for my real data.
n could be a variable. in this case, I set n=3. not, this may have problem if there are lines with same value in last col.
kent$ awk -v n=3 '$NF+0==$NF{a[$NF]=$0}
END{ asorti(a,k,"#ind_num_asc")
print "min:"
for(i=1;i<=n;i++) print a[k[i]]
print "max:"
for(i=length(a)-n+1;i<=length(a);i++)print a[k[i]]}' f
min:
Simon 4 76 4.79
Lewis 19 17 12.83
Mark 11 12 38.83
max:
Jones 6 25 51.77
Mike 8 11 61.70
James 12 1 88.83
You can get the minimum and maximum at once with a little redirection:
minmaxlines=2
( ( grep -v 'F$' inputfile.txt | sort -n -k4 | tee /dev/fd/4 | head -n $minmaxlines >&3 ) 4>&1 | tail -n $minmaxlines ) 3>&1
Here's a pipeline approach to the problem.
$ grep -v 'F$' inputfile.txt | sort -nk 4 | head -2
Simon 4 76 4.79
Lewis 19 17 12.83
$ grep -v 'F$' inputfile.txt | sort -rnk 4 | tail -2
Mike 8 11 61.70
James 12 1 88.83

How to filter a file by n. word in line after pattern?

I've got a large file with diffrent lines.
The lines i am interested in, are looking alike:
lcl|NC_005966.1_gene_59 scaffold441.6 99.74 390 1 0 1 390 34065 34454 0.0 715
lcl|NC_005966.1_gene_59 scaffold2333.4 89.23 390 42 0 1 390 3114 2725 1e-138 488
lcl|NC_005966.1_gene_60 scaffold441.6 100.00 186 0 0 1 186 34528 34713 1e-95 344
Now i want to get the lines after the pattern 'lcl|NC_' but just if the third word(or the nth word in the line) is smaller than 100.
(In this case the first two lines, since they just got a number of 99.74 and 89.23)
Next they should be saved into a new file.
This can make it:
$ awk '$1 ~ /^lcl\|NC_/ && $3<100' file
lcl|NC_005966.1_gene_59 scaffold441.6 99.74 390 1 0 1 390 34065 34454 0.0 715
lcl|NC_005966.1_gene_59 scaffold2333.4 89.23 390 42 0 1 390 3114 2725 1e-138 488
It checks both things:
- 1st field starting with lcl|NC_: $1 ~ /^lcl\|NC_/ does it. (Thanks Ed Morton for improving the previous $1~"^lcl|NC_")
- 3rd field being <100: $3<100.
To save into a file, you can do:
awk '$1 ~ /^lcl\|NC_/ && $3<100' file > new_file
^^^^^^^^^^

SED: How to remove every 10 lines in a file (thin or subsample the file)

I have this so far:
sed -n '0,10p' yourfile > newfile
But it is not working, just outputs a blank file :(
Your question is ambiguous, so here is every permutation I can think of:
Print only the first 10 lines
head -n10 yourfile > newfile
Skip the first 10 lines
tail -n+10 yourfile > newfile
Print every 10th line
awk '!(NR%10)' yourfile > newfile
Delete every 10th line
awk 'NR%10' yourfile > newfile
(Since an ambiguous questions can only have an ambiguous answer...)
To print every tenth line (GNU sed):
$ seq 1 100 | sed -n '0~10p'
10
20
30
40
...
100
Alternatively (GNU sed):
$ seq 1 100 | sed '0~10!d'
10
20
30
40
...
100
To delete every tenth line (GNU sed):
$ seq 1 100 | sed '0~10d'
1
...
9
11
...
19
21
...
29
31
...
39
41
...
To print the first ten lines (POSIX):
$ seq 1 100 | sed '11,$d'
1
2
3
4
5
6
7
8
9
10
To delete the first ten lines (POSIX):
$ seq 1 100 | sed '1,10d'
11
12
13
14
...
100
python -c "import sys;sys.stdout.write(''.join(line for i, line in enumerate(open('yourfile')) if i%10 == 0 ))" >newfile
It is longer, but it is a single language - not different syntax and aprameters for each thing one tries to do.
With non-GNU sed, to print every 10th line use
sed '10,${p;n;n;n;n;n;n;n;n;n;}'
(GNU : sed -n '0~10p')
and to delete every 10th line use
sed 'n;n;n;n;n;n;n;n;n;d;'
(GNU : sed -n '0~10d')