I like to join lines following {st,corridor,tunnel} into one line using AWK or SED
Input
abcd
efgjk
st
wer
dfgh
corridor
weerr
tunnel
twdf
Desired output
abcd
efgjk st
wer
dfgh corridor
weerr tunnel
twdf
One way using awk:
awk '!/st|corridor|tunnel/ { if (line) print line; line = $0; next } { line = line " " $0 } END { print line }' file.txt
Results:
abcd
efgjk st
wer
dfgh corridor
weerr tunnel
twdf
This might work for you (GNU sed):
sed '$!N;s/\n\(st\|corridor\|tunnel\)\s*$/ \1/;P;D' file
Or, an awk version that reads the whole file into memory first (not recommended for large files):
$ awk 'BEGIN {i=1} {line[i++] = $0} END {j=1; while (j<i) {if (match(line[j+1], /^(st|corridor|tunnel)$/)) {print line[j] " " line[j+1]; j+=2} else print line[j++];}}' streets
abcd
efgjk st
wer
dfgh corridor
weerr tunnel
twdf
I'll leave you with the exercise of doing this one-or-two-lines-at-a-time. :)
With awk
BEGIN {
s["st"]=s["corridor"]=s["tunnel"]
}
$1 in s {
print prev, $1
}
!($1 in s) {
if (prev) print prev
prev = $1
}
Related
I have text file that consists of 45999 lines. Each line has a word (unigram). I want to create two-sequential words (bigrams). For example:
apple
pie
red
vine
I want 'apple pie', 'pie red', 'red vine'. I tried with sed 'N;s/\n/ /' but it creates just 'apple pie' and 'red vine'. How can I solve this problem? Thank you..
Could you please try following if you are ok with awk.
awk -v RS="" '
BEGIN{
OFS=","
s1="\047"
}
{
for(i=2;i<=NF;i++){
print s1 $(i-1) s1, s1 $i s1
}
}' Input_file
Output will be as follows.
'apple','pie'
'pie','red'
'red','vine'
2nd solution: since output of OP is not clear so adding this one too.
awk -v RS="" '
BEGIN{
OFS=","
s1="\047"
}
{
for(i=2;i<=NF;i++){
val=(val?val OFS:"")s1 $(i-1) s1 OFS s1 $i s1
}
}
END{
print val
}' Input_file
Output will be as follows.
'apple','pie','pie','red','red','vine'
This might work for you (GNU sed):
sed -nE 'N;s/\n(.*)/ \1&/;P;D' file
Append the next line to the current line, then replace the newline by a space and append the second line again. Print/delete the first line and repeat.
N.B. This does not print the last line as it is not a pair, if the last line is needed use:
sed -E 'N;s/\n(.*)/ \1&/;P;D' file
If the output is to be printed as a single line with each pair surrounded by single quotes and separated by a comma, use:
sed -E ':a;$!N;s/(\S+)\n(.*)/'\''\1 \2'\'', \2/;ta;s/ (\S+)$/ '\''\1'\''/' file
Or:
sed -E ':a;$!N;s/(\S+)\n(.*)/'\''\1 \2'\'', \2/;ta;s/, \S+$/' file
I am looking for sed command which will transform following line:
>AT1G01020.6 | ARV1 family protein | Chr1:6788-8737 REVERSE LENGTH=944 | 201606
AGACCCGGACTCTAATTGCTCCGTATTCTTCTTCTCTTGAGAGAGAGAGAGAGAGAGAGA
GAGAGAGAGCAATGGCGGCGAGTGAACACAGATGCGTGGGATGTGGTTTTAGGGTAAAGT
CATTGTTCATTCAATACTCTCCGGGGAAATTGCAAGGAAGTAGCAGATGAGTACATCGAG
TGTGAACGCATGATTATTTTCATCGATTTAATCCTTCACAGACCAAAGGTATATAGACAC
into
>AT1G01020.6 | ARV1 family protein | Chr1:6788-8737 REVERSE LENGTH=944 | 201606
AGACCCGGACTCTAATTGCTCCGTATTCTTCTTCTCTTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGCAATGGCGGCGAGTGAACACAGATGCGTGGGATGTGGTTTTAGGGTAAAGTCATTGTTCATTCAATACTCTCCGGGGAAATTGCAAGGAAGTAGCAGATGAGTACATCGAGTGTGAACGCATGATTATTTTCATCGATTTAATCCTTCACAGACCAAAGGTATATAGACAC
which means newline after > this character will remain unchanged, while on other cases newlines will be joined.
I have tried with the following line, but it is not working:
sed s/^!>\n$// <in.fasta>out.fasta
I have a 28MB fasta file which I need to transform.
sed is not a particularly good tool for this.
awk '/^>/ { if(prev) printf "\n"; print; next }
{ printf "%s", $0; prev = 1; }
END { if(prev) printf "\n" }' in.fasta >out.fasta
Using awk:
awk '/^>/{print (l?l ORS:"") $0;l="";next}{l=l $0}END{print l}' file
The line is printed if a > or the end of the file is reached, otherwise the line is buffered in the variable l.
Following awk may also help you here. Without using any array or variable's values solution.
awk 'BEGIN{ORS=""} /^>/{if(FNR==1){print $0 RS} else {print RS $0 RS};next}1' Input_file
OR
awk 'BEGIN{ORS=""} /^>/{printf("%s",FNR==1?$0 RS:RS $0 RS);next}1' Input_file
I have 100 html files in a directory
I need to print a line from each file that matches a regex and at the same time print the lines between 2 regex.
The commands below provide the results, correctly
sed -n '/string1/p' *.html >result.txt
sed -n '/string2/,/string3/p' *.html > result2.txt
but I need them in one result.txt file, in the format
string1
string2
string3
I have been trying with grep, awk and sed and have searched but I have not found the answer.
Any help would be appreciated.
This might work for you:
sed -n '/string1/p;/string2/;/string3/p' INPUTFILE > OUTPUTFILE
Or here's an awk solution:
awk '/string1/ { print }
/srting2/ { print ; p = 1 }
p == 1 { print }
/string3/ { print ; p = 0 }' INPUTFILE > OUTPUTFILE
Simply put both SED epressions in one invocation:
echo $'a\nstring1\nb\nstring2\nc\nstring3\nd\n' | \
sed -n -e '/string1/p' -e '/string2/,/string3/p'
Input is:
a
string1
b
string2
c
string3
d
Output is:
string1
string2
c
string3
this is the data file I have and I want below output
How can I achieve it.
07:15:01 ST go-b-s1
07:15:21 FA go-b-s1
07:15:22 FA go-a-s1
07:15:01 ST go-c-s2
07:15:21 FA go-c-s2
output
how to get below output using awk or sed
07:15:01 ST go-b-s1 07:15:21 FA go-b-s1
07:15:22 FA go-a-s1
07:15:01 ST go-c-s2 07:15:21 FA go-c-s2
It looks like you just want to trim the newline from lines that have ST in the second column. If that's the case:
awk '$2 == "ST" { printf "%s ", $0; next} 1' input-file
Other options:
sed '/ST/{ N; s/\n/ /; }' input-file
perl -pe 's/\n/ / if /ST/' input-file
It's difficult to tell what you actually want with just one sample. It is generally a good idea to at least attempt to describe how you want to manipulate the data.
input file:
$ cat t.txt
id1;value1_1
id1;value1_2
id2;value2_1
id3;value3_1
id4;value4_1
id4;value4_2
id5;value5_1
result would be:
id1;value1_1;id1;value1_2
id3;value3_1
id4;value4_1;id4;value4_2
id5;value5_1
using sed or awk. Please give your opinion.
Here's one way to do it:
awk -F';' 'BEGIN { getline; id=$1; line=$0 } { if ($1 != id) { print line; line = $0; } else { line = line ";" $0; } id=$1; } END { print line; }' t.txt
Explanation:
Set field separator to ;:
-F';'
Start by reading the first line of input (getline), save the first field ($1) as id, and the first line ($0) as line:
BEGIN { getline; id=$1; line=$0 }
For each line of input, check if the first field differs from the stored id:
if ($1 != id)
If it does, then print the saved line and store the new one ($0):
print line; line = $0;
Otherwise, append the new line to the stored line(s):
line = line ";" $0;
And save the new id:
id=$1
At the end, print whatever is left in line:
END { print line; }
I guess in your result example, the id2; line is missing by mistake, right?
anyway, you could try the awk line below:
awk -F';' '{a[$1]=($1 in a)?a[$1]";"$0:$0}END{for(x in a)print a[x]}' yourFile|sort
output would be:
id1;value1_1;id1;value1_2
id2;value2_1
id3;value3_1
id4;value4_1;id4;value4_2
id5;value5_1
This might work for you:
sed -e '1{h;d};H;${x;:a;s/\(\([^;]*;\)\([^\n]*\)\)\n\2/\1;\2/;ta;p};d' t.txt
Explanation:
Slurp file in to hold space (HS) then on end-of-file swap to the HS and using substitution concatenate lines with duplicate keys and print. N.B. lines normally printed are all deleted.
EDIT:
The above solution works (as far as I know) but for large volumes is not very fast (read incredibly slow). This solution is better:
# cat -A /tmp/t.txt
id1;value1_1$
id1;value1_2$
id2;value2_1$
id3;value3_1$
id4;value4_1$
id4;value4_2$
id5;value5_1$
# for x in {1..1000};do cat /tmp/t.txt;done |
> sed ':a;$!N;/^\([^;]*;\).*\n\1/s/\n//;ta;P;D'| sort | uniq
id1;value1_1;id1;value1_2
id2;value2_1
id3;value3_1
id4;value4_1;id4;value4_2
id5;value5_1