awk join multiple lines - sed

I want to join lines between opening tag and closing tag with class named "content_subhd",
For example:
<span class="content_subhd">1
2
3 </span>
<span class="xyz">1
2
3</span>
Output should be:
<span class="content_subhd">123</span>
<span class="xyz">1
2
3
</span>
How can this be achieve, any suggestion?

awk '/<span class="content_subhd">/, /<\/span>/ {
r = r ? r $0 : $0
if (/<\/span>/) {
print r; r = x
}
next
}1' infile
If you want to replace the content of your existing file:
awk > _new_ '/<span class="content_subhd">/, /<\/span>/ {
r = r ? r $0 : $0
if (/<\/span>/) {
print r; r = x
}
next
}1' your_file &&
mv -- _new_ your_file
Added solution for mass replacement (as per OP request):
find <your arguments here> |
while IFS= read -r; do
awk > _new_ '/<span class="content_subhd">/, /<\/span>/ {
r = r ? r $0 : $0
if (/<\/span>/) {
print r; r = x
}
next
}1' "$REPLY" &&
mv -- _new_ "$REPLY"
done

As sed is tagged in this question, here is a one liner:
sed '/<span class="content_subhd">/,/<\/span>/{H;/<\/span>/{s/.*//;x;s/\n//g;p;};d}' source
All lines are passed through except in the special "span class" case. These lines are hived off to the hold space, newlines are removed and then what would have been a multi-line is printed instead.

Related

Reformatting separated char to couples

Input:
rs001 A C T G C G T T
rs002 C C T T G G A A
out1:
rs001 AC TG CG TT
rs002 CC TT GG AA
out2 :
rs001 1 1 1 2
rs002 2 2 2 2
Ok so basically I want to replace any two similar nucleotides (like AA, CC, TT, or GG) to 2 and any two different (like AT, TA, CG, .. etc) to 1 taking into account that the input should be converted first to out1 then to out2. Also we have so many fields (like 200 columns) in each row, so loops are needed here.
This is what I tried:
cat input | awk '{ for (x = 2; x <= NF; x = x+2) print $x$(x+1) }'
Results are so weird, so can anyone please tell me why I can't get out1 ?! What mistakes I did in awk loops ?
Thanks in advance
For the first,
sed 's/ \([ACGT]\) / \1/g' input >out1
This will remove the space after every other nucleitude. It matches a nucleotide with a space on both sides; the next match will pick up where the previous ended.
For the second,
sed 's/\([ACGT]\)\1/2/g;s/[ACGT][ACGT]/1/g' out1 >out2
This replaces two adjacent identical letters with 2, then any remaining adjacent two letters with 1.
This assumes you have Linux; other sed dialects may require minor modifications.
awk '{
out1 = out2 = $1
for (i=2;i<=NF;i+=2) {
out1 = out1 FS $i $(i+1)
out2 = out2 FS ($i == $(i+1) ? 2 : 1)
}
print out1 > "out1"
print out2 > "out2"
}' input
Here's how you fix your awk script to get output 1:
awk '{ printf "%s ", $1; for (x = 2; x <= NF; x = x + 2) {printf "%s%s ", $x, $(x+1)} printf "\n"}' input
print adds a new line at the end by default, so you'll have to use formatted strings printf to specify where exactly you want the new lines.
(Also added printf "%s ", $1; at the start to print the header at the start of each line)
Edit: Triplee's solution looks much more elegant than mine - you should ditch awk and go with his =)
This might work for you (GNU sed):
sed -re 's/ (.) / \1/g;w out1' -e 's/([ACTG])\1/2/g;s/[ACTG]./1/g' file >out2

Joining lines with awk and sed

I like to join lines following {st,corridor,tunnel} into one line using AWK or SED
Input
abcd
efgjk
st
wer
dfgh
corridor
weerr
tunnel
twdf
Desired output
abcd
efgjk st
wer
dfgh corridor
weerr tunnel
twdf
One way using awk:
awk '!/st|corridor|tunnel/ { if (line) print line; line = $0; next } { line = line " " $0 } END { print line }' file.txt
Results:
abcd
efgjk st
wer
dfgh corridor
weerr tunnel
twdf
This might work for you (GNU sed):
sed '$!N;s/\n\(st\|corridor\|tunnel\)\s*$/ \1/;P;D' file
Or, an awk version that reads the whole file into memory first (not recommended for large files):
$ awk 'BEGIN {i=1} {line[i++] = $0} END {j=1; while (j<i) {if (match(line[j+1], /^(st|corridor|tunnel)$/)) {print line[j] " " line[j+1]; j+=2} else print line[j++];}}' streets
abcd
efgjk st
wer
dfgh corridor
weerr tunnel
twdf
I'll leave you with the exercise of doing this one-or-two-lines-at-a-time. :)
With awk
BEGIN {
s["st"]=s["corridor"]=s["tunnel"]
}
$1 in s {
print prev, $1
}
!($1 in s) {
if (prev) print prev
prev = $1
}

Gather the data with similar columns

I want to filter the data from a text file in unix.
I have text file in unix as below:
A 200
B 300
C 400
A 100
B 600
B 700
How could i modify/create data as below from the above data i have in awk?
A 200 100
B 300 600 700
C 400
i am not that much good in awk and i believe awk/perl is best for this.
awk 'END {
for (R in r)
print R, r[R]
}
{
r[$1] = $1 in r ? r[$1] OFS $2 : $2
}' infile
If the order of the values in the first field is important,
more code will be needed.
The solution will depend on your awk implementation and version.
Explanation:
r[$1] = $1 in r ? r[$1] OFS $2 : $2
Set the value of the array r element $1 to:
if the key $1 is already present: $1 in r, append OFS $2
to the existing value
otherwise set it to the value of $2
expression ? if true : if false is the ternary operator.
See ternary operation for more.
You could do it like this, but with Perl there's always more than one way to do it:
my %hash;
while(<>) {
my($letter, $int) = split(" ");
push #{ $hash{$letter} }, $int;
}
for my $key (sort keys %hash) {
print "$key " . join(" ", #{ $hash{$key} }) . "\n";
}
Should work like that:
$ cat data.txt | perl script.pl
A 200 100
B 300 600 700
C 400
Not language-specific. More like pseudocode, but here's the idea :
- Get all lines in an array
- Set a target dictionary of arrays
- Go through the array :
- Split the string using ' '(space) as the delimiter, into array parts
- If there is already a dictionary entry for `parts[0]` (e.g. 'A').
If not create it.
- Add `parts[1]` (e.g. 100) to `dictionary(parts[0])`
And that's it! :-)
I'd do it, probably in Python, but that's rather a matter of taste.
Using awk, sorting the output inside it:
awk '
{ data[$1] = (data[$1] ? data[$1] " " : "") $2 }
END {
for (i in data) {
idx[++j] = i
}
n = asort(idx);
for ( i=1; i<=n; i++ ) {
print idx[i] " " data[idx[i]]
}
}
' infile
Using external program sort:
awk '
{ data[$1] = (data[$1] ? data[$1] " " : "") $2 }
END {
for (i in data) {
print i " " data[i]
}
}
' infile | sort
For both commands output is:
A 200 100
B 300 600 700
C 400
Using sed:
Content of script.sed:
## First line. Newline will separate data, so add it after the content.
## Save it in 'hold space' and read next one.
1 {
s/$/\n/
h
b
}
## Append content of 'hold space' to current line.
G
## Search if first char (\1) in line was saved in 'hold space' (\4) and add
## the number (\2) after it.
s/^\(.\)\( *[0-9]\+\)\n\(.*\)\(\1[^\n]*\)/\3\4\2/
## If last substitution succeed, goto label 'a'.
ta
## Here last substitution failed, so it is the first appearance of the
## letter, add it at the end of the content.
s/^\([^\n]*\n\)\(.*\)$/\2\1/
## Label 'a'.
:a
## Save content to 'hold space'.
h
## In last line, get content of 'hold space', remove last newline and print.
$ {
x
s/\n*$//
p
}
Run it like:
sed -nf script.sed infile
And result:
A 200 100
B 300 600 700
C 400
This might work for you:
sort -sk1,1 file | sed ':a;$!N;s/^\([^ ]*\)\( .*\)\n\1/\1\2/;ta;P;D'
A 200 100
B 300 600 700
C 400

Splitting file based on variable

I have a file with several lines of the following:
DELIMITER ;
I want to create a separate file for each of these sections.
The man page of split command does not seem to have such option.
The split command only splits a file into blocks of equal size (maybe except for the last one).
However, awk is perfect for your type of problem. Here's a solution example.
Sample input
1
2
3
DELIMITER ;
4
5
6
7
DELIMITER ;
8
9
10
11
awk script split.awk
#!/usr/bin/awk -f
BEGIN {
n = 1;
outfile = n;
}
{
# FILENAME is undefined inside the BEGIN block
if (outfile == n) {
outfile = FILENAME n;
}
if ($0 ~ /DELIMITER ;/) {
n++;
outfile = FILENAME n;
} else {
print $0 >> outfile;
}
}
As pointed out by glenn jackman, the code also can be written as:
#!/usr/bin/awk -f
BEGIN {
n = 1;
}
$0 ~ /DELIMITER ;/ {
n++;
next;
}
{
print $0 >> FILENAME n;
}
The notation on the command prompt awk -v x="DELIMITER ;" -v n=1 '$0 ~ x {n++; next} {print > FILENAME n}' is more suitable if you don't use the script more often, however you can also save it in a file as well.
Test run
$ ls input*
input
$ chmod +x split.awk
$ ./split.awk input
$ ls input*
input input1 input2 input3
$ cat input1
1
2
3
$ cat input2
4
5
6
7
$ cat input3
8
9
10
11
The script is just a starting point. You probably have to adapt it to your personal needs and environment.

Awk or Sed: File Annotation

Hallo, my SO friend, my question is:
Specification: annotate the fields of FILE_2 to the corresponding position of FILE_1.
A field is marked, and hence identified, by a delimiter pair.
I did this job in python before I knew awk and sed, with a couple hundred lines of code.
Now I want to see how powerful and efficient awk and sed can be.
Show me some masterpiece of awk or sed, please!
The delimiter pairs can be configured in FILE_3, but let's assume the first delimiter in a pair is 'Marker (number i) start', the other one is 'Marker (number i) done'
Example:
|-----------------FILE_1------------------|
text text text
text blabla
Marker_1_start
Marker_1_done
any text
in between blabla
Marker_2_start
Marker_2_done
text text
|-----------------FILE_2------------------|
Marker_1_start
11
1111
Marker_1_done
Marker_2_start
2222
22
Marker_2_done
Expected Output:
|-----------------FILE_Out------------------|
text text text
text blabla
Marker_1_start
11
1111
Marker_1_done
any text
in between blabla
Marker_2_start
2222
22
Marker_2_done
text text
awk '
FNR==NR && /Marker_.*_done/ {sep = ""; next}
FNR==NR && /Marker_.*_start/ {marker = $0; next}
FNR==NR {marker_text[marker] = marker_text[marker] sep $0; sep = "\n"; next}
1 {print}
/Marker_.*_start/ {print marker_text[$0]}
' file_2 file_1
There are several ways to approach this. I'm assuming that FILE_2 is smaller than FILE_1 and of a reasonable size.
#!/usr/bin/awk -f
FNR == NR {
if ($0 ~ /^Marker.*start$/) {
flag = 1
idx = $0
next
}
if ($0 ~ /^Marker.*done$/) {
flag = 0
nl = ""
next
}
if (flag) lines[idx] = lines[idx] nl $0
nl = "\n"
next
}
{
print
if (lines[$0]) print lines[$0]
}
To run it:
./script.awk FILE_2 FILE_1
Now I want to see how powerful and
efficient awk and sed can be
For this type of problem, very efficient. I'm sure my code can be further reduced.
#!/bin/bash
awk '
FNR == NR {
if ($0 ~ /Marker_1_start/){m1=1;next}
if ($0 ~ /Marker_2_start/){m2=1;next}
if ($0 ~ /Marker_1_done/){m1=0}
if ($0 ~ /Marker_2_done/){m2=0}
if(m1){a[i++]=$0}
if(m2){b[j++]=$0}
}
FNR != NR {
if ($0 ~ /Marker_1_start/){print;n1=1}
if ($0 ~ /Marker_2_start/){print;n2=1}
if ($0 ~ /Marker_1_done/){n1=0}
if ($0 ~ /Marker_2_done/){n2=0}
if(n1)
for (k = 0; k < i; k++)
print a[k]
else if(n2)
for (l = 0; l < j; l++)
print b[l]
else
print
}' ./file_2 ./file_1
Output
$ ./filemerge.sh
text text text
text blabla
Marker_1_start
11
1111
Marker_1_done
any text
in between blabla
Marker_2_start
2222
22
Marker_2_done
text text