join previous line with next depending of pattern of previous line - sed

I have this Input:
1 a
a
2 b b
3 c
c
4 d d
5 e e
6 f
f
7 g
g
I want this output using sed command
1 a a
2 b b
3 c c
4 d d
5 e e
6 f f
7 g g
I'm trying this without success
sed '/^[^0-9]/ x; N; { s/\n/ / }; n' file

Another in awk:
$ awk 'BEGIN{RS=""}{for(i=1;i<=NF;i+=3)print $i,$(i+1),$(i+2)}' file
1 a a
2 b b
3 c c
4 d d
5 e e
6 f f
7 g g
Explained:
$ awk 'BEGIN {
RS="" # prime awk to read in a paragraph of data
}
{
for(i=1;i<=NF;i+=3) # jump forward 3 fields at a time
print $i,$(i+1),$(i+2) # print 3 fields
}' file

awk 'NR>1 && /^[0-9]/ {print substr(s,2); s=""} {s=s FS $0} END {print substr(s,2)}' file
NR>1 && /^[0-9]/: If a line is not the first and begins with a digit,
{print substr(s,2); s=""}: print "s" without the leading space, then clear it.
{s=s FS $0}: On every line, append the current line to the value of "s". FS is a space by default.
edit: Added END condition to catch last line, hated it, made a better separate answer.

Made it simpler with awk:
awk 'NF==2 {printf("%s ", $0); next} 1' file
Basically, "Don't print a newline if there are only exactly two fields."

This might work for you (GNU sed):
sed '/^[0-9]/{:a;N;s/\n\([^0-9]\)/ \1/;ta;P;D}' file
If the current line begins with an integer, append the following line. If that line does not begin with an integer, replace the newline by a space and repeat. Otherwise print/delete the first line in the pattern space and repeat.

Related

Print specific lines that have two or more occurrences of a particular character

I have file with some text lines. I need to print lines 3-7 and 11 if it has two "b". I did
sed -n '/b\{2,\}/p' file but it printed lines where "b" occurs two times in a row
You can use
sed -n '3,7{/b[^b]*b/p};11{/b[^b]*b/p}' file
## that is equal to
sed -n '3,7{/b[^b]*b/p};11{//p}' file
Note that b[^b]*b matches b, then any zero or more chars other than b and then a b. The //p in the second part matches the most recent pattern , i.e. it matches the same b[^b]*b regex.
Note you might also use b.*b regex if you want, but the bracket expressions tend to word faster.
See an online demo, tested with sed (GNU sed) 4.7:
s='11bb1
b222b
b n b
ww
ee
bb
rrr
fff
999
10
11 b nnnn bb
www12'
sed -ne '3,7{/b[^b]*b/p};11{/b[^b]*b/p}' <<< "$s"
Output:
b n b
bb
11 b nnnn bb
Only lines 3, 6 and 11 are returned.
Just use awk for simplicity, clarity, portability, maintainability, etc. Using any awk in any shell on every Unix box:
awk '( (3<=NR && NR<=7) || (NR==11) ) && ( gsub(/b/,"&") >= 2 )' file
Notice how if you need to change a range, add a range, add other line numbers, change how many bs there are, add other chars and/or strings to match, add some completely different condition, etc. it's all absolutely clear and trivial.
For example, want to print the line if there's exactly either 13 or 27 bs instead of 2 or more:?
awk '( (3<=NR && NR<=7) || (NR==11) ) && ( gsub(/b/,"&") ~ /^(13|27)$/ )' file
Want to print the line if the line number is between 23 and 59 but isn't 34?
awk '( 23<=NR && NR<=59 && NR!=34 ) && ( gsub(/b/,"&") >= 2 )' file
Try making similar changes to a sed script. I'm not saying you can't force it to happen, but it's not nearly as trivial, clear, portable, etc. as it is using awk.

How to parse rows in my txt file properly using perl

I hope to parse a txt file that looks like this:
A a, b, c
B e
C f, g
The format I hope to get is:
A a
A b
A c
B e
C f
C g
I tried this:
perl -ane '#s=split(/\,/, $F[1]); foreach $k (#s){print "$F[0] $k\n";}' txt.txt
but it only works when there's no space after commas. In the original file, there is a space after each comma. What should I do?
$ perl -lane 'print "$F[0] $_" for map { tr/,//rd } #F[1..$#F]' input.txt
A a
A b
A c
B e
C f
C g
Use auto-split mode on whitespace like normal, and for each element of an array slice of #F from the second field to the last one, remove any commas (I used tr//d, the more usual s/// works too, of course) and print it with the first field prepended.
Alternatively, don't use -a because it splits too much.
perl -le'#F = split(" ", $_, 2); print "$F[0] $_" for split(/,\s*/, $F[1])'

How to repeat a sequence of numbers to the end of a column?

I have a data file that needs a new column of identifiers from 1 to 5. The final purpose is to split the data into five separate files with no leftover file (split leaves a leftover file).
Data:
aa
bb
cc
dd
ff
nn
ww
tt
pp
with identifier column:
aa 1
bb 2
cc 3
dd 4
ff 5
nn 1
ww 2
tt 3
pp 4
Not sure if this can be done with seq? Afterwards it will be split with:
awk '$2 == 1 {print $0}'
awk '$2 == 2 {print $0}'
awk '$2 == 3 {print $0}'
awk '$2 == 4 {print $0}'
awk '$2 == 5 {print $0}'
Perl to the rescue:
perl -pe 's/$/" " . $. % 5/e' < input > output
Uses 0 instead of 5.
$. is the line number.
% is the modulo operator.
the /e modifier tells the substitution to evaluate the replacement part as code
i.e. end of line ($) is replaced with a space concatenated (.) with the line number modulo 5.
$ awk '{print $0, ((NR-1)%5)+1}' file
aa 1
bb 2
cc 3
dd 4
ff 5
nn 1
ww 2
tt 3
pp 4
No need for that to create 5 separate files of course. All you need is:
awk '{print > ("file_" ((NR-1)%5)+1)}' file
Looks like you're happy with a perl solution that outputs 1-4 then 0 instead of 1-5 so FYI here's the equivalent in awk:
$ awk '{print $0, NR%5}' file
aa 1
bb 2
cc 3
dd 4
ff 0
nn 1
ww 2
tt 3
pp 4
I am going to offer a Perl solution even though it wasn't tagged because Perl is well suited to solve this problem.
If I understand what you want to do, you have a single file that you want to split into 5 separate files based on the position of a line in the data file:
the first line in the data file goes to file 1
the second line in the data file goes to file 2
the third line in the data file goes to file 3
...
since you already have the lines position in the file, you don't really need the identifier column (though you could pursue that solution if you wanted).
Instead you can open 5 filehandles and simply alternate which handle you write to:
use strict;
use warnings;
my $datafilename = shift #ARGV;
# open filehandles and store them in an array
my #fhs;
foreach my $i ( 0 .. 4 ) {
open my $fh, '>', "${datafilename}_$i"
or die "$!";
$fhs[$i] = $fh;
}
# open the datafile
open my $datafile_fh, '<', $datafilename
or die "$!";
my $row_number = 0;
while ( my $datarow = <$datafile_fh> ) {
print { $fhs[$row_number++ % #fhs] } $datarow;
}
# close resources
foreach my $fh ( #fhs ) {
close $fh;
}

remove token repeatedly if line does not start with #

I want to remove all commas from my text file unless a line starts with #
for example:
a, b, c
#a, b, c
should turn to:
a b c
#a, b, c
I don't mind double scan the file but I want to do that with sed
You could try the below sed command,
$ sed '/^ *#/!s/,//g' file
a b c
#a, b, c
^ asserts that we are at the start. So the above command will match the lines which starts with zero or more spaces and a # symbol. Then the following ! makes the sed to inverse the selections ie, it forces the sed to do the replacement on the lines which are not matched. s/,//g replaces all the commas with an empty string .
Through awk,
$ awk '!/^ *#/{gsub(/,/,"")}1' file
a b c
#a, b, c
! at the start negates the patten. Likewise , it will do the replacement only on the lines which don't have # at the start.

Reformatting separated char to couples

Input:
rs001 A C T G C G T T
rs002 C C T T G G A A
out1:
rs001 AC TG CG TT
rs002 CC TT GG AA
out2 :
rs001 1 1 1 2
rs002 2 2 2 2
Ok so basically I want to replace any two similar nucleotides (like AA, CC, TT, or GG) to 2 and any two different (like AT, TA, CG, .. etc) to 1 taking into account that the input should be converted first to out1 then to out2. Also we have so many fields (like 200 columns) in each row, so loops are needed here.
This is what I tried:
cat input | awk '{ for (x = 2; x <= NF; x = x+2) print $x$(x+1) }'
Results are so weird, so can anyone please tell me why I can't get out1 ?! What mistakes I did in awk loops ?
Thanks in advance
For the first,
sed 's/ \([ACGT]\) / \1/g' input >out1
This will remove the space after every other nucleitude. It matches a nucleotide with a space on both sides; the next match will pick up where the previous ended.
For the second,
sed 's/\([ACGT]\)\1/2/g;s/[ACGT][ACGT]/1/g' out1 >out2
This replaces two adjacent identical letters with 2, then any remaining adjacent two letters with 1.
This assumes you have Linux; other sed dialects may require minor modifications.
awk '{
out1 = out2 = $1
for (i=2;i<=NF;i+=2) {
out1 = out1 FS $i $(i+1)
out2 = out2 FS ($i == $(i+1) ? 2 : 1)
}
print out1 > "out1"
print out2 > "out2"
}' input
Here's how you fix your awk script to get output 1:
awk '{ printf "%s ", $1; for (x = 2; x <= NF; x = x + 2) {printf "%s%s ", $x, $(x+1)} printf "\n"}' input
print adds a new line at the end by default, so you'll have to use formatted strings printf to specify where exactly you want the new lines.
(Also added printf "%s ", $1; at the start to print the header at the start of each line)
Edit: Triplee's solution looks much more elegant than mine - you should ditch awk and go with his =)
This might work for you (GNU sed):
sed -re 's/ (.) / \1/g;w out1' -e 's/([ACTG])\1/2/g;s/[ACTG]./1/g' file >out2