How to parse rows in my txt file properly using perl - perl

I hope to parse a txt file that looks like this:
A a, b, c
B e
C f, g
The format I hope to get is:
A a
A b
A c
B e
C f
C g
I tried this:
perl -ane '#s=split(/\,/, $F[1]); foreach $k (#s){print "$F[0] $k\n";}' txt.txt
but it only works when there's no space after commas. In the original file, there is a space after each comma. What should I do?

$ perl -lane 'print "$F[0] $_" for map { tr/,//rd } #F[1..$#F]' input.txt
A a
A b
A c
B e
C f
C g
Use auto-split mode on whitespace like normal, and for each element of an array slice of #F from the second field to the last one, remove any commas (I used tr//d, the more usual s/// works too, of course) and print it with the first field prepended.

Alternatively, don't use -a because it splits too much.
perl -le'#F = split(" ", $_, 2); print "$F[0] $_" for split(/,\s*/, $F[1])'

Related

join previous line with next depending of pattern of previous line

I have this Input:
1 a
a
2 b b
3 c
c
4 d d
5 e e
6 f
f
7 g
g
I want this output using sed command
1 a a
2 b b
3 c c
4 d d
5 e e
6 f f
7 g g
I'm trying this without success
sed '/^[^0-9]/ x; N; { s/\n/ / }; n' file
Another in awk:
$ awk 'BEGIN{RS=""}{for(i=1;i<=NF;i+=3)print $i,$(i+1),$(i+2)}' file
1 a a
2 b b
3 c c
4 d d
5 e e
6 f f
7 g g
Explained:
$ awk 'BEGIN {
RS="" # prime awk to read in a paragraph of data
}
{
for(i=1;i<=NF;i+=3) # jump forward 3 fields at a time
print $i,$(i+1),$(i+2) # print 3 fields
}' file
awk 'NR>1 && /^[0-9]/ {print substr(s,2); s=""} {s=s FS $0} END {print substr(s,2)}' file
NR>1 && /^[0-9]/: If a line is not the first and begins with a digit,
{print substr(s,2); s=""}: print "s" without the leading space, then clear it.
{s=s FS $0}: On every line, append the current line to the value of "s". FS is a space by default.
edit: Added END condition to catch last line, hated it, made a better separate answer.
Made it simpler with awk:
awk 'NF==2 {printf("%s ", $0); next} 1' file
Basically, "Don't print a newline if there are only exactly two fields."
This might work for you (GNU sed):
sed '/^[0-9]/{:a;N;s/\n\([^0-9]\)/ \1/;ta;P;D}' file
If the current line begins with an integer, append the following line. If that line does not begin with an integer, replace the newline by a space and repeat. Otherwise print/delete the first line in the pattern space and repeat.

perl : how to print after a specific line

For example, my.txt containts
a
b
xx
c
d
I want print from the second line below lines that contains xx
I tried
perl -nle 'if(/xx/){$n=$.};print if $.>($n+1)' my.txt
But it didn't work. It just print all lines.
Before $n is defined it is interpreted as 0 (zero), meaning that $. > 1 will also be printed before xx. This might be what you wanted:
perl -nle 'if(/xx/){$n=$.}; print if defined($n) and $. > $n+1' my.txt

remove token repeatedly if line does not start with #

I want to remove all commas from my text file unless a line starts with #
for example:
a, b, c
#a, b, c
should turn to:
a b c
#a, b, c
I don't mind double scan the file but I want to do that with sed
You could try the below sed command,
$ sed '/^ *#/!s/,//g' file
a b c
#a, b, c
^ asserts that we are at the start. So the above command will match the lines which starts with zero or more spaces and a # symbol. Then the following ! makes the sed to inverse the selections ie, it forces the sed to do the replacement on the lines which are not matched. s/,//g replaces all the commas with an empty string .
Through awk,
$ awk '!/^ *#/{gsub(/,/,"")}1' file
a b c
#a, b, c
! at the start negates the patten. Likewise , it will do the replacement only on the lines which don't have # at the start.

Reformatting separated char to couples

Input:
rs001 A C T G C G T T
rs002 C C T T G G A A
out1:
rs001 AC TG CG TT
rs002 CC TT GG AA
out2 :
rs001 1 1 1 2
rs002 2 2 2 2
Ok so basically I want to replace any two similar nucleotides (like AA, CC, TT, or GG) to 2 and any two different (like AT, TA, CG, .. etc) to 1 taking into account that the input should be converted first to out1 then to out2. Also we have so many fields (like 200 columns) in each row, so loops are needed here.
This is what I tried:
cat input | awk '{ for (x = 2; x <= NF; x = x+2) print $x$(x+1) }'
Results are so weird, so can anyone please tell me why I can't get out1 ?! What mistakes I did in awk loops ?
Thanks in advance
For the first,
sed 's/ \([ACGT]\) / \1/g' input >out1
This will remove the space after every other nucleitude. It matches a nucleotide with a space on both sides; the next match will pick up where the previous ended.
For the second,
sed 's/\([ACGT]\)\1/2/g;s/[ACGT][ACGT]/1/g' out1 >out2
This replaces two adjacent identical letters with 2, then any remaining adjacent two letters with 1.
This assumes you have Linux; other sed dialects may require minor modifications.
awk '{
out1 = out2 = $1
for (i=2;i<=NF;i+=2) {
out1 = out1 FS $i $(i+1)
out2 = out2 FS ($i == $(i+1) ? 2 : 1)
}
print out1 > "out1"
print out2 > "out2"
}' input
Here's how you fix your awk script to get output 1:
awk '{ printf "%s ", $1; for (x = 2; x <= NF; x = x + 2) {printf "%s%s ", $x, $(x+1)} printf "\n"}' input
print adds a new line at the end by default, so you'll have to use formatted strings printf to specify where exactly you want the new lines.
(Also added printf "%s ", $1; at the start to print the header at the start of each line)
Edit: Triplee's solution looks much more elegant than mine - you should ditch awk and go with his =)
This might work for you (GNU sed):
sed -re 's/ (.) / \1/g;w out1' -e 's/([ACTG])\1/2/g;s/[ACTG]./1/g' file >out2

Apply regexp replace only to quoted piece

I need to apply a regexp filtration to affect only pieces of text within quotes and I'm baffled.
$in = 'ab c "d e f" g h "i j" k l';
#...?
$inquotes =~ s/\s+/_/g; #arbitrary regexp working only on the pieces inside quote marks
#...?
$out = 'ab c "d_e_f" g h "i_j" k l';
(the final effect can strip/remove the quotes if that makes it easier, 'ab c d_e_f g...)
You could figure out some cute trick that looks like line noise.
Or you could keep it simple and readable, and just use split and join. Using the quote mark as a field separator, operate on every other field:
my #pieces = split /\"/, $in, -1;
foreach my $i (0 ... $#pieces) {
next unless $i % 2;
$pieces[$i] =~ s/\s+/_/g;
}
my $out = join '"', #pieces;
If you want you use just a regex, the following should work:
my $in = q(ab c "d e f" g h "i j" k l);
$in =~ s{"(.+?)"}{$1 =~ s/\s+/_/gr}eg;
print "$in\n";
(You said the "s may be dropped :) )
HTH,
Paul
Something like
s/\"([\a\w]*)\"/
should match the quoted chunks. My perl regex syntax is a little rusty, but shouldn't just placing quote literals around what you're capturing do the job? You've then got your quoted string d e f inside the first capture group, so you can do whatever you want to it... What kind of 'arbitrary operation' are you trying to do to the quoted strings?
Hmm.
You might be better off matching the quoted strings, then passing them to another regex, rather than doing it all in one.