Why I can 't delete the empty row? - perl

The file A contains one row like below,
SNP1 AA TT GGSNP2 GG CC AASNP3 GG CC AA...
I want to change the format and make it like this,
SNP1 TT AA TT GG CC AA
SNP2 AA GG CC TT GG CC
SNP3 GG AA TT TT CC TT
...
(each row began with SNP_)
I had written a perl script to replace SNP by \nSNP, but the first row of the output file was always empty though I had done some try to delete the first row.
So, Is there any suggestion for me? Or another way to get the final output file.
Thanks a lot.

Just make sure there is a character before every SNP that you alter:
s/.\K(?=SNP)/\n/g

open $file_in...
open $file_out...
local $/ = 'SNP';
<$file_in>; # discard empty row
while ( my $record = <$file_in> ) {
chomp $record;
# make changes to $record, which will start as e.g. "1 AA TT GG"
....
print $file_out "SNP$record\n";
}

Related

How to parse rows in my txt file properly using perl

I hope to parse a txt file that looks like this:
A a, b, c
B e
C f, g
The format I hope to get is:
A a
A b
A c
B e
C f
C g
I tried this:
perl -ane '#s=split(/\,/, $F[1]); foreach $k (#s){print "$F[0] $k\n";}' txt.txt
but it only works when there's no space after commas. In the original file, there is a space after each comma. What should I do?
$ perl -lane 'print "$F[0] $_" for map { tr/,//rd } #F[1..$#F]' input.txt
A a
A b
A c
B e
C f
C g
Use auto-split mode on whitespace like normal, and for each element of an array slice of #F from the second field to the last one, remove any commas (I used tr//d, the more usual s/// works too, of course) and print it with the first field prepended.
Alternatively, don't use -a because it splits too much.
perl -le'#F = split(" ", $_, 2); print "$F[0] $_" for split(/,\s*/, $F[1])'

Perl : $_ and $array[$_] issue

Below is my code , please review.
use strict;
my #people = qw{a b c d e f};
foreach (#people){
print $_,"$people[$_]\n";
}
Below is the output,
[~/perl]$ perl test.pl
aa #why the output of $people[$_] is not same with $_?
ba
ca
da
ea
fa
Thanks for your asking.
$_ is the actual element you're looking at. $people[$_] is getting the $_th element out of #people. It's intended to be used with numerical indices, so it numifies its argument. Taking a letter and "converting" it to a number just converts to zero since it can't parse a number out of it. So $people[$_] is just $people[0] when $_ is not a number. If you try the same experiment where some of the list elements are actually numerical, you'll get some more interesting results.
Try:
use strict;
my #people = qw{a b c 3 2 1};
foreach (#people){
print $_,"$people[$_]\n";
}
Output:
aa
ba
ca
33
2c
1b
Since, in the first three cases, we couldn't parse a, b, or c as numbers so we got the zeroth element. Then, we can actually convert 3, 2, and 1 to numbers, so we get elements 3, 2, and then 1.
EDIT: As mentioned in a comment by #ikegami, put use warnings at the top of your file in addition to use strict to get warned about this sort of thing.

mark list of items in string without overlap

I have a sample of text:
my $text = 'a bb cc xx aa a b c a';
and a list of terms that might be in the text:
my #words = ('bb cc',
'a bb cc',
'xx aa a b',
'a b',
'a'
);
I need to find the occurrences of these words, using the longest matches possible, and not marking anything twice. So if I marked the matches in the text above, it would look like this:
<a bb cc> <xx aa a b> c <a>
Notice that I did not mark bb cc, because that is part of the larger match a bb cc.
Any ideas on a way to do this? I feel like it should have been encountered many times before.
A simple substitution should do, you'll have to sort by length:
my $re = '('.join('|', sort {length $b <=> length $a} map(quotemeta,#words)).')';
$text =~ s/$re/<$1>/g;
say $text;
Output as expected for 5.20.2, can't check other version right now.
The quotemeta part isn't actually needed for the examples you gave, it's there to escape characters with special meaning in the regexen.

Use keys and pairing elements Perl

My data looks like this:
G1 G2 G3 G4
Pf1 NO B1 NO D1
Pf2 NO NO C1 D1
Pf3 A1 B1 NO D1
Pf4 A1 NO C1 D2
Pf5 A3 B2 C2 D3
Pf6 NO B3 NO D3
My purpose is to check in each column if an element (different from the "NO" cases) is showed twice (like A1 in column 2, for example) and only twice (if it is showed three times or more I don't want it in the output) and, if so, write the correspondent elements of the first column. So, the desired output looks like this:
Pf3 Pf4 A1
Pf1 Pf3 B1
Pf2 Pf4 C1
Pf5 Pf6 D3
I'm trying to write a perl script, but I need some help to focus on the different steps. This is what I did so far:
open (HAN, "< $file_in") || die "Impossible open the in_file";
#r = <HAN>;
close (HAN);
for ($i=0; $i<=$#r; $i++){
chomp ($r[$i]);
($Ids, #v) = split (/\t/, $r[$i]);
}
}
But I cannot go on in any direction!
(My perl knowledge needs to be pushed by you!)
The hot points in my mind are:
how do I compare elements from the same column (or anyway in the same file)?
how can I associate the elements of the first column with the other column ones (may be keys)?
Any help is absolutely necessary and welcome!
use Data::Dumper;
my %hash;
while (<DATA>) {
next if $.==1;
chomp;
my ($first,#others) = (split /\s+/);
for (#others){
$hash{$_}.=' '.$first;
}
}
print Dumper \%hash;
__DATA__
G1 G2 G3 G4
Pf1 NO B1 NO D1
Pf2 NO NO C1 D1
Pf3 A1 B1 NO D1
Pf4 A1 NO C1 D2
Pf5 A3 B2 C2 D3
Pf6 NO B3 NO D3
What I use here? (tricks)
while (<DATA>){BLOCK} - read data from specific DATA section in Perl script file. (yes, you can put test data here, if you want. But don't store everything! this is not a bin!)
next if $.==1 - $. - special variable, that store a line number of input data. like 'index'.
chomp; - back to while(<DATA>).
Some variables in Perl are hidden. In functions - #_ array of input parameters. And always Perl programmers like to use $_ - You variable.
And this while(<DATA>) really a hidden while(defined($_ = <DATA>)).
Function chomp use hidden-You variable and try to chop \n symbol at the end.
Function split /REGEX/ also take as default variable hidden-You variable ($_).
Perl multi liner :),
perl -anE '
/^\S/ or next;
$k = shift #F;
push #{$t{$_}}, $k for#F;
}{
#$_-1==2 and say join" ",#$_ for map [#{$t{$_}},$_], sort keys%t;
' file

Sed, awk, Perl or other for de-interleaving text file

I would like a relatively compact command to perform line-by-line de-interleaving of a text file, i.e
a1
a2
a3
a4
b1
b2
b3
b4
c1
c2
c3
c4
d1
d2
d3
d4
maps to
a1
b1
c1
d1
a2
b2
c2
d2
a3
b3
c3
d3
a4
b4
c4
d4
The interleaving depth should be adjustable. The lines themselves do not contain any useful structure to assist with the process, and the example above is just a toy example for demonstration purposes. What tool can I use to do this?
sort can do it!
$ sort -k1.2 your_file
-k1.2 sorts by first field starting from 2nd character.
Output:
a1
b1
c1
d1
a2
b2
c2
d2
a3
b3
c3
d3
a4
b4
c4
d4
Basically, what you're looking at doing is reading your data into a 2D array. As you read it in, you can (for example) put the data into the array row by row.
Then when you write the data out, you traverse the array column by column. Adjusting the (de-)interleaving you do just requires a different size of array (or at least that you use a different amount of it, though you could leave the array size itself fixed, if you chose).
According to your new requirements, reordering elements based on their position in the file:
use strict;
use warnings;
my #sorted;
my $depth = 4; # the adjustable interleaving depth
while (<DATA>) {
my $num = ($. % $depth) - 1; # $. is input line number
push #{ $sorted[$num] }, $_;
}
for (#sorted) {
print #$_;
}
__DATA__
a1
a2
a3
a4
b1
b2
b3
b4
c1
c2
c3
c4
d1
d2
d3
d4
Note that the script can be tested on an input file by changing <DATA> to <> and running:
perl script.pl input.txt
Update
Having finally understood your question, thanks to TLP, I suggest this solution. It expects the depth and the input file name on the command line:
$ perl deinter.pl 4 interleaved.txt
and prints the reordered data to STDOUT.
use strict;
use warnings;
my $depth = shift;
my #data = <>;
for my $start (0 .. $depth-1) {
for (my $i = $start; $i < #data; $i += $depth) {
print $data[$i];
}
}
output
a1
b1
c1
d1
a2
b2
c2
d2
a3
b3
c3
d3
a4
b4
c4
d4
Previous solution
Here is a technique that reads the whole file into memory, builds a set of keys for comparison, and sorts the indices of the data so that they can be printed in the new order.
It can be changed for your purposes by modifying the regular expression that extracts the keys fields, and changing the sort block so that the sorted order is correct.
If your file is enormous then it may be necessary to build only the array of keys in memory, and leave the rest of the data on file to be read as it is output.
use strict;
use warnings;
open my $fh, '<', 'interleaved.txt' or die $!;
my #data = <$fh>;
my #keys = map [ /^(.)(.)/ ], #data;
my #sorted = sort {
$keys[$a][1] <=> $keys[$b][1] or
$keys[$a][0] cmp $keys[$b][0]
} 0 .. $#keys;
print $data[$_] for #sorted;
This might work for you (GNU sed and sort):
sed '1{x;s/^/1/;x};G;s/\n/\t/p;x;y/1234/2341/;x;d' file|sort -sk2|sed 's/\t.*//'
I'd like to credit Borodin and TLP for their inputs and answers, which inspired the solution. Its ugly, but I like it
awk 'BEGIN{v=4}{now=(NR-1)%v; STOR[now] = STOR[now] "\n" $0;} END {for (v in STOR) print STOR[v]}'
It also has the flaw of printing spurious newlines (well, the ones appended to the start of the array), but I can deal with that.
EDIT:
Solution for the newlines:
awk 'BEGIN{v=4}{now=(NR-1)%v; STOR[now] = STOR[now] "\n" $0;} END {for (v in STOR) print substr(STOR[v],2)}'