Sed, awk, Perl or other for de-interleaving text file - perl

I would like a relatively compact command to perform line-by-line de-interleaving of a text file, i.e
a1
a2
a3
a4
b1
b2
b3
b4
c1
c2
c3
c4
d1
d2
d3
d4
maps to
a1
b1
c1
d1
a2
b2
c2
d2
a3
b3
c3
d3
a4
b4
c4
d4
The interleaving depth should be adjustable. The lines themselves do not contain any useful structure to assist with the process, and the example above is just a toy example for demonstration purposes. What tool can I use to do this?

sort can do it!
$ sort -k1.2 your_file
-k1.2 sorts by first field starting from 2nd character.
Output:
a1
b1
c1
d1
a2
b2
c2
d2
a3
b3
c3
d3
a4
b4
c4
d4

Basically, what you're looking at doing is reading your data into a 2D array. As you read it in, you can (for example) put the data into the array row by row.
Then when you write the data out, you traverse the array column by column. Adjusting the (de-)interleaving you do just requires a different size of array (or at least that you use a different amount of it, though you could leave the array size itself fixed, if you chose).

According to your new requirements, reordering elements based on their position in the file:
use strict;
use warnings;
my #sorted;
my $depth = 4; # the adjustable interleaving depth
while (<DATA>) {
my $num = ($. % $depth) - 1; # $. is input line number
push #{ $sorted[$num] }, $_;
}
for (#sorted) {
print #$_;
}
__DATA__
a1
a2
a3
a4
b1
b2
b3
b4
c1
c2
c3
c4
d1
d2
d3
d4
Note that the script can be tested on an input file by changing <DATA> to <> and running:
perl script.pl input.txt

Update
Having finally understood your question, thanks to TLP, I suggest this solution. It expects the depth and the input file name on the command line:
$ perl deinter.pl 4 interleaved.txt
and prints the reordered data to STDOUT.
use strict;
use warnings;
my $depth = shift;
my #data = <>;
for my $start (0 .. $depth-1) {
for (my $i = $start; $i < #data; $i += $depth) {
print $data[$i];
}
}
output
a1
b1
c1
d1
a2
b2
c2
d2
a3
b3
c3
d3
a4
b4
c4
d4
Previous solution
Here is a technique that reads the whole file into memory, builds a set of keys for comparison, and sorts the indices of the data so that they can be printed in the new order.
It can be changed for your purposes by modifying the regular expression that extracts the keys fields, and changing the sort block so that the sorted order is correct.
If your file is enormous then it may be necessary to build only the array of keys in memory, and leave the rest of the data on file to be read as it is output.
use strict;
use warnings;
open my $fh, '<', 'interleaved.txt' or die $!;
my #data = <$fh>;
my #keys = map [ /^(.)(.)/ ], #data;
my #sorted = sort {
$keys[$a][1] <=> $keys[$b][1] or
$keys[$a][0] cmp $keys[$b][0]
} 0 .. $#keys;
print $data[$_] for #sorted;

This might work for you (GNU sed and sort):
sed '1{x;s/^/1/;x};G;s/\n/\t/p;x;y/1234/2341/;x;d' file|sort -sk2|sed 's/\t.*//'

I'd like to credit Borodin and TLP for their inputs and answers, which inspired the solution. Its ugly, but I like it
awk 'BEGIN{v=4}{now=(NR-1)%v; STOR[now] = STOR[now] "\n" $0;} END {for (v in STOR) print STOR[v]}'
It also has the flaw of printing spurious newlines (well, the ones appended to the start of the array), but I can deal with that.
EDIT:
Solution for the newlines:
awk 'BEGIN{v=4}{now=(NR-1)%v; STOR[now] = STOR[now] "\n" $0;} END {for (v in STOR) print substr(STOR[v],2)}'

Related

Why I can 't delete the empty row?

The file A contains one row like below,
SNP1 AA TT GGSNP2 GG CC AASNP3 GG CC AA...
I want to change the format and make it like this,
SNP1 TT AA TT GG CC AA
SNP2 AA GG CC TT GG CC
SNP3 GG AA TT TT CC TT
...
(each row began with SNP_)
I had written a perl script to replace SNP by \nSNP, but the first row of the output file was always empty though I had done some try to delete the first row.
So, Is there any suggestion for me? Or another way to get the final output file.
Thanks a lot.
Just make sure there is a character before every SNP that you alter:
s/.\K(?=SNP)/\n/g
open $file_in...
open $file_out...
local $/ = 'SNP';
<$file_in>; # discard empty row
while ( my $record = <$file_in> ) {
chomp $record;
# make changes to $record, which will start as e.g. "1 AA TT GG"
....
print $file_out "SNP$record\n";
}

Perl: read file and re-arrange into columns

I have a file that i want to read in which has the following structure:
EDIT: i made the example a bit more specific to clarify what i need
HEADER
MORE HEADER
POINTS 2
x1 y1 z1
x2 y2 z2
VECTORS velocities
u1 v1 w1
u2 v2 w2
VECTORS displacements
a1 b1 c1
a2 b2 c2
The number of blocks containing some data is arbitrary, so is their order.
i want to read only data under "POINTS" and under "VECTORS displacements" and rearrange them in the following format:
x1 y1 z1 a1 b1 c1
x2 y2 z2 a2 b2 c2
I manage to read the xyz and abc blocks into separate arrays but my problem is to combine them into one.
I should mention that i am a perl newbie. Can somebody help me?
This is made very simple using the range operator. The expression
/DATA-TO-READ/ .. /DATA-NOT-TO-READ/
evaluates to 1 on the first line of the range (the DATA-TO-READ line), 2 on the second etc. On the last line (the DATA-NOT-TO-READ line) E0 is appended to the count so that it evaluates to the same numeric value but can also be tested for being the last line. On lines outside the range it evaluates to a false value.
This program accumulates the data in array #output and prints it when the end of the input is reached. It expects the path to the input file as a parameter on the command line.
use strict;
use warnings;
my (#output, $i);
while (<>) {
my $index = /DATA-TO-READ/ .. /DATA-NOT-TO-READ/;
if ($index and $index > 1 and $index !~ /E/) {
push #{ $output[$index-2] }, split;
}
}
print "#$_\n" for #output;
output
x1 y1 z1 a1 b1 c1
x2 y2 z2 a2 b2 c2
I only used 1 array to remember the first 3 columns. You can output directly when processing the second part of the data.
#!/usr/bin/perl
use strict;
use warnings;
my #first; # To store the first 3 columns.
my $reading; # Flag: are we reading the data?
while (<>) {
next unless $reading or /DATA-TO-READ/; # Skip the header.
$reading = 1, next unless $reading; # Skip the DATA-TO-READ line, enter the
# reading mode.
last if /DATA-NOT-TO-READ/; # End of the first part.
chomp; # Remove a newline.
push #first, $_; # Remember the line.
}
undef $reading; # Restore the flag.
while (<>) {
next unless $reading or /DATA-TO-READ/;
$reading = 1, next unless $reading;
last if /DATA-NOT-TO-READ/;
print shift #first, " $_"; # Print the remembered columns + current line.
}

How to append new values to hash in Perl

I have a hash of hashes where, at the last level, I want each value to be appended - not updated - if that value already exists. What would be the best way to do this? I was thinking about making the values as lists, but this is probably not the most efficient way...
Here's where I got so far:
#!/usr/local/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $dir='D:\';
open my $data,"<","$dir\\file.txt";
my (#selecteddata,#array,%hash);
while (<$data>) {
chomp $_;
my #line= split "\t";
$hash{$line[1]}{$line[2]}=$line[0];
warn Dumper \%hash;
}
close $data;
Note, this code updates the values at last level with value $line[0], but if the key $line[4] already exists (meaning, it already has a previous value $line[0]) I want this value to be appended and not updated.
So, ideally, for the following (tab sepparated) list:
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b4 c4
The hash would look something like this - I don't know exactly how the grouping of a4 and a5 should look like, so as long as they are grouped it should be ok:
{
'b1' => {'c1' => 'a1'},
'b2' => {'c2' => 'a2'},
'b3' => {'c3' => 'a3'},
'b4' => {'c4' => 'a4, a5'}
}
You can append your string,
$_ = defined($_) ? "$_, $line[0]" : $line[0]
for $hash{$line[1]}{$line[2]};
or use array which is better suited for storing list of elements,
push #{ $hash{$line[1]}{$line[2]} }, $line[0];

split one column by ',' and use the values in calculations

I'm writing a script in which I'm using a text file, where in one column there can be two letters (A,B,C or D) seperated by a ",". This column can also just contain one of those letters. I have to use both letters for further calculations in the rest of the script. This is a simplified example of my input file (here $variants):
C1 C2 C3 C4 C5 C6 ... C9
text 2 A D values and text in the other columns
text 4 B C values and text in the other columns
text 5 A B,D values and text in the other columns
So in line 3 of C4 there is a B and D. After C4 there are still a lot of columns, which cannot be changed since I need them in other parts of my script.
I have a second input file from which, based on the letters present in C3 and C4, some values are extracted. This is how this second input file looks like (here $frequency)
C1 C2 A a B b C c D d
text 1 0 1 0 0 0 0 0 0
text 2 1 0 5 4 0 0 0 0
text 3 0 0 0 0 10 11 3 6
text 4 1 0 9 4 0 2 0 0
text 5 5 3 0 0 6 7 4 0
This is how my output should look like:
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
text 2 A D 1 0 0 0 empty
text 4 B C 9 4 0 2 empty
text 5 A B,D 5 3 0 0 4 0
So for line 1, there is A in C3, then the script extracts the values for A and a from $frequency and puts them in C5 and C6. The values from C4 are then put in C7 and C8 from the output file. Now in the 3rd line there is B,D in C4. So what the script needs to do now is putting the corresponding values from B and b in C7 and C8 and the values for D and d in C9 and C10.
The only thing where I have still problems in my script is in splitting up this C4 when there is a ','. The rest is working.
This is how the problematic part of my script looks like
while(<$variants>){
next if /^\s*#/;
next if /^\s*"/;
chomp;
my ($chr, $pos, $refall, #altall) = split /\t/; # How should I specify here the C4, as an array? So that I don't know
my #ref_data = #{$frequency_data[$pos]}{$refall, lc($refall)};
my #alt_data = #{$frequency_data[$pos]}{$altall, lc($altall)}; # this works for C3 ($refall), but not for C4 when there are two letters
$pos = $#genes if $circular and $pos > $#genes; # adding annotation # this can be ignored here, since this line isn't part of my question
print join("\t","$_ ", $genes[$pos] // q(), #ref_data, #alt_data), "\n"; # printing annotation
}
So could someone help me with splitting of this C4 by ',' and still use the information for extracting values from $variants
I think the easiest would be treating columns 3 and 4 as lists from the get-go:
while(<$variants>){
next if /^\s*#/;
next if /^\s*"/;
chomp;
my ($chr, $pos, $refall_string, $altall_string, #other) = split /\t/;
my #refall = split(",", $refall_string);
my #altall = split(",", $altall_string);
my #ref_data_all = (); # Treat C3 as array just in case...
foreach my $refall (#refall) {
push #ref_data_all, #{$frequency_data[$pos]}{ $refall, lc($refall) };
}
my #alt_data_all = ();
foreach my $altall (#altall) {
push #alt_data_all, #{$frequency_data[$pos]}{ $altall, lc($altall) };
}
$pos = $#genes if $circular and $pos > $#genes;
print join("\t","$_ ", $genes[$pos] // q(),
#ref_data_all, #alt_data_all), "\n";
}
I didn't test this but the approach should be clear even if there's some minor bugs.
All you need is a couple of map calls.
If you write
map { $_, lc } split /,/, $refall
then you have split the field at any commas and duplicated each letter as upper case and lower case.
This is the complete loop (tested).
while (<$variants>) {
next if /^\s*#/;
next if /^\s*"/;
chomp;
my ($chr, $pos, $refall, $altall) = split /\t/;
my $entry = $frequency_data[$pos];
my #ref_data = map { $entry->{$_} } map { $_, lc } split /,/, $refall;
my #alt_data = map { $entry->{$_} } map { $_, lc } split /,/, $altall;
$pos = $#genes if $circular and $pos > $#genes;
print join("\t","$_ ", $genes[$pos] // q(), #ref_data, #alt_data), "\n";
}

Use keys and pairing elements Perl

My data looks like this:
G1 G2 G3 G4
Pf1 NO B1 NO D1
Pf2 NO NO C1 D1
Pf3 A1 B1 NO D1
Pf4 A1 NO C1 D2
Pf5 A3 B2 C2 D3
Pf6 NO B3 NO D3
My purpose is to check in each column if an element (different from the "NO" cases) is showed twice (like A1 in column 2, for example) and only twice (if it is showed three times or more I don't want it in the output) and, if so, write the correspondent elements of the first column. So, the desired output looks like this:
Pf3 Pf4 A1
Pf1 Pf3 B1
Pf2 Pf4 C1
Pf5 Pf6 D3
I'm trying to write a perl script, but I need some help to focus on the different steps. This is what I did so far:
open (HAN, "< $file_in") || die "Impossible open the in_file";
#r = <HAN>;
close (HAN);
for ($i=0; $i<=$#r; $i++){
chomp ($r[$i]);
($Ids, #v) = split (/\t/, $r[$i]);
}
}
But I cannot go on in any direction!
(My perl knowledge needs to be pushed by you!)
The hot points in my mind are:
how do I compare elements from the same column (or anyway in the same file)?
how can I associate the elements of the first column with the other column ones (may be keys)?
Any help is absolutely necessary and welcome!
use Data::Dumper;
my %hash;
while (<DATA>) {
next if $.==1;
chomp;
my ($first,#others) = (split /\s+/);
for (#others){
$hash{$_}.=' '.$first;
}
}
print Dumper \%hash;
__DATA__
G1 G2 G3 G4
Pf1 NO B1 NO D1
Pf2 NO NO C1 D1
Pf3 A1 B1 NO D1
Pf4 A1 NO C1 D2
Pf5 A3 B2 C2 D3
Pf6 NO B3 NO D3
What I use here? (tricks)
while (<DATA>){BLOCK} - read data from specific DATA section in Perl script file. (yes, you can put test data here, if you want. But don't store everything! this is not a bin!)
next if $.==1 - $. - special variable, that store a line number of input data. like 'index'.
chomp; - back to while(<DATA>).
Some variables in Perl are hidden. In functions - #_ array of input parameters. And always Perl programmers like to use $_ - You variable.
And this while(<DATA>) really a hidden while(defined($_ = <DATA>)).
Function chomp use hidden-You variable and try to chop \n symbol at the end.
Function split /REGEX/ also take as default variable hidden-You variable ($_).
Perl multi liner :),
perl -anE '
/^\S/ or next;
$k = shift #F;
push #{$t{$_}}, $k for#F;
}{
#$_-1==2 and say join" ",#$_ for map [#{$t{$_}},$_], sort keys%t;
' file