How do I replace row identifiers with sequential numbers? - perl

I have written a perl script that splits 3 columns into scalars and replaces various values in the second column using regex. This part works fine, as shown below. What I would like to do, though, is change the first column ($item_id into a series of sequential numbers that restart when the original (numeric) value of $item_id changes.
For example:
123
123
123
123
2397
2397
2397
2397
8693
8693
8693
8693
would be changed to something like this (in a column):
1
2
3
4
1
2
3
4
1
2
3
4
This could either replace the first column or be a new fourth column.
I understand that I might do this through a series of if-else statements and tried this, but that doesn't seem to play well with the while procedure I've already got working for me. - Thanks, Thom Shepard
open(DATA,"< text_to_be_processed.txt");
while (<DATA>)
{
chomp;
my ($item_id,$callnum,$data)=split(/\|/);
$callnum=~s/110/\%I/g;
$callnum=~s/245/\%T/g;
$callnum=~s/260/\%U/g;
print "$item_id\t$callnum\t$data\n";
} #End while
close DATA;

The basic steps are:
Outside of the loop declare the counter and a variable holding the previous $item_id.
Inside the loop you do three things:
reset the counter to 1 if the current $item_id differs from the previous one, otherwise increase it
use that counter, e.g. print it
remember the previous value
With code this could look something similar to this (untested):
my ($counter, $prev_item_id) = (0, '');
while (<DATA>) {
# do your thing
$counter = $item_id eq $prev_item_id ? $counter + 1 : 1;
$prev_item_id = $item_id;
print "$item_id\t$counter\t...\n";
}

This goes a little further than just what you asked...
Use lexical filehandles
[autodie] makes open throw an error automatically
Replace the call nums using a table
Don't assume the data is sorted by item ID
Here's the code.
use strict;
use warnings;
use autodie;
open(my $fh, "<", "text_to_be_processed.txt");
my %Callnum_Map = (
110 => '%I',
245 => '%T',
260 => '%U',
);
my %item_id_count;
while (<$fh>) {
chomp;
my($item_id,$callnum,$data) = split m{\|};
for my $search (keys %Callnum_Map) {
my $replace = $Callnum_Map{$search};
$callnum =~ s{$search}{$replace}g;
}
my $item_count = ++$item_id_count{$item_id};
print "$item_id\t$callnum\t$data\t$item_count\n";
}
By using a hash, it does not presume the data is sorted by item ID. So if it sees...
123|foo|bar
456|up|down
123|left|right
789|this|that
456|black|white
123|what|huh
It will produce...
1
1
2
1
2
3
This is more robust, assuming you want a count of how many times you've seen an item id in the whole file. If you want how many times its been seen consecutively, use Mortiz's solution.

Is this what you are looking for?
open(DATA,"< text_to_be_processed.txt");
my $counter = 0;
my $prev;
while (<DATA>)
{
chomp;
my ($item_id,$callnum,$data)=split(/\|/);
$callnum=~s/110/\%I/g;
$callnum=~s/245/\%T/g;
$callnum=~s/260/\%U/g;
++$counter;
$item_id = $counter;
#reset counter if $prev is different than $item_id
$counter = 0 if ($prev ne $item_id );
$prev = $item_id;
print "$item_id\t$callnum\t$data\n";
} #End while
close DATA;

Related

Perl File with newline to HASH then to CSV

Hello Perl experts,
Im sorry if I m asking too much, but just started learning perl, and want to know more on hash to csv. Tried breaking my head for few days, but didnt get much.
My requirement is I want to convert the below file & print it as a csv without using a custom module from cpan. Since custom modules are not allowed in here.
The logic i tried was take the input file into a hash and printing like a csv.
First problem something wrong with my code.
I have a complicated part in this, I need to search using first keyword before space, if I dont find it, need to add a ''(space). eg. ovpa_type is not present in the second.
Like this there are 5L lines in the file.
I want to learn about adding lines into the hash array, like powershell does. and converting into csv whereever I want.
My input file contains the below data.
begin node
name ccaita23.contoso.com
on_off on
group SYSTEM_PING_UNIX
suppress no
auto_delete yes
read_community public
write_community public
address 2.1.52.36
port 161
ovpa_type router
trace off
snmp_version 1
engineid 0
auth_protocol 1
is_key_ok 0
error_status 0
security_level 0
v3_user 0
priv_protocol 0
end node
begin node
name ccaidi7c.contoso.com
on_off on
group SYSTEM_PING_UNIX
suppress no
auto_delete yes
read_community public
write_community public
address 1.1.210.76
port 161
trace off
snmp_version 1
engineid 0
auth_protocol 1
is_key_ok 0
error_status 0
security_level 0
v3_user 0
priv_protocol 0
end node
Output required
ccaita23.contoso.com,on,SYSTEM_PING_UNIX,no,yes,public,public,2.11.52.36,161,router,off,1,0,1,0,0,0,0,0
ccaidi7c.contoso.com,on,SYSTEM_PING_UNIX,no,yes,public,public,1.1.210.76,161,,off,1,0,1,0,0,0,0,0
open FILE1, "File.node" or die;
my %hash;
while (my $line=<FILE1>) {
chomp($line);
(my $key,my $value) = split / /, $line;
$hash{$key} .= $value;
}
my $name = $hash{'name'};
my $group = $hash{'group'};
my $csv = "$name,$group\n";
print $csv;
my #fields = qw(
name on_off group suppress auto_delete read_community write_community
address port ovpa_type trace snmp_version engineid auth_protocol
is_key_ok error_status security_level v3_user priv_protocol
);
open my $FILE1, "<", "File.node" or die $!;
local $/ = ""; # paragraph reading mode/reads whole node at once
while (my $rec = <$FILE1>) {
# $rec =~ s/(?:begin|end)\s+node//g; # get rid of begin/end
my %hash = split ' ', $rec; # split node on spaces and feed into hash, in key/value fashion
# get hash slice for #fields keys, map every undef to "", so join wont warn under warnings
print join(",", map { $_ // "" } #hash{#fields}), "\n";
}

fast way to compare rows in a dataset

I asked this question in R and got a lot of answers, but all of them crash my 4Gb Ram computer after a few hours running or they take a very long time to finish.
faster way to compare rows in a data frame
Some people said that it's not a job to be done in R. As I don't know C and I'm a little bit fluent in Perl, I'll ask here.
I'd like to know if there is a fast way to compare each row of a large dataset with the other rows, identifying the rows with a specific degree of homology. Let's say for the simple example below that I want homology >= 3.
data:
sample_1,10,11,10,13
sample_2,10,11,10,14
sample_3,10,10,8,12
sample_4,10,11,10,13
sample_5,13,13,10,13
The output should be something like:
output
sample duplicate matches
1 sample_1 sample_2 3
2 sample_1 sample_4 4
3 sample_2 sample_4 3
Matches are calculated when both lines have same numbers on same positions,
perl -F',' -lane'
$k = shift #F;
for my $kk (#o) {
$m = grep { $h{$kk}[$_] == $F[$_] } 0 .. $#F;
$m >=3 or next;
print ++$i, " $kk $k $m";
}
push #o, $k;
$h{$k} = [ #F ];
' file
output,
1 sample_1 sample_2 3
2 sample_1 sample_4 4
3 sample_2 sample_4 3
This solution provides an alternative to direct comparison, which will be slow for large data amounts.
Basic idea is to build an inverted index while reading the data.
This makes comparison faster if there are a lot of different values per column.
For each row, you look up the index and count the matches - this way you only consider the samples where this value actually occurs.
You might still have a memory problem because the index gets as large as your data.
To overcome that, you can shorten the sample name and use a persistent index (using DB_File, for example).
use strict;
use warnings;
use 5.010;
my #h;
my $LIMIT_HOMOLOGY = 3;
while(my $line = <>) {
my #arr = split /,/, $line;
my $sample_no = shift #arr;
my %sim;
foreach my $i (0..$#arr) {
my $value = $arr[$i];
our $l;
*l = \$h[$i]->{$value};
foreach my $s (#$l) {
$sim{$s}++;
}
push #$l, $sample_no;
}
foreach my $s (keys %sim) {
if ($sim{$s}>=$LIMIT_HOMOLOGY) {
say "$sample_no: $s. Matches: $sim{$s}";
}
}
}
For 25000 rows with 26 columns with random integer values between 1 and 100, the program took 69 seconds on my mac book air to finish.

Parallelization of perl script

I have a perl script I wish to parrallelise.
It is composed of a while loop with over 11000 lines inside of another while loop of 3400 lines, which makes it extremely slow.
open (FILE1, "File1.txt") or die "Can't open File1";
open (OUT, ">Outfile.txt");
while (<FILE1>)
{
my #data=split (/ /, $_);
my $RS=1;
open (FILE2, "File2.txt") or die "Can't open File2";
while (<FILE2>)
{
my #value=split (/ /, $_);
if ($data[$RS] == 1) {print OUT $value[1];$RS++;}
elsif ($data[$RS] == 2) {print OUT $value[2];$RS++;}
elsif ($data[$RS] == 0) {print OUT $value[3];$RS++;}
}
close FILE2;
}
I'm looking for a way to do the equivalent of qsub with every line of File1 so I can send 3440 jobs. Any suggestions? I'd like to stay with perl if possible. I tried to insert this code inside of a bash script, but I don't really understand how to insert a language inside another one.
My File1 contains a list of ID with information in column. Each column is then related to a single line in File2. I'd like to be able to run the second loop for multiple ID simultaneously instead of one after another.
File1
ID RS_10 RS_15 RS_30
23 1 0 1
34 2 2 0
45 1 1 0
23 0 0 2
10 2 1 1
File2
RS_10 A B C
RS_15 D E F
RS_30 G H I
The first rule of optimization is not to do it too early (i.e. jumping to premature conclusions without profiling your code).
The second rule would probably refer to caching.
The File2 of yours isn't very large. I'd say we load it into memory. This has the following advantages:
We do our parsing once and only once.
The file isn't obscenly large, so space isn't much of an issue.
We can create a data structure that makes lookups very simple.
About that first point: You split each line over three thousand times. Those cycles could have been better spent.
About that third point: you seem to do an index conversion:
1 → 1, 2 → 2, 0 → 3
Instead of testing for all values with an if/elsif-switch (linear complexity), we could use an array that does this translation (constant time lookups):
my #conversion = (3, 1, 2);
...;
print OUT $value[$conversion[$data[$RS++]]];
If this index conversion is constant, we could do it once and only once when parsing File2. This would look like
use strict; use warnings;
use autodie; # automatic error handling
my #file2;
{
open my $file2, "<", "File2.txt";
while (<$file2>) {
my (undef, #vals) = split;
# do the reordering. This is equivalent to #vals = #vals[2, 0, 1];
unshift #vals, pop #vals;
push #file2, \#vals;
}
}
Now we can move on to iterating through File1. Printing the corresponding entry from File2 now looks like
open my $file1, "<", "File1.txt";
<$file1>; # remove header
while (<$file1>) {
my ($id, #indices) = split;
print $id, map $file2[$_][$indices[$_]], 0 .. $#indices;
# but I guess you'd want some separator in between
# If so, set the $, variable
}
This algorithm is still quadratic (the map is just a for-loop in disguise), but this should have a better constant factor. The output of above code given your example input is
23 A F G
34 B E I
45 A D I
23 C F H
10 B D G
(with $, = " "; $\ = "\n").
Where to go from here
This last step (looping through File1) could be parallelized, but this is unlikely to help much: IO is slow, communication between threads is expensive (IPC even more so), and the output would be in random order. We could spawn a bunch of workers, and pass unparsed lines in a queue:
use threads; # should be 1st module to be loaded
use Thread::Queue;
use constant NUM_THREADS => 4; # number of cores
# parse the File2 data here
my $queue = Thread::Queue->new;
my #threads = map threads->new(\&worker), 1 .. NUM_THREADS;
# enqueue data
$queue->enqueue($_) while <$file1>;
# end the queue
$queue->enqueue((undef) x NUM_THREADS); # $queue->end in never versions
# wait for threads to complete
$_->join for #threads;
sub worker {
while(defined(my $_ = $queue->dequeue)) {
my ($id, #indices) = split;
print $id, map $file2[$_][$indices[$_]], 0 .. $#indices;
}
}
Note that this copies the #file2 into all threads. Fun fact: for the example data, this threaded solution takes roughly 4× as long. This is mostly the overhead of thread creation, so this will be less of an issue for your data.
In any case, profile your code to see where you can optimize most effectively. I recommend the excellent Devel::NYTProf. E.g. for my non-threaded test run with this very limited data, the overhead implied by autodie and friends used more time than doing the actual processing. For you, the most expensive line would probably be
print $id, map $file2[$_][$indices[$_]], 0 .. $#indices;
but there isn't much we can do here inside Perl.

Merge lines and do operations if a condition is statisfied

I'm new in perl and I would like to read a table and make a sum of some values from specific lines. This is a simplified example of my input file:
INPUT :
Gene Size Feature
GeneA 1200 Intron 1
GeneB 100 Intron 1
GeneB 200 Intron 1
GeneB 150 Intron 2
GeneC 300 Intron 5
OUTPUT :
GeneA 1200 Intron 1
GeneB 300 Intron 1 <-- the size values are summed
GeneB 150 Intron 2
GeneC 300 Intron 5
Because Gene B is present for intron 1 with two different sizes, I would like to sum these two values and print only one line per intron number.
This is an example of code that I want to do. But I would like to make it more complicated if I can understand How to handle this kind of data.
#!/usr/bin/perl
use strict;
use warnings;
my $sum;
my #GAP_list;
my $prevline = 'na';
open INFILE,"Table.csv";
while (my $ligne = <INFILE>)
{
chomp ($ligne);
my #list = split /\t/, $ligne;
my $gene= $list[0];
my $GAP_size= $list[2];
my $intron= $list[3];
my $intron_number=$list[4];
if($prevline eq 'na'){
push #GAP_list, $GAP_size;
}
elsif($prevline ne 'na') {
my #list_p = split /\t/,$prevline;
my $gene_p= $list_p[0];
my $GAP_size_p= $list_p[2];
my $intron_p= $list_p[3];
my $intron_number_p=$list_p[4];
if (($gene eq $gene_p) && ($intron eq $intron_p) && ($intron_number eq $intron_number_p)){
push #GAP_list, $GAP_size;
}
}
else{
$sum = doSum(#GAP_list);
print "$gene\tGAP\t$GAP_size\t$intron\t$intron_number\t$sum\n";
$prevline=$ligne;
}
}
# Subroutine
sub doSum {
my $sum = 0;
foreach my $x (#_) {
$sum += $x;
}
return $sum;
}
Assuming the fields are seperated by tabs, then the following strategy would work. It buffers the last line, either adding up if the other fields are equal, or printing the old data and then replacing the buffer with the current line.
After the whole input was processed, we must not forget to print out the contents that are still in the buffer.
my $first_line = do { my $l = <>; chomp $l; $l };
my ($last_gene, $last_tow, $last_intron) = split /\t/, $first_line;
while(<>) {
chomp;
my ($gene, $tow, $intron) = split /\t/;
if ($gene eq $last_gene and $intron eq $last_intron) {
$last_tow += $tow;
} else {
print join("\t", $last_gene, $last_tow, $last_intron), "\n";
($last_gene, $last_tow, $last_intron) = ($gene, $tow, $intron);
}
}
print join("\t", $last_gene, $last_tow, $last_intron), "\n";
This works fine as long as genes that may be folded together are always consecutive. If the joinable records are spread all over the file, we have to keep a data structure of all records. After the whole file is parsed, we can emit nicely sorted sums.
We will use a multilevel hash that uses the gene as first level key, and the intron as 2nd level key. The value is the count/tow/whatever:
my %records;
# parse the file
while (<>) {
chomp;
my ($gene, $tow, $intron) = split /\t/;
$records{$gene}{$intron} += $tow;
}
# emit the data:
for my $gene (sort keys %records) {
for my $intron (sort keys %{ $records{$gene} }) {
print join("\t", $gene, records{$gene}{$intron}, $intron), \n";
}
}
This seems more like something that can be done easily using a simple SQL Query. Especially as you get your files in a database table format. I couldn't comment on your question, to ask you more about it as I don't have enough reputation to do so.
So I'm assuming that you get your data from a table. Not that you can't solve this problem in Perl. But I strongly recommend using the database to do such calculation when fetching the data file, as that seems much easier. And I am not sure why you chose to do it in Perl, especially when you have lots of such fields in a file and you wanted to do such operations on all of them. And you could still use Perl to interact with your database when solving your problem via an SQL Query.
So my proposed solution in SQL, if the data is collected from a database is:
Write an SQL statement involving a GROUP BY on the GENE and feature field and aggregate the size column.
If your table looked exactly like what you described, let us call it GeneInformation table and you loaded your data file to the SQL database (SQLLite maybe) then your select query would be:
SELECT gene, feature, SUM(size) FROM GeneInformation
GROUP
BY gene, feature;
That should give you a list of genes, features and their corresponding total sizes .
If SQL solution is completely impossible for you then I will talk about the Perl solution.
I noticed that the Perl solutions are based on the assumption that a particular gene's values would appear consecutively in the file. If that is the case then I would like to up vote amon's answer (which I can't do at the moment).

Perl grep not returning expected value

I have the following code:
#!/usr/bin/perl
# splits.pl
use strict;
use warnings;
use diagnostics;
my $pivotfile = "myPath/Internal_Splits_Pivot.txt";
open PIVOTFILE, $pivotfile or die $!;
while (<PIVOTFILE>) { # loop through each line in file
next if ($. == 1); # skip first line (contains business segment code)
next if ($. == 2); # skip second line (contains transaction amount text)
my #fields = split('\t',$_); # split fields for line into an array
print scalar(grep $_, #fields), "\n";
}
Given that the data in the text file is this:
4 G I M N U X
Transaction Amount Transaction Amount Transaction Amount Transaction Amount Transaction Amount Transaction Amount Transaction Amount
0000-13-I21 600
0001-8V-034BLA 2,172 2,172
0001-8V-191GYG 13,125 4,375
0001-9W-GH5B2A -2,967.09 2,967.09 25.00
I would expect the output from the perl script to be: 2 3 3 4 given the amount of defined elements in each line. The file is a tab delimited text file with 8 columns.
Instead I get 3 4 3 4 and I have no idea why!
For background, I am using Counting array elements in Perl as the basis for my development, as I am trying to count the number of elements in the line to know if I need to skip that line or not.
I suspect you have spaces mixed with the tabs in some places, and your grep test will consider " " true.
What does:
use Data::Dumper;
$Data::Dumper::Useqq=1;
print Dumper [<PIVOTFILE>];
show?
The problem should be in this line:
my #fields = split('\t',$_); # split fields for line into an array
The tab character doesn't get interpolated. And your file doesn't seem to be tab-only separated, at least here on SO. I changed the split regex to match arbitrary whitespace, ran the code on my machine and got the "right" result:
my #fields = split(/\s+/,$_); # split fields for line into an array
Result:
2
3
3
4
As a side note:
For background, I am using Counting array elements in Perl as the basis for my development, as I am trying to count the number of elements in the line to know if I need to skip that line or not.
Now I understand why you use grep to count array elements. That's important when your array contains undefined values like here:
my #a;
$a[1] = 42; # #a contains the list (undef, 42)
say scalar #a; # 2
or when you manually deleted entries:
my #a = split /,/ => 'foo,bar'; # #a contains the list ('foo', 'bar')
delete $a[0]; # #a contains the list (undef, 'bar')
say scalar #a; # 2
But in many cases, especially when you're using arrays to just store list without operating on single array elements, scalar #a works perfectly fine.
my #a = (1 .. 17, 1 .. 25); # (1, 2, ..., 17, 1, 2, .., 25)
say scalar #a; # 42
It's important to understand, what grep does! In your case
print scalar(grep $_, #fields), "\n";
grep returns the list of true values of #fields and then you print how many you have. But sometimes this isn't what you want/expect:
my #things = (17, 42, 'foo', '', 0); # even '' and 0 are things
say scalar grep $_ => #things # 3!
Because the empty string and the number 0 are false values in Perl, they won't get counted with that idiom. So if you want to know how long an array is, just use
say scalar #array; # number of array entries
If you want to count true values, use this
say scalar grep $_ => #array; # number of true values
But if you want to count defined values, use this
say scalar grep defined($_) => #array; # number of defined values
I'm pretty sure you already know this from the other answers on the linked page. In hashes, the situation is a little bit more complex because setting something to undef is not the same as deleteing it:
my %h = (a => 0, b => 42, c => 17, d => 666);
$h{c} = undef; # still there, but undefined
delete $h{d}; # BAM! $h{d} is gone!
What happens when we try to count values?
say scalar grep $_ => values %h; # 1
because 42 is the only true value in %h.
say scalar grep defined $_ => values %h; # 2
because 0 is defined although it's false.
say scalar grep exists $h{$_} => qw(a b c d); # 3
because undefined values can exist. Conclusion:
know what you're doing instead of copy'n'pasting code snippets :)
There are not only tabs, but there are spaces as well.
trying out with splitting by space works
Look below
#!/usr/bin/perl
# splits.pl
use strict;
use warnings;
use diagnostics;
while (<DATA>) { # loop through each line in file
next if ($. == 1); # skip first line (contains business segment code)
next if ($. == 2); # skip second line (contains transaction amount text)
my #fields = split(" ",$_); # split fields by SPACE
print scalar(#fields), "\n";
}
__DATA__
4 G I M N U X
Transaction Amount Transaction Amount Transaction Amount Transaction Amount Transaction Amount Transaction Amount Transaction Amount
0000-13-I21 600
0001-8V-034BLA 2,172 2,172
0001-8V-191GYG 13,125 4,375
0001-9W-GH5B2A -2,967.09 2,967.09 25.00
Output
2
3
3
4
Your code works for me. The problem may be that the input file contains some "hidden" whitespace fields (eg. other whitespace than tabs). For instance
A<tab><space><CR> gives two fields, A and <space><CR>
A<tab>B<tab><CR> gives three, A, B, <CR> (remember, the end of line is part of the input!)
I suggest you to chomp every line you use; other than that, you will have to clean the array from whitespace-only fields. Eg.
scalar(grep /\S/, #fields)
should do it.
A lot of great help on this question, and quickly too!
After a long, drawn-out learning process, this is what I came up with that worked quite well, with intended results.
#!/usr/bin/perl
# splits.pl
use strict;
use warnings;
use diagnostics;
my $pivotfile = "myPath/Internal_Splits_Pivot.txt";
open PIVOTFILE, $pivotfile or die $!;
while (<PIVOTFILE>) { # loop through each line in file
next if ($. == 1); # skip first line (contains business segment code)
next if ($. == 2); # skip second line (contains transaction amount text)
chomp $_; # clean line of trailing \n and white space
my #fields = split(/\t/,$_); # split fields for line into an array
print scalar(grep $_, #fields), "\n";
}