Hash incorrectly tracking counts, runtime long

Hash incorrectly tracking counts, runtime long - perl

I am working on a program in Perl and my output is wrong and taking forever to process. The code is meant to take in a large DNA sequence file, read through it in 15 letter increments (kmers), stepping forward 1 position at a time. I'm supposed to enter the kmer sequences into a hash, with their value being the number of incidences of that kmer- meaning each key should be unique and when a duplicate is found, it should increase the count for that particular kmer. I know from my Prof. expected output file, that I have too many lines, so it is allowing duplicates and not counting correctly. It's also running 5+ minutes, so I have to Ctrl+C to escape. When I go look at kmers.txt, the file is at least written and formatted correctly.
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
# countKmers.pl
# Open file /scratch/Drosophila/dmel-2L-chromosome-r5.54.fasta
# Identify all k-mers of length 15, load them into a hash
# and count the number of occurences of each k-mer. Each
# unique k-mer and its' count will be written to file
# kmers.txt
#Create an empty hash
my %kMersHash = ();
#Open a filehandle for the output file kmers.txt
unless ( open ( KMERS, ">", "kmers.txt" ) ) {
die $!;
}
#Call subroutine to load Fly Chromosome 2L
my $sequenceRef = loadSequence("/scratch/Drosophila/dmel-2L-chromosome-r5.54.fasta");
my $kMer = 15; #Set the size of the sliding window
my $stepSize = 1; #Set the step size
for (
#The sliding window's start position is 0
my $windowStart = 0;
#Prevent going past end of the file
$windowStart <= ( length($$sequenceRef) - $kMer );
#Advance the window by the step size
$windowStart += $stepSize
)
{
#Get the substring from $windowStart for length $kMer
my $kMerSeq = substr( $$sequenceRef, $windowStart, $kMer );
#Call the subroutine to iterate through the kMers
processKMers($kMerSeq);
}
sub processKMers {
my ($kMerSeq) = #_;
#Initialize $kCount with at least 1 occurrence
my $kCount = 1;
#If the key already exists, the count is
#increased and changed in the hash
if ( not exists $kMersHash{$kMerSeq} ) {
#The hash key=>value is loaded: kMer=>count
$kMersHash{$kMerSeq} = $kCount;
}
else {
#Increment the count
$kCount ++;
#The hash is updated
$kMersHash{$kMerSeq} = $kCount;
}
#Print out the hash to filehandle KMERS
for (keys %kMersHash) {
print KMERS $_, "\t", $kMersHash{$_}, "\n";
}
}
sub loadSequence {
#Get my sequence file name from the parameter array
my ($sequenceFile) = #_;
#Initialize my sequence to the empty string
my $sequence = "";
#Open the sequence file
unless ( open( FASTA, "<", $sequenceFile ) ) {
die $!;
}
#Loop through the file line-by-line
while (<FASTA>) {
#Assign the line, which is in the default
#variable to a named variable for readability.
my $line = $_;
#Chomp to get rid of end-of-line characters
chomp($line);
#Check to see if this is a FASTA header line
if ( $line !~ /^>/ ) {
#If it's not a header line append it
#to my sequence
$sequence .= $line;
}
}
#Return a reference to the sequence
return \$sequence;
}

Here's how I would write your application. The processKMers subroutine boils down to just incrementing a hash element, so I've removed that. I've also altered the identifiers to be match the snake_case that is more usual in Perl code, and I didn't see any point in load_sequence returning a reference to the sequence so I've changed it to return the string itself
use strict;
use warnings 'all';
use constant FASTA_FILE => '/scratch/Drosophila/dmel-2L-chromosome-r5.54.fasta';
use constant KMER_SIZE => 15;
use constant STEP_SIZE => 1;
my $sequence = load_sequence( FASTA_FILE );
my %kmers;
for (my $offset = 0;
$offset + KMER_SIZE <= length $sequence;
$offset += STEP_SIZE ) {
my $kmer_seq = substr $sequence, $start, KMER_SIZE;
++$kmers{$kmer_seq};
}
open my $out_fh, '>', 'kmers.txt' or die $!;
for ( keys %kmers ) {
printf $out_fh "%s\t%d\n", $_, $kmers{$_};
}
sub load_sequence {
my ( $sequence_file ) = #_;
my $sequence = "";
open my $fh, '<', $sequence_file or die $!;
while ( <$fh> ) {
next if /^>/;
chomp;
$sequence .= $_;
}
return $sequence;
}
Here's a neater way to increment a hash element without using ++ on the hash directly
my $n;
if ( exists $kMersHash{$kMerSeq} ) {
$n = $kMersHash{$kMerSeq};
}
else {
$n = 0;
}
++$n;
$kMersHash{$kMerSeq} = $n;

Everything looks fine in your code besides processKMers. The main issues:
$kCount is not persistent between calls to processKMers, so in your else statement, $kCount will always be 2
You are printing every time you call processKMers, which is what is slowing you down. Printing frequently slows down your process significantly, you should wait until the end of your program and print once.
Keeping your code mostly the same:
sub processKMers {
my ($kMerSeq) = #_;
if ( not exists $kMersHash{$kMerSeq} ) {
$kMersHash{$kMerSeq} = 1;
}
else {
$kMersHash{$kMerSeq}++;
}
}
Then you want to move your print logic to immediately after your for-loop.

Related

How to identify nth lines of n files in while<>

I have a code which adds all vectors in all files.
There can be any number of input files. For example first input file is:
0.55 0 0.3335 1.2
0.212 0 2.2025 1
and the second one is:
0.25 0 0.3333 1.0
0.1235 0 0.2454 1
What I get is the sum of all vectors, thus in result i get one vector
which is:
1.13550 0 3.1147 4.2
But I'm trying to sum the first vector of the first file with the first vector of the second file and so on. In result according to this example I should get 2 vectors.
For now I have this:
use strict;
use warnings;
if ($ARGV[0] ne "vector1.dat"){
die ("vector1.dat is necessary as first argument");
}
my #sum = 0;
my $dim = 0;
while (<>) {
#Ignore blank lines, hashtags
#and lines starting with $
if ($_ =~ /#/ || $_ =~ /^$/ || $_ =~ /^\s$/){
next;
}
my #vectors = split(" ", $_);
my $vector_length = #vectors;
if ($dim eq 0) {
$dim = $vector_length;
}
else {
if ($dim ne $vector_length) {
die ("Vector dimensions do not match. : $!");
}
}
for (my $i = 0; $i <= $#vectors; $i++) {
$sum[$i] += $vectors[$i];
}
}
$" = "\t\t";
print "\n --- \n #sum \n";
What I need is just to find out how to identify each file's nth line
and to sum the column values of those lines while keeping in mind, that there can be n number of files.
I saw filehandling question over here with similar issue, however
I didn't find my answer there.
Just looking for some suggestions and guidance. Got stuck on this.

Open each file yourself and use the $. variable to know which line you are on (or count the files yourself). Here's the basic structure:
foreach my $file ( #files ) {
open my $fh, '<', $file or die ...;
while( <$fh> ) {
chomp;
$sum[ $. ] = ...; # $. is the line number
}
}
If you don't like $., you can use its longer name. You have to turn on English (which comes with Perl):
use English;
## use English qw( -no_match_vars ); # for v5.16 and earlier
foreach my $file ( #files ) {
open my $fh, '<', $file or die ...;
while( <$fh> ) {
chomp;
$sum[ $INPUT_LINE_NUMBER ] = ...;
}
}
Or, you can count yourself, which might be handy if the vectors in the files don't line up by strict line number (perhaps because of comments or some other formatting oddity):
foreach my $file ( #files ) {
open my $fh, '<', $file or die ...;
my $line = -1;
while( <$fh> ) {
$line++;
chomp;
$sum[ $line ] = ...;
}
}
The harder way is the answer bart gives which inspects eof at the end of every line to see if the magical ARGV handle is looking at a new file, and resetting $. if it is. It's an interesting trick but hardly anyone is going to understand what it's doing (or even notice it).
For the other part of the problem, I think you're doing the vector sum wrong, or using confusing variable names. A line is a vector, and the numbers in the lines are a component. A two dimensional array will work. The first index is the line number and the second in the component index:
while( <$fh> ) {
chomp;
... skip unwanted lines
my #components = split;
... various dimension checks
foreach my $i ( 0 .. $#components ) {
$sum[ $. ][ $i ] += $components[ $i ];
}
}
The Data::Dumper module is handy for complex data structures. You can also see the perldsc (Perl Data Structures Cookbook) documentation. The $. variable is found in perlvar.

$. is the line number of the most recently read file handle. close(ARGV) if eof; can be used to reset the file number between files (as documented in eof). (Note: eof() is different than eof.) So you now have line numbers.
The second problem you have is that you are adding vector components ($vectors[$i]) to a vectors ($sum[$i]). You need to add vector components to vectors components. Start by using more appropriate variable names.
This is what we get:
my #sum_vectors;
while (<>) {
s/#.*//; # Remove comments.
next if /^\s*$/; # Ignore blank lines.
my #vector = split;
if ($sum_vectors[$.] && #{ $sum_vectors[$.] } != #vector) {
die("$ARGV:$.: Vector dimensions do not match\n");
}
for my $i (0..$#vector) {
$sum_vectors[$.][$i] += $vector[$i];
}
} continue {
close(ARGV) if eof; # Reset line numbers for each file.
}
Two other errors fixed:
$! did not contain anything meaningful when you used it.
You ignored lines that contain comments, even if they contained valid data too.

Try this:
#!/usr/bin/perl
use strict;
use warnings;
if ($ARGV[0] ne "vector1.dat"){
die ("vector1.dat is necessary as first argument");
}
my %sum;
my $dim = 0;
my $vector_length;
my $line_number;
while (<>) {
#Ignore blank lines, hashtags
#and lines starting with $
if ($_ =~ /#/ || $_ =~ /^$/ || $_ =~ /^\s$/){
next;
}
my #vectors = split(" ", $_);
$vector_length = #vectors;
if ($dim eq 0) {
$dim = $vector_length;
}
else {
if ($dim ne $vector_length) {
die ("Vector dimensions do not match. : $!");
}
}
for (my $i = 0; $i <= $#vectors; $i++) {
$sum{$.}{$i} += $vectors[$i];
}
$line_number = $.;
$. = 0 if eof;
}
$" = "\t\t";
for (my $line=1; $line<=$line_number; $line++)
{
print $line;
for (my $vector=0; $vector<$vector_length; $vector++)
{
print " " . $sum{$line}{$vector};
}
print "\n";
}

compare two txt files

I'm new to Perl. I have two text files and I need to check matching string on both lists.
For example matching strings are:
line - file 1: fe/bla/blablabla/abcdefg
line - file 2: blablabla/abcdefg
There is a match!
In addition, the location (line number) of the matching strings is not the same on both files.
I tried put the lists in arrays and compare the arrays with nested loop, but the running time of the program is huge (the lists contain thousand of lines) and I believe there is another way, less naïve and more productive.
This is the way I put the data in the array:
my $list1 = /path/to/the/file;
open (FILE , '<' , $list1) or die ("Could not open the file");
while ( my $line = <FILE> ) {
chomp ($line);
$list_1[$i] = $line;
$i = $i+1;
}
close FILE;
I did it to the other list as well.
And this is my nested loop.
for ( $k = 0 ; $k < #list_1 ; $k = $k+1 ) {
for ($i = 0 ; $i < #list_2 ; $i = $i+1 ) {
if (index($list_1[$k] , $list_2[$i]) != -1) {
splice (#list_2 , $i , 1);
last;
}
}
}

As long as file2 isn't enormous, the simplest way is to build a regular expression pattern from its contents and check each line in file1 against the pattern.
You don't say what output you want, so I have printed all lines in file1 that have a match in file2.
use strict;
use warnings;
use 5.010;
use autodie;
my ($list1, $list2) = qw( /path/to/list1 /path/to/list2 );
open my $fh, '<', $list2;
my $re = join '|', map { chomp; quotemeta; } <$fh>;
$re = qr/$re/;
open $fh, '<', $list2;
while ( <$fh> ) {
print if /$re/;
}

Perl Program error

I wrote a PERL program which takes an excel sheet (coverted to a text file by changing the extension from .xls to .txt) and a sequence file for its input. The excel sheet contains the start point and the end point of an area in the sequence file (along with 70 flanking values on either side of the match area) that needs to cut and extracted into a third output file. There are like 300 values. The program reads in the start point and the end point of the sequence that needs to be cut each time but it repeatedly tells me that the value is outside the length on the input file when it clearly isn't. I just cant seem to get this fixed
This is the program
use strict;
use warnings;
my $blast;
my $i;
my $idline;
my $sequence;
print "Enter Your BLAST result file name:\t";
chomp( $blast = <STDIN> ); # BLAST result file name
print "\n";
my $database;
print "Enter Your Gene list file name:\t";
chomp( $database = <STDIN> ); # sequence file
print "\n";
open IN, "$blast" or die "Can not open file $blast: $!";
my #ids = ();
my #seq_start = ();
my #seq_end = ();
while (<IN>) {
#spliting the result file based on each tab
my #feilds = split( "\t", $_ );
push( #ids, $feilds[0] ); #copying the name of sequence
#coping the 6th tab value of the result which is the start point of from where a value should be cut.
push( #seq_start, $feilds[6] );
#coping the 7th tab value of the result file which is the end point of a value should be cut.
push( #seq_end, $feilds[7] );
}
close IN;
open OUT, ">Result.fasta" or die "Can not open file $database: $!";
for ( $i = 0; $i <= $#ids; $i++ ) {
($sequence) = &block( $ids[$i] );
( $idline, $sequence ) = split( "\n", $sequence );
#extracting the sequence from the start point to the end point
my $seqlen = $seq_end[$i] - $seq_start[$i] - 1;
my $Nucleotides = substr( $sequence, $seq_start[$i], $seqlen ); #storing the extracted substring into $sequence
$Nucleotides =~ s/(.{1,60})/$1\n/gs;
print OUT "$idline\n";
print OUT "$Nucleotides\n";
}
print "\nExtraction Completed...";
sub block {
#block for id storage which is the first tab in the Blast output file.
my $id1 = shift;
print "$id1\n";
my $start = ();
open IN3, "$database" or die "Can not open file $database: $!";
my $blockseq = "";
while (<IN3>) {
if ( ( $_ =~ /^>/ ) && ($start) ) {
last;
}
if ( ( $_ !~ /^>/ ) && ($start) ) {
chomp;
$blockseq .= $_;
}
if (/^>$id1/) {
my $start = $. - 1;
my $blockseq .= $_;
}
}
close IN3;
return ($blockseq);
}
BLAST RESULT FILE: http://www.fileswap.com/dl/Ws7ehftejp/
SEQUENCE FILE: http://www.fileswap.com/dl/lPwuGh2oKM/
Error
substr outside of string at Nucleotide_Extractor.pl line 39.
Use of uninitialized value $Nucleotides in substitution (s///) at
Nucleotide_Extractor.pl line 41.
Use of uninitialized value $Nucleotides in concatenation (.) or string
at Nucleotide_Extractor.pl line 44.
Any help is very much appreciated and queries are always invited

There were several problems with the existing code, and I ended up rewriting the script while fixing the errors. Your implementation isn't very efficient as it opens, reads, and closes the sequence file for every ID in your Excel sheet. A better approach would be either to read and store the data from the sequence file, or, if memory is limited, go through each entry in the sequence file and pick out the corresponding data from the Excel file. You would also be better off using hashes, instead of arrays; hashes store data in key -- value pairs, so it is MUCH easier to find what you're looking for. I have also used references throughout, as they make it easy to pass data into and out of subroutines.
If you are not familiar with perl data structures, check out perlfaq4 and perldsc, and perlreftut has information on using references.
The main problem with your existing code was that the subroutine to get the sequence from the fasta file was not returning anything. It is a good idea to put plenty of debugging statements in your code to ensure that it is doing what you think it is doing. I've left in my debugging statements but commented them out. I've also copiously commented the code that I changed.
#!/usr/bin/perl
use strict;
use warnings;
# enables 'say', which prints out your text and adds a carriage return
use feature ':5.10';
# a very useful module for dumping out data structures
use Data::Dumper;
#my $blast = 'infesmall.txt';
print "Enter Your BLAST result file name:\t";
chomp($blast = <STDIN>); # BLAST result file name
print "\n";
#my $database = 'infe.fasta';
print "Enter Your Gene list file name:\t";
chomp($database = <STDIN>); # sequence file
print "\n";
open IN,"$blast" or die "Can not open file $blast: $!";
# instead of using three arrays, let's use a hash reference!
# for each ID, we want to store the start and the end point. To do that,
# we'll use a hash of hashes. The start and end information will be in one
# hash reference:
# { start => $fields[6], end => $fields[7] }
# and we will use that hashref as the value in another hash, where the key is
# the ID, $fields[0]. This means we can access the start or end data using
# code like this:
# $info->{$id}{start}
# $info->{$id}{end}
my $info;
while(<IN>){
#splitting the result file based on each tab
my #fields = split("\t",$_);
# add the data to our $info hashref with the ID as the key:
$info->{ $fields[0] } = { start => $fields[6], end => $fields[7] };
}
close IN;
#say "info: " . Dumper($info);
# now read the sequence info from the fasta file
my $sequence = read_sequences($database);
#say "data from read_sequences:\n" . Dumper($sequence);
my $out = 'result.fasta';
open(OUT, ">" . $out) or die "Can not open file $out: $!";
foreach my $id (keys %$info) {
# check whether the sequence exists
if ($sequence->{$id}) {
#extracting the sequence from the start point to the end point
my $seqlen = $info->{$id}{end} - $info->{$id}{start} - 1;
#say "seqlen: $seqlen; stored seq length: " . length($sequence->{$id}{seq}) . "; start: " . $info->{$id}{start} . "; end: " . $info->{$id}{end};
#storing the extracted substring into $sequence
my $nucleotides = substr($sequence->{$id}{seq}, $info->{$id}{start}, $seqlen);
$nucleotides =~ s/(.{1,60})/$1\n/gs;
#say "nucleotides: $nucleotides";
print OUT $sequence->{$id}{header} . "\n";
print OUT "$nucleotides\n";
}
}
print "\nExtraction Completed...";
sub read_sequences {
# fasta file
my $fasta_file = shift;
open IN3, "$fasta_file" or die "Can not open file $fasta_file: $!";
# initialise two variables. We will store our sequence data in $fasta
# and use $id to track the current sequence ID
# the $fasta hash will look like this:
# $fasta = {
# 'gi|7212472|ref|NC_002387.2' => {
# header => '>gi|7212472|ref|NC_002387.2| Phytophthora...',
# seq => 'ATAAAATAATATGAATAAATTAAAACCAAGAAATAAAATATGTT...',
# }
#}
my ($fasta, $id);
while(<IN3>){
chomp;
if (/^>/) {
if (/^>(\S+) /){
# the header line with the sequence info.
$id = $1;
# save the data to the $fasta hash, keyed by seq ID
# we're going to build up an entry as we go along
# set the header to the current line
$fasta->{ $id }{ header } = $_;
}
else {
# no ID found! Erk. Emit an error and undef $id.
warn "Formatting error: $_";
undef $id;
}
}
## ensure we're getting sequence lines...
elsif (/^[ATGC]/) {
# if $id is not defined, there's something weird going on, so
# don't save the sequence. In a correctly-formatted file, this
# should not be an issue.
if ($id) {
# if $id is set, add the line to the sequence.
$fasta->{ $id }{ seq } .= $_;
}
}
}
close IN3;
return $fasta;
}

How do I average column values from a tab-separated data file, ignoring a header row and the left column?

My task is to compute averages from the following data file, titled Lab1_table.txt:
retrovirus genome gag pol env
HIV-1 9181 1503 3006 2571
FIV 9474 1353 2993 2571
KoRV 8431 1566 3384 1980
GaLV 8088 1563 3498 2058
PERV 8072 1560 3621 1532
I have to write a script that will open and read this file, read each line by splitting the contents into an array and computer the average of the numerical values (genome, gag, pol, env), and write to a new file the average from each of the aforementioned columns.
I've been trying my best to figure out how to not take into account the first row, or the first column, but every time I try to execute on the command line I keep coming up with 'explicit package name' errors.
Global symbol #average requires explicit package name at line 23.
Global symbol #average requires explicit package name at line 29.
Execution aborted due to compilation errors.
I understand that this involves # and $, but even knowing that I've not been able to change the errors.
This is my code, but I emphasise that I'm a beginner having started this just last week:
#!/usr/bin/perl -w
use strict;
my $infile = "Lab1_table.txt"; # This is the file path
open INFILE, $infile or die "Can't open $infile: $!";
my $count = 0;
my $average = ();
while (<INFILE>) {
chomp;
my #columns = split /\t/;
$count++;
if ( $count == 1 ) {
$average = #columns;
}
else {
for( my $i = 1; $i < scalar $average; $i++ ) {
$average[$i] += $columns[$i];
}
}
}
for( my $i = 1; $i < scalar $average; $i++ ) {
print $average[$i]/$count, "\n";
}
I'd appreciate any insight, and I would also great appreciate letting me know by list numbering what you're doing at each step - if appropriate. I'd like to learn and it would make more sense to me if I was able to read through what someone's process was.

Here are the points you need to change
Use another variable for the headers
my $count = 0;
my #header = ();
my #average = ();
then change the logic inside if statement
if ( $count == 1 ) {
#header = #columns;
}
Now don't use the #average for the limit, use $i < scalar #columns for else statement.
Initially #average is zero, you will never get inside the for loop ever.
else {
for( my $i = 1; $i < scalar #columns; $i++ ) {
$average[$i] += $columns[$i];
}
}
Finally add -1 to your counter. Remember you increment your counter when you parse your header
for( my $i = 1; $i < scalar #average; $i++ ) {
print $average[$i]/($count-1), "\n";
}
Here is the final code
You can take advantage of #header to display the result neatly
#!/usr/bin/perl -w
use strict;
my $infile = "Lab1_table.txt"; # This is the file path
open INFILE, $infile or die "Can't open $infile: $!";
my $count = 0;
my #header = ();
my #average = ();
while (<INFILE>) {
chomp;
my #columns = split /\t/;
$count++;
if ( $count == 1 ) {
#header = #columns;
}
else {
for( my $i = 1; $i < scalar #columns; $i++ ) {
$average[$i] += $columns[$i];
}
}
}
for( my $i = 1; $i < scalar #average; $i++ ) {
print $average[$i]/($count-1), "\n";
}
There are other ways to write this code but I thought it would be better to just correct your code so that you can easily understand what is wrong with your code. Hope it helps

Perl merging 2 csv files line by line with a primary key

Edit: solution added.
Hi, I currently have some working albeit slow code.
It merges 2 CSV files line by line using a primary key.
For example, if file 1 has the line:
"one,two,,four,42"
and file 2 has this line;
"one,,three,,42"
where in 0 indexed $position = 4 has the primary key = 42;
then the sub: merge_file($file1,$file2,$outputfile,$position);
will output a file with the line:
"one,two,three,four,42";
Every primary key is unique in each file, and a key might exist in one file but not in the other (and vice versa)
There are about 1 million lines in each file.
Going through every line in the first file, I am using a hash to store the primary key, and storing the line number as the value. The line number corresponds to an array[line num] which stores every line in the first file.
Then I go through every line in the second file, and check if the primary key is in the hash, and if it is, get the line from the file1array and then add the columns I need from the first array to the second array, and then concat. to the end. Then delete the hash, and then at the very end, dump the entire thing to file. (I am using a SSD so I want to minimise file writes.)
It is probably best explained with a code:
sub merge_file2{
my ($file1,$file2,$out,$position) = ($_[0],$_[1],$_[2],$_[3]);
print "merging: \n$file1 and \n$file2, to: \n$out\n";
my $OUTSTRING = undef;
my %line_for;
my #file1array;
open FILE1, "<$file1";
print "$file1 opened\n";
while (<FILE1>){
chomp;
$line_for{read_csv_string($_,$position)}=$.; #reads csv line at current position (of key)
$file1array[$.] = $_; #store line in file1array.
}
close FILE1;
print "$file2 opened - merging..\n";
open FILE2, "<", $file2;
my #from1to2 = qw( 2 4 8 17 18 19); #which columns from file 1 to be added into cols. of file 2.
while (<FILE2>){
print "$.\n" if ($.%1000) == 0;
chomp;
my #array1 = ();
my #array2 = ();
my #array2 = split /,/, $_; #split 2nd csv line by commas
my #array1 = split /,/, $file1array[$line_for{$array2[$position]}];
# ^ ^ ^
# prev line lookup line in 1st file,lookup hash, pos of key
#my #output = &merge_string(\#array1,\#array2); #merge 2 csv strings (old fn.)
foreach(#from1to2){
$array2[$_] = $array1[$_];
}
my $outstring = join ",", #array2;
$OUTSTRING.=$outstring."\n";
delete $line_for{$array2[$position]};
}
close FILE2;
print "adding rest of lines\n";
foreach my $key (sort { $a <=> $b } keys %line_for){
$OUTSTRING.= $file1array[$line_for{$key}]."\n";
}
print "writing file $out\n\n\n";
write_line($out,$OUTSTRING);
}
The first while is fine, takes less than 1 minute, however the second while loop takes about 1 hour to run, and I am wondering if I have taken the right approach. I think it is possible for a lot of speedup? :) Thanks in advance.
Solution:
sub merge_file3{
my ($file1,$file2,$out,$position,$hsize) = ($_[0],$_[1],$_[2],$_[3],$_[4]);
print "merging: \n$file1 and \n$file2, to: \n$out\n";
my $OUTSTRING = undef;
my $header;
my (#file1,#file2);
open FILE1, "<$file1" or die;
while (<FILE1>){
if ($.==1){
$header = $_;
next;
}
print "$.\n" if ($.%100000) == 0;
chomp;
push #file1, [split ',', $_];
}
close FILE1;
open FILE2, "<$file2" or die;
while (<FILE2>){
next if $.==1;
print "$.\n" if ($.%100000) == 0;
chomp;
push #file2, [split ',', $_];
}
close FILE2;
print "sorting files\n";
my #sortedf1 = sort {$a->[$position] <=> $b->[$position]} #file1;
my #sortedf2 = sort {$a->[$position] <=> $b->[$position]} #file2;
print "sorted\n";
#file1 = undef;
#file2 = undef;
#foreach my $line (#file1){print "\t [ #$line ],\n"; }
my ($i,$j) = (0,0);
while ($i < $#sortedf1 and $j < $#sortedf2){
my $key1 = $sortedf1[$i][$position];
my $key2 = $sortedf2[$j][$position];
if ($key1 eq $key2){
foreach(0..$hsize){ #header size.
$sortedf2[$j][$_] = $sortedf1[$i][$_] if $sortedf1[$i][$_] ne undef;
}
$i++;
$j++;
}
elsif ( $key1 < $key2){
push(#sortedf2,[#{$sortedf1[$i]}]);
$i++;
}
elsif ( $key1 > $key2){
$j++;
}
}
#foreach my $line (#sortedf2){print "\t [ #$line ],\n"; }
print "outputting to file\n";
open OUT, ">$out";
print OUT $header;
foreach(#sortedf2){
print OUT (join ",", #{$_})."\n";
}
close OUT;
}
Thanks everyone, the solution is posted above. It now takes about 1 minute to merge the whole thing! :)

Two techniques come to mind.
Read the data from the CSV files into two tables in a DBMS (SQLite would work just fine), and then use the DB to do a join and write the data back out to CSV. The database will use indexes to optimize the join.
First, sort each file by primary key (using perl or unix sort), then do a linear scan over each file in parallel (read a record from each file; if the keys are equal then output a joined row and advance both files; if the keys are unequal then advance the file with the lesser key and try again). This step is O(n + m) time instead of O(n * m), and O(1) memory.

What's killing the performance is this code, which is concatenating millions of times.
$OUTSTRING.=$outstring."\n";
....
foreach my $key (sort { $a <=> $b } keys %line_for){
$OUTSTRING.= $file1array[$line_for{$key}]."\n";
}
If you want to write to the output file only once, accumulate your results in an array, and then print them at the very end, using join. Or, even better perhaps, include the newlines in the results and write the array directly.
To see how concatenation does not scale when crunching big data, experiment with this demo script. When you run it in concat mode, things start slowing down considerably after a couple hundred thousand concatenations -- I gave up and killed the script. By contrast, simply printing an array of a million lines took less than a than a minute on my machine.
# Usage: perl demo.pl 50 999999 concat|join|direct
use strict;
use warnings;
my ($line_len, $n_lines, $method) = #ARGV;
my #data = map { '_' x $line_len . "\n" } 1 .. $n_lines;
open my $fh, '>', 'output.txt' or die $!;
if ($method eq 'concat'){ # Dog slow. Gets slower as #data gets big.
my $outstring;
for my $i (0 .. $#data){
print STDERR $i, "\n" if $i % 1000 == 0;
$outstring .= $data[$i];
}
print $fh $outstring;
}
elsif ($method eq 'join'){ # Fast
print $fh join('', #data);
}
else { # Fast
print $fh #data;
}

If you want merge you should really merge. First of all you have to sort your data by key and than merge! You will beat even MySQL in performance. I have a lot of experience with it.
You can write something along those lines:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV_XS;
use autodie;
use constant KEYPOS => 4;
die "Insufficient number of parameters" if #ARGV < 2;
my $csv = Text::CSV_XS->new( { eol => $/ } );
my $sortpos = KEYPOS + 1;
open my $file1, "sort -n -k$sortpos -t, $ARGV[0] |";
open my $file2, "sort -n -k$sortpos -t, $ARGV[1] |";
my $row1 = $csv->getline($file1);
my $row2 = $csv->getline($file2);
while ( $row1 and $row2 ) {
my $row;
if ( $row1->[KEYPOS] == $row2->[KEYPOS] ) { # merge rows
$row = [ map { $row1->[$_] || $row2->[$_] } 0 .. $#$row1 ];
$row1 = $csv->getline($file1);
$row2 = $csv->getline($file2);
}
elsif ( $row1->[KEYPOS] < $row2->[KEYPOS] ) {
$row = $row1;
$row1 = $csv->getline($file1);
}
else {
$row = $row2;
$row2 = $csv->getline($file2);
}
$csv->print( *STDOUT, $row );
}
# flush possible tail
while ( $row1 ) {
$csv->print( *STDOUT, $row1 );
$row1 = $csv->getline($file1);
}
while ( $row2 ) {
$csv->print( *STDOUT, $row2 );
$row2 = $csv->getline($file1);
}
close $file1;
close $file2;
Redirect output to file and measure.
If you like more sanity around sort arguments you can replace file opening part with
(open my $file1, '-|') || exec('sort', '-n', "-k$sortpos", '-t,', $ARGV[0]);
(open my $file2, '-|') || exec('sort', '-n', "-k$sortpos", '-t,', $ARGV[1]);

I can't see anything that strikes me as obviously slow, but I would make these changes:
First, I'd eliminate the #file1array variable. You don't need it; just store the line itself in the hash:
while (<FILE1>){
chomp;
$line_for{read_csv_string($_,$position)}=$_;
}
Secondly, although this shouldn't really make much of a difference with perl, I wouldn't add to $OUTSTRING all the time. Instead, keep an array of output lines and push onto it each time. If for some reason you still need to call write_line with a massive string you can always use join('', #OUTLINES) at the end.
If write_line doesn't use syswrite or something low-level like that, but rather uses print or other stdio-based calls, then you aren't saving any disk writes by building up the output file in memory. Therefore, you might as well not build your output up in memory at all, and instead just write it out as you create it. Of course if you are using syswrite, forget this.
Since nothing is obviously slow, try throwing Devel::SmallProf at your code. I've found that to be the best perl profiler for producing those "Oh! That's the slow line!" insights.

Assuming around 20 bytes lines each of your file would amount to about 20 MB, which isn't too big.
Since you are using hash your time complexity doesn't seem to be a problem.
In your second loop, you are printing to the console for each line, this bit is slow. Try removing that should help a lot.
You can also avoid the delete in the second loop.
Reading multiple lines at a time should also help. But not too much I think, there is always going to be a read ahead behind the scenes.

I'd store each record in a hash whose keys are the primary keys. A given primary key's value is a reference to an array of CSV values, where undef represents an unknown value.
use 5.10.0; # for // ("defined-or")
use Carp;
use Text::CSV;
sub merge_csv {
my($path,$record) = #_;
open my $fh, "<", $path or croak "$0: open $path: $!";
my $csv = Text::CSV->new;
local $_;
while (<$fh>) {
if ($csv->parse($_)) {
my #f = map length($_) ? $_ : undef, $csv->fields;
next unless #f >= 1;
my $primary = pop #f;
if ($record->{$primary}) {
$record->{$primary}[$_] //= $f[$_]
for 0 .. $#{ $record->{$primary} };
}
else {
$record->{$primary} = \#f;
}
}
else {
warn "$0: $path:$.: parse failed; skipping...\n";
next;
}
}
}
Your main program will resemble
my %rec;
merge_csv $_, \%rec for qw/ file1 file2 /;
The Data::Dumper module shows that the resulting hash given the simple inputs from your question is
$VAR1 = {
'42' => [
'one',
'two',
'three',
'four'
]
};

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Hash incorrectly tracking counts, runtime long - perl

Related

How to identify nth lines of n files in while<>

compare two txt files

Perl Program error

How do I average column values from a tab-separated data file, ignoring a header row and the left column?

Perl merging 2 csv files line by line with a primary key

Categories

Resources