Perl Program error - perl

I wrote a PERL program which takes an excel sheet (coverted to a text file by changing the extension from .xls to .txt) and a sequence file for its input. The excel sheet contains the start point and the end point of an area in the sequence file (along with 70 flanking values on either side of the match area) that needs to cut and extracted into a third output file. There are like 300 values. The program reads in the start point and the end point of the sequence that needs to be cut each time but it repeatedly tells me that the value is outside the length on the input file when it clearly isn't. I just cant seem to get this fixed
This is the program
use strict;
use warnings;
my $blast;
my $i;
my $idline;
my $sequence;
print "Enter Your BLAST result file name:\t";
chomp( $blast = <STDIN> ); # BLAST result file name
print "\n";
my $database;
print "Enter Your Gene list file name:\t";
chomp( $database = <STDIN> ); # sequence file
print "\n";
open IN, "$blast" or die "Can not open file $blast: $!";
my #ids = ();
my #seq_start = ();
my #seq_end = ();
while (<IN>) {
#spliting the result file based on each tab
my #feilds = split( "\t", $_ );
push( #ids, $feilds[0] ); #copying the name of sequence
#coping the 6th tab value of the result which is the start point of from where a value should be cut.
push( #seq_start, $feilds[6] );
#coping the 7th tab value of the result file which is the end point of a value should be cut.
push( #seq_end, $feilds[7] );
}
close IN;
open OUT, ">Result.fasta" or die "Can not open file $database: $!";
for ( $i = 0; $i <= $#ids; $i++ ) {
($sequence) = &block( $ids[$i] );
( $idline, $sequence ) = split( "\n", $sequence );
#extracting the sequence from the start point to the end point
my $seqlen = $seq_end[$i] - $seq_start[$i] - 1;
my $Nucleotides = substr( $sequence, $seq_start[$i], $seqlen ); #storing the extracted substring into $sequence
$Nucleotides =~ s/(.{1,60})/$1\n/gs;
print OUT "$idline\n";
print OUT "$Nucleotides\n";
}
print "\nExtraction Completed...";
sub block {
#block for id storage which is the first tab in the Blast output file.
my $id1 = shift;
print "$id1\n";
my $start = ();
open IN3, "$database" or die "Can not open file $database: $!";
my $blockseq = "";
while (<IN3>) {
if ( ( $_ =~ /^>/ ) && ($start) ) {
last;
}
if ( ( $_ !~ /^>/ ) && ($start) ) {
chomp;
$blockseq .= $_;
}
if (/^>$id1/) {
my $start = $. - 1;
my $blockseq .= $_;
}
}
close IN3;
return ($blockseq);
}
BLAST RESULT FILE: http://www.fileswap.com/dl/Ws7ehftejp/
SEQUENCE FILE: http://www.fileswap.com/dl/lPwuGh2oKM/
Error
substr outside of string at Nucleotide_Extractor.pl line 39.
Use of uninitialized value $Nucleotides in substitution (s///) at
Nucleotide_Extractor.pl line 41.
Use of uninitialized value $Nucleotides in concatenation (.) or string
at Nucleotide_Extractor.pl line 44.
Any help is very much appreciated and queries are always invited

There were several problems with the existing code, and I ended up rewriting the script while fixing the errors. Your implementation isn't very efficient as it opens, reads, and closes the sequence file for every ID in your Excel sheet. A better approach would be either to read and store the data from the sequence file, or, if memory is limited, go through each entry in the sequence file and pick out the corresponding data from the Excel file. You would also be better off using hashes, instead of arrays; hashes store data in key -- value pairs, so it is MUCH easier to find what you're looking for. I have also used references throughout, as they make it easy to pass data into and out of subroutines.
If you are not familiar with perl data structures, check out perlfaq4 and perldsc, and perlreftut has information on using references.
The main problem with your existing code was that the subroutine to get the sequence from the fasta file was not returning anything. It is a good idea to put plenty of debugging statements in your code to ensure that it is doing what you think it is doing. I've left in my debugging statements but commented them out. I've also copiously commented the code that I changed.
#!/usr/bin/perl
use strict;
use warnings;
# enables 'say', which prints out your text and adds a carriage return
use feature ':5.10';
# a very useful module for dumping out data structures
use Data::Dumper;
#my $blast = 'infesmall.txt';
print "Enter Your BLAST result file name:\t";
chomp($blast = <STDIN>); # BLAST result file name
print "\n";
#my $database = 'infe.fasta';
print "Enter Your Gene list file name:\t";
chomp($database = <STDIN>); # sequence file
print "\n";
open IN,"$blast" or die "Can not open file $blast: $!";
# instead of using three arrays, let's use a hash reference!
# for each ID, we want to store the start and the end point. To do that,
# we'll use a hash of hashes. The start and end information will be in one
# hash reference:
# { start => $fields[6], end => $fields[7] }
# and we will use that hashref as the value in another hash, where the key is
# the ID, $fields[0]. This means we can access the start or end data using
# code like this:
# $info->{$id}{start}
# $info->{$id}{end}
my $info;
while(<IN>){
#splitting the result file based on each tab
my #fields = split("\t",$_);
# add the data to our $info hashref with the ID as the key:
$info->{ $fields[0] } = { start => $fields[6], end => $fields[7] };
}
close IN;
#say "info: " . Dumper($info);
# now read the sequence info from the fasta file
my $sequence = read_sequences($database);
#say "data from read_sequences:\n" . Dumper($sequence);
my $out = 'result.fasta';
open(OUT, ">" . $out) or die "Can not open file $out: $!";
foreach my $id (keys %$info) {
# check whether the sequence exists
if ($sequence->{$id}) {
#extracting the sequence from the start point to the end point
my $seqlen = $info->{$id}{end} - $info->{$id}{start} - 1;
#say "seqlen: $seqlen; stored seq length: " . length($sequence->{$id}{seq}) . "; start: " . $info->{$id}{start} . "; end: " . $info->{$id}{end};
#storing the extracted substring into $sequence
my $nucleotides = substr($sequence->{$id}{seq}, $info->{$id}{start}, $seqlen);
$nucleotides =~ s/(.{1,60})/$1\n/gs;
#say "nucleotides: $nucleotides";
print OUT $sequence->{$id}{header} . "\n";
print OUT "$nucleotides\n";
}
}
print "\nExtraction Completed...";
sub read_sequences {
# fasta file
my $fasta_file = shift;
open IN3, "$fasta_file" or die "Can not open file $fasta_file: $!";
# initialise two variables. We will store our sequence data in $fasta
# and use $id to track the current sequence ID
# the $fasta hash will look like this:
# $fasta = {
# 'gi|7212472|ref|NC_002387.2' => {
# header => '>gi|7212472|ref|NC_002387.2| Phytophthora...',
# seq => 'ATAAAATAATATGAATAAATTAAAACCAAGAAATAAAATATGTT...',
# }
#}
my ($fasta, $id);
while(<IN3>){
chomp;
if (/^>/) {
if (/^>(\S+) /){
# the header line with the sequence info.
$id = $1;
# save the data to the $fasta hash, keyed by seq ID
# we're going to build up an entry as we go along
# set the header to the current line
$fasta->{ $id }{ header } = $_;
}
else {
# no ID found! Erk. Emit an error and undef $id.
warn "Formatting error: $_";
undef $id;
}
}
## ensure we're getting sequence lines...
elsif (/^[ATGC]/) {
# if $id is not defined, there's something weird going on, so
# don't save the sequence. In a correctly-formatted file, this
# should not be an issue.
if ($id) {
# if $id is set, add the line to the sequence.
$fasta->{ $id }{ seq } .= $_;
}
}
}
close IN3;
return $fasta;
}

Related

Search and add a text file for words in an XML file

I have two files. The first has a sequence of words on each line
bus do car
car tree
The second file is an XML file
<title>i have a car. i take bus..</title>
I want to search the text file for each word in the XML file. If it is found, I want to insert all lines from the text file where it appears, with any spaces replaced by x.
The result file would be
<title>i have a car busxdoxcar carxtree. i take bus busxdoxcar..</title>
I try this
use strict;
use warnings;
use autodie;
my $QueryFile = "query.txt";
my $SequenceFile = "Seq_2_terms_150.txt";
my %hashlist;
open NewQueryFile ,">./NewQuery.txt"
or die "Cannot create NewQuery.txt";
open(my $fh,$SequenceFile)
or die "$SequenceFile : $!";
while ( <$fh> ) {
chop;
s/^\s+|\s+$//g;
my $h = \%hashlist;
foreach ( split('\s+', $_) ) {
$h->{$_} //= {};
$h = $h->{$_};
}
$h->{'#'} = 1;
}
close $fh;
open(my $fd, $QueryFile)
or die "$QueryFile : $!";
for my $xml (<$fd>) {
foreach my $line (split(/\n/, $xml)) {
my #words = split(/\s/, $line);
if $words = #hashlist[$_] {
print NewQueryFile join ('x',$words) ;
}
}
}
close NewQueryFile ;
close($fd);
I have put together a quick script to indicate how one might go about this.
I have not bothered with xml, because that may well have left me in a bad mood.
My advice would be: do use variables, whatever you save from not doing so is lost as your code gets confusing and then buggy.
#!/usr/bin/env perl
use strict;
use warnings;
# Notes:
# - more than one space or tab in a row are mangled: They become one space only
# - the query file is not checked for containing actual words to match on,
# it is assumed to be suitable
# - I have made no attempt to parse xml. You should use a parser for that.
# Search Stack Overflow or Google or CPAN or all of those for examples.
# - The replace_xml_text function can be used on the text supplied by the
# parser to get the desired output
# - a feeble attempt is made to deal with punctuation in replace_xml_text
# - This code is not really tested
my %query_words;
my $query_fn = 'query.txt';
open (my $fh, "<",$query_fn) or die "could not open file '$query_fn'";
# build list of words from query file
while ( <$fh> ){
chomp;
# Words mentioned in line.
my #words = split(/\s+/,$_);
# Words joined by 'x'. Seems a strange choice *shrug*.
# This is used to replace words with later.
my $line = join("x",#words);
# Storing in arrayref, this seems easier to me
# than concatening repeatedly and getting the spaces right.
for my $word ( #words ){
push #{$query_words{$word}}, $line;
}
}
# Expects the text to replace.
# Returns the text to replace it with.
sub replace_xml_text {
my $original_text = shift;
my #words;
for my $word ( split(/\s+/,$original_text) ){
my $punctuation = '';
# Remove punctuation before matching,
# but do preserve it.
if ( $word =~s /(\s*[,.]\s*)$// ){
$punctuation = $1;
}
if ( my $additions = $query_words{$word} ){
$word = join(" ",$word,#$additions);
}
# Put punctuation back.
$word .= $punctuation;
# Done replacing in this word, next
push #words,$word;
}
return join(" ",#words);
}

Write a Perl script that takes in a fasta and reverses all the sequences (without BioPerl)?

I dont know if this is just a quirk with Stawberry Perl, but I can't seem to get it to run. I just need to take a fasta and reverse every sequence in it.
-The problem-
I have a multifasta file:
>seq1
ABCDEFG
>seq2
HIJKLMN
and the expected output is:
>REVseq1
GFEDCBA
>REVseq2
NMLKJIH
The script is here:
$NUM_COL = 80; ## set the column width of output file
$infile = shift; ## grab input sequence file name from command line
$outfile = "test1.txt"; ## name output file, prepend with “REV”
open (my $IN, $infile);
open (my $OUT, '>', $outfile);
$/ = undef; ## allow entire input sequence file to be read into memory
my $text = <$IN>; ## read input sequence file into memory
print $text; ## output sequence file into new decoy sequence file
my #proteins = split (/>/, $text); ## put all input sequences into an array
for my $protein (#proteins) { ## evaluate each input sequence individually
$protein =~ s/(^.*)\n//m; ## match and remove the first descriptive line of
## the FATA-formatted protein
my $name = $1; ## remember the name of the input sequence
print $OUT ">REV$name\n"; ## prepend with #REV#; a # will help make the
## protein stand out in a list
$protein =~ s/\n//gm; ## remove newline characters from sequence
$protein = reverse($protein); ## reverse the sequence
while (length ($protein) > $NUM_C0L) { ## loop to print sequence with set number of cols
$protein =~ s/(.{$NUM_C0L})//;
my $line = $1;
print $OUT "$line\n";
}
print $OUT "$protein\n"; ## print last portion of reversed protein
}
close ($IN);
close ($OUT);
print "done\n";
This will do as you ask
It builds a hash %fasta out of the FASTA file, keeping array #keys to keep the sequences in order, and then prints out each element of the hash
Each line of the sequence is reversed using reverse before it is added to the hash, and using unshift adds the lines of the sequence in reverse order
The program expects the input file as a parameter on the command line, and prints the result to STDOUT, which may be redirected on the command line
use strict;
use warnings 'all';
my (%fasta, #keys);
{
my $key;
while ( <> ) {
chomp;
if ( s/^>\K/REV/ ) {
$key = $_;
push #keys, $key;
}
elsif ( $key ) {
unshift #{ $fasta{$key} }, scalar reverse;
}
}
}
for my $key ( #keys ) {
print $key, "\n";
print "$_\n" for #{ $fasta{$key} };
}
output
>REVseq1
GFEDCBA
>REVseq2
NMLKJIH
Update
If you prefer to rewrap the sequence so that short lines are at the end, then you just need to rewrite the code that dumps the hash
This alternative uses the length of the longest line in the original file as the limit, and rerwraps the reversed sequence to the same length. It's claer that it would be simple to specify an explicit length instead of calculating it
You will need to add use List::Util 'max' at the top of the program
my $len = max map length, map #$_, values %fasta;
for my $key ( #keys ) {
print $key, "\n";
my $seq = join '', #{ $fasta{$key} };
print "$_\n" for $seq =~ /.{1,$len}/g;
}
Given the original data the output is identical to that of the solution above. I used this as input
>seq1
ABCDEFGHI
JKLMNOPQRST
UVWXYZ
>seq2
HIJKLMN
OPQRSTU
VWXY
with this result. All lines have been wrapped to eleven characters - the length of the longest JKLMNOPQRST line in the original data
>REVseq1
ZYXWVUTSRQP
ONMLKJIHGFE
DCBA
>REVseq2
YXWVUTSRQPO
NMLKJIH
I don't know if this is just for a class that uses toy datasets or actual research FASTAs that can be gigabytes in size. If the latter, it would make sense not to keep the whole data set in memory as both your program and Borodin's do but read it one sequence at a time, print that out reversed and forget about it. The following code does that and also deals with FASTA files that may have asterisks as sequence-end markers as long as they start with >, not ;.
#!/usr/bin/perl
use strict;
use warnings;
my $COL_WIDTH = 80;
my $sequence = '';
my $seq_label;
sub print_reverse {
my $seq_label = shift;
my $sequence = reverse shift;
return unless $sequence;
print "$seq_label\n";
for(my $i=0; $i<length($sequence); $i += $COL_WIDTH) {
print substr($sequence, $i, $COL_WIDTH), "\n";
}
}
while(my $line = <>) {
chomp $line;
if($line =~ s/^>/>REV/) {
print_reverse($seq_label, $sequence);
$seq_label = $line;
$sequence = '';
next;
}
$line = substr($line, 0, -1) if substr($line, -1) eq '*';
$sequence .= $line;
}
print_reverse($seq_label, $sequence);

Open a file.txt and find the possible start and end positions of its genes

Hi I have a file and I would like to open it and find the start and end positions of its genes,also I have some extra imformations.The beginning of each gene is mapped by the following pattern. There is an 8 letter consensus known as the Shine-Dalgarno sequence (TAAGGAGG) followed by 4-10 bases downstream before the initiation codon (ATG). However there are variants of the Shine-Dalgarno sequence with the most common of which being [TA][AC]AGGA[GA][GA].The end of the gene is specified by the stop codon TAA, TAG and TGA. It must be taken care the stop codon is found after the correct Open.Reading Frame (ORF).
Now I have make a txt file with genome and I open it with this code,and the error begin when I go to read the genome and put start and end.Any help?Thanks a lot.:
#!/usr/bin/perl -w
use strict;
use warnings;
# Searching for motifs
# Ask the user for the filename of the file containing
my $proteinfilename = "yersinia_genome.fasta";
print "\nYou open the filename of the protein sequence data: yersinia_genome.fasta \n";
# Remove the newline from the protein filename
chomp $proteinfilename;
# open the file, or exit
unless (open(PROTEINFILE, $proteinfilename) )
{
print "Cannot open file \"$proteinfilename\"\n\n";
exit;
}
# Read the protein sequence data from the file, and store it
# into the array variable #protein
my #protein = <PROTEINFILE>;
# Close the file - we've read all the data into #protein now.
close PROTEINFILE;
# Put the protein sequence data into a single string, as it's easier
# to search for a motif in a string than in an array of
# lines (what if the motif occurs over a line break?)
my $protein = join( '', #protein);
# Remove whitespace.
$protein =~ s/\s//g;
# In a loop, ask the user for a motif, search for the motif,
# and report if it was found.
my $motif='TAAGGAGG';
do
{
print "\n Your motif is:$motif\n";
# Remove the newline at the end of $motif
chomp $motif;
# Look for the motif
if ( $protein =~ /$motif/ )
{
print "I found it!This is the motif: $motif in line $.. \n\n";
}
else
{
print "I couldn't find it.\n\n";
}
}
until ($motif =~ /TAAGGAGG/g);
my $reverse=reverse $motif;
print "Here is the reverse Motif: $reverse. \n\n";
#HERE STARTS THE PROBLEMS,I DONT KNOW WHERE I MAKE THE MISTAKES
#$genome=$motif;
#$genome = $_[0];
my $ORF = 0;
while (my $genome = $proteinfilename) {
chomp $genome;
print "processing $genome\n";
my $mrna = split(/\s+/, $genome);
while ($mrna =~ /ATG/g) {
# $start and $stop are 0-based indexes
my $start = pos($mrna) - 3; # back up to include the start sequence
# discard remnant if no stop sequence can be found
last unless $mrna=~ /TAA|TAG|TGA/g;
#m/^ATG(?:[ATGC]{3}){8,}?(?:TAA|TAG|TGA)/gm;
my $stop = pos($mrna);
my $genlength = $stop - $start;
my $genome = substr($mrna, $start, $genlength);
print "\t" . join(' ', $start+1, $stop, $genome, $genlength) . "\n";
# $ORF ++;
#print "$ORF\n";
}
}
exit;
Thanks,I have make it the solution is :
local $_=$protein;
while(/ATG/g){
my $start = pos()-3;
if(/T(?:TAA|TAG|TGA)/g){
my $stop = pos;
print $start, " " , $stop, " " ,$stop - $start, " " ,
substr ($_,$start,$stop - $start),$/;
}
}
while (my $genome = $proteinfilename) {
This creates an endless loop: you are copying the file name (not the $protein data) over and over.
The purpose of the while loop is unclear; it will never terminate.
Perhaps you simply mean
my ($genome) = $protein;
Here is a simplistic attempt at fixing the obvious problems in your code.
#!/usr/bin/perl -w
use strict;
use warnings;
my $proteinfilename = "yersinia_genome.fasta";
chomp $proteinfilename;
unless (open(PROTEINFILE, $proteinfilename) )
{
# die, don't print & exit
die "Cannot open file \"$proteinfilename\"\n";
}
# Avoid creating a potentially large temporary array
# Read directly into $protein instead
my $protein = join ('', <PROTEINFILE>);
close PROTEINFILE;
$protein =~ s/\s//g;
# As this is a static variable, no point in looping
my $motif='TAAGGAGG';
chomp $motif;
if ( $protein =~ /$motif/ )
{
print "I found it! This is the motif: $motif in line $.. \n\n";
}
else
{
print "I couldn't find it.\n\n";
}
my $reverse=reverse $motif;
print "Here is the reverse Motif: $reverse. \n\n";
# $ORF isn't used; removed
# Again, no point in writing a loop
# Also, $genome is a copy of the data, not the filename
my $genome = $protein;
# It was already chomped, so no need to do that again
my $mrna = split(/\s+/, $genome);
while ($mrna =~ /ATG/g) {
my $start = pos($mrna) - 3; # back up to include the start sequence
last unless $mrna=~ /TAA|TAG|TGA/g;
my $stop = pos($mrna);
my $genlength = $stop - $start;
my $genome = substr($mrna, $start, $genlength);
print "\t" . join(' ', $start+1, $stop, $genome, $genlength) . "\n";
}
exit;

Reading the next line in the file and keeping counts separate

Another question for everyone. To reiterate I am very new to the Perl process and I apologize in advance for making silly mistakes
I am trying to calculate the GC content of different lengths of DNA sequence. The file is in this format:
>gene 1
DNA sequence of specific gene
>gene 2
DNA sequence of specific gene
...etc...
This is a small piece of the file
>env
ATGCTTCTCATCTCAAACCCGCGCCACCTGGGGCACCCGATGAGTCCTGGGAA
I have established the counter and to read each line of DNA sequence but at the moment it is do a running summation of the total across all lines. I want it to read each sequence, print the content after the sequence read then move onto the next one. Having individual base counts for each line.
This is what I have so far.
#!/usr/bin/perl
#necessary code to open and read a new file and create a new one.
use strict;
my $infile = "Lab1_seq.fasta";
open INFILE, $infile or die "$infile: $!";
my $outfile = "Lab1_seq_output.txt";
open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!";
#establishing the intial counts for each base
my $G = 0;
my $C = 0;
my $A = 0;
my $T = 0;
#initial loop created to read through each line
while ( my $line = <INFILE> ) {
chomp $line;
# reads file until the ">" character is encounterd and prints the line
if ($line =~ /^>/){
print OUTFILE "Gene: $line\n";
}
# otherwise count the content of the next line.
# my percent counts seem to be incorrect due to my Total length counts skewing the following line. I am currently unsure how to fix that
elsif ($line =~ /^[A-Z]/){
my #array = split //, $line;
my $array= (#array);
# reset the counts of each variable
$G = ();
$C = ();
$A = ();
$T = ();
foreach $array (#array){
#if statements asses which base is present and makes a running total of the bases.
if ($array eq 'G'){
++$G;
}
elsif ( $array eq 'C' ) {
++$C; }
elsif ( $array eq 'A' ) {
++$A; }
elsif ( $array eq 'T' ) {
++$T; }
}
# all is printed to the outfile
print OUTFILE "G:$G\n";
print OUTFILE "C:$C\n";
print OUTFILE "A:$A\n";
print OUTFILE "T:$T\n";
print OUTFILE "Total length:_", ($A+=$C+=$G+=$T), "_base pairs\n";
print OUTFILE "GC content is(percent):_", (($G+=$C)/($A+=$C+=$G+=$T)*100),"_%\n";
}
}
#close the outfile and the infile
close OUTFILE;
close INFILE;
Again I feel like I am on the right path, I am just missing some basic foundations. Any help would be greatly appreciated.
The final problem is in the final counts printed out. My percent values are wrong and give me the wrong value. I feel like the total is being calculated then that new value is incorporated into the total.
Several things:
1. use hash instead of declaring each element.
2. assignment such as $G = (0); is indeed working, but it is not the right way to assign scalar. What you did is declaring an array, which in scalar context $G = is returning the first array item. The correct way is $G = 0.
my %seen;
$seen{/^([A-Z])/}++ for (grep {/^\>/} <INFILE>);
foreach $gene (keys %seen) {
print "$gene: $seen{$gene}\n";
}
Just reset the counters when a new gene is found. Also, I'd use hashes for the counting:
use strict; use warnings;
my %counts;
while (<>) {
if (/^>/) {
# print counts for the prev gene if there are counts:
print_counts(\%counts) if keys %counts;
%counts = (); # reset the counts
print $_; # print the Fasta header
} else {
chomp;
$counts{$_}++ for split //;
}
}
print_counts(\%counts) if keys %counts; # print counts for last gene
sub print_counts {
my ($counts) = #_;
print "$_:=", ($counts->{$_} || 0), "\n" for qw/A C G T/;
}
Usage: $ perl count-bases.pl input.fasta.
Example output:
> gene 1
A:=3
C:=1
G:=5
T:=5
> gene 2
A:=1
C:=5
G:=0
T:=13
Style comments:
When opening a file, always use lexical filehandles (normal variables). Also, you should do a three-arg open. I'd also recommend the autodie pragma for automatic error handling (since perl v5.10.1).
use autodie;
open my $in, "<", $infile;
open my $out, ">", $outfile;
Note that I don't open files in my above script because I use the special ARGV filehandle for input, and print to STDOUT. The output can be redirected on the shell, like
$ perl count-bases.pl input.fasta >counts.txt
Declaring scalar variables with their values in parens like my $G = (0) is weird, but works fine. I think this is more confusing than helpful. → my $G = 0.
Your intendation is a bit weird. It is very unusual and visually confusing to put closing braces on the same line with another statement like
...
elsif ( $array eq 'C' ) {
++$C; }
I prefer cuddling elsif:
...
} elsif ($base eq 'C') {
$C++;
}
This statement my $array= (#array); puts the length of the array into $array. What for? Tip: You can declare variables right inside foreach-loops, like for my $base (#array) { ... }.

Perl merging 2 csv files line by line with a primary key

Edit: solution added.
Hi, I currently have some working albeit slow code.
It merges 2 CSV files line by line using a primary key.
For example, if file 1 has the line:
"one,two,,four,42"
and file 2 has this line;
"one,,three,,42"
where in 0 indexed $position = 4 has the primary key = 42;
then the sub: merge_file($file1,$file2,$outputfile,$position);
will output a file with the line:
"one,two,three,four,42";
Every primary key is unique in each file, and a key might exist in one file but not in the other (and vice versa)
There are about 1 million lines in each file.
Going through every line in the first file, I am using a hash to store the primary key, and storing the line number as the value. The line number corresponds to an array[line num] which stores every line in the first file.
Then I go through every line in the second file, and check if the primary key is in the hash, and if it is, get the line from the file1array and then add the columns I need from the first array to the second array, and then concat. to the end. Then delete the hash, and then at the very end, dump the entire thing to file. (I am using a SSD so I want to minimise file writes.)
It is probably best explained with a code:
sub merge_file2{
my ($file1,$file2,$out,$position) = ($_[0],$_[1],$_[2],$_[3]);
print "merging: \n$file1 and \n$file2, to: \n$out\n";
my $OUTSTRING = undef;
my %line_for;
my #file1array;
open FILE1, "<$file1";
print "$file1 opened\n";
while (<FILE1>){
chomp;
$line_for{read_csv_string($_,$position)}=$.; #reads csv line at current position (of key)
$file1array[$.] = $_; #store line in file1array.
}
close FILE1;
print "$file2 opened - merging..\n";
open FILE2, "<", $file2;
my #from1to2 = qw( 2 4 8 17 18 19); #which columns from file 1 to be added into cols. of file 2.
while (<FILE2>){
print "$.\n" if ($.%1000) == 0;
chomp;
my #array1 = ();
my #array2 = ();
my #array2 = split /,/, $_; #split 2nd csv line by commas
my #array1 = split /,/, $file1array[$line_for{$array2[$position]}];
# ^ ^ ^
# prev line lookup line in 1st file,lookup hash, pos of key
#my #output = &merge_string(\#array1,\#array2); #merge 2 csv strings (old fn.)
foreach(#from1to2){
$array2[$_] = $array1[$_];
}
my $outstring = join ",", #array2;
$OUTSTRING.=$outstring."\n";
delete $line_for{$array2[$position]};
}
close FILE2;
print "adding rest of lines\n";
foreach my $key (sort { $a <=> $b } keys %line_for){
$OUTSTRING.= $file1array[$line_for{$key}]."\n";
}
print "writing file $out\n\n\n";
write_line($out,$OUTSTRING);
}
The first while is fine, takes less than 1 minute, however the second while loop takes about 1 hour to run, and I am wondering if I have taken the right approach. I think it is possible for a lot of speedup? :) Thanks in advance.
Solution:
sub merge_file3{
my ($file1,$file2,$out,$position,$hsize) = ($_[0],$_[1],$_[2],$_[3],$_[4]);
print "merging: \n$file1 and \n$file2, to: \n$out\n";
my $OUTSTRING = undef;
my $header;
my (#file1,#file2);
open FILE1, "<$file1" or die;
while (<FILE1>){
if ($.==1){
$header = $_;
next;
}
print "$.\n" if ($.%100000) == 0;
chomp;
push #file1, [split ',', $_];
}
close FILE1;
open FILE2, "<$file2" or die;
while (<FILE2>){
next if $.==1;
print "$.\n" if ($.%100000) == 0;
chomp;
push #file2, [split ',', $_];
}
close FILE2;
print "sorting files\n";
my #sortedf1 = sort {$a->[$position] <=> $b->[$position]} #file1;
my #sortedf2 = sort {$a->[$position] <=> $b->[$position]} #file2;
print "sorted\n";
#file1 = undef;
#file2 = undef;
#foreach my $line (#file1){print "\t [ #$line ],\n"; }
my ($i,$j) = (0,0);
while ($i < $#sortedf1 and $j < $#sortedf2){
my $key1 = $sortedf1[$i][$position];
my $key2 = $sortedf2[$j][$position];
if ($key1 eq $key2){
foreach(0..$hsize){ #header size.
$sortedf2[$j][$_] = $sortedf1[$i][$_] if $sortedf1[$i][$_] ne undef;
}
$i++;
$j++;
}
elsif ( $key1 < $key2){
push(#sortedf2,[#{$sortedf1[$i]}]);
$i++;
}
elsif ( $key1 > $key2){
$j++;
}
}
#foreach my $line (#sortedf2){print "\t [ #$line ],\n"; }
print "outputting to file\n";
open OUT, ">$out";
print OUT $header;
foreach(#sortedf2){
print OUT (join ",", #{$_})."\n";
}
close OUT;
}
Thanks everyone, the solution is posted above. It now takes about 1 minute to merge the whole thing! :)
Two techniques come to mind.
Read the data from the CSV files into two tables in a DBMS (SQLite would work just fine), and then use the DB to do a join and write the data back out to CSV. The database will use indexes to optimize the join.
First, sort each file by primary key (using perl or unix sort), then do a linear scan over each file in parallel (read a record from each file; if the keys are equal then output a joined row and advance both files; if the keys are unequal then advance the file with the lesser key and try again). This step is O(n + m) time instead of O(n * m), and O(1) memory.
What's killing the performance is this code, which is concatenating millions of times.
$OUTSTRING.=$outstring."\n";
....
foreach my $key (sort { $a <=> $b } keys %line_for){
$OUTSTRING.= $file1array[$line_for{$key}]."\n";
}
If you want to write to the output file only once, accumulate your results in an array, and then print them at the very end, using join. Or, even better perhaps, include the newlines in the results and write the array directly.
To see how concatenation does not scale when crunching big data, experiment with this demo script. When you run it in concat mode, things start slowing down considerably after a couple hundred thousand concatenations -- I gave up and killed the script. By contrast, simply printing an array of a million lines took less than a than a minute on my machine.
# Usage: perl demo.pl 50 999999 concat|join|direct
use strict;
use warnings;
my ($line_len, $n_lines, $method) = #ARGV;
my #data = map { '_' x $line_len . "\n" } 1 .. $n_lines;
open my $fh, '>', 'output.txt' or die $!;
if ($method eq 'concat'){ # Dog slow. Gets slower as #data gets big.
my $outstring;
for my $i (0 .. $#data){
print STDERR $i, "\n" if $i % 1000 == 0;
$outstring .= $data[$i];
}
print $fh $outstring;
}
elsif ($method eq 'join'){ # Fast
print $fh join('', #data);
}
else { # Fast
print $fh #data;
}
If you want merge you should really merge. First of all you have to sort your data by key and than merge! You will beat even MySQL in performance. I have a lot of experience with it.
You can write something along those lines:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV_XS;
use autodie;
use constant KEYPOS => 4;
die "Insufficient number of parameters" if #ARGV < 2;
my $csv = Text::CSV_XS->new( { eol => $/ } );
my $sortpos = KEYPOS + 1;
open my $file1, "sort -n -k$sortpos -t, $ARGV[0] |";
open my $file2, "sort -n -k$sortpos -t, $ARGV[1] |";
my $row1 = $csv->getline($file1);
my $row2 = $csv->getline($file2);
while ( $row1 and $row2 ) {
my $row;
if ( $row1->[KEYPOS] == $row2->[KEYPOS] ) { # merge rows
$row = [ map { $row1->[$_] || $row2->[$_] } 0 .. $#$row1 ];
$row1 = $csv->getline($file1);
$row2 = $csv->getline($file2);
}
elsif ( $row1->[KEYPOS] < $row2->[KEYPOS] ) {
$row = $row1;
$row1 = $csv->getline($file1);
}
else {
$row = $row2;
$row2 = $csv->getline($file2);
}
$csv->print( *STDOUT, $row );
}
# flush possible tail
while ( $row1 ) {
$csv->print( *STDOUT, $row1 );
$row1 = $csv->getline($file1);
}
while ( $row2 ) {
$csv->print( *STDOUT, $row2 );
$row2 = $csv->getline($file1);
}
close $file1;
close $file2;
Redirect output to file and measure.
If you like more sanity around sort arguments you can replace file opening part with
(open my $file1, '-|') || exec('sort', '-n', "-k$sortpos", '-t,', $ARGV[0]);
(open my $file2, '-|') || exec('sort', '-n', "-k$sortpos", '-t,', $ARGV[1]);
I can't see anything that strikes me as obviously slow, but I would make these changes:
First, I'd eliminate the #file1array variable. You don't need it; just store the line itself in the hash:
while (<FILE1>){
chomp;
$line_for{read_csv_string($_,$position)}=$_;
}
Secondly, although this shouldn't really make much of a difference with perl, I wouldn't add to $OUTSTRING all the time. Instead, keep an array of output lines and push onto it each time. If for some reason you still need to call write_line with a massive string you can always use join('', #OUTLINES) at the end.
If write_line doesn't use syswrite or something low-level like that, but rather uses print or other stdio-based calls, then you aren't saving any disk writes by building up the output file in memory. Therefore, you might as well not build your output up in memory at all, and instead just write it out as you create it. Of course if you are using syswrite, forget this.
Since nothing is obviously slow, try throwing Devel::SmallProf at your code. I've found that to be the best perl profiler for producing those "Oh! That's the slow line!" insights.
Assuming around 20 bytes lines each of your file would amount to about 20 MB, which isn't too big.
Since you are using hash your time complexity doesn't seem to be a problem.
In your second loop, you are printing to the console for each line, this bit is slow. Try removing that should help a lot.
You can also avoid the delete in the second loop.
Reading multiple lines at a time should also help. But not too much I think, there is always going to be a read ahead behind the scenes.
I'd store each record in a hash whose keys are the primary keys. A given primary key's value is a reference to an array of CSV values, where undef represents an unknown value.
use 5.10.0; # for // ("defined-or")
use Carp;
use Text::CSV;
sub merge_csv {
my($path,$record) = #_;
open my $fh, "<", $path or croak "$0: open $path: $!";
my $csv = Text::CSV->new;
local $_;
while (<$fh>) {
if ($csv->parse($_)) {
my #f = map length($_) ? $_ : undef, $csv->fields;
next unless #f >= 1;
my $primary = pop #f;
if ($record->{$primary}) {
$record->{$primary}[$_] //= $f[$_]
for 0 .. $#{ $record->{$primary} };
}
else {
$record->{$primary} = \#f;
}
}
else {
warn "$0: $path:$.: parse failed; skipping...\n";
next;
}
}
}
Your main program will resemble
my %rec;
merge_csv $_, \%rec for qw/ file1 file2 /;
The Data::Dumper module shows that the resulting hash given the simple inputs from your question is
$VAR1 = {
'42' => [
'one',
'two',
'three',
'four'
]
};