Problems with user arguments in Perl - perl

I'm currently trying to take user arguments (usually 2) that are text files, get the amount of characters, lines, and words from the text file and display them back. My code currently adds them all together instead of listing them separately for each file. How do I list the file name based on user arguments, and the amount of lines, characters and words for each file without adding them together? Thank you for taking time to read this.
#!usr/bin/perl -w
use strict;
my $user_files = #ARGV;
chomp($user_files);
my #parts;
my $word_count = 0;
my $total_words = 0;
my $line_count = 0;
foreach my $line (<>)
{
#parts = split (/\s+/,$line);
$line_count += (line =~tr/\n//);
$word_count += length($line) + 1;
$total_words += scalar(#parts);
}
for(my $i = 0; $i < 1; $i++)
{
print "File name:", #ARGV,
"\t\t Word Count: ", $word_count,
"\t\t Total words: ", $total_words,
"\t\t Total lines: ", $line_count,
"\n";
}

There are two basic things you need to change to enable this to work.
Use $ARGV - when reading across multiple files using <>, it contains the name of the current file
Store the data in a hash (that is keyed on $ARGV)
In this sample, I've retained all of your calculations (but I think you'll need to reconsider some of those) and made a few other changes to clean up your code a bit.
#!/usr/bin/perl
use strict;
use warnings; # better than '-w'
my %files; # Store all the data here
# While is better than foreach here as is reads the file one line at a time.
# Each line goes into $_
while (<>) {
# By default, split splits $_ on whitespace
my #parts = split;
# By default, tr/// works on $_
$files{$ARGV}{line_count} += tr/\n//;
# I think this calculation is wrong.
# length() has no relation to word count. And why add 1 to it?
$files{$ARGV}{word_count} += length($_) + 1;
# Addition imposes scalar context, no need for the scalar keyword
$files{$ARGV}{total_words} += #parts;
}
# Print all the information in the hash
foreach (keys %files) {
print "File name: $_",
"\t\t Word Count: $files{$_}{word_count}",
"\t\t Total words: $files{$_}{total_words}",
"\t\t Total lines: $files{$_}{line_count}",
"\n";
}

This line :
foreach my $line(<>)
Is taking input from STDIN. You need to do something like:
for my $file (#user_files) {
open my $fin, '<', $file or die $!;
while ( my $line = <$fin> ) {
# count stuff
}
close $fin;
# print counted stuff
}
Also note that if you want to take multiple filenames as args:
my $user_files = #ARGV;
will only take the first arg. You probably want:
my #user_files = #ARGV;
Also, the chomp on an arg is unnecessary.
In your script, you're counting all the files before printing. Which is good, but you probably want to store that data in an array or hash. That data structure might look like this :
$file_counts = [
{
$file_name1 => {
characters => $characters,
words => $words,
lines => $lines,
}
},
{
$file_name2 => {
characters => $characters,
words => $words,
lines => $lines,
}
},
];

Related

Perl script to print out all the lines containing a keyword and the line below it

I need to write a perl script to search for a keyword in a large file and then print all the lines containing the keyword plus the line below each keyword to a new file.
In the original file, there are multiple lines (the exact number varies) below each keyword-containing line. I already have a script that makes the variable number of lines to equal 1. I need this functionality to remain in the script and build upon it.
I found out that I could use grep to extract the lines, but this requires running the script I already have first and then using the grep command. I'd really need to have these functions to be combined into one.
Any help is much appreaciated!
Here is the script I have so far:
use strict;
open (FILE, $ARGV[0]) or die ("Cannot open file");
my $name;
my $sequence;
while (my $line = <FILE>) {
chomp ($line);
if (substr ($line, 0, 1) eq ">") {
if ($sequence ne "") {
printf if / ("%s\n%s\n", $name, $sequence);
}
$name = $line;
$sequence = "";
} else {
$sequence .= $line;
}
}
if ($sequence ne "") {
printf ("%s\n%s\n", $name, $sequence);
}
And an example of the original file:
sp|Q6GZX4|001R_FRG3G Putative transcription factor 001R OS=Frog virus 3 (isolate Goorha) GN=FV3-001R PE=4 SV=1
MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPSEKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLDAKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHLEKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDDSFRKIYTDLGW
In this example, the keyword would be "FRG3G". The keyword is always in the same place, the characters before it vary, but the structure is the same.
If you have only 1 line to print after the keyword line, you can just remember if you found the keyword and then print the line like this:
my $matched = 0;
while (<FILE>) {
print if ($matched);
if (m/$keyword/) {
print;
matched = 1;
}
else {
matched = 0;
}
}
If you can detect the end of the lines you want to print somehow, you can adjust the code above instead of just hard-coding it to print 1 line.
Redirect to a new file as needed.

Using a regular expression with nested for loops, using Perl

I have two arrays:
#file_list holds a list of files in a directory, and
#name_list holds some names.
For example, these arrays could contain
#file_list = ('Bob_car', 'Bob_house', 'Bob_work', 'Fred_car', 'Fred_house', 'Fred_work', ...);
#name_list = ('Bob', 'Fred', ...);
(the real data is not that simple).
My goal is to compare each file with every name and see if they match. They match if the file string starts with the name.
I could then use these matches to sort the files into new directories, based on their corresponding name.
Here is my code:
for ( my $i = 0; $i < scalar #file_list ; $i++ )
{
for ( my $j = 0; $j < #name_list ; $j++ )
{
if ( $file_list[ $i ] =~ m/^$name_list[ $j ]/ )
{
print "$file_list[ $i ] goes with $name_list[ $j ]\n";
}
else
{
print "no match\n";
}
}
}
However, I don't get any matches. I've tested the individual loops and they are working. Else, is there something off about the regex?
About how the arrays were made:
For #name_list, the file containing the names is organized in a seemingly random way, just because of how it was used for something else. The names in that file are on several different lines, with lots of blank lines in between and lots of blank entries within lines. Names can appear more than once.
I used the following code to make #name_list:
while (my $line = <$OriginalFILE>)
{
chomp $line;
my #current_line = split( "\t", $line );
for ( my $i = 0; $i < scalar #current_line ; $i ++ )
{
if ( $current_line[ $i ] =~ m/^\s*$/ )
{
# print "$current_line[$i] is blank\n";
}
else
{
push( #raw_name_list, $current_line[ $i ] );
}
} # end of for
} # while
# collect list without repeat instances of the same name
my %unique = ();
foreach my $name (#raw_name_list)
{
$unique{$name} ++;
}
my #name_list = keys %unique;
foreach my $name ( #name_list )
{
# print "$name\n";
chomp $name;
unless(mkdir $name, 0700)
{
die "Unable to create directory called $name\n";
}
}
The array #file_list was made using:
opendir(DIR, $ARGV[1]);
my #file_list = grep ! /^\./, readdir DIR;
closedir(DIR);
# print #file_list;
#amon, here is what i did to test the loops and regex:
FILE: for my $file (#transposed_files) {
print "$file\n";
for my $name (#transposedunique) {
print "i see this $name\n";
if ($file =~ /^\Q$name\E/) {
print "$file goes with $name\n";
next FILE;
}
}
#print "no match for $file\n";
}
oh, and I transposed the arrays, so that they would print to an outfile into separate rows.
Short version: You are building your name array wrong. Look at this line:
$unique{name} ++;
You are just incrementing the name entry of the hash. You probably wanted the $name variable.
The Longer Version
On English, and Foreach-Loops
Your code is a bit unperlish and looks more like C than like Perl. Perl is much closer to English than you might think. From the original wording of your question:
take the first element from #file_list and then to compare that to each element in #name_list
You wrote this as
for (my $i = 0; $i < #file_list; $i++) {
for (my $j = 0; $j < #name_list; $j++) {
...; # compare $file_list[$i] with $name_list[$j]
}
}
I'd rather do
for my $file (#file_list) {
for my $name (#name_list) {
...; # compare $file with $name
}
}
and save myself from the hassle of array subscripting.
Building Correct Regexes
Your code contains the following test:
$file_list[ $i ] =~ m/^$name_list[ $j ]/
This will not do what you think if $name_list[$j] contains special characters like (, ., +. You can match the literal contents of a variable by enclosing it in \Q ... \E. This would make the code
$file =~ /^\Q$name\E/
(if used with my variant of the loop).
You could also go the nifty route and compare the leading substring directly:
$name eq substr $file, 0, length($name)
This expresses the same condition.
On Loop Control
I will make two assumptions:
You are only interested in the first matching name for any file
You only want to print the no match message if no name was found
Perl allows us to break out of arbitrary loops, or restart the current iteration, or go directly to the next iteration, without using flags, as you would do in other languages. All we have to do is to label our loops like LABEL: for (...).
So once we have a match, we can start our search for the next file. Also, we only want to print no match if we left the inner loop without going to the next file. This code does it:
FILE: for my $file (#file_list) {
for my $name (#name_list) {
if ($file =~ /^\Q$name\E/) {
print "$file goes with $name\n";
next FILE;
}
}
print "no match for $file\n";
}
The Zen of Negation
In your file parsing code, you express a condition
if ($field =~ /^\s*$/) {
} else {
# do this stuff only if the field does not consist only of
# zero or more whitespace characters
}
That description is far to complex. How about
if ($field =~ /\S/) {
# do this stuff only if the field contains a non-whitespace character.
}
The same condition, but simpler, and more efficient.
Simplify your Parse
In short, your file parsing code can be condensed to
my %uniq;
while (<$OriginalFILE>) {
chomp;
$uniq{$_} = undef for grep /\S/, split /\t/;
}
my #name_list = sort { length($b) <=> length($a) } keys %uniq;
The split function takes a regex as first argument, and will split on $_ if no other string is specified. It returns a list of fields.
The grep function takes a condition and a list, and will return all elements of a list that match the condition. The current element is in $_, which regexes match by default. For explanation of the regex, see above.
Note: This still allows for the fields to contain whitespace, even in leading position. To split on all whitespace, you can give split the special argument of a string containing a single space: split ' '. This would make the grep unneccessary.
The for loop can also be used as a statement modifier, i.e. like EXPR for LIST. The current element is in $_. We assign something to the $_ entry in our %uniq hash (which is already initialized to the empty hash). This could be a number, but undef works as well.
The keys are returned in a seemingly random order. But as multiple names could match a file, but we only want to select one match, we will have to match the most specific name first. Therefore, I sort the names after their length in descending order.
Your code seems to work for me. All I did was construct two arrays like this:
my #file_list = qw/Bob_car Bob_house Bob_work Fred_car Fred_house Fred_work/;
my #name_list = qw/Fred Bob Mary/;
Then running your code produces output like this:
no match
Bob_car goes with Bob
no match
no match
Bob_house goes with Bob
no match
no match
Bob_work goes with Bob
no match
Fred_car goes with Fred
no match
no match
Fred_house goes with Fred
no match
no match
Fred_work goes with Fred
no match
no match
So it looks like it's working.
A common problem with reading input from files or from a user is forgetting to strip the newline character from the end of the input. This could be your problem. If so, have a read about perldoc -f chomp, and just chomp each value as you add it to your array.
I'm always interested in doing things in efficient way so every time I see O(N^2) algorithm rings bells for me. Why it should be O(N*M) and not O(N+M)?
my $re = join('|',map quotemeta, #name_list);
$re = qr/$re/;
for my $file (#file_list) {
if($file =~ /^($re)/) {
my $name = $1;
... do what you need
}
}
its look something wrong in loop.
follow comments in code
for ( my $i = 0; $i < scalar #file_list ; $i++ )
{
#use some string variable assign it ""
for ( my $j = 0; $j < #name_list ; $j++ )
{
if ( $file_list[ $i ] =~ m/^$name_list[ $j ]/ )
{
# assign string variable to founded name_list[$j]
break loop
}
}
# check condition if string not equal to "" match found print your requirement with string value else match not found
}

put next lines in an array after finding the matched pattern in Perl

I open a text report inside my Perl script and need to find the specific lines and store them in arrays.
this is my report which I need to process through:
matched pattern 1
line1:10
line2:20
line3:30
next matched pattern 2
line1:5
line2:10
line3:15
next matched pattern 3
lineA:A
lineB:B
lineC:C
.
.
------------------------------------
this part is my script:
#numbers;
#numbers2;
#letters;
while (<FILE>)
{
if ($_ =~/matched pattern 1/ && $_ ne "\n")
{
chomp();
push (#numbers,$_)
}
if ($_ =~/next matched pattern 2/ && $_ ne "\n")
{
chomp();
push (#numbers2,$_)
}
if ($_ =~/next matched pattern 3/ && $_ ne "\n")
{
chomp();
push (#letters,$_)
}
}
then I can use numbers and letters inside the arrays.
this is a part of my report file
Maximum points per Lab
Lab1:10
Lab2:30
Lab3:20
Maximum points per Exam
Exam1:50
Exam2:50
Maximum points on Final
Final:150
What is your program supposed to be doing? Your current program is looking for the lines that have matched pattern and storing THOSE VERY LINEs into three different arrays. All other lines are ignored.
You show some sort of example output, but there's no real relationship between your output and input.
First, learn about references, so you don't need five different arrays. In my example, I use an array of arrays to store all of your separate files. If each file represents something else, you could use an array of hashes or a hash of arrays or a hash of hashes of arrays to represent this data in a unified structure. (Don't get me started on how you really should learn object oriented Perl. Get the hang of references first).
Also get a book on modern Perl and learn the new Perl syntax. It looks like your Perl reference is for Perl 4.0. Perl 5.0 has been out since 1994. There's a big difference between Perl 4 and Perl 5 in the way syntax is done.
use strict;
use warnings;
# Prints out your data strtucture
use Data::Dumper;
my $array_num;
my #array_of_arrays;
use constant {
PATTERN => qr/matched pattern/,
};
while (my $line = <DATA>) {
chomp $line;
next if $line =~ /^\s*$/; #Skip blank lines
if ($line =~ PATTERN) {
if (not defined $array_num) {
$array_num = 0;
}
else {
$array_num++;
}
next;
}
push #{ $array_of_arrays[$array_num] }, $line;
}
print Dumper (\#array_of_arrays) . "\n";
__DATA__
matched pattern 1
line1:10
line2:20
line3:30
next matched pattern 2
line1:5
line2:10
line3:15
next matched pattern 3
lineA:A
lineB:B
lineC:C
OUTPUT. Each set of lines are in a different array:
$VAR1 = [
[
'line1:10',
'line2:20',
'line3:30'
],
[
'line1:5',
'line2:10',
'line3:15'
],
[
'lineA:A',
'lineB:B',
'lineC:C'
]
];
#numbers;
#letters;
open FILE, "report2.txt" or die $!;
while (<FILE>)
{
if ($_ =~/:(\d+)/ && $_ ne "\n")
{
chomp();
push (#numbers,$1)
}elsif ($_ =~/:(\w+)/ && $_ ne "\n")
{
chomp();
push (#letters,$1)
}
}
print "numbers: ", #numbers, "\n";
print "letters: ", #letters, "\n";
Revised for some best practices and my own style preferences (programmed for extensibility, as I always end up extending code, so I try to program in a generally extensible way):
# Things we search for
my %patterns = (
foo => qr/^matched pattern 1/,
bar => qr/^matched pattern 2/,
baz => qr/^matched pattern 3/,
);
# Where we store matches, initialized to empty array refs
my %matches = map { $_ => [] } keys %patterns;
open(my $fh, '<', $file) or die $!;
my %current_match;
LINE: while (my $line = <$fh>) {
# We never want empty lines, so exit early
next if $_ eq "\n";
# Check current line for matches, to note which bucket we are saving into
for my $matchable (keys %patterns) {
# Skip to next unless it matches
next unless $lines =~ $matches{$matchable};
# Set the current match and jump to next line:
$current_match = $matchable;
next LINE;
}
# If there's a current match found, save the line
push( #{$matches{$current_match}, $line ) if $current_match;
}

How do I list multiple sentences which contain the same word. The heading is the word that is contained in those sentences

This currently prints all the nouns with sentences they are found in right below.
#!/usr/bin/perl
use strict;
use warnings FATAL => "all";
my $search_key = "expend"; ## CHANGE "..." to <>
open(my $tag_corpus, '<', "ch13tagged.txt") or die $!;
my #sentences = <$tag_corpus>; # This breaks up each line into list
my #words;
my %seens = ();
my %seenw = ();
for (my $i = 0; $i <= #sentences; $i++) {
if (defined($sentences[$i]) and $sentences[$i] =~ /($search_key)_VB.*/i) {
#words = split /\s/, $sentences[$i]; ## \s is a whitespace
for (my $j = 0; $j <= #words; $j++) {
#FILTER if word is noun, and therefore will end with _NN:
if (defined($words[$j]) and $words[$j] =~ /_NN/) {
#PRINT word (without _NN) and sentence (without any _ENDING):
next if $seenw{$words[$j]}++; ## How to include plural etc
push #words, $words[$j];
print "**", split(/_\S+/, $words[$j]), "**", "\n";
## next if $seens{ $sentences[$i] }++;
## push #sentences, $sentences[$i];
print split(/_\S+/, $sentences[$i]), "\n"
## HOW PRINT bold or specifically word bold?
#FILTER if word has been output, add sentence under that heading
}
} ## put print sentences here to print each sentence after all the nouns inside
}
}
close $tag_corpus || die "Can't close $tag_corpus: $!";
Your original:
#!/usr/bin/perl
use strict;
use warnings FATAL => "all";
That's a good start...
my $search_key = "expend"; ## CHANGE "..." to <>
Since you're going to use this in a regex in a loop, it's better to compile the
regex right now: my $verb_regex = qr/\bexpend_VB\b/i. I put word boundaries in
there, because it seems like you need them. '
open(my $tag_corpus, '<', "ch13tagged.txt") or die $!;
my #sentences = <$tag_corpus>; # This breaks up each line into list
my #words;
my %seens = ();
my %seenw = ();
for (my $i = 0; $i <= #sentences; $i++) {
This does much of the same with less overhead:
while ( <$tag_corpus> ) {
...
Back to yours:
if (defined($sentences[$i]) and $sentences[$i] =~ /($search_key)_VB.*/i) {
If the line contains the record separator--and it will unless you chomp it, you'll always be
getting a defined line until the end of the file. There's no need to test for defined.
Additionally, you don't need the .* after the search term and capturing the $search_key
here has no effect.
#words = split /\s/, $sentences[$i]; ## \s is a whitespace
You don't want to split on a single space for whitespace. You should use /\s+/, but
even better is: #words = split ' ', $sentences[$i];
But you won't even need that.
for (my $j = 0; $j <= #words; $j++) {
#FILTER if word is noun, and therefore will end with _NN:
if (defined($words[$j]) and $words[$j] =~ /_NN/) {
#PRINT word (without _NN) and sentence (without any _ENDING):
But that's all you're if-ing on: words ending in _NN. In addition, the whole
list from a split will be defined-- no need to test.
next if $seenw{$words[$j]}++; ## How to include plural etc
Unless you want to reset %seenw after each sentence, you'll only process each _NN
word once per file.
push #words, $words[$j];
I don't see how this push can serve any possible purpose by appending nouns
back on the list of words. Sure you've got the uniqueness check before it to save
you from the infinite loop if there are any _NN words, but it just means you'll have
all the words in the sentence, followed by all the "nouns". Not only that, but you're simply
going to test that it's an noun and do nothing with it. Not to mention that you
clobber the list with the next sentence.
print "**", split(/_\S+/, $words[$j]), "**", "\n";
## next if $seens{ $sentences[$i] }++;
You don't want to do this in the word loop
## push #sentences, $sentences[$i];
Again, I'm not thinking that you would want to do this if it were uncommented
and outside the word loop. It seems like everything from 2 lines ago would be
after the word loop.
print split(/_\S+/, $sentences[$i]), "\n"
## HOW PRINT bold or specifically word bold?
#FILTER if word has been output, add sentence under that heading
}
} ## put print sentences here to print each sentence after all the nouns inside
}
}
close $tag_corpus || die "Can't close $tag_corpus: $!";
Nope. That won't handle the bad return from close. The || or is "binding" too
tightly. You are closing either $tag_corpus or the output of die. Luckily (or perhaps unluckily)
the die never gets called because if we got this far, $tag_corpus should be a
true value.
This is a kind of cleaned-up version of what you're trying to do--with the
parts that I can make sense of left in.
my #sentences;
# We're processing a single line at a time.
while ( <$tag_corpus> ) {
# Test if we want to work with the line
next unless m/$verb_regex/;
# If we do, then test that we haven't dealt with it before
# Although I suspect that this may not be needed as much if we're not
# pushing to a queue that we're reading from.
next if $seens{ $_ }++;
# split -> split ' ', $_
# pass through only those words that match _NN at the end and
# are unique so far. We test on a substitution, because the result
# still uniquely identifies a noun
foreach my $noun ( grep { s/_NN$// && !$seenw{ $_ }++ } split ) {
print "**$noun**\n";
}
# This will omit any adjacent punctuation you have after the word--if
# that's a problem.
print split( /_\S+/ ), "\n";
# Here we save the sentence.
push #sentences, $_;
}
close $tag_corpus or die "Can't close ch13tagged.txt: $!";

Perl merging 2 csv files line by line with a primary key

Edit: solution added.
Hi, I currently have some working albeit slow code.
It merges 2 CSV files line by line using a primary key.
For example, if file 1 has the line:
"one,two,,four,42"
and file 2 has this line;
"one,,three,,42"
where in 0 indexed $position = 4 has the primary key = 42;
then the sub: merge_file($file1,$file2,$outputfile,$position);
will output a file with the line:
"one,two,three,four,42";
Every primary key is unique in each file, and a key might exist in one file but not in the other (and vice versa)
There are about 1 million lines in each file.
Going through every line in the first file, I am using a hash to store the primary key, and storing the line number as the value. The line number corresponds to an array[line num] which stores every line in the first file.
Then I go through every line in the second file, and check if the primary key is in the hash, and if it is, get the line from the file1array and then add the columns I need from the first array to the second array, and then concat. to the end. Then delete the hash, and then at the very end, dump the entire thing to file. (I am using a SSD so I want to minimise file writes.)
It is probably best explained with a code:
sub merge_file2{
my ($file1,$file2,$out,$position) = ($_[0],$_[1],$_[2],$_[3]);
print "merging: \n$file1 and \n$file2, to: \n$out\n";
my $OUTSTRING = undef;
my %line_for;
my #file1array;
open FILE1, "<$file1";
print "$file1 opened\n";
while (<FILE1>){
chomp;
$line_for{read_csv_string($_,$position)}=$.; #reads csv line at current position (of key)
$file1array[$.] = $_; #store line in file1array.
}
close FILE1;
print "$file2 opened - merging..\n";
open FILE2, "<", $file2;
my #from1to2 = qw( 2 4 8 17 18 19); #which columns from file 1 to be added into cols. of file 2.
while (<FILE2>){
print "$.\n" if ($.%1000) == 0;
chomp;
my #array1 = ();
my #array2 = ();
my #array2 = split /,/, $_; #split 2nd csv line by commas
my #array1 = split /,/, $file1array[$line_for{$array2[$position]}];
# ^ ^ ^
# prev line lookup line in 1st file,lookup hash, pos of key
#my #output = &merge_string(\#array1,\#array2); #merge 2 csv strings (old fn.)
foreach(#from1to2){
$array2[$_] = $array1[$_];
}
my $outstring = join ",", #array2;
$OUTSTRING.=$outstring."\n";
delete $line_for{$array2[$position]};
}
close FILE2;
print "adding rest of lines\n";
foreach my $key (sort { $a <=> $b } keys %line_for){
$OUTSTRING.= $file1array[$line_for{$key}]."\n";
}
print "writing file $out\n\n\n";
write_line($out,$OUTSTRING);
}
The first while is fine, takes less than 1 minute, however the second while loop takes about 1 hour to run, and I am wondering if I have taken the right approach. I think it is possible for a lot of speedup? :) Thanks in advance.
Solution:
sub merge_file3{
my ($file1,$file2,$out,$position,$hsize) = ($_[0],$_[1],$_[2],$_[3],$_[4]);
print "merging: \n$file1 and \n$file2, to: \n$out\n";
my $OUTSTRING = undef;
my $header;
my (#file1,#file2);
open FILE1, "<$file1" or die;
while (<FILE1>){
if ($.==1){
$header = $_;
next;
}
print "$.\n" if ($.%100000) == 0;
chomp;
push #file1, [split ',', $_];
}
close FILE1;
open FILE2, "<$file2" or die;
while (<FILE2>){
next if $.==1;
print "$.\n" if ($.%100000) == 0;
chomp;
push #file2, [split ',', $_];
}
close FILE2;
print "sorting files\n";
my #sortedf1 = sort {$a->[$position] <=> $b->[$position]} #file1;
my #sortedf2 = sort {$a->[$position] <=> $b->[$position]} #file2;
print "sorted\n";
#file1 = undef;
#file2 = undef;
#foreach my $line (#file1){print "\t [ #$line ],\n"; }
my ($i,$j) = (0,0);
while ($i < $#sortedf1 and $j < $#sortedf2){
my $key1 = $sortedf1[$i][$position];
my $key2 = $sortedf2[$j][$position];
if ($key1 eq $key2){
foreach(0..$hsize){ #header size.
$sortedf2[$j][$_] = $sortedf1[$i][$_] if $sortedf1[$i][$_] ne undef;
}
$i++;
$j++;
}
elsif ( $key1 < $key2){
push(#sortedf2,[#{$sortedf1[$i]}]);
$i++;
}
elsif ( $key1 > $key2){
$j++;
}
}
#foreach my $line (#sortedf2){print "\t [ #$line ],\n"; }
print "outputting to file\n";
open OUT, ">$out";
print OUT $header;
foreach(#sortedf2){
print OUT (join ",", #{$_})."\n";
}
close OUT;
}
Thanks everyone, the solution is posted above. It now takes about 1 minute to merge the whole thing! :)
Two techniques come to mind.
Read the data from the CSV files into two tables in a DBMS (SQLite would work just fine), and then use the DB to do a join and write the data back out to CSV. The database will use indexes to optimize the join.
First, sort each file by primary key (using perl or unix sort), then do a linear scan over each file in parallel (read a record from each file; if the keys are equal then output a joined row and advance both files; if the keys are unequal then advance the file with the lesser key and try again). This step is O(n + m) time instead of O(n * m), and O(1) memory.
What's killing the performance is this code, which is concatenating millions of times.
$OUTSTRING.=$outstring."\n";
....
foreach my $key (sort { $a <=> $b } keys %line_for){
$OUTSTRING.= $file1array[$line_for{$key}]."\n";
}
If you want to write to the output file only once, accumulate your results in an array, and then print them at the very end, using join. Or, even better perhaps, include the newlines in the results and write the array directly.
To see how concatenation does not scale when crunching big data, experiment with this demo script. When you run it in concat mode, things start slowing down considerably after a couple hundred thousand concatenations -- I gave up and killed the script. By contrast, simply printing an array of a million lines took less than a than a minute on my machine.
# Usage: perl demo.pl 50 999999 concat|join|direct
use strict;
use warnings;
my ($line_len, $n_lines, $method) = #ARGV;
my #data = map { '_' x $line_len . "\n" } 1 .. $n_lines;
open my $fh, '>', 'output.txt' or die $!;
if ($method eq 'concat'){ # Dog slow. Gets slower as #data gets big.
my $outstring;
for my $i (0 .. $#data){
print STDERR $i, "\n" if $i % 1000 == 0;
$outstring .= $data[$i];
}
print $fh $outstring;
}
elsif ($method eq 'join'){ # Fast
print $fh join('', #data);
}
else { # Fast
print $fh #data;
}
If you want merge you should really merge. First of all you have to sort your data by key and than merge! You will beat even MySQL in performance. I have a lot of experience with it.
You can write something along those lines:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV_XS;
use autodie;
use constant KEYPOS => 4;
die "Insufficient number of parameters" if #ARGV < 2;
my $csv = Text::CSV_XS->new( { eol => $/ } );
my $sortpos = KEYPOS + 1;
open my $file1, "sort -n -k$sortpos -t, $ARGV[0] |";
open my $file2, "sort -n -k$sortpos -t, $ARGV[1] |";
my $row1 = $csv->getline($file1);
my $row2 = $csv->getline($file2);
while ( $row1 and $row2 ) {
my $row;
if ( $row1->[KEYPOS] == $row2->[KEYPOS] ) { # merge rows
$row = [ map { $row1->[$_] || $row2->[$_] } 0 .. $#$row1 ];
$row1 = $csv->getline($file1);
$row2 = $csv->getline($file2);
}
elsif ( $row1->[KEYPOS] < $row2->[KEYPOS] ) {
$row = $row1;
$row1 = $csv->getline($file1);
}
else {
$row = $row2;
$row2 = $csv->getline($file2);
}
$csv->print( *STDOUT, $row );
}
# flush possible tail
while ( $row1 ) {
$csv->print( *STDOUT, $row1 );
$row1 = $csv->getline($file1);
}
while ( $row2 ) {
$csv->print( *STDOUT, $row2 );
$row2 = $csv->getline($file1);
}
close $file1;
close $file2;
Redirect output to file and measure.
If you like more sanity around sort arguments you can replace file opening part with
(open my $file1, '-|') || exec('sort', '-n', "-k$sortpos", '-t,', $ARGV[0]);
(open my $file2, '-|') || exec('sort', '-n', "-k$sortpos", '-t,', $ARGV[1]);
I can't see anything that strikes me as obviously slow, but I would make these changes:
First, I'd eliminate the #file1array variable. You don't need it; just store the line itself in the hash:
while (<FILE1>){
chomp;
$line_for{read_csv_string($_,$position)}=$_;
}
Secondly, although this shouldn't really make much of a difference with perl, I wouldn't add to $OUTSTRING all the time. Instead, keep an array of output lines and push onto it each time. If for some reason you still need to call write_line with a massive string you can always use join('', #OUTLINES) at the end.
If write_line doesn't use syswrite or something low-level like that, but rather uses print or other stdio-based calls, then you aren't saving any disk writes by building up the output file in memory. Therefore, you might as well not build your output up in memory at all, and instead just write it out as you create it. Of course if you are using syswrite, forget this.
Since nothing is obviously slow, try throwing Devel::SmallProf at your code. I've found that to be the best perl profiler for producing those "Oh! That's the slow line!" insights.
Assuming around 20 bytes lines each of your file would amount to about 20 MB, which isn't too big.
Since you are using hash your time complexity doesn't seem to be a problem.
In your second loop, you are printing to the console for each line, this bit is slow. Try removing that should help a lot.
You can also avoid the delete in the second loop.
Reading multiple lines at a time should also help. But not too much I think, there is always going to be a read ahead behind the scenes.
I'd store each record in a hash whose keys are the primary keys. A given primary key's value is a reference to an array of CSV values, where undef represents an unknown value.
use 5.10.0; # for // ("defined-or")
use Carp;
use Text::CSV;
sub merge_csv {
my($path,$record) = #_;
open my $fh, "<", $path or croak "$0: open $path: $!";
my $csv = Text::CSV->new;
local $_;
while (<$fh>) {
if ($csv->parse($_)) {
my #f = map length($_) ? $_ : undef, $csv->fields;
next unless #f >= 1;
my $primary = pop #f;
if ($record->{$primary}) {
$record->{$primary}[$_] //= $f[$_]
for 0 .. $#{ $record->{$primary} };
}
else {
$record->{$primary} = \#f;
}
}
else {
warn "$0: $path:$.: parse failed; skipping...\n";
next;
}
}
}
Your main program will resemble
my %rec;
merge_csv $_, \%rec for qw/ file1 file2 /;
The Data::Dumper module shows that the resulting hash given the simple inputs from your question is
$VAR1 = {
'42' => [
'one',
'two',
'three',
'four'
]
};