How to sort rows in a text file in Perl? - perl

I have a couple of text files (A.txt and B.txt) which look like this (might have ~10000 rows each)
processa,id1=123,id2=5321
processa,id1=432,id2=3721
processa,id1=3,id2=521
processb,id1=9822,id2=521
processa,id1=213,id2=1
processc,id1=822,id2=521
I need to check if every row in file A.txt is present in B.txt as well (B.txt might have more too, that is okay).
The thing is that rows can be in any order in the two files, so I am thinking I will sort them in some particular order in both the files in O(nlogn) and then match each line in A.txt to the next lines in B.txt in O(n). I could implement a hash, but the files are big and this comparison happens only once after which these files are regenerated, so I don't think that is a good idea.
What is the best way to sort the files in Perl? Any ordering would do, it just needs to be some ordering.
For example, in dictionary ordering, this would be
processa,id1=123,id2=5321
processa,id1=213,id2=1
processa,id1=3,id2=521
processa,id1=432,id2=3721
processb,id1=9822,id2=521
processc,id1=822,id2=521
As I mentioned before, any ordering would be just as fine, as long as Perl is fast in doing it.
I want to do it from within Perl code, after opening the file like so
open (FH, "<A.txt");
Any comments, ideas etc would be helpful.

To sort the file in your script, you will still have to load the entire thing into memory. If you're doing that, I'm not sure what's the advantage of sorting it vs just loading it into a hash?
Something like this would work:
my %seen;
open(A, "<A.txt") or die "Can't read A: $!";
while (<A>) {
$seen{$_}=1;
}
close A;
open(B, "<B.txt") or die "Can't read B: $!";
while(<B>) {
delete $seen{$_};
}
close B;
print "Lines found in A, missing in B:\n";
join "\n", keys %seen;

Here's another way to do it. The idea is to create a flexible data structure that allows you to answer many kinds of questions easily with grep.
use strict;
use warnings;
my ($fileA, $fileB) = #ARGV;
# Load all lines: $h{LINE}{FILE_NAME} = TALLY
my %h;
$h{$_}{$ARGV} ++ while <>;
# Do whatever you need.
my #all_lines = keys %h;
my #in_both = grep { keys %{$h{$_}} == 2 } keys %h;
my #in_A = grep { exists $h{$_}{$fileA} } keys %h;
my #only_in_A = grep { not exists $h{$_}{$fileB} } #in_A;
my #in_A_mult = grep { $h{$_}{$fileA} > 1 } #in_A;

well, i routinely parse very large (600MB) daily Apache log files with Perl, and to store the information i use a hash. I also go through about 30 of these files, in one script instance, using the same hash. its not a big deal assuming you have enough RAM.

May I ask why you must do this in native Perl? If the cost of calling a system call or 3 is not an issue (e.g. you do this infrequently and not in a tight loop), why not simply do:
my $cmd = "sort $file1 > $file1.sorted";
$cmd .= "; sort $file2 > $file2.sorted";
$cmd .= "; comm -23 $file1.sorted $file2.sorted |wc -l";
my $count = `$cmd`;
$count =~ s/\s+//g;
if ($count != 0) {
print "Stuff in A exists that aren't in B\n";
}
Please note that comm parameter might be different, depending on what exactly you want.

As usual, CPAN has an answer for this. Either Sort::External or File::Sort looks like it would work. I've never had occasion to try either, so I don't know which would be better for you.
Another possibility would be to use AnyDBM_File to create a disk-based hash that can exceed available memory. Without trying it, I couldn't say whether using a DBM file would be faster or slower than the sort, but the code would probably be simpler.

Test if A.txt is a subset of B.txt
open FILE.B, "B.txt";
open FILE.A, "A.txt";
my %bFile;
while(<FILE.B>) {
($process, $id1, $id2) = split /,/;
$bFile{$process}{$id1}{$id2}++;
}
$missingRows = 0;
while(<FILE.A>) {
$missingRows++ unless $bFile{$process}{$id1}{$id2};
# If we've seen a given entry already don't add it
next if $missingRows; # One miss means they aren't all verified
}
$is_Atxt_Subset_Btxt = $missingRows?FALSE:TRUE;
That will give you a test for all rows in A being in B with only reading in all of B and then testing each member of the array while reading A.

Related

Reading lines of a file into a hash parallel in Perl

I have thousands of files. My goal is to insert the lines of those files into a hash (Big amount of those lines repeats).
For now, I iterate through an array on files and for each file, I open it and split the row (Because each row is in the following format: <path>,<number>).
Then I insert into the %paths hash. Also each line I write into one main file (trying to save time by combining).
Piece of my code:
open(my $fh_main, '>', "$main_file") or die;
foreach my $dir (#dirs)
{
my $test = $dir."/"."test.csv";
open(my $fh, '<', "$test") or die;
while (my $row = <$fh>)
{
print $fh_main $row;
chomp($row);
my ($path,$counter) = split(",",$row);
my $abs_path = abs_path($path);
$paths{$abs_path} += $counter;
}
close ($fh);
}
close ($fh_main);
Due to a lot of files, I would like to split the iteration at least half. I thought of using the Parallel::ForkManager module (link),
in order to parallel insert the files into a hash A and into a hash B (if possible, then more than two hashes).
Then I can combine those two (or more) hashes into one main hash. There should not be a memory issue (because I'm running on a machine that does not have memory issues).
I read the decontamination but every single try failed and each iteration was running alone. I would like to see an initial example of the should I solve this issue.
Also, I would like to hear another opinion on how to implement this in a more clean and wise way.
Edit: maybe I didn't understand what exactly the module do. I would like to create a fork in the script so one half will of the files will be collected by process 1 and the other half will be collected by process 2. The first one to finish will write to a file and the other one will read from it. Is it possible to implement? Will it reduce the run time?
Try MCE::Map. It will automatically gather the output of the sub-processes into a list, which in your case can be a hash. Here's some untested pseudocode:
use MCE::Map qw[ mce_map ];
# note that MCE passes the argument via $_, not #_
sub process_file {
my $file = $_;
my %result_hash;
... fill hash ...
return %result_hash
}
my %result_hash = mce_map \&process_file \#list_of_files

A Perl script to process a CSV file, aggregating properties spread over multiple records

Sorry for the vague question, I'm struggling to think how to better word it!
I have a CSV file that looks a little like this, only a lot bigger:
550672,1
656372,1
766153,1
550672,2
656372,2
868194,2
766151,2
550672,3
868179,3
868194,3
550672,4
766153,4
The values in the first column are a ID numbers and the second column could be described as a property (for want of a better word...). The ID number 550672 has properties 1,2,3,4. Can anyone point me towards how I can begin solving how to produce strings such as that for all the ID numbers? My ideal output would be a new csv file which looks something like:
550672,1;2;3;4
656372,1;2
766153,1;4
etc.
I am very much a Perl baby (only 3 days old!) so would really appreciate direction rather than an outright solution, I'm determined to learn this stuff even if it takes me the rest of my days! I have tried to investigate it myself as best as I can, although I think I've been encumbered by not really knowing what to really search for. I am able to read in and parse CSV files (I even got so far as removing duplicate values!) but that is really where it drops off for me. Any help would be greatly appreciated!
I think it is best if I offer you a working program rather than a few hints. Hints can only take you so far, and if you take the time to understand this code it will give you a good learning experience
It is best to use Text::CSV whenever you are processing CSV data as all the debugging has already been done for you
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new;
open my $fh, '<', 'data.txt' or die $!;
my %data;
while (my $line = <$fh>) {
$csv->parse($line) or die "Invalid data line";
my ($key, $val) = $csv->fields;
push #{ $data{$key} }, $val
}
for my $id (sort keys %data) {
printf "%s,%s\n", $id, join ';', #{ $data{$id} };
}
output
550672,1;2;3;4
656372,1;2
766151,2
766153,1;4
868179,3
868194,2;3
Firstly props for seeking an approach not a solution.
As you've probably already found with perl, There Is More Than One Way To Do It.
The approach I would take would be;
use strict; # will save you big time in the long run
my %ids # Use a hash table with the id as the key to accumulate the properties
open a file handle on csv or die
while (read another line from the file handle){
split line into ID and property variable # google the split function
append new property to existing properties for this id in the hash table # If it doesn't exist already, it will be created
}
foreach my $key (keys %ids) {
deduplicate properties
print/display/do whatever you need to do with the result
}
This approach means you will need to iterate over the whole set twice (once in memory), so depending on the size of the dataset that may be a problem.
A more sophisticated approach would be to use a hashtable of hashtables to do the de duplication in the intial step, but depending on how quickly you want/need to get it working, that may not be worthwhile in the first instance.
Check out
this question
for a discussion on how to do the deduplication.
Well, open the file as stdin in perl, assume each row is of two columns, then iterate over all lines using left column as hash identifier, and gathering right column into an array pointed by a hash key. At the end of input file you'll get a hash of arrays, so iterate over it, printing a hash key and assigned array elements separated by ";" or any other sign you wish.
and here you go
dtpwmbp:~ pwadas$ cat input.txt
550672,1
656372,1
766153,1
550672,2
656372,2
868194,2
766151,2
550672,3
868179,3
868194,3
550672,4
766153,4
dtpwmbp:~ pwadas$ cat bb2.pl
#!/opt/local/bin/perl
my %hash;
while (<>)
{
chomp;
my($key, $value) = split /,/;
push #{$hash{$key}} , $value ;
}
foreach my $key (sort keys %hash)
{
print $key . "," . join(";", #{$hash{$key}} ) . "\n" ;
}
dtpwmbp:~ pwadas$ cat input.txt | perl -f bb2.pl
550672,1;2;3;4
656372,1;2
766151,2
766153,1;4
868179,3
868194,2;3
dtpwmbp:~ pwadas$
perl -F"," -ane 'chomp($F[1]);$X{$F[0]}=$X{$F[0]}.";".$F[1];if(eof){for(keys %X){$X{$_}=~s/;//;print $_.",".$X{$_}."\n"}}'
Another (not perl) way which incidentally is shorter and more elegant:
#!/opt/local/bin/gawk -f
BEGIN {FS=OFS=",";}
NF > 0 { IDs[$1]=IDs[$1] ";" $2; }
END { for (i in IDs) print i, substr(IDs[i], 2); }
The first line (after specifying the interpreter) sets the input FIELD SEPARATOR and the OUTPUT FIELD SEPARATOR to the comma. The second line checks of we have more than zero fields and if you do it makes the ID ($1) number the key and $2 the value. You do this for all lines.
The END statement will print these pairs out in an unspecified order. If you want to sort them you have to option of asorti gnu awk function or connecting the output of this snippet with a pipe to sort -t, -k1n,1n.

Parsing multiple files at a time in Perl

I have a large data set (around 90GB) to work with. There are data files (tab delimited) for each hour of each day and I need to perform operations in the entire data set. For example, get the share of OSes which are given in one of the columns. I tried merging all the files into one huge file and performing the simple count operation but it was simply too huge for the server memory.
So, I guess I need to perform the operation each file at a time and then add up in the end. I am new to perl and am especially naive about the performance issues. How do I do such operations in a case like this.
As an example two columns of the file are.
ID OS
1 Windows
2 Linux
3 Windows
4 Windows
Lets do something simple, counting the share of the OSes in the data set. So, each .txt file has millions of these lines and there are many such files. What would be the most efficient way to operate on the entire files.
Unless you're reading the entire file into memory, I don't see why the size of the file should be an issue.
my %osHash;
while (<>)
{
my ($id, $os) = split("\t", $_);
if (!exists($osHash{$os}))
{
$osHash{$os} = 0;
}
$osHash{$os}++;
}
foreach my $key (sort(keys(%osHash)))
{
print "$key : ", $osHash{$key}, "\n";
}
While Paul Tomblin's answer dealt with filling the hash, here's the same plus opening the files:
use strict;
use warnings;
use 5.010;
use autodie;
my #files = map { "file$_.txt" } 1..10;
my %os_count;
for my $file (#files) {
open my $fh, '<', $file;
while (<$file>) {
my ($id, $os) = split /\t/;
... #Do something with %os_count and $id/$os here.
}
}
We just open each file serially -- Since you need to read all lines from all files, there isn't much more you can do about it. Once you have the hash, you could store it somewhere and load it when the program starts, then skip all lines until the last you read, or simply seek there, if your records premit, which doesn't look like it.

How can I combine files into one CSV file?

If I have one file FOO_1.txt that contains:
FOOA
FOOB
FOOC
FOOD
...
and a lots of other files FOO_files.txt. Each of them contains:
1110000000...
one line that contain 0 or 1 as the number of FOO1 values (fooa,foob, ...)
Now I want to combine them to one file FOO_RES.csv that will have the following format:
FOOA,1,0,0,0,0,0,0...
FOOB,1,0,0,0,0,0,0...
FOOC,1,0,0,0,1,0,0...
FOOD,0,0,0,0,0,0,0...
...
What is the simple & elegant way to conduct that
(with hash & arrays -> $hash{$key} = \#data ) ?
Thanks a lot for any help !
Yohad
If you can't describe a your data and your desired result clearly, there is no way that you will be able to code it--taking on a simple project is a good way to get started using a new language.
Allow me to present a simple method you can use to churn out code in any language, whether you know it or not. This method only works for smallish projects. You'll need to actually plan ahead for larger projects.
How to write a program:
Open up your text editor and write down what data you have. Make each line a comment
Describe your desired results.
Start describing the steps needed to change your data into the desired form.
Numbers 1 & 2 completed:
#!/usr/bin perl
use strict;
use warnings;
# Read data from multiple files and combine it into one file.
# Source files:
# Field definitions: has a list of field names, one per line.
# Data files:
# * Each data file has a string of digits.
# * There is a one-to-one relationship between the digits in the data file and the fields in the field defs file.
#
# Results File:
# * The results file is a CSV file.
# * Each field will have one row in the CSV file.
# * The first column will contain the name of the field represented by the row.
# * Subsequent values in the row will be derived from the data files.
# * The order of subsequent fields will be based on the order files are read.
# * However, each column (2-X) must represent the data from one data file.
Now that you know what you have, and where you need to go, you can flesh out what the program needs to do to get you there - this is step 3:
You know you need to have the list of fields, so get that first:
# Get a list of fields.
# Read the field definitions file into an array.
Since it is easiest to write CSV in a row oriented fashion, you will need to process all your files before generating each row. So you'll need someplace to store the data.
# Create a variable to store the data structure.
Now we read the data files:
# Get a list of data files to parse
# Iterate over list
# For each data file:
# Read the string of digits.
# Assign each digit to its field.
# Store data for later use.
We've got all the data in memory, now write the output:
# Write the CSV file.
# Open a file handle.
# Iterate over list of fields
# For each field
# Get field name and list of values.
# Create a string - comma separated string with field name and values
# Write string to file handle
# close file handle.
Now you can start converting comments into code. You could have anywhere from 1 to 100 lines of code for each comment. You may find that something you need to do is very complex and you don't want to take it on at the moment. Make a dummy subroutine to handle the complex task, and ignore it until you have everything else done. Now you can solve that complex, thorny sub-problem on it's own.
Since you are just learning Perl, you'll need to hit the docs to find out how to do each of the subtasks represented by the comments you've written. The best resource for this kind of work is the list of functions by category in perlfunc. The Perl syntax guide will come in handy too. Since you'll need to work with a complex data structure, you'll also want to read from the Data Structures Cookbook.
You may be wondering how the heck you should know which perldoc pages you should be reading for a given problem. An article on Perlmonks titled How to RTFM provides a nice introduction to the documentation and how to use it.
The great thing, is if you get stuck, you have some code to share when you ask for help.
If I understand correctly your first file is your key order file, and the remaining files each contain a byte per key in the same order. You want a composite file of those keys with each of their data bytes listed together.
In this case you should open all the files simultaneously. Read one key from the key order file, read one byte from each of the data files. Output everything as you read it to you final file. Repeat for each key.
It looks like you have many foo_files that have 1 line in them, something like:
1110000000
Which stands for
fooa=1
foob=1
fooc=1
food=0
fooe=0
foof=0
foog=0
fooh=0
fooi=0
fooj=0
And it looks like your foo_res is just a summation of those values? In that case, you don't need a hash of arrays, but just a hash.
my #foo_files = (); #NOT SURE HOW YOU POPULATE THIS ONE
my #foo_keys = qw(a b c d e f g h i j);
my %foo_hash = map{ ( $_, 0 ) } #foo_keys; # initialize hash
foreach my $foo_file ( #foo_files ) {
open( my $FOO, "<", $foo_file) || die "Cannot open $foo_file\n";
my $line = <$FOO>;
close( $FOO );
chomp($line);
my #foo_values = split(//, $line);
foreach my $indx ( 0 .. $#foo_keys ) {
last if ( ! $foo_values[ $indx ] ); # or some kind of error checking if the input file doesn't have all the values
$foo_hash{ $foo_keys[$indx] } += $foo_values[ $indx ];
}
}
It's pretty hard to understand what you are asking for, but maybe this helps?
Your specifications aren't clear. You couldn't have a "lots of other files" named FOO_files.txt, because it's only one name. So I'm going to take this as the files-with-data + filelist pattern. In this case, there are files named FOO*.txt, each containing "[01]+\n".
Thus the idea is to process all the files in the filelist file and to insert them all into a result file FOO_RES.csv, comma-delimited.
use strict;
use warnings;
use English qw<$OS_ERROR>;
use IO::Handle;
open my $foos, '<', 'FOO_1.txt'
or die "I'm dead: $OS_ERROR";
#ARGV = sort map { chomp; "$_.txt" } <$foos>;
$foos->close;
open my $foo_csv, '>', 'FOO_RES.csv'
or die "I'm dead: $OS_ERROR";
while ( my $line = <> ) {
my ( $foo_name ) = ( $ARGV =~ /(.*)\.txt$/ );
$foo_csv->print( join( ',', $foo_name, split //, $line ), "\n" );
}
$foo_csv->close;
You don't really need to use a hash. My Perl is a little rusty, so syntax may be off a bit, but basically do this:
open KEYFILE , "foo_1.txt" or die "cannot open foo_1 for writing";
open VALFILE , "foo_files.txt" or die "cannot open foo_files for writing";
open OUTFILE , ">foo_out.txt"or die "cannot open foo_out for writing";
my %output;
while (<KEYFILE>) {
my $key = $_;
my $val = <VALFILE>;
my $arrVal = split(//,$val);
$output{$key} = $arrVal;
print OUTFILE $key."," . join(",", $arrVal)
}
Edit: Syntax check OK
Comment by Sinan: #Byron, it really bothers me that your first sentence says the OP does not need a hash yet your code has %output which seems to serve no purpose. For reference, the following is a less verbose way of doing the same thing.
#!/usr/bin/perl
use strict;
use warnings;
use autodie qw(:file :io);
open my $KEYFILE, '<', "foo_1.txt";
open my $VALFILE, '<', "foo_files.txt";
open my $OUTFILE, '>', "foo_out.txt";
while (my $key = <$KEYFILE>) {
chomp $key;
print $OUTFILE join(q{,}, $key, split //, <$VALFILE> ), "\n";
}
__END__

Should I manually set Perl's #ARGV so I can use <> to open, scan, and close files?

I have recently started learning Perl and one of my latest assignments involves searching a bunch of files for a particular string. The user provides the directory name as an argument and the program searches all the files in that directory for the pattern. Using readdir() I have managed to build an array with all the searchable file names and now need to search each and every file for the pattern, my implementation looks something like this -
sub searchDir($) {
my $dirN = shift;
my #dirList = glob("$dirN/*");
for(#dirList) {
push #fileList, $_ if -f $_;
}
#ARGV = #fileList;
while(<>) {
## Search for pattern
}
}
My question is - is it alright to manually load the #ARGV array as has been done above and use the <> operator to scan in individual lines or should I open / scan / close each file individually? Will it make any difference if this processing exists in a subroutine and not in the main function?
On the topic of manipulating #ARGV - that's definitely working code, Perl certainly allows you to do that. I don't think it's a good coding habit though. Most of the code I've seen that uses the "while (<>)" idiom is using it to read from standard input, and that's what I initially expect your code to do. A more readable pattern might be to open/close each input file individually:
foreach my $file (#files) {
open FILE, "<$file" or die "Error opening file $file ($!)";
my #lines = <FILE>;
close FILE or die $!;
foreach my $line (#file) {
if ( $line =~ /$pattern/ ) {
# do something here!
}
}
}
That would read more easily to me, although it is a few more lines of code. Perl allows you a lot of flexibility, but I think that makes it that much more important to develop your own style in Perl that's readable and understandable to you (and your co-workers, if that's important for your code/career).
Putting subroutines in the main function or in a subroutine is also mostly a stylistic decision that you should play around with and think about. Modern computers are so fast at this stuff that style and readability is much more important for scripts like this, as you're not likely to encounter situations in which such a script over-taxes your hardware.
Good luck! Perl is fun. :)
Edit: It's of course true that if he had a very large file, he should do something smarter than slurping the entire file into an array. In that case, something like this would definitely be better:
while ( my $line = <FILE> ) {
if ( $line =~ /$pattern/ ) {
# do something here!
}
}
The point when I wrote "you're not likely to encounter situations in which such a script over-taxes your hardware" was meant to cover that, sorry for not being more specific. Besides, who even has 4GB hard drives, let alone 4GB files? :P
Another Edit: After perusing the Internet on the advice of commenters, I've realized that there are hard drives that are much larger than 4GB available for purchase. I thank the commenters for pointing this out, and promise in the future to never-ever-ever try to write a sarcastic comment on the internet.
I would prefer this more explicit and readable version:
#!/usr/bin/perl -w
foreach my $file (<$ARGV[0]/*>){
open(F, $file) or die "$!: $file";
while(<F>){
# search for pattern
}
close F;
}
But it is also okay to manipulate #ARGV:
#!/usr/bin/perl -w
#ARGV = <$ARGV[0]/*>;
while(<>){
# search for pattern
}
Yes, it is OK to adjust the argument list before you start the 'while (<>)' loop; it would be more nearly foolhardy to adjust it while inside the loop. If you process option arguments, for instance, you typically remove items from #ARGV; here, you are adding items, but it still changes the original value of #ARGV.
It makes no odds whether the code is in a subroutine or in the 'main function'.
The previous answers cover your main Perl-programming question rather well.
So let me comment on the underlying question: How to find a pattern in a bunch of files.
Depending on the OS it might make sense to call a specialised external program, say
grep -l <pattern> <path>
on unix.
Depending on what you need to do with the files containing the pattern, and how big the hit/miss ratio is, this might save quite a bit of time (and re-uses proven code).
The big issue with tweaking #ARGV is that it is a global variable. Also, you should be aware that while (<>) has special magic attributes. (reading each file in #ARGV or processing STDIN if #ARGV is empty, testing for definedness rather than truth). To reduce the magic that needs to be understood, I would avoid it, except for quickie-hack-jobs.
You can get the filename of the current file by checking $ARGV.
You may not realize it, but you are actually affecting two global variables, not just #ARGV. You are also hitting $_. It is a very, very good idea to localize $_ as well.
You can reduce the impact of munging globals by using local to localize the changes.
BTW, there is another important, subtle bit of magic with <>. Say you want to return the line number of the match in the file. You might think, ok, check perlvar and find $. gives the linenumber in the last handle accessed--great. But there is an issue lurking here--$. is not reset between #ARGV files. This is great if you want to know how many lines total you have processed, but not if you want a line number for the current file. Fortunately there is a simple trick with eof that will solve this problem.
use strict;
use warnings;
...
searchDir( 'foo' );
sub searchDir {
my $dirN = shift;
my $pattern = shift;
local $_;
my #fileList = grep { -f $_ } glob("$dirN/*");
return unless #fileList; # Don't want to process STDIN.
local #ARGV;
#ARGV = #fileList;
while(<>) {
my $found = 0;
## Search for pattern
if ( $found ) {
print "Match at $. in $ARGV\n";
}
}
continue {
# reset line numbering after each file.
close ARGV if eof; # don't use eof().
}
}
WARNING: I just modified your code in my browser. I have not run it so it, may have typos, and probably won't work without a bit of tweaking
Update: The reason to use local instead of my is that they do very different things. my creates a new lexical variable that is only visible in the contained block and cannot be accessed through the symbol table. local saves the existing package variable and aliases it to a new variable. The new localized version is visible in any subsequent code, until we leave the enclosing block. See perlsub: Temporary Values Via local().
In the general case of making new variables and using them, my is the correct choice. local is appropriate when you are working with globals, but you want to make sure you don't propagate your changes to the rest of the program.
This short script demonstrates local:
$foo = 'foo';
print_foo();
print_bar();
print_foo();
sub print_bar {
local $foo;
$foo = 'bar';
print_foo();
}
sub print_foo {
print "Foo: $foo\n";
}