Related
I'm having an issue with Perl and I'm hoping someone here can help me figure out what's going on. I have about 130,000 .txt files in a directory called RawData and I have a Perl program that loads them into an array, then loops through this array, loading each .txt file. For simplicity, suppose I have four text files I'm looping through
File1.txt
File2.txt
File3.txt
File4.txt
The contents of each .txt file look something like this:
007 C03XXYY ZZZZ
008 A01XXYY ZZZZ
009 A02XXYY ZZZZ
where X,Y,Z are digits. In my simplified code below, the program then pulls out just line 007 in each .txt file, saves XX as ID, ignores YY and grabs the variable data ZZZZ that I've called VarVal. Then it writes everything to a file with a header specified in the code below:
#!/usr/bin/perl
use warnings;
use strict;
open(OUTFILE, "> ../Data/OutputFile.csv") or die $!;
opendir(MYDIR,"../RawData")||die $!;
my #txtfiles=grep {/\.txt$/} readdir(MYDIR);
closedir(MYDIR);
print OUTFILE "ID,VarName,VarVal\n";
foreach my $txtfile (#txtfiles){
#Prints to the screen so I can see where I am in the loop.
print $txtfile","\n";
open(INFILE, "< ../RawData/$txtfile") or die $!;
while(<INFILE>){
if(m{^007 C03(\d{2})(\d+)(\s+)(.+)}){
print OUTFILE "$1,VarName,$4\n"
}
}
}
The issue I'm having is that the contents of, for example File3.txt, don't show up in OutputFile.csv. However, it's not an issue with Perl not finding a match because I checked that the if statement is being executed by deleting OUTFILE and looking at what the code prints to the terminal screen. What shows up is exactly what should be there.
Furthermore, If I just run the problematic file (File3.txt) through the loop itself by commenting out the opendir and closedir stuff and doing something like my #textfile = "File3.txt";. Then when I run the code, the only data that shows up in the OutputFile.csv IS what's in File3.txt. But when it goes through the loop, it won't show up in OutputFile.csv. Plus, I know that File3.txt is being sent to into the loop because I can see it being printed on the screen with print $txtfile","\n";. I'm at a loss as to what is going on here.
The other issue is that I don't think it's something specific to this one particular file (maybe it is) but I can't just troubleshoot this one file because I have 130,000 files and I just happened to stumble across the fact that this one wasn't being written to the output file. So there may be other files that also aren't getting written, even though there is no obvious reason they shouldn't be just like the case of File3.txt.
Perhaps because I'm doing so many files in rapid succession, looping 130,000 files, causes some sort of I/O issues that randomly fails every so often to write the contents in memory to the output file? That's my best guess but I have not idea how to diagnose or fix this.
This is kind of a difficult question to debug, but I'm hoping someone on here has some insight or has seen similar problems that would provide me with a solution.
Thanks
There's nothing obviously wrong that I can see in your code. It is a little outdated as using autodie and lexical filehandles would be better.
However, I would recommend that you make your regex slightly less restrictive by making the spacing variable length after the first value and making the last variable optionally of 0 length. I'd also output the filename as well. Then you can see which other files aren't being caught for whatever reason:
if (m{^007\s+C03(\d{2})\d+\s+(.*)}){
print OUTFILE "$txtfile $1,VarName,$2\n";
last;
}
Finally, assuming there is only a single 007 C03 in each file, you could throw in a last call after one is found.
You may want to try sorting the #txtfiles list, then trying to systematically look through the output to see what is or isn't there. With 130k files in random order, it would be pretty difficult to be certain that you missed one. Perl should be giving you the files in the actual order they appear in the directory, which is different that user level commands like ls, so it may be different that you'd expect.
I have a very simple dictionary application that does search and display. It's built with the Win32::GUI module. I put all the plain text data needed for the dictionary under the __DATA__ section. The script itself is very small but with everything under the __DATA__ section, its size reaches 30 MB. In order to share the work with my friends, I've then packed the script into a stand-alone executable using the PP utility of the PAR::Packer module with the highest compression level 9 and now I have a single-file dictionary app of about the size of 17MB.
But although I'm very comfortable with the idea of a single-file script, placing such huge amount of text data under the script's DATA section does not feel right. For one thing, when I try opening the script in Padre (Notepad ++ is okay), I'm receiving the error that is like:
Can't open my script as the script is over the arbitrary file size limit which is currently 500000.
My questions:
Does it bring me any extra benefits except for the eliminating of Padre's file opening issue if I move everything under the DATA section to a separate text file?
If I do so, What should I do to reduce the size of the separate file? Zip it and uncompress it while doing search and display?
How do people normally format the text data needed for a dictionary application?
Any comments, ideas or suggestions? Thanks like always :)
If I do so, What should I do to reduce the size of the separate file? Zip it and uncompress it while doing search and display?
Well, it depends on WHY you want to reduce the size. If it is to minimize disk space usage (rather weird goal most of the time these days), then the zip/unzip is the way to go.
However if the goal is to minimize memory usage, then a better approach is to split up the dictionary data into smaller chunks (for example indexed by a first letter), and only load needed chunks.
How do people normally format the text data needed for a dictionary application?
IMHO the usual approach is what you get as the logical end of an approach mentioned above (partitioned and indexed data): using a back-end database, which allows you to only retrieve the data which is actually needed.
In your case probably something simple like SQLite or Berkley DB/DBM files should be OK.
Does it bring me any extra benefits except for the eliminating of Padre's file opening issue if I move everything under the DATA section to a separate text file?
This depends somewhat on your usage... if it's a never-changing script used by 3 people, may be no tangible benefits.
In general, it will make maintenance much easier (you can change the dictionary and the code logic independently - think virus definitions file vs. antivirus executable for real world example).
It will also decrease the process memory consumption if you go with the approaches I mentioned above.
Since you are using PAR::Packer already, why not move it to a separate file or module and include it in the PAR file?
The easy way (no extra commandline options to pp, it will see the use statement and do the right thing):
words.pl
#!/usr/bin/perl
use strict;
use warnings;
use Words;
for my $i (1 .. 2) {
print "Run $i\n";
while (defined(my $word = Words->next_word)) {
print "\t$word\n";
}
}
Words.pm
package Words;
use strict;
use warnings;
my $start = tell DATA
or die "could not find current position: $!";
sub next_word {
if (eof DATA) {
seek DATA, $start, 0
or die "could not seek: $!";
return undef;
}
chomp(my $word = scalar <DATA>);
return $word;
}
1;
__DATA__
a
b
c
I have a application generating logs in every 5 sec. The logs are in below format.
11:13:49.250,interface,0,RX,0
11:13:49.250,interface,0,TX,0
11:13:49.250,interface,1,close,0
11:13:49.250,interface,4,error,593
11:13:49.250,interface,4,idle,2994215
and so on for other interfaces...
I am working to convert these into below CSV format:
Time,interface.RX,interface.TX,interface.close....
11:13:49,0,0,0,....
Simple as of now but the problem is, I have to get the data in CSV format online, i.e as soon the log file updated the CSV should also be updated.
What I have tried to read the output and make the header is:
#!/usr/bin/perl -w
use strict;
use File::Tail;
my $head=["Time"];
my $pos={};
my $last_pos=0;
my $current_event=[];
my $events=[];
my $file = shift;
$file = File::Tail->new($file);
while(defined($_=$file->read)) {
next if $_ =~ some filters;
my ($time,$interface,$count,$eve,$value) = split /[,\n]/, $_;
my $key = $interface.".".$eve;
if (not defined $pos->{$eve_key}) {
$last_pos+=1;
$pos->{$eve_key}=$last_pos;
push #$head,$eve;
}
print join(",", #$head) . "\n";
}
Is there any way to do this using Perl?
Module Text::CSV will allow you to both read and write CSV format files. Text::CSV will internally use Text::CSV_XS if it's installed, or it will fall back to using Text::CSV_PP (thanks to Brad Gilbert for improving this explanation).
Grouping the related rows together is something you will have to do; it is not clear from your example where the source date goes to.
Making sure that the CSV output is updated is primarily a question of ensuring that you have the output file line buffered.
As David M suggested, perhaps you should look at the File::Tail module to deal with the continuous reading aspect of the problem. That should allow you to continually read from the input log file.
You can then use the 'parse' method in Text::CSV to split up the read line, and the 'print' method to format the output. How you combine the information from the various input lines to create an output line is a mystery to me - I cannot see how the logic works from the example you give. However, I assume you know what you need to do, and these tools will give you the mechanisms you need to handle the data.
No-one can do much more to spoon-feed you the answer. You are going to have to do some thinking for yourself. You will have a file handle that can be continuously read via File::Tail; you will have a CSV structure for reading the data lines; you will probably have another CSV structure for the written output; you will have an output file handle that you ensure is flushed every time you write. Connecting these dots is now your problem.
I have a list of sorted coordinates (let's call it xycord.txt) that looks like this:
chr1 10003486 10043713
chr1 10003507 10043106
chr2 10003486 10043713
chr2 10003507 10043162
chr2 10003532 10042759
In reality the this file is very2 large with 10^7 lines.
What I want to do is given another two-point coordinates I want to check if they
fall in between any coordinates in xycord.txt file.
The current approach I have is super slow. Because
there are also many others two-point coordinates against this large xycord.txt file.
Is there a fast way to do it?
#!/usr/bin/perl -w
my $point_to_check_x = $ARGV[0] || '10003488';
my $point_to_check_y = $ARGV[1] || '10003489';
my $chrid = $ARGV[2] || "chr1";
my %allxycordwithchr;
# skip file opening construct
while (<XYCORD_FILE>) {
my ($chr,$tx,$ty) = split(/\s+/,$_);
push #{$allxycordwithchr{$chr}},$tx."-".$ty;
}
my #chosenchr_cord = #{$allxycordwithchr{$chrid}};
for my $chro_cords (#chosenchr_cord){
my ($repox,$repoy) = split("-",$chro_cord);
my $stat = is_in_xycoordsfile($repox,$repoy,$point_to_check_x,$point_to_check_y);
if ($stat eq "IN"){
print "IN\n";
}
}
sub is_in_xycoordsfile {
my ($x,$y,$xp,$yp) = #_;
if ( $xp >= $x && $yp <= $y ) {
return "IN";
}
else {
return "OUT";
}
}
Update: I apologize for correcting this. In my earlier posting I oversimplified
the problem.
Actually, there is one more query field (e.g. chromosome name).
Hence the DB/RB-trees/SQL approaches maybe infeasible in this matter?
A few suggestions:
You could store your data in a database, such as MySQL or SQLite. You could then use a simple request such as:
"SELECT * FROM coordinates WHERE x<"+xp+" AND y>"+yp
Provided you have indexes on x and y, this should be super fast.
You could also take a look at R-Trees. I used R-trees a few years ago to store tens of thousands of city coordinates, and I could find the closest city from a given point in a fraction of a second. In your example, you are storing 1D ranges but I'm pretty sure R-trees would work fine too. You might find R-tree implementations for Perl here. Or you can use RectanglesContainingDot, which seems to do what you need.
You could cache the coordinates in memory: each number looks like it would take 4 bytes to store, so this would lead to about 80 MB of memory usage if you have 10^7 couples of numbers. That's what firefox uses on my machine! Of course if you do this, you need to have some sort of daemon running in order to avoid reloading the whole file every time you need to check coordinates.
You can mix solutions 2 & 3.
My preference is for solution 1: it has a good efficiency/complexity ratio.
In addition to Udi Pasmon's good advice, you could also convert your large file to a DBM and then tie the DBM file to a hash for easy look ups.
Convert the file:
#!/usr/bin/perl
use strict;
use warnings;
use DB_File;
my $dbfile = "coords.db";
tie my %coords, "DB_File", $dbfile
or die "could not open $dbfile: $!";
while (<>) {
my ($x, $y) = split;
$coords{"$x-$y"} = 1;
}
Check to see if arguments are members of the file:
#!/usr/bin/perl
use strict;
use warnings;
use DB_file;
my ($x, $y) = #ARGV;
tie my %coords, "DB_File", "coords.db"
or die "could not open coords.db: $!";
print "IN\n" if $coords{"$x-$y"};
Try a binary search rather than sequential search. There are two appearant options to do this:
Split the files to smaller files (xycord001.txt, xycord002.txt and so on). Now you can easily determine in which file to search, and the search is rather faster. The big con here is that if you need to add data to a file it might get messy.
Make a binary search over the file: Start at the middle, splitting the file into two logical parts. Decide in which part you coordinates might be, and look at the middle of that part. You'll rapidly (exponentially) reduce the size of the file you're looking in, till you'll be searching one line only. Read more about seeking into files; There is a perl example about binary searching a file here.
EDIT: Generally, using a database or DB file is preferred; However, binary file search is the quick-and-dirty way, especially if the script should run on different files on different machines (thanks #MiniQuark, #Chas. Owens)
If both inputs or atleast the large one are sorted you can try a variation of merge-join between them.
If the lookup file (smaller file) isn't too large, then easiest is to just read it in, put it in a hash keyed by the name with sorted arrays of start-end pairs for value.
Then go through each row in the large file, lookup the array of lookup values that could match it by its name. Go through each pair in the lookup array, if the lookup start is less than the input pairs start, discard that value as it can no longer match anything. If the lookup start is past input end, break the loop as no further lookup values can match. If the lookup end is before the input end you have a match and you can add the input and lookup to the list of matches.
My Perl is rusty, so no Perl example code, but I threw together a quick and dirty Python implementation. On my arbitrary randomly generated dataset matching 10M rows to 10k lookup rows for 14k matches took 22s, matching to 100k lookups for 145k matches took 24s and matching to 1M lookups for 1.47M matches took 35s.
If the smaller file is too big to fit in memory at once, it can be loaded in batches of keys as the keys are encountered in the input file.
Restating your question, do you want to print all ranges in a file that contains the (x, y) pair and also have the same id? If that's the case, you don't need to parse the file and storing it in memory.
while (<DATA>) {
my ($chr, $tx, $ty) = split /\s+/;
print "IN : $chr, $tx, $ty\n"
if $chr eq $chrid
&& $point_to_check_x >= $tx
&& $point_to_check_y <= $ty;
}
OK, so let me clarify the problem, based on my understanding of your code. You have a file with a very large number of entries in it. Each entry includes a label "chr1", "chr2", etc. and two numbers, the first of which is less than the second. You then have a query which comprises a label and a number, and you wish to know whether there is a record in the large file which has the same label as the query, and has the two values such that one is less than the query number and the other is greater than it. Essentially, whether the number in the query is within the interval specified by the two numbers in the record.
Provided my understanding is correct, the first thing to notice is that any of the records which do not have the same label as the query have no part to play in the problem. So you can ignore them. Your program reads them all in, puts them in a hash and then doesn't look at most of the data. Sure, if you have several queries to do, you're going to need to keep data for each of the labels you're interested in, but you can throw the rest away as early as possible. This will keep your memory requirement down.
I would be tempted to take this a little further. Is there a possibility of breaking the huge file up into smaller files? It would seem to be a good idea to break it into files which have only certain labels in them. For example, you might have one file per label, or one file for all data with labels beginning with "a", or so on. This way you can open only those files which you know you're going to be interested in, and save yourself a lot of time and effort.
These changes alone may make enough difference for you. If you still need more performance, I would start thinking about how you are storing the records you're going to need. Storing them ordered on the lower (or higher) of the two numbers should cut down a bit the time taken to find what you're after, particularly if you store them in a binary search tree or a trie.
That should give you enough to work on.
PDL for genomic data processing
We processed a lot of files in the same format as you show in your question and found that PDL (documentation) is a very good tool for doing this. You need some time to learn it --- but it's definitely worth the effort (if you do genomics data processing): PDL can process huge files a few thousand times faster than MySQL.
Here are some hints where to go:
First of all, PDL is a language somewhat like Matlab --- but fully integrated with perl. Read the documentation, do some examples. Consult a mathematician for advise which features to use for what purpose.
PDL stores it's data in plain C arrays. Learn about Inline::C and access this data directly from C if PDL doesn't do the job for you. To me, PDL and Inline::C seem like a perfect match: PDL for high-level operations; Inline::C for anything missing. Still PDL is as fast as your best C because it does it's work in C.
use PDL::IO::FastRaw --- store and access data in files on disk. I often write the files "by hand" (see below) and read them as memory mapped files (using PDL::IO::FastRaw::mapfraw, often with the flag ReadOnly=>1). This is the most efficient way to read data in Linux from the disk.
The format of the data files is trivial: just a sequence of C numbers. You can easily write such files in perl with 'print FileHandle pack "i*",#data;' Check 'perldoc -f pack'.
In my experience, just reading the input files line by line and printing them out in binary format is the slowest part of processing: Once you have them ready for PDL to 'mmap' them, processing is much faster.
I hope this advise helps --- even though not much code is given.
I'm learning Perl and building an application that gets a random line from a file using this code:
open(my $random_name, "<", "out.txt");
my #array = shuffle(<$random_name>);
chomp #array;
close($random_name) or die "Error when trying to close $random_name: $!";
print shift #array;
But now I want to delete this random name from the file. How I can do this?
shift already deletes a name from the array.
So does pop (one from the beginning, one from the end) - I would suggest using pop as it may be more efficient and being a random one, you don't care which on you use.
Or do you need to delete it from a file?
If that's the case, you need to:
A. get a count of names inside a file (if small, read it all in memory using File::Slurp, if large, either read it line-by-line and count or simply execute wc -l $filename command via backticks.
B. Generate a random # from 1 to <$ of lines> (say, $random_line_number
C. Read the file line by line. For every line read, WRITE it to another temp file (use File::Temp to generate temp files. Except do NOT write the line numbered $random_line_number to text file
D. Close temp file and move it instead of your original file
If the list contains filenames and you need to delete the file itself (the random file), use unlink() function. Don't forget to process return code from unlink() and, like with any IO operation, print error message containing $! which will be the text of system error on failure.
Done.
D.
When you say "delete this … from the list" do you mean delete it from the file? If you simply mean remove it from #array then you've already done that by using shift. If you want it removed from the file, and the order doesn't matter, simply write the remaining names in #array back into the file. If the file order does matter, you're going to have to do something slightly more complicated, such as reopen the file, read the items in in order, except for the one you don't want, and then write them all back out again. Either that, or take more notice of the order when you read the file.
If you need to delete a line from a file (its not entirely clear from your question) one of the simplest and most efficient ways is to use Tie::File to manipulate a file as if it were an array. Otherwise perlfaq5 explains how to do it the long way.