Is there added value in using Text::CSV for writing output? - perl

If I am writing tabular output to a file as CSV, what advantage does loading an extra module
Text::CSV
and converting my data to an object get me over a basic loop and string manipulation? I have seen a couple of answers that suggest doing this: How can I get column names and row data in order with DBI in Perl and How do I create a CSV file using Perl.
Loading up an entire module seems like overkill and significant overhead for something I can write in four lines of Perl (ignoring data retrieval, etc.):
my $rptText = join(',', map { qq/"$_"/ } #head) . "\n";
foreach my $person ( #$data ) {
$rptText .= join(',', map { qq/"$person->{$_}"/ } #head) . "\n";
}
So what does loading
Text::CSV
get me over the above code?

For simple trivial data (which, admittedly, is quite common) there's no advantage. You can join , print, and go on your merry way. For a quick throw-away script that's likely all you need (and faster, too, if you'd need to consult the Text::CSV documentation).
The advantages start when your data is non-trivial.
Does it contain commas?
Does it contain double-quotes?
Does it contain newlines?
Does it contain non-ASCII characters?
Is there a distinction between undef (no value) and '' (empty string)?
Might you need to use a different separator (or let the user specify it)?
For production code, the standard pro-module advice applies:
Reuse instead of reinvent (particularly if you'll be generating more than one CSV file)
It keeps your code cleaner (more consistent and with better separation of concerns)
CPAN modules are often faster (better optimized), more robust (edge-case handling), and have a cleaner API than in-house solutions.
The overhead of loading a module is almost certainly a non-issue.

Your CSV won't be valid if $person->{col1} contains ". Also, all columns will be wrapped in double quotes, which might not be desired (e.g. numbers). You'll also get "" for undefined values, which might not work if you plan to load the CSV into a database that makes a distinction between null and ''.

Related

How can I convert CGI input to UTF-8 without Perl's Encode module?

Through this forum, I have learned that it is not a good idea to use the following for converting CGI input (from either an escape()d Ajax call or a normal HTML form post) to UTF-8:
read (STDIN, $_, $ENV{CONTENT_LENGTH});
s{%([a-fA-F0-9]{2})}{ pack ('C', hex ($1)) }eg;
utf8::decode $_;
A safer way (which for example does not allow bogus characters through) is to do the following:
use Encode qw (decode);
read (STDIN, $_, $ENV{CONTENT_LENGTH});
s{%([a-fA-F0-9]{2})}{ pack ('C', hex ($1)) }eg;
decode ('UTF-8', $_, Encode::FB_CROAK);
I would, however, very much like to avoid using any modules (including XSLoader, Exporter, and whatever else they bring with them). The function is for a high-volume mod_perl driven website and I think both performance and maintainability will be better without modules (especially since the current code does not use any).
I guess one approach would be to examine the Encode module and strip out the functions and constants used for the “decode ('UTF-8', $_, Encode::FB_CROAK)” call. I am not sufficiently familiar with Unicode and Perl modules to do this. Maybe somebody else is capable of doing this or know a similar, safe “native” way of doing the UTF-8 conversion?
UPDATE:
I prefer keeping things non-modular, because then the only black-box is Perl's own compiler (unless of course you dig down into the module libs).
Sometimes you see large modules being replaced with a few specific lines of code. For example, instead of the CGI.pm module (which people are also in love with), one can use the following for parsing AJAX posts:
my %Input;
if ($ENV{CONTENT_LENGTH}) {
read (STDIN, $_, $ENV{CONTENT_LENGTH});
foreach (split (/&/)) {
tr/+/ /; s/%([a-fA-F0-9]{2})/pack("C", hex($1))/eg;
if (m{^(\w+)=\s*(.*?)\s*$}s) { $Input{$1} = $2; }
else { die ("bad input ($_)"); }
}
}
In a similar way, it would be great if one could extract or replicate Encode's UTF-8 decode function.
Don't pre-optimize. Do it the conventional way first then profile and benchmark later to see where you need to optimize. People usually waste all their time somewhere else, so starting off blindfolded and hadcuffed doesn't give you any benefit.
Don't be afraid of modules. The point of mod_perl is to load up everything as few times as possible so the startup time and module loading time are insignificant.
Don't use escape() to create your posted data. This isn't compatible with URL-encoding, it's a mutant JavaScript oddity which should normally never be used. One of the defects is that it will encode non-ASCII characters to non-standard %uNNNN sequences based on UTF-16 code units, instead of standard URL-encoded UTF-8. Your current code won't be able to handle that.
You should typically use encodeURIComponent() instead.
If you must URL-decode posted input yourself rather than using a form library (and this does mean you won't be able to handle multipart/form-data), you will need to convert + symbols to spaces before replacing %-sequences. This replacement is standard in form submissions (though not elsewhere in URL-encoded data).
To ensure input is valid UTF-8 if you really don't want to use a library, try this regex. It also excludes some control characters (you may want to tweak it to exclude more).

How do I properly format plain text data for a simple Perl dictionary app?

I have a very simple dictionary application that does search and display. It's built with the Win32::GUI module. I put all the plain text data needed for the dictionary under the __DATA__ section. The script itself is very small but with everything under the __DATA__ section, its size reaches 30 MB. In order to share the work with my friends, I've then packed the script into a stand-alone executable using the PP utility of the PAR::Packer module with the highest compression level 9 and now I have a single-file dictionary app of about the size of 17MB.
But although I'm very comfortable with the idea of a single-file script, placing such huge amount of text data under the script's DATA section does not feel right. For one thing, when I try opening the script in Padre (Notepad ++ is okay), I'm receiving the error that is like:
Can't open my script as the script is over the arbitrary file size limit which is currently 500000.
My questions:
Does it bring me any extra benefits except for the eliminating of Padre's file opening issue if I move everything under the DATA section to a separate text file?
If I do so, What should I do to reduce the size of the separate file? Zip it and uncompress it while doing search and display?
How do people normally format the text data needed for a dictionary application?
Any comments, ideas or suggestions? Thanks like always :)
If I do so, What should I do to reduce the size of the separate file? Zip it and uncompress it while doing search and display?
Well, it depends on WHY you want to reduce the size. If it is to minimize disk space usage (rather weird goal most of the time these days), then the zip/unzip is the way to go.
However if the goal is to minimize memory usage, then a better approach is to split up the dictionary data into smaller chunks (for example indexed by a first letter), and only load needed chunks.
How do people normally format the text data needed for a dictionary application?
IMHO the usual approach is what you get as the logical end of an approach mentioned above (partitioned and indexed data): using a back-end database, which allows you to only retrieve the data which is actually needed.
In your case probably something simple like SQLite or Berkley DB/DBM files should be OK.
Does it bring me any extra benefits except for the eliminating of Padre's file opening issue if I move everything under the DATA section to a separate text file?
This depends somewhat on your usage... if it's a never-changing script used by 3 people, may be no tangible benefits.
In general, it will make maintenance much easier (you can change the dictionary and the code logic independently - think virus definitions file vs. antivirus executable for real world example).
It will also decrease the process memory consumption if you go with the approaches I mentioned above.
Since you are using PAR::Packer already, why not move it to a separate file or module and include it in the PAR file?
The easy way (no extra commandline options to pp, it will see the use statement and do the right thing):
words.pl
#!/usr/bin/perl
use strict;
use warnings;
use Words;
for my $i (1 .. 2) {
print "Run $i\n";
while (defined(my $word = Words->next_word)) {
print "\t$word\n";
}
}
Words.pm
package Words;
use strict;
use warnings;
my $start = tell DATA
or die "could not find current position: $!";
sub next_word {
if (eof DATA) {
seek DATA, $start, 0
or die "could not seek: $!";
return undef;
}
chomp(my $word = scalar <DATA>);
return $word;
}
1;
__DATA__
a
b
c

How can i count the respective lines for each sub in my perl code?

I am refactoring a rather large body of code and a sort of esoteric question came to me while pondering where to go on with this. What this code needs in large parts is shortening of subs.
As such it would be very advantageous to point some sort of statistics collector at the directory, which would go through all the .pm, .cgi and .pl files, find all subs (i'm fine if it only gets the named ones) and gives me a table of all of them, along with their line count.
I gave PPI a cursory look, but could not find anything directly relevant, with some tools that might be appropiate, but rather complex to use.
Are there any easier modules that do something like this?
Failing that, how would you do this?
Edit:
Played around with PPI a bit and created a script that collects relevant statistics on a code base: http://gist.github.com/514512
my $document = PPI::Document->new($file);
# Strip out comments and documentation
$document->prune('PPI::Token::Pod');
$document->prune('PPI::Token::Comment');
# Find all the named subroutines
my $sub_nodes = $document->find(
sub { $_[1]->isa('PPI::Statement::Sub') and $_[1]->name } );
print map { sprintf "%s %s\n", $_->name, scalar split /\n/, $_->content } #$sub_nodes;
I'm dubious that simply identifying long functions is the best way to identify what needs to be refactored. Instead, I'd run the code through perlcritic at increasing levels of harshness and follow the suggestions.

How can I extract fields from a CSV file in Perl?

I want to extract a particular fields from a csv file (830k records) and store into hash. Is there any fast and easy way to do in Perl with out using any external methods?
How can I achieve that?
Use Text::CSV_XS. It's fast, moderately flexible, and extremely well-tested. The answer to many of these questions is something on CPAN. Why spend the time to make something not as good as what a lot of people have already perfected and tested?
If you don't want to use external modules, which is a silly objection, look at the code in Text::CSV_XS and do that. I'm constantly surprised that people think that even though they think they can't use a module they won't use a known and tested solution as example code for the same task.
assuming normal csv (ie, no embedded commas), to get 2nd field for example
$ perl -F"," -lane 'print $F[1];' file
See also this code fragment taken from The Perl Cookbook which is a great book in itself for Perl solutions to common problems
using split command would do the job I guess. (guessing columns are separated by commas and no commas present in fields)
while (my $line = <INPUTFILE>){
#columns= split ('<field_separator>',$line); #field separator is ","
}
and then from elements of the "column" array you can construct whatever hash you like.

Fast algorithm to check membership of a pair of numbers in large (x,y) coordinates in Perl

I have a list of sorted coordinates (let's call it xycord.txt) that looks like this:
chr1 10003486 10043713
chr1 10003507 10043106
chr2 10003486 10043713
chr2 10003507 10043162
chr2 10003532 10042759
In reality the this file is very2 large with 10^7 lines.
What I want to do is given another two-point coordinates I want to check if they
fall in between any coordinates in xycord.txt file.
The current approach I have is super slow. Because
there are also many others two-point coordinates against this large xycord.txt file.
Is there a fast way to do it?
#!/usr/bin/perl -w
my $point_to_check_x = $ARGV[0] || '10003488';
my $point_to_check_y = $ARGV[1] || '10003489';
my $chrid = $ARGV[2] || "chr1";
my %allxycordwithchr;
# skip file opening construct
while (<XYCORD_FILE>) {
my ($chr,$tx,$ty) = split(/\s+/,$_);
push #{$allxycordwithchr{$chr}},$tx."-".$ty;
}
my #chosenchr_cord = #{$allxycordwithchr{$chrid}};
for my $chro_cords (#chosenchr_cord){
my ($repox,$repoy) = split("-",$chro_cord);
my $stat = is_in_xycoordsfile($repox,$repoy,$point_to_check_x,$point_to_check_y);
if ($stat eq "IN"){
print "IN\n";
}
}
sub is_in_xycoordsfile {
my ($x,$y,$xp,$yp) = #_;
if ( $xp >= $x && $yp <= $y ) {
return "IN";
}
else {
return "OUT";
}
}
Update: I apologize for correcting this. In my earlier posting I oversimplified
the problem.
Actually, there is one more query field (e.g. chromosome name).
Hence the DB/RB-trees/SQL approaches maybe infeasible in this matter?
A few suggestions:
You could store your data in a database, such as MySQL or SQLite. You could then use a simple request such as:
"SELECT * FROM coordinates WHERE x<"+xp+" AND y>"+yp
Provided you have indexes on x and y, this should be super fast.
You could also take a look at R-Trees. I used R-trees a few years ago to store tens of thousands of city coordinates, and I could find the closest city from a given point in a fraction of a second. In your example, you are storing 1D ranges but I'm pretty sure R-trees would work fine too. You might find R-tree implementations for Perl here. Or you can use RectanglesContainingDot, which seems to do what you need.
You could cache the coordinates in memory: each number looks like it would take 4 bytes to store, so this would lead to about 80 MB of memory usage if you have 10^7 couples of numbers. That's what firefox uses on my machine! Of course if you do this, you need to have some sort of daemon running in order to avoid reloading the whole file every time you need to check coordinates.
You can mix solutions 2 & 3.
My preference is for solution 1: it has a good efficiency/complexity ratio.
In addition to Udi Pasmon's good advice, you could also convert your large file to a DBM and then tie the DBM file to a hash for easy look ups.
Convert the file:
#!/usr/bin/perl
use strict;
use warnings;
use DB_File;
my $dbfile = "coords.db";
tie my %coords, "DB_File", $dbfile
or die "could not open $dbfile: $!";
while (<>) {
my ($x, $y) = split;
$coords{"$x-$y"} = 1;
}
Check to see if arguments are members of the file:
#!/usr/bin/perl
use strict;
use warnings;
use DB_file;
my ($x, $y) = #ARGV;
tie my %coords, "DB_File", "coords.db"
or die "could not open coords.db: $!";
print "IN\n" if $coords{"$x-$y"};
Try a binary search rather than sequential search. There are two appearant options to do this:
Split the files to smaller files (xycord001.txt, xycord002.txt and so on). Now you can easily determine in which file to search, and the search is rather faster. The big con here is that if you need to add data to a file it might get messy.
Make a binary search over the file: Start at the middle, splitting the file into two logical parts. Decide in which part you coordinates might be, and look at the middle of that part. You'll rapidly (exponentially) reduce the size of the file you're looking in, till you'll be searching one line only. Read more about seeking into files; There is a perl example about binary searching a file here.
EDIT: Generally, using a database or DB file is preferred; However, binary file search is the quick-and-dirty way, especially if the script should run on different files on different machines (thanks #MiniQuark, #Chas. Owens)
If both inputs or atleast the large one are sorted you can try a variation of merge-join between them.
If the lookup file (smaller file) isn't too large, then easiest is to just read it in, put it in a hash keyed by the name with sorted arrays of start-end pairs for value.
Then go through each row in the large file, lookup the array of lookup values that could match it by its name. Go through each pair in the lookup array, if the lookup start is less than the input pairs start, discard that value as it can no longer match anything. If the lookup start is past input end, break the loop as no further lookup values can match. If the lookup end is before the input end you have a match and you can add the input and lookup to the list of matches.
My Perl is rusty, so no Perl example code, but I threw together a quick and dirty Python implementation. On my arbitrary randomly generated dataset matching 10M rows to 10k lookup rows for 14k matches took 22s, matching to 100k lookups for 145k matches took 24s and matching to 1M lookups for 1.47M matches took 35s.
If the smaller file is too big to fit in memory at once, it can be loaded in batches of keys as the keys are encountered in the input file.
Restating your question, do you want to print all ranges in a file that contains the (x, y) pair and also have the same id? If that's the case, you don't need to parse the file and storing it in memory.
while (<DATA>) {
my ($chr, $tx, $ty) = split /\s+/;
print "IN : $chr, $tx, $ty\n"
if $chr eq $chrid
&& $point_to_check_x >= $tx
&& $point_to_check_y <= $ty;
}
OK, so let me clarify the problem, based on my understanding of your code. You have a file with a very large number of entries in it. Each entry includes a label "chr1", "chr2", etc. and two numbers, the first of which is less than the second. You then have a query which comprises a label and a number, and you wish to know whether there is a record in the large file which has the same label as the query, and has the two values such that one is less than the query number and the other is greater than it. Essentially, whether the number in the query is within the interval specified by the two numbers in the record.
Provided my understanding is correct, the first thing to notice is that any of the records which do not have the same label as the query have no part to play in the problem. So you can ignore them. Your program reads them all in, puts them in a hash and then doesn't look at most of the data. Sure, if you have several queries to do, you're going to need to keep data for each of the labels you're interested in, but you can throw the rest away as early as possible. This will keep your memory requirement down.
I would be tempted to take this a little further. Is there a possibility of breaking the huge file up into smaller files? It would seem to be a good idea to break it into files which have only certain labels in them. For example, you might have one file per label, or one file for all data with labels beginning with "a", or so on. This way you can open only those files which you know you're going to be interested in, and save yourself a lot of time and effort.
These changes alone may make enough difference for you. If you still need more performance, I would start thinking about how you are storing the records you're going to need. Storing them ordered on the lower (or higher) of the two numbers should cut down a bit the time taken to find what you're after, particularly if you store them in a binary search tree or a trie.
That should give you enough to work on.
PDL for genomic data processing
We processed a lot of files in the same format as you show in your question and found that PDL (documentation) is a very good tool for doing this. You need some time to learn it --- but it's definitely worth the effort (if you do genomics data processing): PDL can process huge files a few thousand times faster than MySQL.
Here are some hints where to go:
First of all, PDL is a language somewhat like Matlab --- but fully integrated with perl. Read the documentation, do some examples. Consult a mathematician for advise which features to use for what purpose.
PDL stores it's data in plain C arrays. Learn about Inline::C and access this data directly from C if PDL doesn't do the job for you. To me, PDL and Inline::C seem like a perfect match: PDL for high-level operations; Inline::C for anything missing. Still PDL is as fast as your best C because it does it's work in C.
use PDL::IO::FastRaw --- store and access data in files on disk. I often write the files "by hand" (see below) and read them as memory mapped files (using PDL::IO::FastRaw::mapfraw, often with the flag ReadOnly=>1). This is the most efficient way to read data in Linux from the disk.
The format of the data files is trivial: just a sequence of C numbers. You can easily write such files in perl with 'print FileHandle pack "i*",#data;' Check 'perldoc -f pack'.
In my experience, just reading the input files line by line and printing them out in binary format is the slowest part of processing: Once you have them ready for PDL to 'mmap' them, processing is much faster.
I hope this advise helps --- even though not much code is given.