I have to process text files 10-20GB in size of the format:
field1 field2 field3 field4 field5
I would like to parse the data from each line of field2 into one of several files; the file this gets pushed into is determined line-by-line by the value in field4. There are 25 different possible values in field2 and hence 25 different files the data can get parsed into.
I have tried using Perl (slow) and awk (faster but still slow) - does anyone have any suggestions or pointers toward alternative approaches?
FYI here is the awk code I was trying to use; note I had to revert to going through the large file 25 times because I wasn't able to keep 25 files open at once in awk:
chromosomes=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25)
for chr in ${chromosomes[#]}
do
awk < my_in_file_here -v pat="$chr" '{if ($4 == pat) for (i = $2; i <= $2+52; i++) print i}' >> my_out_file_"$chr".query
done
With Perl, open the files during initialization and then match the output for each line to the appropriate file:
#! /usr/bin/perl
use warnings;
use strict;
my #values = (1..25);
my %fh;
foreach my $chr (#values) {
my $path = "my_out_file_$chr.query";
open my $fh, ">", $path
or die "$0: open $path: $!";
$fh{$chr} = $fh;
}
while (<>) {
chomp;
my($a,$b,$c,$d,$e) = split " ", $_, 5;
print { $fh{$d} } "$_\n"
for $b .. $b+52;
}
Here is a solution in Python. I have tested it on a small fake file I made up. I think this will be acceptably fast for even a large file, because most of the work will be done by C code inside of Python. And I think this is a pleasant and easy to understand program; I prefer Python to Perl.
import sys
s_usage = """\
Usage: csplit <filename>
Splits input file by columns, writes column 2 to file based on chromosome from column 4."""
if len(sys.argv) != 2 or sys.argv[1] in ("-h", "--help", "/?"):
sys.stderr.write(s_usage + "\n")
sys.exit(1)
# replace these with the actual patterns, of course
lst_pat = [
'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't',
'u', 'v', 'w', 'x', 'y'
]
d = {}
for s_pat in lst_pat:
# build a dictionary mapping each pattern to an open output file
d[s_pat] = open("my_out_file_" + s_pat, "wt")
if False:
# if the patterns are unsuitable for filenames (contain '*', '?', etc.) use this:
for i, s_pat in enumerate(lst_pat):
# build a dictionary mapping each pattern to an output file
d[s_pat] = open("my_out_file_" + str(i), "wt")
for line in open(sys.argv[1]):
# split a line into words, and unpack into variables.
# use '_' for a variable name to indicate data we don't care about.
# s_data is the data we want, and s_pat is the pattern controlling the output
_, s_data, _, s_pat, _ = line.split()
# use s_pat to get to the file handle of the appropriate output file, and write data.
d[s_pat].write(s_data + "\n")
# close all the output file handles.
for key in d:
d[key].close()
EDIT: Here's a little more information about this program, since it seems you will be using it.
All of the error handling is implicit. If an error happens, Python will "raise an exception" which will terminate processing. For example, if one of the files fails to open, this program will stop executing and Python will print a backtrace showing which line of code caused the exception. I could have wrapped the critical parts with a "try/except" block, to catch errors, but for a program this simple, I didn't see any point.
It's subtle, but there is a check to see if there are exactly five words on each line of the input file. When this code unpacks a line, it does so into five variables. (The variable name "_" is a legal variable name, but there is a convention in the Python community to use it for variables you don't actually care about.) Python will raise an exception if there are not exactly five words on the input line to unpack into the five variables. If your input file can sometimes have four words on a line, or six or more, you could modify the program to not raise an exception; change the main loop to this:
for line in open(sys.argv[1]):
lst = line.split()
d[lst[3]].write(lst[1] + "\n")
This splits the line into words, and then just assigns the whole list of words into a single variable, lst. So that line of code doesn't care how many words are on the line. Then the next line indexes into the list to get the values out. Since Python indexes a list using 0 to start, the second word is lst[1] and the fourth word is lst[3]. As long as there are at least four words in the list, that line of code won't raise an exception either.
And of course, if the fourth word on the line is not in the dictionary of file handles, Python will raise an exception for that too. That would stop processing. Here is some example code for how to use a "try/except" block to handle this:
for line in open(sys.argv[1]):
lst = line.split()
try:
d[lst[3]].write(lst[1] + "\n")
except KeyError:
sys.stderr.write("Warning: illegal line seen: " + line)
Good luck with your project.
EDIT: #larelogio pointed out that this code doesn't match the AWK code. The AWK code has an extra for loop that I do not understand. Here is Python code to do the same thing:
for line in open(sys.argv[1]):
lst = line.split()
n = int(lst[1])
for i in range(n, n+53):
d[lst[3]].write(i + "\n")
And here is another way to do it. This might be a little faster, but I have not tested it so I am not certain.
for line in open(sys.argv[1]):
lst = line.split()
n = int(lst[1])
s = "\n".join(str(i) for i in range(n, n+53))
d[lst[3]].write(s + "\n")
This builds a single string with all the numbers to write, then writes them in one chunk. This may save time compared to calling .write() 53 times.
do you know why its slow? its because you are processing that big file 25 times with the outer shell for loop.!!
awk '
$4 <=25 {
for (i = $2; i <= $2+52; i++){
print i >> "my_out_file_"$4".query"
}
}' bigfile
There are times when awk is not the answer.
There are also times when scripting languages are not the answer, when you are just plain better off biting the bullet and dragging down your copy of K&R and hacking some C code.
If your operating system implements pipes using concurrent processes and interprocess communications, as opposed to big temp files, what you might be able to do is write an awk script that reformats the line, to put the selector field at the beginning of the line in a format easily readable with scanf(), write a C program that opens the 25 files and distributes lines among them, and the pipe the awk script output into the C program.
Sounds like you're on your way, but I just wanted to mention Memory Mapped I/O as being a huge help when working with gigantic files. There was a time when I had to parse a .5GB binary file with Visual Basic 5 (yes)... importing the CreateFileMapping API allowed me to parse the file (and create several-gig "human-readable" file) in minutes. And it only took a half hour or so to implement.
Here's a link describing the API on Microsoft platforms, though I'm sure MMIO should be on just about any platform: MSDN
Good luck!
There are some precalculations that may help.
For example you can precalculate the outputs for each value of your field2. Admiting that they are 25 like field4:
my %tx = map {my $tx=''; for my $tx1 ($_ .. $_+52) {$tx.="$tx1\n"}; $_=>$tx} (1..25);
Latter when writing you can do print {$fh{$pat}} $tx{$base};
Related
I am confused by one perl question, anyone has some idea?
I use one hash structure to store the keys and values like:
$hash{1} - > a;
$hash{2} - > b;
$hash{3} - > c;
$hash{4} - > d;
....
more than 1000 lines. I give a name like %hash
and then, I plan to have one loop statement to search for all keys to see whether it will match with the value from the file.
for example, below is the file content:
first line 1
second line 2
nothing
another line 3
my logic is:
while(read line){
while (($key, $value) = each (%hash))
{
if ($line =~/$key/i){
print "found";
}
}
so my expectation is :
first line 1 - > return found
second line 2 - > return found
nothing
another line 3 - > return found
....
However, during my testing, only first line and second line return found, for 'another line3', the
program does not return 'found'
Note: the hash has more than 1000 records.
So I try to debug it and add some count inside and find out for those found case, the loop has run like 600 or 700 times, but for the 'another line3' case, it only runs around 300 times and just exit the loop and did not return found.
any idea why it happens like that?
and I have done one more testing is if my hash structure is small, like only 10 keys, the logic works.
and I try to use foreach, and It looks like foreach does not have this kind of issue.
The pseudo code you give should work fine, but there might be a subtle problem.
If after you found your key and print it out you end the while loop, the next time each is called, it will continue where you left. Put it in other words "each" is an iterator that stores its state in the hash it iterates over.
In http://blogs.perl.org/users/rurban/2014/04/do-not-use-each.html the author explains this in more detail. His conclusion:
So each should be treated as in php: Avoid it like a plague. Only use it in optimized cases where you know what you are doing.
The problem is not very well articulated by OP, provided sample data are poor for demonstration purpose.
Following sample code is an attempt based on provided problem description by OP.
Recreate filter hash from DATA block, compose $re_filter consisting of filter hash keys, walk through a file given as an argument on command line to filter out lines matching $re_filter.
use strict;
use warnings;
my $data = do { local $/; <DATA> };
my %hash = split ' ', $data;
my $re_filter = join('|',keys %hash);
/$re_filter/ && print for <>;
__DATA__
1 a
2 b
3 c
4 d
Input data file content
first line 1
second line 2
nothing
another line 3
Output
first line 1
second line 2
another line 3
This is a perl code I use for compiling pressure data.
$data_ct--;
mkdir "365Days", 0777 unless -d "365Days";
my $file_no = 1;
my $j = $num_levels;
for ($i = 0; $i < $data_ct; $i++) {
if ($j == $num_levels) {
close OUT;
$j = 0;
my $file = "365days/wind$file_no";
$file_no++;
open OUT, "> $file" or die "Can't open $file: $!";
}
{
$wind_direction = (270-atan2($vwind[$i], $uwind[$i])*(180/pi))%360;
}
$wind_speed = sqrt($uwind[$i]*$uwind[$i]+$vwind[$i]*$vwind[$i]);
printf OUT "%.0f %.0f %.1f\n", $level[$i], $wind_direction, $wind_speed;
$j++;
}
$file_no--;
print STDERR "Wrote out $file_no wind files.\n";
print STDERR "Done\n";
The problem I am having is when it prints out the numbers, I want it to be in this format
Level Wind direction windspeed
250 320 1.5
870 56 4.6
Right now when I run the script the columns names do not show up rather just the numbers. Can someone direct me as to how to rectify the script?
There are several ways to do this in Perl. First, Perl has built in form ability. It's been a part of Perl since version 3.0 (about 20 years old). However, it is rarely used. In fact, it is so rarely used I am not even going to attempt to write an example with it because I'd have to spend way too much time relearning it. It's there and documented.
You can try to figure it out for yourself. Or, maybe some old timer Perl programmer might wake up from his nap and help out. (All bets are off if it's meatloaf night at the old age home, though).
Perl has evolved greatly in the last few decades, and this old forms bit represents a much older way of writing Perl programs. It just isn't pretty.
Another way this can be done and is more popular is to use the printf function. If you're not familiar with C and printf from there, it can be a bit intimidating to use. It depends upon formatting codes (the things that start with % to specify what you want to print (strings, integers, floating point numbers, etc.), and how you want those values formatted.
Fortunately, printf is so useful, that most programming languages have their own version of printf. Once you learn it, your knowledge is transferable to other places. There's an equivalent sprintf for setting variable with formats.
# Define header and line formats
my $header_fmt = "%-5.5s %-14.14s %-9.9s\n";
my $data_fmt = "%5d %14d %9.1f\n";
# Print my table header
printf $header_fmt, "Level", "Wind direction", "windspeed";
my $level = 230;
my $direction = 120;
my $speed = 32.3;
# Print my table data
printf $data_fmt, $level, $direction, $speed;
This prints out:
Level Wind direction windspeed
230 120 32.3
I like defining the format of my printed lines all together, so I can tweak them to get what I want. It's a great way to make sure your data line lines up with your header.
Okay, Matlock wasn't on tonight, so this crusty old Perl programmer has plenty of time.
In my previous answer, I said there was an old way of doing forms in Perl, but I didn't remember how it went. Well, I spent some time and got you an example of how it works.
Basically, you sort of need globalish variables. I thought you needed our variables for this to work, but I can get my variables to work if I define them on the same level as my format statements. It's not pretty.
You use GLOBS to define your formats with _TOP appended for your headers. Since I'm printing these on STDOUT, I define STDOUT_TOP for my heading and STDOUT for my data lines.
The format must start at the beginning of a column. The lone . on the end ends the format definition. You notice I write the entire thing with just a single write statement. How does it know to print out the heading? Perl tracks the number of lines printed and automatically writes a Form Feed character and a new heading when Perl thinks it's at the bottom of a page. I am assuming Perl uses 66 line pages as a default.
You can in Perl set your own form names via select. Perl uses $= as the number of lines on a page, and $- on the number of lines left. These variables are global, but are set by the selected format via the select statement. You can use IO::Handle for better variable naming.
#! /usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
my #data = (
{
level => 250,
direction => 320,
speed => 1.5,
},
{
level => 870,
direction => 55,
speed => 4.5,
},
);
my $level;
my $direction;
my $speed;
for my $item_ref ( #data ) {
$level = $item_ref->{level};
$direction = $item_ref->{direction};
$speed = $item_ref->{speed};
write;
}
format STDOUT_TOP =
Level Wind Direction Windspeed
===== ============== =========
.
format STDOUT =
##### ############## ######.##
$level, $direction, $speed
.
This prints:
Level Wind Direction Windspeed
===== ============== =========
250 320 1.50
870 55 4.50
#Gunnerfan : Can you replace the line from your code as shown below
Your line of code: printf OUT "%.0f %.0f %.1f\n",$level[$i], wind_direction, $wind_speed;
Replacement code:
if($i==0) {
printf OUT "\n\t%s%-20s %-10s%-12s %-20s%s\n", 'Level' , 'Wind direction' , 'windspeed');
}
printf OUT "\t%s%-20s%s %-10s%s%-12s%s %-20s\n",$level[$i],$wind_direction, $wind_speed;
I'm a student in an intro Perl class, looking for suggestions and feedback on my approach to writing a small (but tricky) program that analyzes data about atoms. My professor encourages forums. I am not advanced with Perl subs or modules (including Bioperl) so please limit responses to an appropriate 'beginner level' so that I can understand and learn from your suggestions and/or code (also limit "Magic" please).
The requirements of the program are as follows:
Read a file (containing data about Atoms) from the command line & create an array of atom records (one record/atom per newline). For each record the program will need to store:
• The atom's serial number (cols 7 - 11)
• The three-letter name of the amino acid to which it belongs (cols 18 - 20)
• The atom's three coordinates (x,y,z) (cols 31 - 54 )
• The atom's one- or two-letter element name (e.g. C, O, N, Na) (cols 77-78 )
Prompt for one of three commands: freq, length, density d (d is some number):
• freq - how many of each type of atom is in the file (example Nitrogen, Sodium, etc would be displayed like this: N: 918 S: 23
• length - The distances among coordinates
• density d (where d is a number) - program will prompt for the name of a file to save computations to and will containing the distance between that atom and every other atom. If that distance is less than or equal to the number d, it increments the count of the number of atoms that are within that distance, unless that count is zero into the file. The output will look something like:
1: 5
2: 3
3: 6
... (very big file) and will close when it finishes.
I'm looking for feedback on what I have written (and need to write) in the code below. I especially appreciate any feedback in how to approach writing my subs. I've included sample input data at the bottom.
The program structure and function descriptions as I see it:
$^W = 1; # turn on warnings
use strict; # behave!
my #fields;
my #recs;
while ( <DATA> ) {
chomp;
#fields = split(/\s+/);
push #recs, makeRecord(#fields);
}
for (my $i = 0; $i < #recs; $i++) {
printRec( $recs[$i] );
}
my %command_table = (
freq => \&freq,
length => \&length,
density => \&density,
help => \&help,
quit => \&quit
);
print "Enter a command: ";
while ( <STDIN> ) {
chomp;
my #line = split( /\s+/);
my $command = shift #line;
if ($command !~ /^freq$|^density$|length|^help$|^quit$/ ) {
print "Command must be: freq, length, density or quit\n";
}
else {
$command_table{$command}->();
}
print "Enter a command: ";
}
sub makeRecord
# Read the entire line and make records from the lines that contain the
# word ATOM or HETATM in the first column. Not sure how to do this:
{
my %record =
(
serialnumber => shift,
aminoacid => shift,
coordinates => shift,
element => [ #_ ]
);
return\%record;
}
sub freq
# take an array of atom records, return a hash whose keys are
# distinct atom names and whose values are the frequences of
# these atoms in the array.
sub length
# take an array of atom records and return the max distance
# between all pairs of atoms in that array. My instructor
# advised this would be constructed as a for loop inside a for loop.
sub density
# take an array of atom records and a number d and will return a
# hash whose keys are atom serial numbers and whose values are
# the number of atoms within that distance from the atom with that
# serial number.
sub help
{
print "To use this program, type either\n",
"freq\n",
"length\n",
"density followed by a number, d,\n",
"help\n",
"quit\n";
}
sub quit
{
exit 0;
}
# truncating for testing purposes. Actual data is aprox. 100 columns
# and starts with ATOM or HETATM.
__DATA__
ATOM 4743 CG GLN A 704 19.896 32.017 54.717 1.00 66.44 C
ATOM 4744 CD GLN A 704 19.589 30.757 55.525 1.00 73.28 C
ATOM 4745 OE1 GLN A 704 18.801 29.892 55.098 1.00 75.91 O
It looks like your Perl skills are advancing nicely -- using references and complex data structures. Here are a few tips and pieces of general advice.
Enable warnings with use warnings rather than $^W = 1. The former is self-documenting and has the advantage being local to the enclosing block rather than being a global setting.
Use well-named variables, which will help document the program's behavior, rather than relying on Perl's special $_. For example:
while (my $input_record = <DATA>){
}
In user-input scenarios, an endless loop provides a way to avoid repeated instructions like "Enter a command". See below.
Your regex can be simplified to avoid the need for repeated anchors. See below.
As a general rule, affirmative tests are easier to understand than negative tests. See the modified if-else structure below.
Enclose each part of program within its own subroutine. This is a good general practice for a bunch of reasons, so I would just start the habit.
A related good practice is to minimize the use of global variables. As an exercise, you could try to write the program so that it uses no global variables at all. Instead, any needed information would be passed around between the subroutines. With small programs one does not necessarily need to be rigid about the avoidance of globals, but it's not a bad idea to keep the ideal in mind.
Give your length subroutine a different name. That name is already used by the built-in length function.
Regarding your question about makeRecord, one approach is to ignore the filtering issue inside makeRecord. Instead, makeRecord could include an additional hash field, and the filtering logic would reside elsewhere. For example:
my $record = makeRecord(#fields);
push #recs, $record if $record->{type} =~ /^(ATOM|HETATM)$/;
An illustration of some of the points above:
use strict;
use warnings;
run();
sub run {
my $atom_data = load_atom_data();
print_records($atom_data);
interact_with_user($atom_data);
}
...
sub interact_with_user {
my $atom_data = shift;
my %command_table = (...);
while (1){
print "Enter a command: ";
chomp(my $reply = <STDIN>);
my ($command, #line) = split /\s+/, $reply;
if ( $command =~ /^(freq|density|length|help|quit)$/ ) {
# Run the command.
}
else {
# Print usage message for user.
}
}
}
...
FM's answer is pretty good. I'll just mention a couple of additional things:
You already have a hash with the valid commands (which is a good idea). There's no need to duplicate that list in a regex. I'd do something like this:
if (my $routine = $command_table{$command}) {
$routine->(#line);
} else {
print "Command must be: freq, length, density or quit\n";
}
Notice I'm also passing #line to the subroutine, because you'll need that for the density command. Subroutines that don't take arguments can just ignore them.
You could also generate the list of valid commands for the error message by using keys %command_table, but I'll leave that as an exercise for you.
Another thing is that the description of the input file mentions column numbers, which suggests that it's a fixed-width format. That's better parsed with substr or unpack. If a field is ever blank or contains a space, then your split will not parse it correctly. (If you use substr, be aware that it numbers columns starting at 0, when people often label the first column 1.)
How does this code work at all?
#!/usr/bin/perl
$i=4;$|=#f=map{("!"x$i++)."K$_^\x{0e}"}
"BQI!\\","BQI\\","BQI","BQ","B","";push
#f,reverse#f[1..5];#f=map{join"",undef,
map{chr(ord()-1)}split""}#f;{;$f=shift#
f;print$f;push#f,$f;select undef,undef,
undef,.25;redo;last;exit;print or die;}
Lets first put this through perltidy
$i = 5;
$| = #f = map { ("!" x $i++) . "9$_*\x{0e}" } ">>>E!)", ">>>E)", ">>>E", ">>>", ">>", ">", "";
push #f, reverse #f[ 1..5 ];
#f = map {
join "",
map { chr(ord() - 1) }
split //
} #f;
{
$f = shift #f;
print $f;
push #f, $f;
select undef, undef, undef, .25;
redo;
last;
exit;
print or die;
}
The first line is obvious.
The second line makes a list ">>>E!)", ">>>E)", ">>>E", ">>>", ">>", ">", "", and spaces them all to be equally long and appends an asterisk and a 'Shift Out' (the character after a carriage return).
The third line appends items 5 to 1 (in that order) to that list, , so it will be ">>>E!)", ">>>E)", ">>>E", ">>>", ">>", ">", "", ">", ">>", ">>>", ">>>E".
The map decrements the all characters by one, thus creating elements like 8===D ().
The second loop simply prints the elements in the list in a loop every 0.25 seconds. The carriage return causes them to overwrite each other, so that an animation is seen. The last couple of lines are never reached and thus bogus.
Data from the file is loaded into a program called a Perl interpreter. The interpreter parses the code and converts it to a series of "opcodes" -- a bytecode language that is sort of halfway between Perl code and the machine language that the code is running on. If there were no errors in the conversion process (called "compiling"), then the code is executed by another part of the Perl interpreter. During execution, the program may change various states of the machine, such as allocating, deallocating, reading, and writing to memory, or using the input/output and other features of the system.
(CW - More hardcore hackers than I are welcome to correct any errors or misconceptions and to add more information)
There's no magic going on here, just obfuscation. Let's take a high-level view. The first thing to notice is that later on, every character in strings is interpreted as if it were the previous character:
[1] map{chr(ord()-1)} ...
Thus, a string like "6qD" will result in "5rC" (the characters before '6', 'q', and 'D', respectively). The main point of interest is the array of strings near the beginning:
[2] ">>>E!)",">>>E)",">>>E",">>>",">>",">",""
This defines a sequence of "masks" that we will substitute later on, into this string:
[3] "9$_*\x{0e}"
They'll get inserted at the $_ point. The string \x{0e} represents a hex control character; notice that \x{0d}, the character just before it, is a carriage return. That's what'll get substituted into [3] when we do [1].
Before the [3] string is assembled, we prepend a number of ! equal to i to each element in [2]. Each successive element gets one more ! than the element before it. Notice that the character whose value is just before ! is a space .
The rest of the script iterates over each of the assembled array elements, which now look more like this:
[4] "!!!!!9>>>E!)\x{0e}", ---> " 8===D ("
"!!!!!!9>>>E)\x{0e}", ---> " 8===D("
"!!!!!!!9>>>E\x{0e}", ---> " 8===D"
"!!!!!!!!9>>>\x{0e}", ---> " 8==="
"!!!!!!!!!9>>\x{0e}", ---> " 8=="
"!!!!!!!!!!9>\x{0e}", ---> " 8="
"!!!!!!!!!!!9\x{0e}", ---> " 8"
Then the reverse operation appends the same elements in reverse, creating a loop.
At this point you should be able to see the pattern emerge that produces the animation. Now it's just a matter of moving through each step in the animation and back again, which is accomplished by the rest of the script. The timestep delay of each step is governed by the select statement:
[5] select undef, undef, undef, 0.25
which tells us to wait 250 milliseconds between each iteration. You can change this if you want to see it speed up or slow down.
I have an image file name that consists of four parts:
$Directory (the directory where the image exists)
$Name (for a art site, this is the paintings name reference #)
$File (the images file name minus extension)
$Extension (the images extension)
$example 100020003000.png
Which I desire to be broken down accordingly:
$dir=1000 $name=2000 $file=3000 $ext=.png
I was wondering if substr was the best option in breaking up the incoming $example so I can do stuff with the 4 variables like validation/error checking, grabbing the verbose name from its $Name assignment or whatever. I found this post:
is unpack faster than substr?
So, in my beginners "stone tool" approach:
my $example = "100020003000.png";
my $dir = substr($example, 0,4);
my $name = substr($example, 5,4);
my $file = substr($example, 9,4);
my $ext = substr($example, 14,3); # will add the the "." later #
So, can I use unpack, or maybe even another approach that would be more efficient?
I would also like to avoid loading any modules unless doing so would use less resources for some reason. Mods are great tools I luv'em but, I think not necessary here.
I realize I should probably push the vars into an array/hash but, I am really a beginner here and I would need further instruction on how to do that and how to pull them back out.
Thanks to everyone at stackoverflow.com!
Absolutely:
my $example = "100020003000.png";
my ($dir, $name, $file, $ext) = unpack 'A4' x 4, $example;
print "$dir\t$name\t$file\t$ext\n";
Output:
1000 2000 3000 .png
I'd just use a regex for that:
my ($dir, $name, $file, $ext) = $path =~ m:(.*)/(.*)/(.*)\.(.*):;
Or, to match your specific example:
my ($dir, $name, $file, $ext) = $example =~ m:^(\d{4})(\d{4})(\d{4})\.(.{3})$:;
Using unpack is good, but since the elements are all the same width, the regex is very simple as well:
my $example = "100020003000.png";
my ($dir, $name, $file, $ext) = $example =~ /(.{4})/g;
It isn't unpack, but since you have groups of 4 characters, you could use a limited split, with a capture:
my ($dir, $name, file, $ext) = grep length, split /(....)/, $filename, 4;
This is pretty obfuscated, so I probably wouldn't use it, but the capture in a split is an ofter overlooked ability.
So, here's an explanation of what this code does:
Step 1. split with capturing parentheses adds the values captured by the pattern to its output stream. The stream contains a mix of fields and delimiters.
qw( a 1 b 2 c 3 ) == split /(\d)/, 'a1b2c3';
Step 2. split with 3 args limits how many times the string is split.
qw( a b2c3 ) == split /\d/, 'a1b2c3', 2;
Step 3. Now, when we use a delimiter pattern that matches pretty much anything /(....)/, we get a bunch of empty (0 length) strings. I've marked delimiters with D characters, and fields with F:
( '', 'a', '', '1', '', 'b', '', '2' ) == split /(.)/, 'a1b2';
F D F D F D F D
Step 4. So if we limit the number of fields to 3 we get:
( '', 'a', '', '1', 'b2' ) == split /(.)/, 'a1b2', 3;
F D F D F
Step 5. Putting it all together we can do this (I used a .jpeg extension so that the extension would be longer than 4 characters):
( '', 1000, '', 2000, '', 3000, '.jpeg' ) = split /(....)/, '100020003000.jpeg',4;
F D F D F D F
Step 6. Step 5 is almost perfect, all we need to do is strip out the null strings and we're good:
( 1000, 2000, 3000, '.jpeg' ) = grep length, split /(....)/, '100020003000.jpeg',4;
This code works, and it is interesting. But it's not any more compact that any of the other solutions. I haven't bench-marked, but I'd be very surprised if it wins any speed or memory efficiency prizes.
But the real issue is that it is too tricky to be good for real code. Using split to capture delimiters (and maybe one final field), while throwing out the field data is just too weird. It's also fragile: if one field changes length the code is broken and has to be rewritten.
So, don't actually do this.
At least it provided an opportunity to explore some lesser known features of split.
Both substr and unpack bias your thinking toward fixed-layout, while regex solutions are more oriented toward flexible layouts with delimiters.
The example you gave appeared to be fixed layout, but directories are usually separated from file names by a delimiter (e.g. slash for POSIX-style file systems, backwardslash for MS-DOS, etc.) So you might actually have a case for both; a regex solution to split directory and file name apart (or even directory/name/extension) and then a fixed-length approach for the name part by itself.