Related
I have a perl script I wish to parrallelise.
It is composed of a while loop with over 11000 lines inside of another while loop of 3400 lines, which makes it extremely slow.
open (FILE1, "File1.txt") or die "Can't open File1";
open (OUT, ">Outfile.txt");
while (<FILE1>)
{
my #data=split (/ /, $_);
my $RS=1;
open (FILE2, "File2.txt") or die "Can't open File2";
while (<FILE2>)
{
my #value=split (/ /, $_);
if ($data[$RS] == 1) {print OUT $value[1];$RS++;}
elsif ($data[$RS] == 2) {print OUT $value[2];$RS++;}
elsif ($data[$RS] == 0) {print OUT $value[3];$RS++;}
}
close FILE2;
}
I'm looking for a way to do the equivalent of qsub with every line of File1 so I can send 3440 jobs. Any suggestions? I'd like to stay with perl if possible. I tried to insert this code inside of a bash script, but I don't really understand how to insert a language inside another one.
My File1 contains a list of ID with information in column. Each column is then related to a single line in File2. I'd like to be able to run the second loop for multiple ID simultaneously instead of one after another.
File1
ID RS_10 RS_15 RS_30
23 1 0 1
34 2 2 0
45 1 1 0
23 0 0 2
10 2 1 1
File2
RS_10 A B C
RS_15 D E F
RS_30 G H I
The first rule of optimization is not to do it too early (i.e. jumping to premature conclusions without profiling your code).
The second rule would probably refer to caching.
The File2 of yours isn't very large. I'd say we load it into memory. This has the following advantages:
We do our parsing once and only once.
The file isn't obscenly large, so space isn't much of an issue.
We can create a data structure that makes lookups very simple.
About that first point: You split each line over three thousand times. Those cycles could have been better spent.
About that third point: you seem to do an index conversion:
1 → 1, 2 → 2, 0 → 3
Instead of testing for all values with an if/elsif-switch (linear complexity), we could use an array that does this translation (constant time lookups):
my #conversion = (3, 1, 2);
...;
print OUT $value[$conversion[$data[$RS++]]];
If this index conversion is constant, we could do it once and only once when parsing File2. This would look like
use strict; use warnings;
use autodie; # automatic error handling
my #file2;
{
open my $file2, "<", "File2.txt";
while (<$file2>) {
my (undef, #vals) = split;
# do the reordering. This is equivalent to #vals = #vals[2, 0, 1];
unshift #vals, pop #vals;
push #file2, \#vals;
}
}
Now we can move on to iterating through File1. Printing the corresponding entry from File2 now looks like
open my $file1, "<", "File1.txt";
<$file1>; # remove header
while (<$file1>) {
my ($id, #indices) = split;
print $id, map $file2[$_][$indices[$_]], 0 .. $#indices;
# but I guess you'd want some separator in between
# If so, set the $, variable
}
This algorithm is still quadratic (the map is just a for-loop in disguise), but this should have a better constant factor. The output of above code given your example input is
23 A F G
34 B E I
45 A D I
23 C F H
10 B D G
(with $, = " "; $\ = "\n").
Where to go from here
This last step (looping through File1) could be parallelized, but this is unlikely to help much: IO is slow, communication between threads is expensive (IPC even more so), and the output would be in random order. We could spawn a bunch of workers, and pass unparsed lines in a queue:
use threads; # should be 1st module to be loaded
use Thread::Queue;
use constant NUM_THREADS => 4; # number of cores
# parse the File2 data here
my $queue = Thread::Queue->new;
my #threads = map threads->new(\&worker), 1 .. NUM_THREADS;
# enqueue data
$queue->enqueue($_) while <$file1>;
# end the queue
$queue->enqueue((undef) x NUM_THREADS); # $queue->end in never versions
# wait for threads to complete
$_->join for #threads;
sub worker {
while(defined(my $_ = $queue->dequeue)) {
my ($id, #indices) = split;
print $id, map $file2[$_][$indices[$_]], 0 .. $#indices;
}
}
Note that this copies the #file2 into all threads. Fun fact: for the example data, this threaded solution takes roughly 4× as long. This is mostly the overhead of thread creation, so this will be less of an issue for your data.
In any case, profile your code to see where you can optimize most effectively. I recommend the excellent Devel::NYTProf. E.g. for my non-threaded test run with this very limited data, the overhead implied by autodie and friends used more time than doing the actual processing. For you, the most expensive line would probably be
print $id, map $file2[$_][$indices[$_]], 0 .. $#indices;
but there isn't much we can do here inside Perl.
Ok, so I'm very new to Perl. I have a text file and in the file there are 4 columns of data(date, time, size of files, files). I need to create a small script that can open the file and get the average size of the files. I've read so much online, but I still can't figure out how to do it. This is what I have so far, but I'm not sure if I'm even close to doing this correctly.
#!/usr/bin/perl
open FILE, "files.txt";
##array = File;
while(FILE){
#chomp;
($date, $time, $numbers, $type) = split(/ /,<FILE>);
$total += $numbers;
}
print"the total is $total\n";
This is how the data looks in the file. These are just a few of them. I need to get the numbers in the third column.
12/02/2002 12:16 AM 86016 a2p.exe
10/10/2004 11:33 AM 393 avgfsznew.pl
11/01/2003 04:42 PM 38124 c2ph.bat
Your program is reasonably close to working. With these changes it will do exactly what you want
Always use use strict and use warnings at the start of your program, and declare all of your variables using my. That will help you by finding many simple errors that you may otherwise overlook
Use lexical file handles, the three-parameter form of open, and always check the return status of any open call
Declare the $total variable outside the loop. Declaring it inside the loop means it will be created and destroyed each time around the loop and it won't be able to accumulate a total
Declare a $count variable in the same way. You will need it to calculate the average
Using while (FILE) {...} just tests that FILE is true. You need to read from it instead, so you must use the readline operator like <FILE>
You want the default call to split (without any parameters) which will return all the non-space fields in $_ as a list
You need to add a variable in the assignment to allow for athe AM or PM field in each line
Here is a modification of your code that works fine
use strict;
use warnings;
open my $fh, '<', "files.txt" or die $!;
my $total = 0;
my $count = 0;
while (<$fh>) {
my ($date, $time, $ampm, $numbers, $type) = split;
$total += $numbers;
$count += 1;
}
print "The total is $total\n";
print "The count is $count\n";
print "The average is ", $total / $count, "\n";
output
The total is 124533
The count is 3
The average is 41511
It's tempting to use Perl's awk-like auto-split option. There are 5 columns; three containing date and time information, then the size and then the name.
The first version of the script that I wrote is also the most verbose:
perl -n -a -e '$total += $F[3]; $num++; END { printf "%12.2f\n", $total / ($num + 0.0); }'
The -a (auto-split) option splits a line up on white space into the array #F. Combined with the -n option (which makes Perl run in a loop that reads the file name arguments in turn, or standard input, without printing each line), the code adds $F[3] (the fourth column, counting from 0) to $total, which is automagically initialized to zero on first use. It also counts the lines in $num. The END block is executed when all the input is read; it uses printf() to format the value. The + 0.0 ensures that the arithmetic is done in floating point, not integer arithmetic. This is very similar to the awk script:
awk '{ total += $4 } END { print total / NR }'
First drafts of programs are seldom optimal — or, at least, I'm not that good a programmer. Revisions help.
Perl was designed, in part, as an awk killer. There is still a program a2p distributed with Perl for converting awk scripts to Perl (and there's also s2p for converting sed scripts to Perl). And Perl does have an automatic (built-in) variable that keeps track of the number of lines read. It has several names. The tersest is $.; the mnemonic name $NR is available if you use English; in the script; so is $INPUT_LINE_NUMBER. So, using $num is not necessary. It also turns out that Perl does a floating point division anyway, so the + 0.0 part was unnecessary. This leads to the next versions:
perl -MEnglish -n -a -e '$total += $F[3]; END { printf "%12.2f\n", $total / $NR; }'
or:
perl -n -a -e '$total += $F[3]; END { printf "%12.2f\n", $total / $.; }'
You can tune the print format to suit your whims and fancies. This is essentially the script I'd use in the long term; it is fairly clear without being long-winded in any way. The script could be split over multiple lines if you desired. It is a simple enough task that the legibility of the one-line is not a problem, IMNSHO. And the beauty of this is that you don't have to futz around with split and arrays and read loops on your own; Perl does most of that for you. (Granted, it does blow up on empty input; that fix is trivial; see below.)
Recommended version
perl -n -a -e '$total += $F[3]; END { printf "%12.2f\n", $total / $. if $.; }'
The if $. tests whether the number of lines read is zero or not; the printf and division are omitted if $. is zero so the script outputs nothing when given no input.
There is a noble (or ignoble) game called 'Code Golf' that was much played in the early days of Stack Overflow, but Code Golf questions are no longer considered good questions. The object of Code Golf is to write a program that does a particular task in as few characters as possible. You can play Code Golf with this and compress it still further if you're not too worried about the format of the output and you're using at least Perl 5.10:
perl -Mv5.10 -n -a -e '$total += $F[3]; END { say $total / $. if $.; }'
And, clearly, there are a lot of unnecessary spaces and letters in there:
perl -Mv5.10 -nae '$t+=$F[3];END{say$t/$.if$.}'
That is not, however, as clear as the recommended version.
#!/usr/bin/perl
use warnings;
use strict;
open my $file, "<", "files.txt";
my ($total, $cnt);
while(<$file>){
$total += (split(/\s+/, $_))[3];
$cnt++;
}
close $file;
print "number of files: $cnt\n";
print "total size: $total\n";
printf "avg: %.2f\n", $total/$cnt;
Or you can use awk:
awk '{t+=$4} END{print t/NR}' files.txt
Try doing this :
#!/usr/bin/perl -l
use strict; use warnings;
open my $file, '<', "my_file" or die "open error [$!]";
my ($total, $count);
while (<$file>){
chomp;
next if /^$/;
my ($date, $time, $x, $numbers, $type) = split;
$total += $numbers;
$count++;
}
print "the average is " . $total/$count . " and the total is $total";
close $file;
It is as simple as this:
perl -F -lane '$a+=$F[3];END{print "The average size is ".$a/$.}' your_file
tested below:
> cat temp
12/02/2002 12:16 AM 86016 a2p.exe
10/10/2004 11:33 AM 393 avgfsznew.pl
11/01/2003 04:42 PM 38124 c2ph.bat
Now the execution:
> perl -F -lane '$a+=$F[3];END{print "The average size is ".$a/$.}' temp
The average size is 41511
>
explanation:
-F -a says store the line in an array format.with the default separator as space or tab.
so nopw $F[3] has you size of the file.
sum up all the sizes in the 4th column untill all the lines are processed.
END will be executed after processing all the lines in the file.
so $. at the end will gives the number of lines.
so $a/$. will give the average.
This solution opens the file and loops through each line of the file. It then splits the file into the five variables in the line by splitting on 1 or more spaces.
open the file for reading, "<", and if it fails, raise an error or die "..."
my ($total, $cnt) are our column total and number of files added count
while(<FILE>) { ... } loops through each line of the file using the file handle and stores the line in $_
chomp removes the input record separator in $_. In unix, the default separator is a newline \n
split(/\s+/, $_) Splits the current line represented by$_, with the delimiter \s+. \s represents a space, the + afterward means "1 or more". So, we split the next line on 1 or more spaces.
Next we update $total and $cnt
#!/usr/bin/perl
open FILE, "<", "files.txt" or die "Error opening file: $!";
my ($total, $cnt);
while(<FILE>){
chomp;
my ($date, $time, $am_pm, $numbers, $type) = split(/\s+/, $_);
$total += $numbers;
$cnt++;
}
close FILE;
print"the total is $total and count of $cnt\n";`
Hello everyone I'm a beginner in perl and I'm facing some problems as I want to put my strings starting from AA to \ in to an array and want to save it. There are about 2000-3000 strings in a txt file starting from same initials i.e., AA to / I'm doing it by this way plz correct me if I'm wrong.
Input File
AA c0001
BB afsfjgfjgjgjflffbg
CC table
DD hhhfsegsksgk
EB jksgksjs
\
AA e0002
BB rejwkghewhgsejkhrj
CC chair
DD egrhjrhojohkhkhrkfs
VB rkgjehkrkhkh;r
\
Source code
$flag = 0
while ($line = <ifh>)
{
if ( $line = m//\/g)
{
$flag = 1;
}
while ( $flag != 0)
{
for ($i = 0; $i <= 10000; $i++)
{ # Missing brace added by editor
$array[$i] = $line;
} # Missing brace added by editor
}
} # Missing close brace added by editor; position guessed!
print $ofh, $line;
close $ofh;
Welcome to StackOverflow.
There are multiple issues with your code. First, please post compilable Perl; I had to add three braces to give it the remotest chance of compiling, and I had to guess where one of them went (and there's a moderate chance it should be on the other side of the print statement from where I put it).
Next, experts have:
use warnings;
use strict;
at the top of their scripts because they know they will miss things if they don't. As a learner, it is crucial for you to do the same; it will prevent you making errors.
With those in place, you have to declare your variables as you use them.
Next, remember to indent your code. Doing so makes it easier to comprehend. Perl can be incomprehensible enough at the best of times; don't make it any harder than it has to be. (You can decide where you like braces - that is open to discussion, though it is simpler to choose a style you like and stick with it, ignoring any discussion because the discussion will probably be fruitless.)
Is the EB vs VB in the data significant? It is hard to guess.
It is also not clear exactly what you are after. It might be that you're after an array of entries, one for each block in the file (where the blocks end at the line containing just a backslash), and where each entry in the array is a hash keyed by the first two letters (or first word) on the line, with the remainder of the line being the value. This is a modestly complex structure, and probably beyond what you're expected to use at this stage in your learning of Perl.
You have the line while ($line = <ifh>). This is not invalid in Perl if you opened the file the old fashioned way, but it is not the way you should be learning. You don't show how the output file handle is opened, but you do use the modern notation when trying to print to it. However, there's a bug there, too:
print $ofh, $line; # Print two values to standard output
print $ofh $line; # Print one value to $ofh
You need to look hard at your code, and think about the looping logic. I'm sure what you have is not what you need. However, I'm not sure what it is that you do need.
Simpler solution
From the comments:
I want to flag each record starting from AA to \ as record 0 till record n and want to save it in a new file with all the record numbers.
Then you probably just need:
#!/usr/bin/env perl
use strict;
use warnings;
my $recnum = 0;
while (<>)
{
chomp;
if (m/^\\$/)
{
print "$_\n";
$recnum++;
}
else
{
print "$recnum $_\n";
}
}
This reads from the files specified on the command line (or standard input if there are none), and writes the tagged output to standard output. It prefixes each line except the 'end of record' marker lines with the record number and a space. Choose your output format and file handling to suit your needs. You might argue that the chomp is counter-productive; you can certainly code the program without it.
Overly complex solution
Developed in the absence of clear direction from the questioner.
Here is one possible way to read the data, but it uses moderately advanced Perl (hash references, etc). The Data::Dumper module is also useful for printing out Perl data structures (see: perldoc Data::Dumper).
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my #data;
my $hashref = { };
my $nrecs = 0;
while (<>)
{
chomp;
if (m/^\\$/)
{
# End of group - save to data array and start new hash
$data[$nrecs++] = $hashref;
$hashref = { };
}
else
{
m/^([A-Z]+)\s+(.*)$/;
$hashref->{$1} = $2;
}
}
foreach my $i (0..$nrecs-1)
{
print "Record $i:\n";
foreach my $key (sort keys $data[$i])
{
print " $key = $data[$i]->{$key}\n";
}
}
print Data::Dumper->Dump([ \#data ], [ '#data' ]);
Sample output for example input:
Record 0:
AA = c0001
BB = afsfjgfjgjgjflffbg
CC = table
DD = hhhfsegsksgk
EB = jksgksjs
Record 1:
AA = e0002
BB = rejwkghewhgsejkhrj
CC = chair
DD = egrhjrhojohkhkhrkfs
VB = rkgjehkrkhkh;r
$#data = [
{
'EB' => 'jksgksjs',
'CC' => 'table',
'AA' => 'c0001',
'BB' => 'afsfjgfjgjgjflffbg',
'DD' => 'hhhfsegsksgk'
},
{
'CC' => 'chair',
'AA' => 'e0002',
'VB' => 'rkgjehkrkhkh;r',
'BB' => 'rejwkghewhgsejkhrj',
'DD' => 'egrhjrhojohkhkhrkfs'
}
];
Note that this data structure is not optimized for searching except by record number. If you need to search the data in some other way, then you need to organize it differently. (And don't hand this code in as your answer without understanding it all - it is subtle. It also does no error checking; beware faulty data.)
It can't be right. I can see two main issues with your while-loop.
Once you enter the following loop
while ( $flag != 0)
{
...
}
you'll never break out because you do not reset the flag whenever you find an break-line. You'll have to parse you input and exit the loop if necessary.
And second you never read any input within this loop and thus process the same $line over and over again.
You should not put the loop inside your code but instead you can use the following pattern (pseudo-code)
if flag != 0
append item to array
else
save array to file
start with new array
end
I believe what you want is to split the files content at \ though it's not too clear.
To achieve this you can slurp the file into a variable by setting the input record separator, then split the content.
To find out about Perl's special variables related to filehandlers read perlvar
#!perl
use strict;
use warnings;
my $content;
{
open my $fh, '<', 'test.txt';
local $/; # slurp mode
$content = <$fh>;
close $fh;
}
my #blocks = split /\\/, $content;
Make sure to localize modifications of Perl's special variables to not interfere with different parts of your program.
If you want to keep the separator you could set $/ to \ directly and skip split.
#!perl
use strict;
use warnings;
my #blocks;
{
open my $fh, '<', 'test.txt';
local $/ = '\\'; # seperate at \
#blocks = <$fh>;
close $fh;
}
Here's a way to read your data into an array. As I said in a comment, "saving" this data to a file is pointless, unless you change it. Because if I were to print the #data array below to a file, it would look exactly like the input file.
So, you need to tell us what it is you want to accomplish before we can give you an answer about how to do it.
This script follows these rules (exactly):
Find a line that begins with "AA",
and save that into $line
Concatenate every new line from the
file into $line
When you find a line that begins with
a backslash \, stop concatenating
lines and save $line into #data.
Then, find the next line that begins
with "AA" and start the loop over.
These matching regexes are pretty loose, as they will match AAARGH and \bonkers as well. If you need them stricter, you can try /^\\$/ and /^AA$/, but then you need to watch out for whitespace at the beginning and end of line. So perhaps /^\s*\\\s*$/ and /^\s*AA\s*$/ instead.
The code:
use warnings;
use strict;
my $line="";
my #data;
while (<DATA>) {
if (/^AA/) {
$line = $_;
while (<DATA>) {
$line .= $_;
last if /^\\/;
}
}
push #data, $line;
}
use Data::Dumper;
print Dumper \#data;
__DATA__
AA c0001
BB afsfjgfjgjgjflffbg
CC table
DD hhhfsegsksgk
EB jksgksjs
\
AA e0002
BB rejwkghewhgsejkhrj
CC chair
DD egrhjrhojohkhkhrkfs
VB rkgjehkrkhkh;r
\
The problem I am having is when I have a Perl script reading data (PE Executable) via STDIN and the stream contains a line terminator "0A" the conversion to hex misses it. Then when I convert the hex data back it is corrupted (missing 0A in the hex format). So how can I detect the "windows" version of line feed "0A" in Perl?
Note: Linux OS (Perl) is reading a Windows PE
!usr/bin/perl
while($line = <STDIN>)
{
chomp($line);
#bytes = split //, $line;
foreach (#bytes)
{
printf "%02lx", ord $_;
}
}
Usage example:
[root#mybox test]# cat test.exe | perl encoder.pl > output
In your loop, you are running chomp on each input line. This is removing whatever value is currently in $/ from the end of your line. Chances are this is 0x0a, and that's where the value is going. Try removing chomp($line) from your loop.
In general, using line oriented reading doesn't make sense for binary files that are themselves not line oriented. You should take a look at the lower level read function which allows you to read a block of bytes from the file without caring what those bytes are. You can then process your data in blocks instead of lines.
#With split
cat magic.exe | perl -e 'print join("", map { sprintf("\\x%02x", ord($_)) } split(//, join("", <STDIN>)))' > hex_encoded_binary
#With pack
cat magic.exe| perl -e 'print join("", map { "\\x" . $_ } unpack("H*", join("", <STDIN>)) =~ /.{2}/gs)' > hex_encoded_binary
In Perl most of my print statements take the form
print "hello." . "\n";
Is there a nice way to avoid keeping all the pesky "\n"s lying around?
I know I could make a new function such as myprint that automatically appends \n, but it would be nice if I could override the existing print.
Raku (Perl 6) has the say function that automatically appends \n.
You can also use say in Perl 5.10 or 5.12 if you add
use feature qw(say);
to the beginning of your program. Or you can use Modern::Perl to get this and other features.
See perldoc feature for more details.
You can use the -l option in the she-bang header:
#!/usr/bin/perl -l
$text = "hello";
print $text;
print $text;
Output:
hello
hello
See "-l[octnum]" in perlrun(1) for details.
If Perl 5.10+ is not an option, here is a quick and dirty approximation. It's not exactly the same, since say has some magic when its first arg is a handle, but for printing to STDOUT:
sub say {print #_, "\n"}
say 'hello';
The way you're writing your print statement is unnecessarily verbose. There's no need to separate the newline into its own string. This is sufficient.
print "hello.\n";
This realization will probably make your coding easier in general.
In addition to using use feature "say" or use 5.10.0 or use Modern::Perl to get the built in say feature, I'm going to pimp perl5i which turns on a lot of sensible missing Perl 5 features by default.
Perhaps you want to change your output record separator to linefeed with:
local $\ = "\n";
$ perl -e 'print q{hello};print q{goodbye}' | od -c
0000000 h e l l o g o o d b y e
0000014
$ perl -e '$\ = qq{\n}; print q{hello};print q{goodbye}' | od -c
0000000 h e l l o \n g o o d b y e \n
0000016
Update: my answer speaks to capability rather than advisability. I don't regard adding "\n" at the end of lines to be a "pesky" chore, but if someone really wants to avoid them, this is one way. If I had to maintain a bit of code that uses this technique, I'd probably refactor it out pronto.
Here's what I found at https://perldoc.perl.org/perlvar.html:
$\
The output record separator for the print operator. If defined, this value is
printed after the last of print's arguments. Default is undef.
You cannot call output_record_separator() on a handle, only as a static method.
See IO::Handle.
Mnemonic: you set $\ instead of adding "\n" at the end of the print. Also, it's
just like $/ , but it's what you get "back" from Perl.
example:
$\ = "\n";
print "a newline will be appended to the end of this line automatically";
In Raku (Perl 6) there is, the say function.
If you're stuck with pre-5.10, then the solutions provided above will not fully replicate the say function. For example
sub say { print #_, "\n"; }
Will not work with invocations such as
say for #arr;
or
for (#arr) {
say;
}
... because the above function does not act on the implicit global $_ like print and the real say function.
To more closely replicate the perl 5.10+ say you want this function
sub say {
if (#_) { print #_, "\n"; }
else { print $_, "\n"; }
}
Which now acts like this
my #arr = qw( alpha beta gamma );
say #arr;
# OUTPUT
# alphabetagamma
#
say for #arr;
# OUTPUT
# alpha
# beta
# gamma
#
The say builtin in perl6 behaves a little differently. Invoking it with say #arr or #arr.say will not just concatenate the array items, but instead prints them separated with the list separator. To replicate this in perl5 you would do this
sub say {
if (#_) { print join($", #_) . "\n"; }
else { print $_ . "\n"; }
}
$" is the global list separator variable, or if you're using English.pm then is is $LIST_SEPARATOR
It will now act more like perl6, like so
say #arr;
# OUTPUT
# alpha beta gamma
#
As requested:
sub myprint { print #_, "\n"; }
myprint "foo", 3 . 'bar'
And what was the actual problem?
Maybe try understanding the language you have before trying to change it.
Man perlvar(1) says: "Within a subroutine the array #_ contains the parameters passed to that subroutine."