I am trying to process each line in a file through a perl script instead of sending the entire file to the perl script, sending so much data to memory at once.
In a shell script, I began what I thought to be line iteration as follows:
while read line
do
perl script.pl --script=options "$line"
done < input
When I do this, how do I save the data to an output file >> output?
while read line
do
perl script.pl --script=options "$line"
done < input
>> output
If it takes less memory to split the file, then I also had trouble with the for statement
for file in /dev/*
do
split -l 1000 $file prefix
done < input
## Where do I save the output?
for file in /dev/out/*
do
perl script.pl --script=options
etc...
Which is the most memory-efficient way to
also you can process your very big file line by line within perl script without loading the entire file in memory. for that you just need to enclose the text of your current perl script (that i hope doen't read the file in memory any more :) ) with while loop. for example:
my $line;
while ($line = <>) {
// your script text here, refering to $line variable instead of param variable
}
and in this perl script you can also write results to output file. say, if result is stored in variable $res, you can do it this way:
open (my $fh, ">>", "out") or die "ERROR: $!"; # opening a file descriptor
my $line;
while ($line = <>) {
// your script text here, refering to $line variable instead of param variable
print $fh $res, "\n"; # writing to file descriptor
}
close $fh; # closing file descriptor
try this:
while read line
do
perl script.pl --script=options "$line" >> "out"
done < input
"out" is a name of your output file.
I fixed my issue with:
split -l 100000 input /dev/shm/split/split.input.txt.
find /dev/shm/split/ -type f -name '*.txt.* -exec perl script.pl --script=options {} + > output
This made my script process the files faster.
Related
I want to make the same calculations in two similar files, but I do not want to double the code for each file nor to create two scripts for this.
my $file = "file1.txt";
my $tempfile = "file1_temp.txt";
if (not defined $file) {
die "Input file not found";
}
open(my $inputFileHandler, '<:encoding(UTF-8)', $file)
or die "Could not open file '$file' $!";
open(my $outs, '>', $tempfile) or die $!;
/*Calculations made*/
close($outs);
copy($tempfile,$file) or die "Copy failed: $!";
unlink($tempfile) or die "Could not delete the file!\n";
close($inputFileHandler);
So i want to do the exact calculations for file2.txt_temp and copy it in file2.txt is there a way to do it without writing the code again for file2?
Thank you very much.
Write your code as a Unix filter. Read the data from STDIN and write it to STDOUT. Your code will be simpler and your program will be more flexible.
#!/usr/bin/perl
use strict;
use warnings;
while (<STDIN>) {
# do a calculation using the data that is in $_
print $output_data
}
The cleverness is in how you call the program:
$ ./my_clever_filter < file1.txt > file1_out.txt
$ ./my_clever_filter < file2.txt > file2_out.txt
See The Unix Filter Model: What, Why and How? for far more information.
Assuming your code is well written (not manipulating any globals, ...) you could use a for-loop
foreach my $prefix ('file1', 'file2') {
my $file = $prefix . ".txt";
my $tempfile = $prefix . "_temp.txt";
...
}
There is a certain Perl feature that is designed especially for cases like this, and that is this:
$ perl -pi -e'/*Calculations made*/' file1.txt file2.txt ... fileN.txt
Loosely referred to as "in-place edit", which basically does what your code does: It writes to a temp file and then overwrites the original.
Which will apply your calculations to the files named as arguments. If you have complex calculations you can put them in a file and skip the -e'....' part
$ perl -pi foo.pl file1.txt ...
Say for example that your "calculations" consist of incrementing each pair of numbers by 1:
s/(\d+) (\d+)/($1 + 1) . ($2 + 1)/ge
You would do either
$ perl -pi -e's/(\d+) (\d+)/($1 + 1) . ($2 + 1)/ge' file1.txt file2.txt
$ perl -pi foo.pl file1.txt file2.txt
Where foo.pl contains the code.
Be aware that the -i switch is destructive, so make backups before running the command. You can supply a backup extension to save a backup, but that backup is overwritten if you run the command again. E.g. -i.bak.
-p places a while (<>) loop around your code, followed by a print of each line,
-i.bak does the editing of the original file, and saves a backup with the extension, if it is supplied.
I'm running a perl script, which in turn calls a batch script. I need to pass 3 parameters to the batch script. I'm passing the parameters, since it it easier to read a file in perl script & capture the desired value. But, my script is erroring out with error - 'The system cannot find the path specified.'I'm using below code --
while (<FILE>)
{
($file, $rcc, $loc) = split(',');
my #lines = qx/"D:\\SiebelAdmin\\Commands\\WinFile_Move.bat $file $rcc $loc" /;
}
Remove the double quotes. With them, the system interprets the whole line as a command, not as a command with parameters.
my #lines = qx/D:\\SiebelAdmin\\Commands\\WinFile_Move.bat $file $rcc $loc/;
Please check if this works for you.
I have created sample batch script which takes two args and print it on prompt.
This script location is on desktop.
Code of batch script
#echo off
set arg1=%1
set arg2=%2
shift
shift
echo %arg1%
echo %arg2%
Output of batch script
C:\Users\Administrator\Desktop>a.bat perl5.8 perl5.18
perl5.8
perl5.18
C:\Users\Administrator\Desktop>
Now I have created the perl script which calls this batch script. This perl script is present in C drive.
Code for perl script
my $bat_file = 'C:\Users\Administrator\Desktop\a.bat';
my $arg1 = 'perl5.8';
my $arg2 = 'perl5.18';
my #lines = `$bat_file $arg1 $arg2`;
print #lines;
Output of perl script
C:\>perl tmp.pl
perl5.8
perl5.18
You can do like this:
Perl file:
my $arg = "hey";
my $bat_file_loc = "C:\\abc.bat";
system($bat_file_loc,$arg);
Batch file:
set arg1=%1
I want to search for a string and get the full line from a text file through Perl scripting.
So the text file will be like the following.
data-key-1,col-1.1,col-1.2
data-key-2,col-2.1,col-2.2
data-key-3,col-3.1,col-3.2
Here I want to apply data-key-1 as the search string and get the full line into a Perl variable.
Here I want the exact replacement of grep "data-key-1" data.csv in the shell.
Some syntax like the following worked while running in the console.
perl -wln -e 'print if /\bAPPLE\b/' your_file
But how can I place it in a script? With the perl keyword we can't put it into a script. Is there a way to avoid the loops?
If you'd know the command line options you are giving for your one-liner, you'd know exactly what to write inside your perl script. When you read a file, you need a loop. Choice of loop can yield different results performance wise. Using for loop to read a while is more expensive than using a while loop to read a file.
Your one-liner:
perl -wln -e 'print if /\bAPPLE\b/' your_file
is basically saying:
-w : Use warnings
-l : Chomp the newline character from each line before processing and place it back during printing.
-n : Create an implicit while(<>) { ... } loop to perform an action on each line
-e : Tell perl interpreter to execute the code that follows it.
print if /\bAPPLE\b/ to print entire line if line contains the word APPLE.
So to use the above inside a perl script, you'd do:
#!usr/bin/perl
use strict;
use warnings;
open my $fh, '<', 'your_file' or die "Cannot open file: $!\n";
while(<$fh>) {
my $line = $_ if /\bAPPLE\b/;
# do something with $line
}
chomp is not really required here because you are not doing anything with the line other then checking for an existence of a word.
open($file, "<filename");
while(<$file>) {
print $_ if ($_ =~ /^data-key-3,/);
}
use strict;
use warnings;
# the file name of your .csv file
my $file = 'data.csv';
# open the file for reading
open(FILE, "<$file") or
die("Could not open log file. $!\n");
#process line by line:
while(<FILE>) {
my($line) = $_;
# remove any trail space (the newline)
# not necessary, but again, good habit
chomp($line);
my #result = grep (/data-key-1/, $line);
push (#final, #result);
}
print #final;
I currently have an issue with reading files in one directory.
I need to take all the fastq files in a file and run the script for each file then put new files in an ‘Edited_sequences’ folder.
The one script I had is
perl -ne '$i++; if($i<80001){print}' BM2003_TCCCAGAACAAC_L001_R1_001.fastq > ./Edited_sequences/BM2003_TCCCAGAACAAC_L001_R1_001.fastq
It takes the first 80000 lines in one fastq file then outputs the result.
Now for example I have 2000 fastq files, then I need to copy and paste for 2000 times.
I know there is a glob command suit for this situation but I just do not know how to deal with that.
Please help me out.
You can use perl to do copy/paste for you, first argument *.fastq are all fastq files, and second ./Edited_sequences is target folder for new files,
perl -e '$d=pop; `head -8000 "$_" > "$d/$_"` for #ARGV' *.fastq ./Edited_sequences
glob gets you an array of filenames matching a particular expression. It's frequently used with <> brackets, a lot like reading input (you can think of it as reading files from a directory).
This is a simple example that will print the names of every ".fastq" file in the current directory:
print "$_\n" for <*.fastq>;
The important part is <*.fastq>, which gives us an array of filenames matching that expression (in this case, a file extension). If you need to change which directory your Perl script is working in, you can use chdir.
From there, we can process your files as needed:
while (my $filename = <*.fastq>) {
open(my $in, '<', $filename) or die $!;
open(my $out, '>', "./Edited_sequences/$filename") or die $!;
for (1..80000) {
my $line = <$in>;
print $out $line;
}
}
You have two choices:
Use Perl to read in the 2000 files and run it as part of your program
Use the Shell to pass each of those 2000 file to your command line
Here's the bash alternative:
for file in *.fastq
do
perl -ne '$i++; if($i<80001){print}' "$file" > "./Edited_sequences/$file"
done
Your same Perl script, but with the shell finding each file. This should work and not overload the command line. The for loop in bash, if handed a glob can expand them correctly.
However, I always recommend that you don't actually execute the command, but echo the resulting commands into a file:
for file in *.fastq
do
echo "perl -ne '\$i++; if(\$i<80001){print}' \
\"$file\" > \"./Edited_sequences/$file\"" >> myoutput.txt
done
Then, you can look at myoutput.txt to make sure it looks good before you actually do any real harm. Once you've determined that myoutput.txt is a good file, you can execute that as a shell script:
$ bash myoutput.txt
I found this perl script here which seems will work for my purposes. It opens a Unicode text file and reads each line so that a command can be run. But I cannot figure out how to run a certain ICU command on each line. Can someone help me out? The error I get is (largefile is the script name):
syntax error at ./largefile line 11, near "/ ."
Search pattern not terminated at ./largefile line 11.
#!/usr/bin/perl
use strict;
use warnings;
my $file = 'test.txt';
open my $info, $file or die "Could not open $file: $!";
while( my $line = <$info>) {
do
LD_LIBRARY_PATH=icu/source/lib/ ./a.out "$line" >> newtext.txt
done
}
close $info;
Basically I want to open a large text file and run the command (which normally runs from the command line...I think how I call this in the perl script is the problem, but I don't know how to fix it) "LD_LIBRARY_PATH=icu/source/lib/ ./a.out "$line" >> newtext.txt" on each line so that "newtext.txt" is then populated with all the lines after they have been processed by the script. The ICU part is breaking words for Khmer.
Any help would be much appreciated! I'm not much of a programmer... Thanks!
For executing terminal commands, the command needs to be in system(), hence change to
system("LD_LIBRARY_PATH=icu/source/lib/ ./a.out $line >> newtext.txt");
Have you tried backticks:
while (my $line = <$info>) {
`LD_LIBRARY_PATH=icu/source/lib/ ./a.out "$line" >> newtext.txt`
last if $. == 2;
}