Perl read text in paragraph and patternmatch - perl

I have an index file, which keeps an index of each object separated by a blank line. Now, I have to search for a keyword on each object, and if present, dump to another file, instead of rebuilding the entire index from the scratch. The piece of code
####files is an array that contains the list of packages in the index
open("FH", $indexfile) or die ;
my #linearray = <FH>;
close ("FH");
open (NFH, '>', "$tmpfile") or die "cannot create";
foreach my $pattern (#files)
{
if (my #matches = grep /$pattern/, #linearray) {
print NFH "#matches";
} else {
push #newpkgs,$pattern;
}
}
close (NFH);
But this is not working as expected. How can I get a paragraph as an element in an array?

Modify $/
my #linearray;
{
open("FH", $indexfile) or die ;
local $/ = '';
#linearray = <FH>;
close ("FH");
print Dumper #linearray;
}
This will give you the required output.
I used the braces {} to limit the scope of Field Record Separator modification till the objects are read to the array. You can extend it according to your requirement.

Related

Picking a specific line with a specific string

I am trying this in Perl to pick one complete line from whole document which contains "CURRENT_RUN_ID". I have been using below code to accomplish the above said task but I am unable to enter the while loop.
my $sSuccessString = "CURRENT_RUN_ID";
open(LOG, "$slogfile") or die("Can't open $slogfile\n");
my $sLines;
{
local $/ = undef;
$sLines=<LOG>;
}
my $spool = 0;
my #matchingLines;
while (<LOG>)
{
print OUTLOG "in while loop\n";
if (m/$sSuccessString/i) {
print OUTLOG "in if loop\n";
$spool = 1;
print map { "$_ \n" } #matchingLines;
#matchingLines = ();
}
if ($spool) {
push (#matchingLines, $_);
}
}
You are already done reading from the filehandle LOG after you have slurped it into $sLines. <LOG> in the head of the while will return undef because it has reached eof. You either have to use that variable $sLines in your while loop or get rid of it. You're not using it anyway.
If you only want to print the line that matches, all you need to do is this:
use strict;
use warnings;
open my $fh_in, '<', 'input_file' or die $!;
open my $fh_out '>', 'output_file' or die $!;
while (my $line = <$fh_in>) {
print $fh_out $line if $line =~ m/CURRENT_RUN_ID/;
}
close $fh_in;
close $fh_out;
When you execute this code:
$sLines=<LOG>;
it reads all of the data from LOG into $sLines and it leaves the file pointer for LOG at the end of the file. So when you next try to read from that file handle with:
while (<LOG>)
nothing is returned as there is no more data to read.
If you want to read the file twice, then you will need to use the seek() function to reset the file pointer before your second read.
seek LOG, 0, 0;
But, given that you never do anything with $sLines I suspect that you can probably just remove that whole section of the code.
The whole thing with $spool and #matchingLines seems strange too. What were you trying to achieve there?
I think your code can be simplified to just:
my $sSuccessString = "CURRENT_RUN_ID";
open(LOG, $slogfile) or die("Can't open $slogfile\n");
while (<LOG>) {
print OUTLOG if /$sSuccessString/i/;
}
Personally, I'd make it even simpler, by reading from STDIN and writing to STDOUT.
my $sSuccessString = 'CURRENT_RUN_ID';
while (<>) {
print if /$sSuccessString/i/;
}
And then using Unix I/O redirection to connect up the correct files.
$ ./this_filter.pl < your_input.log > your_output.log

How do I read in an editable file that contains words that I don't want stemmed using Lingua::Stem's add_exceptions($exceptions_hash_ref) in perl?

I am using Perl's Lingua::Stem module (Lingua::Stem) and I want to have a text file or other editable file format to contain a list of words I do not want stemmed. I want to be able to add words to the file any time.
Their example shows:
add_exceptions($exceptions_hash_ref);
What is the best way to do this?
I used their method in hard coding some exceptions, but I want to do this with a file.
# adding default exceptions
Lingua::Stem::add_exceptions({ 'emily' => 'emily',
'driven' => 'driven',
});
You can define a function to load exceptions from the given file:
sub load_exceptions {
my $fname = shift;
my %list;
open (my $in, "<", $fname) or die("load_exceptions: $fname");
while (<$in>) {
chomp;
$list{$_} = $_;
}
close $in;
return \%list;
}
And use it:
Lingua::Stem::add_exceptions(load_exceptions("notstem.txt"));
Example input file:
emily
driven
Assuming your "editable" file is whitespace separated, like so:
emily emily
driven driven
Your code could be:
open my $fh, "<", "excep.txt" or die $!;
my $href = { map split, <$fh> };
Lingua::Stem::add_exceptions($href);

perl file read, truncate

I am trying to modify a config file.
I first read it into #buffer, depending on a regex match.
The modified buffer gets written back on disk, in case the file got smaller, a trunciation is done.
Unfortunatly this does not work, and it already crashes at fseek, but as far as I can say my usage of fseek conforms to perl doc.
open (my $file, "+<", "somefilethatexists.txt");
flock ($file, LOCK_EX);
foreach my $line (<$file>) {
if ($line =~ m/(something)*/) {
push (#buffer, $line);
}
}
print "A\n";
seek($file,0,0); #seek to the beginning, we read some data already
print "B\n"; # never appears
write($file, join('\n',#buffer)); #write new data
truncate($file, tell($file)); #get rid of everything beyond the just written data
flock($file, LOCK_UN);
close ($file);
perlopentut says this about Mixing Reads and Writes
... when it comes to updating a file ... you probably don't want to
use this approach for updating.
You should use Tie::File for this. It opens the file for both read and write on the same filehandle and allows you to treat a file as an array of lines
use strict;
use warnings;
use Tie::File;
tie my #file, 'Tie::File', 'somefilethatexists.txt' or die $!;
for (my $i = 0; $i < #file; ) {
if (m/(something)*/) {
$i++;
}
else {
splice #file, $i, 1;
}
}
untie #file;
Where are your fseek(), fwrite() and ftruncate() functions defined? Perl doesn't have those functions. You should be using seek(), print() (or syswrite()) and truncate(). We can't really help you if you're using functions that we know nothing about.
You also don't need (and probably don't want) that explicit call to unlock the file or the call to close the file. The filehandle will be closed and unlocked as soon as your $file variable goes out of scope.
Maybe you can try this:
$^I = '.bak';
#ARGV = 'somefilethatexists.txt';
while (<>) {
if (/(something)*/) {
print;
}
}

File handle array

I wanted to choose what data to put into which file depending on the index. However, I seem to be stuck with the following.
I have created the files using an array of file handles:
my #file_h;
my $file;
foreach $file (0..11)
{
$file_h[$file]= new IT::File ">seq.$file.fastq";
}
$file= index;
print $file_h[$file] "$record_r1[0]$record_r1[1]$record_r1[2]$record_r1[3]\n";
However, I get an error for some reason in the last line. Help anyone....?
That should simply be:
my #file_h;
for my $file (0..11) {
open($file_h[$file], ">", "seq.$file.fastq")
|| die "cannot open seq.$file.fastq: $!";
}
# then later load up $some_index and then print
print { $file_h[$some_index] } #record_r1[0..3], "\n";
You can always use the object-oriented syntax:
$file_h[$file]->print("$record_r1[0]$record_r1[1]$record_r1[2]$record_r1[3]\n");
Also, you can print out the array more simply:
$file_h[$file]->print(#record_r1[0..3],"\n");
Or like this, if those four elements are actually the whole thing:
$file_h[$file]->print("#record_r1\n");
Try assigning the $file_h[$file] to a temporary variable first:
my #file_h;
my $file;
my $current_file;
foreach $file (0..11)
{
$file_h[$file]= new IT::File ">seq.$file.fastq";
}
$file= index;
$current_file = $file_h[$file];
print $current_file "$record_r1[0]$record_r1[1]$record_r1[2]$record_r1[3]\n";
As far as I remember, Perl doesn't recognize it as an output handle otherwise, complaining about invalid syntax.

Can a Perl subroutine return data but keep processing?

Is there any way to have a subroutine send data back while still processing? For instance (this example used simply to illustrate) - a subroutine reads a file. While it is reading through the file, if some condition is met, then "return" that line and keep processing. I know there are those that will answer - why would you want to do that? and why don't you just ...?, but I really would like to know if this is possible.
A common way to implement this type of functionality is with a callback function:
{
open my $log, '>', 'logfile' or die $!;
sub log_line {print $log #_}
}
sub process_file {
my ($filename, $callback) = #_;
open my $file, '<', $filename or die $!;
local $_;
while (<$file>) {
if (/some condition/) {
$callback->($_)
}
# whatever other processing you need ....
}
}
process_file 'myfile.txt', \&log_line;
or without even naming the callback:
process_file 'myfile.txt', sub {print STDERR #_};
Some languages offer this sort of feature using "generators" or "coroutines", but Perl does not. The generator page linked above has examples in Python, C#, and Ruby (among others).
The Coro module looks like it would be useful for this problem, though I have no idea how it works and no idea whether it does what it advertises.
The easiest way to do this in Perl is probably with an iterator-type solution. For example, here we have a subroutine which forms a closure over a filehandle:
open my $fh, '<', 'some_file.txt' or die $!;
my $iter = sub {
while( my $line = <$fh> ) {
return $line if $line =~ /foo/;
}
return;
}
The sub iterates over the lines until it finds one matching the pattern /foo/ and then returns it, or else returns nothing. (undef in scalar context.) Because the filehandle $fh is defined outsite the scope of the sub, it remains resident in memory between calls. Most importantly, its state, including the current seek position in the file, is retained. So each call to the subroutine resumes reading the file where it last left off.
To use the iterator:
while( defined( my $next_line = $iter->() ) ) {
# do something with each line here
}
If you really want do this you can by using threading. One option would be to fork a separate thread that reads the file and when it finds a certain line, place it in an array that is shared between threads. Then the other thread could take the lines, as they are found, and process them. Here is an example that reads a file, looks for an 'X' in a file's line, and does an action when it is found.
use strict;
use threads;
use threads::shared;
my #ary : shared;
my $thr = threads->create('file_reader');
while(1){
my ($value);
{
lock(#ary);
if ($#ary > -1){
$value = shift(#ary);
print "Found a line to process: $value\n";
}
else{
print "no more lines to process...\n";
}
}
sleep(1);
#process $value
}
sub file_reader{
#File input
open(INPUT, "<test.txt");
while(<INPUT>){
my($line) = $_;
chomp($line);
print "reading $line\n";
if ($line =~ /X/){
print "pushing $line\n";
lock(#ary);
push #ary, $line;
}
sleep(4)
}
close(INPUT);
}
Try this code as the test.txt file:
line 1
line 2X
line 3
line 4X
line 5
line 6
line 7X
line 8
line 9
line 10
line 11
line 12X
If your language supports closures, you may be able to do something like this:
By the way, the function would not keep processing the file, it would run just when you call it, so it may be not what you need.
(This is a javascript like pseudo-code)
function fileReader (filename) {
var file = open(filename);
return function () {
while (s = file.read()) {
if (condition) {
return line;
}
}
return null;
}
}
a = fileReader("myfile");
line1 = a();
line2 = a();
line3 = a();
What about a recursive sub? Re-opening existing filehandles do not reset the input line number, so it carries on from where it's left off.
Here is an example where the process_file subroutine prints out blank-line-separated "\n\n" paragraphs that contain foo.
sub process_file {
my ($fileHandle) = #_;
my $paragraph;
while ( defined(my $line = <$fileHandle>) and not eof(<$fileHandle>) ) {
$paragraph .= $line;
last unless length($line);
}
print $paragraph if $paragraph =~ /foo/;
goto &process_file unless eof($fileHandle);
# goto optimizes the tail recursion and prevents a stack overflow
# redo unless eof($fileHandle); would also work
}
open my $fileHandle, '<', 'file.txt';
process_file($fileHandle);