How do I delete a random value from an array in Perl? - perl

I'm learning Perl and building an application that gets a random line from a file using this code:
open(my $random_name, "<", "out.txt");
my #array = shuffle(<$random_name>);
chomp #array;
close($random_name) or die "Error when trying to close $random_name: $!";
print shift #array;
But now I want to delete this random name from the file. How I can do this?

shift already deletes a name from the array.
So does pop (one from the beginning, one from the end) - I would suggest using pop as it may be more efficient and being a random one, you don't care which on you use.
Or do you need to delete it from a file?
If that's the case, you need to:
A. get a count of names inside a file (if small, read it all in memory using File::Slurp, if large, either read it line-by-line and count or simply execute wc -l $filename command via backticks.
B. Generate a random # from 1 to <$ of lines> (say, $random_line_number
C. Read the file line by line. For every line read, WRITE it to another temp file (use File::Temp to generate temp files. Except do NOT write the line numbered $random_line_number to text file
D. Close temp file and move it instead of your original file
If the list contains filenames and you need to delete the file itself (the random file), use unlink() function. Don't forget to process return code from unlink() and, like with any IO operation, print error message containing $! which will be the text of system error on failure.
Done.
D.

When you say "delete this … from the list" do you mean delete it from the file? If you simply mean remove it from #array then you've already done that by using shift. If you want it removed from the file, and the order doesn't matter, simply write the remaining names in #array back into the file. If the file order does matter, you're going to have to do something slightly more complicated, such as reopen the file, read the items in in order, except for the one you don't want, and then write them all back out again. Either that, or take more notice of the order when you read the file.

If you need to delete a line from a file (its not entirely clear from your question) one of the simplest and most efficient ways is to use Tie::File to manipulate a file as if it were an array. Otherwise perlfaq5 explains how to do it the long way.

Related

Tailing already opened file in Perl

I have one child process that writes into file some information dynamically. Sometimes I need to get N number of last lines of this file. But when parent process is reading file child will continue to write to it.
I have read that there is no sense to lock it and unlock, but I am not sure. I will not write anything from parent process, so I need to open it only for reading.
I have found module File::Tail , but didn't understand how to use it to get N number of last lines, please provide some simple example.
Also I need advice is it necessary to use locking in this case?
To read the last N lines of a files you can use the CPAN module File::ReadBackwards.
use File::ReadBackwards;
my $lastlines = File::ReadBackwards->new("file.txt");
print reverse map { $lastlines->readline() } (1 .. 2);
This will print last 2 lines of a file. Replace 2 with any number what you want.

Perl Skipping Some Files When Looping Across Files and Writing Their Contents to Output File

I'm having an issue with Perl and I'm hoping someone here can help me figure out what's going on. I have about 130,000 .txt files in a directory called RawData and I have a Perl program that loads them into an array, then loops through this array, loading each .txt file. For simplicity, suppose I have four text files I'm looping through
File1.txt
File2.txt
File3.txt
File4.txt
The contents of each .txt file look something like this:
007 C03XXYY ZZZZ
008 A01XXYY ZZZZ
009 A02XXYY ZZZZ
where X,Y,Z are digits. In my simplified code below, the program then pulls out just line 007 in each .txt file, saves XX as ID, ignores YY and grabs the variable data ZZZZ that I've called VarVal. Then it writes everything to a file with a header specified in the code below:
#!/usr/bin/perl
use warnings;
use strict;
open(OUTFILE, "> ../Data/OutputFile.csv") or die $!;
opendir(MYDIR,"../RawData")||die $!;
my #txtfiles=grep {/\.txt$/} readdir(MYDIR);
closedir(MYDIR);
print OUTFILE "ID,VarName,VarVal\n";
foreach my $txtfile (#txtfiles){
#Prints to the screen so I can see where I am in the loop.
print $txtfile","\n";
open(INFILE, "< ../RawData/$txtfile") or die $!;
while(<INFILE>){
if(m{^007 C03(\d{2})(\d+)(\s+)(.+)}){
print OUTFILE "$1,VarName,$4\n"
}
}
}
The issue I'm having is that the contents of, for example File3.txt, don't show up in OutputFile.csv. However, it's not an issue with Perl not finding a match because I checked that the if statement is being executed by deleting OUTFILE and looking at what the code prints to the terminal screen. What shows up is exactly what should be there.
Furthermore, If I just run the problematic file (File3.txt) through the loop itself by commenting out the opendir and closedir stuff and doing something like my #textfile = "File3.txt";. Then when I run the code, the only data that shows up in the OutputFile.csv IS what's in File3.txt. But when it goes through the loop, it won't show up in OutputFile.csv. Plus, I know that File3.txt is being sent to into the loop because I can see it being printed on the screen with print $txtfile","\n";. I'm at a loss as to what is going on here.
The other issue is that I don't think it's something specific to this one particular file (maybe it is) but I can't just troubleshoot this one file because I have 130,000 files and I just happened to stumble across the fact that this one wasn't being written to the output file. So there may be other files that also aren't getting written, even though there is no obvious reason they shouldn't be just like the case of File3.txt.
Perhaps because I'm doing so many files in rapid succession, looping 130,000 files, causes some sort of I/O issues that randomly fails every so often to write the contents in memory to the output file? That's my best guess but I have not idea how to diagnose or fix this.
This is kind of a difficult question to debug, but I'm hoping someone on here has some insight or has seen similar problems that would provide me with a solution.
Thanks
There's nothing obviously wrong that I can see in your code. It is a little outdated as using autodie and lexical filehandles would be better.
However, I would recommend that you make your regex slightly less restrictive by making the spacing variable length after the first value and making the last variable optionally of 0 length. I'd also output the filename as well. Then you can see which other files aren't being caught for whatever reason:
if (m{^007\s+C03(\d{2})\d+\s+(.*)}){
print OUTFILE "$txtfile $1,VarName,$2\n";
last;
}
Finally, assuming there is only a single 007 C03 in each file, you could throw in a last call after one is found.
You may want to try sorting the #txtfiles list, then trying to systematically look through the output to see what is or isn't there. With 130k files in random order, it would be pretty difficult to be certain that you missed one. Perl should be giving you the files in the actual order they appear in the directory, which is different that user level commands like ls, so it may be different that you'd expect.

perl: open filehandle, write into it, give it a name later on?

I think I've read how to do this somewhere but I can't find where. Maybe it's only possible in new(ish) versions of Perl. I am using 5.14.2:
I have a Perl script that writes down results into a file if certain criteria are met. It's more logical given the structure of the script to write down the results and later on check if the criteria to save the results into a file are met.
I think I've read somewhere that I can write content into a filehandle, which in Linux I guess will correspond to a temporary file or a pipe of some sorts, and then give the name to that file, including the directory where it should be, later on. If not, the content will be discarded when the script finishes.
Other than faffing around temporary files and deleting them manually, is there a straightforward way of doing this in Perl?
There's no simple (UNIX) facility for what you describe, but the behavior can be composed out of basic system operations. Perl's File::Temp already does most of what you want:
use File:Temp;
my $tmp = File::Temp->new; # Will be unlinked at end of program.
while ($work_to_do) {
print $tmp a_lot_of_stuff(); # $tmp is a filehandle
}
if ($save_it) {
rename($tmp, $new_file); # $tmp is also a string. Move (rename) the file.
} # If you need this to work across filesystems, you
# might want to ``use File::Copy qw(move)'' instead.
exit; # $tmp will be unlinked here if it was not renamed
I use File::Temp for this.
But you should have in mind that File::Temp deletes the file by default. That is OK but in my case I don't want that when debugging. If the script terminates and the output is not the desired one I can not check the temp file.
So I prefer to set $KEEP_ALL=1 or $fh->unlink_on_destroy( 0 ); when OO or ($fh, $filename) = tempfile($template, UNLINK => 0); and then unlink the file myself or move to a proper place.
It would be safer to move the file after closing the filehandle (just in case there is some buffering going on). So I would prefer an approach where temp file is not deleted by default and then when all is done, set a conditional that either delete it or move it to your desired place and name.

Accessing a file in perl

In my script I am dealing with opening files and writing to files. I found that there is some thing wrong with a file I try to open, the file exists, it is not empty and I am passing the right path to file handle.
I know that my question might sounds weird but while I was debugging my code I put the following command in my script to check some files
system ("ls");
Then my script worked well, when it's removed it does not work correctly anymore.
my #unique = ("test1","test2");
open(unique_fh,">orfs");
print unique_fh #unique ;
open(ORF,"orfs")or die ("file doesnot exist");
system ("ls");
while(<ORF>){
split ;
}
#neworfs=#_ ;
print #neworfs ;
Perl buffers the output when you print to a file. In other words, it doesn't actually write to the file every time you say print; it saves up a bunch of data and writes it all at once. This is faster.
In your case, you couldn't see anything you had written to the file, because Perl hadn't written anything yet. Adding the system("ls") call, however, caused Perl to write your output first (the interpreter is smart enough to do this, because it thinks you might want to use the system() call to do something with the file you just created).
How do you get around this? You can close the file before you open it again to read it, as choroba suggested. Or you can disable buffering for that file. Put this code just after you open the file:
my $fh = select (unique_fh);
$|=1;
select ($fh);
Then anytime you print to the file, it will get written immediately ($| is a special variable that sets the output buffering behavior).
Closing the file first is probably a better idea, although it is possible to have a filehandle for reading and writing open at the same time.
You did not close the filehandle before trying to read from the same file.

How do I read a CSV file column by column in order to transpose it?

I have a data set in the following format:
snp,T2DG0200001,T2DG0200002,T2DG0200003,T2DG0200004
3_60162,AA,AA,AA,AA
3_61495,AA,AA,GA,GA
3_61466,GG,GG,CG,CG
The real data is much larger than this, extending to millions of rows and about a thousand columns. My eventual goal is to transpose this monstrosity and output the result in a text file (or CSV file or whatever, doesn't matter).
I need to feed the data to my computer piece by piece so as not to overload my memory. I read the CSV file line by line, and then transpose it, and write to file. I then loop back and repeat the steps, appending to the text file as I go.
The problem of course is that I am supposed to append the text file column by column instead of by row if the result is the transpose of the original data file. But a friend told me that is not feasible in Perl code. I am wondering if I can read the data column by column. Is there something similar such as the getline method which I used in my original code
while (my $row = $csv->getline ($fh)) {
that can return columns instead of rows? Something akin to a Unix cut command would be preferred, if it does not require loading the entire data into memory.
a CSV is simply a text file; it consists of one big long line of text characters, so there is no random access to columns. Ideally, you would put the CSV into a database, which will then be able to do this directly.
However, barring that, I believe you could do this in Perl with a little cleverness. My approach would be something like this:
my #filehandles;
my $line = 0;
while (my $row = $csv->getline ($fh)<FILE>)
{
#open an output file for each column!
if (not defined $filehandles[0])
{
for (0..$#$row)
{
local $handle;
open $handle, ">column_$_.txt" or die "Oops!";
push #filehandles, $handle;
}
}
#print each column to its respective output file.
for (0..$#$row)
{
print $filehandles[$_] $row->[$_] . ",";
}
#This is going to take a LONG time, so show some sign of life.
print '.' if (($line++ % 1000) == 0);
}
At the end, each column would be printed as a row in its own text file. Don't forget to close all the files, then open them all again for reading, then write them into a single output file one at a time. My guess is this would be slow, but fast enough to do millions of rows, as long as you don't have to do it very often. And it wouldn't face memory limitations.
If the file does not fit in your computers memory, your program has to read through it multiple times. There is no way around it.
There might be modules that obscure or hide this fact - like DBD::CSV - but those just do the same work behind the scenes.