How can I randomly sample the contents of a file? - perl

I have a file with contents
abc
def
high
lmn
...
...
There are more than 2 million lines in the files.
I want to randomly sample lines from the files and output 50K lines. Any thoughts on how to approach this problem? I was thinking along the lines of Perl and its rand function (Or a handy shell command would be neat).
Related (Possibly Duplicate) Questions:
Randomly Pick Lines From a File Without Slurping It With Unix
How can I get exactly n random lines from a file with Perl?

Assuming you basically want to output about 2.5% of all lines, this would do:
print if 0.025 > rand while <$input>;

Shell way:
sort -R file | head -n 50000

From perlfaq5: "How do I select a random line from a file?"
Short of loading the file into a database or pre-indexing the lines in the file, there are a couple of things that you can do.
Here's a reservoir-sampling algorithm from the Camel Book:
srand;
rand($.) < 1 && ($line = $_) while <>;
This has a significant advantage in space over reading the whole file in. You can find a proof of this method in The Art of Computer Programming, Volume 2, Section 3.4.2, by Donald E. Knuth.
You can use the File::Random module which provides a function for that algorithm:
use File::Random qw/random_line/;
my $line = random_line($filename);
Another way is to use the Tie::File module, which treats the entire file as an array. Simply access a random array element.

Perl way:
use CPAN. There is module File::RandomLine that does exactly what you need.

If you need to extract an exact number of lines:
use strict;
use warnings;
# Number of lines to pick and file to pick from
# Error checking omitted!
my ($pick, $file) = #ARGV;
open(my $fh, '<', $file)
or die "Can't read file '$file' [$!]\n";
# count lines in file
my ($lines, $buffer);
while (sysread $fh, $buffer, 4096) {
$lines += ($buffer =~ tr/\n//);
}
# limit number of lines to pick to number of lines in file
$pick = $lines if $pick > $lines;
# build list of N lines to pick, use a hash to prevent picking the
# same line multiple times
my %picked;
for (1 .. $pick) {
my $n = int(rand($lines)) + 1;
redo if $picked{$n}++
}
# loop over file extracting selected lines
seek($fh, 0, 0);
while (<$fh>) {
print if $picked{$.};
}
close $fh;

Related

Remove duplicate lines on file by substring - preserve order (PERL)

i m trying to write a perl script to deal with some 3+ gb text files, that are structured like :
1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212123x534534534534xx4545454x232323xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx
I want to perform two operations :
Count the number of delimiters per line and compare it to a static number (ie 5), those lines that exceed said number should be output to a file.control.
Remove duplicates on the file by substring($line, 0, 7) - first 7 numbers, but i want to preserve order. I want the output of that in a file.output.
I have coded this in simple shell script (just bash), but it took too long to process, the same script calling on perl one liners was quicker, but i m interested in a way to do this purely in perl.
The code i have so far is :
open $file_hndl_ot_control, '>', $FILE_OT_CONTROL;
open $file_hndl_ot_out, '>', $FILE_OT_OUTPUT;
# INPUT.
open $file_hndl_in, '<', $FILE_IN;
while ($line_in = <$file_hndl_in>)
{
# Calculate n. of delimiters
my $delim_cur_line = $line_in =~ y/"$delimiter"//;
# print "$commas \n"
if ( $delim_cur_line != $delim_amnt_per_line )
{
print {$file_hndl_ot_control} "$line_in";
}
# Remove duplicates by substr(0,7) maintain order
my substr_in = substr $line_in, 0, 11;
print if not $lines{$substr_in}++;
}
And i want the file.output file to look like
1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx
and the file.control file to look like :
(assuming delimiter control number is 6)
4352342xx23232xxx345545x45454x23232xxx
Could someone assist me? Thank you.
Posting edits : Tried code
my %seen;
my $delimiter = 'x';
my $delim_amnt_per_line = 5;
open(my $fh1, ">>", "outputcontrol.txt");
open(my $fh2, ">>", "outputoutput.txt");
while ( <> ) {
my $count = ($_ =~ y/x//);
print "$count \n";
# print $_;
if ( $count != $delim_amnt_per_line )
{
print fh1 $_;
}
my ($prefix) = substr $_, 0, 7;
next if $seen{$prefix}++;
print fh2;
}
I dont know if i m supposed to post new code in here. But i tried the above, based on your example. What baffles me (i m still very new in perl) is that it doesnt output to either filehandle, but if i redirected from the command line just as you said, it worked perfect. The problem is that i need to output into 2 different files.
It looks like entries with the same seven-character prefix may appear anywhere in the file, so it's necessary to use a hash to keep track of which ones have already been encountered. With a 3GB text file this may result in your perl process running out of memory, in which case a different approach is necessary. Please give this a try and see if it comes in under the bar
The tr/// operator (the same as y///) doesn't accept variables for its character list, so I've used eval to create a subroutine delimiters() that will count the number of occurrences of $delimiter in $_
It's usually easiest to pass the input file as a parameter on the command line, and redirect the output as necessary. That way you can run your program on different files without editing the source, and that's how I've written this program. You should run it as
$ perl filter.pl my_input.file > my_output.file
use strict;
use warnings 'all';
my %seen;
my $delimiter = 'x';
my $delim_amnt_per_line = 5;
eval "sub delimiters { tr/$delimiter// }";
while ( <> ) {
next if delimiters() == $delim_amnt_per_line;
my ($prefix) = substr $_, 0, 7;
next if $seen{$prefix}++;
print;
}
output
1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx

Using Perl to find and fix errors in CSV files

I am dealing with very large amounts of data. Every now and then there is a slip up. I want to identify each row with an error, under a condition of my choice. With that I want the row number along with the line number of each erroneous row. I will be running this script on a handful of files and I will want to output the report to one.
So here is my example data:
File_source,ID,Name,Number,Date,Last_name
1.csv,1,Jim,9876,2014-08-14,Johnson
1.csv,2,Jim,9876,2014-08-14,smith
1.csv,3,Jim,9876,2014-08-14,williams
1.csv,4,Jim,9876,not_a_date,jones
1.csv,5,Jim,9876,2014-08-14,dean
1.csv,6,Jim,9876,2014-08-14,Ruzyck
Desired output:
Row#5,4.csv,4,Jim,9876,not_a_date,jones (this is an erroneous row)
The condition I have chosen is print to output if anything in the date field is not a date.
As you can see, my desired output contains the line number where the error occurred, along with the data itself.
After I have my output that shows the lines within each file that are in error, I want to grab that line from the untouched original CSV file to redo (both modified and original files contain the same amount of rows). After I have a file of these redone rows, I can omit and clean up where needed to prevent interruption of an import.
Folder structure will contain:
Modified: 4.txt
Original: 4.csv
I have something started here, written in Perl, which by the logic will at least return the rows I need. However I believe my syntax is a little off and I do not know how to plug in the other subroutines.
Code:
$count = 1;
while (<>) {
unless ($F[4] =~ /\d+[-]\d+[-]\d+/)
print "Row#" . $count++ . "," . "$_";
}
The code above is supposed to give me my erroneous rows, but to be able to extract them from the originals is beyond me. The above code also contains some syntax errors.
This will do as you ask.
Please be certain that none of the fields in the data can ever contain a comma , otherwise you will need to use Text::CSV to process it instead of just a simple split.
use strict;
use warnings;
use 5.010;
use autodie;
open my $fh, '<', 'example.csv';
<$fh>; # Skip header
while (<$fh>) {
my #fields = split /,/;
if( $fields[4] !~ /^\d{4}-\d{2}-\d{2}$/ ) {
print "Row#$.,$_";
}
}
output
Row#5,4.csv,4,Jim,9876,not_a_date,jones
Update
If you want to process a number of files then you need this instead.
The close ARGV at the end of the loop is there so that the line counter $. is reset to
1 at the start of each file. Without it it just continues from 1 upwards across all the files.
You would run this like
rob#Samurai-U:~$ perl findbad.pl *.csv
or you could list the files individually, separated by spaces.
For the test I have created files 1.csv and 2.csv which are identical to your example data except that the first field of each line is the name of the file containing the data.
You may not want the line in the output that announces each file name, in which case you should replace the entire first if block with just next if $. == 1.
use strict;
use warnings;
#ARGV = map { glob qq{"$_"} } #ARGV; # For Windows
while (<>) {
if ($. == 1) {
print "\n\nFile: $ARGV\n\n";
next;
}
my #fields = split /,/;
unless ( $fields[4] =~ /^\d{4}-\d{2}-\d{2}$/ ) {
printf "Row#%d,%s", $., $_;
}
close ARGV if eof ARGV;
}
output
File: 1.csv
Row#5,1.csv,4,Jim,9876,not_a_date,jones
File: 2.csv
Row#5,2.csv,4,Jim,9876,not_a_date,jones

How to print result STDOUT to a temporary blank new file in the same directory in Perl?

I'm new in Perl, so it's maybe a very basic case that i still can't understand.
Case:
Program tell user to types the file name.
User types the file name (1 or more files).
Program read the content of file input.
If it's single file input, then it just prints the entire content of it.
if it's multi files input, then it combines the contents of each file in a sequence.
And then print result to a temporary new file, which located in the same directory with the program.pl .
file1.txt:
head
a
b
end
file2.txt:
head
c
d
e
f
end
SINGLE INPUT program ioSingle.pl:
#!/usr/bin/perl
print "File name: ";
$userinput = <STDIN>; chomp ($userinput);
#read content from input file
open ("FILEINPUT", $userinput) or die ("can't open file");
#PRINT CONTENT selama ada di file tsb
while (<FILEINPUT>) {
print ; }
close FILEINPUT;
SINGLE RESULT in cmd:
>perl ioSingle.pl
File name: file1.txt
head
a
b
end
I found tutorial code that combine content from multifiles input but cannot adapt the while argument to code above:
while ($userinput = <>) {
print ($userinput);
}
I was stucked at making it work for multifiles input,
How am i suppose to reformat the code so my program could give result like this?
EXPECTED MULTIFILES RESULT in cmd:
>perl ioMulti.pl
File name: file1.txt file2.txt
head
a
b
end
head
c
d
e
f
end
i appreciate your response :)
A good way to start working on a problem like this, is to break it down into smaller sections.
Your problem seems to break down to this:
get a list of filenames
for each file in the list
display the file contents
So think about writing subroutines that do each of these tasks. You already have something like a subroutine to display the contents of the file.
sub display_file_contents {
# filename is the first (and only argument) to the sub
my $filename = shift;
# Use lexical filehandl and three-arg open
open my $filehandle, '<', $filename or die $!;
# Shorter version of your code
print while <$filehandle>;
}
The next task is to get our list of files. You already have some of that too.
sub get_list_of_files {
print 'File name(s): ';
my $files = <STDIN>;
chomp $files;
# We might have more than one filename. Need to split input.
# Assume filenames are separated by whitespace
# (Might need to revisit that assumption - filenames can contain spaces!)
my #filenames = split /\s+/, $files;
return #filenames;
}
We can then put all of that together in the main program.
#!/usr/bin/perl
use strict;
use warnings;
my #list_of_files = get_list_of_files();
foreach my $file (#list_of_files) {
display_file_contents($file);
}
By breaking the task down into smaller tasks, each one becomes easier to deal with. And you don't need to carry the complexity of the whole program in you head at one time.
p.s. But like JRFerguson says, taking the list of files as command line parameters would make this far simpler.
The easy way is to use the diamond operator <> to open and read the files specified on the command line. This would achieve your objective:
while (<>) {
chomp;
print "$_\n";
}
Thus: ioSingle.pl file1.txt file2.txt
If this is the sole objective, you can reduce this to a command line script using the -p or -n switch like:
perl -pe '1' file1.txt file2.txt
perl -ne 'print' file1.txt file2.txt
These switches create implicit loops around the -e commands. The -p switch prints $_ after every loop as if you had written:
LINE:
while (<>) {
# your code...
} continue {
print;
}
Using -n creates:
LINE:
while (<>) {
# your code...
}
Thus, -p adds an implicit print statement.

Perl: How to add a line to sorted text file

I want to add a line to the text file in perl which has data in a sorted form. I have seen examples which show how to append data at the end of the file, but since I want the data in a sorted format.
Please guide me how can it be done.
Basically from what I have tried so far :
(I open a file, grep its content to see if the line which I want to add to the file already exists. If it does than exit else add it to the file (such that the data remains in a sorted format)
open(my $FH, $file) or die "Failed to open file $file \n";
#file_data = <$FH>;
close($FH);
my $line = grep (/$string1/, #file_data);
if($line) {
print "Found\n";
exit(1);
}
else
{
#add the line to the file
print "Not found!\n";
}
Here's an approach using Tie::File so that you can easily treat the file as an array, and List::BinarySearch's bsearch_str_pos function to quickly find the insert point. Once you've found the insert point, you check to see if the element at that point is equal to your insert string. If it's not, splice it into the array. If it is equal, don't splice it in. And finish up with untie so that the file gets closed cleanly.
use strict;
use warnings;
use Tie::File;
use List::BinarySearch qw(bsearch_str_pos);
my $insert_string = 'Whatever!';
my $file = 'something.txt';
my #array;
tie #array, 'Tie::File', $file or die $!;
my $idx = bsearch_str_pos $insert_string, #array;
splice #array, $idx, 0, $insert_string
if $array[$idx] ne $insert_string;
untie #array;
The bsearch_str_pos function from List::BinarySearch is an adaptation of a binary search implementation from Mastering Algorithms with Perl. Its convenient characteristic is that if the search string isn't found, it returns the index point where it could be inserted while maintaining the sort order.
Since you have to read the contents of the text file anyway, how about a different approach?
Read the lines in the file one-by-one, comparing against your target string. If you read a line equal to the target string, then you don't have to do anything.
Otherwise, you eventually read a line 'greater' than your current line according to your sort criteria, or you hit the end of the file. In the former case, you just insert the string at that position, and then copy the rest of the lines. In the latter case, you append the string to the end.
If you don't want to do it that way, you can do a binary search in #file_data to find the spot to add the line without having to examine all of the entries, then insert it into the array before outputting the array to the file.
Here's a simple version that reads from stdin (or filename(s) specified on command line) and appends 'string to append' to the output if it's not found in the input. Outuput is printed on stdout.
#! /usr/bin/perl
$found = 0;
$append='string to append';
while(<>) {
$found = 1 if (m/$append/o);
print
}
print "$append\n" unless ($found);;
Modifying it to edit a file in-place (with perl -i) and taking the append string from the command line would be quite simple.
A 'simple' one-liner to insert a line without using any module could be:
perl -ni -le '$insert="lemon"; $eq=($insert cmp $_); if ($eq == 0){$found++}elsif($eq==-1 && !$found){print$insert} print'
giver a list.txt whose context is:
ananas
apple
banana
pear
the output is:
ananas
apple
banana
lemon
pear
{
local ($^I, #ARGV) = ("", $file); # Enable in-place editing of $file
while (<>) {
# If we found the line exactly, bail out without printing it twice
last if $_ eq $insert;
# If we found the place where the line should be, insert it
if ($_ gt $insert) {
print $insert;
print;
last;
}
print;
}
# We've passed the insertion point, now output the rest of the file
print while <>;
}
Essentially the same answer as pavel's, except with a lot of readability added. Note that $insert should already contain a trailing newline.

How can I remove non-unique lines from a large file with Perl?

Duplicate data removal using Perl called within via a batch file within Windows
A DOS window in Windows called via a batch file.
A batch file calls the Perl script which carries out the actions. I have the batch file.
The code script I have works duplicate data is removal so long as the data file is not too big.
The problem that requires resolving is with data files which are larger, (2 GB or more), with this size of file a memory error occurs when trying to load the complete file in to an array for duplicate data removal.
The memory error occurs in the subroutine at:-
#contents_of_the_file = <INFILE>;
(A completely different method is acceptable so long as it solves this issue, please suggest).
The subroutine is:-
sub remove_duplicate_data_and_file
{
open(INFILE,"<" . $output_working_directory . $output_working_filename) or dienice ("Can't open $output_working_filename : INFILE :$!");
if ($test ne "YES")
{
flock(INFILE,1);
}
#contents_of_the_file = <INFILE>;
if ($test ne "YES")
{
flock(INFILE,8);
}
close (INFILE);
### TEST print "$#contents_of_the_file\n\n";
#unique_contents_of_the_file= grep(!$unique_contents_of_the_file{$_}++, #contents_of_the_file);
open(OUTFILE,">" . $output_restore_split_filename) or dienice ("Can't open $output_restore_split_filename : OUTFILE :$!");
if ($test ne "YES")
{
flock(OUTFILE,1);
}
for($element_number=0;$element_number<=$#unique_contents_of_the_file;$element_number++)
{
print OUTFILE "$unique_contents_of_the_file[$element_number]\n";
}
if ($test ne "YES")
{
flock(OUTFILE,8);
}
}
You are unnecessarily storing a full copy of the original file in #contents_of_the_file and -- if the amount of duplication is low relative to the file size -- nearly two other full copies in %unique_contents_of_the_file and #unique_contents_of_the_file. As ire_and_curses noted, you can reduce the storage requirements by making two passes over the data: (1) analyze the file, storing information about the line numbers of non-duplicate lines; and (2) process the file again to write non-dups to the output file.
Here is an illustration. I don't know whether I've picked the best module for the hashing function (Digest::MD5); perhaps others will comment on that. Also note the 3-argument form of open(), which you should be using.
use strict;
use warnings;
use Digest::MD5 qw(md5);
my (%seen, %keep_line_nums);
my $in_file = 'data.dat';
my $out_file = 'data_no_dups.dat';
open (my $in_handle, '<', $in_file) or die $!;
open (my $out_handle, '>', $out_file) or die $!;
while ( defined(my $line = <$in_handle>) ){
my $hashed_line = md5($line);
$keep_line_nums{$.} = 1 unless $seen{$hashed_line};
$seen{$hashed_line} = 1;
}
seek $in_handle, 0, 0;
$. = 0;
while ( defined(my $line = <$in_handle>) ){
print $out_handle $line if $keep_line_nums{$.};
}
close $in_handle;
close $out_handle;
You should be able to do this efficiently using hashing. You don't need to store the data from the lines, just identify which ones are the same. So...
Don't slurp - Read one line at a time.
Hash the line.
Store the hashed line representation as a key in a Perl hash of lists. Store the line number as the first value of the list.
If the key already exists, append the duplicate line number to the list corresponding to that value.
At the end of this process, you'll have a data-structure identifying all the duplicate lines. You can then do a second pass through the file to remove those duplicates.
Perl does heroic things with large files, but 2GB may be a limitation of DOS/Windows.
How much RAM do you have?
If your OS doesn't complain, it may be best to read the file one line at a time, and write immediately to output.
I'm thinking of something using the diamond operator <> but I'm reluctant to suggest any code because on the occasions I've posted code, I've offended a Perl guru on SO.
I'd rather not risk it. I hope the Perl cavalry will arrive soon.
In the meantime, here's a link.
Here's a solution that works no matter how big the file is. But it doesn't use RAM exclusively, so its slower than a RAM-based solution. You can also specify the amount of RAM you want this thing to use.
The solution uses a temporary file that the program treats as a database with SQLite.
#!/usr/bin/perl
use DBI;
use Digest::SHA 'sha1_base64';
use Modern::Perl;
my $input= shift;
my $temp= 'unique.tmp';
my $cache_size_in_mb= 100;
unlink $temp if -f $temp;
my $cx= DBI->connect("dbi:SQLite:dbname=$temp");
$cx->do("PRAGMA cache_size = " . $cache_size_in_mb * 1000);
$cx->do("create table x (id varchar(86) primary key, line int unique)");
my $find= $cx->prepare("select line from x where id = ?");
my $list= $cx->prepare("select line from x order by line");
my $insert= $cx->prepare("insert into x (id, line) values(?, ?)");
open(FILE, $input) or die $!;
my ($line_number, $next_line_number, $line, $sha)= 1;
while($line= <FILE>) {
$line=~ s/\s+$//s;
$sha= sha1_base64($line);
unless($cx->selectrow_array($find, undef, $sha)) {
$insert->execute($sha, $line_number)}
$line_number++;
}
seek FILE, 0, 0;
$list->execute;
$line_number= 1;
$next_line_number= $list->fetchrow_array;
while($line= <FILE>) {
$line=~ s/\s+$//s;
if($next_line_number == $line_number) {
say $line;
$next_line_number= $list->fetchrow_array;
last unless $next_line_number;
}
$line_number++;
}
close FILE;
Well you could use the inline replace mode of command line perl.
perl -i~ -ne 'print unless $seen{$_}++' uberbigfilename
In the "completely different method" category, if you've got Unix commands (e.g. Cygwin):
cat infile | sort | uniq > outfile
This ought to work - no need for Perl at all - which may, or may not, solve your memory problem. However, you will lose the ordering of the infile (as outfile will now be sorted).
EDIT: An alternative solution that's better able to deal with large files may be by using the following algorithm:
Read INFILE line-by-line
Hash each line to a small hash (e.g. a hash# mod 10)
Append each line to a file unique to the hash number (e.g. tmp-1 to tmp-10)
Close INFILE
Open and sort each tmp-# to a new file sortedtmp-#
Mergesort sortedtmp-[1-10] (i.e. open all 10 files and read them simultaneously), skipping duplicates and writing each iteration to the end output file
This will be safer, for very large files, than slurping.
Parts 2 & 3 could be changed to a random# instead of a hash number mod 10.
Here's a script BigSort that may help (though I haven't tested it):
# BigSort
#
# sort big file
#
# $1 input file
# $2 output file
#
# equ sort -t";" -k 1,1 $1 > $2
BigSort()
{
if [ -s $1 ]; then
rm $1.split.* > /dev/null 2>&1
split -l 2500 -a 5 $1 $1.split.
rm $1.sort > /dev/null 2>&1
touch $1.sort1
for FILE in `ls $1.split.*`
do
echo "sort $FILE"
sort -t";" -k 1,1 $FILE > $FILE.sort
sort -m -t";" -k 1,1 $1.sort1 $FILE.sort > $1.sort2
mv $1.sort2 $1.sort1
done
mv $1.sort1 $2
rm $1.split.* > /dev/null 2>&1
else
# work for empty file !
cp $1 $2
fi
}