How can I optimise the search of unique lines? - perl

I have large tab separated files like the following example:
scaffold1443 182629 182998 chr1.1.1.241051.241420 367 99.80
scaffold1443 131948 132412 chr1.1.2.291778.292242 462 99.80
scaffold1443 96142 96474 chr1.1.3.327471.327803 330 99.70
scaffold1443 53153 53479 chr1.1.4.370342.370668 322 99.40
scaffold526 2870014 2870523 chr1.1.5.488372.488881 507 99.90
scaffold526 2865956 2866314 chr1.1.6.490869.491234 357 98.10
scaffold526 2867666 2868024 chr1.1.6.490869.491234 357 98.10
scaffold526 2485557 2485867 chr1.1.7.610677.610987 310 100.00
I want to print in a new file only the lines that the 4th column is unique.
In the previous example, all the lines should be printed except the 2 lines that have the "chr1.1.6.490869.491234" in the 4th column.
The following script that I wrote (it is a part of a larger pipeline) does the job but it is extremely slow, especially when the input file is very big.
#!/usr/bin/perl
use strict;
use warnings;
#This script takes the best hits output and finds the unique elements that up to only one scaffold.
my $target = $ARGV[0];
my $chromosome = $ARGV[1];
my #mykeys = `cat OUTPUT_$target/psl_score_byname_$target/$chromosome.table| awk '{print \$4}'| sort -u`;
foreach (#mykeys)
{
my $key = $_;
chomp($key);
my $command = "cat OUTPUT_$target/psl_score_byname_$target/$chromosome.table|grep -w $key";
my #belongs= `$command`;
chomp(#belongs);
my $count = scalar(#belongs);
if ($count == 1)
{
open FILE, ">>OUTPUT_$target/unique_hces_$target/$chromosome.txt" or die $!;
print FILE "#belongs\n";
#belongs = ();
}
else {
#belongs = ();
}
}
Is there any smarter and faster way to do it?
Thank you very much in advance.

Given that you do not want to print lines that have duplicates at all, you need to see the whole file before any printing, to first find those lines with duplicates. Then go back and print others.
This can be done by keeping the whole file in memory along with ancillary data structures, or by making two passes. Since the file is "very big" here is a less memory-straining way
use warnings;
use strict;
my $file = 'skip.txt';
open my $fh, '<', $file or die "Can't open $file: $!";
my (%seen, %dupe);
while (<$fh>)
{
my $patt = (split)[3];
# Record line numbers if the 4th field has been seen
if (exists $seen{$patt}) {
$dupe{ $seen{$patt} }++; # num of line with it seen first, with count
$dupe{$.} = 1; # this line's number as well
}
else { $seen{$patt} = $. } # first time this 4th field is seen
}
# Now we know all lines which carry duplicate fourth field
my $outfile = 'filtered_' . $file;
open my $fh_out, '>', $outfile or die "Can't open $outfile: $!";
seek $fh, 0, 0; # rewind to the beginning
$. = 0; # seek doesn't reset $.
while (<$fh>) {
print $fh_out $_ if not exists $dupe{$.}
}
close $fh_out;
The first time a duplicate is found its original line also need be recorded, $dupe{$seen{$patt}}++, in that branch. This need be done only once, and while we can check (whether it's already been recorded) we may well pick up a potentially useful duplicates' count instead.
I've added a few more duplicates (some more than twice) to your posted sample and this produces the correct output.
Comment on the posted code
The posted code checks the fourth field on each line against the whole file, thus processing the file as many times as there are lines. That is a lot of work and it has to take time, specially for big files.
Also, there is no reason to use external programs for that job.

As oneliner:
perl -F"\t" -lanE 'push #l,[#F];$s{$F[3]}++}{say join"\t",#$_ for grep{$s{$_->[3]}==1}#l' <<EOF
scaffold1443 182629 182998 chr1.1.1.241051.241420 367 99.80
scaffold1443 131948 132412 chr1.1.2.291778.292242 462 99.80
scaffold1443 96142 96474 chr1.1.3.327471.327803 330 99.70
scaffold1443 53153 53479 chr1.1.4.370342.370668 322 99.40
scaffold526 2870014 2870523 chr1.1.5.488372.488881 507 99.90
scaffold526 2865956 2866314 chr1.1.6.490869.491234 357 98.10
scaffold526 2867666 2868024 chr1.1.6.490869.491234 357 98.10
scaffold526 2485557 2485867 chr1.1.7.610677.610987 310 100.00
EOF
output
scaffold1443 182629 182998 chr1.1.1.241051.241420 367 99.80
scaffold1443 131948 132412 chr1.1.2.291778.292242 462 99.80
scaffold1443 96142 96474 chr1.1.3.327471.327803 330 99.70
scaffold1443 53153 53479 chr1.1.4.370342.370668 322 99.40
scaffold526 2870014 2870523 chr1.1.5.488372.488881 507 99.90
scaffold526 2485557 2485867 chr1.1.7.610677.610987 310 100.00
more readable:
perl -F"\t" -lanE '
push #lines, [ #F ]; $seen{ $F[3] }++;
END {
say join("\t",#$_) for grep { $seen{ $_->[3] } == 1 } #lines
}
'
You can translate it to full script if want, I created this as oneliner because you said: it is a part of a larger pipeline.
Also note, the above reads the whole file into the memory first - so very the big files could cause problems.

The simple approach involves using an associative array to identify duplicates.
perl -F'\t' -lane'
push #{ $h{ $F[3] } }, $_;
END {
for (values(%h)) {
print(#$_) if #$_ == 1;
}
}
' file.tsv
The above approach requires as much memory as the file is large. That's a no-go if you files are truly large.
If you have truly large files, the simple approach is to sort the file using the sort command line utility (which is rather fast, and can handle arbitrarily large files). By first rearranging the file such that duplicates are next to each other, we can easily filtered out the duplicates without worrying about memory issues.
sort -t$'\t' -k 4,4 file.tsv | perl -F'\t' -lane'
if ($key ne $F[3]) {
print(#buf) if #buf == 1;
#buf = ();
}
$key = $F[3];
push #buf, $_;
END { print(#buf) if #buf == 1; }
'
If you have truly large files, another relatively simple approach is to load the data in a database (e.g. an sqlite3 database). You could easily maintain the original order with this approach.

Related

Use Perl to loop over files and calculate the mean of each column

I'm new to perl and I would like to lean how to use loops with it. I have multiple directories and each directory contain a file named data.txt. The data.txt file has several columns. I basically need to use a loop to calculate the mean of each column for each data.txt file.
I have this command that does the job for one single file:
perl -lane 'for $c (0..$#F){$t[$c] += $F[$c]}; END{for $c (0..$#t){print $t[$c]/$.}}' data.txt`
I wish to write a script where I visit every directory, read every file that's in it and apply the command.
Example:
data.txt:
-79.2335 0.4041 71.9143 1.3392 -0.7687 0.0212 -8.0934 1.1425
-74.4163 0.6188 60.0468 1.8782 -0.8540 0.0305 -15.1574 1.4755
-74.4118 0.6046 62.1771 1.8058 -0.9143 0.0304 -13.2272 1.3408
-74.3895 0.5935 66.4264 1.6532 -0.8509 0.0223 -8.8819 1.2670
-74.3192 0.5589 67.1619 1.4763 -0.9656 0.0274 -8.1090 1.1450
-73.8272 0.6274 61.6632 1.7554 -0.8840 0.0256 -13.0435 1.3641
-73.3525 0.5856 60.6622 1.7872 -0.8489 0.0222 -13.5014 1.3947
-73.3206 0.6275 53.3129 2.2961 -0.7962 0.0337 -20.8195 1.8538
-72.5461 0.5212 62.0359 1.4267 -0.9378 0.0240 -11.4203 1.0295
-72.3058 0.7225 56.2304 2.1480 -0.7539 0.0293 -16.7954 1.5952
-72.1180 0.6460 51.7954 2.0845 -0.8479 0.0265 -21.0355 1.4630
-72.0690 0.4905 58.8372 1.3918 -0.9866 0.0333 -14.1823 1.1045
-71.7949 0.5799 55.6006 1.9189 -0.8541 0.0313 -17.0112 1.4530
-71.3074 0.4482 45.9271 2.1135 -0.6637 0.0354 -25.9309 1.8761
-71.2542 0.4879 57.3196 1.5406 -0.9523 0.0281 -14.9113 1.2705
-71.2421 0.5480 47.9065 2.2445 -0.8107 0.0352 -24.2489 1.7997
-70.3751 0.5278 49.5489 1.8395 -0.8208 0.0371 -21.5205 1.4994
-69.2181 0.4823 54.8234 1.0645 -0.9897 0.0246 -15.3506 0.9369
-69.0456 0.4650 40.3798 2.0117 -0.6476 0.0360 -29.3403 1.7013
-66.5402 0.5006 42.1805 1.7872 -0.7692 0.0356 -25.1431 1.4522
Output:
-72.354355 0.552015 56.297505 1.77814 -0.845845 0.029485 -16.88618 1.408235
As your comments imply that you have a simple directory structure with one main directory called mean with 100s of subdirectories, each with a file called data.txt, the list of files can be compiled easily with a glob, and the math is fairly straightforward. This is a suggestion how it can be done.
I would not use $. as a way to calculate the average, since it can be corrupted by other factors. But just use a count variable for each file, and count the non-blank lines.
use strict;
use warnings;
use feature 'say';
for my $data (glob "mean/*/data.txt") { # get list of files
open my $fh, '<', $data or die "Cannot open file '$data': $!";
my #sum;
my $count = 0;
while (<$fh>) {
$count++ if /\S/; # count non-blank lines
my #fields = split; # split on whitespace
for (0 .. $#fields) {
$sum[$_] += $fields[$_]; # sum columns
}
}
say $data; # file name
say join "\t", # 3. ...join them with tab and print
map $_/$count, # 2. ...for each sum, divide by count
#sum; # 1. Take list of sums...
}
Output:
mean/A/data.txt
-72.354355 0.552015 56.297505 1.77814 -0.845845 0.029485 -16.88618 1.408235
mean/B/data.txt
-142.354355 0.552015 56.297505 1.77814 -0.845845 0.029485 -16.88618 1.408235
mean/C/data.txt
-72.354355 17.152015 56.297505 1.77814 -0.845845 0.029485 -16.88618 1.408235
I am not a Perl expert but this worked for me. It prints the results to terminal. Or you could redirect it to a file if you want or directly write to a file instead of printing to terminal.
use 5.28.2;
use warnings;
use File::Find;
my ($inf, #sum);
for my $dir (glob "/mainDirectory/*"){ # finds files/subdirectories
if (! -d $dir) {
next; # keeps only directories
}
$inf= "$dir/data.txt";
say "$inf";
find(\&sum_columns, $inf);
}
sub sum_columns{
open (IN, "<", "$inf" ) or die "Cannot open file.\n $!";
while (<IN>){
my $line = $_;
chomp $line;
my #columns = split(/\s+/,$line);
for my $item (0 .. $#columns){
$sum[$item] += $columns[$item];
}
}
say "#sum";
#sum=();
}

perl find text from file in certain position

130723,-001,1.14,130725,+002,4.20,130731,+006,1.52,130728
130725,+002,4.20,130731,+006,1.52,130728,-003,0.00,130731
130731,+006,1.52,130728,-003,0.00,130731,+003,1.00,130731
130728,-003,0.00,130731,+003,1.00,130731,+000,0.00,130729
130731,+000,0.00,130729,-002,1.00,130728,-001,0.00,130728
the above is part of a log file. Each line in the log file is always the same length and has the same pattern as you can see above. I need to read the file and place in an array all the lines where position 42 to 46 in each line meet certain expectations. In the case above we are looking at the following numbers:
+006
-003
+003
+000
-001
Can someone point me in the right direction?
EDIT :
Thx to Amon for his suggestion.
I ended up with this code for future reference.
open (FILE, $filename) or die "Couldn't open log: $!";
while (<FILE>) {
if ((split /,/)[8] == "+003"){
push #data, $_ }}
close FILE;
foreach(#data)
{
print "$_\r\n";
}
I was thinking towards the future if this file gets really big what steps should I take to optimise the process speedwise?
If you want to do it by column numbers, then substr() is usable with care:
perl -pe '$_ = substr($_, 41, 4) . "\n"' data
Your question asks for columns 42..46, but with an inclusive notation, that selects 5 positions, the last of which is a comma. Specifying 42..46 is perhaps the 1-based half-open range of columnns.
The 41 in the code is 'column 42 - 1' (0-based indexes); the 4 is '46 - 42'. So, for columns [N..M), the formula would be:
perl -pe '$_ = substr($_, N-1, M-N) . "\n"' data
While #amon's answer is elegant, you can just use regex:
open FILE, "filename.txt" or die $!;
while (<FILE>) {
if $_ =~ /^.{41}(\+006)|(-003)|(\+003)|(\+000)|(-001)/
}
Try
perl -F, -ane '$F[7] eq "+003" and push #l,$_; END { print for #l }'<<XXX
130723,-001,1.14,130725,+002,4.20,130731,+006,1.52,130728
130725,+002,4.20,130731,+006,1.52,130728,-003,0.00,130731
130731,+006,1.52,130728,-003,0.00,130731,+003,1.00,130731
130728,-003,0.00,130731,+003,1.00,130731,+000,0.00,130729
130731,+000,0.00,130729,-002,1.00,130728,-001,0.00,130728
XXX
Output:
130731,+006,1.52,130728,-003,0.00,130731,+003,1.00,130731

Loading Big files into Hashes in Perl (BLAST tables)

I'm a perl beginner, please help me out with my query... I'm trying to extract information from a blast table (a snippet of what it looks like is below):
It's a standard blast table input... I basically want to extract any information on a list of reads (Look at my second script below , to get an idea of what I want to do).... Anyhow this is precisely what I've done in the second script:
INPUTS:
1) the blast table:
38.1 0.53 59544 GH8NFLV01A02ED GH8NFLV01A02ED rank=0113471 x=305.0 y=211.5 length=345 1 YP_003242370 Dynamin family protein [Paenibacillus sp. Y412MC10] -1 0 48.936170212766 40.4255319148936 47 345 1213 13.6231884057971 3.87469084913438 31 171 544 590
34.3 7.5 123828 GH8NFLV01A03QJ GH8NFLV01A03QJ rank=0239249 x=305.0 y=1945.5 length=452 1 XP_002639994 Hypothetical protein CBG10824 [Caenorhabditis briggsae] 3 0 52.1739130434783 32.6086956521739 46 452 367 10.1769911504425 12.5340599455041 111 248 79 124
37.7 0.70 62716 GH8NFLV01A09B8 GH8NFLV01A09B8 rank=0119267 x=307.0 y=1014.0 length=512 1 XP_002756773 PREDICTED: probable G-protein coupled receptor 123-like, partial [Callithrix jacchus] 1 0 73.5294117647059 52.9411764705882 34 512 703 6.640625 4.83641536273115 43 144 273 306
37.7 0.98 33114 GH8NFLV01A0H5C GH8NFLV01A0H5C rank=0066011 x=298.0 y=2638.5 length=573 1 XP_002756773 PREDICTED: probable G-protein coupled receptor 123-like, partial [Callithrix jacchus] -3 0 73.5294117647059 52.9411764705882 34 573 703 5.93368237347295 4.83641536273115 131 232 273 306
103 1e-020 65742 GH8NFLV01A0MXI GH8NFLV01A0MXI rank=0124865 x=300.5 y=644.0 length=475 1 ABZ08973 hypothetical protein ALOHA_HF4000APKG6B14ctg1g18 [uncultured marine crenarchaeote HF4000_APKG6B14] 2 0 77.9411764705882 77.9411764705882 68 475 151 14.3157894736842 45.0331125827815 2 205 1 68
41.6 0.053 36083 GH8NFLV01A0QKX GH8NFLV01A0QKX rank=0071366 x=301.0 y=1279.0 length=526 1 XP_766153 hypothetical protein [Theileria parva strain Muguga] -1 0 66.6666666666667 56.6666666666667 30 526 304 5.70342205323194 9.86842105263158 392 481 31 60
45.4 0.003 78246 GH8NFLV01A0Z29 GH8NFLV01A0Z29 rank=0148293 x=304.0 y=1315.0 length=432 1 ZP_04111769 hypothetical protein bthur0007_56280 [Bacillus thuringiensis serovar monterrey BGSC 4AJ1] 3 0 51.8518518518518 38.8888888888889 54 432 193 12.5 27.979274611399 48 209 97 150
71.6 4e-011 97250 GH8NFLV01A14MR GH8NFLV01A14MR rank=0184885 x=317.5 y=609.5 length=314 1 ZP_03823721 DNA replication protein [Acinetobacter sp. ATCC 27244] 1 0 92.5 92.5 40 314 311 12.7388535031847 12.8617363344051 193 312 13 52
58.2 5e-007 154555 GH8NFLV01A1KCH GH8NFLV01A1KCH rank=0309994 x=310.0 y=2991.0 length=267 1 ZP_03823721 DNA replication protein [Acinetobacter sp. ATCC 27244] 1 0 82.051282051282 82.051282051282 39 267 311 14.6067415730337 12.540192926045 142 258 1 39
2) The reads list:
GH8NFLV01A09B8
GH8NFLV01A02ED
etc
etc
3) the output I want:
37.7 0.70 62716 GH8NFLV01A09B8 GH8NFLV01A09B8 rank=0119267 x=307.0 y=1014.0 length=512 1 XP_002756773 PREDICTED: probable G-protein coupled receptor 123-like, partial [Callithrix jacchus] 1 0 73.5294117647059 52.9411764705882 34 512 703 6.640625 4.83641536273115 43 144 273 306
38.1 0.53 59544 GH8NFLV01A02ED GH8NFLV01A02ED rank=0113471 x=305.0 y=211.5 length=345 1 YP_003242370 Dynamin family protein [Paenibacillus sp. Y412MC10] -1 0 48.936170212766 40.4255319148936 47 345 1213 13.6231884057971 3.87469084913438 31 171 544 590
I want a subset of the information in the first list, given a list of read names I want to extract (that is found in the 4th column)
Instead of hashing the reads list (only?) I want to hash the blast table itself, and use the information in Column 4 (of the blast table)as the keys to extract the values of each key, even when that key may have more than one value(i.e: each read name might actually have more than one hit , or associated blast result in the table), keeping in mind, that the value includes the WHOLE row with that key(readname) in it.
My greplist.pl script does this, but is very very slow, I think , ( and correct me if i'm wrong) that by loading the whole table in a hash, that this should speed things up tremendously ...
Thank you for your help.
My scripts:
The Broken one (mambo5.pl)
#!/usr/bin/perl -w
# purpose: extract blastX data from a list of readnames
use strict;
open (DATA,$ARGV[0]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
open (LIST,$ARGV[1]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
my %hash = <DATA>;
close (DATA);
my $filename=$ARGV[0];
open(OUT, "> $filename.bololom");
my $readName;
while ( <LIST> )
{
#########;
if(/^(.*?)$/)#
{
$readName=$1;#
chomp $readName;
if (exists $hash{$readName})
{
print "bingo!";
my $output =$hash{$readName};
print OUT "$output\n";
}
else
{
print "it aint workin\n";
#print %hash;
}
}
}
close (LIST);
The Slow and quick cheat (that works) and is very slow (my blast tables can be about 400MB to 2GB large, I'm sure you can see why it's so slow)
#!/usr/bin/perl -w
##
# This script finds a list of names in a blast table and outputs the result in a new file
# name must exist and list must be correctly formatted
# will not output anything using a "normal" blast file, must be a table blast
# if you have the standard blast output use blast2table script
use strict;
my $filein=$ARGV[0] or die ("usage: ./listgrep.pl readslist blast_table\n");
my $db=$ARGV[1] or die ("usage: ./listgrep.pl readslist blast_table\n");
#open the reads you want to grep
my $read;
my $line;
open(READSLIST,$filein);
while($line=<READSLIST>)
{
if ($line=~/^(.*)$/)
{
$read = $1;
print "$read\n";
system("grep \"$read\" $db >$read\_.out\n");
}
#system("grep $read $db >$read\_.out\n");
}
system("cat *\_.out >$filein\_greps.txt\n");
system("rm *.out\n");
I don't know how to define that 4th column as the key : maybe I could use the split function, but I've tried to find a way that does this for a table of more than 2 columns to no avail... Please help!
If there is an easy way out of this please let me know
Thanks !
I'd do the opposite i.e read the readslist file into a hash then walk thru the big blast file and print the desired lines.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
# Read the readslist file into a hash
open my $fh, '<', 'readslist' or die "Can't open 'readslist' for reading:$!";
my %readslist = map { chomp; $_ => 1 }<$fh>;
close $fh;
open my $fh_blast, '<', 'blastfile' or die "Can't open 'blastfile' for reading:$!";
# loop on all the blastfile lines
while (<$fh_blast>) {
chomp;
# retrieve the key (4th column)
my ($key) = (split/\s+/)[3];
# print the line if the key exists in the hash
say $_ if exists $readslist{$key};
}
close $fh_blast;
I suggest you build an index to turn your blasts file temporarily into an indexed-sequential file. Read through it and build a hash of addresses within the file where every record for each key starts.
After that it is just a matter of seeking to the correct places in the file to pick up the records required. This will certainly be faster than most simple solutions, as it entails read the big file only once. This example code demonstrates.
use strict;
use warnings;
use Fcntl qw/SEEK_SET/;
my %index;
open my $blast, '<', 'blast.txt' or die $!;
until (eof $blast) {
my $place = tell $blast;
my $line = <$blast>;
my $key = (split ' ', $line, 5)[3];
push #{$index{$key}}, $place;
}
open my $reads, '<', 'reads.txt' or die $!;
while (<$reads>) {
next unless my ($key) = /(\S+)/;
next unless my $places = $index{$key};
foreach my $place (#$places) {
seek $blast, $place, SEEK_SET;
my $line = <$blast>;
print $line;
}
}
Voila, 2 ways of doing this, one with nothing to do with perl :
awk 'BEGIN {while ( i = getline < "reads_list") ar[$i] = $1;} {if ($4 in ar) print $0;}' blast_table > new_blast_table
Mambo6.pl
#!/usr/bin/perl -w
# purpose: extract blastX data from a list of readnames. HINT: Make sure your list file only has unique names , that way you save time.
use strict;
open (DATA,$ARGV[0]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
open (LIST,$ARGV[1]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
my %hash;
my $val;
my $key;
while (<DATA>)
{
#chomp;
if(/((.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?))$/)
{
#print "$1\n";
$key= $5;#read
$val= $1;#whole row; notice the brackets around the whole match.
$hash{$key} .= exists $hash{$key} ? "$val\n" : $val;
}
else {
print "something wrong with format";
}
}
close (DATA);
open(OUT, "> $ARGV[1]\_out\.txt");
my $readName;
while ( <LIST> )
{
#########;
if(/^(.*?)$/)#
{
$readName=$1;#
chomp $readName;
if (exists $hash{$readName})
{
print "$readName\n";
my $output =$hash{$readName};
print OUT "$output";
}
else
{
#print "it aint workin\n";
}
}
}
close (LIST);
close (OUT);
The oneliner is faster, and probably better than my script, I'm sure some people can find easier ways to do it... I just thought I'd put this up since it does what I want.

How can I remove non-unique lines from a large file with Perl?

Duplicate data removal using Perl called within via a batch file within Windows
A DOS window in Windows called via a batch file.
A batch file calls the Perl script which carries out the actions. I have the batch file.
The code script I have works duplicate data is removal so long as the data file is not too big.
The problem that requires resolving is with data files which are larger, (2 GB or more), with this size of file a memory error occurs when trying to load the complete file in to an array for duplicate data removal.
The memory error occurs in the subroutine at:-
#contents_of_the_file = <INFILE>;
(A completely different method is acceptable so long as it solves this issue, please suggest).
The subroutine is:-
sub remove_duplicate_data_and_file
{
open(INFILE,"<" . $output_working_directory . $output_working_filename) or dienice ("Can't open $output_working_filename : INFILE :$!");
if ($test ne "YES")
{
flock(INFILE,1);
}
#contents_of_the_file = <INFILE>;
if ($test ne "YES")
{
flock(INFILE,8);
}
close (INFILE);
### TEST print "$#contents_of_the_file\n\n";
#unique_contents_of_the_file= grep(!$unique_contents_of_the_file{$_}++, #contents_of_the_file);
open(OUTFILE,">" . $output_restore_split_filename) or dienice ("Can't open $output_restore_split_filename : OUTFILE :$!");
if ($test ne "YES")
{
flock(OUTFILE,1);
}
for($element_number=0;$element_number<=$#unique_contents_of_the_file;$element_number++)
{
print OUTFILE "$unique_contents_of_the_file[$element_number]\n";
}
if ($test ne "YES")
{
flock(OUTFILE,8);
}
}
You are unnecessarily storing a full copy of the original file in #contents_of_the_file and -- if the amount of duplication is low relative to the file size -- nearly two other full copies in %unique_contents_of_the_file and #unique_contents_of_the_file. As ire_and_curses noted, you can reduce the storage requirements by making two passes over the data: (1) analyze the file, storing information about the line numbers of non-duplicate lines; and (2) process the file again to write non-dups to the output file.
Here is an illustration. I don't know whether I've picked the best module for the hashing function (Digest::MD5); perhaps others will comment on that. Also note the 3-argument form of open(), which you should be using.
use strict;
use warnings;
use Digest::MD5 qw(md5);
my (%seen, %keep_line_nums);
my $in_file = 'data.dat';
my $out_file = 'data_no_dups.dat';
open (my $in_handle, '<', $in_file) or die $!;
open (my $out_handle, '>', $out_file) or die $!;
while ( defined(my $line = <$in_handle>) ){
my $hashed_line = md5($line);
$keep_line_nums{$.} = 1 unless $seen{$hashed_line};
$seen{$hashed_line} = 1;
}
seek $in_handle, 0, 0;
$. = 0;
while ( defined(my $line = <$in_handle>) ){
print $out_handle $line if $keep_line_nums{$.};
}
close $in_handle;
close $out_handle;
You should be able to do this efficiently using hashing. You don't need to store the data from the lines, just identify which ones are the same. So...
Don't slurp - Read one line at a time.
Hash the line.
Store the hashed line representation as a key in a Perl hash of lists. Store the line number as the first value of the list.
If the key already exists, append the duplicate line number to the list corresponding to that value.
At the end of this process, you'll have a data-structure identifying all the duplicate lines. You can then do a second pass through the file to remove those duplicates.
Perl does heroic things with large files, but 2GB may be a limitation of DOS/Windows.
How much RAM do you have?
If your OS doesn't complain, it may be best to read the file one line at a time, and write immediately to output.
I'm thinking of something using the diamond operator <> but I'm reluctant to suggest any code because on the occasions I've posted code, I've offended a Perl guru on SO.
I'd rather not risk it. I hope the Perl cavalry will arrive soon.
In the meantime, here's a link.
Here's a solution that works no matter how big the file is. But it doesn't use RAM exclusively, so its slower than a RAM-based solution. You can also specify the amount of RAM you want this thing to use.
The solution uses a temporary file that the program treats as a database with SQLite.
#!/usr/bin/perl
use DBI;
use Digest::SHA 'sha1_base64';
use Modern::Perl;
my $input= shift;
my $temp= 'unique.tmp';
my $cache_size_in_mb= 100;
unlink $temp if -f $temp;
my $cx= DBI->connect("dbi:SQLite:dbname=$temp");
$cx->do("PRAGMA cache_size = " . $cache_size_in_mb * 1000);
$cx->do("create table x (id varchar(86) primary key, line int unique)");
my $find= $cx->prepare("select line from x where id = ?");
my $list= $cx->prepare("select line from x order by line");
my $insert= $cx->prepare("insert into x (id, line) values(?, ?)");
open(FILE, $input) or die $!;
my ($line_number, $next_line_number, $line, $sha)= 1;
while($line= <FILE>) {
$line=~ s/\s+$//s;
$sha= sha1_base64($line);
unless($cx->selectrow_array($find, undef, $sha)) {
$insert->execute($sha, $line_number)}
$line_number++;
}
seek FILE, 0, 0;
$list->execute;
$line_number= 1;
$next_line_number= $list->fetchrow_array;
while($line= <FILE>) {
$line=~ s/\s+$//s;
if($next_line_number == $line_number) {
say $line;
$next_line_number= $list->fetchrow_array;
last unless $next_line_number;
}
$line_number++;
}
close FILE;
Well you could use the inline replace mode of command line perl.
perl -i~ -ne 'print unless $seen{$_}++' uberbigfilename
In the "completely different method" category, if you've got Unix commands (e.g. Cygwin):
cat infile | sort | uniq > outfile
This ought to work - no need for Perl at all - which may, or may not, solve your memory problem. However, you will lose the ordering of the infile (as outfile will now be sorted).
EDIT: An alternative solution that's better able to deal with large files may be by using the following algorithm:
Read INFILE line-by-line
Hash each line to a small hash (e.g. a hash# mod 10)
Append each line to a file unique to the hash number (e.g. tmp-1 to tmp-10)
Close INFILE
Open and sort each tmp-# to a new file sortedtmp-#
Mergesort sortedtmp-[1-10] (i.e. open all 10 files and read them simultaneously), skipping duplicates and writing each iteration to the end output file
This will be safer, for very large files, than slurping.
Parts 2 & 3 could be changed to a random# instead of a hash number mod 10.
Here's a script BigSort that may help (though I haven't tested it):
# BigSort
#
# sort big file
#
# $1 input file
# $2 output file
#
# equ sort -t";" -k 1,1 $1 > $2
BigSort()
{
if [ -s $1 ]; then
rm $1.split.* > /dev/null 2>&1
split -l 2500 -a 5 $1 $1.split.
rm $1.sort > /dev/null 2>&1
touch $1.sort1
for FILE in `ls $1.split.*`
do
echo "sort $FILE"
sort -t";" -k 1,1 $FILE > $FILE.sort
sort -m -t";" -k 1,1 $1.sort1 $FILE.sort > $1.sort2
mv $1.sort2 $1.sort1
done
mv $1.sort1 $2
rm $1.split.* > /dev/null 2>&1
else
# work for empty file !
cp $1 $2
fi
}

How can I randomly sample the contents of a file?

I have a file with contents
abc
def
high
lmn
...
...
There are more than 2 million lines in the files.
I want to randomly sample lines from the files and output 50K lines. Any thoughts on how to approach this problem? I was thinking along the lines of Perl and its rand function (Or a handy shell command would be neat).
Related (Possibly Duplicate) Questions:
Randomly Pick Lines From a File Without Slurping It With Unix
How can I get exactly n random lines from a file with Perl?
Assuming you basically want to output about 2.5% of all lines, this would do:
print if 0.025 > rand while <$input>;
Shell way:
sort -R file | head -n 50000
From perlfaq5: "How do I select a random line from a file?"
Short of loading the file into a database or pre-indexing the lines in the file, there are a couple of things that you can do.
Here's a reservoir-sampling algorithm from the Camel Book:
srand;
rand($.) < 1 && ($line = $_) while <>;
This has a significant advantage in space over reading the whole file in. You can find a proof of this method in The Art of Computer Programming, Volume 2, Section 3.4.2, by Donald E. Knuth.
You can use the File::Random module which provides a function for that algorithm:
use File::Random qw/random_line/;
my $line = random_line($filename);
Another way is to use the Tie::File module, which treats the entire file as an array. Simply access a random array element.
Perl way:
use CPAN. There is module File::RandomLine that does exactly what you need.
If you need to extract an exact number of lines:
use strict;
use warnings;
# Number of lines to pick and file to pick from
# Error checking omitted!
my ($pick, $file) = #ARGV;
open(my $fh, '<', $file)
or die "Can't read file '$file' [$!]\n";
# count lines in file
my ($lines, $buffer);
while (sysread $fh, $buffer, 4096) {
$lines += ($buffer =~ tr/\n//);
}
# limit number of lines to pick to number of lines in file
$pick = $lines if $pick > $lines;
# build list of N lines to pick, use a hash to prevent picking the
# same line multiple times
my %picked;
for (1 .. $pick) {
my $n = int(rand($lines)) + 1;
redo if $picked{$n}++
}
# loop over file extracting selected lines
seek($fh, 0, 0);
while (<$fh>) {
print if $picked{$.};
}
close $fh;