Fast alternative to grep -f - perl

file.contain.query.txt
ENST001
ENST002
ENST003
file.to.search.in.txt
ENST001 90
ENST002 80
ENST004 50
Because ENST003 has no entry in 2nd file and ENST004 has no entry in 1st file the expected output is:
ENST001 90
ENST002 80
To grep multi query in a particular file we usually do the following:
grep -f file.contain.query <file.to.search.in >output.file
since I have like 10000 query and almost 100000 raw in file.to.search.in it takes very long time to finish (like 5 hours). Is there a fast alternative to grep -f ?

If you want a pure Perl option, read your query file keys into a hash table, then check standard input against those keys:
#!/usr/bin/env perl
use strict;
use warnings;
# build hash table of keys
my $keyring;
open KEYS, "< file.contain.query.txt";
while (<KEYS>) {
chomp $_;
$keyring->{$_} = 1;
}
close KEYS;
# look up key from each line of standard input
while (<STDIN>) {
chomp $_;
my ($key, $value) = split("\t", $_); # assuming search file is tab-delimited; replace delimiter as needed
if (defined $keyring->{$key}) { print "$_\n"; }
}
You'd use it like so:
lookup.pl < file.to.search.txt
A hash table can take a fair amount of memory, but searches are much faster (hash table lookups are in constant time), which is handy since you have 10-fold more keys to lookup than to store.

If you have fixed strings, use grep -F -f. This is significantly faster than regex search.

This Perl code may helps you:
use strict;
open my $file1, "<", "file.contain.query.txt" or die $!;
open my $file2, "<", "file.to.search.in.txt" or die $!;
my %KEYS = ();
# Hash %KEYS marks the filtered keys by "file.contain.query.txt" file
while(my $line=<$file1>) {
chomp $line;
$KEYS{$line} = 1;
}
while(my $line=<$file2>) {
if( $line =~ /(\w+)\s+(\d+)/ ) {
print "$1 $2\n" if $KEYS{$1};
}
}
close $file1;
close $file2;

If the files are already sorted:
join file1 file2
if not:
join <(sort file1) <(sort file2)

If you are using perl version 5.10 or newer, you can join the 'query' terms into a regular expression with the query terms separated by the 'pipe'. (Like:ENST001|ENST002|ENST003) Perl builds a 'trie' which, like a hash, does lookups in constant time. It should run as fast as the solution using a lookup hash. Just to show another way to do this.
#!/usr/bin/perl
use strict;
use warnings;
use Inline::Files;
my $query = join "|", map {chomp; $_} <QUERY>;
while (<RAW>) {
print if /^(?:$query)\s/;
}
__QUERY__
ENST001
ENST002
ENST003
__RAW__
ENST001 90
ENST002 80
ENST004 50

Mysql:
Importing the data into Mysql or similar will provide an immense improvement. Will this be feasible ? You could see results in a few seconds.
mysql -e 'select search.* from search join contains using (keyword)' > outfile.txt
# but first you need to create the tables like this (only once off)
create table contains (
keyword varchar(255)
, primary key (keyword)
);
create table search (
keyword varchar(255)
,num bigint
,key (keyword)
);
# and load the data in:
load data infile 'file.contain.query.txt'
into table contains fields terminated by "add column separator here";
load data infile 'file.to.search.in.txt'
into table search fields terminated by "add column separator here";

use strict;
use warings;
system("sort file.contain.query.txt > qsorted.txt");
system("sort file.to.search.in.txt > dsorted.txt");
open (QFILE, "<qsorted.txt") or die();
open (DFILE, "<dsorted.txt") or die();
while (my $qline = <QFILE>) {
my ($queryid) = ($qline =~ /ENST(\d+)/);
while (my $dline = <DFILE>) {
my ($dataid) = ($dline =~ /ENST(\d+)/);
if ($dataid == $queryid) { print $qline; }
elsif ($dataid > $queryid) { break; }
}
}

This may be a little dated, but is tailor-made for simple UNIX utilities. Given:
keys are fixed-length (here 7 chars)
files are sorted (true in the example) allowing the use of fast merge sort
Then:
$ sort -m file.contain.query.txt file.to.search.in.txt | tac | uniq -d -w7
ENST002 80
ENST001 90
Variants:
To strip the number printed after the key, remove tac command:
$ sort -m file.contain.query.txt file.to.search.in.txt | uniq -d -w7
To keep sorted order, add an extra tac command at the end:
$ sort -m file.contain.query.txt file.to.search.in.txt | tac | uniq -d -w7 | tac

Related

Extract preceding and trailing characters to a matched string from file in awk

I have a large string file seq.txt of letters, unwrapped, with over 200,000 characters. No spaces, numbers etc, just a-z.
I have a second file search.txt which has lines of 50 unique letters which will match once in seq.txt. There are 4000 patterns to match.
I want to be able to find each of the patterns (lines in file search.txt), and then get the 100 characters before and 100 characters after the pattern match.
I have a script which uses grep and works, but this runs very slowly, only does the first 100 characters, and is written out with echo. I am not knowledgeable enough in awk or perl to interpret scripts online that may be applicable, so I am hoping someone here is!
cat search.txt | while read p; do echo "grep -zoP '.{0,100}$p' seq.txt | sed G"; done > do.grep
Easier example with desired output:
>head seq.txt
abcdefghijklmnopqrstuvwxyz
>head search.txt
fgh
pqr
uvw
>head desiredoutput.txt
cdefghijk
mnopqrstu
rstuvwxyz
Best outcome would be a tab separated file of the 100 characters before \t matched pattern \t 100 characters after. Thank you in advance!
One way
use warnings;
use strict;
use feature 'say';
my $string;
# Read submitted files line by line (or STDIN if #ARGV is empty)
while (<>) {
chomp;
$string = $_;
last; # just in case, as we need ONE line
}
# $string = q(abcdefghijklmnopqrstuvwxyz); # test
my $padding = 3; # for the given test sample
my #patterns = do {
my $search_file = 'search.txt';
open my $fh, '<', $search_file or die "Can't open $search_file: $!";
<$fh>;
};
chomp #patterns;
# my #patterns = qw(bcd fgh pqr uvw); # test
foreach my $patt (#patterns) {
if ( $string =~ m/(.{0,$padding}) ($patt) (.{0,$padding})/x ) {
say "$1\t$2\t$3";
# or
# printf "%-3s\t%3s%3s\n", $1, $2, $3;
}
}
Run as program.pl seq.txt, or pipe the content of seq.txt to it.†
The pattern .{0,$padding} matches any character (.), up to $padding times (3 above), what I used in case the pattern $patt is found at a position closer to the beginning of the string than $padding (like the first one, bcd, that I added to the example provided in the question). The same goes for the padding after the $patt.
In your problem then replace $padding to 100. With the 100 wide "padding" before and after each pattern, when a pattern is found at a position closer to the beginning than the 100 then the desired \t alignment could break, if the position is lesser than 100 by more than the tab value (typically 8).
That's what the line with the formatted print (printf) is for, to ensure the width of each field regardless of the length of the string being printed. (It is commented out since we are told that no pattern ever gets into the first or last 100 chars.)
If there is indeed never a chance that a matched pattern breaches the first or the last 100 positions then the regex can be simplified to
/(.{$padding}) ($patt) (.{$padding})/x
Note that if a $patt is within the first/last $padding chars then this just won't match.
The program starts the regex engine for each of #patterns, what in principle may raise performance issues (not for one run with the tiny number of 4000 patterns, but such requirements tend to change and generally grow). But this is by far the simplest way to go since
we have no clue how the patterns may be distributed in the string, and
one match may be inside the 100-char buffer of another (we aren't told otherwise)
If there is a performance problem with this approach please update.
† The input (and output) of the program can be organized in a better way using named command-line arguments via Getopt::Long, for an invocation like
program.pl --sequence seq.txt --search search.txt --padding 100
where each argument may be optional here, with defaults set in the file, and argument names may be shortened and/or given additional names, etc. Let me know if that is of interest
One in awk. -v b=3 is the before context length -v a=3 is the after context length and -v n=3 is the match length which is always constant. It hashes all the substrings of seq.txt to memory so it uses it depending on the size of the seq.txt and you might want to follow the consumption with top, like: abcdefghij -> s["def"]="abcdefghi" , s["efg"]="bcdefghij" etc.
$ awk -v b=3 -v a=3 -v n=3 '
NR==FNR {
e=length()-(n+a-1)
for(i=1;i<=e;i++) {
k=substr($0,(i+b),n)
s[k]=s[k] (s[k]==""?"":ORS) substr($0,i,(b+n+a))
}
next
}
($0 in s) {
print s[$0]
}' seq.txt search.txt
Output:
cdefghijk
mnopqrstu
rstuvwxyz
You can tell grep to search for all the patterns in one go.
sed 's/.*/.{0,100}&.{0,100}/' search.txt |
grep -zoEf - seq.txt |
sed G >do.grep
4000 patterns should be easy peasy, though if you get to hundreds of thousands, maybe you will want to optimize.
There is no Perl regex here, so I switched from the nonstandard grep -P to the POSIX-compatible and probably more efficient grep -E.
The surrounding context will consume any text it prints, so any match within 100 characters from the previous one will not be printed.
You can try following approach to your problem:
load string input data
load into an array patterns
loop through each pattern and look for it in the string
form an array from found matches
loop through matches array and print result
NOTE: the code is not tested due absence of input data
use strict;
use warnings;
use feature 'say';
my $fname_s = 'seq.txt';
my $fname_p = 'search.txt';
open my $fh, '<', $fname_s
or die "Couldn't open $fname_s";
my $data = do { local $/; <$fh> };
close $fh;
open my $fh, '<', $fname_p
or die "Couln't open $fname_p";
my #patterns = <$fh>;
close $fh;
chomp #patterns;
for ( #patterns ) {
my #found = $data =~ s/(.{100}$_.{100})/g;
s/(.{100})(.{50})(.{100})/$1 $2 $3/ && say for #found;
}
Test code for provided test data (added latter)
use strict;
use warnings;
use feature 'say';
my #pat = qw/fgh pqr uvw/;
my $data = do { local $/; <DATA> };
for( #pat ) {
say $1 if $data =~ /(.{3}$_.{3})/;
}
__DATA__
abcdefghijklmnopqrstuvwxyz
Output
cdefghijk
mnopqrstu
rstuvwxyz

How to sort huge CSV file?

I have some huge (2GB and more) CSV files I need to sort, using Powershell or Perl (company request).
I need to sort them from any column, depending on the file.
My CSV files are, for some, looking like this, using double quotes :
Column1;Column2;Column3;Column4
1234;1234;ABCD;"1234;ABCD"
5678;5678;ABCD;"5678;ABCD"
9012;5678;ABCD;"9012;ABCD"
...
In Powershell, I already tested the solution Import-CSV, but I got OutOfMemory Exception problem.
I also tried to load my CSV file in SQL table using OleDb Connection with this code :
$provider = (New-Object System.Data.OleDb.OleDbEnumerator).GetElements() | Where-Object { $_.SOURCES_NAME -like "Microsoft.ACE.OLEDB.*" }
if ($provider -is [system.array]) { $provider = $provider[0].SOURCES_NAME } else { $provider = $provider.SOURCES_NAME }
$csv = "PathToCSV\file.csv"
$firstRowColumnNames = "Yes"
$connstring = "Provider=$provider;Data Source=$(Split-Path $csv);Extended Properties='text;HDR=$firstRowColumnNames;';"
$tablename = (Split-Path $csv -leaf).Replace(".","#")
$sql = "SELECT * from [$tablename] ORDER BY Column3"
# Setup connection and command
$conn = New-Object System.Data.OleDb.OleDbconnection
$conn.ConnectionString = $connstring
$conn.Open()
$cmd = New-Object System.Data.OleDB.OleDBCommand
$cmd.Connection = $conn
$cmd.CommandText = $sql
$cmd.ExecuteReader()
# Clean up
$cmd.dispose
$conn.dispose
But this returns me the error :
Exception calling "ExecuteReader" with "1" argument(s): "No value given for one or more required parameters." and I don't understand why. I tried to modify the code, the SQL code, and it still doesn't work.
I think this is the best solution to do that, but I didn't achieve to make it work for now, and I'm open to all other solutions that could work...
I'm a total beginner in Perl, so I just tried to understand how it works and how to use it, but didn't code anything for the moment.
EDIT
I just tested all the solutions you proposed, and thank you very much for this help.
Both solutions didn't work because the modules you told me to use (Text::CSV, File::Sort or Data::Dumper) are not installed in the Perl version I am using, and I can't install them (company restrictions...).
What I've tried instead, is to try a simple sort on a column, without taking care of the double quotes problem :
use CGI qw(:standard);
use strict;
use warnings;
my $file = 'path\to\my\file.csv';
open (my $csv, '<', $file) || die "cant open";
foreach (<$csv>) {
chomp;
my #fields = split(/\;/);
}
#sorted = sort { $a->[1] cmp $b->[1] } #fields;
I was thinking this should work, sorting in my array #sorted the datas I have by the 2nd column, but it doesn't work, and I don't understand why...
Use *NIX sort (or its equivalent in the OS you are using) with a delimiter option (e.g., -t';' - make sure to quote the semicolon) to sort the file. If the company requires that you use Perl, wrap the system call to sort in Perl like so:
system "sort -t';' [options] in_file > out_file" and die "cannot sort: $?"
Examples:
Sort by column 1, numerically:
sort -k1,1g -t';' in_file.txt > out_file.txt
Note that ( head -n1 in_file.txt ; tail -n+2 in_file.txt | sort ... ) is needed to keep the header on top and sort only the data lines.
Sort by column 1, numerically descending:
( head -n1 in_file.txt ; tail -n+2 in_file.txt | sort -k1,1gr -t';' ) > out_file.txt
Sort by column 4, ASCIIbetically, then by column 1, numerically descending:
( head -n1 in_file.txt ; tail -n+2 in_file.txt | sort -k4,4 -k1,1gr -t';' ) > out_file.txt
One could use DBD::CSV, in which case something like this might do the job:
use Data::Dumper;
$Data::Dumper::Deepcopy=1;
$Data::Dumper::Indent=1;
$Data::Dumper::Sortkeys=1;
use DBI;
use Getopt::Long::Descriptive ('describe_options');
use Text::CSV_XS ('csv');
use Try::Tiny;
use 5.01800;
use warnings;
try {
push #ARGV,'--help'
unless (#ARGV);
my ($opts,$usage)=describe_options(
'my-program %o <some-arg>',
,['field|f=s' ,'the order by field']
,['table|t=s' ,'The table (file) to be sorted']
,['new|n=s' ,'The sorted table (file)']
,[]
,['verbose|v' ,'print extra stuff' ,{ default => !!0 }]
,['help' ,'print usage message and exit' ,{ shortcircuit => !1 }]
);
if ($opts->help()) { # MAN! MAN!
say <<"_HELP_";
#{[$usage->text]}
_HELP_
exit;
}
else { # No MAN required.
};
my $dbh=DBI->connect("dbi:CSV:",undef,undef,{
f_ext => ".csv/r",
csv_sep_char => ";",
RaiseError => 1,
}) or die "Cannot connect: $DBI::errstr";
my $sth=$dbh->prepare(
"select * from #{[$opts->table()]} order by #{[$opts->field()]}"
);
# New table
my $csv=Text::CSV_XS->new({ sep_char => ";" });
open my $fh,">:encoding(utf8)","#{[$opts->new()]}.csv"
or die "#{[$opts->new()]}.csv: $!";
# fields
$sth->execute();
my $fields_aref=$sth->{NAME};
$csv->say($fh,$fields_aref);
# the sorted rows
my $max_rows=5_000;
while (my $aref=$sth->fetchall_arrayref(undef,$max_rows)) {
$csv->say($fh,$_)
for (#$aref);
};
close $fh
or die "#{[$opts->new()]}.csv: $!";
$dbh->disconnect();
}
catch {
Carp::confess $_;
};
__END__
(Under Window) invoke it as
perl CSV_01.t -t data -f "column2 DESC" -n newest
Invoked without parameters gets help or
my-program [-fntv] [long options...] <some-arg>
-f STR --field STR the order by field
-t STR --table STR The table (file) to be sorted
-n STR --new STR The sorted table (file)
-v --verbose print extra stuff
--help print usage message and exit
(Sadly verbose doesn't do any extra.)

How to perform a series of string replacements and be able to easily undo them?

I have a series of strings and their replacements separated by spaces:
a123 b312
c345 d453
I'd like to replace those strings in the left column with those in the right column, and undo the replacements later on. For the first part I could construct a sed command s/.../...;s/.../... but that doesn't consider reversing, and it requires me to significantly alter the input, which takes time. Is there a convenient way to do this?
Listed some example programs, could be anything free for win/lin.
Text editors provide "undo" functionality, but command-line utilities don't. You can write a script to do the replacement, then reverse the replacements file to do the same thing in reverse.
Here's a script that takes a series of replacements in 'replacements.txt' and runs them against the script's input:
#!/usr/bin/perl -w
use strict;
open REPL, "<replacements.txt";
my #replacements;
while (<REPL>) {
chomp;
push #replacements, [ split ];
}
close REPL;
while (<>) {
for my $r (#replacements) { s/$r->[0]/$r->[1]/g }
print;
}
If you save this file as 'repl.pl', and you save your file above as 'replacements.txt', you can use it like this:
perl repl.pl input.txt >output.txt
To convert your replacements file into a 'reverse-replacements.txt' file, you can use a simple awk command:
awk '{ print $2, $1 }' replacements.txt >reverse-replacements.txt
Then just modify the Perl script to use the reverse replacements file instead of the forward one.
use strict;
use warnings;
unless (#ARGV == 3) {
print "Usage: script.pl <reverse_changes?> <rfile> <input>\n";
exit;
}
my $reverse_changes = shift;
my $rfile = shift;
open my $fh, "<", $rfile or die $!;
my %reps = map split, <$fh>;
if ($reverse_changes) {
%reps = reverse %reps;
}
my $rx = join "|", keys %reps;
while (<>) {
s/\b($rx)\b/$reps{$1}/g;
print;
}
The word boundary checks \b surrounding the replacements will prevent partial matches, e.g. replacing a12345 with b31245. In the $rx you may wish to escape meta characters, if such can be present in your replacements.
Usage:
To perform the replacements:
script.pl 0 replace.txt input.txt > output.txt
To reverse changes:
script.pl 1 replace.txt output.txt > output2.txt

Remove duplicates from list of files in perl

I know this should be pretty simple and the shell version is something like:
$ sort example.txt | uniq -u
in order to remove duplicate lines from a file. How would I go about doing this in Perl?
The interesting spin on this question is the uniq -u! I don't think the other answers I've seen tackle this; they deal with sort -u example.txt or (somewhat wastefully) sort example.txt | uniq.
The difference is that the -u option eliminates all occurrences of duplicated lines, so the output is of lines that appear only once.
To tackle this, you need to know how many times each name appears, and then you need to print the names that appear just once. Assuming the list is to be read from standard input, then this code does the trick:
my %counts;
while (<>)
{
chomp;
$counts{$_}++;
}
foreach my $name (sort keys %counts)
{
print "$name\n" if $counts{$name} == 1;
}
Or, using using grep:
my %counts;
while (<>)
{
chomp;
$counts{$_}++;
}
{
local $, = "\n";
print grep { $counts{$_} == 1 } sort keys %counts;
}
Or, if you don't need to remove the newlines (because you're only going to print the names):
my %counts;
$counts{$_}++ for (<>);
print grep { $counts{$_} == 1 } sort keys %counts;
If you do in fact want every name that appears in the input to appear in the output (but only once), then any of the other solutions will do the trick (or, with minimal adaptation, will do the trick). In fact, since the input lines will end with a newline, you can generate the answer in just two lines:
my %counts = map { $_, 1 } <>;
print sort keys %counts;
No, you can't do it in one by simply replacing %counts in the print line with the map in the first line:
print sort keys map { $_, 1 } <>;
You get the error:
Type of arg 1 to keys must be hash or array (not map iterator) at ...
or use 'uniq' sub from List::MoreUtils module after reading all the file to a list (although its not a good solution)
Are you wanting to update a list of files to remove duplicate lines?
Or process a list of files, ignoring duplicate lines?
Or remove duplicate filenames from a list?
Assuming the latter:
my %seen;
#filenames = grep !$seen{$_}++, #filenames;
or other solutions from perldoc -q duplicate
First of all, sort -u xxx.txt would have been smarter than sort | uniq -u.
Second, perl -ne 'print unless $seen{$_}++' is prone to integer overflow, so a more sophisticated way of perl -ne 'if(!$seen{$_}){print;$seen{$_}=1}' seems preferable.

How can I remove non-unique lines from a large file with Perl?

Duplicate data removal using Perl called within via a batch file within Windows
A DOS window in Windows called via a batch file.
A batch file calls the Perl script which carries out the actions. I have the batch file.
The code script I have works duplicate data is removal so long as the data file is not too big.
The problem that requires resolving is with data files which are larger, (2 GB or more), with this size of file a memory error occurs when trying to load the complete file in to an array for duplicate data removal.
The memory error occurs in the subroutine at:-
#contents_of_the_file = <INFILE>;
(A completely different method is acceptable so long as it solves this issue, please suggest).
The subroutine is:-
sub remove_duplicate_data_and_file
{
open(INFILE,"<" . $output_working_directory . $output_working_filename) or dienice ("Can't open $output_working_filename : INFILE :$!");
if ($test ne "YES")
{
flock(INFILE,1);
}
#contents_of_the_file = <INFILE>;
if ($test ne "YES")
{
flock(INFILE,8);
}
close (INFILE);
### TEST print "$#contents_of_the_file\n\n";
#unique_contents_of_the_file= grep(!$unique_contents_of_the_file{$_}++, #contents_of_the_file);
open(OUTFILE,">" . $output_restore_split_filename) or dienice ("Can't open $output_restore_split_filename : OUTFILE :$!");
if ($test ne "YES")
{
flock(OUTFILE,1);
}
for($element_number=0;$element_number<=$#unique_contents_of_the_file;$element_number++)
{
print OUTFILE "$unique_contents_of_the_file[$element_number]\n";
}
if ($test ne "YES")
{
flock(OUTFILE,8);
}
}
You are unnecessarily storing a full copy of the original file in #contents_of_the_file and -- if the amount of duplication is low relative to the file size -- nearly two other full copies in %unique_contents_of_the_file and #unique_contents_of_the_file. As ire_and_curses noted, you can reduce the storage requirements by making two passes over the data: (1) analyze the file, storing information about the line numbers of non-duplicate lines; and (2) process the file again to write non-dups to the output file.
Here is an illustration. I don't know whether I've picked the best module for the hashing function (Digest::MD5); perhaps others will comment on that. Also note the 3-argument form of open(), which you should be using.
use strict;
use warnings;
use Digest::MD5 qw(md5);
my (%seen, %keep_line_nums);
my $in_file = 'data.dat';
my $out_file = 'data_no_dups.dat';
open (my $in_handle, '<', $in_file) or die $!;
open (my $out_handle, '>', $out_file) or die $!;
while ( defined(my $line = <$in_handle>) ){
my $hashed_line = md5($line);
$keep_line_nums{$.} = 1 unless $seen{$hashed_line};
$seen{$hashed_line} = 1;
}
seek $in_handle, 0, 0;
$. = 0;
while ( defined(my $line = <$in_handle>) ){
print $out_handle $line if $keep_line_nums{$.};
}
close $in_handle;
close $out_handle;
You should be able to do this efficiently using hashing. You don't need to store the data from the lines, just identify which ones are the same. So...
Don't slurp - Read one line at a time.
Hash the line.
Store the hashed line representation as a key in a Perl hash of lists. Store the line number as the first value of the list.
If the key already exists, append the duplicate line number to the list corresponding to that value.
At the end of this process, you'll have a data-structure identifying all the duplicate lines. You can then do a second pass through the file to remove those duplicates.
Perl does heroic things with large files, but 2GB may be a limitation of DOS/Windows.
How much RAM do you have?
If your OS doesn't complain, it may be best to read the file one line at a time, and write immediately to output.
I'm thinking of something using the diamond operator <> but I'm reluctant to suggest any code because on the occasions I've posted code, I've offended a Perl guru on SO.
I'd rather not risk it. I hope the Perl cavalry will arrive soon.
In the meantime, here's a link.
Here's a solution that works no matter how big the file is. But it doesn't use RAM exclusively, so its slower than a RAM-based solution. You can also specify the amount of RAM you want this thing to use.
The solution uses a temporary file that the program treats as a database with SQLite.
#!/usr/bin/perl
use DBI;
use Digest::SHA 'sha1_base64';
use Modern::Perl;
my $input= shift;
my $temp= 'unique.tmp';
my $cache_size_in_mb= 100;
unlink $temp if -f $temp;
my $cx= DBI->connect("dbi:SQLite:dbname=$temp");
$cx->do("PRAGMA cache_size = " . $cache_size_in_mb * 1000);
$cx->do("create table x (id varchar(86) primary key, line int unique)");
my $find= $cx->prepare("select line from x where id = ?");
my $list= $cx->prepare("select line from x order by line");
my $insert= $cx->prepare("insert into x (id, line) values(?, ?)");
open(FILE, $input) or die $!;
my ($line_number, $next_line_number, $line, $sha)= 1;
while($line= <FILE>) {
$line=~ s/\s+$//s;
$sha= sha1_base64($line);
unless($cx->selectrow_array($find, undef, $sha)) {
$insert->execute($sha, $line_number)}
$line_number++;
}
seek FILE, 0, 0;
$list->execute;
$line_number= 1;
$next_line_number= $list->fetchrow_array;
while($line= <FILE>) {
$line=~ s/\s+$//s;
if($next_line_number == $line_number) {
say $line;
$next_line_number= $list->fetchrow_array;
last unless $next_line_number;
}
$line_number++;
}
close FILE;
Well you could use the inline replace mode of command line perl.
perl -i~ -ne 'print unless $seen{$_}++' uberbigfilename
In the "completely different method" category, if you've got Unix commands (e.g. Cygwin):
cat infile | sort | uniq > outfile
This ought to work - no need for Perl at all - which may, or may not, solve your memory problem. However, you will lose the ordering of the infile (as outfile will now be sorted).
EDIT: An alternative solution that's better able to deal with large files may be by using the following algorithm:
Read INFILE line-by-line
Hash each line to a small hash (e.g. a hash# mod 10)
Append each line to a file unique to the hash number (e.g. tmp-1 to tmp-10)
Close INFILE
Open and sort each tmp-# to a new file sortedtmp-#
Mergesort sortedtmp-[1-10] (i.e. open all 10 files and read them simultaneously), skipping duplicates and writing each iteration to the end output file
This will be safer, for very large files, than slurping.
Parts 2 & 3 could be changed to a random# instead of a hash number mod 10.
Here's a script BigSort that may help (though I haven't tested it):
# BigSort
#
# sort big file
#
# $1 input file
# $2 output file
#
# equ sort -t";" -k 1,1 $1 > $2
BigSort()
{
if [ -s $1 ]; then
rm $1.split.* > /dev/null 2>&1
split -l 2500 -a 5 $1 $1.split.
rm $1.sort > /dev/null 2>&1
touch $1.sort1
for FILE in `ls $1.split.*`
do
echo "sort $FILE"
sort -t";" -k 1,1 $FILE > $FILE.sort
sort -m -t";" -k 1,1 $1.sort1 $FILE.sort > $1.sort2
mv $1.sort2 $1.sort1
done
mv $1.sort1 $2
rm $1.split.* > /dev/null 2>&1
else
# work for empty file !
cp $1 $2
fi
}