Cleanup failed with tempfile using perl

Cleanup failed with tempfile using perl - perl

I needed to use "iconv" to convert char encoding from some files generated on windows. Sometimes those files are very big and execution fails because it runs out of RAM. Googling i found a script which is called "iconv-chunks.pl" which is basically a perl script which processes the files and works pretty well, but it generates temporary files on my /tmp folder.
The problem is that this scripts runs automatically everyday for many files and it keeps generating garbage on my /tmp dir even though it has the cleanup flag ON.
The script im talking about is:
https://code.google.com/p/clschool-team4/source/browse/trunk/iconv-chunks.pl?r=53
#!/usr/bin/perl
our $CHUNK_SIZE = 1024 * 1024 * 100; # 100M
=head1 NAME
iconv-chunks - Process huge files with iconv
=head1 SYNOPSIS
iconv-chunks <filename> [iconv-options]
=head1 DESCRIPTION
The standard iconv program reads the entire input file into
memory, which doesn't work for large files (such as database exports).
This script is just a wrapper that processes the input file
in manageable chunks and writes it to standard output.
The first argument is the input filename (use - to specify standard input).
Anything else is passed through to iconv.
The real iconv needs to be somewhere in your PATH.
=head1 EXAMPLES
# Convert latin1 to utf-8:
./iconv-chunks database.txt -f latin1 -t utf-8 > out.txt
# Input filename of - means standard input:
./iconv-chunks - -f iso8859-1 -t utf8 < database.txt > out.txt
# More complex example, using compressed input/output to minimize disk use:
zcat database.txt.gz | ./iconv-chunks - -f iso8859-1 -t utf8 | \
gzip - > database-utf.dump.gz
=head1 AUTHOR
Maurice Aubrey <maurice.aubrey+iconv#gmail.com>
=cut
# $Id: iconv-chunks 6 2007-08-20 21:14:55Z mla $
use strict;
use warnings;
use bytes;
use File::Temp qw/ tempfile /;
# iconv errors:
# iconv: unable to allocate buffer for input: Cannot allocate memory
# iconv: cannot open input file `database.txt': File too large
#ARGV >= 1 or die "Usage: $0 <inputfile> [iconv-options]\n";
my #options = splice #ARGV, 1;
my($oh, $tmp) = tempfile(undef, CLEANUP => 1);
# warn "Tempfile: $tmp\n";
my $iconv = "iconv #options $tmp";
sub iconv { system($iconv) == 0 or die "command '$iconv' failed: $!" }
my $size = 0;
# must read by line to ensure we don't split multi-byte character
while (<>) {
$size += length $_;
print $oh $_;
if ($size >= $CHUNK_SIZE) {
iconv;
truncate $oh, 0 or die "truncate '$tmp' failed: $!";
seek $oh, 0, 0 or die "seek on '$tmp' failed: $!";
$size = 0;
}
}
iconv if $size > 0;
Any help finding the problem or how can it delete temporary files after finishing?
Regards

Change
my($oh, $tmp) = tempfile(undef, CLEANUP => 1);
to
my($oh, $tmp) = tempfile(UNLINK => 1);
CLEANUP is used to trigger removal of temporary directories on exit, not files. Note that passing undef as the first argument in order to use the default template is unnecessary.

Related

Perl and Postgresql UTF8 problems with directories

I am having trouble with directory names with UTF8 characters in them on a Mac (10.11.2) with Perl 5.22 and Postgresql (9.4). Text encoding in Postgresql is set to UTF8.
If I have a directory name with a non-ascii UTF8 character in it I can chdir() to that directory if the directory name is read in by the Perl script or inserted into a string in the Perl script. If I insert this name into a PG table and read it back out (SELECT dirname FROM utfdirs) I can't chdir to that directory. However, the on screen printed strings are identical, a Perl cmp test on the two strings reports they are identical, and guess_encoding() reports both are UTF8.
#!/opt/local/bin/perl5.22
use strict;
use Cwd;
use DBI;
use Encode;
use Encode qw/from_to/;
use Encode::Detect;
use Encode::Guess;
use Encode::UTF8Mac;
#
Encode::Guess->add_suspects(qw/utf-8-mac/);
#
my $dbname = 'test';
my $dbh = DBI->connect("dbi:Pg:dbname=$dbname;host=localhost");
$dbh->do("SET client_min_messages TO WARNING");
#
my $homeDir = '/Users/jldasch';
chdir($homeDir) or die "Cannot cd to [$homeDir]\n";
opendir(D,".");
my #tdlist = sort grep(/(Lambda?)|(Delta?)/,readdir(D));
closedir(D);
$dbh->do("DELETE FROM utfdirs");
my $ins = $dbh->prepare("INSERT INTO utfdirs (dirname) VALUES (?)");
foreach my $d (#tdlist) {
chdir($homeDir);
my $ok = chdir($d) ? 1 : 0;
my $fp = "${homeDir}/${d}";
printf("%2d %s\n",$ok,$fp);
$ins->execute($fp);
}
my $rset = $dbh->selectall_arrayref("SELECT dirname FROM utfdirs ORDER BY dirname");
my $i = 0;
foreach my $r (#$rset) {
my $dbdir = $r->[0];
my $pdir = ${homeDir} . '/' . $tdlist[$i++];
print "$r->[0] $pdir\n";
my $encPerl = guess_encoding($pdir);
my $encDb = guess_encoding($dbdir);
print "Perl Encoding [$encPerl->{Name}]\n";
print "Db Encoding [$encDb->{Name}]\n";
unless ( chdir($dbdir) ) {
print "Cannot CD to DbDir [$dbdir]\n";
print "DbDir and PerlDir Match\n" if ($dbdir eq $pdir)
}
exit;
The output:
bash-3.2$ ./utfstuff2.pl
1 /Users/jldasch/DeltaΔ
1 /Users/jldasch/Lambdaλ
/Users/jldasch/DeltaΔ /Users/jldasch/DeltaΔ
Perl Encoding [utf8]
Db Encoding [utf8]
Cannot CD to DbDir [/Users/jldasch/DeltaΔ]
DbDir and PerlDir Match
/Users/jldasch/Lambdaλ /Users/jldasch/Lambdaλ
Perl Encoding [utf8]
Db Encoding [utf8]
Cannot CD to DbDir [/Users/jldasch/Lambdaλ]
DbDir and PerlDir Match
So at the level I have checked so far Perl is telling me the strings are the same (both cmp and guess_encoding()), they print the same, but they are not the same.
How do I convert the UTF8 string returned by Postgresql to a string which is accepted (in Perl) as a valid directory name for chdir()?

There is a module Encode::UTF8Mac which appears to solve this.
my $macOkDir = Encode::decode('utf-8-mac',$dbDir)
– John Daschbach

Error 500 when with perl cgi - but not any of the common pitfalls

I have a very tricky to diagnose perl problem, that has been seriously hampering my ability to maintain a perl/cgi website. It usually occurs when editing a script - after a change I get error 500, and then after I revert it it wont work again unless I delete the file and start from scratch, however I currently have a state which it can be reproduced by the following simple two scripts which show just how crazy this bug is:
file1.pl
#! /usr/bin/perl
use CGI::Carp qw(fatalsToBrowser warningsToBrowser);
print "content-type: text/html\n\nIt works";
file2.pl
#! /usr/bin/perl
use CGI::Carp qw(fatalsToBrowser warningsToBrowser);
print "content-type: text/html\n\nIt works";
(Ie... they're identical)
server.com/cgi-bin/file1.pl works
server.com/cgi-bin/file2.pl results in error 500
Both files have the same size and md5 hash.
Both have the same permissions (755) and the same owner and group.
Both are in the correct folder (hosting supplied cgi-bin).
Both were uploaded in text mode.
Both work with local perl interpreters.
If i rename file1 -> file3, file2 -> file1, and file3->file2, (ie swapping both files), now file2.pl works and file1.pl doesn't. So my guess is some state is attached to the files themselves.
If i edit the files in filezilla and re-upload (eg add some whitespace after a semicolon), same behaviour occurs with the re-uploaded files.
My error 500 page is set to auto-retry using a meta refresh (in case of out memory errors, etc), and it doesn't go away after countless refreshes. It doesn't seem to matter which ones is accessed first.
I do not have access to the http error_log on this hosting so do not know the reason for the failure. The error also occurs without the "use error messages to browser" diagnostic line.
Can anyone give me a hint as to what this could be and help me fix it?

What you describe can be either caused by some problem on your hosting provider side (some bad caching, or transparent proxies, or any other magic), or—and that is what I think it is—still caused by wrong file permissions or line breaks, even if your file manager reports that everything is good.
If I'm reading your description correctly you basically
can put a script and it will work, but
cannot edit it as it will stop working after that.
As you don't have shell access, just put the following small script to the same directory and run it (hope it will run as you are not going to edit it):
#!/usr/bin/perl
use strict;
use warnings;
print "Content-Type: text/plain\n\n";
opendir( my $dirh, "." );
my #files = grep { -f $_; } readdir $dirh;
closedir $dirh;
foreach my $file (#files) {
my #stat = stat $file;
my ($dev, $ino, $mode, $nlink, $uid, $gid, $rdev,
$size, $atime, $mtime, $ctime, $blksize, $blocks
) = stat($file);
my $octmode = sprintf "%04o", $mode & 07777;
print "$file\tmode=$octmode\tuid=$uid\tgid=$gid\tsize=$size\t";
if ( -r $file ) {
open( my $fh, $file );
my $firstline = <$fh>;
print $firstline =~ /\r\n/ ? "crlf\n" : "lf\n";
close $fh;
} else {
print "can't read\n";
}
}
It will show the real permissions, linebreaks, and size of the files—those taken from the server's filesystem, not which your FTP client shows.
Maybe it's worth adding MD5 or SHA1 hash calculation to this script but not sure if you have Digest::MD5 or Digest::SHA1 available.
If you see the same output for test1.pl and test2.pl, just go ahead and contact your hosting provider's support.

My guess: the files don't use the same newline convention.
You can check this (in a Unix shell) using the file command.

Not being able to inspect the errorlog is a big headache.
Nevertheless, I suspect that the cause is still most likely line endings. I would upload a script to examine all of your files like the following:
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use CGI qw(header);
use CGI::Carp qw(fatalsToBrowser warningsToBrowser);
use File::stat;
print header('text/plain');
my $fmt = "%-15s %4s %4s %4s %7s %4s %4s\n";
printf $fmt, qw(File mode uid gid size lf crlf);
printf $fmt, map { '-' x $_ } $fmt =~ /(\d+)/g;
opendir my $dh, '.';
while ( my $file = readdir $dh ) {
next unless -f $file;
my $stat = stat $file;
my %chars;
my $data = do { local ( #ARGV, $/ ) = $file; <> };
$chars{$1}++ while $data =~ /(\R)/g;
printf $fmt, $file, sprintf( "%04o", $stat->mode & 07777 ), $stat->uid,
$stat->gid, $stat->size, map { $_ // 0 } #chars{ "\n", "\r\n" };
}
Outputs:
Content-Type: text/plain; charset=ISO-8859-1
File mode uid gid size lf crlf
--------------- ---- ---- ---- ------- ---- ----
env.cgi 0775 0 0 266 25 0
le.pl 0775 501 0 696 28 0
lineendings.pl 0755 501 0 516 30 0
mywiki.pl 0755 501 0 226947 0 6666
test.cgi 0755 0 0 2071 65 0
wiki.pl 0755 0 0 219231 6494 0
For additional testing, I would recommend executing each of the scripts using system and inspecting the error conditions if there are any.

I have had the same problem, got help from user BOC as below:
"You may have problem with encoding of characters. Some editors replace some characters by very close characters when you save files (for example " by “). Try changing editor (notepad++ works well on windows). – BOC"
I downloaded and used Notepad++ instead of Notepad and Winword; It works now for me.

How to calculate size of a file from the command line arguments in perl

I was trying out a sample program that can calculate the size of the file given in command line arguments. It gives the size correctly when I have a file name stored inside a variable, but doesn't output a result when got the filename from the command line arguments.
#! /usr/bin/perl
use File::stat;
while(<>){
if(($_ cmp "\n") == 0){
exit 0;
}
else{
my $file_size = stat($_)->size; # $filesize = s $_;
print $file_size;
}
}
I get no output when using file test operator -s and I get errors when using stat module:
Unsuccessful stat on filename containing newline at /usr/share/perl/5.10/File/stat.pm line 49, <> line 1.
Can't call method "size" on an undefined value at 2.pl line 17, <> line 1.
1.txt is the filename I'm giving as an input.

#!/usr/bin/perl
for (#ARGV){
my $file_size = -s $_;
print $file_size;
}
or similar cmd oneliner,
perl -E 'say "$_, size: ", -s for #ARGV' *

#!/usr/bin/perl -w
$filename = '/path/to/your/file.doc';
$filesize = -s $filename;
print $filesize;
Simple enough, right? First you create a string that contains the path to the file that you want to test, then you use the -s File Test Operator on it. You could easily shorten this to one line using simply:
print -s '/path/to/your/file.doc';
Also, keep in mind that this will always return true if a file is larger than zero bytes, but will be false if the file size is zero. It makes a handy and quick way to check for zero byte files.

PERL to count non-printable characters

I have 100,000's of files that I would like to analyze. Specifically I would like to calculate the percentage of printable characters from a sample of the file of arbitrary size. Some of these files are from mainframes, Windows, Unix, etc. so it is likely that binary and control characters are included.
I started by using the Linux "file" command, but it did not provide enough detail for my purposes. The following code conveys what I am trying to do, but does not always work.
#!/usr/bin/perl -n
use strict;
use warnings;
my $cnt_n_print = 0;
my $cnt_print = 0;
my $cnt_total = 0;
my $prc_print = 0;
#Count the number of non-printable characters
while ($_ =~ m/[^[:print:]]/g) {$cnt_n_print++};
#Count the number of printable characters
while ($_ =~ m/[[:print:]]/g) {$cnt_print++};
$cnt_total = $cnt_n_print + $cnt_print;
$prc_print = $cnt_print/$cnt_total;
#Print the # total number of bytes read followed by the % printable
print "$cnt_total|$prc_print\n"
This is a test call that works:
echo "test_string of characters" | /home/user/scripts/prl/s16_count_chars.pl
This is how I intend to call it, and works for one file:
find /fct/inbound/trans/ -name "TRNST.20121115231358.xf2" -type f -print0 | xargs -0 head -c 2000 | /home/user/scripts/prl/s16_count_chars.pl
This does not work correctly:
find /fct/inbound/trans/ -type f -print0 | xargs -0 head -c 2000 | /home/user/scripts/prl/s16_count_chars.pl
Neither does this:
find /fct/inbound/trans/ -type f -print0 | xargs -0 head -c 2000 | perl -0 /home/user/scripts/prl/s16_count_chars.pl
Instead of executing the script once for EACH line returned by the find, it executes ONCE for ALL the results.
Thanks in advance.
Research so far:
Pipe and XARGS and separators
http://help.lockergnome.com/linux/help-understand-pipe-xargs--ftopict549399.html
http://en.wikipedia.org/wiki/Xargs#The_separator_problem
Clarification(s):
1.) Desired output: If there are 932 files in a directory, the output would be a 932 line list of file names, the total bytes read from the file and the % that were printable characters.
2.) Many of the files are binary. Script needs to handle embedded binary eol or eof sequences.
3.) Many of the files are large, so I would like to only read the first/last xx bytes. I had been trying to use head -c 256 or tail -c 128 to read either the first 256 bytes or the last 128 bytes respectively. Solution could either work in a pipe line or limit bytes within perl script.

The -n option wraps your entire code in a while(defined($_=<ARGV>) { ... } block. This means your my $cnt_print and other variable declarations are repeated for every line of input, essentially resetting all your variable values.
The workaround is to use global variables (declare them with our if you want to keep using use strict), and not to initialize them to 0, as they would be reinitialized for every line of input. You could say something like
our $cnt_print //= 0;
if you don't want $cnt_print and its friends to be undefined for the first line of input.
See this recent question with a similar issue.

You could have find pass you one arg at a time.
find /fct/inbound/trans/ -type f -exec perl script.pl {} \;
But I'd continue passing multiple files at a time, either through xargs, or using GNU find's -exec +.
find /fct/inbound/trans/ -type f -exec perl script.pl {} +
The following code snippets support both.
You can continue reading a line at a time:
#!/usr/bin/perl
use strict;
use warnings;
my $cnt_total = 0;
my $cnt_n_print = 0;
while (<>) {
$cnt_total += length;
++$cnt_n_print while /[^[:print:]]/g;
} continue {
if (eof) {
my $cnt_print = $cnt_total - $cnt_n_print;
my $prc_print = $cnt_print/$cnt_total;
print "$ARGV: $cnt_total|$prc_print\n";
$cnt_total = 0;
$cnt_n_print = 0;
}
}
Or you could read a whole file at a time:
#!/usr/bin/perl
use strict;
use warnings;
local $/;
while (<>) {
my $cnt_n_print = 0;
++$cnt_n_print while /[^[:print:]]/g;
my $cnt_total = length;
my $cnt_print = $cnt_total - $cnt_n_print;
my $prc_print = $cnt_print/$cnt_total;
print "$ARGV: $cnt_total|$prc_print\n";
}

Here is my working solution based on the feedback provided.
I would appreciate any further feedback on form or more efficient methods:
#!/usr/bin/perl
use strict;
use warnings;
# This program receives a file path and name.
# The program attempts to read the first 2000 bytes.
# The output is a list of files, the number of bytes
# actually read and the percent of tbe bytes that are
# ASCII "printable" aka [\x20-\x7E].
my ($data, $n_bytes, $file_name, $cnt_n_print, $cnt_print, $prc_print);
# loop through each file
foreach(#ARGV) {
$file_name = shift or die "Pass the file name on the command line.\n";
# open the file read only with "<" in "<$file_name"
open(FILE, "<$file_name") or die "Can't open $file_name: $!";
# open each file in binary mode to handle non-printable characters
binmode FILE;
# try to read 2000 bytes from FILE, save the results in $data and the
# actual number of bytes read in $n_bytes
$n_bytes = read FILE, $data, 2000;
$cnt_n_print = 0;
$cnt_print = 0;
# count the number of non-printable characters
++$cnt_n_print while ($data =~ m/[^[:print:]]/g);
$cnt_print = $n_bytes - $cnt_n_print;
$prc_print = $cnt_print/$n_bytes;
print "$file_name|$n_bytes|$prc_print\n";
close(FILE);
}
Here is a sample of how to call the above script:
find /some/path/to/files/ -type f -exec perl this_script.pl {} +
Here's a list of references I found helpful:
POSIX Bracket Expressions
Opening files in binmode
Read function
Open file read only

How can I remove non-unique lines from a large file with Perl?

Duplicate data removal using Perl called within via a batch file within Windows
A DOS window in Windows called via a batch file.
A batch file calls the Perl script which carries out the actions. I have the batch file.
The code script I have works duplicate data is removal so long as the data file is not too big.
The problem that requires resolving is with data files which are larger, (2 GB or more), with this size of file a memory error occurs when trying to load the complete file in to an array for duplicate data removal.
The memory error occurs in the subroutine at:-
#contents_of_the_file = <INFILE>;
(A completely different method is acceptable so long as it solves this issue, please suggest).
The subroutine is:-
sub remove_duplicate_data_and_file
{
open(INFILE,"<" . $output_working_directory . $output_working_filename) or dienice ("Can't open $output_working_filename : INFILE :$!");
if ($test ne "YES")
{
flock(INFILE,1);
}
#contents_of_the_file = <INFILE>;
if ($test ne "YES")
{
flock(INFILE,8);
}
close (INFILE);
### TEST print "$#contents_of_the_file\n\n";
#unique_contents_of_the_file= grep(!$unique_contents_of_the_file{$_}++, #contents_of_the_file);
open(OUTFILE,">" . $output_restore_split_filename) or dienice ("Can't open $output_restore_split_filename : OUTFILE :$!");
if ($test ne "YES")
{
flock(OUTFILE,1);
}
for($element_number=0;$element_number<=$#unique_contents_of_the_file;$element_number++)
{
print OUTFILE "$unique_contents_of_the_file[$element_number]\n";
}
if ($test ne "YES")
{
flock(OUTFILE,8);
}
}

You are unnecessarily storing a full copy of the original file in #contents_of_the_file and -- if the amount of duplication is low relative to the file size -- nearly two other full copies in %unique_contents_of_the_file and #unique_contents_of_the_file. As ire_and_curses noted, you can reduce the storage requirements by making two passes over the data: (1) analyze the file, storing information about the line numbers of non-duplicate lines; and (2) process the file again to write non-dups to the output file.
Here is an illustration. I don't know whether I've picked the best module for the hashing function (Digest::MD5); perhaps others will comment on that. Also note the 3-argument form of open(), which you should be using.
use strict;
use warnings;
use Digest::MD5 qw(md5);
my (%seen, %keep_line_nums);
my $in_file = 'data.dat';
my $out_file = 'data_no_dups.dat';
open (my $in_handle, '<', $in_file) or die $!;
open (my $out_handle, '>', $out_file) or die $!;
while ( defined(my $line = <$in_handle>) ){
my $hashed_line = md5($line);
$keep_line_nums{$.} = 1 unless $seen{$hashed_line};
$seen{$hashed_line} = 1;
}
seek $in_handle, 0, 0;
$. = 0;
while ( defined(my $line = <$in_handle>) ){
print $out_handle $line if $keep_line_nums{$.};
}
close $in_handle;
close $out_handle;

You should be able to do this efficiently using hashing. You don't need to store the data from the lines, just identify which ones are the same. So...
Don't slurp - Read one line at a time.
Hash the line.
Store the hashed line representation as a key in a Perl hash of lists. Store the line number as the first value of the list.
If the key already exists, append the duplicate line number to the list corresponding to that value.
At the end of this process, you'll have a data-structure identifying all the duplicate lines. You can then do a second pass through the file to remove those duplicates.

Perl does heroic things with large files, but 2GB may be a limitation of DOS/Windows.
How much RAM do you have?
If your OS doesn't complain, it may be best to read the file one line at a time, and write immediately to output.
I'm thinking of something using the diamond operator <> but I'm reluctant to suggest any code because on the occasions I've posted code, I've offended a Perl guru on SO.
I'd rather not risk it. I hope the Perl cavalry will arrive soon.
In the meantime, here's a link.

Here's a solution that works no matter how big the file is. But it doesn't use RAM exclusively, so its slower than a RAM-based solution. You can also specify the amount of RAM you want this thing to use.
The solution uses a temporary file that the program treats as a database with SQLite.
#!/usr/bin/perl
use DBI;
use Digest::SHA 'sha1_base64';
use Modern::Perl;
my $input= shift;
my $temp= 'unique.tmp';
my $cache_size_in_mb= 100;
unlink $temp if -f $temp;
my $cx= DBI->connect("dbi:SQLite:dbname=$temp");
$cx->do("PRAGMA cache_size = " . $cache_size_in_mb * 1000);
$cx->do("create table x (id varchar(86) primary key, line int unique)");
my $find= $cx->prepare("select line from x where id = ?");
my $list= $cx->prepare("select line from x order by line");
my $insert= $cx->prepare("insert into x (id, line) values(?, ?)");
open(FILE, $input) or die $!;
my ($line_number, $next_line_number, $line, $sha)= 1;
while($line= <FILE>) {
$line=~ s/\s+$//s;
$sha= sha1_base64($line);
unless($cx->selectrow_array($find, undef, $sha)) {
$insert->execute($sha, $line_number)}
$line_number++;
}
seek FILE, 0, 0;
$list->execute;
$line_number= 1;
$next_line_number= $list->fetchrow_array;
while($line= <FILE>) {
$line=~ s/\s+$//s;
if($next_line_number == $line_number) {
say $line;
$next_line_number= $list->fetchrow_array;
last unless $next_line_number;
}
$line_number++;
}
close FILE;

Well you could use the inline replace mode of command line perl.
perl -i~ -ne 'print unless $seen{$_}++' uberbigfilename

In the "completely different method" category, if you've got Unix commands (e.g. Cygwin):
cat infile | sort | uniq > outfile
This ought to work - no need for Perl at all - which may, or may not, solve your memory problem. However, you will lose the ordering of the infile (as outfile will now be sorted).
EDIT: An alternative solution that's better able to deal with large files may be by using the following algorithm:
Read INFILE line-by-line
Hash each line to a small hash (e.g. a hash# mod 10)
Append each line to a file unique to the hash number (e.g. tmp-1 to tmp-10)
Close INFILE
Open and sort each tmp-# to a new file sortedtmp-#
Mergesort sortedtmp-[1-10] (i.e. open all 10 files and read them simultaneously), skipping duplicates and writing each iteration to the end output file
This will be safer, for very large files, than slurping.
Parts 2 & 3 could be changed to a random# instead of a hash number mod 10.
Here's a script BigSort that may help (though I haven't tested it):
# BigSort
#
# sort big file
#
# $1 input file
# $2 output file
#
# equ sort -t";" -k 1,1 $1 > $2
BigSort()
{
if [ -s $1 ]; then
rm $1.split.* > /dev/null 2>&1
split -l 2500 -a 5 $1 $1.split.
rm $1.sort > /dev/null 2>&1
touch $1.sort1
for FILE in `ls $1.split.*`
do
echo "sort $FILE"
sort -t";" -k 1,1 $FILE > $FILE.sort
sort -m -t";" -k 1,1 $1.sort1 $FILE.sort > $1.sort2
mv $1.sort2 $1.sort1
done
mv $1.sort1 $2
rm $1.split.* > /dev/null 2>&1
else
# work for empty file !
cp $1 $2
fi
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse