How to sort huge CSV file? - powershell

I have some huge (2GB and more) CSV files I need to sort, using Powershell or Perl (company request).
I need to sort them from any column, depending on the file.
My CSV files are, for some, looking like this, using double quotes :
Column1;Column2;Column3;Column4
1234;1234;ABCD;"1234;ABCD"
5678;5678;ABCD;"5678;ABCD"
9012;5678;ABCD;"9012;ABCD"
...
In Powershell, I already tested the solution Import-CSV, but I got OutOfMemory Exception problem.
I also tried to load my CSV file in SQL table using OleDb Connection with this code :
$provider = (New-Object System.Data.OleDb.OleDbEnumerator).GetElements() | Where-Object { $_.SOURCES_NAME -like "Microsoft.ACE.OLEDB.*" }
if ($provider -is [system.array]) { $provider = $provider[0].SOURCES_NAME } else { $provider = $provider.SOURCES_NAME }
$csv = "PathToCSV\file.csv"
$firstRowColumnNames = "Yes"
$connstring = "Provider=$provider;Data Source=$(Split-Path $csv);Extended Properties='text;HDR=$firstRowColumnNames;';"
$tablename = (Split-Path $csv -leaf).Replace(".","#")
$sql = "SELECT * from [$tablename] ORDER BY Column3"
# Setup connection and command
$conn = New-Object System.Data.OleDb.OleDbconnection
$conn.ConnectionString = $connstring
$conn.Open()
$cmd = New-Object System.Data.OleDB.OleDBCommand
$cmd.Connection = $conn
$cmd.CommandText = $sql
$cmd.ExecuteReader()
# Clean up
$cmd.dispose
$conn.dispose
But this returns me the error :
Exception calling "ExecuteReader" with "1" argument(s): "No value given for one or more required parameters." and I don't understand why. I tried to modify the code, the SQL code, and it still doesn't work.
I think this is the best solution to do that, but I didn't achieve to make it work for now, and I'm open to all other solutions that could work...
I'm a total beginner in Perl, so I just tried to understand how it works and how to use it, but didn't code anything for the moment.
EDIT
I just tested all the solutions you proposed, and thank you very much for this help.
Both solutions didn't work because the modules you told me to use (Text::CSV, File::Sort or Data::Dumper) are not installed in the Perl version I am using, and I can't install them (company restrictions...).
What I've tried instead, is to try a simple sort on a column, without taking care of the double quotes problem :
use CGI qw(:standard);
use strict;
use warnings;
my $file = 'path\to\my\file.csv';
open (my $csv, '<', $file) || die "cant open";
foreach (<$csv>) {
chomp;
my #fields = split(/\;/);
}
#sorted = sort { $a->[1] cmp $b->[1] } #fields;
I was thinking this should work, sorting in my array #sorted the datas I have by the 2nd column, but it doesn't work, and I don't understand why...

Use *NIX sort (or its equivalent in the OS you are using) with a delimiter option (e.g., -t';' - make sure to quote the semicolon) to sort the file. If the company requires that you use Perl, wrap the system call to sort in Perl like so:
system "sort -t';' [options] in_file > out_file" and die "cannot sort: $?"
Examples:
Sort by column 1, numerically:
sort -k1,1g -t';' in_file.txt > out_file.txt
Note that ( head -n1 in_file.txt ; tail -n+2 in_file.txt | sort ... ) is needed to keep the header on top and sort only the data lines.
Sort by column 1, numerically descending:
( head -n1 in_file.txt ; tail -n+2 in_file.txt | sort -k1,1gr -t';' ) > out_file.txt
Sort by column 4, ASCIIbetically, then by column 1, numerically descending:
( head -n1 in_file.txt ; tail -n+2 in_file.txt | sort -k4,4 -k1,1gr -t';' ) > out_file.txt

One could use DBD::CSV, in which case something like this might do the job:
use Data::Dumper;
$Data::Dumper::Deepcopy=1;
$Data::Dumper::Indent=1;
$Data::Dumper::Sortkeys=1;
use DBI;
use Getopt::Long::Descriptive ('describe_options');
use Text::CSV_XS ('csv');
use Try::Tiny;
use 5.01800;
use warnings;
try {
push #ARGV,'--help'
unless (#ARGV);
my ($opts,$usage)=describe_options(
'my-program %o <some-arg>',
,['field|f=s' ,'the order by field']
,['table|t=s' ,'The table (file) to be sorted']
,['new|n=s' ,'The sorted table (file)']
,[]
,['verbose|v' ,'print extra stuff' ,{ default => !!0 }]
,['help' ,'print usage message and exit' ,{ shortcircuit => !1 }]
);
if ($opts->help()) { # MAN! MAN!
say <<"_HELP_";
#{[$usage->text]}
_HELP_
exit;
}
else { # No MAN required.
};
my $dbh=DBI->connect("dbi:CSV:",undef,undef,{
f_ext => ".csv/r",
csv_sep_char => ";",
RaiseError => 1,
}) or die "Cannot connect: $DBI::errstr";
my $sth=$dbh->prepare(
"select * from #{[$opts->table()]} order by #{[$opts->field()]}"
);
# New table
my $csv=Text::CSV_XS->new({ sep_char => ";" });
open my $fh,">:encoding(utf8)","#{[$opts->new()]}.csv"
or die "#{[$opts->new()]}.csv: $!";
# fields
$sth->execute();
my $fields_aref=$sth->{NAME};
$csv->say($fh,$fields_aref);
# the sorted rows
my $max_rows=5_000;
while (my $aref=$sth->fetchall_arrayref(undef,$max_rows)) {
$csv->say($fh,$_)
for (#$aref);
};
close $fh
or die "#{[$opts->new()]}.csv: $!";
$dbh->disconnect();
}
catch {
Carp::confess $_;
};
__END__
(Under Window) invoke it as
perl CSV_01.t -t data -f "column2 DESC" -n newest
Invoked without parameters gets help or
my-program [-fntv] [long options...] <some-arg>
-f STR --field STR the order by field
-t STR --table STR The table (file) to be sorted
-n STR --new STR The sorted table (file)
-v --verbose print extra stuff
--help print usage message and exit
(Sadly verbose doesn't do any extra.)

Related

How to get array of hash arguments using Getopt::Long lib in perl?

I want to take arguments as an array of hashes by using Getopt::Long in my script.
Consider the following command line example:
perl testing.pl --systems id=sys_1 ip_address=127.0.0.1 id=sys_2 ip_address=127.0.0.2
For the sake of simplicity, I'm using two systems and only two sub arguments of each system, i.e., id and ip_address. Ideally, the number of systems is dynamic; it may contain 1, 2 or more and so with the number of arguments of each system.
My script should handle these arguments in such a way that it will store in #systems array and each element will be a hash containing id and ip_address.
Is there any way in Getopt::Long to achieve this without parsing it myself?
Following is pseudocode for what I'm trying to achieve:
testing.pl
use Getopt::Long;
my #systems;
GetOptions('systems=s' => \#systems);
foreach (#systems) {
print $_->{id},' ', $_->{ip_address};
}
Here is an attempt, there might be more elegant solutions:
GetOptions('systems=s{1,}' => \my #temp );
my #systems;
while (#temp) {
my $value1 = shift #temp;
$value1 =~ s/^(\w+)=//; my $key1 = $1;
my $value2 = shift #temp;
$value2 =~ s/^(\w+)=//; my $key2 = $1;
push #systems, { $key1 => $value1, $key2 => $value2 };
}
for (#systems) {
print $_->{id},' ', $_->{ip_address}, "\n";
}
Output:
sys_1 127.0.0.1
sys_2 127.0.0.2
I actually think this is a design problem, more than a problem with GetOpt - the notion of supporting multiple, paired arguments passed as command line arguments I think is something that you'd be far better off avoiding.
There's a reason that GetOpt doesn't really support it - it's not a scalable solution really.
How about instead just reading the values from STDIN?:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my %systems = do { local $/; <DATA> } =~ m/id=(\w+) ip_address=([\d\.]+)/mg;
print Dumper \%systems;
And then you'd be able to invoke your script as:
perl testing.pl <filename_with_args>
Or similar.
And if you really must:
my %systems = "#ARGV" =~ m/id=(\w+) ip_address=([\d\.]+)/g;
Both of the above work for multiple parameters.
However, your comment on another post:
I can't because I'm fetching parameters from database and converting them into command line and then passing it to the script using system command $cmd_lines_args = '--system --id sys_1 --ip_address 127.0.0.0.1'; system("perl testing.pl $cmd_lines_args"); $cmd_lines_args I'll generate dynamically using for loop by reading from database
.. that makes this an XY Problem.
Don't do it like that:
open ( my $script, '|-', "perl testing.pl" );
print {$script} "id=some_id ip_address=10.9.8.7\n";
print {$script} "id=sys2 ip_address=10.9.8.7\n";
etc.
What you are describing,
--systems id=sys_1 ip_address=127.0.0.1 id=sys_2 ip_address=127.0.0.2
appears to be one option that takes a variable number of arguments that are pairs, and come in multiples of two. Getopt::Long's "Options with multiple values" lets you do the following:
GetOptions('systems=s{2,4}' => \#systems);
This lets you specify 2, 3 or 4 arguments, but it does not have syntax for "any even number of arguments" (to cover an arbitrary number of pairs beyond two), and you still have to unpack the id=sys_1 manually then. You can write a user-defined subroutine that handles the processing of --systems' arguments (but does not take into account missing id=...s):
my $system;
my %systems;
GetOptions('systems=s{,}' => sub {
my $option = shift;
my $pair = shift;
my ($key, $value) = split /=/, $pair;
$system = $value if $key eq 'id';
$systems{$system} = $value if $key eq 'ip_address';
});
I would however prefer one of the following schemes:
--system sys_1 127.0.0.1 --system sys_2 127.0.0.1
--system sys_1=127.0.0.1 --system sys_2=127.0.0.1
They're achieved with the following:
GetOptions('system=s{2}', \#systems);
GetOptions('system=s%', \#systems);
I would just parse the --systems arg and quote the "hashes" like this:
perl testing.pl --systems "id=s1 ip_address=127.0.0.1 id=s2 ip_address=127.0.0.2"
Parse like perhaps so:
my($systems,#systems);
GetOptions('systems=s' => \$systems);
for(split/\s+/,$systems){
my($key,$val)=split/=/,$_,2;
push #systems, {} if !#systems or exists$systems[-1]{$key};
$systems[-1]{$key}=$val;
}
print "$_->{id} $_->{ip_address}\n" for #systems;
Prints:
sys_1 127.0.0.1
sys_2 127.0.0.2

Perl Text::CSV $csv->fields() property not populated

I've got a script that reformats incoming data from a CSV into a readable format by a vended system. I may be going crazy, but I'm pretty sure it worked a week or two ago in the production environment. However, at some point in the last week or two, it stopped working. I tracked the problem down to the Text::CSV module not populating the $csv->fields() property.
my $csv = Text::CSV->new({sep_char => '|', allow_loose_quotes => 1});
$csv->column_names($csv->getline(*READ));
my #keys = $csv->fields;
Now, on my local machine (and, at least in my head, in the production environment two weeks ago, too), this would populate #keys with the parsed header fields. However, now, in both production and pre-production, this fails. The only difference I can tell is that my machine is running perl 5.12.4, while the prod/pprd is 5.8.8. The Text::CSV module on both is 1.21.
On my machine, if I use Data::Dumper and dump the $csv object, part of the properties is
'_FIELDS' => [
'ID',
'IDCARD_TYPE',
'FIRST_NAME',
'MIDDLE_NAME',
'LAST_NAME',
...
'EMAIL',
],
On the other machines:
'_FIELDS' => undef,
I've worked around this by using $csv->column_names to populate #keys, but something doesn't seem right and I'd really like to figure out what's going on. Any ideas?
Per the Text::CSV documentation, returning undef is the expected result of fields() after calling getline(). Try using parse() first. You might be using a different version of this module on your local machine. You can check the version using perl -MText::CSV -e 'print $Text::CSV::VERSION'.
Note that the return value is undefined after using getline (), which
does not fill the data structures returned by parse ().
Following alternate sequence worked for me:
$file = "test.csv" ;
if(!open($fh, "<", $file )) {
# Cannot call getline is a symptom of a bad open()
printf("### Error %s: could not open file %s\n", $ws, $file) ;
close($fh) ;
exit 1 ;
}
while(my $row = $csv->getline($fh)) {
# $row is a pointer to an Array
# The array is already parsed.
#items = #{$row} ;
for(my $i=0 ; $i<=$#items; $i++) {
printf("Field %d: (%s)\n", $i, $items[$i] ) ;
}
}
close $fh ;

ksh perl script.. if condition

Friends...
I have got bash script which calls perl script and emails logfile result everytime.
I want to change my bash script such that it should only email if there is value in perl subroutine row counter (rcounter++) and not all time.
any tips on how to change .ksh file?
.ksh
#!/bin/ksh
d=`date +%Y%m%d`
log_dir=$HOME
output_file=log.list
if ! list_tables -login /#testdb -outputFile $output_file
then
mailx -s "list report : $d" test#mail < $output_file
fi
=======Below if condition also works for me=============================
list_tables -login /#testdb -outputFile $output_file
if ["$?" -ne "0"];
then
mailx -s "list report : $d" test#mail < $output_file
fi
========================================================================
Perl Script: list_tables
use strict;
use Getopt::Long;
use DBI;
use DBD::Oracle qw(:ora_types);
my $exitStatus = 0;
my %options = ()
my $oracleLogin;
my $outputFile;
my $runDate;
my $logFile;
my $rcounter;
($oracleLogin, $outputFile) = &validateCommandLine();
my $db = &attemptconnect($oracleLogin);
&reportListTables($outputFile);
$db->$disconnect;
exit($rcounter);
#---------------------------
sub reportListTables {
my $outputFile = shift;
if ( ! open (OUT,">" . $outputfile)) {
&logMessage("Error opening $outputFile");
}
print OUT &putTitle;
my $oldDB="DEFAULT";
my $dbcounter = 0;
my $i;
print OUT &putHeader();
#iterate over results
for (my $i=0; $i<=$lstSessions; $i++) {
# print result row
print OUT &putRow($i);
$dbCounter++;
}
print OUT &putFooter($dbCounter);
print OUT " *** Report End \n";
closeOUT;
}
#------------------------------
sub putTitle {
my $title = qq{
List Tables: Yesterday
--------------
};
#------------------------------
sub putHeader {
my $header = qq{
TESTDB
==============
OWNER Table Created
};
#------------------------------
sub putRow {
my $indx = shift;
my $ln = sprintf "%-19s %-30s %-19s",
$lstSessions[$indx]{owner},
$lstSessions[$indx]{object_name},
$lstSessions[$indx]{created};
return "$ln\n";
}
#------------------------------
sub getListTables {
my $runDt = shift;
$rcounter = 0;
my $SQL = qq{
selct owner, object_name, to_char(created,'MM-DD-YYYY') from dba_objects
};
my $sth = $db->prepare ($SQL) or die $db->errstr;
$sth->execute() or die $db->errstr;;
while (my #row = $sth->fethcrow_array) {
$lstSessions[$rcounter] {owner} =$row[0];
$lstSessions[$rcounter] {object_name} =$row[1];
$lstSessions[$rcounter] {created} =$row[2];
&logMessage(" Owner: $lstSessions[$rcounter]{owner}");
&logMessage(" Table: $lstSessions[$rcounter]{object_name}");
&logMessage(" created: $lstSessions[$rcounter]{created}");
$rcounter++;
}
&logMessage("$rcounter records found...");
}
thanks..
also happy to include mail-x part in perl if that makes life more easy..
I am not sure I understood your question correctly. Also, your code is incomplete. So there's some guessing involved.
You cannot check the value of a local Perl variable from the caller's side.
But if your question is if the Perl code added anything to the logfile, the solution is simple: Delete the "rcounter records found..." line (which doesn't make sense anyway since it is always executed, whether the query returned results or not). Then, let the shell script backup the logfile before the call to Perl, and make a diff afterwards, sending the mail only if diff tells you there has been output added to the logfile.
If this doesn't help you, please clarify the question.
EDIT (from comments below):
Shell scripting isn't that difficult. Right now, your Perl script ends with:
$db->($exitStatus);
That is your exit code. You don't check that in your shell script anyway, so you could change it to something more useful, like the number of data rows written. A primitive solution would be to make $rcounter global (instead of local to getListTables()), by declaring it at the top of the Perl script (e.g. after my $logFile;). Then you could replace the "exitStatus" line above with simply:
$rcounter;
Voila, your Perl script now returns the number of data rows written.
In Perl, a return code of 0 is considered a failure, any other value is a success. In shell, it's the other way around - but luckily you don't have to worry about that as Perl knows that and "inverts" (negates) the return code of a script when returning to the calling shell.
So all you need is making the mailing depend on a non-zero return of Perl:
if list_tables -login /#testdb -outputFile $output_file
then
mailx -s "list report : $d" test#mail < $output_file
fi
A comment on the side: It looks to me as if your programming skill isn't up to par with the scope of the problem you are trying to solve. If returning a value from Perl to bash gives you that much trouble, you should probably spend your time with tutorials, not with getting input from a database and sending emails around. Learn to walk before you try to fly...

Fast alternative to grep -f

file.contain.query.txt
ENST001
ENST002
ENST003
file.to.search.in.txt
ENST001 90
ENST002 80
ENST004 50
Because ENST003 has no entry in 2nd file and ENST004 has no entry in 1st file the expected output is:
ENST001 90
ENST002 80
To grep multi query in a particular file we usually do the following:
grep -f file.contain.query <file.to.search.in >output.file
since I have like 10000 query and almost 100000 raw in file.to.search.in it takes very long time to finish (like 5 hours). Is there a fast alternative to grep -f ?
If you want a pure Perl option, read your query file keys into a hash table, then check standard input against those keys:
#!/usr/bin/env perl
use strict;
use warnings;
# build hash table of keys
my $keyring;
open KEYS, "< file.contain.query.txt";
while (<KEYS>) {
chomp $_;
$keyring->{$_} = 1;
}
close KEYS;
# look up key from each line of standard input
while (<STDIN>) {
chomp $_;
my ($key, $value) = split("\t", $_); # assuming search file is tab-delimited; replace delimiter as needed
if (defined $keyring->{$key}) { print "$_\n"; }
}
You'd use it like so:
lookup.pl < file.to.search.txt
A hash table can take a fair amount of memory, but searches are much faster (hash table lookups are in constant time), which is handy since you have 10-fold more keys to lookup than to store.
If you have fixed strings, use grep -F -f. This is significantly faster than regex search.
This Perl code may helps you:
use strict;
open my $file1, "<", "file.contain.query.txt" or die $!;
open my $file2, "<", "file.to.search.in.txt" or die $!;
my %KEYS = ();
# Hash %KEYS marks the filtered keys by "file.contain.query.txt" file
while(my $line=<$file1>) {
chomp $line;
$KEYS{$line} = 1;
}
while(my $line=<$file2>) {
if( $line =~ /(\w+)\s+(\d+)/ ) {
print "$1 $2\n" if $KEYS{$1};
}
}
close $file1;
close $file2;
If the files are already sorted:
join file1 file2
if not:
join <(sort file1) <(sort file2)
If you are using perl version 5.10 or newer, you can join the 'query' terms into a regular expression with the query terms separated by the 'pipe'. (Like:ENST001|ENST002|ENST003) Perl builds a 'trie' which, like a hash, does lookups in constant time. It should run as fast as the solution using a lookup hash. Just to show another way to do this.
#!/usr/bin/perl
use strict;
use warnings;
use Inline::Files;
my $query = join "|", map {chomp; $_} <QUERY>;
while (<RAW>) {
print if /^(?:$query)\s/;
}
__QUERY__
ENST001
ENST002
ENST003
__RAW__
ENST001 90
ENST002 80
ENST004 50
Mysql:
Importing the data into Mysql or similar will provide an immense improvement. Will this be feasible ? You could see results in a few seconds.
mysql -e 'select search.* from search join contains using (keyword)' > outfile.txt
# but first you need to create the tables like this (only once off)
create table contains (
keyword varchar(255)
, primary key (keyword)
);
create table search (
keyword varchar(255)
,num bigint
,key (keyword)
);
# and load the data in:
load data infile 'file.contain.query.txt'
into table contains fields terminated by "add column separator here";
load data infile 'file.to.search.in.txt'
into table search fields terminated by "add column separator here";
use strict;
use warings;
system("sort file.contain.query.txt > qsorted.txt");
system("sort file.to.search.in.txt > dsorted.txt");
open (QFILE, "<qsorted.txt") or die();
open (DFILE, "<dsorted.txt") or die();
while (my $qline = <QFILE>) {
my ($queryid) = ($qline =~ /ENST(\d+)/);
while (my $dline = <DFILE>) {
my ($dataid) = ($dline =~ /ENST(\d+)/);
if ($dataid == $queryid) { print $qline; }
elsif ($dataid > $queryid) { break; }
}
}
This may be a little dated, but is tailor-made for simple UNIX utilities. Given:
keys are fixed-length (here 7 chars)
files are sorted (true in the example) allowing the use of fast merge sort
Then:
$ sort -m file.contain.query.txt file.to.search.in.txt | tac | uniq -d -w7
ENST002 80
ENST001 90
Variants:
To strip the number printed after the key, remove tac command:
$ sort -m file.contain.query.txt file.to.search.in.txt | uniq -d -w7
To keep sorted order, add an extra tac command at the end:
$ sort -m file.contain.query.txt file.to.search.in.txt | tac | uniq -d -w7 | tac

In Perl, how can I watch a directory for changes?

use Text::Diff;
for($count = 0; $count <= 1000; $count++){
my $data_dir="archive/oswiostat/oracleapps.*dat";
my $data_file= `ls -t $data_dir | head -1`;
while($data_file){
print $data_file;
open (DAT,$data_file) || die("Could not open file! $!");
$stats1 = (stat $data_file)[9];
print "Stats: \n";
#raw_data=<DAT>;
close(DAT);
print "Stats1 is :$stats1\n";
sleep(5);
if($stats1 != $stats2){
#diff = diff \#raw_data, $data_file, { STYLE => "Context" };
$stats2 = $stats1;
}
print #diff || die ("Didn't see any updates $!");
}
}
Output:
$ perl client_socket.pl
archive/oswiostat/oracleapps.localdomain_iostat_12.06.28.1500.dat
Stats:
Stats1 is :
Didn't see any updates at client_socket.pl line 18.
Can you tell me why the stats are missing and how to fix it?
The real fix is File::ChangeNotify or File::Monitor or something similar (e.g., on Windows, Win32::ChangeNotify).
use File::ChangeNotify;
my $watcher = File::ChangeNotify->instantiate_watcher(
directories => [ 'archive/oswiostat' ],
filter => qr/\Aoracleapps[.].*dat\z/,
);
while (my #events = $watcher->wait_for_events) {
# ...
}
Note that I'm answering your original question, why the stat() seemed to fail, rather than the newly edited question title, which asks something different.
This is the fix:
my $data_file= `ls -t $data_dir | head -1`;
chomp($data_file);
The reason this is the fix is a little murky. Without that chomp(), $data_file contains a trailing newline: "some_filename\n". The two argument form of open() ignores trailing newlines in filenames and I don't know why because two-arg open mimics shell behavior. Your call to stat(), however, does not ignore the newline in the filename, so it is stat()ing a non-existent file and thus $stats1 is undef.