I've got multi-GB mailserver log file and a list of ~350k messages ID.
I want to pull out from the big log file rows with IDs from the long list... and I want it faster than it is now...
Currently I do it in perl:
#!/usr/bin/perl
use warnings;
#opening file with the list - over 350k unique ID
open ID, maillog_id;
#lista_id = <ID>;
close ID;
chomp #lista_id;
open LOG, maillog;
# while - foreach would cause out of memory
while ( <LOG> ) {
$wiersz = $_;
my #wiersz_split = split ( ' ' , $wiersz );
#
foreach ( #lista_id ) {
$id = $_;
# ID in maillog is 6th column
if ( $wiersz_split[5] eq $id) {
# print whole row when matched - can be STDOUT or file or anything
print "#wiersz_split\n";
}
}
}
close LOG;
It works but it is slow... Every line from log is taken into comparison with list of ID.
Should I use database and perform a kind of join? Or compare substrings?
There are a lot of tools for log analyse - e.g. pflogsumm... but it just summarizes. E.g. I could use
grep -c "status=sent" maillog
It would be fast but useless and I would use it AFTER filtering my log file... the same is for pflogsumm etc. - just increasing variables.
Any suggestions?
-------------------- UPDATE -------------------
thank you Dallaylaen,
I succeded with this (instead internal foreach on #lista_id):
if ( exists $lista_id_hash{$wiersz_split[5]} ) { print "$wiersz"; }
where %lista_id_hash is a hash table where keys are items taken from my ID list. It works superfast.
Processing 4,6 GB log file with >350k IDs takes less than 1 minute to filter interesting logs.
Use a hash.
my %known;
$known{$_} = 1 for #lista_id;
# ...
while (<>) {
# ... determine id
if ($known{$id}) {
# process line
};
};
P.S. If your log is THAT big, you're probably better off with splitting it according to e.g. last two letters of $id into 256 (or 36**2?) smaller files. Something like a poor man's MapReduce. The number of IDs to store in memory at a time will also be reduced (i.e. when you're processing maillog.split.cf, you should only keep IDs ending in "cf" in hash).
Related
Basically, I have a script to create a hash for COGs with corresponding gene IDs:
# Open directory and get all the files in it
opendir(DIR, "/my/path/to/COG/");
my #infiles = grep(/OG-.*\.fasta/, readdir(DIR));
closedir(DIR);
# Create hash for COGs and their corresponding gene IDs
tie my %ids_for, 'Tie::IxHash';
if (! -e '/my/path/to/COG/COG_hash.ref') {
for my $infile (#infiles) {
## $infile
%ids_for = (%ids_for, read_COG_fasta($infile));
}
## %ids_for
store \%ids_for, '/my/path/to/COG/COG_hash.ref';
}
my $id_ref = retrieve('/my/path/to/COG/COG_hash.ref');
%ids_for = %$id_ref;
## %ids_for
The problem isn't that it doesn't work (at least I think), but that it is extremely slow for some reason. When I tried to test run it, it would take weeks for me to have an actual result. Somehow the hash creation is really really slow and I'm sure there is some way to optimize it better for it to work way faster.
Ideally, the paths should be the input of the script that way there would be no need to constantly change the script in case the path changes.
It would also be great if there could be a way to see the progress of the hash creation, like maybe have it show that it is 25% done, 50% done, 75% done and ultimately 100% done. Regarding this last point I have seen things like use Term::ProgressBar but I am not sure if it would be appropriate in this case.
Do you really need Tie::IxHash?
That aside, I suspect your culprit is this set of lines:
for my $infile (#infiles) {
## $infile
%ids_for = (%ids_for, read_COG_fasta($infile));
}
To add a key to the hash, you are creating a list of the current key-value pairs, adding the new pair, then assigning it all back to the hash.
What happens if you take the results from read_COG_fasta and add the keys one at a time?
for my $infile (#infiles) {
my %new_hash = read_COG_fasta($infile);
foreach my $key ( keys %new_hash ) {
$ids_for{$key} = $new_hash{$key};
}
}
As for progress, I usually have something like this when I'm trying to figure out something:
use v5.26;
my $file_count = #files;
foreach my $n ( 0 .. $#files ) {
say "[$n/$file_count] Processing $file[$n]";
my %result = ...;
printf "\tGot %d results", scalar %hash; # v5.26 feature!
}
You could do the same sort of thing with the keys that you get back so you can track the size.
First time poster and new to Perl so I'm a little stuck. I'm iterating through a collection of long file names with columns separated by variable amounts of whitespace for example:
0 19933 12/18/2013 18:00:12 filename1.someextention
1 11912 12/17/2013 18:00:12 filename2.someextention
2 19236 12/16/2013 18:00:12 filename3.someextention
These are generated by multiple servers so I am iterating through multiple collections. That mechanism is simple enough.
I'm focused solely on the date column and need to ensure the date is changing like the above example as that ensures the file is being created on a daily basis and only once. If the file is created more than once per day I need to do something like send an email to myself and move on to the next server collection. If the date changes from the first file to the second exit the loop as well.
My issue is I don't know how to keep the date element of the first file stored so that I can compare it to the next file's date going through the loop. I thought about keeping the element stored in an array inside the loop until the current collection is finished and then move onto the next collection but I don't know the correct way of doing so. Any help would be greatly appreciated. Also, if there is a more eloquent way please enlighten me since I am willing to learn and not just wanting someone to write my script for me.
#file = `command -h server -secFilePath $secFilePath analyzer -archive -list`;
#array = reverse(#file); # The output from the above command lists the oldest file first
foreach $item (#array) {
#first = split (/ +/, #item);
#firstvar = #first[2];
#if there is a way to save the first date in the #firstvar array and keep it until the date
changes
if #firstvar == #first[2] { # This part isnt designed correctly I know. }
elsif #firstvar ne #first[2]; {
last;
}
}
One common technique is to use a hash, which is a data structure mapping key-value pairs. If you key by date, you can check if a given date has been encountered before.
If a date hasn't been encountered, it has no key in the hash.
If a date has been encountered, we insert 1 under that key to mark it.
my %dates;
foreach my $line (#array) {
my ($idx, $id, $date, $time, $filename) = split(/\s+/, $line);
if ($dates{$date}) {
#handle duplicate
} else {
$dates{$date} = 1;
#...
#code here will be executed only if the entry's date is unique
}
#...
#code here will be executed for each entry
}
Note that this will check each date against each other date. If for some reason you only want to check if two adjacent dates match, you could just cache the last $date and check against that.
In comments, OP mentioned they might rather perform that second check I mentioned. It's similar. Might look like this:
#we declare the variable OUTSIDE of the loop
#if needs to be, so that it stays in scope between runs
my $last_date;
foreach my $line (#array) {
my ($idx, $id, $date, $time, $filename) = split(/\s+/, $line);
if ($date eq $last_date) { #we use 'eq' for string comparison
#handle duplicate
} else {
$last_date = $date;
#...
#code here will be executed only if the entry's date is unique
}
#...
#code here will be executed for each entry
}
I am trying to figure out a way to do this, I know it should be possible. A little background first.
I want to automate the process of creating the NCBI Sequin block for submitting DNA sequences to GenBank. I always end up creating a table that lists the species name, the specimen ID value, the type of sequences, and finally the location of the the collection. It is easy enough for me to export this into a tab-delimited file. Right now I do something like this:
while ($csv) {
foreach ($_) {
if ($_ =! m/table|species|accession/i) {
#csv = split('\t', $csv);
print NEWFILE ">[species=$csv[0]] [molecule=DNA] [moltype=genomic] [country=$csv[2]] [spec-id=$csv[1]]\n";
}
else {
next;
}
}
}
I know that is messy, and I just typed up something similar to what I have by memory (don't have script on any of my computers at home, only at work).
Now that works for me fine right now because I know which columns the information I need (species, location, and ID number) are in.
But is there a way (there must be) for me to find the columns that are for the needed info dynamically? That is, no matter the order of the columns the correct info from the correct column goes to the right place?
The first row will usually as Table X (where X is the number of the table in the publication), the next row will usually have the column headings of interest and are nearly universal in title. Nearly all tables will have standard headings to search for and I can just use | in my pattern matching.
First off, I would be remiss if I didn’t recommend the excellent Text::CSV_XS module; it does a much more reliable job of reading CSV files, and can even handle the column-mapping scheme that Barmar referred to above.
That said, Barmar has the right approach, though it ignores the "Table X" row being a separate row entirely. I recommend taking an explicit approach, perhaps something like this (and this is going to have a bit more detail just to make things clear; I would probably write it more tightly in production code):
# Assumes the file has been opened and that the filehandle is stored in $csv_fh.
# Get header information first.
my $hdr_data = {};
while( <$csv_fh> ) {
if( ! $hdr_data->{'table'} && /Table (\d+)/ ) {
$hdr_data->{'table'} = $1;
next;
}
if( ! $hdr_data->{'species'} && /species/ ) {
my $n = 0;
# Takes the column headers as they come, creating
# a map between the column name and column number.
# Assumes that column names are case-insensitively
# unique.
my %columns = map { lc($_) => $n++ } split( /\t/ );
# Now pick out exactly the columns we want.
foreach my $thingy ( qw{ species accession country } ) {
$hdr_data->{$thingy} = $columns{$thingy};
}
last;
}
}
# Now process the rest of the lines.
while( <$csv_fh> ) {
my $col = split( /\t/ );
printf NEWFILE ">[species=%s] [molecule=DNA] [moltype=genomic] [country=%s] [spec-id=%s]\n",
$col[$hdr_data->{'species'}],
$col[$hdr_data->{'country'}],
$col[$hdr_data->{'accession'}];
}
Some variation of that will get you close to what you need.
Create a hash that maps column headings to column numbers:
my %columns;
...
if (/table|species|accession/i) {
my #headings = split('\t');
my $col = 0;
foreach my $col (#headings) {
$columns{"\L$col"} = $col++;
}
}
Then you can use $csv[$columns{'species'}].
Description: I am reading from a list of flat files and generating and loading an access database. Windows XP, Perl 5.8.8, and no access to additional modules outside the default installed.
Issue(s): Performance, Performance, Performance. It is taking ~20 minutes to load in all of the data. I am assuming that there might be a better way to load the data rather than addnew & update.
Logic: Without posting a lot of my transformations and additional logic here is what I am attempting:
Open file x
read row 0 of file x
jet->execute a Create statement from string dervied from step 2
read in rows 1 - n creating a tab delimitted string and store into an array
Open a recordset using select * from tablename
for each item in array
recordset->AddNew
split the item based on the tab
for each item in the split
rs->Fields->Item(pos)->{Value} = item_value
recordset->Update
One issue in slow loads is doing a commit on every update. Make sure that automatic commits are off and do one every 1000 rows or whatever. If it is not a gigantic load, don't do them at all. Also, do not create indexes during the load, create them afterwards.
Also, I'm not sure that OLE is the best way to do this. I load Access db's all of the time using DBI and Win32::ODBC. Goes pretty fast.
Per request, here is sample load program, did about 100k records per minute on WinXP, Access 2003, ActiveState Perl 5.8.8.
use strict;
use warnings;
use Win32::ODBC;
$| = 1;
my $dsn = "LinkManagerTest";
my $db = new Win32::ODBC($dsn)
or die "Connect to database $dsn failed: " . Win32::ODBC::Error();
my $rows_added = 0;
my $error_code;
while (<>) {
chomp;
print STDERR "." unless $. % 100;
print STDERR " $.\n" unless $. % 5000;
my ($source, $source_link, $url, $site_name) = split /\t/;
my $insert = qq{
insert into Links (
URL,
SiteName,
Source,
SourceLink
)
values (
'$url',
'$site_name',
'$source',
'$source_link'
)
};
$error_code = $db->Sql($insert);
if ($error_code) {
print "\nSQL update failed on line $. with error code $error_code\n";
print "SQL statement:\n$insert\n\n";
print "Error:\n" . $db->Error() . "\n\n";
}
else {
$rows_added++;
}
$db->Transact('SQL_COMMIT') unless $. % 1000;
}
$db->Transact('SQL_COMMIT');
$db->Close();
print "\n";
print "Lines Read: $.\n";
print "Rows Added: $rows_added\n";
exit 0;
I am having a problem printing out the correct number of records for a given file. My test script simply does a perl dbi connection to a mysql database and given a list of tables, extracts (1) record per table.
For every table I list, I also want to print that (1) record out to its own file. For example if I have a list of 100 tables, I should expect 100 uniques files with (1) record each.
So far, I am able to generate the 100 files, but there is more than (1) record. There are up to 280 records in the file. Ironically, I am generating a unique key for each record and the keys are unique.
If I print out the $data to a single file (outside the foreach loop), I get the expected results, but in one single file. So one file with 100 records for example, but I want to create a file for each.
I seem to have a problem opening up a file handle and outputting this correctly? or something else is wrong with my code.
Can someone show me how to set this up properly? Show me some best practices for achieving this?
Thank you.
Here is my test code:
# Get list of table
my #tblist = qx(mysql -u foo-bar -ppassw0rd --database $dbsrc -h $node --port 3306 -ss -e "show tables");
#Create data output
my $data = '';
foreach my $tblist (#tblist)
{
chomp $tblist;
#Testing to create file
my $out_file = "/home/$node-$tblist.$dt.dat";
open (my $out_fh, '>', $out_file) or die "cannot create $out_file: $!";
my $dbh = DBI->connect("DBI:mysql:database=$dbsrc;host=$node;port=3306",'foo-bar','passw0rd');
my $sth = $dbh->prepare("SELECT UUID(), '$node', ab, cd, ef, gh, hi FROM $tblist limit 1");
$sth->execute();
while (my($id, $nd,$ab,$cd,$ef,$gh,$hi) = $sth->fetchrow_array() ) {
$data = $data. "__pk__^A$id^E1^A$nd^E2^A$ab^E3^A$cd^E4^A$ef^E5^A$gh^E6^A$hi^E7^D";
}
$sth->finish;
$dbh->disconnect;
#Testing to create file
print $out_fh $data;
close $out_fh or die "Failed to close file: $!";
};
#print $data; #Here if I uncomment and output to a single file, I can see the correct number of record, but its in (1) file
You need to clear $data on each $tblist loop iteration (outer loop).
In this line: $data = $data. "__pk__^A$id^E1^A$... you are appending the data from new table each iteration on TOP of the old data, and it gets preserved in your code between different tables since the $data variable is scoped OUTSIDE the outer loop and its value never gets reset inside of it.
The simplest solution is to declare $data inside the outer ($tblist) loop:
foreach my $tblist (#tblist) {
my $data = '';
You could keep declaring it before the outer loop and simply assign it "" value at the start of the loop, but there's no point - there is usually no legitimate reason to know the value of $data in a loop like this after a loop finishes so there's no need for it to be in the scope bigger than the loop block.