Description: I am reading from a list of flat files and generating and loading an access database. Windows XP, Perl 5.8.8, and no access to additional modules outside the default installed.
Issue(s): Performance, Performance, Performance. It is taking ~20 minutes to load in all of the data. I am assuming that there might be a better way to load the data rather than addnew & update.
Logic: Without posting a lot of my transformations and additional logic here is what I am attempting:
Open file x
read row 0 of file x
jet->execute a Create statement from string dervied from step 2
read in rows 1 - n creating a tab delimitted string and store into an array
Open a recordset using select * from tablename
for each item in array
recordset->AddNew
split the item based on the tab
for each item in the split
rs->Fields->Item(pos)->{Value} = item_value
recordset->Update
One issue in slow loads is doing a commit on every update. Make sure that automatic commits are off and do one every 1000 rows or whatever. If it is not a gigantic load, don't do them at all. Also, do not create indexes during the load, create them afterwards.
Also, I'm not sure that OLE is the best way to do this. I load Access db's all of the time using DBI and Win32::ODBC. Goes pretty fast.
Per request, here is sample load program, did about 100k records per minute on WinXP, Access 2003, ActiveState Perl 5.8.8.
use strict;
use warnings;
use Win32::ODBC;
$| = 1;
my $dsn = "LinkManagerTest";
my $db = new Win32::ODBC($dsn)
or die "Connect to database $dsn failed: " . Win32::ODBC::Error();
my $rows_added = 0;
my $error_code;
while (<>) {
chomp;
print STDERR "." unless $. % 100;
print STDERR " $.\n" unless $. % 5000;
my ($source, $source_link, $url, $site_name) = split /\t/;
my $insert = qq{
insert into Links (
URL,
SiteName,
Source,
SourceLink
)
values (
'$url',
'$site_name',
'$source',
'$source_link'
)
};
$error_code = $db->Sql($insert);
if ($error_code) {
print "\nSQL update failed on line $. with error code $error_code\n";
print "SQL statement:\n$insert\n\n";
print "Error:\n" . $db->Error() . "\n\n";
}
else {
$rows_added++;
}
$db->Transact('SQL_COMMIT') unless $. % 1000;
}
$db->Transact('SQL_COMMIT');
$db->Close();
print "\n";
print "Lines Read: $.\n";
print "Rows Added: $rows_added\n";
exit 0;
Related
I am reading values from my sql database for a column. Now I want to convert columns values in row in comma(ex:-"abc","xyz")
Source Data:-
amol
aakash
shami
krishna
Output expected: ( "amol","aakash","shami","krishna")
Code which I am trying:-
#!/usr/bin/env perl
$t = `date`;
#print $t;
$GCMS_SERVER = $ENV{DSQUERY};
$GCMS_USERNAME = $ENV{GCMS_USERNAME};
$GCMS_PASSWORD = $ENV{GCMS_PASSWORD};
$GCMS_DATABASE = $ENV{GCMS_DATABASE};
#print "test\n";
my $query = "SELECT Label FROM FeedGenSource WHERE BaseFileName ='aldgctna'";
#print "SQL =$query\n";
my $sqlcmd = (qq/
set nocount on
go
use $GCMS_DATABASE
go
$query
/);
open(::DBCMD, "sqsh -S$GCMS_SERVER -U$GCMS_USERNAME -h -w100 -P- <<EOF
$GCMS_PASSWORD
$sqlcmd
go
EOF
|") || die("Could not communicate with DB ");
while(<::DBCMD>){
print "$_";
}
#print "done\n";
close ::DBCMD;
exit
This is the perl script. I want to use output in query as in statement.
Your output comes from this section of your code:
while(<::DBCMD>){
print "$_";
}
The filehandle ::DBCMD is connected to the output from your sqsh command. Your code reads each record from that filehandle and prints it to STDOUT.
If you want to do something cleverer with the output, then you're going to have to store the data in some kind of data structure (probably an array in this case) and manipulate that.
I expect you want something like this:
my #data;
while (<::DBCMD>) {
chomp; # remove the newline
push #data, $_;
}
# And then:
print join(',', #data), "\n";
To print the exact output that you ask for, you would need this:
print '(', join(',', map { qq["$_"] } #data), ")\n";
But I have to ask... why are you making your life so difficult by manipulating data that comes back from sqsh? You should really look at Perl's database interface library, DBI. That will make your life far simpler.
A few other tips:
Always have use strict and use warnings in your code. And fix the issues they will reveal.
Use Perl's built-in date and time tools instead of shelling out to date.
Using :: on your bareword filehandle achieves nothing. Just DBCMD works the same way and is less confusing.
I have a script that uses a custom module EVTConf which is just a wrapper around DBI.
It has the username and password hard coded so we don't have to write the username and password in every script.
I want to see the data that the query picks up - but it does not seem to pick up anything from the query - just a bless statement.
What is bless?
#!/sbcimp/dyn/data/scripts/perl/bin/perl
use EVTConf;
EVTConf::makeDBConnection(production);
$dbh = $EVTConf::dbh;
use Data::Dumper;
my %extend_hash = %{#_[0]};
my $query = "select level_id, e_risk_symbol, e_exch_dest, penny, specialist from etds_extend";
if (!$dbh) {
print "Error connecting to DataBase; $DBI::errstr\n";
}
my $cur_msg = $dbh->prepare($query) or die "\n\nCould not prepare statement: ".$dbh->errstr;
$cur_msg->execute();
$cur_msg->fetchrow_array;
print Dumper($cur_msg) ;
This is what I get:
Foohost:~/walt $
Foohost:~/walt $ ./Test_extend_download_parse_the_object
$VAR1 = bless( {}, 'DBI::st' );
$cur_msg is a statement handle (hence it is blessed into class DBI::st). You need something like:
my $cur_msg = $dbh->prepare($query) or die "…";
$cur_msg->execute();
my #row;
while (#row = $cur_msg->fetchrow_array)
{
print "#row\n";
# print Dumper(\#row);
}
only you need to be a bit more careful about how you actually print the data than I was. There are a number of other fetching methods, such as fetchrow_arrayref, fetchrow_hashref, fetchall_arrayref. All the details are available via perldoc DBI at the command line or the DBI page on CPAN.
You can see what the official documentation says about bless by using perldoc -f bless (or going to bless). It is a way of associating a variable with a class, and the class in this example is DBI::st, the DBI statement handle class. You $dbh would be in class DBI::db, for example.
What is the best way to print the results?
The best way to print them out depends on what you know about the result set.
You might choose:
printf "%-12s %6.2f\n", $row[0], $row[3];
if you know that there are only two fields you're interested in (though why didn't you just choose the two you're interested in — it costs time (a little time) to process elements 1 and 2 if they're unused).
You might choose:
foreach my $val (#row) { print "$val\n"; }
You might choose:
for (my $i = 0; $i < scalar(#row); $i++)
{
printf "%-12s = %s\n", $cur_msg->{NAME}[$i], $row[$i];
}
to print out the column name as well as the value. There are many other possibilities too, but those cover the key ones.
As noted by Borodin in his comment, you should be using use strict; and use warnings; automatically and reflexively in your Perl code. There's one variable that is not handled strictly in the code you show, namely $dbh. 'Tis easily remedied; add my before it where it is assigned. But it is a good idea to ensure that you use them all the time. Using them can allows you to avoid unexpected behaviours that you weren't aware of and weren't intending to exploit.
I am finding uniques URL in a log file along with the response stamp which can be available using $line[7]. I am using Hash to get the unique URLs.
How can I get the count of Unique URL?
How can I get the average of response time along with the count of Unique URL?
With below code I am getting
url1
url2
url3
but I want it along with the average response time and count of each URL
URL Av.RT Count
url1 10.5 125
url2 9.3 356
url3 7.8 98
Code:
#!/usr/bin/perl
open(IN, "web1.txt") or die "can not open file";
# Hash to store final list of unique IPs
my %uniqueURLs = ();
my $z;
# Read log file line by line
while (<IN>) {
#line = split(" ",$_);
$uniqueURLs{$line[9]}=1;
}
# Go through the hash table and print the keys
# which are the unique IPs
for $url (keys %uniqueURLs) {
print $url . "\n";
}
store a listref in your hashing directory:
$uniqueURLs{$line[9]} = [ <avg response time>, <count> ];
adjust the elements accordingly, eg. the count:
if (defined($uniqueURLs{$line[9]})) {
# url known, increment count,
# update average response time with data from current log entry
$uniqueURLs{$line[9]}->[0] =
(($uniqueURLs{$line[9]}->[0] * $uniqueURLs{$line[9]}->[1]) + ($line[7] + 0.0))
/ ($uniqueURLs{$line[9]}->[1] + 1)
;
$uniqueURLs{$line[9]}->[1] += 1;
}
else {
# url not yet known,
# init count with 1 and average response time with actual response time from log entry
$uniqueURLs{$line[9]} = [ $line[7] + 0.0, 1 ];
}
to print results:
# Go through the hash table and print the keys
# which are the unique IPs
for $url (keys %uniqueURLs) {
printf ( "%s %f %d\n", $url, $uniqueURLs{$url}->[0], $uniqueURLs{$url}->[1]);
}
adding 0.0 will guarantee type coercion from string to float as a safeguard.
Read up on References. Also, read up on modern Perl practices which will help improve your programming skills.
Instead of just using the keys of your hash of unique URLs, you could store information in those hashes. Let's start with just a count of the unique URLs:
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use feature qw(say);
use constant {
WEB_FILE => "web1.txt",
};
open my $web_fh, "<", WEBFILE; #Autodie will catch this for you
my %unique_urls;
while ( my $line = <$web_fh> ) {
my $url = (split /\s+/, $line)[9];
if ( not exists $unique_urls{$url} ) { #Not really needed
$unique_urls{$url} = 0;
}
$unique_urls{$url} += 1;
}
close $web_fh;
Now, each key in your %unique_urls hash will contain the number of unique URLs you have.
This, by the way, is your code written in a bit more modern style. The use strict; and use warnings; pragmas will catch about 90% of the standard programming errors. The use autodie; will catch exceptions to things that you forget to check. In this case, the program will automatically die if the file doesn't exist.
The three parameter version of the open command is preferred, and so is using scalar variables for file handles. Using scalar variables for the file handle makes them easier to pass in subroutines, and the file will automatically close if the file handle falls out of scope.
However, we want to store in two items per hash. We want to store the unique count, and we want to store something that will help us find the average response time. This is where references come in.
In Perl, variables deal with single data items. A scalar variable (like $foo) deals with an individual data item. Arrays and Hashes (like #foo and %foo) deal with lists of individual data items. References help you get around this limitation.
Let's look at an array of people:
$person[0] = "Bob";
$person[1] = "Ted";
$person[2] = "Carol";
$person[3] = "Alice";
However, people are more than just first names. They have last names, phone numbers, addresses, etc. Let's take a look at a hash for Bob:
my %bob_hash;
$bob_hash{FIRST_NAME} = "Bob";
$bob_hash{LAST_NAME} = "Jones";
$bob_hash{PHONE} = "555-1234";
We can take a reference to this hash by putting a backslash in front of it. A reference is merely the memory address where this hash is stored:
$bob_reference = \%bob_hash;
print "$bob_reference\n": # Prints out something like HASH(0x7fbf79004140)
However, that memory address is a single item, and could be stored in our array of people!
$person[0] = $bob_reference;
If we want to get to the items in our reference, we dereference it by putting the right data type symbol in front. Since this is a hash, we will use %:
$bob_hash = %{ $person[0] };
Perl provides an easy way to dereference hashes with the -> syntax:
$person[0]->{FIRST_NAME} = "Bob";
$person[0]->{LAST_NAME} = "Jones";
$person[0]->{PHONE} = "555-1212";
We'll use the same technique in %unique_urls to store the number of times, and the total amount of response time. (Average will be total time / number of times).
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use feature qw(say);
use constant {
WEB_FILE => "web1.txt",
};
open my $web_fh, "<", WEB_FILE; #Autodie will catch this for you
my %unique_urls;
while ( my $line ( <$web_fh> ) {
my $url = (split /\s+/, $line)[9];
my $response_time = (split /\s+/, $line)[10]; #Taking a guess
if ( not exists $unique_urls{$url} ) { #Not really needed
$unique_urls{$url}->{INSTANCES} = 0;
$unique_urls{$url}->{TOTAL_RESP_TIME} = 0;
}
$unique_urls{$url}->{INSTANCES} += 1;
$unique_urls{$url}->{TOTAL_RESP_TIME} += $response_time;
}
$close $web_fh;
Now we can print them out:
print "%20.20s %6s %8s\n", "URL", "INST", "AVE";
for my $url ( sort keys %unique_urls ) {
my $total_resp_time = $unique_urls{$url}->{TOTAL_RESP_TIME};
my $instances = $unique_urls{$url}->{INSTANCES};
my $average = $total_resp_time / $instances
printf "%-20.20s %-6d %-8.5f\n", $url, $instances, $average";
}
I like using printf for tables.
Instead of setting the value to 1 here:
$uniqueURLs{$line[9]}=1;
Store a data structure indicating the response time and the number of times this URL has been seen (so you can properly calculate the average). You can use an array ref, or hashref if you want. If the key doesn't exist yet, that means it hasn't been seen yet, and you can set some initial values.
# Initialize 3-element arrayref: [count, total, average]
$uniqueURLS{$line[9]} = [0, 0, 0] if not exists $uniqueURLS{$line[9]};
$uniqueURLs{$line[9]}->[0]++; # Count
$uniqueURLs{$line[9]}->[1] += $line[7]; # Total time
# Calculate average
$uniqueURLs{$line[9]}->[2] = $uniqueURLs{$line[9]}->[1] / $uniqueURLs{$line[9]}->[0];
One way you can get count of uniqueURLS is by counting the keys:
print scalar(keys %uniqueURLS); # Print number of unique url's
In your loop, you can print out the url and average time like this:
for $url (keys %uniqueURLs) {
print $url, ' - ', $uniqueURLs[$url]->[2], "seconds \n";
}
I've got multi-GB mailserver log file and a list of ~350k messages ID.
I want to pull out from the big log file rows with IDs from the long list... and I want it faster than it is now...
Currently I do it in perl:
#!/usr/bin/perl
use warnings;
#opening file with the list - over 350k unique ID
open ID, maillog_id;
#lista_id = <ID>;
close ID;
chomp #lista_id;
open LOG, maillog;
# while - foreach would cause out of memory
while ( <LOG> ) {
$wiersz = $_;
my #wiersz_split = split ( ' ' , $wiersz );
#
foreach ( #lista_id ) {
$id = $_;
# ID in maillog is 6th column
if ( $wiersz_split[5] eq $id) {
# print whole row when matched - can be STDOUT or file or anything
print "#wiersz_split\n";
}
}
}
close LOG;
It works but it is slow... Every line from log is taken into comparison with list of ID.
Should I use database and perform a kind of join? Or compare substrings?
There are a lot of tools for log analyse - e.g. pflogsumm... but it just summarizes. E.g. I could use
grep -c "status=sent" maillog
It would be fast but useless and I would use it AFTER filtering my log file... the same is for pflogsumm etc. - just increasing variables.
Any suggestions?
-------------------- UPDATE -------------------
thank you Dallaylaen,
I succeded with this (instead internal foreach on #lista_id):
if ( exists $lista_id_hash{$wiersz_split[5]} ) { print "$wiersz"; }
where %lista_id_hash is a hash table where keys are items taken from my ID list. It works superfast.
Processing 4,6 GB log file with >350k IDs takes less than 1 minute to filter interesting logs.
Use a hash.
my %known;
$known{$_} = 1 for #lista_id;
# ...
while (<>) {
# ... determine id
if ($known{$id}) {
# process line
};
};
P.S. If your log is THAT big, you're probably better off with splitting it according to e.g. last two letters of $id into 256 (or 36**2?) smaller files. Something like a poor man's MapReduce. The number of IDs to store in memory at a time will also be reduced (i.e. when you're processing maillog.split.cf, you should only keep IDs ending in "cf" in hash).
I am having a problem printing out the correct number of records for a given file. My test script simply does a perl dbi connection to a mysql database and given a list of tables, extracts (1) record per table.
For every table I list, I also want to print that (1) record out to its own file. For example if I have a list of 100 tables, I should expect 100 uniques files with (1) record each.
So far, I am able to generate the 100 files, but there is more than (1) record. There are up to 280 records in the file. Ironically, I am generating a unique key for each record and the keys are unique.
If I print out the $data to a single file (outside the foreach loop), I get the expected results, but in one single file. So one file with 100 records for example, but I want to create a file for each.
I seem to have a problem opening up a file handle and outputting this correctly? or something else is wrong with my code.
Can someone show me how to set this up properly? Show me some best practices for achieving this?
Thank you.
Here is my test code:
# Get list of table
my #tblist = qx(mysql -u foo-bar -ppassw0rd --database $dbsrc -h $node --port 3306 -ss -e "show tables");
#Create data output
my $data = '';
foreach my $tblist (#tblist)
{
chomp $tblist;
#Testing to create file
my $out_file = "/home/$node-$tblist.$dt.dat";
open (my $out_fh, '>', $out_file) or die "cannot create $out_file: $!";
my $dbh = DBI->connect("DBI:mysql:database=$dbsrc;host=$node;port=3306",'foo-bar','passw0rd');
my $sth = $dbh->prepare("SELECT UUID(), '$node', ab, cd, ef, gh, hi FROM $tblist limit 1");
$sth->execute();
while (my($id, $nd,$ab,$cd,$ef,$gh,$hi) = $sth->fetchrow_array() ) {
$data = $data. "__pk__^A$id^E1^A$nd^E2^A$ab^E3^A$cd^E4^A$ef^E5^A$gh^E6^A$hi^E7^D";
}
$sth->finish;
$dbh->disconnect;
#Testing to create file
print $out_fh $data;
close $out_fh or die "Failed to close file: $!";
};
#print $data; #Here if I uncomment and output to a single file, I can see the correct number of record, but its in (1) file
You need to clear $data on each $tblist loop iteration (outer loop).
In this line: $data = $data. "__pk__^A$id^E1^A$... you are appending the data from new table each iteration on TOP of the old data, and it gets preserved in your code between different tables since the $data variable is scoped OUTSIDE the outer loop and its value never gets reset inside of it.
The simplest solution is to declare $data inside the outer ($tblist) loop:
foreach my $tblist (#tblist) {
my $data = '';
You could keep declaring it before the outer loop and simply assign it "" value at the start of the loop, but there's no point - there is usually no legitimate reason to know the value of $data in a loop like this after a loop finishes so there's no need for it to be in the scope bigger than the loop block.