How can match records in two files using Perl? - perl

I have two files, CUSTOMER_ACCOUNT_LOG.TXT, CUSOMER_ID_LOG.TXT.
Am this is log, maintain the timestamp and account id, same like in another file timestamp and customerid,
simple, i want to pick the AccountID and CustomerID with matched TIMESTAMP,
For example, 123456793 is TIMESTAMP, FOR this Equlent match records are ABC0103,CUSTOMER_ID_0103,
like this i want to pick detaild and need to make these matched records wrtite into another file,
CUSOMER_ACCOUNT_LOG.TXT
TIMESTAMP| N1| N2 |ACCOUNT ID
-----------------------------------
123456789,111,1000,ABC0101
123456791,112,1001,ABC0102
123456793,113,1002,ABC0103
123456795,114,1003,ABC0104
123456797,115,1004,ABC0105
123456799,116,1005,ABC0106
123456801,117,1006,ABC0107
123456803,118,1007,ABC0108
123456805,119,1008,ABC0109
123456807,120,1009,ABC0110
123456809,121,1010,ABC0111
123456811,122,1011,ABC0112
123456813123,1012,ABC0113
123456815,124,1013,ABC0114
123456817,125,1014,ABC0115
123456819,126,1015,ABC0116
123456821,127,1016,ABC0117
123456823,128,1017,ABC0118
123456825,129,1018,ABC0119
123456827,130,1019,ABC0120
123456829,131,1020,ABC0121
CUSOMER_ID_LOG.TXT
TIMESTAMP| N1| N2 | CUSTOMER ID
-----------------------------------
123456789,111,1000,CUSTOMER_ID_0101
123456791,112,1001,CUSTOMER_ID_0102
123456793,113,1002,CUSTOMER_ID_0103
123456795,114,1003,CUSTOMER_ID_0104
123456797,115,1004,CUSTOMER_ID_0105
123456799,116,1005,CUSTOMER_ID_0106
123456801,117,1006,CUSTOMER_ID_0107
123456803,118,1007,CUSTOMER_ID_0108
123456805,119,1008,CUSTOMER_ID_0109
123456807,120,1009,CUSTOMER_ID_0110
123456809,121,1010,CUSTOMER_ID_0111
123456811,122,1011,CUSTOMER_ID_0112
123456813123,1012,CUSTOMER_ID_0113
123456815,124,1013,CUSTOMER_ID_0114
123456817,125,1014,CUSTOMER_ID_0115
123456819,126,1015,CUSTOMER_ID_0116
123456821,127,1016,CUSTOMER_ID_0117
123456823,128,1017,CUSTOMER_ID_0118
123456825,129,1018,CUSTOMER_ID_0119
123456827,130,1019,CUSTOMER_ID_0120
123456829,131,1020,CUSTOMER_ID_0121
I am a PHP programer, and new to Perl.
First i read the file, and then i just maded array, now my array contains the timestampe rest of the required details, actually what should do know ? we should read the file and fille values into array, so guess, array key should contain Account id and array value should timestamp wise versa not sure, same like another file, finally we should compare the time stamp, which timestamps are matched then timestamps account id and customer id we should pick, upto my knowledge i filled the array, now i dont knwo how to proceed further, because, here should use the foreach and then need to match noth file timestamps, am stuck here !

Here are the steps I'd take:
0) Some rudimentary Perl boilerplate (this is step 0 because you should always always always do it, and some people will add other stuff to this boilerplate, but this is the bare minimum):
use strict;
use warnings;
use 5.010;
1) Read the first file into a hash whose keys are the timestamp:
my %account;
open( my $fh1, '<', $file1 ) or die "$!";
while( my $line = <$fh1> ) {
my #values = split ',', $line;
$account{$values[0]} = $values[3];
}
close $fh1;
2) Read the second file, and each time you read a line, pull out the timestamp, then print out the timestamp, the account ID, and the customer ID to a new file.
open( my $out_fh, '>', $outfile ) or die "$!";
open( my $fh2, '<', $file2 ) or die "$!";
while( my $line = <$fh2> ) {
my #values = split ',', $line;
say $out_fh join ',', $values[0], $account{$values[0]}, $values[3];
}
close $out_fh;
close $fh2;
You don't want to read the whole file into an array because that's a waste of memory. Only store the information that you need, and take advantage of Perl's datatypes to help you store that information.

Related

Join multiple files into one using a key and rearrange the columns using perl.

What approach should i take if i am trying to read multiple large files and join them using a key. There is a possibility of 1 to many combinations so reading one line at a time works for my simple scenario. Looking for some guidance. Thanks!
use strict;
use warnings;
open my $head, $ARGV[0] or die "Can't open $ARGV[0] for reading: $!";
open my $addr, $ARGV[1] or die "Can't open $ARGV[1] for reading: $!";
open my $phone, $ARGV[2] or die "Can't open $ARGV[2] for reading: $!";
#open my $final, $ARGV[3] or die "Can't open $ARGV[3] for reading: $!";
while( my $line1 = <$head> and my $line2 = <$addr> and my $line3 = <$phone>)
{
#split files to fields
my #headValues = split('\|', $line1);
my #addrValues = split('\|', $line2);
my #phoneValues = split('\|', $line3);
# if the key matches, join them
if($headValues[0]==$addrValues[0] and $headValues[0]==$phoneValues[0])
{
print "$headValues[0]|$headValues[1]|$headValues[2]|$addrValues[1]|$addrValues[2]|$phoneValues[1]";
}
}
close $head;
I am not sure if it's exactly what you're looking for but did you try the UNIX command join?
Consider these two files:
x.tsv
001 X1
002 X2
004 X4
y.tsv
002 Y2
003 Y3
004 Y4
the command join x.tsv y.tsv produces:
002 X2 Y2
004 X4 Y4
That is, it merges lines with the same ID and discard the others (to keep things simple).
If I were you, then I would build an sqlite database from the three file then it would be much easier to use sql to retrive the results.
I did not know how fast it is going to be, but i think it is much robust than reading three files in paralel. SQlite could handle this amount of data.
http://perlmaven.com/simple-database-access-using-perl-dbi-and-sql
SQLite for large data sets?
#!/usr/bin/perl
use strict;
use warnings;
use DBI;
my $dbfile = "sample.db";
my $dsn = "dbi:SQLite:dbname=$dbfile";
my $user = "";
my $password = "";
my $dbh = DBI->connect($dsn, $user, $password, {
PrintError => 1,
RaiseError => 1,
FetchHashKeyName => 'NAME_lc',
AutoCommit => 0,
});
$dbh->do('PRAGMA synchronous = OFF');
my $sql = <<'END_SQL';
CREATE TABLE t1 (
id INTEGER PRIMARY KEY,
c1 VARCHAR(100),
c2 VARCHAR(100),
c3 VARCHAR(100),
c4 VARCHAR(100),
)
END_SQL
$dbh->do($sql);
my $sql = <<'END_SQL';
CREATE TABLE t2 (
id INTEGER PRIMARY KEY,
c1 VARCHAR(100),
c2 VARCHAR(100),
c3 VARCHAR(100),
c4 VARCHAR(100),
)
END_SQL
$dbh->do($sql);
my $sql = <<'END_SQL';
CREATE TABLE t3 (
id INTEGER PRIMARY KEY,
c1 VARCHAR(100),
c2 VARCHAR(100),
c3 VARCHAR(100),
c4 VARCHAR(100),
)
END_SQL
$dbh->do($sql);
### populate data
open my $fh, $ARGV[0] or die "Can't open $ARGV[0] for reading: $!";
while( my $line = <$fh> ){
my #cols = split('\|', $line);
$dbh->do('INSERT INTO t1 (id, c1, c2, c3, c4) VALUES (?, ?, ?)',undef,$col[0],$col[1],$col[2],$col[3]);
}
close($fh);
$dbh->commit();
open my $fh, $ARGV[1] or die "Can't open $ARGV[1] for reading: $!";
while( my $line = <$fh> ){
my #cols = split('\|', $line);
$dbh->do('INSERT INTO t2 (id, c1, c2, c3, c4) VALUES (?, ?, ?)',undef,$col[0],$col[1],$col[2],$col[3]);
}
close($fh);
$dbh->commit();
open my $fh, $ARGV[2] or die "Can't open $ARGV[2] for reading: $!";
while( my $line = <$fh> ){
my #cols = split('\|', $line);
$dbh->do('INSERT INTO t3 (id, c1, c2, c3, c4) VALUES (?, ?, ?)',undef,$col[0],$col[1],$col[2],$col[3]);
}
close($fh);
$dbh->commit();
### process data
my $sql = 'SELECT t1.c1, t1.c2, t1.c3, t2.c2, t2.c3, t3.c2 FROM t1,t2,t3 WHERE t1.c1=t2.c1 AND t1.c1=t3.c1 ORDER BY t1.c1';
my $sth = $dbh->prepare($sql);
$sth->execute(1, 10);
while (my #row = $sth->fetchrow_array) {
print join("\t",#row)."\n";
}
$dbh->disconnect;
#unlink($dbfile);
Trying to understand your files. You have one file of head values (whatever those are) one file filled with phone numbers, and one file filled with addresses. Is that correct? Each file can have multiple head, addresses, or phone numbers, and each file somehow corresponds to each other.
Could you give an example of the data in the files, and how they relate to each other? I'll update my answer as soon as I get a better understanding on what your data actually looks like.
Meanwhile, it's time to learn about references. References allow you to create more complex data structures. And, once you understand references, you can move onto Object Oriented Perl which will really allow you to tackle programming tasks that you didn't know were possible.
Perl references allow you to have hashes of hashes, arrays of arrays, arrays of hashes, or hashes of arrays, and of course those arrays or hashes in that array or hash can itself have arrays or hashes. Maybe an example will help.
Let's say you have a hash of people assigned by employee number. I'm assuming that your first file is employee_id|name, and the second file is address|city_state, and the third is home_phone|work_phone:
First, just read in the files into arrays:
use strict;
use warnings;
use autodie;
use feature qw(say);
open my $heading_fh, "<", $file1;
open my $address_fh, "<", $file2;
open my $phone_fh, "<", $file3;
my #headings = <$heading_fh>;
chomp #headings;
close $heading_fh;
my #addresses = <$address_fh>;
chomp #addresses;
close $address_fh;
my #phones = <$phone_fh>;
chomp #phones;
close $phone_fh;
That'll make it easier to manipulate the various data streams. Now, we can go through each row:
my %employees;
for my $employee_number (0..$#headings) {
my ( $employee_id, $employee_name ) = split /\s*\|\s*/, $employees[$employee_number];
my ( $address, $city ) = split /\s*\|\s*/, $phones[$employee_number];
my ( $work_phone, $home_phone ) = split /\s*\|\s*/, $addresses[$employee_number];
my $employees{$employee_id}->{NAME} = $employee_name;
my $employees{$employee_id}->{ADDRESS} = $address;
my $employess{$employee_id}->{CITY} = $city;
my $employees{$employee_id}->{WORK} = $work_phone;
my $employees{$employee_id}->{HOME} = $home_phone;
}
Now, you have a single hash called %employees that is keyed by the $employee_id, and each entry in the hash is a reference to another hash. You have a hash of hashes.
The end result is a single data structure (your %employees) that are keyed by $employee_id, but each field is individually accessible. What is the name of employee number A103?, It's $employees{A103}->{NAME}.
Code is far from complete. For example, you probably want to verify that all of your initial arrays are all the same size and die if they're not:
if ( ( not $#employees == $#phones ) or ( not $#employees == $#addresses ) ) {
die qq(The files don't have the same number of entries);
}
I hope the idea of using references and making use of more complex data structures makes things easier to handle. However, if you need more help. Post an example of what your data looks like. Also explain what the various fields are and how they relate to each other.
There are many postings on Stackoverflow are look like this to me:
My data looks like this:
ajdjadd|oieuqweoqwe|qwoeqwe|(asdad|asdads)|adsadsnrrd|hqweqwe
And, I need to make it look like this:
##*()#&&###|##*##&)(*&!#!|####&(*&##

Perl dbi sqlite 'select * ..' only returns first elem

got a problem with perl dbi sqlite.
I have set up a database (and checked it with sqlite command line).
Now i want to search in this database, which did not work.
So i tried to just do a 'SELECT *'
this prints only the first element in the database, but not as it should everything in this table.
I think the error that causes the select * to fail is the same that prevents me from using "like %..%" stuff.
This is the relevant code, if the code is correct and the database table seems good what else could have caused the problems ?
my $dbh = DBI->connect("dbi:SQLite:dbname=$dbfile","","") || die "Cannot connect: $DBI::errstr";
my $sth = $dbh->prepare('SELECT * FROM words');
$sth->execute;
my #result = $sth->fetchrow_array();
foreach( #result) {
print $_;
}
fetchrow_array() only fetches one row.
Try
while ( my #row = $sth->fetchrow_array ) {
print "#row\n";
}
According to the documentation, fetchrow_array
Fetches the next row of data and returns it as a list containing the field values.
If you want all of the data you can call fetchrow_array (or fetchrow_arrayref) repeatedly until you reach the end of the table, or you can use fetchall_arrayref:
The fetchall_arrayref method can be used to fetch all the data to be returned from a prepared and executed statement handle. It returns a reference to an array that contains one reference per row
The code would look like this
use strict;
use warnings;
use DBI;
my $dbfile = 'words.db';
my $dbh = DBI->connect("dbi:SQLite:dbname=$dbfile", '', '') or die "Cannot connect: $DBI::errstr";
my $sth = $dbh->prepare('SELECT * FROM words');
$sth->execute;
my $result = $sth->fetchall_arrayref;
foreach my $row ( #$result ) {
print "#$row\n";
}

Why does MySQL DATE_FORMAT print a blank result?

For the past couple of hours I've been trying to format a MySQL timestamp using DATE_FORMAT and it doesn't do anything!
Perl Code:
use CGI;
use DBI;
my $q = new CGI;
# Database connection goes here
my $sth_select = $dbh->prepare(
"SELECT DATE_FORMAT(timestamp, '%m/%d/%y') FROM foo"
);
$sth_select->execute() || die "Unable to execute query: $dbh->errstr";
if (my $ref = $sth_select->fetchrow_hashref()) {
print $q->header;
print " TIME: $ref->{timestamp}";
exit;
}
Results
TIME:
It doesn't print the formatted time at all, it is blank!
When I attempt to print the timestamp it doesn't print anything, but if I were to remove DATA_FORMAT and just simply do a SELECT timestamp FROM foo, then it prints the timestamp just fine, albeit not formatted though. Can somebody provide their insight on this matter, please?
The hash returned has as keys column headers as provided by the database. When using a function like that, the column header is actually "DATE_FORMAT(timestamp, '%m/%d/%y')".
Try modifying your SQL to be:
my $sth_select = $dbh->prepare("SELECT DATE_FORMAT(timestamp, '%m/%d/%y') AS timestamp FROM foo");

Can I get the table names from an SQL query with Perl's DBI?

I am writing small snippets in Perl and DBI (SQLite yay!)
I would like to log some specific queries to text files having the same filename as that of the table name(s) on which the query is run.
Here is the code I use to dump results to a text file :
sub dumpResultsToFile {
my ( $query ) = #_;
# Prepare and execute the query
my $sth = $dbh->prepare( $query );
$sth->execute();
# Open the output file
open FILE, ">results.txt" or die "Can't open results output file: $!";
# Dump the formatted results to the file
$sth->dump_results( 80, "\n", ", ", \*FILE );
# Close the output file
close FILE or die "Error closing result file: $!\n";
}
Here is how I can call this :
dumpResultsToFile ( <<" END_SQL" );
SELECT TADA.fileName, TADA.labelName
FROM TADA
END_SQL
What I effectively want is, instead of stuff going to "results.txt" ( that is hardcoded above ), it should now go to "TADA.txt".
Had this been a join between tables "HAI" and "LOL", then the resultset should be written to "HAI.LOL.txt"
Is what I am saying even possible using some magic in DBI?
I would rather do without parsing the SQL query for tables, but if there is a widely used and debugged function to grab source table names in a SQL query, that would work for me too.
What I want is just to have a filename
that gives some hint as to what query
output it holds. Seggregating based on
table name seems a nice way for now.
Probably not. Your SQL generation code takes the wrong approach. You are hiding too much information from your program. At some point, your program knows which table to select from. Instead of throwing that information away and embedding it inside an opaque SQL command, you should keep it around. Then your logger function doesn't have to guess where the log data should go; it knows.
Maybe this is clearer with some code. Your code looks like:
sub make_query {
my ($table, $columns, $conditions) = #_;
return "SELECT $columns FROM $table WHERE $conditions";
}
sub run_query {
my ($query) = #_;
$dbh->prepare($query);
...
}
run_query( make_query( 'foo', '*', '1=1' ) );
This doesn't let you do what you want to do. So you should structure
your program to do something like:
sub make_query {
my ($table, $columns, $conditions) = #_;
return +{
query => "SELECT $columns FROM $table WHERE $conditions",
table => $table,
} # an object might not be a bad idea
}
sub run_query {
my ($query) = #_;
$dbh->prepare($query->{query});
log_to_file( $query->{table}.'.log', ... );
...
}
run_query( make_query( 'foo', '*', '1=1' ) );
The API is the same, but now you have the information you need to log
the way you want.
Also, consider SQL::Abstract for dynamic SQL generation. My code
above is just an example.
Edit: OK, so you say you're using SQLite. It has an EXPLAIN command
which you could parse the output of:
sqlite> explain select * from test;
0|Trace|0|0|0|explain select * from test;|00|
1|Goto|0|11|0||00|
2|SetNumColumns|0|2|0||00|
3|OpenRead|0|2|0||00|
4|Rewind|0|9|0||00|
5|Column|0|0|1||00|
6|Column|0|1|2||00|
7|ResultRow|1|2|0||00|
8|Next|0|5|0||00|
9|Close|0|0|0||00|
10|Halt|0|0|0||00|
11|Transaction|0|0|0||00|
12|VerifyCookie|0|1|0||00|
13|TableLock|0|2|0|test|00|
14|Goto|0|2|0||00|
Looks like TableLock is what you would want to look for. YMMV, this
is a bad idea.
In general, in SQL, you cannot reliably deduce table names from result set, both for theoretical reasons (the result set may only consist of computed columns) and practical (the result set never includes table names - only column names - in its data).
So the only way to figure out the tables used is to stored them with (or deduce them from) the original query.
I've heard good things about the parsing ability of SQL::Statement but never used it before now myself.
use SQL::Statement;
use strict;
use warnings;
my $sql = <<" END_SQL";
SELECT TADA.fileName, TADA.labelName
FROM TADA
END_SQL
my $parser = SQL::Parser->new();
$parser->{RaiseError} = 1;
$parser->{PrintError} = 0;
my $stmt = eval { SQL::Statement->new($sql, $parser) }
or die "parse error: $#";
print join',',map{$_->name}$stmt->tables;

How do I insert values from parallel arrays into a database using Perl's DBI module?

I need to insert values in database using Perl's DBI module. I have parsed a file to obtain these values and hence these values are present in an arrays, say #array1, #array2, #array3. I know how to insert one value at a time but not from an arrays.
I know insert one value at a time:
$dbh = DBI->connect("dbi:Sybase:server=$Srv;database=$Db", "$user", "$passwd") or die "could not connect to database";
$query= "INSERT INTO table1 (id, name, address) VALUES (DEFAULT, tom, Park_Road)";
$sth = $dbh->prepare($query) or die "could not prepare statement\n";
$sth-> execute or die "could not execute statement\n $command\n";
I am not sure if I have array1 containing ids, array2 containing names, and array3 containing address, how would I insert values.
Since you have parallel arrays, you could take advantange of execute_array:
my $sth = $dbh->prepare('INSERT INTO table1 (id, name, address) VALUES (?, ?, ?)');
my $num_tuples_executed = $sth->execute_array(
{ ArrayTupleStatus => \my #tuple_status },
\#ids,
\#names,
\#addresses,
);
Please note that this is a truncated (and slightly modified) example from the documentation. You'll definitely want to check out the rest of it if you decide to use this function.
Use placeholders.
Update: I just realized you have parallel arrays. That is really not a good way of working with data items that go together. With that caveat, you can use List::MoreUtils::each_array:
#!/usr/bin/perl
use strict; use warnings;
use DBI;
use List::MoreUtils qw( each_array );
my $dbh = DBI->connect(
"dbi:Sybase:server=$Srv;database=$Db",
$user, $passwd,
) or die sprintf 'Could not connect to database: %s', DBI->errstr;
my $sth = $dbh->prepare(
'INSERT INTO table1 (id, name, address) VALUES (?, ?, ?)'
) or die sprintf 'Could not prepare statement: %s', $dbh->errstr;
my #ids = qw( a b c);
my #names = qw( d e f );
my #addresses = qw( g h i);
my $it = each_array(#ids, #names, #address);
while ( my #data = $it->() ) {
$sth->execute( #data )
or die sprintf 'Could not execute statement: %s', $sth->errstr;
}
$dbh->commit
or die sprintf 'Could not commit updates: %s', $dbh->errstr;
$dbh->disconnect;
Note that the code is not tested.
You might also want to read the FAQ: What's wrong with always quoting "$vars"?.
Further, given that the only way you are handling error is by dying, you might want to consider specifying { RaiseError => 1 } in the connect call.
How could you not be sure what your arrays contain? Anyway the approach would be the iterate through one array and assuming the other arrays have corresponding values put those into the insert statement
Another way would be to use a hash as an intermediate storage area. IE:
my $hash = {};
foreach(#array1) {
$hash->{id} = $array1[$_];
$hash->{name} = $array2[$_];
$hash->{address} = $array3[$_];
}
foreach( keys %$hash ) {
$sql = "insert into table values(?,?,?)";
$sth = $dbh->prepare($sql) or die;
$sth->execute($hash->{id}, $hash->{name}, $hash->{address}) or die;
}
Though again this depends on the three arrays being synced up. However you could modify this to do value modifications or checks or greps in the other arrays within the first loop through array1 (ie: if your values in array2 and array3 are maybe stored as "NN-name" and "NN-address" where NN is the id from the first array and you need to find the corresponding values and remove the NN- with a s// regex). Depends on how your data is structured though.
Another note is to check out Class::DBI and see if it might provide a nicer and more object oriented way of getting your data in.