only taking certain values from a list in perl - perl

First I will describe what I have, then the problem.
I have a text file that is structured as such
----------- Start of file-----
<!-->
name,name2,ignore,name4,jojobjim,name3,name6,name9,pop
-->
<csv counter="1">
1,2,3,1,6,8,2,8,2,
2,6,5,1,5,8,7,7,9,
1,4,3,1,2,8,9,3,4,
4,1,6,1,5,6,5,2,9
</csv>
-------- END OF FILE-----------
I also have a perl program that has a map:
my %column_mapping = (
"name" => 'name',
"name1" => 'name_1',
"name2" => 'name_2',
"name3" => 'name_3',
"name4" => 'name_4',
"name5" => 'name_5',
"name6" => 'name_6',
"name7" => 'name_7',
"name9" => 'name_9',
)
My dynamic insert statement (assume I connected to database proper, and headers is my array of header names, such as test1, test2, ect)
my $sql = sprintf 'INSERT INTO tablename ( %s ) VALUES ( %s )',
join( ',', map { $column_mapping{$_} } #headers ),
join( ',', ('?') x scalar #headers );
my $sth = $dbh->prepare($sql);
Now for the problem I am actually having:
I need a way to only do an insert on the headers and for the values that are in the map.
In the data file given as an exmaple, there are several names that are not in the map, is there a way I can ignore them and the numbers associated with them in the csv section?
basically to make a subset csv, to turn it into:
name,name2,name4,name3,name6,name9,
1,2,1,8,2,8,
2,6,1,8,7,7,
1,4,1,8,9,3,
4,1,1,6,5,2,
so that my insert statment will only insert the ones in the map. The data file is always different, and are not in same order, and an unknown amount will be in the map.
Ideally a efficient way to do this, since this script will be going through thousands of files, and each files behind millions of lines of the csv with hundreds of columns.
It is just a text file being read though, not a csv, not sure if csv libraries can work in this scenario or not.

You would typically put the set of valid indices in a list and use array slices after that.
#valid = grep { defined($column_mapping{ $headers[$_] }) } 0 .. $#headers;
...
my $sql = sprintf 'INSERT INTO tablename ( %s ) VALUES ( %s )',
join( ',', map { $column_mapping{$_} } #headers[#valid] ),
join( ',', ('?') x scalar #valid);
my $sth = $dbh->prepare($sql);
...
my #row = split /,/, <INPUT>;
$sth->execute( #row[#valid] );
...

Because this is about four different questions in one, I'm going to take a higher level approach to the broad set of problems and leave the programming details to you (or you can ask new questions about the details).
I would get the data format changed as quickly as possible. Mixing CSV columns into an XML file is bizarre and inefficient, as I'm sure you're aware. Use a CSV file for bulk data. Use an XML file for complicated metadata.
Having the headers be an XML comment is worse, now you're parsing comments; comments are supposed to be ignored. If you must retain the mixed XML/CSV format put the headers into a proper XML tag. Otherwise what's the point of using XML?
Since you're going to be parsing a large file, use an XML SAX parser. Unlike a more traditional DOM parser which must parse the whole document before doing anything, a SAX parser will process it as it reads the file. This will save a lot of memory. I leave SAX processing as an exercise, start with XML::SAX::Intro.
Within the SAX parser, extract the data from the <csv> and use a CSV parser on that. Text::CSV_XS is a good choice. It is efficient and has solved all the problems of parsing CSV data you are likely to run into.
When you finally have it down to a Text::CSV_XS object, call getline_hr in a loop to get the rows as hashes, apply your mapping, and insert into your database. #mob's solution is fine, but I would go with SQL::Abstract to generate the SQL rather than doing it by hand. This will protect against both SQL injection attacks as well as more mundane things like the headers containing SQL meta characters and reserved words.
It's important to separate the processing of the parsed data from the parsing of the data. I'm quite sure that hideous data format will change, either for the worse or the better, and you don't want to tie the code to it.

Related

Get the number of columns in an ASCII file

I have found many questions regarding CSV files, but not regarding a normal ASCII file (.dat) file.
Assuming I have a subroutine sub writeMyFile($data), which writes different values in an ASCII file my_file.dat. Each column is then a value, which I want to plot in another subroutine sub plotVals(), but for that I need to know the number of columns of my_file.dat, which is not always the same.
What is an easy an readable way in Perl to have the number of columns of an ASCII file my_file.dat?
Some sample input/output would be (note: file might have multiple rows):
In:
(first line on my_data1.dat) -19922 233.3442 12312 0 0
(first line on my_data2.dat) 0 0 0
Out:
(for my_data1.dat) 5
(for my_data2.dat) 3
You haven't really given us enough detail for any answer to be really helpful (explaining the format of your data file, for example, would have been a great help).
But let's assume that you have a file where the fields are separated by whitespace - something like this:
col1 col2 col3 col4 col5 col6 col7 col8
We know nothing about the columns, only that they are separated by varying amounts of white space.
We can open the file in the usual manner.
my $file = 'my_file.dat';
open my $data_fh, '<', $file or die "Can't open $file: $!";
We can read each record from the file in turn in the usual manner.
while (<$data_fh>) {
# Data is in $_. Let's remove the newline from the end.
chomp;
# Here we do other interesting stuff with the data...
}
Probably, a useful thing to do would be to split the record so that each field is stored in a separate element of an array. That's simple with split().
# By default, split() works on $_ and splits on whitespace, so this is
# equivalent to:
# my #data = split /\s+/, $_;
my #data = split;
Now we get to your question. We have all of our values in #data. But we don't know how many values there are. Luckily, Perl makes it simple to find out the number of elements in an array. We just assign the array to a scalar variable.
my $number_of_values = #data;
I think that's all the information you'll need. Depending on the actual format of your data file, you might need to change the split() line in some way - but without more information it's impossible for us to know what you need there.
When reading the file in plotVals(), split each line on whatever delimiter you use in the data file, and count how many fields you get. I presume that you have to split the lines anyway to plot the individual data points, unless you call an external utility for doing the plotting. If you call an external utility for plotting, then it is enough to read one representative row (the first?) and count the fields in that.
Alternatively pass the data or some meta data (the number of columns) directly to plotVals().

Storing large strings in POSTGRES and comparing

In our analytics application, we parse URL's and store in the database.
We parse the URLS using URLparse module and store each and every content in a separate table.
from urlparse import urlsplit
parsed = urlsplit('http://user:pass#NetLoc:80/path;parameters/path2;parameters2?query=argument#fragment')
print parsed
print 'scheme :', parsed.scheme
print 'netloc :', parsed.netloc
print 'path :', parsed.path
print 'query :', parsed.query
print 'fragment:', parsed.fragment
print 'username:', parsed.username
print 'password:', parsed.password
print 'hostname:', parsed.hostname, '(netloc in lower case)'
print 'port :', parsed.port
Before inserting it we check if the contents are already on the table, if its there we dont insert it.
It works fine except the PATH table. The PATH content for our URLS are too big(2000-3000) bytes and it takes lot of time to index/compare and insert the row.
Is there a way better way to store a 2000-3000 byte field that needs to be compared?
Personally I would store a hash of the path component and/or the whole URL. Then for searches I'd check the hash.
You can use jsonb with gin or GiST indexing depending on your dataset
http://www.postgresql.org/docs/9.4/static/datatype-json.html
Basically I would store each parsed part separately and this way everything you want can be indexed searchable and your comparison can be quite efficient too.
scheme , host , port etc...

Optimal way of writing to a file after DB Query and then using this file as the data file for BCP in

Requirement is to copy a table say account in server A to table account_two in server B.
There are many tables like this each having thousands of rows.
I want to try BCP for it. The problem is account_two might have fewer cols than account.
I understand in such scenarios I can either use a format file or a temp table.
The issue is I do not own Server A tables. And in case someone changes the order and the no of col , bcp will fail.
In Sybase queryout is not working.
The only option left is doing a select A , B from account in server A and then writing this output to a file and using this file as the date file in BCP IN .
However, since it is huge data I am not able to find a convenient way of doing this.
while ( $my row = $isth->fetchrow_arrayref) {
print FILE JOIN ("\t",#$row),"\n";
}
But using this performance will be hit.
I cannot use dump_results() or dumper. It will be additional task to bring thousands of lines of data into bcp data file format.
if someone can help me in deciding the best approach.
PS: I am new to PERL. Sorry, if there is an obvious answer to this.
#!/usr/local/bin/perl
use strict;
use warnings;
use Sybase::BCP;
my $bcp = new Sybase::BCP $user, $passwd;
$bcp->config(INPUT => 'foo.bcp',
OUTPUT => 'mydb.dbo.bar',
SEPARATOR => '|');
$bcp->run;
You should record column names as well, so later you can check if order didn't change. There is no bcp option to retrieve column names, so you have to get that information and store it separatelly.
If you need to reorder them, then:
$bcp->config(...
REORDER => { 1 => 2,
3 => 1,
2 => 'foobar',
12 => 4},
...);
Non-Perl solution:
-- Create the headers file
sqlcmd -Q"SET NOCOUNT ON SELECT 'col1','col2'" -Syour_server -dtempdb -E -W -h-1 -s" " >c:\temp\headers.txt
-- Output data
bcp "SELECT i.col1, col2 FROM x" queryout c:\temp\temp.txt -Syour_server -T -c
-- Combine the files using DOS copy command. NB switches: /B - binary; avoids appending invalid EOF character 26 to end of file.
copy c:\temp\headers.txt + c:\temp\temp.txt c:\temp\output.txt /B

Perl: How to retrieve field names when doing $dbh->selectall_..?

$sth = $dbh->prepare($sql);
$sth->execute();
$sth->{NAME};
But how do you do that when:
$hr = $dbh->selectall_hashref($sql,'pk_id');
There's no $sth, so how do you get the $sth->{NAME}? $dbh->{NAME} doesn't exist.
When you're looking at a row, you can always use keys %$row to find out what columns it contains. They'll be exactly the same thing as NAME (unless you change FetchHashKeyName to NAME_lc or NAME_uc).
You can always prepare and execute the handle yourself, get the column names from it, and then pass the handle instead of the sql to selectall_hashref (e.g. if you want the column names but the statement may return no rows). Though you may as well call fetchall_hashref on the statement handle.

Escaping & in perl DB queries

I need to manipulate some records in a DB and write their values to another table. Some of these values have an '&' in the string, i.e. 'Me & You'. Short of finding all of these values and placing a \ before any &'s, how can insert these values into a table w/o oracle choking on the &?
Use placeholders. Instead of putting '$who' in your SQL statement, prepare with a ? there instead, and then either bind $who, or execute with $who as the appropriate argument.
my $sth = $dbh->prepare_cached('INSERT INTO FOO (WHO) VALUES (?)');
$sth->bind_param(1, $who);
my $rc = $sth->execute();
This is safer and faster than trying to do it yourself. (There is a "quote" method in DBI, but this is better than that.)
This is definitely a wheel you don't need to reinvent. If you are using DBI, don't escape the input; use placeholders.
Example:
my $string = "database 'text' with &special& %characters%";
my $sth = $dbh->prepare("UPDATE some_table SET some_column=?
WHERE some_other_column=42");
$sth->execute($string);
The DBD::Oracle module (and all the other DBD::xxxxx modules) have undergone extensive testing and real world use. Let it worry about how to get your text inserted into the database.