Storing large strings in POSTGRES and comparing - postgresql

In our analytics application, we parse URL's and store in the database.
We parse the URLS using URLparse module and store each and every content in a separate table.
from urlparse import urlsplit
parsed = urlsplit('http://user:pass#NetLoc:80/path;parameters/path2;parameters2?query=argument#fragment')
print parsed
print 'scheme :', parsed.scheme
print 'netloc :', parsed.netloc
print 'path :', parsed.path
print 'query :', parsed.query
print 'fragment:', parsed.fragment
print 'username:', parsed.username
print 'password:', parsed.password
print 'hostname:', parsed.hostname, '(netloc in lower case)'
print 'port :', parsed.port
Before inserting it we check if the contents are already on the table, if its there we dont insert it.
It works fine except the PATH table. The PATH content for our URLS are too big(2000-3000) bytes and it takes lot of time to index/compare and insert the row.
Is there a way better way to store a 2000-3000 byte field that needs to be compared?

Personally I would store a hash of the path component and/or the whole URL. Then for searches I'd check the hash.

You can use jsonb with gin or GiST indexing depending on your dataset
http://www.postgresql.org/docs/9.4/static/datatype-json.html
Basically I would store each parsed part separately and this way everything you want can be indexed searchable and your comparison can be quite efficient too.
scheme , host , port etc...

Related

Sed command inside TCL script

Help me understand the syntax sed.I removed single quotes, but the code still does not work.
set id [open file.txt]
# send the request, get a lot of data
set tok [::http::geturl "http://example.com"-channel $id]
# cut out the necessary data between two words
exec sed s/{"data1":\(.*\)/data2\1/ $id
close $id
set ir [open file.txt]
set phone [read $ir]
close $ir
puts $phone
The problem is that I get data from a query of the following kind
{"id":3876,"form":"index","time":21,"data":"2529423","service":"Atere","response":"WAIT"}
The brace is an element of the syntax of the language, and I need to cut exactly the value between the word and the brace. How to implement this in a script.
Your code is rather confused, as (a) you are passing a file handle to the sed command. That's not going to work. (b) you are passing an input channel to http rather than an output channel (try opening the file for writing).
About the underlying problem.
If you are receiving basic JSON data back as shown.
a) You can use a JSON parser: tcllib's json module
b) Convert it to a form that Tcl can parse as a dictionary
# Assuming the JSON data is in the $data variable, and there's no
# other data present. This also assumes the data is very basic
# there are no embedded commas. Many assumptions means this
# code is likely to break in the future. A JSON parser would
# be a better choice.
set data "\{"
append data {"id":3876,"form":"index","time":21,"data":"2529423","service":"Atere","response":"WAIT"}
append data "\}"
regsub -all {[{}:",]} $data { } data
set mydatadict $data
puts [dict get $mydatadict id]
Edit:
For http processing:
set tok [::http::geturl "http://example.com"]
set data [::http::data $tok]
::http::cleanup $tok

only taking certain values from a list in perl

First I will describe what I have, then the problem.
I have a text file that is structured as such
----------- Start of file-----
<!-->
name,name2,ignore,name4,jojobjim,name3,name6,name9,pop
-->
<csv counter="1">
1,2,3,1,6,8,2,8,2,
2,6,5,1,5,8,7,7,9,
1,4,3,1,2,8,9,3,4,
4,1,6,1,5,6,5,2,9
</csv>
-------- END OF FILE-----------
I also have a perl program that has a map:
my %column_mapping = (
"name" => 'name',
"name1" => 'name_1',
"name2" => 'name_2',
"name3" => 'name_3',
"name4" => 'name_4',
"name5" => 'name_5',
"name6" => 'name_6',
"name7" => 'name_7',
"name9" => 'name_9',
)
My dynamic insert statement (assume I connected to database proper, and headers is my array of header names, such as test1, test2, ect)
my $sql = sprintf 'INSERT INTO tablename ( %s ) VALUES ( %s )',
join( ',', map { $column_mapping{$_} } #headers ),
join( ',', ('?') x scalar #headers );
my $sth = $dbh->prepare($sql);
Now for the problem I am actually having:
I need a way to only do an insert on the headers and for the values that are in the map.
In the data file given as an exmaple, there are several names that are not in the map, is there a way I can ignore them and the numbers associated with them in the csv section?
basically to make a subset csv, to turn it into:
name,name2,name4,name3,name6,name9,
1,2,1,8,2,8,
2,6,1,8,7,7,
1,4,1,8,9,3,
4,1,1,6,5,2,
so that my insert statment will only insert the ones in the map. The data file is always different, and are not in same order, and an unknown amount will be in the map.
Ideally a efficient way to do this, since this script will be going through thousands of files, and each files behind millions of lines of the csv with hundreds of columns.
It is just a text file being read though, not a csv, not sure if csv libraries can work in this scenario or not.
You would typically put the set of valid indices in a list and use array slices after that.
#valid = grep { defined($column_mapping{ $headers[$_] }) } 0 .. $#headers;
...
my $sql = sprintf 'INSERT INTO tablename ( %s ) VALUES ( %s )',
join( ',', map { $column_mapping{$_} } #headers[#valid] ),
join( ',', ('?') x scalar #valid);
my $sth = $dbh->prepare($sql);
...
my #row = split /,/, <INPUT>;
$sth->execute( #row[#valid] );
...
Because this is about four different questions in one, I'm going to take a higher level approach to the broad set of problems and leave the programming details to you (or you can ask new questions about the details).
I would get the data format changed as quickly as possible. Mixing CSV columns into an XML file is bizarre and inefficient, as I'm sure you're aware. Use a CSV file for bulk data. Use an XML file for complicated metadata.
Having the headers be an XML comment is worse, now you're parsing comments; comments are supposed to be ignored. If you must retain the mixed XML/CSV format put the headers into a proper XML tag. Otherwise what's the point of using XML?
Since you're going to be parsing a large file, use an XML SAX parser. Unlike a more traditional DOM parser which must parse the whole document before doing anything, a SAX parser will process it as it reads the file. This will save a lot of memory. I leave SAX processing as an exercise, start with XML::SAX::Intro.
Within the SAX parser, extract the data from the <csv> and use a CSV parser on that. Text::CSV_XS is a good choice. It is efficient and has solved all the problems of parsing CSV data you are likely to run into.
When you finally have it down to a Text::CSV_XS object, call getline_hr in a loop to get the rows as hashes, apply your mapping, and insert into your database. #mob's solution is fine, but I would go with SQL::Abstract to generate the SQL rather than doing it by hand. This will protect against both SQL injection attacks as well as more mundane things like the headers containing SQL meta characters and reserved words.
It's important to separate the processing of the parsed data from the parsing of the data. I'm quite sure that hideous data format will change, either for the worse or the better, and you don't want to tie the code to it.

opening utf8 files on perl and double encoding

I have mysql db which have COLLATE='utf8_general_ci' for every table.
i connect to the tables with dbi my $db = DBI->connect($cstring, $user, $password) and without
$db->{mysql_enable_utf8} = 1
$db->do(qq{SET NAMES 'utf8';} );
Then select the table and copy it to the csv file using Text::CSV to myFile where myFile is opened like the the below :
binmode(Myfile, ":utf8")
The problem that i repeat this process on different tables with different files which opened like the above but on some files i get double encoding and only if i remove the binmode for those speicfic files the problem is solved while the other files are fine and encoded utf8 and if i remove the binmode for them i get a problem on the utf8 encdoing what could be the problem ?
worth to mention i tried to use : use utf8 on my script and also tried to use
$db-> {mysql_enable_utf8} = 1
$db->do(qq{SET NAMES 'utf8';} );
but the problem is not solved.
If I understand correctly, you see
éëè
where you expect
éëè
when using phpMyAdmin. This indicates the data in your database is wrong (double-encoded). You'll need to go back and repopulate your database with the correct data.
If you can't fix your database, it's most likely safe to just add the following:
utf8::decode($str); # Fix double-encoding
It will attempt to decode the already-decoded data from the database. If the data was double-encoded, this will fix it. If the data wasn't double-encoded, it will fail silently fail, leaving the correct value in $str (assuming your strings aren't very very weird).
I recommend that you write a small tool that reads the data from the database, uses this trick to fix the data, then puts it back in the database correctly.

Perl CGI - How can I delete contents of text fields?

So, I am totally new with CGI programming in Perl.
The question is simple. Is there any chance to delete the content of a text field in CGI?
I must to write a code that have some popup_menu, submit button and text fields (area).
When I click on the submit button the program reads the value from one of the popup_menu.
The task is to copy this content into text field and then when I choose another element from the popup_menu (and click on the submit button of course), let the new content write into the text field replace the old one.
I think perldoc.perl.org gives only a little information about CGI programming. I'd have lot of questions in thema... :(
Any help would be approciate!
I guess, what you describe is: when you click the submit button, then your cgi script will run, given the parameters you entered in the form. What I then has to do is: write something back and print the form again - with different values.
So even if this is not the perfect way of doing such kind of things (for simple form element substitution you should do it client side and use javascript - you don't need a cgi backend script for this), let's see how a cgi script might look like.
First it's important to know, how you write your form. Let's assume you write it "the hard way" with print.
What your script has to do is parse the input and then add it as a value to the output.
use CGI;
my $q = CGI->new;
# get the value from the popup / html select
my $popup_value = $q->param('popup_menu'); # name of the <select name="..."> in your html
# ...
# writing the form
print $q->header;
# some more prints with form etc.
print textarea( -name => 'text_area',
-default => $popup_value // '', # will use empty string on first call
);
# Don't turn off autoescaping !
BTW, the value of a select option is meant to be a short indicator, not a full text (even this might be possible up to a certain amount of characters). So you might think of building a hash or an array with the appropriate values to be printed in the text area and give your select options the values 0, 1, 2 ...
my #text_values = ('', 'First text', 'second text', 'third text');
my $popup_value = $q->param('popup_menu') || 0; # default index.
# now use 1,2,3, ... as values in your popup_menu options
# ...
print textarea( -name => 'text_area',
-default => $text_values[$popup_value] );

Why do I get wide character warnings from Perl when I insert in a Berkeley DB?

I'm running an experiment on Berkeley DBs. I'm simply removing the contents from DB a and reinserting the key-value pairs into DB b. However, I am getting Wide character errors when inserting key-value pairs into this DB b. Help?
BerkeleyDB stores bytes ("octets"). Perl strings are made of Perl characters. In order to store Perl characters in the octet-based store, you have to convert the characters to bytes. This is called encoding, as in character-encoding.
The warning you get indicates that Perl is doing the conversion for you, and is guessing about what character encoding you want to use. Since it will probably guess wrong, it's best to explicitly say. The Encode module allows you to do that.
Instead of writing:
$db->store( key => $value );
You should instead write:
use Encode qw(encode);
$db->store( key => encode('utf-8', $value) );
And on the way out:
use Encode qw(decode);
$db->get($key, $octets); # BDB returns the result via the arg list. C programmers...
my $value = decode('utf-8', $octets);
This is true of more than just BDB; whenever you are communicating across the network, via files, via the terminal, or pretty much anything, you must be sure to encode characters to octets on the way out, and decode octets to characters on the way in. Otherwise, your program will not work.