opening utf8 files on perl and double encoding - perl

I have mysql db which have COLLATE='utf8_general_ci' for every table.
i connect to the tables with dbi my $db = DBI->connect($cstring, $user, $password) and without
$db->{mysql_enable_utf8} = 1
$db->do(qq{SET NAMES 'utf8';} );
Then select the table and copy it to the csv file using Text::CSV to myFile where myFile is opened like the the below :
binmode(Myfile, ":utf8")
The problem that i repeat this process on different tables with different files which opened like the above but on some files i get double encoding and only if i remove the binmode for those speicfic files the problem is solved while the other files are fine and encoded utf8 and if i remove the binmode for them i get a problem on the utf8 encdoing what could be the problem ?
worth to mention i tried to use : use utf8 on my script and also tried to use
$db-> {mysql_enable_utf8} = 1
$db->do(qq{SET NAMES 'utf8';} );
but the problem is not solved.

If I understand correctly, you see
éëè
where you expect
éëè
when using phpMyAdmin. This indicates the data in your database is wrong (double-encoded). You'll need to go back and repopulate your database with the correct data.
If you can't fix your database, it's most likely safe to just add the following:
utf8::decode($str); # Fix double-encoding
It will attempt to decode the already-decoded data from the database. If the data was double-encoded, this will fix it. If the data wasn't double-encoded, it will fail silently fail, leaving the correct value in $str (assuming your strings aren't very very weird).
I recommend that you write a small tool that reads the data from the database, uses this trick to fix the data, then puts it back in the database correctly.

Related

updating table rows based on txt file

I have been searching but so far I only found how to insert date into tables based on a csv files.
I have the following scenario:
Directory name = ticketID
Inside this directory I have a couple of files, like:
Description.txt
Summary.txt - Contains ticket header and has been imported succefully.
Progress_#.txt - this is everytime a ticket gets udpdated. I get a new file.
Solution.txt
Importing the Issue.txt was easy since this was actually a CSV.
Now my problem is with Description and Progress files.
I need to update the existing rows with the data from this files. Something on the line of
update table_ticket set table_ticket.description = Description.txt where ticket_number = directoryname
I'm using PostgreSQL and the COPY command is valid for new data and it would still fail due to the ',;/ special chars.
I wanted to do this using bash script, but it seem that it is it won't be possible:
for i in `find . -type d`
do
update table_ticket
set table_ticket.description = $i/Description.txt
where ticket_number = $i
done
Of course the above code would take into consideration connection to the database.
Anyone has a idea on how I could achieve this using shell script. Or would it be better to just make something in Java and read and update the record, although I would like to avoid this approach.
Thanks
Alex
Thanks for the answer, but I came across this:
psql -U dbuser -h dbhost db
\set content = `cat PATH/Description.txt`
update table_ticket set description = :'content' where ticketnr = TICKETNR;
Putting this into a simple script I created the following:
#!/bin/bash
for i in `find . -type d|grep ^./CS`
do
p=`echo $i|cut -b3-12 -`
echo $p
sed s/PATH/${p}/g cmd.sql > cmd.tmp.sql
ticketnr=`echo $p|cut -b5-10 -`
sed -i s/TICKETNR/${ticketnr}/g cmd.tmp.sql
cat cmd.tmp.sql
psql -U supportAdmin -h localhost supportdb -f cmd.tmp.sql
done
The downside is that it will create always a new connection, later I'll change to create a single file
But it does exactly what I was looking for, putting the contents inside a single column.
psql can't read the file in for you directly unless you intend to store it as a large object in which case you can use lo_import. See the psql command \lo_import.
Update: #AlexandreAlves points out that you can actually slurp file content in using
\set myvar = `cat somefile`
then reference it as a psql variable with :'myvar'. Handy.
While it's possible to read the file in using the shell and feed it to psql it's going to be awkward at best as the shell offers neither a native PostgreSQL database driver with parameterised query support nor any text escaping functions. You'd have to roll your own string escaping.
Even then, you need to know that the text encoding of the input file is valid for your client_encoding otherwise you'll insert garbage and/or get errors. It quickly lands up being easier to do it in a langage with proper integration with PostgreSQL like Python, Perl, Ruby or Java.
There is a way to do what you want in bash if you really must, though: use Pg's delimited dollar quoting with a randomized delimiter to help prevent SQL injection attacks. It's not perfect but it's pretty darn close. I'm writing an example now.
Given problematic file:
$ cat > difficult.txt <__END__
Shell metacharacters like: $!(){}*?"'
SQL-significant characters like "'()
__END__
and sample table:
psql -c 'CREATE TABLE testfile(filecontent text not null);'
You can:
#!/bin/bash
filetoread=$1
sep=$(printf '%04x%04x\n' $RANDOM $RANDOM)
psql <<__END__
INSERT INTO testfile(filecontent) VALUES (
\$x${sep}\$$(cat ${filetoread})\$x${sep}\$
);
__END__
This could be a little hard to read and the random string generation is bash specific, though I'm sure there are probably portable approaches.
A random tag string consisting of alphanumeric characters (I used hex for convenience) is generated and stored in seq.
psql is then invoked with a here-document tag that isn't quoted. The lack of quoting is important, as <<'__END__' would tell bash not to interpret shell metacharacters within the string, wheras plain <<__END__ allows the shell to interpret them. We need the shell to interpret metacharacters as we need to substitute sep into the here document and also need to use $(...) (equivalent to backticks) to insert the file text. The x before each substitution of seq is there because here-document tags must be valid PostgreSQL identifiers so they must start with a letter not a number. There's an escaped dollar sign at the start and end of each tag because PostgreSQL dollar quotes are of the form $taghere$quoted text$taghere$.
So when the script is invoked as bash testscript.sh difficult.txt the here document lands up expanding into something like:
INSERT INTO testfile(filecontent) VALUES (
$x0a305c82$Shell metacharacters like: $!(){}*?"'
SQL-significant characters like "'()$x0a305c82$
);
where the tags vary each time, making SQL injection exploits that rely on prematurely ending the quoting difficult.
I still advise you to use a real scripting language, but this shows that it is indeed possible.
The best thing to do is to create a temporary table, COPY those from the files in question, and then run your updates.
Your secondary option would be to create a function in a language like pl/perlu and do this in the stored procedure, but you will lose a lot of performance optimizations that you can do when you update from a temp table.

perl utf8 corruption

I am using the perl module sapnwrfc to connect to SAP and retrieve reports. This module uses utf8 and when the data is returned some of the data has a pattern of utf8 character corruption. This appears to happen when a line in the SAP report is more than 4096 in length and my current thinking is that the read buffer of perl is splitting utf8 characters and causing the corruption.
$abap_lookup = $sap_rfc->function_lookup("REPORT");
$abap_program = $abap_lookup->create_function_call;
# set abap program input variables
$abap_program->REPORT($abap_program_name);
$abap_program->VARIANT($abap_variant_name);
# call the abap program
$abap_program->invoke;
$abap_program->DATA has the corruption in one place in each line that is more than 4Kb
This is the fragment with the corruption, the actual line is a byte or two more than 4Kb.
\x{f8fc}\x{2500} \x{500}/\x{f8fc}\x{2500}
This is what is expected, so I am assuming something is splitting the line and causing the problem.
\x{f8fc}\x{2500}\x{f8fc}\x{2500}\x{f8fc}\x{2500}
I have tried all manner of open ':utf8' pragma and other settings (use utf8, binmode(STDIN, ":utf8"), binmode(STDOUT, ":utf8");). Also have tried to turn off buffering ($| = 1;). I cannot tell if this is a utf8 problem or a buffering problem. Does anyone know why this would be doing this and how to fix it?
was not able to figure out where the corruption is happening, but it is repeatable so I built a filter.

Error when using complex file names for tar -> write in perl

While using tar->write() I am getting errors while using complex file names.
The code is:
my $filename= $archive_type."_".$from_date_time."_".$to_date_time."tar";
$tar->write($filename);
The error i get is:
Could not create filehandle for 'postProcessProbe_2010/6/23/3_2010/6/23/7.tar':
No such file or directory at test.pl line 24
If I change the $filename to a simple string like out.tar everything works.
Well, / is the directory separator on *nix systems (and, internally Windows treats / and \ interchangeably) and I believe tar files, regardless of platform use it internally as the directory separator.
I do not think you can create file names containing / on either *nix or Windows based systems. Even if you could, that would probably create a whole bunch of headaches down the road.
It would be better in my humble opinion to switch to a saner date format such as YYYYMMDD.
Also, you are using string concatenation when sprintf would have been much clearer:
my $filename= sprintf '%s_%s_%s.tar', $archive_type, $from_date_time, , $to_date_time;

Why won't this ISQL command run through Perl's DBI?

A while back I was looking for a way to insert values into a text field through isql
and eventually found some load command that worked out for me.
It doesn't work when I try to execute it from Perl. I get a syntax error. I have tried two separate methods and both are not working so far.
I have the SQL statement variable print out at the end of each loop cycle so I know that the syntax is correct, but just not getting across correctly.
Here's the latest snip of code I was testing:
foreach(#files)
{
$STMT = <<EOF;
load from $_ insert into some_table
EOF
$sth = $db1->prepare($STMT);
$sth->execute;
}
#files is an array whose elements are a full path/location of a pipe-delimited text file (ex. /home/xx/xx/xx/something.txt)
The number of columns in the table match the number of fields in the text file and the type-checking is fine (I've loaded test files manually without fail)
The error I get back is:
DBD::Informix::db prepare failed: SQL: -201: A syntax error has occurred.
Any idea what might be causing this?
EDIT to RET's & Petr's answers
$STMT = "'LOAD FROM $_ INSERT INTO table'";
system("echo $STMT | isql $db")
I had to change it to this, because the die command would force an unnatural death and the statement had to be wrapped in single quotes.
Petr is exactly right, the LOAD statement is an ISQL or DB-Access extension, so you can't execute it through DBI. If you have a look at the manual, you'll see it is also invalid syntax for SPL, ESQL/C and so on.
It's not clear whether you have to use perl to execute the script, or perl is just a convenient way of generating the SQL.
If the former, and you want a pure-perl method, you have to prepare an INSERT statement (there's just one table involved by the look of it?), and slurp through the file, using split to break it up into columns and executing the prepared insert.
Otherwise, you can generate the SQL using perl and execute it through DB-Access, either directly with system or by wrapping both in either a shell script or DOS batch file.
System call version
foreach (#files) {
my $stmt = "LOAD FROM $_ INSERT INTO table;\n";
system("echo $stmt | dbaccess $database")
|| die "Statement $stmt failed: $!\n";
}
In a batch script version, you could write all the SQL into a single script, ie:
perl -e 'while(#ARGV){shift; print "LOAD FROM '$_' INSERT INTO table;\n"}' file1 [ file2 ... ] > loadfiles.sql
isql database loadfiles.sql
NB, the comment about quotes on the filename is only relevant if the filename contains spaces or metacharacters, the usual issue.
Also, one key difference in behaviour between isql and dbaccess is that when executed in this manner, dbaccess does not stop on error, but isql will. To make dbaccess stop processing on error, set DBACCNOIGN=1 in the environment.
Hope that's helpful.
This is because your query is not SQL query, it is an isql command that tells isql to parse the input file and generate INSERT statements.
If you think about it, the server can be on a completely different machine and has no idea what file are you talking about and how to access it.
So you basically have two options:
call isql and pipe the LOAD command to it - very ugly
parse the file yourself and generate the INSERT statements
Please note that the file Notes/load.unload is distributed with DBD::Informix and contains guidelines on how to handle UNLOAD operations using Perl, DBI and DBD::Informix. Somewhat to my chagrin, I see that it says "T.B.D." (more or less) for the LOAD section.
As other people have stated, the LOAD and UNLOAD statements are faked by various client-side tools to look like SQL statements, but the Informix server does not support them itself, mainly because of the issue with getting the file from a client machine (perhaps a PC) to the server machine (perhaps a Solaris machine).
To simulate the LOAD statement, you would need to analyze the INSERT INTO Table part. If it lists columns (INSERT INTO Table(Col03, Col05, Col09)), then you can expect three values in the load data file, and they go into those three columns. You would prepare a statement 'SELECT Col03, Col05, Col09 FROM Table' to get the types of the columns. Otherwise, you need to prepare a statement 'SELECT * FROM Table' to get the complete list of columns (and their types). Given the column names and the number of columns, you can create and prepare a suitable insert statement: 'INSERT INTO Table(Col03, Col05, Col09) VALUES(?,?,?)' or 'INSERT INTO Table VALUES(?,?,?,?,?,?,?,?,?)'. You could (arguably should) include column names in the second one.
With that ready, you now have parse the unloaded data. There is a document available in the SQLCMD program available from the IIUG Software Archive (which has been around a lot longer than Microsoft's upstart program of the same name). That describes the UNLOAD format in considerable detail. Perl has the ability to handle anything Informix uses - witness the UNLOAD information in the load.unload file distributed with DBD::Informix.
A quick bit of Googling showed that the syntax for load puts quote marks around the file name. What if you change your statement to be:
load from '$_' insert into some_table
Since your statement is not using place holders, you have to put the quotes in yourself, as opposed to using the DBI quoting functionality.

How can I handle unicode with Perl's DBI?

My delicious-to-wp perl script works but gives for all "weird" characters even weirder output.
So I tried
$description = decode_utf8( $description );
but that doesnt make a difference. I would like e.g. “go live” to become “go live” and not “go live†How can I handle unicode in Perl so that this works?
UPDATE: I found the problem was to set utf of DBI I had to set in Perl:
my $sql = qq{SET NAMES 'utf8';};
$dbh->do($sql);
That was the part that I had to set, tricky. Thanks!
It's worth noting that if you're running a version of DBD::mysql new enough (3.0008 on), you can do the following: $dbh->{'mysql_enable_utf8'} = 1; and then everything's decode()ed/encode()ed for you on the way out from/in to DBI.
Enable UTF8, when you connect to database like this:
my $dbh = DBI->connect(
"dbi:mysql:dbname=db_name",
"db_user", "db_pass",
{RaiseError => 0, PrintError => 0, mysql_enable_utf8 => 1}
) or die "Connect to database failed.";
This should get you character mode strings with the UTF8 flag set as needed.
From DBI General Interface Rules & Caveats:
Perl supports two kinds of strings: Unicode (utf8 internally) and non-Unicode (defaults to iso-8859-1 if forced to assume an encoding). Drivers should accept both kinds of strings and, if required, convert them to the character set of the database being used. Similarly, when fetching from the database character data that isn't iso-8859-1 the driver should convert it into utf8.
And the specifics from DBD::mysql for mysql_enable_utf8
Additionally, turning on this flag tells MySQL that incoming data should be treated as UTF-8. This will only take effect if used as part of the call to connect(). If you turn the flag on after connecting, you will need to issue the command SET NAMES utf8 to get the same effect.
The term
$dbh->do(qq{SET NAMES 'utf8';});
definitely saves the day for accessing an utf-8 declared database, but take notice, if you are going to do any perl processing of any data obatined from the db it would be wise to store it in a perl var as an utf8 string with, as this operation is not implicit.
$utfstring = decode('utf8',$string_from_db);
of course, for proper i/o handling of utf8 strings (reading, printing, writing to output) remember to set
use open ':utf8';
and
binmode STDOUT, ":utf8";
the latter being essential for printing out utf8 strings. Hope this helps.
It may have nothing to do with Perl. Check to make sure you're using UTF encodings in the pertinent MySQL table columns.
Leave this öne out:
binmode STDOUT, ":utf8";
when using:
$dbh->do(qq{SET NAMES 'utf8';});
Otherwise your output will have double utf8 encoding, resulting in unreadable double byte characters!
It took me a couple of hours to figure this out..
By default, the driver Perl/MySQL handles binary data (at least I concluded this from some experiments with MySQL 5.1 and 5.5).
Without setting mysql_enable_utf8, I encoded/decoded the strings to/from UTF-8 before writing/reading to/from the database.
It should not be relied upon the perl-internal string representation as an array of byte; be aware that the internal 'utf8' is not guaranteed to be standard UTF-8; in converse, the single byte encoding is not guaranteed to be ISO-8859-1; really do encode/decode to/from UTF-8 (and not 'utf8').
There are also some settings of MySQL (like SET NAMES above, as far as I remember there is a client encoding, a connection encoding, and a server encoding, whose interactions are not quite clear to me if they do not all have the same value) regarding to the encodings; setting all of them to UTF-8, and the recipe above, worked for me.