Postgresql plperlu and encodings - postgresql

I want to generate a PDF in plperlu, store it in the database and then add it to an email as an attachment.
I am using PDF::Report to generate the PDF. The code looks like this:-
CREATE OR REPLACE FUNCTION workflow.make_pdf(report_template json, report_json json)
RETURNS bytea AS
$BODY$
use strict;
use PDF::Report;
my $pdf = new PDF::ReportNG(PageSize => 'A4', PageOrientation => "Landscape");
...
lots of tricky stuff to make PDF
...
return $pdf->Finish();
$BODY$ LANGUAGE plperlu;
This errors with invalid input syntax for type bytea which I assume is something to do with the encoding of the PDF document created.
The document itself is fine as $pdf->saveAs('/tmp/test.pdf'); creates a document that is perfectly readable.
I tried base64 encoding the result before returning it as the attachment to email will need to be in base64.
return MIME::base64::encode($pdf->Finish());
This removed the error, I can then store it in a table with:-
INSERT INTO weekly_report_pdfs(report)
VALUES (make_pdf(report_template,report_json));
Which also works fine, and can be attached to an email, but the corruption problem persists after the base64 decode.
Exporting the file directly from the database and running base64 -d test.b64 gives an invalid input error after only one line.
This appears to be something to do with the way postgres chucks bytea, as the file looks like this:-
MDA0OTY1MSAwMDAwMCBuIAowMDAwMDQ5ODU0IDAwMDAwIG4gCjAwMDAwNTAxNjcgMDAwMDAgbiAK\\012dHJhaWxlcgo8PCAvUm9vdCAxIDAgUiAvU2l6ZSA0MCAvSW5mbyA0IDAgUiA+PgpzdGFydHhyZWYK\\012
With lots of \012 separators.
Any ideas, I've been at this for hours and am completely stumped.

Returning binary from plperl
When declared as returning bytea, a pl/perl function must actually return a postgresql text representation for bytea.
Consider this excerpt from PL/Perl Functions and Arguments in the doc (and more specifically the last sentence):
Anything in a function argument that is not a reference is a string,
which is in the standard PostgreSQL external text representation for
the relevant data type. In the case of ordinary numeric or text types,
Perl will just do the right thing and the programmer will normally not
have to worry about it. However, in other cases the argument will need
to be converted into a form that is more usable in Perl. For example,
the decode_bytea function can be used to convert an argument of type
bytea into unescaped binary.
Similarly, values passed back to PostgreSQL must be in the external
text representation format. For example, the encode_bytea function can
be used to escape binary data for a return value of type bytea.
According to this, you should do:
return encode_bytea($pdf->Finish());
and then the invalid input syntax for type bytea error would go away.
Returning base64
If returning base64, the function should normally be declared as RETURNS text and the report column should be text too. That would solve the problem of having these \012 sequences that look like a line feed (ASCII code=12 in octal) expressed in a bytea string literal with postgres escape format.
The line feeds are typically added by base64 encoders every 76 characters to avoid long lines in a MIME body (RFC-4648).
If the report column stays in bytea and the function produces base64 with RETURNS text, an implicit cast should happen on INSERT and it would probably be fine. Otherwise the conversion could be made explicitly with convert_to(bytea_plperl_func(), 'US-ASCII')
But storing base64 in a bytea column doesn't make much sense.

Related

Base64 SQL Default Value, Error Validating

I'm trying to insert a base64 image code as the default value for one of my columns. I've tried encapsulating my base64 with single quotes but in SSMS it doesn't show it as a string and am getting errors when trying to save the table.
-Error validating the default for column 'mycol'
-Error modyfing column properties, Unclosed quotation mark after the character string : then shows the entire base64 code.
I also tried the following way: (N'(base64here)') but that also throws the same error.
I did a search/replace in notepad for any other single quotes in my string but there are none.
Not sure whats wrong here? Could it be that the string is too long for an varchar(MAX) field? It's 223210 characters long. I'm using SSMS gui not TSQL to enter the defaul value.
(N'(\=)')
The base64 string above is truncated because stackoverflow has character limit btw.
I was able to add it using TSQL
ALTER TABLE MYTBL
ADD CONSTRAINT MYCOL
DEFAULT 'base64here' FOR MYCOL;

Convert text to image data type

I have a table that for some reason stores text as IMAGE. I can grab the data and read it using
SELECT CONVERT(NVARCHAR(MAX), CONVERT(VARBINARY(MAX), column,2)) FROM table
Now I need to insert data back in to the table. I've tried
SELECT CONVERT(IMAGE, CAST('TEST TEXT' AS VARBINARY(MAX)))
But when I test converting it back using
SELECT CONVERT(NVARCHAR(MAX), CONVERT(VARBINARY(MAX), CONVERT(IMAGE, CAST('TEST TEXT' AS VARBINARY(MAX))),2))
It returns 䕔呓吠塅 which is obviously not right as it should return "TEST TEXT"
What am I doing wrong here?
The text you're trying to store is encoded as binary ASCII characters. You're trying to convert it back into a Unicode text string, which isn't what it originally was, therefore you're getting back garbled text.
Change your source text string into a Unicode string by adding N in front of it:
SELECT CONVERT(NVARCHAR(MAX), CONVERT(VARBINARY(MAX), CONVERT(IMAGE, CAST(N'TEST TEXT' AS VARBINARY(MAX))),2))
It should return the correct text. Tested this on SQL Server 2008
You can use this one:
SELECT CONVERT(**VARCHAR(MAX)**, CONVERT(VARBINARY(MAX), **CAST('TEST TEXT' AS IMAGE)**,**0**))
Basically, you were not consistent with your character type conversions. In some parts you used NVarChar and some parts Varchar. Also, the number 2 at the end is affecting the result. In you Convert statements, when you don't specify the code, default value (0) is used. So if you are converting it back, you should use the same code.

Why PostgreSQL store data in hex in own format?

I can't understand the reason why PostgreSQL store data in own format
The "hex" format encodes binary data as 2 hexadecimal digits per byte, most significant nibble first. The entire string is preceded by the sequence \x (to distinguish it from the escape format).
Does it's mean that it is not simple hex and it would not possible to simple convert this hex to byte type and I should write parser of PostgreSQL hex format?
The client driver usually takes care of bytea conversion for you, supplying you a native language data type like byte[] for Java. The representation of bytea on the wire shouldn't generally concern you. The only time it'll really matter is if you're using bytea literals in SQL text, rather than sending them as bind parameters.
Anyway, it is normal hex, it just has a \x prefix. So it's utterly trivial to "parse" if you do need to do so manually. E.g. in Python
r'\x736f6d65737472696e67'[2:].decode("hex")
The reason for the \x prefix is largely historical. PostgreSQL used to use an octal escape format for bytea data. When the format was changed to hex - to make it easier for clients to consume and work with and make it a bit more compact - it was necessary for the client to be able to tell what format the data was in. Since \x can never appear in octal ("escape") format literals, any string beginning with \x must be a hex bytea literal. This is even more important when receiving data from a client, which might be sending either hex or escape style literals, and the server must be able to tell which is which.
We could've just required that all clients use the format specified by the server. But that would break compatibility for all old clients that use bytea. Personally I think that's exactly what we should've done, and required that people using old clients set bytea_format = escape or something. That's not what happened, though. The setting bytea_output controls the format the server sends, but it still understands both formats as input. That makes interoperating with old clients and scripts easier. In theory.
In practice lots of old clients blindly interpreted hex literals sent by the server as if they were escape-format even though they were invalid; they'd ignore the backslash or treat it as a literal backslash. So they'd tend to corrupt bytea data when loading it then saving it again. Exactly what we wanted to avoid.

Non UTF8 chars in function parameters

I have a badly behaved client application that is sending non UTF8 characters in a sql string to postgres. The string is in the form
"select call_function('paramvalue_1', paramvalue_2')"
the function looking like:
create or replace function(parameter_1 text, parameter_2 text)
returns...
Unfortunately the client occasionaly sends dodgy characters in the function parameters. It's going to be tricky to clean up the client app and I was wondering if there is any way for a function in a UTF-8 database to accept non-UTF8 characters and strip them from the param values before they get inserted anywhere.
I'm pretty convinced the answer is no but I thought it was worth asking here.

Parsing COPY...WITH BINARY results

I'm using this:
COPY( select field1, field2, field3 from table ) TO 'C://Program
Files/PostgreSql//8.4//data//output.dat' WITH BINARY
To export some fields to a file, one of them is a ByteA field. Now, I need to read the file with a custom made program.
How can I parse this file?
The general format of a file generated by COPY...BINARY is explained in the documentation, and it's non-trivial.
bytea contents are the most easy to deal with, since they're not encoded.
Each other datatype has its own encoding rules, which are not described in the documentation but in the source code. From the doc:
To determine the appropriate binary format for the actual tuple data
you should consult the PostgreSQL source, in particular the *send and
*recv functions for each column's data type (typically these functions are found in the src/backend/utils/adt/ directory of the source
distribution).
It might be easier to use the text format rather than binary (so just remove the WITH BINARY). The text format has better documentation and is designed for better interoperability. The binary format is more intended for moving between postgres installations, and even there they have version incompatibilities.
Text format will write the bytea field as if it was text, and encode any non-printable characters with \nnn octal representation (except for a few special cases that it encodes with C style \x patterns, such as \n and \t etc.) These are listed in the COPY documentation.
The only caveat with this is you need to be absolutely sure that the character encoding you're using is the same when saving the file as when reading it. To make sure that the printable characters map to the same numbers. I'd stick to SQL_ASCII as it keeps thing simpler.