Why BYTEA is considered as TEXT format? - postgresql

I'm debugging an app about a streaming problem from my database.
Turning my client logs to finest level I figure out that Postgres sent column descriptions metadata with the TEXT_FORMAT byte (the T flag) for BYTEA type (see here for the class generating log's code source) :
02/08/2022 15:32:54.907 [TRACE] o.p.core.v3.QueryExecutorImpl - Field(fichier2_18_0_,BYTEA,65535,T)
I need a binary format from Postgres, if BYTEA is not considered as a binary type, which one can I try to use?

Related

SQLAlchemy/MariaDB - incorrectly using Windows-1252 and UTF-8 encodings

I am currently trying to migrate an Oracle database to MariaDB. I have a CHAR column in the oracle database, let's call it my_col, which I believe is encoded using latin-1.
I have fetched the raw bytes of the column, and decoded them successfully using Python, with:
my_col.decode('latin-1')
No error was thrown, which leads me to believe the column is, in fact, encoded with latin-1.
Now, even though the column on the MariaDB table has the latin1 charset, I found that SQLAlchemy was trying to insert UTF-8 encoded strings into it; so I specified ?charset=latin1 on my MariaDB connection string.
STILL, whenever I try inserting the decoded strings into MariaDB, I get the following error:
UnicodeDecodeError: 'charmap' codec can't encode character \x96 ...`
and at the end of my error trace:
File "usr/lib64/python3.9/encodings/cp1252.py", line 12, in encode
This raises two questions:
why is Python trying to use the cp-1252 encoding instead of latin-1, as specified?
how can I tell SQLAlchemy to use latin-1 when fetching the strings from Oracle? I would rather not have to fetch the bytes and decode them myself.
Thanks.

Dealing with parsing oids in Postgres

I'm currently improving a library client for Postgresql, the library already has working communication protocol including DataRow and RowDescription.
The problem I'm facing right now is how to deal with values.
Returning plain string with array of integers for example is kind of pointless.
By my research I found that some other libraries (like for Python) either return is as unmodified string or convert primitive types including arrays.
What I mean by conversion is making Postgres DataRow raw data as Python-type value: Postgres integer is parsed as python number, Postgres booleans as python booleans, etc.
Should I make second query to get information column type and use its converters or should I leave it plain?
You could opt to get the array values in the internal format by setting the corresponding "result-column format code" in the Bind message to 1, but that is typically a bad choice, since the internal format varies from type to type and may even depend on the server's architecture.
So your best option is probably to parse the string representation of the array on the client side, including all the escape characters.
When it comes to finding the base type for an array type, there is no other option than querying pg_type like
SELECT typelem::regtype FROM pg_type WHERE oid = 1007;
typelem
---------
integer
(1 row)
You could cache these values on the client side so that you don't have to query more than once per type and database session.

PostgreSQL - Converting Binary data to Varchar

We are working towards migration of databases from MSSQL to PostgreSQL database. During this process we came across a situation where a table contains password field which is of NVARCHAR type and this field value got converted from VARBINARY type and stored as NVARCHAR type.
For example: if I execute
SELECT HASHBYTES('SHA1','Password')`
then it returns 0x8BE3C943B1609FFFBFC51AAD666D0A04ADF83C9D and in turn if this value is converted into NVARCHAR then it is returning a text in the format "䏉悱゚얿괚浦Њ鴼"
As we know that PostgreSQL doesn't support VARBINARY so we have used BYTEA instead and it is returning binary data. But when we try to convert this binary data into VARCHAR type it is returning hex format
For example: if the same statement is executed in PostgreSQL
SELECT ENCODE(DIGEST('Password','SHA1'),'hex')
then it returns
8be3c943b1609fffbfc51aad666d0a04adf83c9d.
When we try to convert this encoded text into VARCHAR type it is returning the same result as 8be3c943b1609fffbfc51aad666d0a04adf83c9d
Is it possible to get the same result what we retrieved from MSSQL server? As these are related to password fields we are not intended to change the values. Please suggest on what needs to be done
It sounds like you're taking a byte array containing a cryptographic hash and you want to convert it to a string to do a string comparison. This is a strange way to do hash comparisons but it might be possible depending on which encoding you were using on the MSSQL side.
If you have a byte array that can be converted to string in the encoding you're using (e.g. doesn't contain any invalid code points or sequences for that encoding) you can convert the byte array to string as follows:
SELECT CONVERT_FROM(DIGEST('Password','SHA1'), 'latin1') AS hash_string;
hash_string
-----------------------------
\u008BãÉC±`\u009Fÿ¿Å\x1A­fm+
\x04­ø<\u009D
If you're using Unicode this approach won't work at all since random binary arrays can't be converted to Unicode because there are certain sequences that are always invalid. You'll get an error like follows:
# SELECT CONVERT_FROM(DIGEST('Password','SHA1'), 'utf-8');
ERROR: invalid byte sequence for encoding "UTF8": 0x8b
Here's a list of valid string encodings in PostgreSQL. Find out which encoding you're using on the MSSQL side and try to match it to PostgreSQL. If you can I'd recommend changing your business logic to compare byte arrays directly since this will be less error prone and should be significantly faster.
then it returns 0x8BE3C943B1609FFFBFC51AAD666D0A04ADF83C9D and in turn
if this value is converted into NVARCHAR then it is returning a text
in the format "䏉悱゚얿괚浦Њ鴼"
Based on that, MSSQL interprets these bytes as a text encoded in UTF-16LE.
With PostgreSQL and using only built-in functions, you cannot obtain that result because PostgreSQL doesn't use or support UTF-16 at all, for anything.
It also doesn't support nul bytes in strings, and there are nul bytes in UTF-16.
This Q/A: UTF16 hex to text suggests several solutions.
Changing your business logic not to depend on UTF-16 would be your best long-term option, though. The hexadecimal representation, for instance, is simpler and much more portable.

BigQuery - create table via UI from cloud storage results in integer error

I am trying to test out BigQuery but am getting stuck on creating a table from data stored in google cloud storage. I am able to reduce the data down to just one value, but it is not making sense.
I have a text file I uploaded to google cloud storage with just one integer value in it, 177790884
I am trying to create a table via the BigQuery web UI, and go through the wizard. When I get to the schema definition section, I enter...
ID:INTEGER
The load always fails with...
Errors:
File: 0 / Line:1 / Field:1: Invalid argument: 177790884 (error code: invalid)
Too many errors encountered. Limit is: 0. (error code: invalid)
Job ID trusty-hangar-120519:job_LREZ5lA8QNdGoG2usU4Q1jeMvvU
Start Time Jan 30, 2016, 12:43:31 AM
End Time Jan 30, 2016, 12:43:34 AM
Destination Table trusty-hangar-120519:.onevalue
Source Format CSV
Allow Jagged Rows true
Ignore Unknown Values true
Source URI gs:///onevalue.txt
Schema
ID: INTEGER
If I load with a schema of ID:STRING it works fine. The number 177790884 is not larger than a 64 bit signed int, I am really unsure what is going on.
Thanks,
Craig
Your input file likely contains a UTF-8 byte order mark (3 "invisible" bytes at the beginning of the file that indicate the encoding) that can cause BigQuery's CSV parser to fail.
https://en.wikipedia.org/wiki/Byte_order_mark
I'd suggest Googling for a platform-specific method for view and remove the byte order mark. (A hex editor would do.)
The issue is definitely with file's encoding. I was able to reproduce error.
And then "fixed" it by saving "problematic" file as ANSI (just for test) and now it was loaded successfully.

deserialize cassandra row key

I'm trying to use the sstablekeys utility for Cassandra to retrieve all the keys currently in the sstables for a Cassandra cluster. The format they come back in what appears to be serialized format when I run sstablekeys for a single sstable. Does anyone know how to deserialize the keys or get them back into their original format? They were inserted into Cassandra using astyanax, where the serialized type is a tuple in Scala. The key_validation_class for the column family is a CompositeType.
Thanks to the comment by Thomas Stets, I was able to figure out that the keys are actually just converted to hex and printed out. See here for a way of getting them back to their original format.
For the specific problem of figuring out the format of a CompositeType row key and unhexifying it, see the Cassandra source which describes the format of a CompositeType key that gets output by sstablekeys. With CompositeType(UTF8Type, UTF8Type, Int32Type), the UTF8Type treats bytes as ASCII characters (so the function in the link above works in this case), but with Int32Type, you must interpret the bytes as one number.