I'm currently improving a library client for Postgresql, the library already has working communication protocol including DataRow and RowDescription.
The problem I'm facing right now is how to deal with values.
Returning plain string with array of integers for example is kind of pointless.
By my research I found that some other libraries (like for Python) either return is as unmodified string or convert primitive types including arrays.
What I mean by conversion is making Postgres DataRow raw data as Python-type value: Postgres integer is parsed as python number, Postgres booleans as python booleans, etc.
Should I make second query to get information column type and use its converters or should I leave it plain?
You could opt to get the array values in the internal format by setting the corresponding "result-column format code" in the Bind message to 1, but that is typically a bad choice, since the internal format varies from type to type and may even depend on the server's architecture.
So your best option is probably to parse the string representation of the array on the client side, including all the escape characters.
When it comes to finding the base type for an array type, there is no other option than querying pg_type like
SELECT typelem::regtype FROM pg_type WHERE oid = 1007;
typelem
---------
integer
(1 row)
You could cache these values on the client side so that you don't have to query more than once per type and database session.
Related
I have a Table with about 200Mio Rows and multiple Columns of Datatype DECIMAL(p,s) with varying precision/scales.
Now, as far as i understand, DECIMAL(p,s) is a fixed size column, with a size depending on the precision, see:
https://learn.microsoft.com/en-us/sql/t-sql/data-types/decimal-and-numeric-transact-sql?view=sql-server-ver16
Now, when altering the table and changing a column from DECIMAL(15,2) to DECIMAL(19,6), for example, i would have expected there to be almost no work to be done on the side of the SQL-Sever as the required bytes to store the value are the same, yet the altering itself does take a long time - so what exactly does the server do when i execute the alter statement?
Also, is there any benefit (other than having constraints on a column) to storing a DECIMAL(15,2) instead of, for example, a DECIMAL(19,2)? It seems to me the storage requirements would be the same, but i could store larger values in the latter.
Thanks in advance!
The precision and scale of a decimal / numeric type matters considerably.
As far as SQL Server is concerned, decimal(15,2) is a different data type to decimal(19,6), and is stored differently. You therefore cannot make the assumption that just because the overall storage requirements do not change, nothing else does.
SQL Server stores decimal data types in byte-reversed (little endian) format with the scale being the first incrementing value therefore changing the definition can require the data to be re-written, SQL Server will use an internal worktable to safely convert the data and update the values on every page.
How to avoid the unnecessary CPU cost?
See this historic question with failure tests. Example: j->'x' is a JSONb representing a number and j->'y' a boolean. Since the first versions of JSONb (issued in 2014 with 9.4) until today (6 years!), with PostgreSQL v12... Seems that we need to enforce double conversion:
Discard j->'x' "binary JSONb number" information and transforms it into printable string j->>'x';discard j->'y' "binary JSONb boolean" information and transforms it into printable string j->>'y'.
Parse string to obtain "binary SQL float" by casting string (j->>'x')::float AS x; parse string to obtain "binary SQL boolean" by casting string (j->>'y')::boolean AS y.
Is there no syntax or optimized function to a programmer enforce the direct conversion?
I don't see in the guide... Or it was never implemented: is there a technical barrier to it?
NOTES about typical scenario where we need it
(responding to comments)
Imagine a scenario where your system need to store many many small datasets (real example!) with minimal disk usage, and managing all with a centralized control/metadata/etc. JSONb is a good solution, and offer at least 2 good alternatives to store in the database:
Metadata (with schema descriptor) and all dataset in an array of arrays;
Separating Metadata and table rows in two tables.
(and variations where metadata is translated to a cache of text[], etc.) Alternative-1, monolitic, is the best for the "minimal disk usage" requirement, and faster for full information retrieval. Alternative-2 can be the choice for random access or partial retrieval, when the table Alt2_DatasetLine have also more one column, like time, for time series.
You can create all SQL VIEWS in a separated schema, for example
CREATE mydatasets.t1234 AS
SELECT (j->>'d')::date AS d, j->>'t' AS t, (j->>'b')::boolean AS b,
(j->>'i')::int AS i, (j->>'f')::float AS f
FROM (
select jsonb_array_elements(j_alldata) j FROM Alt1_AllDataset
where dataset_id=1234
) t
-- or FROM alt2...
;
And CREATE VIEW's can by all automatic, running the SQL string dynamically ... we can reproduce the above "stable schema casting" by simple formating rules, extracted from metadata:
SELECT string_agg( CASE
WHEN x[2]!='text' THEN format(E'(j->>\'%s\')::%s AS %s',x[1],x[2],x[1])
ELSE format(E'j->>\'%s\' AS %s',x[1],x[1])
END, ',' ) as x2
FROM (
SELECT regexp_split_to_array(trim(x),'\s+') x
FROM regexp_split_to_table('d date, t text, b boolean, i int, f float', ',') t1(x)
) t2;
... It's a "real life scenario", this (apparently ugly) model is surprisingly fast for small traffic applications. And other advantages, besides disk usage reduction: flexibility (you can change datataset schema without need of change in the SQL schema) and scalability (2, 3, ... 1 billion of different datasets on the same table).
Returning to the question: imagine a dataset with ~50 or more columns, the SQL VIEW will be faster if PostgreSQL offers a "bynary to bynary casting".
Short answer: No, there is no better way to extract a jsonb number as PostgreSQL than (for example)
CAST(j ->> 'attr' AS double precision)
A JSON number happens to be stored as PostgreSQL numeric internally, so that wouldn't work “directly” anyway. But there is no principal reason why there could not be a more efficient way to extract such a value as numeric.
So, why don't we have that?
Nobody has implemented it. That is often an indication that nobody thought it worth the effort. I personally think that this would be a micro-optimization – if you want to go for maximum efficiency, you extract that column from the JSON and store it directly as column in the table.
It is not necessary to modify the PostgreSQL source to do this. It is possible to write your own C function that does exactly what you envision. If many people thought this was beneficial, I'd expect that somebody would already have written such a function.
PostgreSQL has just-in-time compilation (JIT). So if an expression like this is evaluated for a lot of rows, PostgreSQL will build executable code for that on the fly. That mitigates the inefficiency and makes it less necessary to have a special case for efficiency reasons.
It might not be quite as easy as it seems for many data types. JSON standard types don't necessarily correspond to PostgreSQL types in all cases. That may seem contrived, but look at this recent thread in the Hackers mailing list that deals with the differences between the numeric types between JSON and PostgreSQL.
All of the above are not reasons that such a feature could never exist, I just wanted to give reasons why we don't have it.
We are working towards migration of databases from MSSQL to PostgreSQL database. During this process we came across a situation where a table contains password field which is of NVARCHAR type and this field value got converted from VARBINARY type and stored as NVARCHAR type.
For example: if I execute
SELECT HASHBYTES('SHA1','Password')`
then it returns 0x8BE3C943B1609FFFBFC51AAD666D0A04ADF83C9D and in turn if this value is converted into NVARCHAR then it is returning a text in the format "䏉悱゚얿괚浦Њ鴼"
As we know that PostgreSQL doesn't support VARBINARY so we have used BYTEA instead and it is returning binary data. But when we try to convert this binary data into VARCHAR type it is returning hex format
For example: if the same statement is executed in PostgreSQL
SELECT ENCODE(DIGEST('Password','SHA1'),'hex')
then it returns
8be3c943b1609fffbfc51aad666d0a04adf83c9d.
When we try to convert this encoded text into VARCHAR type it is returning the same result as 8be3c943b1609fffbfc51aad666d0a04adf83c9d
Is it possible to get the same result what we retrieved from MSSQL server? As these are related to password fields we are not intended to change the values. Please suggest on what needs to be done
It sounds like you're taking a byte array containing a cryptographic hash and you want to convert it to a string to do a string comparison. This is a strange way to do hash comparisons but it might be possible depending on which encoding you were using on the MSSQL side.
If you have a byte array that can be converted to string in the encoding you're using (e.g. doesn't contain any invalid code points or sequences for that encoding) you can convert the byte array to string as follows:
SELECT CONVERT_FROM(DIGEST('Password','SHA1'), 'latin1') AS hash_string;
hash_string
-----------------------------
\u008BãÉC±`\u009Fÿ¿Å\x1Afm+
\x04ø<\u009D
If you're using Unicode this approach won't work at all since random binary arrays can't be converted to Unicode because there are certain sequences that are always invalid. You'll get an error like follows:
# SELECT CONVERT_FROM(DIGEST('Password','SHA1'), 'utf-8');
ERROR: invalid byte sequence for encoding "UTF8": 0x8b
Here's a list of valid string encodings in PostgreSQL. Find out which encoding you're using on the MSSQL side and try to match it to PostgreSQL. If you can I'd recommend changing your business logic to compare byte arrays directly since this will be less error prone and should be significantly faster.
then it returns 0x8BE3C943B1609FFFBFC51AAD666D0A04ADF83C9D and in turn
if this value is converted into NVARCHAR then it is returning a text
in the format "䏉悱゚얿괚浦Њ鴼"
Based on that, MSSQL interprets these bytes as a text encoded in UTF-16LE.
With PostgreSQL and using only built-in functions, you cannot obtain that result because PostgreSQL doesn't use or support UTF-16 at all, for anything.
It also doesn't support nul bytes in strings, and there are nul bytes in UTF-16.
This Q/A: UTF16 hex to text suggests several solutions.
Changing your business logic not to depend on UTF-16 would be your best long-term option, though. The hexadecimal representation, for instance, is simpler and much more portable.
I have been implementing user defined types in Postgresql 9.2 and got confused.
In the PostgreSQL 9.2 documentation, there is a section (35.11) on user defined types. In the third paragraph of that section, the documentation refers to input and output functions that are used to construct a type. I am confused about the purpose of these functions. Are they concerned with on-disk representation or only in-memory representation? In the section referred to above, after defining the input and output functions, it states that:
If we want to do anything more with the type than merely store it,
we must provide additional functions to implement whatever operations
we'd like to have for the type.
Do the input and output functions deal with serialization?
As I understand it, the input function is the one which will be used to perform INSERT INTO and the output function to perform SELECT on the type so basically if we want to perform an INSERT INTO then we need a serialization function embedded or invoked in the input or output function. Can anyone help explain this to me?
Types must have a text representation, so that values of this type can be expressed as literals in a SQL query, and returned as results in output columns.
For example, '2013-20-01' is a text representation of a date. It's possible to write VALUES('2013-20-01'::date) in a SQL statement, because the input function of the date type recognizes this string as a date and transforms it into an internal representation (for both using it in memory and storing to disk).
Conversely, when client code issues SELECT date_field FROM table, the values inside date_field are returned in their text representation, which is produced by the type's output function from the internal representation (unless the client requested a binary format for this column).
What is the syntax in PostgreSQL for inserting varbinary values?
SQL Server's syntax using a constant like 0xFFFF, it didn't work.
Given there's no "varbinary" data type in Postgres I believe you mean "bytea". Take a look at the docs about the way to specify "bytea" literals.
Depending on the language and the bindings you use there could be more sophisticated ways for transferring binary data - you could find a .Net/C#/Npgsql example here (under "Working with binary data and bytea datatype").