Hash nvarchar as UTF32

Hash nvarchar as UTF32 - tsql

I have a column of type nvarchar. I would like to get a SHA256 hash of the UTF32 representation of these characters.
I have found HASHBYTES which seems to do the meat of what I want to do, with SELECT HASHBYTES('SHA2_256', MyBinaryData).
It also indicates it can operate on nvarchar, but it doesn't indicate how it does the conversion from characters to bytes. I particularly need the hash of the UTF32 representation. How can I get that hash? Is there an in-database way to encode the nvarchar to UTF32 that I can feed to HASHBYTES? Is there another way?

To get the UTF32 representation of a nvarchar column (in SQL Server) you should use some CLR code.
Try something similar to the code in this answer: https://stackoverflow.com/a/14041069/1187211

Related

Postgres truncates trailing zeros for timestamps

Postgres (V11.3, 64bit, Windows) truncates trailing zeros for timestamps. So if I insert the timestamp '2019-06-12 12:37:07.880' into the table and I read it back as text postgres returns '2019-06-12 12:37:07.88'.
Table date_test:
CREATE TABLE public.date_test (
id SERIAL,
"timestamp" TIMESTAMP WITHOUT TIME ZONE NOT NULL,
CONSTRAINT pkey_date_test PRIMARY KEY(id)
)
SQL command when inserting data:
INSERT INTO date_test (timestamp) VALUES( '2019-06-12 12:37:07.880' )
SQL command to retrieve data:
SELECT dt.timestamp ::TEXT FROM date_test dt
returns '2019-06-12 12:37:07.88'
Do you consider this a bug or a feature?
My real issue is: I´m running queries from a C++ program and I have to convert the data returned from the database to appropriate data types. Since the protocol is text-based everything I read from the database is plain text. When parsing timestamps I first tokenize the string and then convert each token to integer. And because the millisecond part is truncated, the last token is "88" instead of "880", and converting "88" yields another value that converting "880" to integer.

That's the default display format when using a cast to text.
If you want to see all three digits, use to_char()
SELECT to_char(dt.timestamp,'yyyy-mm-d hh24:mi:ss.ms')
FROM date_test dt;
will return 2019-06-12 12:37:07.880

It’s a matter of presentation only.
First note that 07.88 seconds and 07.880 seconds is the same amount of time (also 7.88 and 07.880000000 for that matter).
PostgreSQL internally represents a timestamp in a way that we shouldn’t be concerned about as long as it’s an unambiguous representation. When you retrieve the timestamp, it is formatted into some string. This is where PostgreSQL apparently chooses not to print redundant trailing zeros. So it’s probably not even correct to say that it truncates anything. It just refrains from generating that 0.
I think that the nice solution would be to modify your parser in C++ to accept any number of decimals and parse them correctly with and without trailing zeroes. Another solution that should work is given in the answer by a_horse_with_no_name.

PostgreSQL: Difference between "bytea" and "bit varying" types

The PostgreSQL types bytea and bit varying sound similar:
bytea stores binary strings.
bit varying stores strings of 1's and 0's.
The documentation does not mention a maximum size for either. Is it 1GB like character varying?
I have two separate use cases, both over a table with millions of rows:
Storing MD5 hashes
That would be a bytea with a length of 16 bytes or a bit(128). It would be used for:
Deduplication: Heavy use of GROUP BY, with an index I suppose.
Querying with WHERE md5 = for exact matches only.
Displaying as a hex string for human use.
Storing arbitrary binary data
Strings of binary data of varying length up to 4kB for:
Bitwise operations to find the strings matching a certain mask. Example at the end of this post.
Extracting some bytes, for instance get the integer value of the byte 14 in my string.
Some deduplication.
Working example for the bitwise operation, using bit varying. The mask is X'00FF00' and the it returns only the row X'AAAAAA'. I shortened the strings for the example but it would be over their full length, up to 4kB. Is it possible to do something similar with bytea?
CREATE TABLE test1 (mystring bit varying);
INSERT INTO test1 VALUES (X'AAAAAA'), (X'ABCABC');
SELECT * FROM test1 WHERE mystring & X'00FF00' = X'00AA00';
Which of bytea and bit varying is the more appropriate?
I saw the UUID type is made to store exactly 16 bytes, would that be any advantage to store the MD5's?

In general, if you're not using bitwise operations you should be using bytea.
I store larger values in bytea and then convert substrings to bit varying for bitwise operations where possible, mostly because clients understand bytea much more consistently than bit varying and the I/O format is more compact.
MD5 values should be stored as bytea. Bitwise operations on them make no sense, and you generally want to fetch them as binary.
I think bit varying really has two uses:
To store flags fields that are literally bit strings; and
As an interim data type for internal calculations
For pretty much everything else, use bytea.
There's nothing stopping you storing a 4k bitfield if that's what it is, though.

It appears the maximum length of bytea is 1 GB. [1]
For bitwise operation use bit varying (explanation see below)
For storing MD5 hash use bytea. It will take less storage than bit varying
The benefit using UUID is UUID algorithm somehow guarantees your uniqueness, not only in your table, but also in your database or even across your database (even if you generate UUID in your application). I think if you are using UUID without dashes it will be more efficient for storing, comparing and sorting in UUID (comparison between bytea and UUID see below).
For bitwise operation use bit varying
If you concern about storage:
bit varying takes more storage than bytea. If you are okay then you should try comparing the function they both offer:
bit varying
vs
bytea
So far I can see bit varying will be more suitable for you to do bitwise operation though bytea is generally accepted way to store arbitrary data.
PostgreSQL offers a single bytea operator: concatenation. You can append one byte value to another bytea value using the concatenation operator ||. [1]
Note that you cannot compare two bytea value, even for equality/inequality. You can, of course, convert bytea value into another value using the CAST(), and that opens up other operators. [1]
Comparison between UUID and bytea
create table u(uuid uuid primary key, payload character(300));
create table b( bytea bytea primary key, payload character(300));
INSERT INTO u
SELECT uuid_generate_v4()
FROM generate_series(1,1000*1000);
INSERT INTO b
SELECT random_bytea(16)
FROM generate_series(1,1000*1000);
VACUUM ANALYZE u;
VACUUM ANALYZE b;
## Your table size
SELECT pg_size_pretty(pg_total_relation_size('u'));
pg_size_pretty
----------------
81 MB
SELECT pg_size_pretty(pg_total_relation_size('b'));
pg_size_pretty
----------------
101 MB
## Speed comparison
\timing on
## Common select
select * from u limit 1000;
Time: 1.433 ms
select * from b limit 1000;
Time: 1.396 ms
## Random Select
SELECT * FROM u OFFSET random()*1000 LIMIT 10000;
Time: 42.453 ms
SELECT * FROM b OFFSET random()*1000 LIMIT 10000;
Time: 10.962 ms
Conclusion : I don't think there will be more benefit using UUID except its uniqueness and smaller size (will be faster to insert)
Note: No Index, there is only one connection
Some source :
PostgreSQL: "The Comprehensive Guide to Building, Programming, And Administratoring PostgreSQL Databases" Book

Postgresql constraint to check for non-ascii characters

I have a Postgresql 9.3 database that is encoded 'UTF8'. However, there is a column in database that should never contain anything but ASCII. And if non-ascii gets in there, it causes a problem in another system that I have no control over. Therefore, I want to add a constraint to the column. Note: I already have a BEFORE INSERT trigger - so that might be a good place to do the check.
What's the best way to accomplish this in PostgreSQL?

You can define ASCII as ordinal 1 to 127 for this purpose, so the following query will identify a string with "non-ascii" values:
SELECT exists(SELECT 1 from regexp_split_to_table('abcdéfg','') x where ascii(x) not between 1 and 127);
but it's not likely to be super-efficient, and the use of subqueries would force you to do it in a trigger rather than a CHECK constraint.
Instead I'd use a regular expression. If you want all printable characters then you can use a range in a check constraint, like:
CHECK (my_column ~ '^[ -~]*$')
this will match everything from the space to the tilde, which is the printable ASCII range.
If you want all ASCII, printable and nonprintable, you can use byte escapes:
CHECK (my_column ~ '^[\x00-\x7F]*$')
The most strictly correct approach would be to convert_to(my_string, 'ascii') and let an exception be raised if it fails ... but PostgreSQL doesn't offer an ascii (i.e. 7-bit) encoding, so that approach isn't possible.

Use a CHECK constraint built around a regular expression.
Assuming that you mean a certain column should never contain anything but the lowercase letters from a to z, the uppercase letters from A to Z, and the numbers 0 through 9, something like this should work.
alter table your_table
add constraint allow_ascii_only
check (your_column ~ '^[a-zA-Z0-9]+$');
This is what people usually mean when they talk about "only ASCII" with respect to database columns, but ASCII also includes glyphs for punctuation, arithmetic operators, etc. Characters you want to allow go between the square brackets.

How to convert PostgreSQL escape bytea to hex bytea?

I got the answer to check for one certain BOM in a PostgreSQL text column. What I really like to do would be to have something more general, i.e. something like
select decode(replace(textColumn, '\\', '\\\\'), 'escape') from tableXY;
The result of a UTF8 BOM is:
\357\273\277
Which is octal bytea and can be converted by switching the output of bytea in pgadmin:
update pg_settings set setting = 'hex' WHERE name = 'bytea_output';
select '\357\273\277'::bytea
The result is:
\xefbbbf
What I would like to have is this result as one query, e.g.
update pg_settings set setting = 'hex' WHERE name = 'bytea_output';
select decode(replace(textColumn, '\\', '\\\\'), 'escape') from tableXY;
But that doesn't work. The result is empty, probably because the decode cannot handle hex output.

If the final purpose is to get the hexadecimal representation of all the bytes that constitute the strings in textColumn, this can be done with:
SELECT encode(convert_to(textColumn, 'UTF-8'), 'hex') from tableXY;
It does not depend on bytea_output. BTW, this setting plays a role only at the final stage of a query, when a result column is of type bytea and has to be returned in text format to the client (which is the most common case, and what pgAdmin does). It's a matter of representation, the actual values represented (the series of bytes) are identical.
In the query above, the result is of type text, so this is irrelevant anyway.
I think that your query with decode(..., 'escape') can't work because the argument is supposed to be encoded in escape format and it's not, per comments it's normal xml strings.

With the great help of Daniel-Vérité I use this general query now to check for all kind of BOM or unicode char problems:
select encode(textColumn::bytea, 'hex'), * from tableXY;
I had problem with pgAdmin and too long columns, as they had no result. I used that query for pgAdmin:
select encode(substr(textColumn,1,100)::bytea, 'hex'), * from tableXY;
Thanks Daniel!

AnsiString being truncated with plenty of space

I'm inserting a row with a JOBCODE field defined as varchar(50). When the string for that field is greater than 20 characters I get an error from SQL Server warning that the string would be truncated.
I suspect this may have to do with Unicode wide characters, but I thought then 25 characters would pass.
Has anyone seen something like this before? What am I missing?

I think there is something else at fault here.
VARCHAR(50) should be 50 characters, irrespective of the encoding
as an example
CREATE TABLE AnsiString
(
JobCode VARCHAR(20), -- ANSI with codepage
JobCodeUnicode NVARCHAR(20) -- Unicode
)
Inserting 20 unicode characters into both columns
INSERT INTO AnsiString(JobCode, JobCodeUnicode) VALUES ('葉2葉4葉6葉8葉0葉2葉4葉6葉8叶0',
N'葉2葉4葉6葉8葉0葉2葉4葉6葉8叶0')
select * from ansistring
Returns
?2?4?6?8?0?2?4?6?8?0 葉2葉4葉6葉8葉0葉2葉4葉6葉8叶0
As expected, ? is inserted for characters which weren't mapped into ANSI, but either way, we can still insert 20 characters.
Do you possibly have a trigger on the table? Could it be another column entirely? Could your data access layer somehow be expanding your unicode string to something else (e.g. byte[])?

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Hash nvarchar as UTF32 - tsql

To get the UTF32 representation of a nvarchar column (in SQL Server) you should use some CLR code. Try something similar to the code in this answer: https://stackoverflow.com/a/14041069/1187211

Related

Postgres truncates trailing zeros for timestamps

PostgreSQL: Difference between "bytea" and "bit varying" types

Postgresql constraint to check for non-ascii characters

How to convert PostgreSQL escape bytea to hex bytea?

AnsiString being truncated with plenty of space

Categories

Resources