Postgresql and BLOBs - maximum size of bytea? - postgresql

I'm currently trying to store images in a psql table and was following this guide here using a bytea for the image. Problem is that the image I'm trying to insert is ~24kb and I keep getting an error that the maximum size is 8191, though I've read in other places that a bytea should be able to store up to 1gb. Surely I should be able to raise this max limit somehow?
Code:
String query = "INSERT INTO " + tableName + " VALUES(?);";
try {
PreparedStatement stmt = conn.prepareStatement(query);
File file = new File(location);
FileInputStream fi = new FileInputStream(file);
stmt.setBinaryStream(1, fi, (int)file.length());
boolean res = stmt.execute();
stmt.close();
fi.close
return res;
}
The database table only consists of a bytea at the moment.
Error message:
org.postgresql.util.PSQLException: ERROR: index row requires 23888 bytes, maximum size is 8191

Max size of bytea
According to this old thread, maximum size for a field in Postgres is 1 GB.
The PostgreSQL version 12 protocol limits row size to 2 GiB minus message header when it is sent to the client (SELECTed). (The protocol uses 32-bit signed integers to denote message size.)
No other limits found (another topic).
But largeobjects are stored as multiple bytea records so they not limited on such way. See this docs for them.

Apparently you have an index on that column (to be honest I'm surprised that you could create it - I would have expected Postgres to reject that).
An index on a bytea column does not really make sense. If you remove that index, you should be fine.
The real question is: why did you create an index on a column that stores binary data?

If you need to ensure that you don't upload the same image twice, you can create a unique index on the md5 (or some other hash) of the bytea:
create table a(a bytea);
create unique index a_bytea_unique_hash on a (md5(a));
insert into a values ('abc');
INSERT 0 1
insert into a values ('abc');
ERROR: duplicate key value violates unique constraint "a_bytea_unique_hash"
DETAIL: Key (md5(a))=(900150983cd24fb0d6963f7d28e17f72) already exists.

Related

How to query parquet data files from Azure Synapse when data may be structured and exceed 8000 bytes in length

I am having trouble reading, querying and creating external tables from Parquet files stored in Datalake Storage gen2 from Azure Synapse.
Specifically I see this error while trying to create an external table through the UI:
"Error details
New external table
Previewing the file data failed. Details: Failed to execute query. Error: Column 'members' of type 'NVARCHAR' is not compatible with external data type 'JSON string. (underlying parquet nested/repeatable column must be read as VARCHAR or CHAR)'. File/External table name: [DELETED] Total size of data scanned is 1 megabytes, total size of data moved is 0 megabytes, total size of data written is 0 megabytes.
. If the issue persists, contact support and provide the following id :"
My main hunch is that since a couple columns were originally JSON types, and some of the rows are quite long (up to 9000 characters right now, which could increase at any point in time during my ETL), this is some kind of conflict with some possible default limit's I have seen referenced in the documentation (enter link description here). Data appears internally like the following example, please bear in mind sometimes this would be way longer
["100.001", "100.002", "100.003", "100.004", "100.005", "100.006", "100.023"]
If I try to manually create the external table (which has worked every other time I have tried following code similar to this
CREATE EXTERNAL TABLE example1(
[id] bigint,
[column1] nvarchar(4000),
[column2] nvarchar(4000),
[column3] datetime2(7)
)
WITH (
LOCATION = 'location/**',
DATA_SOURCE = [datasource],
FILE_FORMAT = [SynapseParquetFormat]
)
GO
the table is created with no error nor warnings but trying to make a very simple select
SELECT TOP (100) [id] bigint,
[column1] nvarchar(4000),
[column2] nvarchar(4000),
[column3] datetime2(7)
FROM [schema1].[example1]
The following error is shown:
"External table 'dbo' is not accessible because content of directory cannot be listed."
It can also show the equivalent:
"External table 'schema1' is not accessible because content of directory cannot be listed."
This error persists even when creating external table with the argument "max" as it appears in this doc
Summary: How to create external table from parquet files with fields exceeding 4000, 8000 bytes or even up to 2gb, which would be the maximum size according to this
Thank you all in advance

Postgres: update value of TEXT column (CLOB)

I have a column of type TEXT which is supposed to represent a CLOB value and I'm trying to update its value like this:
UPDATE my_table SET my_column = TEXT 'Text value';
Normally this column is written and read by Hibernate and I noticed that values written with Hibernate are stored as integers (perhaps some internal Postgres reference to the CLOB data).
But when I try to update the column with the above SQL, the value is stored as a string and when Hibernate tries to read it, I get the following error: Bad value for type long : ["Text value"]
I tried all the options described in this answer but the result is always the same. How do I insert/update a TEXT column using SQL?
In order to update a cblob created by Hibernate you should use functions to handling large objects:
the documentation can be found in the following links:
https://www.postgresql.org/docs/current/lo-interfaces.html
https://www.postgresql.org/docs/current/lo-funcs.html
Examples:
To query:
select mytable.*, convert_from(loread(lo_open(mycblobfield::int, x'40000'::int), x'40000'::int), 'UTF8') from mytable where mytable.id = 4;
Obs:
x'40000' is corresponding to read mode (INV_WRITE)
To Update:
select lowrite(lo_open(16425, x'60000'::int), convert_to('this an updated text','UTF8'));
Obs:
x'60000' = INV_WRITE + INV_READ is corresponding to read and write mode (INV_WRITE + IV_READ).
The number 16425 is an example loid (large object id) which already exists in a record in your table. It's that integer number you can see as value in the blob field created by Hinernate.
To Insert:
select lowrite(lo_open(lo_creat(-1), x'60000'::int), convert_to('this is a new text','UTF8'));
Obs:
lo_creat(-1) generate a new large object a returns its loid

PostgreSQL insert with nested query fails with large numbers of rows

I'm trying to insert data into a PostgreSQL table using a nested SQL statement. I'm finding that my inserts work with a small (a few thousand) rows being returned from the nested query. For instance, when I attempt:
insert into the_target_table (a_few_columns, average_metric)
SELECT a_few_columns, AVG(a_metric)
FROM a table
GROUP BY a_few_columns LIMIT 5000)
However, this same query fails when I remove my LIMIT (the inner query without limit returns about 30,000 rows):
ERROR: Integer out of range
a_metric is a double precision, and a_few_columns are text. I've played around with the LIMIT rows, and it seems like the # of rows it can insert without throwing an error is 14,000. I don't know if this is non-deterministic, or a constant threshold before the error is thrown.
I've looked through a few other SO posts on this topic, including this one, and changed my table primary key data type to BIGINT. I still get the same error. I don't think it's an issue w/ numerical overflow, however, as the number of inserts I'm making is small and nowhere even close to hitting the threshold.
Anyone have any clues what is causing this error?
The issue here was an improper definition of the avg_metric field in my table that I wanted to insert it into. I accidentally had defined it as an integer. This normally isn’t a huge issue, but I also had a handful of infinity values ( inf). Once I switched my field data type to double precision I was able to insert successfully. Of course, it’s probably best if my application had checked beforehand for finite values prior to attempting the insert- normally I’d do this programmatically via asserts, but with a nested query I hadn’t bothered to check.
The final query I used was
insert into the_target_table (a_few_columns, average_metric)
SELECT a_few_columns, CASE WHEN AVG(a_metric) = 'inf' THEN NULL ELSE AVG(a_metric) END
FROM a_table
GROUP BY a_few_columns LIMIT 5000)
An even better solution would have been to go through a_table and replace all inf values first.

PostgreSQL: Difference between "bytea" and "bit varying" types

The PostgreSQL types bytea and bit varying sound similar:
bytea stores binary strings.
bit varying stores strings of 1's and 0's.
The documentation does not mention a maximum size for either. Is it 1GB like character varying?
I have two separate use cases, both over a table with millions of rows:
Storing MD5 hashes
That would be a bytea with a length of 16 bytes or a bit(128). It would be used for:
Deduplication: Heavy use of GROUP BY, with an index I suppose.
Querying with WHERE md5 = for exact matches only.
Displaying as a hex string for human use.
Storing arbitrary binary data
Strings of binary data of varying length up to 4kB for:
Bitwise operations to find the strings matching a certain mask. Example at the end of this post.
Extracting some bytes, for instance get the integer value of the byte 14 in my string.
Some deduplication.
Working example for the bitwise operation, using bit varying. The mask is X'00FF00' and the it returns only the row X'AAAAAA'. I shortened the strings for the example but it would be over their full length, up to 4kB. Is it possible to do something similar with bytea?
CREATE TABLE test1 (mystring bit varying);
INSERT INTO test1 VALUES (X'AAAAAA'), (X'ABCABC');
SELECT * FROM test1 WHERE mystring & X'00FF00' = X'00AA00';
Which of bytea and bit varying is the more appropriate?
I saw the UUID type is made to store exactly 16 bytes, would that be any advantage to store the MD5's?
In general, if you're not using bitwise operations you should be using bytea.
I store larger values in bytea and then convert substrings to bit varying for bitwise operations where possible, mostly because clients understand bytea much more consistently than bit varying and the I/O format is more compact.
MD5 values should be stored as bytea. Bitwise operations on them make no sense, and you generally want to fetch them as binary.
I think bit varying really has two uses:
To store flags fields that are literally bit strings; and
As an interim data type for internal calculations
For pretty much everything else, use bytea.
There's nothing stopping you storing a 4k bitfield if that's what it is, though.
It appears the maximum length of bytea is 1 GB. [1]
For bitwise operation use bit varying (explanation see below)
For storing MD5 hash use bytea. It will take less storage than bit varying
The benefit using UUID is UUID algorithm somehow guarantees your uniqueness, not only in your table, but also in your database or even across your database (even if you generate UUID in your application). I think if you are using UUID without dashes it will be more efficient for storing, comparing and sorting in UUID (comparison between bytea and UUID see below).
For bitwise operation use bit varying
If you concern about storage:
bit varying takes more storage than bytea. If you are okay then you should try comparing the function they both offer:
bit varying
vs
bytea
So far I can see bit varying will be more suitable for you to do bitwise operation though bytea is generally accepted way to store arbitrary data.
PostgreSQL offers a single bytea operator: concatenation. You can append one byte value to another bytea value using the concatenation operator ||. [1]
Note that you cannot compare two bytea value, even for equality/inequality. You can, of course, convert bytea value into another value using the CAST(), and that opens up other operators. [1]
Comparison between UUID and bytea
create table u(uuid uuid primary key, payload character(300));
create table b( bytea bytea primary key, payload character(300));
INSERT INTO u
SELECT uuid_generate_v4()
FROM generate_series(1,1000*1000);
INSERT INTO b
SELECT random_bytea(16)
FROM generate_series(1,1000*1000);
VACUUM ANALYZE u;
VACUUM ANALYZE b;
## Your table size
SELECT pg_size_pretty(pg_total_relation_size('u'));
pg_size_pretty
----------------
81 MB
SELECT pg_size_pretty(pg_total_relation_size('b'));
pg_size_pretty
----------------
101 MB
## Speed comparison
\timing on
## Common select
select * from u limit 1000;
Time: 1.433 ms
select * from b limit 1000;
Time: 1.396 ms
## Random Select
SELECT * FROM u OFFSET random()*1000 LIMIT 10000;
Time: 42.453 ms
SELECT * FROM b OFFSET random()*1000 LIMIT 10000;
Time: 10.962 ms
Conclusion : I don't think there will be more benefit using UUID except its uniqueness and smaller size (will be faster to insert)
Note: No Index, there is only one connection
Some source :
PostgreSQL: "The Comprehensive Guide to Building, Programming, And Administratoring PostgreSQL Databases" Book

how to write a DB2 store procedure to insert/update/delete with random value?

1.I want to write a DB2 procedure to do common insert/update/delete to a table, problem is how to generate SQL statement with random values? for example, if a column of integer type, the store procedure could generate numbers between 1 to 10000, or for a column of varchar type, the store procedure could generate string of random chosen characters with a fixed length,say 10;
2.if the DB2 SQL syntax support sth to put the data from file into a LOB column for a randomly chosen row, say, I have a table t1(c0 integer,c1 clob), then how could I do sth like "insert into t1 values(100,some_path_to_a_text_file)" ?
3.using DB2 "import" to load data, if the file contains 10000 rows,it seems DB2 by default will commit the entire 10000 rows of insertion in one single transaction. Is there any configuration/option I could use to divide the "import" process into like 10 transaction, each with 1000 rows?
Thank you very much!
1) To do a random operation, get a random value, and process it according to set of rules. I have a similar case in an utility I am currently developping.
https://github.com/angoca/log4db2/blob/master/src/examples/sql-pl/bank/DemoBankRandom.sql
It realizes an insert, a select, an update or a delete based on a random value.
2) No idea. What is sth?
3) For more frequent commits, you put commitcount. For more info please check the infoCenter http://publib.boulder.ibm.com/infocenter/db2luw/v10r1/topic/com.ibm.db2.luw.admin.cmd.doc/doc/r0008304.html