Select subset of an array in jsonb postgres - postgresql

I have a table in Postgres 9.6.3:
CREATE TABLE public."Records"
(
"Id" uuid NOT NULL,
"Json" jsonb,
CONSTRAINT "PK_Records" PRIMARY KEY ("Id")
)
Inside my "Json" column i store arrays like so:
[
{"a":"b0","c":0,"z":true},
{"a":"b1","c":1,"z":false},
{"a":"b2","c":2,"z":true}
]
There can be some 10 million entries in each array, and there can be some 5 million records in the table.
I want to get the JSON out, paged, e.g. skip 1 record and return 2 records.
I can do it like so:
select string_agg(txt, ',') as x FROM
(select jsonb_array_elements_text("Json") as txt
FROM "Records" where "Id" = 'de70aadc-70e8-4c77-bd4b-af75ed36897e' -- some id here
limit 50 offset 5000 -- paging parameters
) s;
However, the query takes almost a second (between 780 and 900 msec) to run on my dev laptop with some quite decent hardware (macbook pro 2017). Note: the timing is for the data sizes specified above, of course the sample data of 3 records returns faster.
Adding a GIN index like so: CREATE INDEX records_gin ON "Records" USING gin ("Json"); didn't actually do anything for the query performance (i suppose because i am not querying by the contents of the array).
Is there any way to make this work faster?

It would be faster if you normalized your data and stored the array elements in a second table. Then you could use keyset pagination to page through the data.

Related

Postgres: Fast facets on big blobs

I have a Postgres table with a large jsonb column.
CREATE TABLE mytable (id integer, my_jsonb jsonb)
The my_jsonb column contains data like this:
{
name: 'Bob',
city: 'Somecity',
zip: '12345'
}
The table contains several million rows.
I need to create facets, i.e. aggregations, on individual fields in our user interface. For example:
city | count
New York | 1000
Chicago | 3000
Los Angeles | 4000
maybe 200 more values...
My current query, which yields the correct results, looks like this:
select my_jsonb->>'city', count(*)
from mytable
where foo='bar'
group by my_jsonb->>'city'
order by my_jsonb->>'city'
The problem is that it is painfully slow. It takes 5-10 seconds, depending on the particular column that I pick. It has to do a full table scan and extract each jsonb value, row by row.
Question: how do I create an index that does this query efficiently, and works no matter which jsonb field I choose?
A GIN index doesn't work. The query optimizer doesn't use it. Same for a simple BTREE on the jsonb column.
I'm thinking that there might be some kind of expression index, and I might be able to rewrite the facet query to use the expression, but I haven't figured it out.
Worst case, I can extract all of the values into a second table and index that, I'd prefer not to.
Your only hope would be an index-only scan, but since that doesn't work with expression indexes, you're out. There is no way to avoid scanning the whole table and extracting the JSON values.
You'll have to extract the JSON values in a normalized form. This goes as a reminder that data models involving JSON are very often a bad choice in a relational database (although there are valid use cases).

postgres cube euclidean distance query performance issues

I have a postgres database containing a table of documents with word embeddings of 100 dimensions and using it to find similar documents.
CREATE TABLE documents(
id bigint,
title text,
body text,
vector double[],
PRIMARY KEY(id)
);
I have installed the cube extension and using it to sort documents by similarity from a selected document thus (as explained here):
SELECT id,title,body FROM documents ORDER BY cube(documents.vector)
<-> '(0.0990813672542572021,.. 0.0537704713642597198)'::cube LIMIT 10;
I have the index setup here:
CREATE INDEX ix_vect ON documents USING gist (cube(vector));
I am getting results as expected, but the query time is inordinately long ~30-45 second for a table size of ~2 Million rows. How can I improve performance to bring it down to acceptable levels i.e. <1 sec on millions of rows?
The correct way to use CUBE by doc:
SELECT c FROM test ORDER BY c <-> cube(array[0.5,0.5,0.5]) LIMIT 1;

PostgreSQL different index creation time for same datatype

I have a table with three columns A, B, C, all of type bytea.
There are around 180,000,000 rows in the table. A, B and C all have exactly 20 bytes of data, C sometimes contains NULLs
When creating indexes for all columns with
CREATE INDEX index_A ON transactions USING hash (A);
CREATE INDEX index_B ON transactions USING hash (B);
CREATE INDEX index_C ON transactions USING hash (C);
index_A is created in around 10 minutes, while B and C are taking over 10 hours after which I aborted them. I ran every CREATE INDEX on their own, so no indices were created in parallel. There are also no other queries running in the database.
When running
SELECT * FROM pg_stat_activity;
wait_event_type and wait_event are both NULL, state is active.
Why are the second index creations taking so long, and can I do anything to speed them up?
Ensure the statistics on your table are up-to-date.
Then execute the following query:
SELECT attname, n_distinct, correlation
from pg_stats
where tablename = '<Your table name here>'
Basically, the database will have more work to create indexes when:
The number of distinct values gets higher.
The correlation (= are values in the field physically stored in order) is close to 0.
I suspect you will see field A is different in terms of distinct values and/or a higher correlation than the other 2 fields.
Edit: Basically, creating an index = FULL SCAN of the table and create entries in the index as you progress. With the stats you have shared below that means:
Column A: it was detected as unique
A single scan is enough as the DB knows 1 record = 1 index entry.
Columns B & C : it was detected as having very few distinct values + abs(correlation) is very low.
Each index entry takes an entire FULL SCAN of the table.
Note: the description is simplified to highlight the difference.
Solution 1:
Do not create indexes for B and C.
It might sound stupid but in fact and as explained here, a small correlation means the indexes will probably not be used (an index is useful only when entries are not scattered in all the table blocks).
Solution 2:
Order records on the disk.
The initialization would be something like this:
CREATE TABLE Transactions_order as SELECT * FROM Transactions;
TRUNCATE TABLE Transactions;
INSERT INTO Transactions SELECT * FROM Transactions_order ORDER BY B,C,A;
DROP TABLE Transactions_order;
The tricky part comes next: with insert/update/delete records, you need to keep track of the correlation and ensure it does not drop too much.
If you can't guarantee that, stick to solution 1.
Solution3:
Create partitions and enjoy partition pruning.
There are quite a lot of efforts being made for partitioning recently in postgresql. It could be worth having a look into it.

Cassandra CQL3 select row keys from table with compound primary key

I'm using Cassandra 1.2.7 with the official Java driver that uses CQL3.
Suppose a table created by
CREATE TABLE foo (
row int,
column int,
txt text,
PRIMARY KEY (row, column)
);
Then I'd like to preform the equivalent of SELECT DISTINCT row FROM foo
As for my understanding it should be possible to execute this query efficiently inside Cassandra's data model(given the way compound primary keys are implemented) as it would just query the 'raw' table.
I searched the CQL documentation but I didn't find any options to do that.
My backup plan is to create a separate table - something like
CREATE TABLE foo_rows (
row int,
PRIMARY KEY (row)
);
But this requires the hassle of keeping the two in sync - writing to foo_rows for any write in foo(also a performance penalty).
So is there any way to query for distinct row(partition) keys?
I'll give you the bad way to do this first. If you insert these rows:
insert into foo (row,column,txt) values (1,1,'First Insert');
insert into foo (row,column,txt) values (1,2,'Second Insert');
insert into foo (row,column,txt) values (2,1,'First Insert');
insert into foo (row,column,txt) values (2,2,'Second Insert');
Doing a
'select row from foo;'
will give you the following:
row
-----
1
1
2
2
Not distinct since it shows all possible combinations of row and column. To query to get one row value, you can add a column value:
select row from foo where column = 1;
But then you will get this warning:
Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING
Ok. Then with this:
select row from foo where column = 1 ALLOW FILTERING;
row
-----
1
2
Great. What I wanted. Let's not ignore that warning though. If you only have a small number of rows, say 10000, then this will work without a huge hit on performance. Now what if I have 1 billion? Depending on the number of nodes and the replication factor, your performance is going to take a serious hit. First, the query has to scan every possible row in the table (read full table scan) and then filter the unique values for the result set. In some cases, this query will just time out. Given that, probably not what you were looking for.
You mentioned that you were worried about a performance hit on inserting into multiple tables. Multiple table inserts are a perfectly valid data modeling technique. Cassandra can do a enormous amount of writes. As for it being a pain to sync, I don't know your exact application, but I can give general tips.
If you need a distinct scan, you need to think partition columns. This is what we call a index or query table. The important thing to consider in any Cassandra data model is the application queries. If I was using IP address as the row, I might create something like this to scan all the IP addresses I have in order.
CREATE TABLE ip_addresses (
first_quad int,
last_quads ascii,
PRIMARY KEY (first_quad, last_quads)
);
Now, to insert some rows in my 192.x.x.x address space:
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000000001');
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000000002');
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000001001');
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000001255');
To get the distinct rows in the 192 space, I do this:
SELECT * FROM ip_addresses WHERE first_quad = 192;
first_quad | last_quads
------------+------------
192 | 000000001
192 | 000000002
192 | 000001001
192 | 000001255
To get every single address, you would just need to iterate over every possible row key from 0-255. In my example, I would expect the application to be asking for specific ranges to keep things performant. Your application may have different needs but hopefully you can see the pattern here.
according to the documentation, from CQL version 3.11, cassandra understands DISTINCT modifier.
So you can now write
SELECT DISTINCT row FROM foo
#edofic
Partition row keys are used as unique index to distinguish different rows in the storage engine so by nature, row keys are always distinct. You don't need to put DISTINCT in the SELECT clause
Example
INSERT INTO foo(row,column,txt) VALUES (1,1,'1-1');
INSERT INTO foo(row,column,txt) VALUES (2,1,'2-1');
INSERT INTO foo(row,column,txt) VALUES (1,2,'1-2');
Then
SELECT row FROM foo
will return 2 values: 1 and 2
Below is how things are persisted in Cassandra
+----------+-------------------+------------------+
| row key | column1/value | column2/value |
+----------+-------------------+------------------+
| 1 | 1/'1' | 2/'2' |
| 2 | 1/'1' | |
+----------+-------------------+------------------+

strange result when use Where filter in CQL cassandra

i have a column family use counter as create table command below: (KEY i use bigin to filter when query ).
CREATE TABLE BannerCount (
KEY bigint PRIMARY KEY
) WITH
comment='' AND
comparator=text AND
read_repair_chance=0.100000 AND
gc_grace_seconds=864000 AND
default_validation=counter AND
min_compaction_threshold=4 AND
max_compaction_threshold=32 AND
replicate_on_write='true' AND
compaction_strategy_class='SizeTieredCompactionStrategy' AND
compression_parameters:sstable_compression='SnappyCompressor';
But when i insert data to this column family , and select using Where command to filter data
results i retrived very strange :( like that:
use Query:
select count(1) From BannerCount where KEY > -1
count
-------
71
use Query:
select count(1) From BannerCount where KEY > 0;
count
-------
3
use Query:
select count(1) From BannerCount ;
count
-------
122
What happen with my query , who any tell me why i get that :( :(
To understand the reason for this, you should understand Cassandra's data model. You're probably using RandomPartitioner here, so each of these KEY values in your table are being hashed to token values, so they get stored in a distributed way around your ring.
So finding all rows whose key has a higher value than X isn't the sort of query Cassandra is optimized for. You should probably be keying your rows on some other value, and then using either wide rows for your bigint values (since columns are sorted) or put them in a second column, and create an index on it.
To explain in a little more detail why your results seem strange: CQL 2 implicitly turns "KEY >= X" into "token(KEY) >= token(X)", so that a querier can iterate through all the rows in a somewhat-efficient way. So really, you're finding all the rows whose hash is greater than the hash of X. See CASSANDRA-3771 for how that confusion is being resolved in CQL 3. That said, the proper fix for you is to structure your data according to the queries you expect to be running on it.