Returning JSON from Postgres is slow - postgresql

I have a table in Postgres with a JSONB column, each row of the table contains a large JSONB object (~4500 keys, JSON string is around 110 KB in a txt file). I want to query these rows and get the entire JSONB object.
The query is fast -- when I run EXPLAIN ANALYZE, or omit the JSONB column, it returns in 100-300 ms. But when I execute the full query, it takes on the order of minutes. The exact same query on a previous version of the data was also fast (each JSONB was about half as large).
Some notes:
This ends up in Python (via SQLAlchemy/psycopg2). I'm worried that the query executor is converting JSONB to JSON, then it gets encoded to text for transfer over the wire, then gets JSON encoded again on the Python end.
Is this correct? If so how could I mitigate this issue? When I select the JSONB column as ::text, the query is roughly twice as fast.
I only need a small subset of the JSON (around 300 keys or 6% of keys). I tried methods of filtering the JSON output in the query but they caused a substantial further performance hit -- it ended up being faster to return the entire object.

This is not necessarily a solution, but here is an update:
By casting the JSON column to text in the Postgres query, I was able to substantially cut down query execution and results fetching on the Python end.
On the Python end, doing json.loads for every single row in the result set brings me to the exact timing as using the regular query. However, with the ujson library I was able to obtain a significant speedup. The performance of casting to text in the query, then calling ujson.loads on the python end is roughly 3x faster than simply returning JSON in the query.

Related

How to efficiently retrieve rows with large JSON objects in Postgres

I've inherited the task of retrieving data from a Postgres table.
The table has ~1m rows, and there are about 145k rows that I wish to retrieve. These 145k rows have a common string in one of their columns batch_name that I can use to search for them.
The table has two columns payload & result that are of type JSON. The result column contains the data that I wish to retrieve.
When I make even the simplest queries to the table:
SELECT * FROM table_name WHERE batch_name = 'an_id' limit 10
The request takes ~7-10 seconds to return data.
This is despite the fact that the batch_name column has an index on it and it's of type varchar(255)
Whilst investigating this, I've discovered that the JSON objects in the result column and payload column can be absolutely gigantic objects. When prettified, they are sometimes ~27k lines long.
These gigantic JSON objects seem to be the root cause of the problem.
My questions are:
What can I do to improve the efficiency of this query? Or is the ultimate solution here to just modify the table such that we are no longer storing gigantic JSON objects?
Given that I don't need to actually query fields in these JSON objects (but I DO need to retrieve them), would simply storing them as strings improve efficiency?
Why is storing large JSON objects SO inefficient?
Thanks in advance for any help, it's much appreciated.

Data Lake Analytics - Large vertex query

I have a simple query which make a GROUP BY using two fields:
#facturas =
SELECT a.CodFactura,
Convert.ToInt32(a.Fecha.ToString("yyyyMMdd")) AS DateKey,
SUM(a.Consumo) AS Consumo
FROM #table_facturas AS a
GROUP BY a.CodFactura, a.DateKey;
#table_facturas has 4100 rows but query takes several minutes to finish. Seeing the graph explorer I see it uses 2500 vertices because I'm having 2500 CodFactura+DateKey unique rows. I don't know if it normal ADAL behaviour. Is there any way to reduce the vertices number and execute this query faster?
First: I am not sure your query actually will compile. You would need the Convert expression in your GROUP BY or do it in a previous SELECT statement.
Secondly: In order to answer your question, we would need to know how the full query is defined. Where does #table_facturas come from? How was it produced?
Without this information, I can only give some wild speculative guesses:
If #table_facturas is coming from an actual U-SQL Table, your table is over partitioned/fragmented. This could be because:
you inserted a lot of data originally with a distribution on the grouping columns and you either have a predicate that reduces the number of rows per partition and/or you do not have uptodate statistics (run CREATE STATISTICS on the columns).
you did a lot of INSERT statements, each inserting a small number of rows into the table, thus creating a big number of individual files. This will "scale-out" the processing as well. Use ALTER TABLE REBUILD to recompact.
If it is coming from a fileset, you may have too many small files in the input. See if you can merge them into less, larger files.
You can also try to hint a small number of rows in your query that creates #table_facturas if the above does not help by adding OPTION(ROWCOUNT=4000).

kdb+/q optimize union function

To give you a bit of background. I have a process which does this large complex calculation which takes a while to complete. It runs on a timer. After some investigation I realise that what is causing the slowness isn't the actual calculation but the internal q function, union.
I am trying to union two simple tables, table A and table B. A is approximately 5m rows and B is 500. Both tables have only two columns. First column is a symbol. Table A is actually a compound primary key of a table. (Also, how do you copy directly from the console?)
n:5000000
big:([]n?`4;n?100)
small:([]500?`4;500?100)
\ts big union small
I tried keying both columns and upserting, join and then distinct, "big, small where not small in big" but nothing seems to work :(
Any help will be appreciated!
If you want to upsert the big table it has to be keyed and upsert operator should be used. For example
n:5000000
//big ids are unique numbers from 0 to 499999
//table is keyed with 1! operator
big:1!([]id:(neg n)?n;val:n?100)
//big ids are unique numbers. 250 from 0-4999999 and 250 from 500000-1000000 intervals
small:([]id:(-250?n),(n+-250?n);val:500?100)
If big is global variable it is efficient to upsert it as
`big upsert small
if big is local
big: big upsert small
As the result big will have 500250 elements, because there are 250 common keys (id column) in big and small tables
this may not be relevant, but just a quick thought. If your big table has a column which has type `sym and if this column does not really show up that much throughout your program, why not cast it to string or other value? if you are doing this update process every single day then as the data gets packed in your partitioned hdb, whenever the new data is added, kdb+ process has to reassign/rewrite its sym file and i believe this is the part that actually takes a lot of time, not the union calculation itself..
if above is true, i'd suggest either rewriting your schema for the table which minimises # of rehashing(not sure if this is the right term though!) on your symfile. or, as the above person mentioned, try to assign attribute to your table.. this may reduce the time too.

PostgreSql Dynamic JSON Indexing

I am new to PostgreSql world. We chose this DB so that we could query our JSON results for filter queries like contains, less than , greater than, etc. JSON results are dynamic and we cannot know in advance what keys will be generated as the output. Table (result_id (int64),jsondata(jsonb)) data looks like this
id1,{k1:vab,k2:abc,k3:def}
id1,{k1:abv,k2:v7,k3:ghu}
id1,{k1:v5,k2:vdd,k3:vew}
id1,{k1:v6,k2:v9s,k3:ved}
id2,{k4:vw,k5:vds,k6:vdss}
id2,{k4:v1,k5:fgg,k6:dd}
id2,{k4:qw,k5:gfd,k6:ess}
id2,{k4:er,k5:dfs,k6:fss}
My queries would be something like
Select * from table where result_id = 'id1' and jsondata->'k1' contains 'ab'
My script outputs a json content that I store in this table.
Each json key is represented in a Grid column and json key's values are column data.Grid offers filtering capabilities, which means filtering on JSON data.
My problem is that the filtering can happen on any JSON key, but key names are not static. Keys (json output) might change when the script content is changed So previously indexed keys would become irrelevant. But if the script is not changed the keys remain constant.
How do I apply indexing so that my JSON filter operations become faster? The same table contains many keys within the same JSON row and across rows. Wouldn't it be inefficient to index all keys so that filtering can be made efficient?

Hidden records which exceed numeric limits in Postgresql DB

I have a PG DB containing mostly numeric data, where accuracy of numeric data is important so fields are typically Numeric (7,3) because the largest expected non error value is around 1300. I am importing some data which contains some records exceeding the 10^4 limit, due to sensor faults, so I have applied an error code to these values using:
UPDATE schema.table1 SET fieldname1 = 3333 WHERE fieldname1 > 3333;
(i.e. If an operator sees 3333 in the data it is known to be an erroneous value because it is outside the normally expected range and it's known to be human applied due to the sequence which is unlikely in nature).
The success of this query was confirmed by:
MAX (fieldname1)
which returns 3333 for all fields which previously had values exceeding 3333.
This initial tidying of data was done in a temporary table with the Fields defined as Numeric (10,3), and I now need to import the data into the main table, using:
INSERT INTO schema.table2
(DateTime, Fieldname1)
SELECT
DISTINCT ON (DateTime)
DateTime,
Fieldname1
FROM Schema.table1
WHERE DateTime IS NOT NULL
But I get an error message saying 'datatype Numeric(7,3) cannot contain values exceeding 10^4' (or words to that effect).
As an experiment I tried redefining the datatypes in the temporary table as Numeric (7,3). This worked for most of the fields, but for a few fields I got the 'datatype Numeric(7,3) cannot contain values exceeding 10^4' message, implying that there is still data >10^4 despite the MAX(Fieldname1) command returning 3333.
Tried VACUUM and ANALYZE to clean up the tables, no cigar.
Is there a known issue here? am I going about this the wrong way?