How to combine DISTINCT and ORDER BY in array_agg of jsonb values in postgresSQL - postgresql

Note: I am using the latest version of Postgres (9.4)
I am trying to write a query which does a simple join of 2 tables, and groups by the primary key of the first table, and does an array_agg of several fields in the 2nd table which I want returned as an object. The array needs to be sorted by a combination of 2 fields in the json objects, and also uniquified.
So far, I have come up with the following:
SELECT
zoo.id,
ARRAY_AGG(
DISTINCT ROW_TO_JSON((
SELECT x
FROM (
SELECT animals.type, animals.name
) x
))::JSONB
-- ORDER BY animals.type, animals.name
)
FROM zoo
JOIN animals ON animals.zooId = zoo.id
GROUP BY zoo.id;
This results in one row for each zoo, with a an aggregate array of jsonb objects, one for each animal, uniquely.
However, I can't seem to figure out how to also sort this by the parameters in the commented out part of the code.
If I take out the distinct, I can ORDER BY original fields, which works great, but then I have duplicates.

If you use row_to_json() you will lose the column names unless you put in a row that is typed. If you "manually" build the jsonb object with json_build_object() using explicit names then you get them back:
SELECT zoo.id, array_agg(za.jb) AS animals
FROM zoo
JOIN (
SELECT DISTINCT ON (zooId, "type", "name")
zooId, json_build_object('animal_type', "type", 'animal_name', "name")::jsonb AS jb
FROM animals
ORDER BY zooId, jb->>'animal_type', jb->>'animal_name'
-- ORDER BY zooId, "type", "name" is far more efficient
) AS za ON za.zooId = zoo.id
GROUP BY zoo.id;
You can ORDER BY the elements of a jsonb object, as shown above, but (as far as I know) you cannot use DISTINCT on a jsonb object. In your case this would be rather inefficient anyway (first building all the jsonb objects, then throwing out duplicates) and at the aggregate level it is plain impossible with standard SQL. You can achieve the same result, however, by applying the DISTINCT clause before building the jsonb object.
Also, avoid use of SQL key words like "type" and standard data types like "name" as column names. Both are non-reserved keywords so you can use them in their proper contexts, but practically speaking your commands could get really confusing. You could, for instance, have a schema, with a table, a column in that table, and a data type each called "type" and then you could get this:
SELECT type::type FROM type.type WHERE type = something;
While PostgreSQL will graciously accept this, it is plain confusing at best and prone to error in all sorts of more complex situations. You can get a long way by double-quoting any key words, but they are best just avoided as identifiers.

Related

Postgres, where in on a json object

I've been looking around and can't seem to find anything that is helping me understand how I can achieve the following. (Bear in mind I have simplified this to the problem I'm having and I am only storing simple JSON objects in this field)
Given I have a table "test" defined
CREATE TABLE test (
id int primary key
, features jsonb
)
And some test data
id
features
1
{"country": "Sweden"}
2
{"country": "Denmark"}
3
{"country": "Norway"}
I've been trying to filter on the JSONB column "features". I can do this easily with one value
SELECT *
FROM test
WHERE features #> '{"country": "Sweden"}'
But I've been having troubles working out how I could filter by multiple values succintly. I can do this
SELECT *
FROM test
WHERE features #> '{"country": "Sweden"}'
OR features #> '{"country": "Norway"}'
But I have been wondering if there would be an equivalent to WHERE IN ($1, $2, ...) for JSONB columns.
I suspect that I will likely need to stick with the WHERE... OR... but would like to know if there is another way to achieve this.
You can use jsonb->>'field_name' extract a field as text, then you use any operator compatible with text type
SELECT *
FROM test
WHERE features->>'country' = 'Sweden'
SELECT *
FROM test
WHERE features->>'country' in ('Sweden', 'Norway')
You an also directly work with jsonb as follow
jsonb->'field_name' extract field as jsonb, then you can use operator compatible with jsonb:
SELECT *
FROM test
WHERE features->'country' ?| array['Sweden', 'Norway']
See docs for more details
You can extract the country value, then you can use a regular IN condition:
select *
from test
where features ->> 'country' in ('Sweden', 'Norway')

Most efficient way to DECODE multiple columns -- DB2

I am fairly new to DB2 (and SQL in general) and I am having trouble finding an efficient method to DECODE columns
Currently, the database has a number of tables most of which have a significant number of their columns as numbers, these numbers correspond to a table with the real values. We are talking 9,500 different values (e.g '502=yes' or '1413= Graduate Student')
In any situation, I would just do WHERE clause and show where they are equal, but since there are 20-30 columns that need to be decoded per table, I can't really do this (that I know of).
Is there a way to effectively just display the corresponding value from the other table?
Example:
SELECT TEST_ID, DECODE(TEST_STATUS, 5111, 'Approved, 5112, 'In Progress') TEST_STATUS
FROM TEST_TABLE
The above works fine.......but I manually look up the numbers and review them to build the statements. As I mentioned, some tables have 20-30 columns that would need this AND some need DECODE statements that would be 12-15 conditions.
Is there anything that would allow me to do something simpler like:
SELECT TEST_ID, DECODE(TEST_STATUS = *TableWithCodeValues*) TEST_STATUS
FROM TEST_TABLE
EDIT: Also, to be more clear, I know I can do a ton of INNER JOINS, but I wasn't sure if there was a more efficient way than that.
From a logical point of view, I would consider splitting the lookup table into several domain/dimension tables. Not sure if that is possible to do for you, so I'll leave that part.
As mentioned in my comment I would stay away from using DECODE as described in your post. I would start by doing it as usual joins:
SELECT a.TEST_STATUS
, b.TEST_STATUS_DESCRIPTION
, a.ANOTHER_STATUS
, c.ANOTHER_STATUS_DESCRIPTION
, ...
FROM TEST_TABLE as a
JOIN TEST_STATUS_TABLE as b
ON a.TEST_STATUS = b.TEST_STATUS
JOIN ANOTHER_STATUS_TABLE as c
ON a.ANOTHER_STATUS = c.ANOTHER_STATUS
JOIN ...
If things are too slow there are a couple of things you can try:
Create a statistical view that can help determine cardinalities from the joins (may help the optimizer creating a better plan):
https://www.ibm.com/support/knowledgecenter/sl/SSEPGG_9.7.0/com.ibm.db2.luw.admin.perf.doc/doc/c0021713.html
If your license admits you can experiment with Materialized Query Tables (MQT). Note that there is a penalty for modifications of the base tables, so if you have more of a OLTP workload, this is probably not a good idea:
https://www.ibm.com/developerworks/data/library/techarticle/dm-0509melnyk/index.html
A third option if your lookup table is fairly static is to cache the lookup table in the application. Read the TEST_TABLE from the database, and lookup descriptions in the application. Further improvements may be to add triggers that invalidate the cache when lookup table is modified.
If you don't want to do all these joins you could create yourself an own LOOKUP function.
create or replace function lookup(IN_ID INTEGER)
returns varchar(32)
deterministic reads sql data
begin atomic
declare OUT_TEXT varchar(32);--
set OUT_TEXT=(select text from test.lookup where id=IN_ID);--
return OUT_TEXT;--
end;
With a table TEST.LOOKUP like
create table test.lookup(id integer, text varchar(32))
containing some id/text pairs this will return the text value corrseponding to an id .. if not found NULL.
With your mentioned 10k id/text pairs and an index on the ID field this shouldn't be a performance issue as such data amount should be easily be cached in the corresponding bufferpool.

Postgresql get references from a dictionary

I'm trying to build a request to get the data from a table, but some of those columns have foreign keys I would like to replace by the associated keyword in one request.
Basically there's
table A with column 1:PKA-ID and column 2:name.
table B with column 1:PKB-ID, column 2:FKA-ID, column 3:amount.
I want to get all the lines in table B but with all foreign keys replaced by the associated names in table A.
I started building a request with a subrequest + alias to get that, but ofc I have more than one result per subrequest, yet I can't find a way to link that subrequest to the ID of table B [might be exhausted, dumb or both] from the main request. I did something like that:
SELECT (SELECT "NAME" FROM A JOIN B ON ID = FKA-ID) AS name, amount FROM TABLEB;
it feels so simple of a request yet...
You don't need a join in the subselect.
SELECT pkb_id,
(SELECT name FROM a WHERE a.pka_id = b.fka_id),
amount
FROM b;
(See it live in SQL Fiddle).
The subselect query runs for each and every row of its parent select and has the parent row available from the context.
You can also use a simple join.
SELECT b.pkb_id, a.name, b.amount
FROM b, a
WHERE a.pka_id = b.fka_id;
Note that the join version puts less restrictions on the PostgreSQL query optimizer so in some cases the join version might work faster. (For example, in PostgreSQL 9.6 the join might utilize multiple CPU units, cf. Parallel Query).

Effectively searching through entire 1 level nested JSONB in Postgres

Let's say we need to check if a jsonb column contains a particular value matching by a substring in any of the value (non-nested, only first level).
How does one effectively optimize a query to search entire JSONB column (this means every key) for a value?
Is there some good alternative to doing ILIKE %val% on jsonb datatype casted to text?
jsonb_each_text(jsonb_column) ILIKE '%val%'
As an example consider this data:
SELECT
'{
"col1": "somevalue",
"col2": 5.5,
"col3": 2016-01-01,
"col4": "othervalue",
"col5": "yet_another_value"
}'::JSONB
How would you go about optimizing a query like that when in need to search for pattern %val% in records containing different keys configuration for different rows in a jsonb column?
I'm aware that searching with preceding and following % sign is inefficient, thus looking for a better way but having hard time finding one. Also, indexing all the fields within the json column explicitly is not an option since they vary for each type of record and would create a huge set of indexes (not every row has the same set of keys).
Question
Is there a better alternative to extracting each key-value pair to text and performing an ILIKE/POSIX search?
If you know you will need to query only a few known keys, then you can simply index those expressions.
This is a too simple but self explaining example:
create table foo as SELECT '{"col1": "somevalue", "col2": 5.5, "col3": "2016-01-01", "col4": "othervalue", "col5": "yet_another_value"}'::JSONB as bar;
create index pickfoo1 on foo ((bar #>> '{col1}'));
create index pickfoo2 on foo ((bar #>> '{col2}'));
This is the basic idea, even it isn't useful for ilike querys, but you can do more things (depending on your needs).
For example: If you need only case insensitive matching, it would be sufficient to do:
-- Create index over lowered value:
create index pickfoo1 on foo (lower(bar #>> '{col1}'));
create index pickfoo2 on foo (lower(bar #>> '{col2}'));
-- Check that it matches:
select * from foo where lower(bar #>> '{col1}') = lower('soMEvaLUe');
NOTE: This is only an example: If you perform an explain over the previous select, you will see that postgres actually performs a
sequential scan instead of using the index. But this is because we are
testing over a table with a single row, which is not the usual. But
I'm sure you could test it with a bigger table ;-)
With huge tables, even like queries should benefit of the index if the firt wilcard doesn't appear at the beginning of the string (but it isn't a matter of jsonb but a matter of btree indexes itself).
If you need to optimize queries like:
select * from foo where bar #>> '{col1}' ilike '%MEvaL%';
...then you should consider using GIN or GIST indexes instead.

Cannot sort a row of size 8130, which is greater than the allowable maximum of 8094

SELECT DISTINCT tblJobReq.JobReqId
, tblJobReq.JobStatusId
, tblJobClass.JobClassId
, tblJobClass.Title
, tblJobReq.JobClassSubTitle
, tblJobAnnouncement.JobClassDesc
, tblJobAnnouncement.EndDate
, blJobAnnouncement.AgencyMktgVerbage
, tblJobAnnouncement.SpecInfo
, tblJobAnnouncement.Benefits
, tblSalary.MinRateSal
, tblSalary.MaxRateSal
, tblSalary.MinRateHour
, tblSalary.MaxRateHour
, tblJobClass.StatementEval
, tblJobReq.ApprovalDate
, tblJobReq.RecruiterId
, tblJobReq.AgencyId
FROM ((tblJobReq
LEFT JOIN tblJobAnnouncement ON tblJobReq.JobReqId = tblJobAnnouncement.JobReqId)
INNER JOIN tblJobClass ON tblJobReq.JobClassId = tblJobClass.JobClassId)
LEFT JOIN tblSalary ON tblJobClass.SalaryCode = tblSalary.SalaryCode
WHERE (tblJobReq.JobClassId in (SELECT JobClassId
from tblJobClass
WHERE tblJobClass.Title like '%Family Therapist%'))
When i try to execute the query it results in the following error.
Cannot sort a row of size 8130, which is greater than the allowable maximum of 8094
I checked and didn't find any solution. The only way is to truncate (substring())the "tblJobAnnouncement.JobClassDesc" in the query which has column size of around 8000.
Do we have any work around so that i need not truncate the values. Or Can this query be optimised? Any setting in SQL Server 2000?
The [non obvious] reason why SQL needs to SORT is the DISTINCT keyword.
Depending on the data and underlying table structures, you may be able to do away with this DISTINCT, and hence not trigger this error.
You readily found the alternative solution which is to truncate some of the fields in the SELECT list.
Edit: Answering "Can you please explain how DISTINCT would be the reason here?"
Generally, the fashion in which the DISTINCT requirement is satisfied varies with
the data context (expected number of rows, presence/absence of index, size of row...)
the version/make of the SQL implementation (the query optimizer in particular receives new or modified heuristics with each new version, sometimes resulting in alternate query plans for various constructs in various contexts)
Yet, all the possible plans associated with a "DISTINCT query" involve *some form* of sorting of the qualifying records. In its simplest form, the plan "fist" produces the list of qualifying rows (records) (the list of records which satisfy the WHERE/JOINs/etc. parts of the query) and then sorts this list (which possibly includes some duplicates), only retaining the very first occurrence of each distinct row. In other cases, for example when only a few columns are selected and when some index(es) covering these columns is(are) available, no explicit sorting step is used in the query plan but the reliance on an index implicitly implies the "sortability" of the underlying columns. In other cases yet, steps involving various forms of merging or hashing are selected by the query optimizer, and these too, eventually, imply the ability of comparing two rows.
Bottom line: DISTINCT implies some sorting.
In the specific case of the question, the error reported by SQL Server and preventing the completion of the query is that "Sorting is not possible on rows bigger than..." AND, the DISTINCT keyword is the only apparent reason for the query to require any sorting (BTW many other SQL constructs imply sorting: for example UNION) hence the idea of removing the DISTINCT (if it is logically possible).
In fact you should remove it, for test purposes, to assert that, without DISTINCT, the query completes OK (if only including some duplicates). Once this fact is confirmed, and if effectively the query could produce duplicate rows, look into ways of producing a duplicate-free query without the DISTINCT keyword; constructs involving subqueries can sometimes be used for this purpose.
An unrelated hint, is to use table aliases, using a short string to avoid repeating these long table names. For example (only did a few tables, but you get the idea...)
SELECT DISTINCT JR.JobReqId, JR.JobStatusId,
tblJobClass.JobClassId, tblJobClass.Title,
JR.JobClassSubTitle, JA.JobClassDesc, JA.EndDate, JA.AgencyMktgVerbage,
JA.SpecInfo, JA.Benefits,
S.MinRateSal, S.MaxRateSal, S.MinRateHour, S.MaxRateHour,
tblJobClass.StatementEval,
JR.ApprovalDate, JR.RecruiterId, JR.AgencyId
FROM (
(tblJobReq AS JR
LEFT JOIN tblJobAnnouncement AS JA ON JR.JobReqId = JA.JobReqId)
INNER JOIN tblJobClass ON tblJobReq.JobClassId = tblJobClass.JobClassId)
LEFT JOIN tblSalary AS S ON tblJobClass.SalaryCode = S.SalaryCode
WHERE (JR.JobClassId in
(SELECT JobClassId from tblJobClass
WHERE tblJobClass.Title like '%Family Therapist%'))
FYI, running this SQL command on your DB can fix the problem if it is caused by space that needs to be reclaimed after dropping variable length columns:
DBCC CLEANTABLE (0,[dbo.TableName])
See: http://msdn.microsoft.com/en-us/library/ms174418.aspx
This is a limitation of SQL Server 2000. You can:
Split it into two queries and combine elsewhere
SELECT ID, ColumnA, ColumnB FROM TableA JOIN TableB
SELECT ID, ColumnC, ColumnD FROM TableA JOIN TableB
Truncate the columns appropriately
SELECT LEFT(LongColumn,2000)...
Remove any redundant columns from the SELECT
SELECT ColumnA, ColumnB, --IDColumnNotUsedInOutput
FROM TableA
Migrate off of SQL Server 2000