I have big table that use a JOIN using ts_vector
SELECT
queries.query
FROM reports
INNER JOIN queries ON (
strip(to_tsvector('italian', reports.query)) = strip(to_tsvector('italian', queries.query))
OR strip(to_tsvector('italian', reports.text)) = strip(to_tsvector('italian', queries.query))
)
Now this query it's really slow and I would like to follow what suggested here and create a new ts_vector column (+index it) on the reports table that store the value of
strip(to_tsvector('italian', reports.query))
so I can search on this just created column. I am not interest to weights.
In the article is suggested to create a trigger function, but the problem is that the dictionary it could be 'italian', 'english', 'german' and so on..
The language could be calculated by a join with another table
SELECT
language
FROM profiles
WHERE profileid = reports.profileid
But I suspect this could be complicated. How could I do?
Otherwise, I can run a scheduled script that manually populate the ts_vector column after importing/updating data to reports table. In this case I will switch language at the application-level.
Related
I am doing a query on a very large data set and i am using WITH (CTE) syntax.. this seems to take a while and i was reading online that temp tables could be faster to use in these cases can someone advise me in which direction to go. In the CTE we join to a lot of tables then we filter on the CTE result..
Only interesting in postgres answers
What version of PostgreSQL are you using? CTEs perform differently in PostgreSQL versions 11 and older than versions 12 and above.
In PostgreSQL 11 and older, CTEs are optimization fences (outer query restrictions are not passed on to CTEs) and the database evaluates the query inside the CTE and caches the results (i.e., materialized results) and outer WHERE clauses are applied later when the outer query is processed, which means either a full table scan or a full index seek is performed and results in horrible performance for large tables. To avoid this, apply as much filters in the WHERE clause inside the CTE:
WITH UserRecord AS (SELECT * FROM Users WHERE Id = 100)
SELECT * FROM UserRecord;
PostgreSQL 12 addresses this problem by introducing query optimizer hints to enable us to control if the CTE should be materialized or not: MATERIALIZED, NOT MATERIALIZED.
WITH AllUsers AS NOT MATERIALIZED (SELECT * FROM Users)
SELECT * FROM AllUsers WHERE Id = 100;
Note: Text and code examples are taken from my book Migrating your SQL Server Workloads to PostgreSQL
Summary:
PostgreSQL 11 and older: Use Subquery
PostgreSQL 12 and above: Use CTE with NOT MATERIALIZED clause
My follow up comment is more than I can fit in a comment... so understand this may not be an answer to the OP per se.
Take the following query, which uses a CTE:
with sales as (
select item, sum (qty) as sales_qty, sum (revenue) as sales_revenue
from sales_data
where country = 'USA'
group by item
),
inventory as (
select item, sum (on_hand_qty) as inventory_qty
from inventory_data
where country = 'USA' and on_hand_qty != 0
group by item
)
select
a.item, a.description, s.sales_qty, s.sales_revenue,
i.inventory_qty, i.inventory_qty * a.cost as inventory_cost
from
all_items a
left join sales s on
a.item = s.item
left join inventory i on
a.item = i.item
There are times where I cannot explain why that the query runs slower than I would expect. Some times, simply materializing the CTEs makes it run better, as expected. Other times it does not, but when I do this:
drop table if exists sales;
drop table if exists inventory;
create temporary table sales as
select item, sum (qty) as sales_qty, sum (revenue) as sales_revenue
from sales_data
where country = 'USA'
group by item;
create temporary table inventory as
select item, sum (on_hand_qty) as inventory_qty
from inventory_data
where country = 'USA' and on_hand_qty != 0
group by item;
select
a.item, a.description, s.sales_qty, s.sales_revenue,
i.inventory_qty, i.inventory_qty * a.cost as inventory_cost
from
all_items a
left join sales s on
a.item = s.item
left join inventory i on
a.item = i.item;
Suddenly all is right in the world.
Temp tables may persist across sessions, but to my knowledge the data in them will be session-based. I'm honestly not even sure if the structures persist, which is why to be safe I always drop:
drop table if exists sales;
And use "if exists" to avoid any errors about the object not existing.
I rarely use these in common queries for the simple reason that they are not as portable as a simple SQL statement (you can't give the final query to another user without having the temp tables). My most common use case is when I am processing within a procedure/function:
create procedure sales_and_inventory()
language plpgsql
as
$BODY$
BEGIN
create temp table sales...
insert into sales_inventory
select ...
drop table sales;
END;
$BODY$
Hopefully this helps.
Also, to answer your question on indexes... typically I don't, but nothing says that's always the right answer. If I put data into a temp table, I assume I'm going to use all or most of it. That said, if you plan to query it multiple times with conditions where an index makes sense, then by all means do it.
I have a table Design and a view on that table called ArchivedDesign. The view is declared as:
CREATE OR REPLACE VIEW public."ArchivedDesign" ("RootId", "Id", "Created", "CreatedBy", "Modified", "ModifiedBy", "VersionStatusId", "OrganizationId")
AS
SELECT DISTINCT ON (des."RootId") "RootId", des."Id", des."Created", des."CreatedBy", des."Modified", des."ModifiedBy", des."VersionStatusId", des."OrganizationId"
FROM public."Design" AS des
JOIN public."VersionStatus" AS vt ON des."VersionStatusId" = vt."Id"
WHERE vt."Code" = 'Archived'
ORDER BY "RootId", des."Modified" DESC;
Then, I have a large query which gets a short summary of latest changes, thumbnails, etc. The whole query is not important, but it contains two almost identical subqueries - one for the main table and and one for the view.
SELECT DISTINCT ON (1) x."Id",
TRIM(con."Name") AS "Contributor",
extract(epoch from x."Modified") * 1000 AS "Modified",
x."VersionStatusId",
x."OrganizationId"
FROM public."Design" AS x
JOIN "Contributor" AS con ON con."DesignId" = x."Id"
WHERE x."OrganizationId" = ANY (ARRAY[]::uuid[])
AND x."VersionStatusId" = ANY (ARRAY[]::uuid[])
GROUP BY x."Id", con."Name"
ORDER BY x."Id";
and
SELECT DISTINCT ON (1) x."Id",
TRIM(con."Name") AS "Contributor",
extract(epoch from x."Modified") * 1000 AS "Modified",
x."VersionStatusId",
x."OrganizationId"
FROM public."ArchivedDesign" AS x
JOIN "Contributor" AS con ON con."DesignId" = x."Id"
WHERE x."OrganizationId" = ANY (ARRAY[]::uuid[])
AND x."VersionStatusId" = ANY (ARRAY[]::uuid[])
GROUP BY x."Id", con."Name"
ORDER BY x."Id";
Link to SQL fiddle: http://sqlfiddle.com/#!17/d1d0f/1
The query is valid for the table, but fails for the view with an error column x."Modified" must appear in the GROUP BY clause or be used in an aggregate function. I don't understand why there is a difference in the behavior of those two queries? How do I fix the view query to work the same way as the table query?
My ultimate goal is to replace all table sub-queries with view sub-queries so we can easily separate draft, active and archived designs.
You get that error because when you query the table directly, Postgres is able to identify the primary key of the table and knows that grouping by it is enough.
Quote from the manual
When GROUP BY is present, or any aggregate functions are present, it is not valid for the SELECT list expressions to refer to ungrouped columns except within aggregate functions or when the ungrouped column is functionally dependent on the grouped columns, since there would otherwise be more than one possible value to return for an ungrouped column. A functional dependency exists if the grouped columns (or a subset thereof) are the primary key of the table containing the ungrouped column
(emphasis mine)
When querying the view, Postgres isn't able to detect that functional dependency that makes it possible to have a "shortened" GROUP BY when querying the table directly.
I'm not sure how to do this without using a JOIN (which ODB doesn't have, of course). In "generic" SQL, you might do something like this:
Select * FROM table
INNER JOIN
(SELECT max(field) AS max_of_field, key FROM table GROUP BY key) sub
ON table.field = sub.max_of_field AND table.key = sub.key
Is there an efficient way to do this in ODB, using SELECT and/or MATCH?
Is it possible to add a new column to an existing table from another table using insert or update in conjunction with full outer join .
In my main table i am missing some records in one column in the other table i have all those records i want to take the full record set into the maintable table. Something like this;
UPDATE maintable
SET all_records= othertable.records
FROM
FULL JOIN othertable on maintable.col = othertable.records;
Where maintable.col has same id a othertable.records.
I know i could simply join the tables but i have a lot of comments in the maintable i don't want to have to copy paste back in if possible. As i understand using where is equivalent of a left join so won't show me what i'm missing
EDIT:
What i want is effectively a new maintable.col with all the records i can then pare down based on presence of records in other cols from other tables
Try this:
UPDATE maintable
SET all_records = o.records
FROM othertable o
WHERE maintable.col = o.records;
This is the general syntax to use in postgres when updating via a join.
HTH
EDIT
BTW you will need to change this - I used your example, but you are updating the maintable with the column used for the join! Your set needs to be something like SET missingcol = o.extracol
AMENDED GENERALISED ANSWER (following off-line chat)
To take a simplified example, suppose that you have two tables maintable and subtable, each with the same columns, but where the subtable has extra records. For both tables id is the primary key. To fill maintable with the missing records, for pre 9.5 versions of Postgres you must use the following syntax:
INSERT INTO maintable (SELECT * FROM subtable s WHERE NOT EXISTS
(SELECT 1 FROM maintable m WHERE m.id = s.id));
Since 9.5 there is a (preferred) alternative:
INSERT INTO maintable (SELECT * FROM subtable) ON CONFLICT DO NOTHING;
This is preferred because (apart from being simpler) it avoids the situation that has been known to arise in the former, where a race condition is created between the INSERT and the sub-SELECT.
Obviously when the columns are different, you need to specify in the INSERT statement which columns are inserted from which. Something like:
INSERT INTO maintable (id, ColA, ColB)
(SELECT id, ColE, ColG FROM subtable ....)
Similarly the common field might not be id in both tables. However, the simplified example should be enough to point you in the right direction.
I have the following schema dataset which i want to transform into a table that can be exported to SQL. I am using HIVE. Input as follows
call_id,stat1,stat2,stat3
1,a,b,c,
2,x,y,z,
3,d,e,f,
1,j,k,l,
The output table needs to have call_id as its primary key so it needs to be unique. The output schema should be
call_id,stat2,stat3,
1,b,c, or (1,k,l)
2,y,z,
3,e,f,
The problem is that when i use the keyword DISTINCT in the HIVE query, the DISTINCT applies to the all the colums combined. I want to apply the DISTINCT operation only to the call_id. Something on the lines of
SELECT DISTINCT(call_id), stat2,stat3 from intable;
However this is not valid in HIVE(I am not well-versed in SQL either).
The only legal query seems to be
SELECT DISTINCT call_id, stat2,stat3 from intable;
But this returns multiple rows with same call_id as the other columns are different and the row on the whole is distinct.
NOTE: There is no arithmetic relation between a,b,c,x,y,z, etc. So any trick of averaging or summing is not viable.
Any ideas how i can do this?
One quick idea,not the best one, but will do the work-
hive>create table temp1(a int,b string);
hive>insert overwrite table temp1
select call_id,max(concat(stat1,'|',stat2,'|',stat3)) from intable group by call_id;
hive>insert overwrite table intable
select a,split(b,'|')[0],split(b,'|')[1],split(b,'|')[2] from temp1;
,,I want to apply the DISTINCT operation only to the call_id"
But how will then Hive know which row to eliminate?
Without knowing the amount of data / size of the stat fields you have, the following query can the job:
select distinct i1.call_id, i1.stat2, i1.stat3 from (
select call_id, MIN(concat(stat1, stat2, stat3)) as smin
from intable group by call_id
) i2 join intable i1 on i1.call_id = i2.call_id
AND concat(i1.stat1, i1.stat2, i1.stat3) = i2.smin;