Having complex SQL query to RDBMS Postgresql which consists of multiple nested UNION ALL-like nested queries, something like this:
(
(
(<QUERY 1-1-1> UNION ALL <QUERY 1-1-2>) UNION ALL
(<QUERY 1-1-3> UNION ALL <QUERY 1-1-4>) UNION ALL
...
) UNION ALL
(
(<QUERY 1-2-1> UNION ALL <QUERY 1-2-2>) UNION ALL
(<QUERY 1-2-3> UNION ALL <QUERY 1-2-4>) UNION ALL
...
) UNION ALL
...
) UNION ALL
(
(
(<QUERY 2-1-1> UNION ALL <QUERY 2-1-2>) UNION ALL
(<QUERY 2-1-3> UNION ALL <QUERY 2-1-4>) UNION ALL
...
) UNION ALL
(
(<QUERY 2-2-1> UNION ALL <QUERY 2-2-2>) UNION ALL
(<QUERY 2-2-3> UNION ALL <QUERY 2-2-4>) UNION ALL
...
) UNION ALL
...
) UNION ALL
(
...
)
Each <QUERY i-th> is relatively lightweight query which produces about 100K-1M rows and can be sorted in-memory without significant performance impact.
Result query is consists of tens thousands multi-level nested UNION ALL queries in strict conventional order, like traversing tree in depth, so result query is several billion rows dataset.
So question is: since SQL does not guarantee order of UNION ALL statement, outer query should contain ORDER BY clause, but server hardware cannot perform sorting of billon rows in required time.
However, order of united queries is strict determined, and should be: <QUERY 1-1-1>, <QUERY 1-1-2> and so on, sorted hierarchically, so in fact sorting of outer query is redundant, since dataset is already sorted by sql query structure.
It's necessary to force Postgres to preserve order of nested UNION ALL statements. How to do it? Any plugins, extensions and even dirty hacks are welcome.
Please avoid of answers and comments mention XY-like problem - question is formulated as-is in research manner. Structure of database and dataset cannot be changed by conditions of question. Thanks.
Try this - allocate the queries' results into a temporary table.
Here it is step by step:
Create a temporary table ex. the_temp_table like the the record type of <QUERY 1-1-1>
create temporary table the_temp_table as <QUERY 1-1-1> limit 0;
Add an auto-increment primary key column extra_id to the_temp_table
alter table the_temp_table add column extra_id serial primary key not null;
Then run all your queries one by one in the right order
insert into the_temp_table <QUERY 1-1-1>; insert into the_temp_table <QUERY 1-1-2>;
insert into the_temp_table <QUERY 1-1-3>; insert into the_temp_table <QUERY 1-1-4>;
insert into the_temp_table <QUERY 1-2-1>; insert into the_temp_table <QUERY 1-2-2>;
insert into the_temp_table <QUERY 1-2-3>; insert into the_temp_table <QUERY 1-2-4>;
-- continue
Finally
select <fields list w/o extra_id> from the_temp_table order by extra_id;
-- no sorting is taking place here
Effectively thus you will be emulating UNION ALL in a controlled manner with an insignificant performance penalty.
There are 2 ways of looking at this:
The safest alternative is be to declare an id column using SERIAL or BIGSERIAL, which will be ordered and indexed. As the records are already ordered there will be a minimal effect on query speed and you will be sure that there are no errors in the ordering.
If the order is not critical, and you don't modify the data at all it will probably be fetched in the same order as you entered it. There is no guarantee. How important is the order to your application?
Related
I have three tables, table_a, table_b, table_c. All of them has gist index.
I would like to perform a left join between table_c and the UNION of table_a and table_b.
Can the UNION be considered "indexed"? I assume it would better to create new table as the UNION, but these tables are huge so I try to avoid this kind of redundancy.
In terms of SQL, my question:
Is this
SELECT * FROM myschema.table_c AS a
LEFT JOIN
(SELECT col_1,col_2,the_geom FROM myschema.table_a
UNION
SELECT col_1,col_2,the_geom FROM myschema.table_b) AS b
ON ST_Intersects(a.the_geom,b.the_geom);
equal to this?
CREATE TABLE myschema.table_d AS
SELECT col_1,col_2,the_geom FROM myschema.table_a
UNION
SELECT col_1,col_2,the_geom FROM myschema.table_b;
CREATE INDEX idx_table_d_the_geom
ON myschema.table_d USING gist
(the_geom)
TABLESPACE mydb;
SELECT * FROM myschema.table_c AS a
LEFT JOIN myschema.table_d AS b
ON ST_Intersects(a.the_geom,b.the_geom);
You can look at the execution plan with EXPLAIN, but I doubt that it will use the indexes.
Rather than performing a left join between one table and the union of three other tables, you should perform the union of the left joins between the one table and each of the three tables in turn. That will be a longer statement, but PostgreSQL will be sure to use the index if that can speed up the left joins.
Be sure to use UNION ALL rather than UNION unless you really have to remove duplicates.
Couldn't find an exact duplicate question so please push one to me if you know of one.
https://i.stack.imgur.com/Xjmca.jpg
See the screenshot (sorry for link, not enough rep). In the table I have ID, Cat, Awd, and Xmit.
I want a resultset where each row is a distinct ID plus the aggregate Awd and Xmit amounts for each Cat (so four add'l columns per ID).
Currently I'm using two CTEs, one to aggregate each of Awd and Xmit. Both make use of the PIVOT operator, using Cat to spread and ID to group. After each CTE does its thing, I'm INNER JOINing them on ID.
WITH CTE1 (ID, P_Awd, G_Awd) AS (
SELECT ...
FROM Table
PIVOT(SUM(Awd) FOR Cat IN ('P', 'G'),
CTE2 ([same as CTE1 but replace "Awd" with "Xmit"])
SELECT ID, P_Awd, P_Xmit, G_Awd, G_Xmit
FROM CTE1 INNER JOIN CTE2 ON CTE1.ID = CTE2.ID
The output of this (greatly simplified) is two rows per ID, with each row holding the resultset of one CTE or the other.
What am I overlooking? Am I overcomplicating this?
Here on one method via a CROSS APPLY
Also, this is assumes you don't need dynamic SQL
Example
Select *
From (
Select ID
,B.*
From YourTable A
Cross Apply ( values (cat+'_Awd',Awd)
,(cat+'_Xmit',Xmit)
) B(Item,Value)
) src
Pivot (sum(Value) for Item in ([P_Awd],[P_XMit],[G_Awd],[G_XMit]) ) pvt
Returns (Limited Set -- Best if you not use images for sample data)
ID P_Awd P_XMit G_Awd G_XMit
1 1000 500 1000 0
2 2000 1500 500 500
I have a load of partitioned tables which I would like to consume into Tableau. This worked really well with Qlik sense, because it would consume each table into it's own memory, then processes it.
In Tableau I can't see a way to UNION tables (though you can UNION files). If I try to union it as custom sql, it just loads for hours, so I'm assuming it's just pulling all the data at once, which is 7GB of data and won't perform well on the db or Tableau. Database is PostgreSQL.
The partitions are pre-aggregated, so when I do the custom query union it looks like this:
SELECT user_id, grapes, day FROM steps.steps_2016_04_02 UNION
SELECT user_id, grapes, day FROM steps.steps_2016_04_03 UNION
SELECT user_id, grapes, day FROM steps.steps_2016_04_04 UNION
If you can guarantee that the data of each table is unique, then don't use UNION, because it has to an extra work to make distinct rows out of it.
Use UNION ALL instead, which is basically an append of rows. UNION or UNION DISTINCT (the same) like you showed is somewhat equivalent to:
SELECT DISTINCT * FROM (
SELECT user_id, grapes, day FROM steps.steps_2016_04_02 UNION ALL
SELECT user_id, grapes, day FROM steps.steps_2016_04_03 UNION ALL
SELECT user_id, grapes, day FROM steps.steps_2016_04_04
) t;
And the DISTINCT can be a very slow operation.
Another simpler option is to use PostgreSQL's partitioning with table inheritance and work on Tableau as a single table.
Take this query:
SELECT c.CustomerID, c.AccountNumber, COUNT(*) AS CountOfOrders,
SUM(s.TotalDue) AS SumOfTotalDue
FROM Sales.Customer AS c
INNER JOIN Sales.SalesOrderheader AS s ON c.CustomerID = s.CustomerID
GROUP BY c.CustomerID, c.AccountNumber
ORDER BY c.CustomerID;
I expected COUNT(*) to count the rows in Sales.Customer but to my surprise it counts the number of rows in the joined table.
Any idea why this is? Also, is there a way to be explicit in specifying which table COUNT() should operate on?
Query Processing Order...
The FROM clause is processed before the SELECT clause -- which is to say -- by the time SELECT comes into play, there is only one (virtual) table it is selecting from -- namely, the individual tables after their joined (JOIN), filtered (WHERE), etc.
If you just want to count over the one table, then you might try a couple of things...
COUNT(DISTINCT table1.id)
Or turn the table you want to count into a sub-query with count() inside of it
I have 2 simple select queries to get me a list of id's. My first table returns lets say 5 ids.
1, 2, 5, 10, 23
My second table returns a list of 50 ids not in any order.
Whats is the most efficient way to write a query to map each of my ids from my first table to all the ids from the second table?
edit: sorry Here is more info.
If table 1 has a result of ids = 1, 2, 5, 10, 23
and table 2 has a list of ids = 123, 234, 345, 456, 567
I would like to write an insert that would insert into table 3 these values
Table1ID | Table2ID
1|123
1|234
1|345
1|456
1|567
2|123
2|234
2|345
2|456
2|567
and so on.
It seems like what you are looking for is a Cartesian Product.
You can accomplish this simply by joining the two tables together with no join condition, which is accomplished by CROSS JOIN.
INSERT dbo.TableC (AID, BID)
SELECT A.ID, B.ID
FROM
dbo.TableA A
CROSS JOIN dbo.TableB B
;
Here is an image with a visualization of a Cartesian product. The inputs are small, just the column of symbols on the left corresponding to the first table, and the column on the right being the second table. Upon performing a JOIN with no conditions, you get one row per connecting line in the middle.
Use INSERT INTO ... SELECT statement with cross join:
INSERT INTO TableC (ID1, ID2)
SELECT A.ID AS ID1, b.ID AS ID2 FROM TableA A CROSS JOIN TableB B;
Sample DEMO
INSERT INTO…SELECT is described on MSDN: INSERT (Transact-SQL)
You can use INSERT INTO <target_table> SELECT <columns> FROM
<source_table> to efficiently transfer a large number of rows from one
table, such as a staging table, to another table with minimal logging.
Minimal logging can improve the performance of the statement and
reduce the possibility of the operation filling the available
transaction log space during the transaction.