Using UNNEST in a custom Postres Query in Google Data Studio - postgresql

I have a table that has multiple values saved in an array in one column. (I know that this is not normalized/optimal database structure.) I'm trying to write a query that can create rows for each value in the array. The query below is working for me in Tableau but not Google Data Studio (I'm using a custom query with the PostgreSQL connector). Are there any limitations/different syntax requirements when using UNNEST in Data Studio?
SELECT
e.name as event_name,
e.date as event_date,
l.full_name as leader_name,
p.full_name as participant_name
FROM
(
SELECT
event_id,
user_id,
UNNEST(participants_ids)::INTEGER as participant_id
from event_reports
) r
LEFT JOIN events e ON r.event_id = e.id
LEFT JOIN users l ON r.user_id = l.id
LEFT JOIN users p ON r.participant_id = p.id

Your inner query should probably have a lateral join:
SELECT e.event_id,
e.user_id,
p.participant_id
FROM event_reports AS e
CROSS JOIN
LATERAL unnest(e.participants_ids) AS p(as participant_id)
With a lateral join, you can reference something from the left side of the join on the right side. The cross join combines each event_reports row with rach participant ID that belongs to that row.

Related

Convert LEFT JOIN query to Ecto

I have some queries I need migrate to Ecto and for maintainability reasons, I'd rather not just wrap them in a fragment and call it a day.
They have a lot of LEFT JOINs in them and as I understand from this answer, a left_join in Ecto does a LEFT OUTER JOIN by default. I can't seem to figure out how to specify to Ecto that I want a LEFT INNER JOIN, which is the default behavior for a LEFT JOIN in Postgresql.
To look at a toy example, let's say posts in our database can be either anonymous or they can have a creator. I have a query to get just enough info to make a post preview, but I only want non-anonymous posts to be included:
SELECT
p.id,
p.title,
p.body,
u.name AS creator_name,
u.avatar AS creator_avatar,
FROM posts p
LEFT JOIN users u ON p.creator_id = u.id;
I would translate that into Ecto as:
nonanonymous_posts =
from p in Post,
left_join: u in User, on: p.creator_id == u.id,
select: [p.id, p.title, p.body, u.name, u.avatar]
and Ecto spits out
SELECT
t0."id",
t0."title",
t0."body",
t1."name" AS creator_name,
t1."avatar" AS creator_avatar,
FROM "posts" AS t0
LEFT OUTER JOIN "users" as t1 ON t0."creator_id" = t1."id";
which will give back anonymous posts as well.
There is no such thing as LEFT INNER JOIN. There is only INNER JOIN and LEFT [OUTER] JOIN (OUTER part is optional, as LEFT JOIN must be outer join). So what you want is just :join or :inner_join in your Ecto query.

Left join results in cross join in spark

I am trying to join two tables in pyspark using a SQLContext:
create table joined_table stored
as orc
as
SELECT A.*,
B.*
FROM TABLEA AS A
LEFT JOIN TABLEB AS B ON 1=1
where lower(A.varA) LIKE concat('%',lower(B.varB),'%')
AND (B.varC = 0 OR (lower(A.varA) = lower(B.varB)));
But I get the following error:
AnalysisException: u'Detected cartesian product for LEFT OUTER join between logical plans
parquet\nJoin condition is missing or trivial.\nUse the CROSS JOIN syntax to allow cartesian products between these relations.;
Edit:
I solved the problem using the following in Spark:
conf.set('spark.sql.crossJoin.enabled', 'true')
This enables the cross join in Pyspark!
I cannot see the on clause condition with your left join.. A Left join without join condition will always results in cross join. A cross join will repeat each row of your left hand table for each row the table on your right side. Can you edit your query and include the 'ON' clause with your join key column.

INNER JOIN, LEFT/RIGHT OUTER JOIN

Apology in advance for a long question, but doing this just for the sake of learning:
i'm new to SQL and researching on JOIN for now. I'm getting two different behaviors when using INNER and OUTER JOIN. What I know is, INNER JOIN gives an intersection kind of result while returning only common rows among tables, and (LEFT/RIGHT) OUTER JOIN is outputting what is common and remaining rows in LEFT or RIGHT tables, depending upon LEFT/RIGHT clause respectively.
While working with MS Training Kit and trying to solve this practice: "Practice 2: In this practice, you identify rows that appear in one table but have no matches in another. You are given a task to return the IDs of employees from the HR.Employees table who did not handle orders (in the Sales.Orders table) on February 12, 2008. Write three different solutions using the following: joins, subqueries, and set
operators. To verify the validity of your solution, you are supposed to return employeeIDs: 1, 2, 3, 5, 7, and 9."
I'm successful doing this with subqueries and set operators but with JOIN is returning something not expected. I've written the following query:
USE TSQL2012;
SELECT
E.empid
FROM
HR.Employees AS H
JOIN Sales.Orders AS O
ON H.empid = O.empid
AND O.orderdate = '20080212'
JOIN HR.Employees AS E
ON E.empid <> H.empid
ORDER BY
E.empid
;
I'm expecting results as: 1, 2, 3, 5, 7, and 9 (6 rows)
But what i'm getting is: 1,1,1,2,2,2,3,3,3,4,4,5,5,5,6,6,7,7,7,8,8,9,9,9 (24 rows)
I tried some videos but could not understand this side of INNER/OUTER JOIN. I'll be grateful if someone could help this side of JOIN, why is it so and what should I try to understand while working with JOIN.
you can also use left outer join to get not matching
*** The LEFT JOIN keyword returns all rows from the left table (table1), with the matching rows in the right table (table2). The result is NULL in the right side when there is no match.
SELECT
H.empid
FROM
HR.Employees AS H
LEFT OUTER JOIN Sales.Orders AS O
ON H.empid = O.empid
AND O.orderdate = '20080212'
WHERE O.empid IS NULL
Above script will return emp id who did not handle orders on specify date
here you can see all kind of join
Diagram taken from: http://dsin.wordpress.com/2013/03/16/sql-join-cheat-sheet/
adjust your query to be like this
USE TSQL2012;
SELECT
E.empid
FROM
HR.Employees AS H
JOIN Sales.Orders AS O
ON H.empid = O.empid
where O.orderdate = '2008-02-12' AND O.empid IN null
ORDER BY
E.empid
;
USE TSQL2012;
SELECT
distinct E.empid
FROM
HR.Employees AS H
JOIN Sales.Orders AS O
ON H.empid = O.empid
AND O.orderdate = '20080212'
JOIN HR.Employees AS E
ON E.empid <> H.empid
ORDER BY
E.empid
;
Primary things to always remind yourself when working with SQL JOINs:
INNER JOINs require a match in the join in order for result set rows produced prior to the INNER JOIN to remain in the result set. When no match is found for a row, the row is discarded from the result set.
For a row fed to an INNER JOIN that matches to ONLY one row, only one copy of that row fed to the result set is delivered.
For a row fed to an INNER JOIN that matches to multiple rows, the row will be delivered multiple times, once for each row match from the INNER JOIN table.
OUTER JOINs will not discard rows fed to them in the result set, whether or not the OUTER JOIN results in a match or not.
Just like INNER JOINs, if an OUTER JOIN matches to more than one row, it will increase the number of rows in the result set by duplicating rows equal to the number of rows matched from the OUTER JOIN table.
Ask yourself "if I get NO match on the JOIN, do I want the row discarded or not?" If the answer is NO, use an OUTER JOIN. If the answer is YES, use an INNER JOIN.
If you don't need to reference any of the columns from a JOIN table, don't perform a JOIN at all. Instead, use a WHERE EXISTS, WHERE NOT EXISTS, WHERE IN, WHERE NOT IN, etc. or similar, depending on your database engine in use. Don't rely on the database engine to be smart enough to discard unreferenced columns resulting from JOINs from the result set. Some databases may be smart enough to do that, some not. There's no reason to pull columns into a result set only to not reference them. Doing so increases chance of reduced performance.
Your JOIN of:
JOIN HR.Employees AS E
ON E.empid <> H.empid
...is matching to all Employees rows with a DIFFERENT EMPID to all rows fed to that join. Use of NOT EQUAL on an INNER JOIN is a very rare thing to do or need, especially if the JOIN predicate is testing only ONE condition. That is why your getting duplicate rows in the result set.
On DB2, we could perform an EXCEPTION JOIN to accomplish that using a JOIN alone. Normally, on DB2, I would use a WHERE NOT EXISTS for that. On SQL Server you could do a JOIN to a query where the query set is all employees without orders in SALES.ORDERS on the specified date, but I don't know if that violates the rules of your tutorial.
Naveen posted the solution it appears your tutorial is looking for!

How to join two tables using sub queries?

How to join two tables using sub queries in oracle sql server?
i tried it this way
select s.*,p.* from salary s
cross join (select p.* from pension p where p.sal=21);
I believe you need to name the subquery:
select s.*,p.* from salary s
cross join (select * from pension where sal=21) p

Postgres: left join with order by and limit 1

I have the situation:
Table1 has a list of companies.
Table2 has a list of addresses.
Table3 is a N relationship of Table1 and Table2, with fields 'begin' and 'end'.
Because companies may move over time, a LEFT JOIN among them results in multiple records for each company.
begin and end fields are never NULL. The solution to find the latest address is use a ORDER BY being DESC, and to remove older addresses is a LIMIT 1.
That works fine if the query can bring only 1 company. But I need a query that brings all Table1 records, joined with their current Table2 addresses. Therefore, the removal of outdated data must be done (AFAIK) in LEFT JOIN's ON clause.
Any idea how I can build the clause to not create duplicated Table1 companies and bring latest address?
Use a dependent subquery with max() function in a join condition.
Something like in this example:
SELECT *
FROM companies c
LEFT JOIN relationship r
ON c.company_id = r.company_id
AND r."begin" = (
SELECT max("begin")
FROM relationship r1
WHERE c.company_id = r1.company_id
)
INNER JOIN addresses a
ON a.address_id = r.address_id
demo: http://sqlfiddle.com/#!15/f80c6/2
Since PostgreSQL 9.3 there is JOIN LATERAL (https://www.postgresql.org/docs/9.4/queries-table-expressions.html) that allows to make a sub-query to join, so it solves your issue in an elegant way:
SELECT * FROM companies c
JOIN LATERAL (
SELECT * FROM relationship r
WHERE c.company_id = r.company_id
ORDER BY r."begin" DESC LIMIT 1
) r ON TRUE
JOIN addresses a ON a.address_id = r.address_id
The disadvantage of this approach is the indexes of the tables inside LATERAL do not work outside.
I managed to solve it using Windows Function:
WITH ranked_relationship AS(
SELECT
*
,row_number() OVER (PARTITION BY fk_company ORDER BY dt_start DESC) as dt_last_addr
FROM relationship
)
SELECT
company.*
address.*,
dt_last_addr as dt_relationship
FROM
company
LEFT JOIN ranked_relationship as relationship
ON relationship.fk_company = company.pk_company AND dt_last_addr = 1
LEFT JOIN address ON address.pk_address = relationship.fk_address
row_number() creates an int counter for each record, inside each window based to fk_company. For each window, the record with latest date comes first with rank 1, then dt_last_addr = 1 makes sure the JOIN happens only once for each fk_company, with the record with latest address.
Window Functions are very powerful and few ppl use them, they avoid many complex joins and subqueries!