Inner Join On (Hive) gives different result than PySpark Inner Join - pyspark

I observed this phenomenon today. When I execute the following command in Hive CLI, I obtains something different than by doing this with pyspark :
Hive :
Select count(distinct t1.fieldX) from table1 t1 JOIN table2 t2 ON (t1.fieldX=t2.fieldX AND t1.fieldY=t2.fieldY);
Result : 17 488
SparkSQL :
hc.sql("Select count(distinct t1.fieldX) from table1 t1 JOIN table2 t2 ON (t1.fieldX==t2.fieldX AND t1.fieldY==t2.fieldY)")
Result : 5555
I obtain the same result using this code :
tabl1.alias("t1").join(
other=table2.alias("t2"),
on=[t1.fieldX==t2.fieldX, t1.fieldY==t2.fieldY]
how='inner'
).select("fieldX").distinct().count()
Result : 5555
I don't get why I obtain different results !

Related

Left outer join in Teradata cannot get the full result of left subquery

I want to join two sub-queries:
(Select col1,col2,col3,col4, sum(money1) as sum_money
from table1 group by col1,col2,col3,col4)-----this is subquery1,
return 135 rows
((Select col1,col2,col3,col4, sum(money2) as sum_money_new
from table2 group by col1,col2,col3,col4) -----this is subquery2,
return 79rows
then I left join these two sub-queries like the following, still get 79 rows.
select subquery1.col1,subquery1.co2,subquery1.col3,subquery1.col4,subquery1.sum_money,subquery2.sum_money_new
from subquery1
left join subquery2
on subquery1.col1=subquery2.col1
and subquery1.col2=subquery2.col2
and subquery1.col3=subquery2.col3
and subquery1.col4=subquery2.col4
if I change the join above to be right join, still 79 rows. who can help me?
the version of Teradata is 16.20.53.27

How to use Lateral joins with TypeORM for Postgres db?

I have a postgres query with 5 joins which was taking more than 10 secs to execute. I re-wrote the query with lateral join concept and now it executes in less than 1 sec. I want to write that query in TypeORM and didn't find any ways to do it. Any help is appreciated.
This is the query that needs to be converted to TypeORM
FROM "institution" "i"
INNER JOIN "financial_account" "fa" ON "fa"."institutionId"="i"."id"
left join (
select h1.*
from holding as h1
inner join (
select cast(max(h3.holding_date) as DATE) holding_date,h3."accountId"
From "holding" as h3
group by h3."accountId"
) as h2 on h2."accountId" = h1."accountId" and h2."holding_date" = cast(h1."holding_date" as date)
) "h" on "h"."accountId" = "fa".id
LEFT JOIN "securities" "sec" ON "sec"."id"="h"."securityId"
LEFT JOIN "symbol" "sym" ON "sym"."id"="sec"."symbolId" AND "sym"."trRIC"="sec"."symbolTrRIC"
where "i".af_user_id='1234'
ORDER BY "h".institution_value desc;

SQL: Joining tables using wildcards

When I join 2 tables
select t1.col1, t1.col2, t2.col3
from t1
left join t2
on t1.col2=t2.col3
Then I don't get any duplicate rows. However, when I try joining
tables using wildcards:
select t1.col1, t1.col2, t2.col3
from t1
left join t2
on t1.col2 like '%'||t2.col3
Then I'll get duplicate values. I saw this post that I think is
getting me somewhere but I couldn't really understand the solution.
Joining 2 Tables using a wildcard
It says I can use exists to get rid of duplicate values. I also
don't really understand his query. Here is the query in the other
post:
select *
from tableA a
where exists (select 1 from tableB b where a.id like '%' + b.id + '%');
What does the select 1 from tableB do?
I'm using PostgreSQL
This is what I've tried even though I don't understand it and still gives me duplicates:
select t1.col1,t1.col2,t2.col3
from t1
left join t2
on t1.col2 like '%'||t2.col3
where exists (select 1 from t2 where t1.col2 like '%'||t2.col3)
Use the DISTINCT clause:
SELECT DISTINCT t1.col1, t1.col2, t2.col3
FROM t1
LEFT JOIN t2 ON t1.col2 LIKE '%'||t2.col3;

left join from a group of tables to another group of tables in postgresql

I have some tables say t1 , t2 , t3.
I need to implement something like this in postgresql.
select * from (t1 , t2) left join t3
where t1.some_column = t3.some_column;
But postgresql complains
ERROR: syntax error at or near "," SQL state: 42601 Character: 77
You can't use from (t1,t2), you have to join them in some way.
Try something like this:
select * from t1
inner join t2 on t1.someColumn=t2.someColumn
left join t3 on t1.some_column = t3.some_column;

Join table variable vs join view

I have a stored procedure which is running quite slow. Therefore I want to extract some of the query in a separate view.
My code looks something like this:
DECLARE #tmpTable TABLE(..)
INSERT INTO #tmpTable (..) *query* (returns 3000 rows)
Select ... from table1
inner join table2
inner join table3
inner join #tmpTable
...
I then extract (copy-paste) the *query* and put it in a view - i.e. vView.
Doing this will then give me a different result:
Select ... from table1
inner join table2
inner join table3
inner join vView
...
Why? I can see that the vView and the #tmpTable both returns 3000 rows, so they should match (also did a except query to check).
Any comments would be much appriciated as I feel quite stuck with this..
EDITED:
This is the full query for getting the result (using #tmpTable or vView gives me different results, although the appear the same):
select dep.sid as depsid, dep.[name], COUNT(b.sid) as possiblelogins, count(ls.clientsid) as logins
from department dep
inner join relationship r on dep.sid=r.primarysid and r.relationshiptypeid=27 and r.validto is null
inner join [user] u on r.secondarysid=u.sid
inner join relationship r2 on u.sid=r2.secondarysid and r2.validto is null and r2.relationshiptypeid in (1,37)
inner join client c on r2.primarysid=c.sid
inner join ***#tmpTable or vView*** b on b.sid = c.sid
left outer join (select distinct clientsid from logonstatistics) as ls on b.sid=ls.clientsid
GROUP BY dep.sid, dep.[name],dep.isdepartment
HAVING dep.isdepartment=1
You maybe don't need the view/table if you change to this.
It joins on to client c and appears to be there only to JOIN onto logonstatistics
--remove inner join ***#tmpTable or vView*** b on b.sid = c.sid
--change JOIN
left outer join (select distinct clientsid from logonstatistics) as ls on c.sid=ls.clientsid
And change COUNT(b.sid) to COUNT(c.sid) in the SELECT clause
Otherwise, if you get different results you have two options I can see:
Table and view have different data. Have you run a line by line comparsion?
One has NULL, one has a value (especially for the sid column which will affect the JOIN)
Finally, when you says "different results" do you mean you get x2 or x3 rows? A different COUNT? What?