Hello I'd like to join on array intersection. I've found an arrays_overlap function on spark -- yet I cannot seem to get it to work. I've also tried writing a custom UDF to no avail. There error message I receive is "requires attributes from more than one child". I have no idea what that means making this hard to debug. Am I missing something basic about this? Using Hive/pyspark
Sample queries:
select
a.id
from tbl a
JOIN tbl_b b
ON arrays_overlap(a.my_arr, b.my_arr) = TRUE
def _arrays_overlap(a,b):
for item in a:
if item in b:
return True
return False
spark.udf.register(
"_arrays_overlap",
_arrays_overlap,
BooleanType()
)
select
a.id
from tbl a
JOIN tbl_b b
ON _arrays_overlap(a.my_arr, b.my_arr) = TRUE
What am I missing here?
From Spark-2.4:
We can use array_intersect function with size to join only the rows size !=0.
(or) arrays_overlap and joining on boolean TRUE
Example:
df=spark.createDataFrame([(1,[1,2,3]),(2,[4,5])],["id","my_arr"])
df1=spark.createDataFrame([(1,[2,3]),(2,[8,9])],["id","my_arr"])
df.createOrReplaceTempView("tbl")
df1.createOrReplaceTempView("tbl_b")
spark.sql("select a.id from tbl a join tbl_b b on arrays_overlap(a.my_arr,b.my_arr) = TRUE").show()
spark.sql("select a.id from tbl a join tbl_b b on SIZE(array_intersect(a.my_arr,b.my_arr)) != 0").show()
#+---+
#| id|
#+---+
#| 1|
#+---+
Related
I've got multiple tables
I made my query like this :
SELECT a.creation, b.caseno, c.instanceno
FROM TableB b
JOIN TableA a
ON a.caseno = b.caseno
JOIN TableC c
ON c.caseno = b.caseno
WHERE a.creation BETWEEN '2021-01-01' AND '2021-12-31'
I've got TableD who contains the following column
| InstanceNo | Position | Creation | TaskNo |
The idea is to add a new colum (result) on my query.
If instance from c.instanceno exist on tableD and taskno is 30 or 20, in that case i would like the d.creation but for the max(position).
If not the value null is enough for the column result.
SELECT a.creation, b.caseno, c.instanceno, d.creation
FROM TableB b
JOIN TableA a
ON a.caseno = b.caseno
JOIN TableC c
ON c.caseno = b.caseno
LEFT JOIN (SELECT MAX(position) position, instanceno, creation, taskno FROM TableD GROUP BY instanceno, creation, taskno) d
ON d.instanceno = c.instanceno
AND d.taskno in (20,30)
WHERE a.creation BETWEEN '2021-01-01' AND '2021-12-31'
There is an example request in which there are several joins.
SELECT DISTINCT ON(a.id_1) 1, a.name, b.task, c.created_at
FROM a
INNER JOIN b ON a.id_2 = b.id
INNER JOIN c ON a.ID_2 = c.id
WHERE a.deleted_at IS NULL
ORDER BY a.id_1 desc
In this case, the query will work, sorting by unique values of id_1 will take place. But I need to sort by the column a.name. In this case, postresql will swear with the words ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions.
The following query can serve as a solution to the problem:
SELECT *
FROM(
SELECT DISTINCT ON(a.id_1) a.name, b.task, c.created_at
FROM a
INNER JOIN b ON a.id_2 = b.id
INNER JOIN c ON a.ID_2 = c.id
WHERE a.deleted_at IS NULL
)
ORDER_BY a.name desc
But in reality the database is very large and such a query is not optimal. Are there other ways to sort by the selected column while keeping one uniqueness?
I have 2 tables a and b
Table a
id | name | code
VARCHAR VARCHAR jsonb
1 xyz [14, 15, 16 ]
2 abc [null]
3 def [null]
Table b
id | name | code
1 xyz [16, 15, 14 ]
2 abc [null]
I want to figure out where the code does not match for same id and name. I sort code column in b b/c i know it same but sorted differently
SELECT a.id,
a.name,
a.code,
c.id,
c.name,
c.code
FROM a
FULL OUTER JOIN ( SELECT id,
name,
jsonb_agg(code ORDER BY code) AS code
FROM (
SELECT id,
name,
jsonb_array_elements(code) AS code
FROM b
GROUP BY id,
name,
jsonb_array_elements(code)
) t
GROUP BY id,
name
) c
ON a.id = c.id
AND a.name = c.name
AND COALESCE (a.code, '[]'::jsonb) = COALESCE (c.code, '[]'::jsonb)
WHERE (a.id IS NULL OR c.id IS NULL)
My answer in this case should only return id = 3 b/c its not in b table but my query is returning id = 2 as well b/c i am not handling the null case well enough in the inner subquery
How can i handle the null use case in the inner subquery?
demo:db<>fiddle
The <# operator checks if all elements of the left array occur in the right one. The #> does other way round. So using both you can ensure that both arrays contain the same elements:
a.code #> b.code AND a.code <# b.code
Nevertheless it will be accept as well if one array contains duplicates. So [42,42] will be the same as [42]. If you want to avoid this as well you should check the array length as well
AND jsonb_array_length(a.code) = jsonb_array_length(b.code)
Furthermore you might check if both values are NULL. This case has to be checked separately:
a.code IS NULL and b.code IS NULL
A little bit shorter form is using the COALESCE function:
COALESCE(a.code, b.code) IS NULL
So the whole query could look like this:
SELECT
*
FROM a
FULL OUTER JOIN b
ON a.id = b.id AND a.name = b.name
AND (
COALESCE(a.code, b.code) IS NULL -- both null
OR (a.code #> b.code AND a.code <# b.code
AND jsonb_array_length(a.code) = jsonb_array_length(b.code) -- avoid accepting duplicates
)
)
After that you are able to filter the NULL values in the WHERE clause
I've created a query that joins six tables:
SELECT a.accession, b.value, c.name, d.description, e.value, f.seqlen, f.residues
FROM chado.dbxref a inner join chado.dbxrefprop b on a.dbxref_id = b.dbxref_id
inner join chado.biomaterial d on b.dbxref_id = d.dbxref_id
inner join chado.feature f on d.dbxref_id = f.dbxref_id
inner join chado.biomaterialprop e on d.biomaterial_id = e.biomaterial_id
inner join chado.contact c on d.biosourceprovider_id = c.contact_id;
The output:
I'm currently working with a PostgreSQL schema called Chado (http://gmod.org/wiki/Chado_Tables). My attempts to comply with the preexisting schema have led me to deposit multiple joined values within the same table (two different values within the dbxrefprop table, three different values within the biomaterialprop table). Querying the database results in a substantial amount of redundant output. Is there a way for me to reduce output redundancy by modifying my query statement? Ideally, I'd like the output to resemble the following:
test001 | GB0101 | source011 | Faaberg,K.; Lyoo,K.; Korol,D.M. | serum | T1 | Iowa, USA | 01 Jan 2005 | 1234 | AUGAACGCCUUGCAUUACUAUGACUAUGAUU
Working query statement:
SELECT a.accession, string_agg(distinct b.value, ' | ' ORDER BY b.value) AS bvalue_list, c.name, d.description, string_agg(distinct e.value, ' | ' ORDER BY e.value) AS evalue_list, f.seqlen, f.residues
FROM chado.dbxref a INNER JOIN chado.dbxrefprop b ON a.dbxref_id = b.dbxref_id
INNER JOIN chado.biomaterial d ON b.dbxref_id = d.dbxref_id
INNER JOIN chado.feature f ON d.dbxref_id = f.dbxref_id
INNER JOIN chado.biomaterialprop e ON d.biomaterial_id = e.biomaterial_id
INNER JOIN chado.contact c ON d.biosourceprovider_id = c.contact_id
GROUP BY a.accession, c.name, d.description, f.seqlen, f.residues;
I got this table:
+---+----+----+----+
|ID |KEY1|KEY2|COL1|
+---+----+----+----+
|001|aaa1|bbb1|ccc1|
|101|aaa1|bbb1|ddd2|
|002|aaa2|bbb2|eee3|
|102|aaa2|bbb2|fff4|
|003|aaa3|bbb3|ggg5|
|103|aaa3|bbb3|hhh6|
+---+----+----+----+
The Result must contain the rows with the highest ID if the columns key1 and key2 are equals.
+---+----+----+----+
|ID |KEY1|KEY2|COL1|
+---+----+----+----+
|101|aaa1|bbb1|ddd2|
|102|aaa2|bbb2|fff4|
|103|aaa3|bbb3|hhh6|
+---+----+----+----+
Since in HQL I can't do a subquery like:
select * from (select....)
How can I perform this query?
**SOLUTION**
Actually the solution were a little bit more complex so i want share it since the KEY1 and KEY2 were on an other table which join on the first table with two keys.
+-----+-------+-------+-------+
|t1.ID|t2.KEY1|t2.KEY2|t1.COL1|
+-----+-------+-------+-------+
| 001| aaa1| bbb1| ccc1|
| 101| aaa1| bbb1| ddd2|
| 002| aaa2| bbb2| eee3|
| 102| aaa2| bbb2| fff4|
| 003| aaa3| bbb3| ggg5|
| 103| aaa3| bbb3| hhh6|
+-----+-------+-------+-------+
I used this CORRECT query:
SELECT t1.ID, t2.KEY1, t2.KEY2, t1.COL1
FROM yourTable1 t1, yourTable2 t2
WHERE
t1.JoinCol1 = t2.JoinCol1 and t1.JoinCol2=t2.JoinCol2 and
t1.ID = (SELECT MAX(s1.ID) FROM yourTable1 s1, yourTable2 s2
WHERE
s1.JoinCol1 = s2.JoinCol1 and s1.JoinCol2=s2.JoinCol2 and
s2.KEY1 = t2.KEY1 AND s2.KEY2 = t2.KEY2)
If we were writing this query to be run directly on a regular database, such as MySQL or SQL Server, we might be tempted to join to a subquery. However, from what I read here, subqueries in HQL can only appear in the SELECT or WHERE clauses. We can phrase your query as follows, using the WHERE clause to implement your logic.
The query will be:
SELECT t1.ID, t1.KEY1, t1.KEY2, t1.COL1
FROM yourTable t1
WHERE t1.ID = (SELECT MAX(t2.ID) FROM yourTable t2
WHERE t2.KEY1 = t1.KEY1 AND t2.KEY2 = t1.KEY2)