Hi I have a complex (for my SQL standard) count I need to perform over multiple tables I'd love some help with.
I have three tables:
RELEASE
ID |METADATA_ID|Is_Active|Creation_Source|Release_Status
123456|123 | Y | A1 |Active
134567|124 | Y | A1 |Active
145678|125 | N | A2 |Closed
RELEASE_METADATA
ID |UPC
123 |8001
124 |8002
125 |8003
RELEASE_COUNTRY_RIGHT
(RELEASE)ID |COUNTRY_ID|MARKETING_RIGHT|OPT_OUT
123456 | UK | N |N
123456 | US | Y |N
123456 | FR | Y |Y
134567 | UK | Y |N
134567 | US | Y |Y
145678 | UK | Y |Y
145678 | FR | Y |N
I need to be able to filter the results by the Source and Status from RELEASE, include the UPC from RELEASE_METADATAand count the related number of rows, Marketing and Selected fields from RELEASE_COUNTRY_RIGHT.
So my result would be something like:
ID |Is_Active|Creation_Source|Release_Status|UPC |RELEASE_COUNTRY_RIGHT Rows|MARKETING_RIGHT|OPT_OUT
123456| Y | A1 |Active |8001| 3 | 2 | 1
134567| Y | A1 |Active |8002| 2 | 2 | 1
Any help would be much appreciated.
Cheers!
Update:
I've tried using SHALAKA's solution below but i'm having trouble after substituting the table and field names as shown below. I've updated the table and fields names above as I realized that what I had may have been misleading.
There is also an additional field which I missed out as the joins are not as I expected.
RELEASE joins with RELEASE_METADATA via release.ID = release_metadata_id and the third table needs to join by release.ID = release_country_right.release_ID.
Here is my attempt:
SELECT grp_1.*, COUNT(a5.OPT_OUT) OPT_OUT
FROM (SELECT grp.*, COUNT(a4.MARKETING_RIGHT) MARKETING_RIGHT
FROM (SELECT a1.id,
a1.IS_ACTIVE,
a1.CREATION_SOURCE,
a1.RELEASE_STATUS,
a2.upc,
COUNT(a3.RELEASE_ID) a_row
FROM dschd.release a1, dschd.release_metadata a2, DSCHD.RELEASE_COUNTRY_RIGHT a3
WHERE a1.RELEASE_METADATA_ID = a2.id
AND a1.id = a3.RELEASE_ID
AND a1.IS_ACTIVE = 'Y'
GROUP BY a1.id, a1.IS_ACTIVE, a1.CREATION_SOURCE, a1.RELEASE_STATUS, a2.UPC) grp,
DSCHD.RELEASE_COUNTRY_RIGHT a4
WHERE a4.OPT_OUT = grp.id
AND a4.MARKETING_RIGHT = 'N'
GROUP BY grp.id,
grp.IS_ACTIVE,
grp.CREATION_SOURCE,
grp.RELEASE_STATUS,
grp.upc,
grp.a_row) grp_1,
DSCHD.RELEASE_COUNTRY_RIGHT a5
WHERE a5.RELEASE_ID = grp_1.id
AND a5.OPT_OUT = 'Y'
GROUP BY grp_1.id,
grp_1.IS_ACTIVE,
grp_1.CREATION_SOURCE,
grp_1.RELEASE_STATUS,
grp_1.upc,
grp_1.a_row,
grp_1.MARKETING_RIGHT
SELECT grp_1.*, COUNT(a5.selected) selected
FROM (SELECT grp.*, COUNT(a4.marketing) marketing
FROM (SELECT a1.id,
a1.active,
a1.source,
a1.status,
a2.ref,
COUNT(a3.id) a_row
FROM data1 a1, data2 a2, data3 a3
WHERE a1.id = a2.id
AND a2.id = a3.id
AND a1.active = 'Y'
GROUP BY a1.id, a1.active, a1.source, a1.status, a2.ref) grp,
data3 a4
WHERE a4.id = grp.id
AND a4.marketing = 'Y'
GROUP BY grp.id,
grp.active,
grp.source,
grp.status,
grp.ref,
grp.a_row) grp_1,
data3 a5
WHERE a5.id = grp_1.id
AND a5.selected = 'Y'
GROUP BY grp_1.id,
grp_1.active,
grp_1.source,
grp_1.status,
grp_1.ref,
grp_1.a_row,
grp_1.marketing
Related
As title said, I want to reject rows, so I will not create duplicates.
And first step is not to join on values that have more rows in second table.
Here is an example if needed:
Table a:
aa |bb |
---|----|
1 |111 |
2 |222 |
Table h:
hh |kk |
---|----|
1 |111 |
2 |111 |
3 |222 |
Using Normal Left join:
SELECT
*
FROM a
LEFT JOIN h
ON a.bb = h.kk
;
I get:
aa |bb |hh |kk |
---|----|---|----|
1 |111 |1 |111 |
1 |111 |2 |111 |
2 |222 |3 |222 |
I want to get rid of first two rows, where aa = 1.
...
And second step would be for another query, probably with some case, where is table a I will filter out only those rows which have in table b more than 2 rows.
Therefore I want to create table c, where i will have:
aa |bb |
---|----|
1 |111 |
Can someone help me please?
Thank you.
To get only the 1:1 joins
SELECT a.aa,h.hh,h.kk FROM a
LEFT JOIN h ON a.bb = h.kk
GROUP BY bb HAVING COUNT(kk)=1
To get only the 1:n joins
SELECT a.aa,h.hh,h.kk FROM a
LEFT JOIN h ON a.bb = h.kk
GROUP BY bb HAVING COUNT(kk)>1
I'm doing a join multiples tables using spark sql. One of the table is very big and the others are small (10-20 records). really I want to replace values in the biggest table using others tables that contain pairs of key-value.
i.e.
Bigtable:
| Col 1 | Col 2 | Col 3 | Col 4 | ....
--------------------------------------
| A1 | B1 | C1 | D1 | ....
| A2 | B1 | C2 | D2 | ....
| A1 | B1 | C3 | D2 | ....
| A2 | B2 | C3 | D1 | ....
| A1 | B2 | C2 | D1 | ....
.
.
.
.
.
Table2:
| Col 1 | Col 2
----------------
| A1 | 1a
| A2 | 2a
Table3:
| Col 1 | Col 2
----------------
| B1 | 1b
| B2 | 2b
Table3:
| Col 1 | Col 2
----------------
| C1 | 1c
| C2 | 2c
| C3 | 3c
Table4:
| Col 1 | Col 2
----------------
| D1 | 1d
| D2 | 2d
Expected table is
| Col 1 | Col 2 | Col 3 | Col 4 | ....
--------------------------------------
| 1a | 1b | 1c | 1d | ....
| 2a | 1b | 2c | 2d | ....
| 1a | 1b | 3c | 2d | ....
| 2a | 2b | 3c | 1d | ....
| 1a | 2b | 2c | 1d | ....
.
.
.
.
.
My question is; which is best way to join the tables. (Think that there are 100 or more small tables)
1) Collecting the small dataframes, to transforming it to maps, broadcasting the maps and transforming the big datataframe in one only step
bigdf.transform(ds.map(row => (small1.get(row.col1),.....)
2) Broadcasting the tables and making join using select method.
spark.sql("
select *
from bigtable
left join small1 using(id1)
left join small2 using(id2)")
3) Broadcasting the tables and Concatenate multiples joins
bigtable.join(broadcast(small1), bigtable('col1') ==small1('col1')).join...
Thanks in advance
You might do:
broadcast all small tables (automaticaly done by setting spark.sql.autoBroadcastJoinThreshold slightly superior to the small table number of rows)
run a sql query that join the big table such
val df = spark.sql("
select *
from bigtable
left join small1 using(id1)
left join small2 using(id2)")
EDIT:
Choosing between sql and spark "dataframe" syntax:
The sql syntax is more readable, and less verbose than the spark syntax (for a database user perspective.)
From a developper perspective, dataframe syntax might be more readeble.
The main advantage of using the "dataset" syntax, is the compiler will be able to track some error. Using any string syntax such sql or columns name (col("mycol")) will be spotted at run time.
Best way, as already written in answers, to broadcast all small tables. It can also be done in SparkSQL using BROADCAST hint:
val df = spark.sql("""
select /*+ BROADCAST(t2, t3) */
*
from bigtable t1
left join small1 t2 using(id1)
left join small2 t3 using(id2)
""")
If the data in your small tables is less than the threshold size and physical files for your data is in parquet format then spark will automatically broadcast the small tables but if you are reading the data from some other data sources like sql, PostgreSQL etc. then some times spark does not broadcast the table automatically.
If you know that the tables are small sized and the size of table is not expected to increase ( In case of lookup tables) you can explicitly broadcast the data frame or table and in this way you can efficiently join a larger table with the small tables.
you can verify that the small table is getting broadcasted using the explain command on the data frame or you can do that from Spark UI also.
In the table can exsist 2 lines that give the same information only a single column value is different. Basically the data is duplicated because of this 1 column. Can I somehow sum otherelement in such a manner that it takes this duplication into account ?
To illustrate the idea of the problem
Example:
|id|type|val1|val2|
|1 | 2 | 1 | 1 |
|1 | 3 | 1 | 1 |
|1 | 2 | 2 | 2 |
|1 | 3 | 2 | 2 |
Expected result
|id|type|val1|val2|count|
|1 |2,3 | 3 | 3 | 2 |
Actual result
|id|type|val1|val2|count|
|1 |2,3 | 6 | 6 | 4 |
In the actual data the type and val come from 2 different tables connected by 3rd table, so the query is like this:
SELECT id,
array_to_string(array_agg(DISTINCT x.type ORDER BY x.type), ','::text) AS type,
sum(y.val1) AS val1,
sum(y.val2) AS val2,
count(y.val1) AS count
FROM a
JOIN x ON x.a_id = a.id AND x.active = true
JOIN y ON y.a_id = a.id AND y.active = true
GROUP BY a.id
SOLUTION
SELECT id,
array_to_string(array_agg(DISTINCT x.type ORDER BY x.type), ','::text) AS type,
sum(distinct y.val1) AS val1,
sum(distinct y.val2) AS val2,
count(distinct y.val1) AS count
FROM a
JOIN x ON x.a_id = a.id AND x.active = true
JOIN y ON y.a_id = a.id AND y.active = true
GROUP BY a.id
I have a table that looks like this
id carrier
A1 | 66.87.151.2,sprint pcs,2015-05-21,2015-05-21,66.87.151.145,sprint pcs,2015-05-21,2015-05-21,66.87.150.131,sprint pcs,2015-05-13,2015-05-13
|
B1 | 67.83.18.128,optimum online,2015-05-09,2015-05-09,b8bcdb64-72c2-4578-9db5-c011263b1180 69.204.80.158,time warner cabl,
|
C1 | 76.180.4.64,time warner cable,2015-07-01,2015-07-29,66.87.137.65,sprint pcs
I want to return a table that looks like this
id carrier
A1 | sprint pcs
A1 | sprint pcs
A1 | sprint pcs
B1 | optimum online
B1 | optimum online
C1 | time warner cable
C1 | sprint pcs
The only thing I can think of is an application of the regex operator
regex_replace(carrier,'[^a-z ]','') or something like that,
but I have had no luck.
Convert the text column to array, unnest it and check elements against a regex:
select id, elem
from (
select id, unnest(string_to_array(carrier, ',')) elem
from the_table
) sub
where elem ~ '^[a-z ]+$';
id | elem
----+-------------------
A1 | sprint pcs
A1 | sprint pcs
A1 | sprint pcs
B1 | optimum online
B1 | time warner cabl
C1 | time warner cable
C1 | sprint pcs
(7 rows)
I thought that the query below would naturally do what I explain, but apparently not...
My table looks like this:
id | name | g | partner | g2
1 | John | M | Sam | M
2 | Devon | M | Mike | M
3 | Kurt | M | Susan | F
4 | Stacy | F | Bob | M
5 | Rosa | F | Rita | F
I'm trying to get the id where either the g or g2 value equals 'M'... But, a record where both the g and g2 values are 'M' should return two lines, not 1.
So, in the above sample data, I'm trying to return:
$q = pg_query("SELECT id FROM mytable WHERE ( g = 'M' OR g2 = 'M' )");
1
1
2
2
3
4
But, it always returns:
1
2
3
4
Your query doesn't work because each row is returned only once whether it matches one or both of the conditions. To get what you want use two queries and use UNION ALL to combine the results:
SELECT id FROM mytable WHERE g = 'M'
UNION ALL
SELECT id FROM mytable WHERE g2 = 'M'
ORDER BY id
Result:
1
1
2
2
3
4
you might try a UNION along these lines:
"SELECT id FROM mytable WHERE ( g = 'M') UNION SELECT id FROM mytable WHERE ( g2 = 'M')"
Hope this helps, Martin
SELECT id FROM mytable WHERE g = 'M'
UNION
SELECT id FROM mytable WHERE g2 = 'M'