I have a table that looks like this
id carrier
A1 | 66.87.151.2,sprint pcs,2015-05-21,2015-05-21,66.87.151.145,sprint pcs,2015-05-21,2015-05-21,66.87.150.131,sprint pcs,2015-05-13,2015-05-13
|
B1 | 67.83.18.128,optimum online,2015-05-09,2015-05-09,b8bcdb64-72c2-4578-9db5-c011263b1180 69.204.80.158,time warner cabl,
|
C1 | 76.180.4.64,time warner cable,2015-07-01,2015-07-29,66.87.137.65,sprint pcs
I want to return a table that looks like this
id carrier
A1 | sprint pcs
A1 | sprint pcs
A1 | sprint pcs
B1 | optimum online
B1 | optimum online
C1 | time warner cable
C1 | sprint pcs
The only thing I can think of is an application of the regex operator
regex_replace(carrier,'[^a-z ]','') or something like that,
but I have had no luck.
Convert the text column to array, unnest it and check elements against a regex:
select id, elem
from (
select id, unnest(string_to_array(carrier, ',')) elem
from the_table
) sub
where elem ~ '^[a-z ]+$';
id | elem
----+-------------------
A1 | sprint pcs
A1 | sprint pcs
A1 | sprint pcs
B1 | optimum online
B1 | time warner cabl
C1 | time warner cable
C1 | sprint pcs
(7 rows)
Related
Consider the below dataframe with store and books available:
+-----------+------+-------+
| storename | book | price |
+-----------+------+-------+
| S1 | B11 | 10$ | <<
| S2 | B11 | 11$ |
| S1 | B15 | 29$ | <<
| S2 | B10 | 25$ |
| S2 | B16 | 30$ |
| S1 | B09 | 21$ | <
| S3 | B15 | 22$ |
+-----------+------+-------+
Suppose we need to find the stores which have two books namely, B11 and B15. Here, the answer is S1 as it stores both books.
One way of doing it is to find intersection of the stores having book B11 with the stores having book B15 using below command:
val df_select = df.filter($"book" === "B11").select("storename")
.join(df.filter($"book" === "B15").select("storename"), Seq("storename"), "inner")
which contains the name of stores having both.
But instead I want a table
+-----------+------+-------+
| storename | book | price |
+-----------+------+-------+
| S1 | B11 | 10$ | <<
| S1 | B15 | 29$ | <<
| S1 | B09 | 21$ | <
+-----------+------+-------+
which contains all records related to that fulfilling store. Note that B09 is not left out. (Use case : the user can explore some other books as well in the same store)
We can do this by doing another intersection of above result with original dataframe:
df_select.join(df, Seq("storename"), "inner")
But, I see scalability and readability issue with step 1 as I have to keep on joining one dataframe to another if the number of books are more than 2. Lots of pain to do and that's error-prone too. Is there a more elegant way to do the same? Something like:
val storewise = Window.partitionBy("storename")
df.filter($"book".contains{"B11", "B15"}.over(storewise))
Found a simple solution using array_except function.
Add required set-of-field-values as an array in a new column, req_books
Add a column, all_books, storing all the books stored in a store using Window.
Using above two columns find if the store misses any required book, and filter them out if it misses anything.
Drop the excess columns created.
Code:
val df1 = df.withColumn("req_books", array(lit("B11"), lit("B15")))
.withColumn("all_books", collect_set('book).over(Window.partitionBy('storename)))
df1.withColumn("missing_books", array_except('req_books, 'all_books))
.filter(size('missing_books) === 0)
.drop('missing_book).drop('all_books).drop('req_books).show
Using Window Functions to create array of all values and check if it contains all the necessary values.
val bookList = List("B11", "B15") //list of books to search
def arrayContainsMultiple(bookList: Seq[String]) = udf((allBooks: WrappedArray[String]) => allBooks.intersect(bookList).sorted.equals(bookList.sorted))
val filteredDF = input
.withColumn("allBooks", collect_set($"books").over(Window.partitionBy($"storename")))
.filter(arrayContainsMultiple(bookList)($"allBooks"))
.drop($"allBooks")
I have data like below
Id | Data |Parent Id
----------------------------------------------------------------------------------
1 | IceCream # Chocolate # SoftDrink |0
2 | Amul,Havemore#Cadbary,Nestle#Pepsi |1
3 | Party#Wedding |0
I want to split this data in below format where row 2 is dependent on row 1. I have added ParentId which is use to find dependency.
IceCream | Amul | Party
IceCream | Havemore | Party
IceCream | Amul | Wedding
IceCream | Havemore | Wedding
Chocolate | Cadbery | Party
Chocolate | Nestle | Party
Chocolate | Cadbery | Wedding
Chocolate | Nestle | Wedding
SoftDrink | Pepsi | Party
SoftDrink | Pepsi | Wedding
I have used unnest(string_to_array) to split string but unable to traverse through loop to make this combination.
The is a very "unstable",like sitting on a knife edge and could easily fall apart. It depends on assigning values for each delimited value and then joining on those values. Maybe those flags that are known to you (but unfortunately not us) can stabilize it. But it does match your indicated expectations. It uses the function regexp_split_to_table rather than unnest to split the delimiters.
with base (num, list) as
( values (1,'IceCream#Chocolate#SoftDrink')
, (2,'Amul,Havemore#Cadbary,Nestle#Pepsi')
, (3,'Party#Wedding')
)
, product as
(select p, row_number(*) over() pn
from (
select regexp_split_to_table(list,'#') p
from base
where num=1
) x
)
, maker as
(select regexp_split_to_table(m, ',') m, row_number(*) over() mn
from (
select regexp_split_to_table(list,'#') m
from base
where num=2
) y
)
, event as
( select regexp_split_to_table(regexp_split_to_table(list,'#'), ',') e
from base
where num=3
)
select p as product
, m as maker
, e as event
from (product join maker on pn = mn) cross join event e
order by pn, e, m;
Hope it helps.
I'm doing a join multiples tables using spark sql. One of the table is very big and the others are small (10-20 records). really I want to replace values in the biggest table using others tables that contain pairs of key-value.
i.e.
Bigtable:
| Col 1 | Col 2 | Col 3 | Col 4 | ....
--------------------------------------
| A1 | B1 | C1 | D1 | ....
| A2 | B1 | C2 | D2 | ....
| A1 | B1 | C3 | D2 | ....
| A2 | B2 | C3 | D1 | ....
| A1 | B2 | C2 | D1 | ....
.
.
.
.
.
Table2:
| Col 1 | Col 2
----------------
| A1 | 1a
| A2 | 2a
Table3:
| Col 1 | Col 2
----------------
| B1 | 1b
| B2 | 2b
Table3:
| Col 1 | Col 2
----------------
| C1 | 1c
| C2 | 2c
| C3 | 3c
Table4:
| Col 1 | Col 2
----------------
| D1 | 1d
| D2 | 2d
Expected table is
| Col 1 | Col 2 | Col 3 | Col 4 | ....
--------------------------------------
| 1a | 1b | 1c | 1d | ....
| 2a | 1b | 2c | 2d | ....
| 1a | 1b | 3c | 2d | ....
| 2a | 2b | 3c | 1d | ....
| 1a | 2b | 2c | 1d | ....
.
.
.
.
.
My question is; which is best way to join the tables. (Think that there are 100 or more small tables)
1) Collecting the small dataframes, to transforming it to maps, broadcasting the maps and transforming the big datataframe in one only step
bigdf.transform(ds.map(row => (small1.get(row.col1),.....)
2) Broadcasting the tables and making join using select method.
spark.sql("
select *
from bigtable
left join small1 using(id1)
left join small2 using(id2)")
3) Broadcasting the tables and Concatenate multiples joins
bigtable.join(broadcast(small1), bigtable('col1') ==small1('col1')).join...
Thanks in advance
You might do:
broadcast all small tables (automaticaly done by setting spark.sql.autoBroadcastJoinThreshold slightly superior to the small table number of rows)
run a sql query that join the big table such
val df = spark.sql("
select *
from bigtable
left join small1 using(id1)
left join small2 using(id2)")
EDIT:
Choosing between sql and spark "dataframe" syntax:
The sql syntax is more readable, and less verbose than the spark syntax (for a database user perspective.)
From a developper perspective, dataframe syntax might be more readeble.
The main advantage of using the "dataset" syntax, is the compiler will be able to track some error. Using any string syntax such sql or columns name (col("mycol")) will be spotted at run time.
Best way, as already written in answers, to broadcast all small tables. It can also be done in SparkSQL using BROADCAST hint:
val df = spark.sql("""
select /*+ BROADCAST(t2, t3) */
*
from bigtable t1
left join small1 t2 using(id1)
left join small2 t3 using(id2)
""")
If the data in your small tables is less than the threshold size and physical files for your data is in parquet format then spark will automatically broadcast the small tables but if you are reading the data from some other data sources like sql, PostgreSQL etc. then some times spark does not broadcast the table automatically.
If you know that the tables are small sized and the size of table is not expected to increase ( In case of lookup tables) you can explicitly broadcast the data frame or table and in this way you can efficiently join a larger table with the small tables.
you can verify that the small table is getting broadcasted using the explain command on the data frame or you can do that from Spark UI also.
I have one main table called deliveries and it has one to many relationship with deliveries_languages as dl, deliveries_markets dm and deliveries_tags dt having delivery_id as foreign key. These 3 tables have one to one relation with languages , markets and tags respectively. Additionaly, deliveries, table have one to one relation with companies and have company_is as foreign key. Following is a query that I have written:
SELECT deliveries.*, languages.display_name, markets.default_name, tags.default_name, companies.name
FROM deliveries
JOIN deliveries_languages dl ON dl.delivery_id = deliveries.id
JOIN deliveries_markets dm ON dm.delivery_id = deliveries.id
JOIN deliveries_tags dt ON dt.delivery_id = deliveries.id
JOIN languages ON languages.id = dl.language_id
JOIN markets ON markets.id = dm.market_id
JOIN tags ON tags.id = dt.tag_id
JOIN companies ON companies.id = deliveries.company_id
WHERE
deliveries.name ILIKE '%new%' AND
deliveries.created_by = '5f331347-fb58-4f63-bcf0-702f132f97c5' AND
deliveries.deleted_at IS NULL
LIMIT 10
Here I am getting redundant delivery_ids because for each delivery_id there are multiple languages, markets and tags. I want to use limit on distinct delivery_ids and at the same time, I want those multiple languages, markets and tags to be grouped and populate in single row.
Currently it looks like:
delivery_id | name |languages | markets | tags
------------|------|----------|----------|-----------
1 | d1 |en | au | tag1
1 | d1 |de | sw | tag2
2 | d2 |en | au | tag1
2 | d2 |de | sw | tag2
3 | d3 |en | au | tag1
3 | d3 |de | sw | tag2
Is tere any way that I can have data look like below:
delivery_id | name |languages | markets | tags
------------|------|----------|----------|-----------
1 | d1 |en, de | au,sw | tag1, tag2
2 | d2 |en, de | au,sw | tag1, tag2
3 | d3 |en, de | au,sw | tag2, tag3
P.s. above tables contain only part of data, actual query returns many more columns but above are important one here. Can someone please help me to resolve this issue.
You can use GROUP BY with string_agg like this:
SELECT deliveries.deliver_id, deliver.name,
string_agg(distinct languages.display_name, ',' order by languages.display_name) as langs,
string_agg(distinct markets.default_name, ',' order by markets.default_name) as markets,
string_agg(distinct tags.default_name, ',' order by tags.default_name) as tags,
string_agg(distinct companies.name, ',' order by companies.name) as companies
...
GROUP BY deliveries.deliver_id, deliver.name;
Hi I have a complex (for my SQL standard) count I need to perform over multiple tables I'd love some help with.
I have three tables:
RELEASE
ID |METADATA_ID|Is_Active|Creation_Source|Release_Status
123456|123 | Y | A1 |Active
134567|124 | Y | A1 |Active
145678|125 | N | A2 |Closed
RELEASE_METADATA
ID |UPC
123 |8001
124 |8002
125 |8003
RELEASE_COUNTRY_RIGHT
(RELEASE)ID |COUNTRY_ID|MARKETING_RIGHT|OPT_OUT
123456 | UK | N |N
123456 | US | Y |N
123456 | FR | Y |Y
134567 | UK | Y |N
134567 | US | Y |Y
145678 | UK | Y |Y
145678 | FR | Y |N
I need to be able to filter the results by the Source and Status from RELEASE, include the UPC from RELEASE_METADATAand count the related number of rows, Marketing and Selected fields from RELEASE_COUNTRY_RIGHT.
So my result would be something like:
ID |Is_Active|Creation_Source|Release_Status|UPC |RELEASE_COUNTRY_RIGHT Rows|MARKETING_RIGHT|OPT_OUT
123456| Y | A1 |Active |8001| 3 | 2 | 1
134567| Y | A1 |Active |8002| 2 | 2 | 1
Any help would be much appreciated.
Cheers!
Update:
I've tried using SHALAKA's solution below but i'm having trouble after substituting the table and field names as shown below. I've updated the table and fields names above as I realized that what I had may have been misleading.
There is also an additional field which I missed out as the joins are not as I expected.
RELEASE joins with RELEASE_METADATA via release.ID = release_metadata_id and the third table needs to join by release.ID = release_country_right.release_ID.
Here is my attempt:
SELECT grp_1.*, COUNT(a5.OPT_OUT) OPT_OUT
FROM (SELECT grp.*, COUNT(a4.MARKETING_RIGHT) MARKETING_RIGHT
FROM (SELECT a1.id,
a1.IS_ACTIVE,
a1.CREATION_SOURCE,
a1.RELEASE_STATUS,
a2.upc,
COUNT(a3.RELEASE_ID) a_row
FROM dschd.release a1, dschd.release_metadata a2, DSCHD.RELEASE_COUNTRY_RIGHT a3
WHERE a1.RELEASE_METADATA_ID = a2.id
AND a1.id = a3.RELEASE_ID
AND a1.IS_ACTIVE = 'Y'
GROUP BY a1.id, a1.IS_ACTIVE, a1.CREATION_SOURCE, a1.RELEASE_STATUS, a2.UPC) grp,
DSCHD.RELEASE_COUNTRY_RIGHT a4
WHERE a4.OPT_OUT = grp.id
AND a4.MARKETING_RIGHT = 'N'
GROUP BY grp.id,
grp.IS_ACTIVE,
grp.CREATION_SOURCE,
grp.RELEASE_STATUS,
grp.upc,
grp.a_row) grp_1,
DSCHD.RELEASE_COUNTRY_RIGHT a5
WHERE a5.RELEASE_ID = grp_1.id
AND a5.OPT_OUT = 'Y'
GROUP BY grp_1.id,
grp_1.IS_ACTIVE,
grp_1.CREATION_SOURCE,
grp_1.RELEASE_STATUS,
grp_1.upc,
grp_1.a_row,
grp_1.MARKETING_RIGHT
SELECT grp_1.*, COUNT(a5.selected) selected
FROM (SELECT grp.*, COUNT(a4.marketing) marketing
FROM (SELECT a1.id,
a1.active,
a1.source,
a1.status,
a2.ref,
COUNT(a3.id) a_row
FROM data1 a1, data2 a2, data3 a3
WHERE a1.id = a2.id
AND a2.id = a3.id
AND a1.active = 'Y'
GROUP BY a1.id, a1.active, a1.source, a1.status, a2.ref) grp,
data3 a4
WHERE a4.id = grp.id
AND a4.marketing = 'Y'
GROUP BY grp.id,
grp.active,
grp.source,
grp.status,
grp.ref,
grp.a_row) grp_1,
data3 a5
WHERE a5.id = grp_1.id
AND a5.selected = 'Y'
GROUP BY grp_1.id,
grp_1.active,
grp_1.source,
grp_1.status,
grp_1.ref,
grp_1.a_row,
grp_1.marketing