pyspark join dataframes on each word and compare list with string - pyspark

I am using pyspark and I have 2 tables :
table REF_A
id | name
---------
1 | help
2 | need
3 | hello
4 | hel
Table DATA_B contains a list
| sentence |
----------------------------
| [I , say , hello, to, you]|
| [I , need , your, help] |
I need to join the 2 tables in a way to have this result:
id | name | sentence |
---------------------------------
1 | help | I need your help
2 | need | I need your help |
3 | hello | I say hello to you|
because in REF_A i have the KEY WORD "need" i need to match it with the sentence containing "need", which is "I need your help"
Thank you for your help

REF_A.createOrReplaceTempView('REF_A')
DATA_B.createOrReplaceTempView('DATA_B')
spark.sql('SELECT * FROM REF_A LEFT JOIN DATA_B ON ARRAY_CONTAINS(DATA_B.sentence, REF_A.name);')

Related

Fast split Spark dataframe by keys in some column and save as different dataframes

I have Spark 2.3 very big dataframe like this:
-------------------------
| col_key | col1 | col2 |
-------------------------
| AA | 1 | 2 |
| AB | 2 | 1 |
| AA | 2 | 3 |
| AC | 1 | 2 |
| AA | 3 | 2 |
| AC | 5 | 3 |
-------------------------
I need to "split" this dataframe by values in col_key column and save each splitted part in separate csv file, so I have to get smaller dataframes like
-------------------------
| col_key | col1 | col2 |
-------------------------
| AA | 1 | 2 |
| AA | 2 | 3 |
| AA | 3 | 2 |
-------------------------
and
-------------------------
| col_key | col1 | col2 |
-------------------------
| AC | 1 | 2 |
| AC | 5 | 3 |
-------------------------
and so far.
Every result dataframe I need to save as different csv file.
Count of keys is not big (20-30) but total count of data is (~200 millions records).
I have the solution where in the loop is selected every part of data and then saved to file:
val keysList = df.select("col_key").distinct().map(r => r.getString(0)).collect.toList
keysList.foreach(k => {
val dfi = df.where($"col_key" === lit(k))
SaveDataByKey(dfi, path_to_save)
})
It works correct, but bad issue of this solution is that every selection of data by every key couse full passing through whole dataframe, and it get too many time.
I think must be faster solution, where we pass through dataframe only once and during this put every record to "rigth" result dataframe (or directly to separate file). But I don't know how can to do it :)
May be, someone have ideas about it?
Also I prefer to use Spark's DataFrame API because it provides fastest way of data processing (so using RDD's is not desirable, if possible).
You need to partition by column and save as csv file. Each partition save as one file.
yourDF
.write
.partitionBy("col_key")
.csv("/path/to/save")
Why don't you try this ?

How to aggregate Postgres table so that ID is unique and column values are collected in array?

I'm not sure how to call what I'm trying to do, so trying to look it up didn't work very well. I would like to aggregate my table based on one column and have all the rows from another column collapsed into an array by unique ID.
| ID | some_other_value |
-------------------------
| 1 | A |
| 1 | B |
| 2 | C |
| .. | ... |
To return
| ID | values_array |
-------------------------
| 1 | {A, B} |
| 2 | {C} |
Sorry for the bad explanation, I'm really lacking the vocabulary here. Any help with writing a query that achieves what's in the example would be very much appreciated.
Try the following.
select id, array_agg(some_other_value order by some_other_value ) as values_array from <yourTableName> group by id
You can also check here.
See Aggregate Functions documentation.
SELECT
id,
array_agg(some_other_value)
FROM
the_table
GROUP BY
id;

LibreOffice - RANDBETWEEN return a name

I got two columns list like this
+----+-------+
| Nr | Name |
+----+-------+
| 1 | Alice |
| 2 | Bob |
| 3 | Joe |
| 4 | Ann |
| 5 | Jane |
+----+-------+
And would like to generate a random name from this list.
For now I am only able to randomly select a number and then manually pick out the corresponding name - using this function =RANDBETWEEN(A2;A10) How can I pick out the name instead?
Assuming that the data of your table are in cells E7:F11 the following code can do what you need:
=VLOOKUP(RANDBETWEEN(1;5);E7:F11;2)
Further, in case you need to create a random permutation of the names you may also use the Calc extension Permutate at https://sourceforge.net/projects/permutate/.
Hope that helps.
Assuming your data is with Nr in A1 I suggest:
=INDEX(B$2:B$6;RANDBETWEEN(1;5))
then there is no need for the Nr column in making the selection.

Grouping fields from one to many relationship in postgres

I need to group fields in a child table in one query in postgres.
I have following data
Stores:
| id | name |
|----|------|
| 1 | abcd |
Features:
| id | store | name | other |
|----|-------|------|-------|
| 1 | 1 | door | metal |
| 2 | 1 | fork | green |
I've got to this query
SELECT
stores.id,
stores.name,
concate_ws(',', features.id, features.name, features.other)
FROM stores
LEFT JOIN features
ON(features.store=stores.id)
WHERE stores.id =1
GROUP BY stores.id, features.id;
This is best I've got so far but yields 2 tuples
1, abcd, (1,door,metal)
1, abcd, (2,fork,green)
I'd like to be able to get one row with the features '|' concatenated like so
1, abcd ,(1,door,metal|2,fork,green)
Use string_agg():
SELECT stores.id,
stores.name,
string_agg(concate_ws(',', features.id, features.name, features.other), '|')
FROM stores
LEFT JOIN features ON features.store=stores.id
WHERE stores.id =1
GROUP BY stores.id, stores.name;

PostgreSQL simple count query

Trying to scale this down so the answer is simple. I can probably extrapolate the answers here to apply to a bigger data set.
Given the following table:
+------+-----+
| name | age |
+------+-----+
| a | 5 |
| b | 7 |
| c | 8 |
| d | 8 |
| e | 10 |
+------+-----+
I want to make a table that shows the count of people where their age is equal to or greater than x. For instance, the table about would produce:
+--------------+-------+
| at least age | count |
+--------------+-------+
| 5 | 5 |
| 6 | 4 |
| 7 | 4 |
| 8 | 3 |
| 9 | 1 |
| 10 | 1 |
+--------------+-------+
Is there a single query that can accomplish this task? Obviously, it is easy to write a simple function for it, but I'm hoping to be able to do this quickly with one query.
Thanks!
Yes, what you're looking for is a window function.
with cte_age_count as (
select age,
count(*) c_star
from people
group by age)
select age,
sum(c_star) over (order by age
range between unbounded preceding
and current row)
from cte_age_count
Not syntax checked ... let me know if it works!