How to check all values in columns efficiently using Spark? - scala

I'm wondering how to make dynamic filter given unknown columns in Spark.
For example, the dataframe is like below:
+-------+-------+-------+-------+-------+-------+
| colA | colB | colC | colD | colE | colF |
+-------+-------+-------+-------+-------+-------+
| Red | Red | Red | Red | Red | Red |
| Red | Red | Red | Red | Red | Red |
| Red | Blue | Red | Red | Red | Red |
| Red | Red | Red | Red | Red | Red |
| Red | Red | Red | Red | Blue | Red |
| Red | Red | White | Red | Red | Red |
+-------+-------+-------+-------+-------+-------+
The columns can only be known at runtime, meaning it can have colG, H ..
I need to check if the whole column's value is Red, then get a count, in above case is 3 as colA, colD and ColF columns are all Red.
What I am doing is something like below, and it is SLOW..
val allColumns = df.columns
df.foldLeft(allColumns) {
(df, column) =>
val tmpDf = df.filter(df(column) === "Red")
if (tmpDf.rdd.isEmpty) {
count += 1
}
df
}
I am wondering if there is a better way. Many thanks!

you got N RDD scans where N is number of columns. You can scan all of them at once and reducing in parallel. For example this way:
df.reduce((a, r) => Row.fromSeq(a.toSeq.zip(r.toSeq)
.map { case (a, r) =>
if (a == "Red" && r == "Red") "Red" else "Not"
}
))
res11: org.apache.spark.sql.Row = [Red,Not,Not]
This code will do one RDD scan, and then iterated Row columns inside reduce. Row.toSeq get Seq from Row. fromSeq restore Row to return the same object.
Edit: for count just add: .toSeq.filter(_ == "Red").size

Why not simply do df.filter + df.count using only DataFrame API?
val filter_expr = df.columns.map(c => col(c) === lit("Red")).reduce(_ and _)
val count = df.filter(filter_expr).count
//count: Long = 3

Related

Choose custom column name in SELECT expression depending on returning data, PostgreSQL

Let say we have such table:
id | bird | color
1 | raven | black
2 | gull | white
3 | broadbill | green
4 | goldfinch | yellow
5 | tit | yellow
Is it possible in PostgreSQL to write such SELECT expression, which can make dynamic alias for the color column? This aliase's name should depend on selected data from the color column. It is assumed, that only one row is returned (i.e., LIMIT 1 is applied at the end).
Pseudocode:
SELECT id, bird, color
as 'bw' if color.value in ['black', 'white']
else
as 'colored'
FROM table
WHERE color = 'white'
LIMIT 1
Returning examples:
-- WHERE color = 'white'
id | bird | bw
1 | gull | white
-- WHERE color = 'yellow'
id | bird | colored
4 | goldfinch | yellow

How to filter out rows with lots of conditions in pyspark?

Let's say that these are my data:
` Product_Number| Condition| Type | Country
1 | New | Chainsaw | USA
1 | Old | Chainsaw | USA
1 | Null | Chainsaw | USA
2 | Old | Tractor | India
3 | Null | Machete | Colombia
4 | New | Shovel | Brazil
5 | New | Fertilizer | Italy
5 | Old | Fertilizer | Italy `
The problem is that sometimes, there are more than one Product_Number while it should be unique. What I am trying to do is from the ones that are in the dataframe more than once, to take the ones whose Condition is New without touching the rest. That gets to the result:
` Product_Number| Condition| Type | Country
1 | New | Chainsaw | USA
2 | Old | Tractor | India
3 | Null | Machete | Colombia
4 | New | Shovel | Brazil
5 | New | Fertilizer | Italy`
What I tried to do is first to see how many distinct product numbers I have:
df.select('Product_Number').distinct().count()
Then identify the product numbers that exist most than once and put them in a list:
numbers = df.select('Product_Number').groupBy('Product_Number').count().where('count > 1')\
.select('Product_Number').rdd.flatMap(lambda x: x).collect()
Then I am trying to filter out the product numbers that exist more than once and the Condition isn't new. By filtering them out, if it is done perfectly, counting it should give the same number as df.select('Product_Number').distinct().count().
The code that I have tried is:
1)df.filter(~(df.Product_Number.isin(numbers)) & ~((df.Condition == 'Old') | (df.Condition.isNull())))
df.filter(~((df.Product_Number.isin(numbers)) & ((df.Condition == 'Old') | (df.Condition.isNull()))))
df.filter(~(df.Product_Number.isin(numbers)) & (df.Condition == 'New'))
However, I haven't succeeded until now.
You conditions should be
(Product_Number is in numbers AND Condition == New) OR
(Product_Number is not in numbers)
So, this is the correct filter condition.
df.filter((df.Product_Number.isin(numbers) & (df.Condition == 'New'))
| (~df.Product_Number.isin(numbers)))
However, collect can be a heavy operation if you have large dataset and you can rewrite your code without collect.
from pyspark.sql import functions as F
w = Window.partitionBy('Product_Number')
df = (df.withColumn('cnt', F.count('*').over(w))
.filter(((F.col('Condition') == 'New') & (F.col('cnt') > 1)) | (F.col('cnt') == 1))
)

Create a new DF using another two

I have two dataframes that shares the column colour, and I would like to create a new column with the Code that correspond to the column colour in a new DF as you can see:
DF1
+------------+--------------------+
| Code | colour |
+------------+--------------------+
| 1001 | brown |
| 1201 | black |
| 1300 | green |
+------------+--------------------+
DF2
+------------+--------------------+-----------+
| Name | colour | date |
+------------+--------------------+-----------+
| Joee | brown | 20210101 |
| Jess | black | 20210101 |
| James | green | 20210101 |
+------------+--------------------+-----------+
Output:
+------------+--------------------+-----------+----------+
| Name | colour | date | Got |
+------------+--------------------+-----------+----------+
| Joee | black | 20210101 | 1201 |
| Jess | brown | 20210101 | 1001 |
| James | blue | 20210101 | 092 |
+------------+--------------------+-----------+----------+
How can I do this? With join?
As mck suggested, a simple SQL join would be enough for your case, by explicitly specifying the equality of the colour column's values between the two DataFrames, as seen below (we drop one of the two colour columns since they have the same value for each row after the join):
val joined = df1.join(df2, df1("colour").equalTo(df2("colour")))
.drop(df1("colour"))
This is what we get after showing the newly formed joined DataFrame:
+----+-----+------+--------+
|code| name|colour| date|
+----+-----+------+--------+
|1001| Joe| brown|20210101|
|1201| Jess| black|20210101|
|1300|James| green|20210101|
+----+-----+------+--------+

Compare specific rows of DataFrames in Scala

I have two Scala DataFrames which I am testing for similarities. I want to be able to pick a specific row number, and compare each value of that row between the two DataFrames. For example:
Dataframe 1: df1
+------+-----+-----------+
| Name | Age | Eye Color |
+------+-----+-----------+
| Bob | 12 | Blue |
| Bil | 17 | Red |
| Ron | 13 | Brown |
+------+-----+-----------+
Dataframe 2: df2
+------+-----+-----------+
| Name | Age | Eye Color |
+------+-----+-----------+
| Bob | 12 | Blue |
| Bil | 14 | Blue |
| Ron | 13 | Brown |
+------+-----+-----------+
Input: Row 2, output: Age, Eye Color.
What would be ideal, is for the output to show the values that are different too. I have considered the option here but the issue is that my DataFrames are very large (in excess of 200,000 rows) so this takes far too long. Is there a simpler way to select a specific row value of a Dataframe in Scala?

Postgres Multiple Rows as Single Row

I know I need to use sub-queries for this, but I'm not sure how to go about it. I have multiple entries per column ID, but I want to display them as a single row. Here's the table design:
UUID | position_id | spot
-----+-------------+-----
111 | 1 | left
112 | 1 | right
113 | 3 | center
114 | 4 | right
The way I want to output this data is such:
postion_1_left | position_1_right | postion_3_center | position_4_right
---------------+------------------+------------------+-----------------
true | true | true | true
The reason for this is that I want to put this data into a BIRT report, and having absolute values for each position_id and spot as true or false would make the report much nicer. The report would look as such:
left | center | right
-----------+-------+--------+-----------
position 1 | yes | no | yes
position 2 | no | no | no
position 3 | no | yes | no
position 4 | no | no | yes
I cannot think of a better way of doing this, so if anyone has a suggestion I'm open to it. Otherwise I'll proceed with this layout but I'm having a hard time coming up with the query. I tried starting with a query like this:
SELECT (SELECT spot FROM positions_table WHERE position_id = 3 AND spot = 'left')
from positions_table
WHERE uuid = 'afb36733'
But obviously that wouldn't work.
As you simple want to check if you have a given spot out of a finite list - ('left', 'center', 'right') - for each position_id, I see a very simple solution for your case using bool_or aggregation function (see also on SQL Fiddle):
SELECT
pt.position_id,
bool_or(pt.spot = 'left') AS left,
bool_or(pt.spot = 'right') AS right,
bool_or(pt.spot = 'center') AS center
FROM
positions_table pt
GROUP BY
pt.position_id
ORDER BY
pt.position_id;
Result:
position_id | left | right | center
-------------+------+-------+--------
1 | t | t | f
3 | f | f | t
4 | f | t | f
(3 rows)
You can then expand it with CASE to format better (or do that in your presentation layer):
SELECT
pt.position_id,
CASE WHEN bool_or(pt.spot = 'left') THEN 'yes' ELSE 'no' END AS left,
CASE WHEN bool_or(pt.spot = 'right') THEN 'yes' ELSE 'no' END AS right,
CASE WHEN bool_or(pt.spot = 'center') THEN 'yes' ELSE 'no' END AS center
FROM
positions_table pt
GROUP BY
pt.position_id
ORDER BY
pt.position_id;
Result:
position_id | left | right | center
-------------+------+-------+--------
1 | yes | yes | no
3 | no | no | yes
4 | no | yes | no
(3 rows)
Another common options of pivoting would be:
using crosstab function from tablefunc
using FILTER clause or CASE inside aggregation function
But as it is only true/false, bool_or seems more than enough here.
Use generate_series() to fill gaps in position_ids and aggregate spots to array for id:
select
id,
coalesce('left' = any(arr), false) as left,
coalesce('center' = any(arr), false) as center,
coalesce('right' = any(arr), false) as right
from (
select id, array_agg(spot) arr
from generate_series(1, 4) id
left join positions_table on id = position_id
group by 1
) s
order by 1;
id | left | center | right
----+------+--------+-------
1 | t | f | t
2 | f | f | f
3 | f | t | f
4 | f | f | t
(4 rows)