Check is anyone of the dataframe columns are empty - scala

I need to check if any of the columns of the dataframe are empty. Empty can be defined as if all rows of the column has values that are either null or empty string
Dataframe is as follows
+---+-------+-------+-------+-------+
| ID|Sample1|Sample2|Sample3|Sample4|
+---+-------+-------+-------+-------+
| 1| a1| b1| c1| null|
| 2| | | | |
| 3| a3| | | |
+---+-------+-------+-------+-------+
The code I use for the check
mainDF.select(mainDF.columns.map(c => sum((col(c).isNotNull && col(c)!="").cast("int")).alias(c)): _*).show()
What I get is
+---+-------+-------+-------+-------+
| ID|Sample1|Sample2|Sample3|Sample4|
+---+-------+-------+-------+-------+
| 3| 3| 3| 3| 2|
+---+-------+-------+-------+-------+
What I hope to get is
+---+-------+-------+-------+-------+
| ID|Sample1|Sample2|Sample3|Sample4|
+---+-------+-------+-------+-------+
| 3| 2| 1| 1| 0|
+---+-------+-------+-------+-------+
Also, My final results should be a true or false on if anyone of the columns is empty.
In this case it will be true because count of Sample4 is 0

You could map correct values to 1 and empty strings/nulls to 0 and then perform sum.
mainDF.select(mainDF.columns.map(c => sum(when(col(c)==="" or col(c).isNull,0).otherwise(1)).as(c)):_*).show()
+---+-------+-------+-------+-------+
| ID|Sample1|Sample2|Sample3|Sample4|
+---+-------+-------+-------+-------+
| 3| 2| 1| 1| 0|
+---+-------+-------+-------+-------+

Related

Pyspark combine different rows base on a column

I have a dataframe
+----------------+------------+-----+
| Sport|Total_medals|count|
+----------------+------------+-----+
| Alpine Skiing| 3| 4|
| Alpine Skiing| 2| 18|
| Alpine Skiing| 4| 1|
| Alpine Skiing| 1| 38|
| Archery| 2| 12|
| Archery| 1| 72|
| Athletics| 2| 50|
| Athletics| 1| 629|
| Athletics| 3| 8|
| Badminton| 2| 5|
| Badminton| 1| 86|
| Baseball| 1| 216|
| Basketball| 1| 287|
|Beach Volleyball| 1| 48|
| Biathlon| 4| 1|
| Biathlon| 3| 9|
| Biathlon| 1| 61|
| Biathlon| 2| 23|
| Bobsleigh| 2| 6|
| Bobsleigh| 1| 60|
+----------------+------------+-----+
Is there a way for me to combine the value of counts from multiple rows if they are from the same sport?
For example, if Sport = Alpine Skiing I would have something like this:
+----------------+-----+
| Sport|count|
+----------------+-----+
| Alpine Skiing| 61|
+----------------+-----+
where count is equal to 4+18+1+38 = 61. I would like to do this for all sports
any help would be appreciated
You need to groupby on the Sport column and then aggregate the count column with the sum() function.
Example:
import pyspark.sql.functions as F
grouped_df = df.groupby('Sport').agg(F.sum('count'))

How to count change in row values in pyspark

Logic to count the change in the row values of a given column
Input
df22 = spark.createDataFrame(
[(1, 1.0), (1,22.0), (1,22.0), (1,21.0), (1,20.0), (2, 3.0), (2,3.0),
(2, 5.0), (2, 10.0), (2,3.0), (3,11.0), (4, 11.0), (4,15.0), (1,22.0)],
("id", "v"))
+---+----+
| id| v|
+---+----+
| 1| 1.0|
| 1|22.0|
| 1|22.0|
| 1|21.0|
| 1|20.0|
| 2| 3.0|
| 2| 3.0|
| 2| 5.0|
| 2|10.0|
| 2| 3.0|
| 3|11.0|
| 4|11.0|
| 4|15.0|
+---+----+
Expect output
+---+----+---+
| id| v| c|
+---+----+---+
| 1| 1.0| 0|
| 1|22.0| 1|
| 1|22.0| 1|
| 1|21.0| 2|
| 1|20.0| 3|
| 2| 3.0| 0|
| 2| 3.0| 0|
| 2| 5.0| 1|
| 2|10.0| 2|
| 2| 3.0| 3|
| 3|11.0| 0|
| 4|11.0| 0|
| 4|15.0| 1|
+---+----+---+
Any help on this will be greatly appreciated
Thanks in advance
Ramabadran
Before adding answer, I would like to ask you ,"what you have tried ??". Please try something from your end and then seek for support in this platform. Also your question is not clear. You have not provided if you are looking for a delta capture count per 'id' or as a whole. Just giving an expected output is not going to make the question clear.
And now comes to your question , if I understood it correctly from the sample input and output,you need delta capture count per 'id'. So one way to achieve it as below
#Capture the incremented count using lag() and sum() over below mentioned window
import pyspark.sql.functions as F
from pyspark.sql.window import Window
winSpec=Window.partitionBy('id').orderBy('v') # Your Window for capturing the incremented count
df22.\
withColumn('prev',F.coalesce(F.lag('v').over(winSpec),F.col('v'))).\
withColumn('c',F.sum(F.expr("case when v-prev<>0 then 1 else 0 end")).over(winSpec)).\
drop('prev').\
orderBy('id','v').\
show()
+---+----+---+
| id| v| c|
+---+----+---+
| 1| 1.0| 0|
| 1|20.0| 1|
| 1|21.0| 2|
| 1|22.0| 3|
| 1|22.0| 3|
| 1|22.0| 3|
| 2| 3.0| 0|
| 2| 3.0| 0|
| 2| 3.0| 0|
| 2| 5.0| 1|
| 2|10.0| 2|
| 3|11.0| 0|
| 4|11.0| 0|
| 4|15.0| 1|
+---+----+---+

Creating a column based on a list or dictionary

I am new to PySpark, I have a dataframe containing a customer id and text, with an associated value
+------+-----+------+
|id |text |value |
+------+-----+------+
| 1 | Cat| 5|
| 1 | Dog| 4|
| 2 | Oil| 1|
I would like to parse the text column, based on a list of keywords, and create a column that tells me wether a keyword is in the text field and extract the associated value, the expected result is this
List_keywords = ["Dog", Cat"]
Out
+------+-----+------+--------+---------+--------+---------+
|id |text |value |bool_Dog|value_Dog|bool_cat|value_cat|
+------+-----+------+--------+---------+--------+---------+
| 1 | Cat| 5|0 | 0| 1| 5|
| 1 | Dog| 4|1 | 4| 0| 0|
| 2 | Oil| 1|0 | 0| 0| 0|
What is the best way to do that? I was thinking of creating a list or a dictionary containing my keywords, and parsing it with a for loop, but I'm sure there is a better way to do that.
Please see the solution
import pyspark.sql.functions as F
data = [[1, 'Cat', 5], [1, 'Dog', 4], [2, 'Oil', 3]]
df = spark.createDataFrame(data, ['id', 'text', 'value'])
df.show()
+---+----+-----+
| id|text|value|
+---+----+-----+
| 1| Cat| 5|
| 1| Dog| 4|
| 2| Oil| 1|
+---+----+-----+
keywords = ['Dog','Cat']
(
df
.groupby('id', 'text', 'value')
.pivot('text', keywords)
.agg(
F.count('value').alias('bool'),
F.max('value').alias('value')
)
.fillna(0)
.sort('text')
).show()
+---+----+-----+--------+---------+--------+---------+
| id|text|value|Dog_bool|Dog_value|Cat_bool|Cat_value|
+---+----+-----+--------+---------+--------+---------+
| 1| Cat| 5| 0| 0| 1| 5|
| 1| Dog| 4| 1| 4| 0| 0|
| 2| Oil| 1| 0| 0| 0| 0|
+---+----+-----+--------+---------+--------+---------+

Scala Spark Incrementing a column based on another column in dataframe without for loops

I have a dataframe like the one below. I want a new column called cutofftype - which instead of the current monotonically increasing number should reset to 1 every time the ID column changes .
df = df.orderBy("ID","date").withColumn("cutofftype",monotonically_increasing_id()+1)
+------+---------------+----------+
| ID | date |cutofftype|
+------+---------------+----------+
| 54441| 2016-06-20| 1|
| 54441| 2016-06-27| 2|
| 54441| 2016-07-04| 3|
| 54441| 2016-07-11| 4|
| 54500| 2016-05-02| 5|
| 54500| 2016-05-09| 6|
| 54500| 2016-05-16| 7|
| 54500| 2016-05-23| 8|
| 54500| 2016-06-06| 9|
| 54500| 2016-06-13| 10|
+------+---------------+----------+
Target is this as below :
+------+---------------+----------+
| ID | date |cutofftype|
+------+---------------+----------+
| 54441| 2016-06-20| 1|
| 54441| 2016-06-27| 2|
| 54441| 2016-07-04| 3|
| 54441| 2016-07-11| 4|
| 54500| 2016-05-02| 1|
| 54500| 2016-05-09| 2|
| 54500| 2016-05-16| 3|
| 54500| 2016-05-23| 4|
| 54500| 2016-06-06| 5|
| 54500| 2016-06-13| 6|
+------+---------------+----------+
I know this can be done with for loops - i want to do it without for loops >> Is there a way out ?
Simple partition by problem. You should use the window.
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("ID").orderBy("date")
df.withColumn("cutofftype", row_number().over(w)).show()
+-----+----------+----------+
| ID| date|cutofftype|
+-----+----------+----------+
|54500|2016-05-02| 1|
|54500|2016-05-09| 2|
|54500|2016-05-16| 3|
|54500|2016-05-23| 4|
|54500|2016-06-06| 5|
|54500|2016-06-13| 6|
|54441|2016-06-20| 1|
|54441|2016-06-27| 2|
|54441|2016-07-04| 3|
|54441|2016-07-11| 4|
+-----+----------+----------+

Join two data frames as input for Machine Learning with Spark

I have 2 data frames in Apache Spark, each with a column named "JoinValue". JoinValue is numeric and has the same semantics and meaning in both data frames.
I need the combination of both data frames as input (training and test set) for a Machine Learning algorithm. Is it correct that I first need to combine both DataFrames into a single DataFrame before using it in an ML Pipeline?
Example:
df1.show()
+---------+---------+
| a|JoinValue|
+---------+---------+
|A value 0| 0|
|A value 1| 5|
|A value 2| 10|
|A value 3| 15|
|A value 4| 20|
|A value 5| 25|
|A value 6| 30|
+---------+---------+
and
> df2.show()
+---------+---------+
| b|JoinValue|
+---------+---------+
|B value 0| 0|
|B value 1| 7|
|B value 2| 14|
|B value 3| 21|
|B value 4| 28|
+---------+---------+
An outer join followed by an orderBy yields the following results:
> df1.join(df2, 'JoinValue', 'outer').orderBy('JoinValue').show()
+---------+---------+---------+
|JoinValue| a| b|
+---------+---------+---------+
| 0|A value 0|B value 0|
| 5|A value 1| null|
| 7| null|B value 1|
| 10|A value 2| null|
| 14| null|B value 2|
| 15|A value 3| null|
| 20|A value 4| null|
| 21| null|B value 3|
| 25|A value 5| null|
| 28| null|B value 4|
| 30|A value 6| null|
+---------+---------+---------+
What I actually want is this, without nulls:
+---------+---------+---------+
|JoinValue| a| b|
+---------+---------+---------+
| 0|A value 0|B value 0|
| 5|A value 1|B value 0|
| 7|A value 1|B value 1|
| 10|A value 2|B value 1|
| 14|A value 2|B value 2|
| 15|A value 3|B value 2|
| 20|A value 4|B value 2|
| 21|A value 4|B value 3|
| 25|A value 5|B value 3|
| 28|A value 5|B value 4|
| 30|A value 6|B value 4|
+---------+---------+---------+
What is the best way to use the JoinValue, a and b, coming from multiple data frames as features and labels in a machine learning algorithm?