Unable to perform row operations in pyspark dataframe - pyspark

I have a dataset in this form:
Store_Name Items Ratings
Cartmax Cosmetics, Clothing, Perfumes 4.6/5
DollarSmart Watches, Clothing NEW
Megaplex Shoes, Cosmetics, Medicines, Sports 4.2/5
I want to create a new column which contain the number of items in the store. For example in this first row, the item column has 3 items, so the column have value 3 for first row.
In the ratings column, few rows have 'NEW' and 'NULL' values. I want to remove all those rows.

You can achieve this with filter and split as below -
Data Preparation
s = StringIO("""
Store_Name Items Ratings
Cartmax Cosmetics, Clothing, Perfumes 4.6/5
DollarSmart Watches, Clothing NEW
Megaplex Shoes, Cosmetics, Medicines, Sports 4.2/5
""")
df = pd.read_csv(s,delimiter='\t')
sparkDF = sql.createDataFrame(df)
sparkDF.show(truncate=False)
+-----------+-----------------------------------+-------+
|Store_Name |Items |Ratings|
+-----------+-----------------------------------+-------+
|Cartmax |Cosmetics, Clothing, Perfumes |4.6/5 |
|DollarSmart|Watches, Clothing |NEW |
|Megaplex |Shoes, Cosmetics, Medicines, Sports|4.2/5 |
+-----------+-----------------------------------+-------+
Filter & Split
sparkDF = sparkDF.filter(~(F.col('Ratings').isin(['NEW','NULL'])) | F.col('Ratings').isNotNull())\
.withColumn('NumberOfItems',F.size(F.split(F.col('Items'),',')))
sparkDF.show(truncate=False)
+----------+-----------------------------------+-------+-------------+
|Store_Name|Items |Ratings|NumberOfItems|
+----------+-----------------------------------+-------+-------------+
|Cartmax |Cosmetics, Clothing, Perfumes |4.6/5 |3 |
|Megaplex |Shoes, Cosmetics, Medicines, Sports|4.2/5 |4 |
+----------+-----------------------------------+-------+-------------+

First filter out the rows whose Ratings are null and NEW, and then use the size function and the split function to get the number of items.
import pyspark.sql.functions as F
......
df = df.filter('Ratings is not null and Ratings != "NEW"').withColumn('num_items', F.size(F.split('Items', ',')))

Related

I need to create a new dataframe as below in pysaprk from given input dataset

persons who has same salary should come in same record and their names should be separated by ",".
input Dataset :
Expected Dataset
You can achieve this as below -
Apply a groupBy on Salary and use - collect_list to club all the Name inside an ArrayType()
Further you can choose to convert it to a StringType using - concat_ws
Data Preparation
df = pd.read_csv(StringIO("""Name,Salary
abc,100000
bcd,20000
def,100000
pqr,20000
xyz,30000
""")
,delimiter=','
).applymap(lambda x: str(x).strip())
sparkDF = sql.createDataFrame(df)
sparkDF.groupby("Salary").agg(F.collect_list(F.col("Name")).alias('Name')).show(truncate=False)
+------+----------+
|Salary|Name |
+------+----------+
|100000|[abc, def]|
|20000 |[bcd, pqr]|
|30000 |[xyz] |
+------+----------+
Concat WS
sparkDF.groupby("Salary").agg(F.concat_ws(",",F.collect_list(F.col("Name"))).alias('Name')).show(truncate=False)
+------+-------+
|Salary|Name |
+------+-------+
|100000|abc,def|
|20000 |bcd,pqr|
|30000 |xyz |
+------+-------+

Querying one column by max value on another column after groupBy

+-------+--------------------+-------+
| brand| category_code| count|
+-------+--------------------+-------+
|samsung|electronics.smart...|1782386|
| apple|electronics.smart...|1649525|
| xiaomi|electronics.smart...| 924383|
| huawei|electronics.smart...| 477946|
| oppo|electronics.smart...| 242022|
|samsung|electronics.video.tv| 183988|
| apple|electronics.audio...| 165277|
| acer| computers.notebook| 154599|
| casio| electronics.clocks| 141403|
I want to select a value from the column brand corresponding to the max value of column count after performing a groupBy on column category_code. So in the first row for the group electronics.smartphone in column category_code I want string samsung from column brand because it has the highest value in the count column...
First groupBy to identify rows with the largest count for each category_code, then join with the original dataframe to retrieve brand value corresponding to max count:
df1 = df.groupBy("category_code").agg(F.max("count").alias("count"))
df2 = df.join(df1, ["count", "category_code"]).drop("count")
this will produce df2 as follows
category_code brand
---------------------------
electronics.smart... samsung
electronics.video.tv samsung
electronics.audio apple
computers.notebook acer
electronics.clocks casio

How to filter a dataframe based on multiple column matches in other dataframes in Spark Scala

Say I have three dataframes as follows:
val df1 = Seq(("steve","Run","Run"),("mike","Swim","Swim"),("bob","Fish","Fish")).toDF("name","sport1","sport2")
val df2 = Seq(("chris","Bike","Bike"),("dave","Bike","Fish"),("kevin","Run","Run"),("anthony","Fish","Fish"),("liz","Swim","Fish")).toDF("name","sport1","sport2")
Here is tabular view:
I want to filter df2 to only the rows where sport1 and sport2 combinations are valid rows of df1. For example, since in df1, sport1 -> Run, sport2 -> Run is a valid row, it would return that as one of the rows from df2. It would not return sport1 -> Bike, sport2 -> Bike from df2 though. And it would not factor in what the 'name' column value is at all.
The expected result I'm looking for is the dataframe with the following data:
+-------+------+------+
|name |sport1|sport2|
+-------+------+------+
|kevin |Run |Run |
|anthony|Fish |Fish |
+-------+------+------+
Thanks and have a great day!
Try this,
val res = df3.intersect(df1).union(df3.intersect(df2))
+------+------+
|sport1|sport2|
+------+------+
| Run| Run|
| Fish| Fish|
| Swim| Fish|
+------+------+
To filter a dataframe based on multiple column matches in other dataframes, you can use join:
df2.join(df1.select("sport1", "sport2"), Seq("sport1", "sport2"))
As by default join is an inner join, you will keep only the lines where "sport1" and "sport2" are the same in the two dataframes. And as we use a list of columns Seq("sport1", "sport2") for the join condition, the columns "sport1" and "sport2" will not be duplicated
With your example's input data:
val df1 = Seq(("steve","Run","Run"),("mike","Swim","Swim"),("bob","Fish","Fish")).toDF("name","sport1","sport2")
val df2 = Seq(("chris","Bike","Bike"),("dave","Bike","Fish"),("kevin","Run","Run"),("anthony","Fish","Fish"),("liz","Swim","Fish")).toDF("name","sport1","sport2")
You get:
+------+------+-------+
|sport1|sport2|name |
+------+------+-------+
|Run |Run |kevin |
|Fish |Fish |anthony|
+------+------+-------+

Scala Spark Explode multiple columns pairs into rows

How can I explode multiple columns pairs into multiple rows?
I have a dataframe with the following
client, type, address, type_2, address_2
abc, home, 123 Street, business, 456 Street
I want to have a final dataframe with the follow
client, type, address
abc, home, 123 Street
abc, business, 456 Street
I tried using this code below but it return me 4 records instead of the two records I want
df
.withColumn("type", explode(array("type", "type_2")))
.withColumn("address", explode(array("address", "address_2")))
I can do this with two separate dataframe and perform an union but I wanted to see if there was another way I can do it within a single dataframe
Thanks
you can do it using structs:
df
.withColumn("str",explode(
array(
struct($"type",$"address"),
struct($"type_2".as("type"),$"address_2".as("address"))))
)
.select($"client",$"str.*")
.show()
gives
+------+--------+----------+
|client| type| address|
+------+--------+----------+
| abc| home|123 Street|
| abc|business|456 Street|
+------+--------+----------+
Here is technique I use for complicated transformations - map records on the dataframe and use scala to apply transformation of any complexity.
Here I am hard coding creation of 2 rows, however any logic can be put here to explode rows as needed. I used flatmap to split array of rows into rows.
val df = spark.createDataFrame(Seq(("abc","home","123 Street","business","456 Street"))).toDF("client", "type", "address","type_2","address_2")
df.map{ r =>
Seq((r.getAs[String]("client"),r.getAs[String]("type"),r.getAs[String]("address")),
(r.getAs[String]("client"),r.getAs[String]("type_2"),r.getAs[String]("address_2")))
}.flatMap(identity(_)).toDF("client", "type", "address").show(false)
Result
+------+--------+----------+
|client|type |address |
+------+--------+----------+
|abc |home |123 Street|
|abc |business|456 Street|
+------+--------+----------+

How to fetch data for a column from two tables in spark scala

There are two tables Customer1 and Customer2
Customer1: List the details of the customer
https://docs.google.com/spreadsheets/d/1GuQaHhZ70D0NHGXuW51B5nNZXrSkthmEduHOhwoZmRg/edit#gid=722500260
Customer2: List the updated details of the customer
https://docs.google.com/spreadsheets/d/1GuQaHhZ70D0NHGXuW51B5nNZXrSkthmEduHOhwoZmRg/edit#gid=0
CustomerName has to be fetched from both the tables.If the customer name is updated it has to be fetched from Customer2 table else it has to fetched from Customer1 table.So all customernames should be listed.
Expexted Resultset:
https://docs.google.com/spreadsheets/d/1GuQaHhZ70D0NHGXuW51B5nNZXrSkthmEduHOhwoZmRg/edit#gid=1227228207
How this can be achieved in spark scala?
You can perform Left Join on customer1 table then using coalesce on customer2 table to get first non null value for customername column.
Example:
scala> val customer1=Seq((1,"shiva","9994323565"),(2,"Mani","9994323567"),(3,"Sneha","9994323568")).toDF("customerid","customername","contact")
scala> val customer2=Seq((1,"shivamoorthy","9994323565"),(2,"Manikandan","9994323567")).toDF("customerid","customername","contact")
scala> customer1.as("c1")
.join(customer2.as("c2"),$"c1.customerid" === $"c2.customerid","left")
.selectExpr("c1.customerid",
"coalesce(c2.customername,c1.customername) as customername")
.show()
Result:
+----------+------------+
|customerid|customername|
+----------+------------+
| 1|shivamoorthy|
| 2| Manikandan|
| 3| Sneha|
+----------+------------+