Pyspark: Replace row-wise loop with groupby?

Pyspark: Replace row-wise loop with groupby? - pyspark

I want to generate features by performing a group of operations based on a set of columns of a dataframe.
MY dataframe looks like:
root
|-- CreatedOn: string (nullable = true)
|-- ID: string (nullable = true)
|-- Industry: string (nullable = true)
|-- region: string (nullable = true)
|-- Customer: string (nullable = true)
For eg. count number of times the ID and region were used in last 3/2/1 months.
For this I have to scan the entire dataframe w.r.t to the current record.
Current logic:
1. for i in df.collect() - Row-wise collect.
2. Filter the data 3 months before this row.
3. Generate features.
The code is working fine but since it is a rowwise loop, it is running >10hrs. Is there any way I can replace the row-wise operation in Pyspark, since it is not leveraging the parallelism that pyspark provides.
Something like groupby??
Sample data:
S.No ID CreatedOn Industry Region
1 ERP 05thMa2020 Communications USA
2 ERP 28thSept2020 Communications USA
3 ERP 15thOct2020 Communications Europe
4 ERP 15thNov2020 Communications Europe
5 Cloud 1stDec2020 Insurance Europe
Consider record#4..
Feature 1 (Count_3monthsRegion): I want to see how many times ERP was used in Europe in the last 3 months (w.r.t CreatedOn). The answer will be 1. (Although record#2 is ERP but in same region)
Feature 2(Count_3monthsIndustry): I want to see how many times ERP was used in Communications in the last 3 months (w.r.t CreatedOn). The answer will be 2.
Expected output:
S.No ID CreatedOn Industry Region Count_3monthsRegion Count_3monthsIndustry
1 ERP 05thMay2020 Communications USA 0 0
2 ERP 28thSept2020 Communications USA 0 0
3 ERP 15thOct2020 Communications Europe 0 1
4 ERP 15thNov2020 Communications Europe 1 2
5 Cloud 1stDec2020 Insurance Europe 0 0

Related

Why is wrong data types when using option of inferSchema as true in spark scala?

I'm reading a USA_Housing.csv file which the columns are
(Avg Area Income, Avg Area House Age, Avg Area Number of Rooms, Avg Area Number of Bedrooms, Area Population, Price, Address)
all columns are numerical value except Address
when reading data by this:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
val data = spark.read.option("header","true").option("inferSchema","true").format("csv").load("USA_Housing.csv")
data.printSchema()
output of printSchema is:
|-- Avg Area Income: string (nullable = true)
|-- Avg Area House Age: string (nullable = true)
|-- Avg Area Number of Rooms: double (nullable = true)
|-- Avg Area Number of Bedrooms: double (nullable = true)
|-- Area Population: double (nullable = true)
|-- Price: double (nullable = true)
|-- Address: string (nullable = true)
as the avg area income and area house age are both string but they are actual double in csv file.
when i open data by ATOM it's shown as:
Avg Area Income,Avg Area House Age,Avg Area Number of Rooms,Avg Area Number of Bedrooms,Area Population,Price,Address
79545.45857431678,5.682861321615587,7.009188142792237,4.09,23086.800502686456,1059033.5578701235,"208 Michael Ferry Apt. 674
Laurabury, NE 37010-5101"
79248.64245482568,6.0028998082752425,6.730821019094919,3.09,40173.07217364482,1505890.91484695,"188 Johnson Views Suite 079
Lake Kathleen, CA 48958"

Setting multiLine to true should work.
val data = spark.read.option("header","true").option("inferSchema","true").option("multiLine", "true").format("csv").load("USA_Housing.csv")

The csv (from kaggle) seems mal formated, there is a line-break in the adress column. So the first column is actually parsed as:
+------------------+
| _c0|
+------------------+
| 79545.45857431678|
| Laurabury|
| 79248.64245482568|
| Lake Kathleen|
|61287.067178656784|
| Danieltown|
| 63345.24004622798|
| FPO AP 44820"|
|59982.197225708034|
| FPO AE 09386"|
Therefore spark takes this as string

PySpark - Get the size of each list in group by

I have a massive pyspark dataframe. I need to group by Person and then collect their Budget items into a list, to perform a further calculation.
As an example,
a = [('Bob', 562,"Food", "12 May 2018"), ('Bob',880,"Food","01 June 2018"), ('Bob',380,'Household'," 16 June 2018"), ('Sue',85,'Household'," 16 July 2018"), ('Sue',963,'Household'," 16 Sept 2018")]
df = spark.createDataFrame(a, ["Person", "Amount","Budget", "Date"])
Group By:
import pyspark.sql.functions as F
df_grouped = df.groupby('person').agg(F.collect_list("Budget").alias("data"))
Schema:
root
|-- person: string (nullable = true)
|-- data: array (nullable = true)
| |-- element: string (containsNull = true)
However, I am getting a memory error when I try to apply a UDF on each person. How can I get the size (in megabytes or gigbabytes) of each list (data) for each person?
I have done the following, but I am getting nulls
import sys
size_list_udf = F.udf(lambda data: sys.getsizeof(data)/1000, DoubleType())
df_grouped = df_grouped.withColumn("size",size_list_udf("data") )
df_grouped.show()
Output:
+------+--------------------+----+
|person| data|size|
+------+--------------------+----+
| Sue|[Household, House...|null|
| Bob|[Food, Food, Hous...|null|
+------+--------------------+----+

You just have one minor issue with your code. sys.getsizeof() returns the size of an object in bytes as an integer. You're dividing this by the integer value 1000 to get kilobytes. In python 2, this returns an integer. However you defined your udf to return a DoubleType(). The simple fix is to divide by 1000.0.
import sys
size_list_udf = f.udf(lambda data: sys.getsizeof(data)/1000.0, DoubleType())
df_grouped = df_grouped.withColumn("size",size_list_udf("data") )
df_grouped.show(truncate=False)
#+------+-----------------------+-----+
#|person|data |size |
#+------+-----------------------+-----+
#|Sue |[Household, Household] |0.112|
#|Bob |[Food, Food, Household]|0.12 |
#+------+-----------------------+-----+
I have found that in cases where a udf is returning null, the culprit is very frequently a type mismatch.

Pyspark : collect all the keys in a given data frame column

I'm a spark beginner. I'm trying to collect all the keys present a particular column, with different rows having different key value pairs.
|-- A: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
A ID
name: 'Peter', age:'25'. 5
name: 'John', country:'USA', pet:'dog' 7
I need to transform this to a data frame with all the keys as new columns. I tried exploding the column, which will create new "key" and "value" columns but the data frames is a few GB big, and the spark job fails.
dataframe.select(explode("A")).select("key").show()
The expected result is :
name age ID country pet
Peter 25 5 null null
John null 7 USA dog

How to partition Spark RDD when importing Postgres using JDBC?

I am importing a Postgres database into Spark. I know that I can partition on import, but that requires that I have a numeric column (I don't want to use the value column because it's all over the place and doesn't maintain order):
df = spark.read.format('jdbc').options(url=url, dbtable='tableName', properties=properties).load()
df.printSchema()
root
|-- id: string (nullable = false)
|-- timestamp: timestamp (nullable = false)
|-- key: string (nullable = false)
|-- value: double (nullable = false)
Instead, I am converting the dataframe into an rdd (of enumerated tuples) and trying to partition that instead:
rdd = df.rdd.flatMap(lambda x: enumerate(x)).partitionBy(20)
Note that I used 20 because I have 5 workers with one core each in my cluster, and 5*4=20.
Unfortunately, the following command still takes forever to execute:
result = rdd.first()
Therefore I am wondering if my logic above makes sense? Am I doing anything wrong? From the web GUI, it looks like the workers are not being used:

Since you already know you can partition by a numeric column this is probably what you should do. Here is the trick. First lets find a minimum and maximum epoch:
url = ...
properties = ...
min_max_query = """(
SELECT
CAST(min(extract(epoch FROM timestamp)) AS bigint),
CAST(max(extract(epoch FROM timestamp)) AS bigint)
FROM tablename
) tmp"""
min_epoch, max_epoch = spark.read.jdbc(
url=url, table=min_max_query, properties=properties
).first()
and use it to query the table:
numPartitions = ...
query = """(
SELECT *, CAST(extract(epoch FROM timestamp) AS bigint) AS epoch
FROM tablename) AS tmp"""
spark.read.jdbc(
url=url, table=query,
lowerBound=min_epoch, upperBound=max_epoch + 1,
column="epoch", numPartitions=numPartitions, properties=properties
).drop("epoch")
Since this splits data into ranges of the same size it is relatively sensitive to data skew so you should use it with caution.
You could also provide a list of disjoint predicates as a predicates argument.
predicates= [
"id BETWEEN 'a' AND 'c'",
"id BETWEEN 'd' AND 'g'",
... # Continue to get full coverage an desired number of predicates
]
spark.read.jdbc(
url=url, table="tablename", properties=properties,
predicates=predicates
)
The latter approach is much more flexible and can address certain issues with non-uniform data distribution but requires more knowledge about the data.
Using partitionBy fetches data first and then performs full shuffle to get desired number of partitions so it is relativistically expensive.

Classify data using Apache Spark

I have the following ds:
|-- created_at: timestamp (nullable = true)
|-- channel_source_id: integer (nullable = true)
|-- movie_id: integer (nullable = true)
I would like to classify the movie_id based on some conditions:
Number of times have been played;
#Count ocurrences by ID
SELECT COUNT(createad_at) FROM logs GROUP BY movie_id
What is the range (created_at) that movie has been played;
#Returns distinct movies_id
SELECT DISTINCT(movie_id) FROM logs
#For each movie_id, I would like to retrieve the hour that has been played
#When i have the result, I could apply an filter from df to extract the intervals
SELECT created_at FROM logs WHERE movie_id = ?
Number of differents channel_source_id that have played the movie;
#Count number of channels that have played
SELECT COUNT(DISTINCT(channel_source_id)) FROM logs where movie_id = ? GROUP BY movie_id
I've written a simple table to help me on classification
Played 1 to 5 times, range between 00:00:00 - 03:59:59, 1 to 3 different channels >> Movie Type A
Played 6 to 10 times, range between 04:00:00 - 07:59:59, 4 to 5 different channels >> Movie Type B
etc
I'm using Spark to import file but i'm lost how can I perform the classification. Anyone could me give me a hand on where should I start?
def run() = {
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.options(Map("header" -> "true", "inferSchema" -> "true"))
.load("/home/plc/Desktop/movies.csv")
df.printSchema()
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Pyspark: Replace row-wise loop with groupby? - pyspark

Related

Why is wrong data types when using option of inferSchema as true in spark scala?

PySpark - Get the size of each list in group by

Pyspark : collect all the keys in a given data frame column

How to partition Spark RDD when importing Postgres using JDBC?

Classify data using Apache Spark

Categories

Resources