Distribute group tasks evenly using pandas_udf in PySpark - pyspark

I have a Spark Dataframe which contains groups of training data. Each group is identified by the "group" column.
group | feature_1 | feature_2 | label
--------------------------------------
1 | 123 | 456 | 0
1 | 553 | 346 | 1
... | ... | ... | ...
2 | 623 | 498 | 0
2 | 533 | 124 | 1
... | ... | ... | ...
I want to train a python ML model (lightgbm in my case) for each group in parallel.
Therefore I have the following working code:
schema = T.StructType([T.StructField("group_id", T.IntegerType(), True),
T.StructField("model", T.BinaryType(), True)])
#F.pandas_udf(schema, F.PandasUDFType.GROUPED_MAP)
def _fit(pdf):
group_id = pdf.loc[0, "group"]
X = df.loc[: X_col]
y = df.loc[:, y_col].values
# train model
model = ...
out_df = pd.DataFrame(
[[group_id, pickle.dumps(model)],
columns=["group_id", "model"]]
)
return out_df
df.groupby("group").apply(_fit)
I have 10 groups in the dataset and 10 worker nodes.
Most of the times, each group is assigned to an executor and the processing is very quick.
However sometimes, more than 1 group are assigned to an executor while some other executors are left free.
This causes the processing to become very slow as the executor has to train multiple models at the same time.
Question: how do I schedule each group to train on a separate executor to avoid this problem?

I think you're going to want to look into playing around with setting the following 2 spark configurations:
spark.task.cpus (the number of cpus per task)
spark.executor.cores (the number of cpus per executor)
I believe setting spark.executor.cores = spark.task.cpus = (cores per worker -1) might solve your problem.

Related

Combining two columns from different datasets into one table in Spark

I want to take two columns from two different tables and combine them into one table, but not using any primary keys that are common between both. For example:
val testDSArray : java.util.List[Integer] = new util.ArrayList[Integer]()
testDSArray.add(4)
testDSArray.add(7)
testDSArray.add(10)
val testDS: DataFrame = spark.createDataset(testDSArray)(Encoders.INT).toDF("col1")
val testDS2: DataFrame = spark.createDataset(testDSArray)(Encoders.INT).toDF("col2")
val columns = testDS.withColumn("col2", testDS2.col("col2"))
columns.show(5)
I would expect this code to show something like:
---------------
| col1 | col2 |
---------------
| 4 | 4 |
| 7 | 7 |
| 10 | 10 |
---------------
However, the code above fails to run with error
Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved attribute(s) col2#12 missing from col1#6 in operator !Project [col1#6, col2#12 AS col2#15];

How to preserve order of a DataFrame when writing it as CSV with partitioning by columns?

I sort the rows of a DataFrame and write it out to disk like so:
df.
orderBy("foo").
write.
partitionBy("bar", "moo").
option("compression", "gzip").
csv(outDir)
When I look into the generated .csv.gz files, their order is not preserved. Is this the way Spark does this? Is there a way to preserve order when writing a DF to disk with a partitioning?
Edit: To be more precise: Not the order of the CSVs is off, but the order inside them. Let's say I have it like the following after df.orderBy (for simplicity, I now only partition by one column):
foo | bar | baz
===============
1 | 1 | 1
1 | 2 | 2
1 | 1 | 3
2 | 3 | 4
2 | 1 | 5
3 | 2 | 6
3 | 3 | 7
4 | 2 | 9
4 | 1 | 10
I expect it to be like this, e.g. for files in folder bar=1:
part-00000-NNN.csv.gz:
1,1
1,3
2,5
part-00001-NNN.csv.gz:
3,8
4,10
But what it is like:
part-00000-NNN.csv.gz:
1,1
2,5
1,3
part-00001-NNN.csv.gz:
4,10
3,8
It's been a while but I witnessed this again. I finally came across a workaround.
Suppose, your schema is like:
time: bigint
channel: string
value: double
If you do:
df.sortBy("time").write.partitionBy("channel").csv("hdfs:///foo")
the timestamps in the individual part-* files get tossed around.
If you do:
df.sortBy("channel", "time").write.partitionBy("channel").csv("hdfs:///foo")
the order is correct.
I think it has to do with shuffling. So, as a workaround, I am now sorting by the columns I want my data to be partitioned by first, then by the column I want to have it sorted in the individual files.

Add a key element for n rows in PySpark Dataframe

I have a dataframe like the one shown below.
id | run_id
--------------
4 | 12345
6 | 12567
10 | 12890
13 | 12450
I wish to add a new column say Key that will have value 1 for the first n rows and 2 for the next n rows. The result will be like:
id | run_id | key
----------------------
4 | 12345 | 1
6 | 12567 | 1
10 | 12890 | 2
13 | 12450 | 2
Is it possibile to do the same with PySpark?. Thanks in advance for the help.
Here is one way to do it using zipWithIndex:
# sample rdd
rdd=sc.parallelize([[4,12345], [6,12567], [10,12890], [13,12450]])
# group size for key
n=2
# add rownumber and then label in batches of size n
rdd=rdd.zipWithIndex().map(lambda (x, rownum): x+[int(rownum/n)+1])
# convert to dataframe
df=rdd.toDF(schema=['id', 'run_id', 'key'])
df.show(4)

How to remove records with their count per group below a threshold?

Here's the DataFrame:
id | sector | balance
---------------------------
1 | restaurant | 20000
2 | restaurant | 20000
3 | auto | 10000
4 | auto | 10000
5 | auto | 10000
How to find the count of each sector type and remove the records with sector type count below a specific LIMIT?
The following:
dataFrame.groupBy(columnName).count()
gives me the number of times a value appears in that column.
How to do it in Spark and Scala using DataFrame API?
You can use SQL Window to do so.
import org.apache.spark.sql.expressions.Window
yourDf.withColumn("count", count("*")
.over(Window.partitionBy($"colName")))
.where($"count">2)
// .drop($"count") // if you don't want to keep count column
.show()
For your given dataframe
import org.apache.spark.sql.expressions.Window
dataFrame.withColumn("count", count("*")
.over(Window.partitionBy($"sector")))
.where($"count">2)
.show()
You should see results like this:
id | sector | balance | count
------------------------------
3 | auto | 10000 | 3
4 | auto | 10000 | 3
5 | auto | 10000 | 3
Don't know if it is the best way. But this worked for me.
def getRecordsWithColumnFrequnecyLessThanLimit(dataFrame: DataFrame, columnName: String, limit: Integer): DataFrame = {
val g = dataFrame.groupBy(columnName)
.count()
.filter("count<" + limit)
.select(columnName)
.rdd
.map(r => r(0)).collect()
dataFrame.filter(dataFrame(columnName) isin (g:_*))
}
Since it's a dataframe you can use SQL query like
select sector, count(1)
from TABLE
group by sector
having count(1) >= LIMIT

postgresql Subselect Aggregate in larger query

I'm working with a gigantic dataset of individuals with demographic information and action tracking. I am trying to get the percentage of people who committed an action, which is simple, but also trying to get average ages of people who fit in a specific subgroup of the original SELECT. The CASE WHEN line works fine alone, and the subquery runs fine in it's own query but I cannot seem to get it integrated into this query as a subquery, it gives me a syntax error on the CASE WHEN statement. Here's a slightly anonymized version of the query. Any help would be VERY appreciated.
SELECT
AVG(ageagg)
FROM
(
SELECT
age AS ageagg
FROM
agetable
WHERE
age>30
AND action_taken=1) AvgAge_30Action,
COUNT(
CASE
WHEN action_taken=1
AND age> 30
THEN 1
ELSE 0 NULL) / COUNT(
CASE
WHEN age>30) AS Over_30_Action
FROM
agetable
WHERE
website_type=3
If I've interpreted your intent correctly, you wish to compute the following:
1) the number of people over the age of 30 that took a specific action as a percentage of the total number of people over the age of 30
2) the average age of the people over the age of 30 that took a specific action
Assuming my interpretation is correct, this query might work for you:
SELECT
100 * over_30_action / over_30_total AS percentage_of_over_30_took_action,
average_age_of_over_30_took_action
FROM (
SELECT
SUM(CASE WHEN action_taken=1 THEN 1 ELSE 0 END) AS over_30_action,
COUNT(*) AS over_30_total,
AVG(CASE WHEN action_taken=1 THEN age ELSE NULL END)
AS average_age_of_over_30_took_action
FROM agetable
WHERE website_type=3 AND age>30
) aggregated;
I created a dummy table and populated it with the following data.
postgres=# select * from agetable order by website_type, action_taken, age;
age | action_taken | website_type
-----+--------------+--------------
33 | 1 | 1
32 | 1 | 2
28 | 1 | 3
29 | 1 | 3
32 | 1 | 3
33 | 1 | 3
34 | 1 | 3
32 | 2 | 3
32 | 3 | 3
33 | 4 | 3
34 | 5 | 3
33 | 6 | 3
34 | 7 | 3
35 | 8 | 3
(14 rows)
Of the 14 rows, 4 rows (the first four in this listing) have either the wrong website_type or have age below 30. Of the ten remaining rows, you can see that 3 of them have an action_taken of 1. So, the query should determine that 30% of folks over the age of 30 took a particular action, and the average age among that particular population should be 33 (ages 32, 33, and 34). The results of the query I posted:
percentage_of_over_30_took_action | average_age_of_over_30_took_action
-----------------------------------+------------------------------------
30 | 33.0000000000000000
(1 row)
Again, all of this is predicated upon my interpretation of your intent actually being accurate. This is of course based on a highly contrived data set, but hopefully it's enough of a functional signpost to get you on the right path.