How to "dense" a data frame in Spark [duplicate] - scala

This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 6 years ago.
I have a data frame looks like:
item_id week_id sale amount
1 1 10
1 2 12
1 3 15
2 1 4
2 2 7
2 3 9
I want to transform this dataframe to a new data frame looks like:
item_id week_1 week_2 week_3
1 10 12 15
2 4 7 9
This can be easily done in R, but I don't know how to do it using Spark API, with Scala.

You can use groupBy.pivot and then aggregate the sale_amount column, in this case, you can take the first value from each combination ids of item and week if there are no more than one row within each combination:
df.groupBy("item_id").pivot("week_id").agg(first("sale_amount")).show
+-------+---+---+---+
|item_id| 1| 2| 3|
+-------+---+---+---+
| 1| 10| 12| 15|
| 2| 4| 7| 9|
+-------+---+---+---+
You can use other aggregation functions if there are more than one row for each combination of item_id and week_id, the sum for instance:
df.groupBy("item_id").pivot("week_id").agg(sum("sale_amount")).show
+-------+---+---+---+
|item_id| 1| 2| 3|
+-------+---+---+---+
| 1| 10| 12| 15|
| 2| 4| 7| 9|
+-------+---+---+---+
To get proper column names, you can transform the week_id column before pivoting:
import org.apache.spark.sql.functions._
(df.withColumn("week_id", concat(lit("week_"), df("week_id"))).
groupBy("item_id").pivot("week_id").agg(first("sale_amount")).show)
+-------+------+------+------+
|item_id|week_1|week_2|week_3|
+-------+------+------+------+
| 1| 10| 12| 15|
| 2| 4| 7| 9|
+-------+------+------+------+

Related

Is there any method by which we can limit the rows in repartition function?

In spark I am trying to limit the numbers of rows to 100 in each partition. But i don't want to write it in the file.. i need to perform more operations on the file before overwriting the records
you can do it using repartition.
to keep n record in each partition you need to repartition your data as total_data_count/repartition=100
For example : i have 100 record now if i want to have each partition 10 records then i have to repartition my data in 10 parts df.repartition(10)
>>> df=spark.read.csv("/path to csv/sample2.csv",header=True)
>>> df.count()
100
>>> df1=df.repartition(10)
>>> df1\
... .withColumn("partitionId", spark_partition_id())\
... .groupBy("partitionId")\
... .count()\
... .orderBy(asc("count"))\
... .show()
+-----------+-----+
|partitionId|count|
+-----------+-----+
| 6| 10|
| 3| 10|
| 5| 10|
| 9| 10|
| 8| 10|
| 4| 10|
| 7| 10|
| 1| 10|
| 0| 10|
| 2| 10|
+-----------+-----+
here you can see each partition have 10 records

Pyspark Crosstab Pivot Challenge / Problem

I unfortunately could not find a solution for my exact problem. It is related to pivot and crosstab but I could not solve it with these functions.
I have the feeling I am missing an in-between-table, but I somehow cannot come up with a solution.
Problem description:
A table with customers indicating from which category they have bought a product. If the customer bought a product from the category, the category ID will be shown next to his name.
There are 4 categories 1 - 4 and 3 customers A, B, C
+--------+----------+
|customer| category |
+--------+----------+
| A| 1|
| A| 2|
| A| 3|
| B| 1|
| B| 4|
| C| 1|
| C| 3|
| C| 4|
+--------+----------+
The table is DISTINCT meaning there is only one combination of custmer and category
What I want is a crosstab by category where I can easily read e.g. how many of those who bought from category 1 also bought from category 4?
Desired results table:
+--------+---+---+---+---+
| | 1 | 2 | 3 | 4 |
+--------+---+---+---+---+
| 1| 3| 1| 2| 2|
| 2| 1| 1| 1| 0|
| 3| 2| 1| 2| 1|
| 4| 2| 0| 1| 1|
+--------+---+---+---+---+
Reading examples:
row1 column1 : total number of customers who bought product 1 (A, B, C)
row1 column2 : number of customers who bought product 1 and 2 (A)
row1 column3 : number of customers who bought product 1 and 3 (A, C)
etc.
As you can see the table is mirrored by its diagonal.
Any suggestions how to created the desired table?
Additional challenge:
How to get the results as %?
For the first row the results wold be then: | 100% | 33% | 66% | 66% |
Many thanks in advance!
You can join the input data with itself using customer as join criterium. This returns all combinations of categories that exist for a given customer. After that you can use crosstab to get the result.
df2 = df.withColumnRenamed("category", "cat1").join(df.withColumnRenamed("category", "cat2"), "customer") \
.crosstab("cat1", "cat2") \
.orderBy("cat1_cat2")
df2.show()
Output:
+---------+---+---+---+---+
|cat1_cat2| 1| 2| 3| 4|
+---------+---+---+---+---+
| 1| 3| 1| 2| 2|
| 2| 1| 1| 1| 0|
| 3| 2| 1| 2| 1|
| 4| 2| 0| 1| 2|
+---------+---+---+---+---+
To get the relative frequency you can sum over each row and then divide each element by this sum.
df2.withColumn("sum", sum(df2[col] for col in df2.columns if col != "cat1_cat2")) \
.select("cat1_cat2", *(F.round(df2[col]/F.col("sum"),2).alias(col) for col in df2.columns if col != "cat1_cat2")) \
.show()
Output:
+---------+----+----+----+----+
|cat1_cat2| 1| 2| 3| 4|
+---------+----+----+----+----+
| 1|0.38|0.13|0.25|0.25|
| 2|0.33|0.33|0.33| 0.0|
| 3|0.33|0.17|0.33|0.17|
| 4| 0.4| 0.0| 0.2| 0.4|
+---------+----+----+----+----+

pyspark: Auto filling in implicit missing values

I have a dataframe
user day amount
a 2 10
a 1 14
a 4 5
b 1 4
You see that, the maximum value of day is 4, and the minimum value is 1. I want to fill 0 for amount column in all missing days of all users, so the above data frame will become.
user day amount
a 2 10
a 1 14
a 4 5
a 3 0
b 1 4
b 2 0
b 3 0
b 4 0
How could I do that in PySpark? Many thanks.
Here is one approach. You can get the min and max values first , then group on user column and pivot, then fill in missing columns and fill all nulls as 0, then stack them back:
min_max = df.agg(F.min("day"),F.max("day")).collect()[0]
df1 = df.groupBy("user").pivot("day").agg(F.first("amount").alias("amount")).na.fill(0)
missing_cols = [F.lit(0).alias(str(i)) for i in range(min_max[0],min_max[1]+1)
if str(i) not in df1.columns ]
df1 = df1.select("*",*missing_cols)
#+----+---+---+---+---+
#|user| 1| 2| 4| 3|
#+----+---+---+---+---+
#| b| 4| 0| 0| 0|
#| a| 14| 10| 5| 0|
#+----+---+---+---+---+
#the next step is inspired from https://stackoverflow.com/a/37865645/9840637
arr = F.explode(F.array([F.struct(F.lit(c).alias("day"), F.col(c).alias("amount"))
for c in df1.columns[1:]])).alias("kvs")
(df1.select(["user"] + [arr])
.select(["user"]+ ["kvs.day", "kvs.amount"]).orderBy("user")).show()
+----+---+------+
|user|day|amount|
+----+---+------+
| a| 1| 14|
| a| 2| 10|
| a| 4| 5|
| a| 3| 0|
| b| 1| 4|
| b| 2| 0|
| b| 4| 0|
| b| 3| 0|
+----+---+------+
Note, since column day was pivotted , the dtype might have changed so you may have to cast them back to the original dtype
Another way to do this is to use sequence, array functions and explode. (spark2.4+)
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy(F.lit(0))
df.withColumn("boundaries", F.sequence(F.min("day").over(w),F.max("day").over(w),F.lit(1)))\
.groupBy("user").agg(F.collect_list("day").alias('day'),F.collect_list("amount").alias('amount')\
,F.first("boundaries").alias("boundaries")).withColumn("boundaries", F.array_except("boundaries","day"))\
.withColumn("day",F.flatten(F.array("day","boundaries"))).drop("boundaries")\
.withColumn("zip", F.explode(F.arrays_zip("day","amount")))\
.select("user","zip.day", F.when(F.col("zip.amount").isNull(),\
F.lit(0)).otherwise(F.col("zip.amount")).alias("amount")).show()
#+----+---+------+
#|user|day|amount|
#+----+---+------+
#| a| 2| 10|
#| a| 1| 14|
#| a| 4| 5|
#| a| 3| 0|
#| b| 1| 4|
#| b| 2| 0|
#| b| 3| 0|
#| b| 4| 0|
#+----+---+------+

Pyspark - add missing values per key?

I have a Pyspark dataframe with some non-unique key key and some columns number and value.
For most keys, the number column goes from 1 to 12, but for some of them, there are gaps in numbers (for ex. we have numbers [1, 2, 5, 9]). I would like to add missing rows, so that for every key we have all the numbers in range 1-12 populated with the last seen value.
So that for table
key number value
a 1 6
a 2 10
a 5 20
a 9 25
I would like to get
key number value
a 1 6
a 2 10
a 3 10
a 4 10
a 5 20
a 6 20
a 7 20
a 8 20
a 9 25
a 10 25
a 11 25
a 12 25
I thought about creating a table of a and an array of 1-12, exploding the array and joining with my original table, then separately populating the value column with previous value using a window function bounded by current row. However, it seems a bit inelegant and I wonder if there is a better way to achieve what I want?
I thought about creating a table of a and an array of 1-12, exploding the array and joining with my original table, then separately populating the value column with previous value using a window function bounded by current row. However, it seems a bit inelegant and I wonder if there is a better way to achieve what I want?
I do not think your proposed approach is inelegant - but you can achieve the same using range instead of explode.
First create a dataframe with all the numbers in your range. You will also want to cross join this with the distinct key column from your DataFrame.
all_numbers = spark.range(1, 13).withColumnRenamed("id", "number")
all_numbers = all_numbers.crossJoin(df.select("key").distinct()).cache()
all_numbers.show()
#+------+---+
#|number|key|
#+------+---+
#| 1| a|
#| 2| a|
#| 3| a|
#| 4| a|
#| 5| a|
#| 6| a|
#| 7| a|
#| 8| a|
#| 9| a|
#| 10| a|
#| 11| a|
#| 12| a|
#+------+---+
Now you can outer join this to your original DataFrame and forward fill using the last known good value. If the number of keys is small enough, you may be able to broadcast
from pyspark.sql.functions import broadcast, last
from pyspark.sql import Window
df.join(broadcast(all_numbers), on=["number", "key"], how="outer")\
.withColumn(
"value",
last(
"value",
ignorenulls=True
).over(
Window.partitionBy("key").orderBy("number")\
.rowsBetween(Window.unboundedPreceding, 0)
)
)\
.show()
#+------+---+-----+
#|number|key|value|
#+------+---+-----+
#| 1| a| 6|
#| 2| a| 10|
#| 3| a| 10|
#| 4| a| 10|
#| 5| a| 20|
#| 6| a| 20|
#| 7| a| 20|
#| 8| a| 20|
#| 9| a| 25|
#| 10| a| 25|
#| 11| a| 25|
#| 12| a| 25|
#+------+---+-----+
You could do this without join. I have done multiple tests on this with different gaps and it will always work as long as number 1 is always provided as input(as you need sequence to start from there), and it will always range till 12. I used a couple windows to get a column which I could use in the sequence, then made a custom sequence using expressions, and then exploded it to get desired result. If for some reason, you will have inputs that do not have number 1 in there, let me know I will update my solution.
from pyspark.sql.window import Window
from pyspark.sql import functions as F
from pyspark.sql.functions import when
w=Window().partitionBy("key").orderBy("number")
w2=Window().partitionBy("key").orderBy("number").rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing)
df.withColumn("number2", F.lag("number").over(w)).withColumn("diff", F.when((F.col("number2").isNotNull()) & ((F.col("number")-F.col("number2")) > 1), (F.col("number")-F.col("number2"))).otherwise(F.lit(0)))\
.withColumn("diff2", F.lead("diff").over(w)).withColumn("diff2", F.when(F.col("diff2").isNull(), F.lit(0)).otherwise(F.col("diff2"))).withColumn("diff2", F.when(F.col("diff2")!=0, F.col("diff2")-1).otherwise(F.col("diff2"))).withColumn("max", F.max("number").over(w2))\
.withColumn("diff2", F.when((F.col("number")==F.col("max")) & (F.col("number")<F.lit(12)), F.lit(12)-F.col("number")).otherwise(F.col("diff2")))\
.withColumn("number2", F.when(F.col("diff2")!=0,F.expr("""sequence(number,number+diff2,1)""")).otherwise(F.expr("""sequence(number,number+diff2,0)""")))\
.drop("diff","diff2","max")\
.withColumn("number2", F.explode("number2")).drop("number")\
.select("key", F.col("number2").alias("number"), "value")\
.show()
+---+------+-----+
|key|number|value|
+---+------+-----+
| a| 1| 6|
| a| 2| 10|
| a| 3| 10|
| a| 4| 10|
| a| 5| 20|
| a| 6| 20|
| a| 7| 20|
| a| 8| 20|
| a| 9| 25|
| a| 10| 25|
| a| 11| 25|
| a| 12| 25|
+---+------+-----+

how to pivot Spark dataframe table? [duplicate]

This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 4 years ago.
I have this table of 3 columns:
+---+----+----+
| id|type| val|
+---+----+----+
| 1| A| 0|
| 2| A| 0|
| 4| A| 0|
| 2| B| 1|
| 4| B| 1|
+---+----+----+
and I would like to transforme it to something like:
+---+----+----+
| | A| B|
+---+----+----+
| 1| 0| -|
| 2| 1| 1|
| 4| 0| 1|
+---+----+----+
I tried this but didn't work:
val data_array = data.pivot(cols=['type'],rows=['id'],values='val')
df.groupBy("id").pivot("type").agg(first("value")).na.fill("-").show
df is the dataframe created from the test data file