I'm trying to transpose some of my PySpark dataframe rows into columns
I've done many attempts but I can't seem to get the correct results.
Dataframe currently looks like this
ArticleID |Category |Value
1 Color Black
1 Gender Male
2 Color Green
2 Gender Female
3 Color Blue
3 Gender Male
Situation I'm trying to get is
ArticleID |Color |Gender
1 Black Male
2 Green Female
3 Blue Male
Edit: Question might be the same in some areas but this one required an aggregation on first item for the pivoted row.
agg(f.first())
Suggested question could aggregate on numerical operations.
Use groupBy + pivot:
import pyspark.sql.functions as f
df.groupBy('ArticleID').pivot('Category').agg(f.first('Value')).show()
+---------+-----+------+
|ArticleID|Color|Gender|
+---------+-----+------+
| 3| Blue| Male|
| 1|Black| Male|
| 2|Green|Female|
+---------+-----+------+
Related
Hi Im trying to sum values of one column if 'ID' matches for all in a dataframe
For example
ID
Gender
value
1
Male
5
1
Male
6
2
Female
3
3
Female
0
3
Female
9
4
Male
10
How do I get the following table
ID
Gender
value
1
Male
11
2
Female
3
3
Female
9
4
Male
10
In the example above, ID with Value 1 is now showed just once and its value has been summed up (same for ID with value 3).
Thanks
Im new to Pyspark and still learning. I've tried count(), select and groupby() but nothing has resulted in what Im trying to do.
try this:
df = (
df
.withColumn('value', f.sum(f.col('value')).over(Window.partitionBy(f.col('ID'))))
)
Link to documentation about Window operation https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.window.html
You can use a simple groupBy, with the sum function:
from pyspark.sql import functions as F
(
df
.groupby("ID", 'Gender') # sum rows with same ID and Gender
# .groupby("ID") # use this line instead if you want to sum rows with the same ID, even if they have different Gender
.agg(F.sum('value').alias('value'))
)
The result is:
+---+------+-----+
| ID|Gender|value|
+---+------+-----+
| 1| Male| 11|
| 2|Female| 3|
| 3|Female| 9|
| 4| Male| 10|
+---+------+-----+
I have one dataframe with 3 columns and 20,000 no of rows. i need to be convert all 20,000 transid into column.
table macro:
prodid
transid
flag
A
1
1
B
2
1
C
3
1
so on..
Expected Op be like upto 20,000 no of columns:
prodid
1
2
3
A
1
1
1
B
1
1
1
C
1
1
1
I have tried with PIVOT/transpose function but its taking too long time for high volume data. for processing 20,000 rows to column its taking around 10 hrs.
eg.
val array =a1.select("trans_id").distinct.collect.map(x => x.getString(0)).toSeq
val a2=a1.groupBy("prodid").pivot("trans_id",array).sum("flag")
When i used pivot on 200-300 no of rows then it is working fast but when no of rows increase PIVOT is not good.
can anyone please help me to find out the solution.is there any method to avoid PIVOT function as PIVOT is good for low volume conversion only.How to deal with high volume data.
I need this type of conversion for matrix multiplication.
for matrix multiplication my input be like below table and final results will be in matrix multiplication.
|col1|col2|col3|col4|
|----|----|----|----|
|1 | 0 | 1 | 0 |
|0 | 1 | 0 | 0 |
|1 | 1 | 1 | 1 |
I am trying to convert a piece of code containing the retain functionality and multiple if-else statements in SAS to pyspark.. I had no luck when I tried to search for similar answers.
Input Dataset:
Prod_Code
Rate
Rank
ADAMAJ091
1234.0091
1
ADAMAJ091
1222.0001
2
ADAMAJ091
1222.0000
3
BASSDE012
5221.0123
1
BASSDE012
5111.0022
2
BASSDE012
5110.0000
3
I have calculated the rank using df.withColumn("rank", row_number().over(window.partitionBy('Prod_code'))).orderBy('Rate') function
The value in Rate column must be replicated to all other values in the partition containing rank from 1 to N
Expected Output Dataset:
Prod_Code
Rate
Rank
ADAMAJ091
1234.0091
1
ADAMAJ091
1234.0091
2
ADAMAJ091
1234.0091
3
BASSDE012
5221.0123
1
BASSDE012
5221.0123
2
BASSDE012
5221.0123
3
Rate column's value present at rank=1 must be replicated to all other rows in the same partition. This is retain functionality and I need help in replicating the same in Pyspark code.
I tried using df.withColumn() approach for individual rows, but i was not able to achieve this functionality in pyspark.
Since you already had Rank column, then you can use first function to get the first Rate value in a window ordered by Rank.
from pyspark.sql.functions import first
from pyspark.sql.window import Window
df = df.withColumn('Rate', first('Rate').over(Window.partitionBy('prod_code').orderBy('rank')))
df.show()
# +---------+---------+----+
# |Prod_Code| Rate|Rank|
# +---------+---------+----+
# |BASSDE012|5221.0123| 1|
# |BASSDE012|5221.0123| 2|
# |BASSDE012|5221.0123| 3|
# |ADAMAJ091|1234.0091| 1|
# |ADAMAJ091|1234.0091| 2|
# |ADAMAJ091|1234.0091| 3|
# +---------+---------+----+
I'm currently learning Spark and let's say we have the following DataFrame
user_id
activity
1
liked
2
comment
1
liked
1
liked
1
comment
2
liked
Each type of activity has its own weight which is used to calculate the score
activity
weight
liked
1
comment
3
And this is the desired output
user_id
score
1
6
2
4
The calculation of score involves counting how many times an event occurred followed by their weight. For instance, user 1 perform 3 likes and a comment, so the weight is given by
(3 * 1) + (1 * 3)
How do we do this calculation in Spark?
My initial attempt is below
val df1 = evidenceDF
.groupBy("user_id")
.agg(collect_set("event") as "event_ids")
but I got stuck on the mapping portion. What I want to achieve is after I aggregated the events into its event_ids field, I'm going to split them and do the calculation in a map function, but I'm having difficulty moving further.
I searched about using a custom aggregator function but it sounds complicated, is there a straight forward way to do this?
You can join with the weights dataframe the group by and sum weights :
val df1 = evidenceDF.join(df_weight, Seq("activity"))
.groupBy("user_id")
.agg(
sum(col("weight")).as("score")
)
df1.show
//+-------+-----+
//|user_id|score|
//+-------+-----+
//| 1| 6|
//| 2| 4|
//+-------+-----+
Or if actually you have only 2 categories then using when expression directly in the sum :
val df1 = evidenceDF.groupBy("user_id")
.agg(
sum(
when(col("activity") === "liked", 1)
.when(col("activity") === "comment", 3)
).as("score")
)
I have a data frame in scala spark as
category | score |
A | 0.2
A | 0.3
A | 0.3
B | 0.9
B | 0.8
B | 1
I would like to
add a row id column as
category | score | row-id
A | 0.2 | 0
A | 0.3 | 1
A | 0.3 | 2
B | 0.9 | 0
B | 0.8 | 1
B | 1 | 2
Basically I want the row id to be monotonically increasing for each distinct value in column category. I already have a sorted dataframe so all the rows with same category are grouped together. However, I still don't know how to generate the row_id that restarts when a new category appears. Please help!
This is a good use case for Window aggregation functions
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
import df.sparkSession.implicits._
val window = Window.partitionBy('category).orderBy('score)
df.withColumn("row-id", row_number.over(window))
Window functions work kind of like groupBy except that instead of each group returning a single value, each row in each group returns a single value. In this case the value is the row's position within the group of rows of the same category. Also, if this is the effect that you are trying to achieve, then you don't need to have pre-sorted the column category beforehand.