PySpark join columns after pivot

PySpark join columns after pivot - pyspark

For the following example DataFrame:
df = spark.createDataFrame(
[
('2017-01-01', 'A', 1),
('2017-01-01', 'B', 2),
('2017-01-01', 'C', 3),
('2017-01-02', 'A', 4),
('2017-01-02', 'B', 5),
('2017-01-02', 'C', 6),
('2017-01-03', 'A', 7),
('2017-01-03', 'B', 8),
('2017-01-03', 'C', 9),
],
('date', 'type', 'value')
)
I would like to convert it to have the columns equal to all unique "types" (A, B, and C).
Currently, I have found this code works closest to what I would like to achieve:
df.groupby("date", "type").pivot("type").sum().orderBy("date").show()
+----------+----+----+----+----+
| date|type| A| B| C|
+----------+----+----+----+----+
|2017-01-01| C|null|null| 3|
|2017-01-01| A| 1|null|null|
|2017-01-01| B|null| 2|null|
|2017-01-02| B|null| 5|null|
|2017-01-02| C|null|null| 6|
|2017-01-02| A| 4|null|null|
|2017-01-03| A| 7|null|null|
|2017-01-03| C|null|null| 9|
|2017-01-03| B|null| 8|null|
+----------+----+----+----+----+
The issue is that I still have too many rows (containing all "null").
What I would like to get is:
+----------+---+---+---+
| date| A| B| C|
+----------+---+---+---+
|2017-01-01| 1| 2| 3|
|2017-01-02| 4| 5| 6|
|2017-01-03| 7| 8| 9|
+----------+---+---+---+
Aka, I would like something that has similar functionality to pandas.DataFrame.unstack().
If anyone has any tips on how I can achieve this in PySpark that would be great.

You need to do another group by "date" column then select max values from A,B,C.
Example:
df.groupby("date", "type").pivot("type").sum().orderBy("date").groupBy("date").agg(max(col("A")).alias("A"),max(col("B")).
#+----------+---+---+---+
#| date| A| B| c|
#+----------+---+---+---+
#|2017-01-01| 1| 2| 3|
#|2017-01-02| 4| 5| 6|
#|2017-01-03| 7| 8| 9|
#+----------+---+---+---+
# dynamic way
aggregate = ["A","B","C"]
funs=[max]
exprs=[f(col(c)).alias(c) for f in funs for c in aggregate]
df.groupby("date", "type").pivot("type").sum().orderBy("date").groupBy("date").agg(*exprs).show()
#+----------+---+---+---+
#| date| A| B| c|
#+----------+---+---+---+
#|2017-01-01| 1| 2| 3|
#|2017-01-02| 4| 5| 6|
#|2017-01-03| 7| 8| 9|
#+----------+---+---+---+

Related

Get groups with duplicated values in PySpark

For example, if we have the following dataframe:
df = spark.createDataFrame([['a', 1], ['a', 1],
['b', 1], ['b', 2],
['c', 2], ['c', 2], ['c', 2]],
['col1', 'col2'])
+----+----+
|col1|col2|
+----+----+
| a| 1|
| a| 1|
| b| 1|
| b| 2|
| c| 2|
| c| 2|
| c| 2|
+----+----+
I want to mark groups based on col1 where values in col2 repeat themselves. I have an idea to find the difference between the group size and the count of distinct values:
window = Window.partitionBy('col1')
df.withColumn('col3', F.count('col2').over(window)).\
withColumn('col4', F.approx_count_distinct('col2').over(window)).\
select('col1', 'col2', (F.col('col3') - F.col('col4')).alias('col3')).show()
Maybe you have a better solution. My expected output:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| a| 1| 1|
| a| 1| 1|
| b| 1| 0|
| b| 2| 0|
| c| 2| 2|
| c| 2| 2|
| c| 2| 2|
+----+----+----+
As you can see all groups where col3 is equal to zero have only unique values in col2.

According to your needs, you can consider grouping statistics according to col1 and col2.
df = df.withColumn('col3', F.expr('count(*) over (partition by col1,col2) - 1'))
df.show(truncate=False)

How to get value from previous group in spark?

I need to get value of previous group in spark and set it to the current group.
How can I achieve that?
I must order by count instead of TEXT_NUM.
Ordering by TEXT_NUM is not possible because events repeat in time, as count 10 and 11 shows.
I'm trying with the following code:
val spark = SparkSession.builder()
.master("spark://spark-master:7077")
.getOrCreate()
val df = spark
.createDataFrame(
Seq[(Int, String, Int)](
(0, "", 0),
(1, "", 0),
(2, "A", 1),
(3, "A", 1),
(4, "A", 1),
(5, "B", 2),
(6, "B", 2),
(7, "B", 2),
(8, "C", 3),
(9, "C", 3),
(10, "A", 1),
(11, "A", 1)
))
.toDF("count", "TEXT", "TEXT_NUM")
val w1 = Window
.orderBy("count")
.rangeBetween(Window.unboundedPreceding, -1)
df
.withColumn("LAST_VALUE", last("TEXT_NUM").over(w1))
.orderBy("count")
.show()
Result:
+-----+----+--------+----------+
|count|TEXT|TEXT_NUM|LAST_VALUE|
+-----+----+--------+----------+
| 0| | 0| null|
| 1| | 0| 0|
| 2| A| 1| 0|
| 3| A| 1| 1|
| 4| A| 1| 1|
| 5| B| 2| 1|
| 6| B| 2| 2|
| 7| B| 2| 2|
| 8| C| 3| 2|
| 9| C| 3| 3|
| 10| A| 1| 3|
| 11| A| 1| 1|
+-----+----+--------+----------+
Desired result:
+-----+----+--------+----------+
|count|TEXT|TEXT_NUM|LAST_VALUE|
+-----+----+--------+----------+
| 0| | 0| null|
| 1| | 0| null|
| 2| A| 1| 0|
| 3| A| 1| 0|
| 4| A| 1| 0|
| 5| B| 2| 1|
| 6| B| 2| 1|
| 7| B| 2| 1|
| 8| C| 3| 2|
| 9| C| 3| 2|
| 10| A| 1| 3|
| 11| A| 1| 3|
+-----+----+--------+----------+

Consider using Window function last(columnName, ignoreNulls) to backfill nulls in a column that consists of previous "text_num" at group boundaries, as shown below:
val df = Seq(
(0, "", 0), (1, "", 0),
(2, "A", 1), (3, "A", 1), (4, "A", 1),
(5, "B", 2), (6, "B", 2), (7, "B", 2),
(8, "C", 3), (9, "C", 3),
(10, "A", 1), (11, "A", 1)
).toDF("count", "text", "text_num")
import org.apache.spark.sql.expressions.Window
val w1 = Window.orderBy("count")
val w2 = w1.rowsBetween(Window.unboundedPreceding, 0)
df.
withColumn("prev_num", lag("text_num", 1).over(w1)).
withColumn("last_change", when($"text_num" =!= $"prev_num", $"prev_num")).
withColumn("last_value", last("last_change", ignoreNulls=true).over(w2)).
show
/*
+-----+----+--------+--------+-----------+----------+
|count|text|text_num|prev_num|last_change|last_value|
+-----+----+--------+--------+-----------+----------+
| 0| | 0| null| null| null|
| 1| | 0| 0| null| null|
| 2| A| 1| 0| 0| 0|
| 3| A| 1| 1| null| 0|
| 4| A| 1| 1| null| 0|
| 5| B| 2| 1| 1| 1|
| 6| B| 2| 2| null| 1|
| 7| B| 2| 2| null| 1|
| 8| C| 3| 2| 2| 2|
| 9| C| 3| 3| null| 2|
| 10| A| 1| 3| 3| 3|
| 11| A| 1| 1| null| 3|
+-----+----+--------+--------+-----------+----------+
*/
The intermediary columns are kept in the output for references. Just drop them if they aren't needed.

Spark Categorize ordered dataframe values by a condition

Let's say I have a dataframe
val userData = spark.createDataFrame(Seq(
(1, 0),
(2, 2),
(3, 3),
(4, 0),
(5, 3),
(6, 4)
)).toDF("order_clause", "some_value")
userData.withColumn("passed", when(col("some_value") <= 1.5,1))
.show()
+------------+----------+------+
|order_clause|some_value|passed|
+------------+----------+------+
| 1| 0| 1|
| 2| 2| null|
| 3| 3| null|
| 4| 0| 1|
| 5| 3| null|
| 6| 4| null|
+------------+----------+------+
That dataframe is ordered by order_clause. When values in some_value become smaller than 1.5 I can say one round is done.
What I want to do is create column round like:
+------------+----------+------+-----+
|order_clause|some_value|passed|round|
+------------+----------+------+-----+
| 1| 0| 1| 1|
| 2| 2| null| 1|
| 3| 3| null| 1|
| 4| 0| 1| 2|
| 5| 3| null| 2|
| 6| 4| null| 2|
+------------+----------+------+-----+
Now I could be able to get subsets of rounds in this dataframe. I searched for hints how to do this but have not found a way to do this.

You're probably looking for a rolling sum of the passed column. You can do it using a sum window function:
import org.apache.spark.sql.expressions.Window
val result = userData.withColumn(
"passed",
when(col("some_value") <= 1.5, 1)
).withColumn(
"round",
sum("passed").over(Window.orderBy("order_clause"))
)
result.show
+------------+----------+------+-----+
|order_clause|some_value|passed|round|
+------------+----------+------+-----+
| 1| 0| 1| 1|
| 2| 2| null| 1|
| 3| 3| null| 1|
| 4| 0| 1| 2|
| 5| 3| null| 2|
| 6| 4| null| 2|
+------------+----------+------+-----+
Or more simply
import org.apache.spark.sql.expressions.Window
val result = userData.withColumn(
"round",
sum(when(col("some_value") <= 1.5, 1)).over(Window.orderBy("order_clause"))
)

using spark sql for conditional lag summation

Below is the dataframe i have
df = sqlContext.createDataFrame(
[("0", "0"), ("1", "2"), ("2", "3"), ("3", "4"), ("4", "0"), ("5", "5"), ("6", "5")],
["id", "value"])
+---+-----+
| id|value|
+---+-----+
| 0| 0|
| 1| 2|
| 2| 3|
| 3| 4|
| 4| 0|
| 5| 5|
| 6| 5|
+---+-----+
And what I want to get is :
+---+-----+---+-----+
| id|value|masterid|partsum|
+---+-----|---+-----+
| 0| 0| 0| 0|
| 1| 2| 0| 2|
| 2| 3| 0| 5|
| 3| 4| 0| 9|
| 4| 0| 4| 0|
| 5| 5| 4| 5|
| 6| 5| 4| 10|
+---+-----+---+-----+
So I try to use SparkSQL to do so:
df=df.withColumn("masterid", F.when( df.value !=0 , F.lag(df.id)).otherwise(df.id))
I original thought the lag function can help me process before next iteration so as to get the masterid col. Unfortunately, after i check the manual , it cant help.
So , i would like to ask if there are any special functions i could use to do what i want? Or is there any "conditional lag" function i could use? so that, when i see non-zero item, i can use lag until find a zero number?

IIUC, you can try defining a sub-group label (g in the below code) and two Window Specs:
from pyspark.sql import Window, functions as F
w1 = Window.orderBy('id')
w2 = Window.partitionBy('g').orderBy('id')
df.withColumn('g', F.sum(F.expr('if(value=0,1,0)')).over(w1)).select(
'id'
, 'value'
, F.first('id').over(w2).alias('masterid')
, F.sum('value').over(w2).alias('partsum')
).show()
#+---+-----+--------+-------+
#| id|value|masterid|partsum|
#+---+-----+--------+-------+
#| 0| 0| 0| 0.0|
#| 1| 2| 0| 2.0|
#| 2| 3| 0| 5.0|
#| 3| 4| 0| 9.0|
#| 4| 0| 4| 0.0|
#| 5| 5| 4| 5.0|
#| 6| 5| 4| 10.0|
#+---+-----+--------+-------+

Apache Spark - Scala API - Aggregate on sequentially increasing key

I have a data frame that looks something like this:
val df = sc.parallelize(Seq(
(3,1,"A"),(3,2,"B"),(3,3,"C"),
(2,1,"D"),(2,2,"E"),
(3,1,"F"),(3,2,"G"),(3,3,"G"),
(2,1,"X"),(2,2,"X")
)).toDF("TotalN", "N", "String")
+------+---+------+
|TotalN| N|String|
+------+---+------+
| 3| 1| A|
| 3| 2| B|
| 3| 3| C|
| 2| 1| D|
| 2| 2| E|
| 3| 1| F|
| 3| 2| G|
| 3| 3| G|
| 2| 1| X|
| 2| 2| X|
+------+---+------+
I need to aggregate the strings by concatenating them together based on the TotalN and the sequentially increasing ID (N). The problem is there is not a unique ID for each aggregation I can group by. So, I need to do something like "for each row look at the TotalN, loop through the next N rows and concatenate, then reset".
+------+------+
|TotalN|String|
+------+------+
| 3| ABC|
| 2| DE|
| 3| FGG|
| 2| XX|
+------+------+
Any pointers much appreciated.
Using Spark 2.3.1 and the Scala Api.

Try this:
val df = spark.sparkContext.parallelize(Seq(
(3, 1, "A"), (3, 2, "B"), (3, 3, "C"),
(2, 1, "D"), (2, 2, "E"),
(3, 1, "F"), (3, 2, "G"), (3, 3, "G"),
(2, 1, "X"), (2, 2, "X")
)).toDF("TotalN", "N", "String")
df.createOrReplaceTempView("data")
val sqlDF = spark.sql(
"""
| SELECT TotalN d, N, String, ROW_NUMBER() over (order by TotalN) as rowNum
| FROM data
""".stripMargin)
sqlDF.withColumn("key", $"N" - $"rowNum")
.groupBy("key").agg(collect_list('String).as("texts")).show()

Solution is to calculate a grouping variable using the row_number function which can be used in later groupBy.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
var w = Window.orderBy("TotalN")
df.withColumn("GeneratedID", $"N" - row_number.over(w)).show
+------+---+------+-----------+
|TotalN| N|String|GeneratedID|
+------+---+------+-----------+
| 2| 1| D| 0|
| 2| 2| E| 0|
| 2| 1| X| -2|
| 2| 2| X| -2|
| 3| 1| A| -4|
| 3| 2| B| -4|
| 3| 3| C| -4|
| 3| 1| F| -7|
| 3| 2| G| -7|
| 3| 3| G| -7|
+------+---+------+-----------+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

PySpark join columns after pivot - pyspark

Related

Get groups with duplicated values in PySpark

How to get value from previous group in spark?

Spark Categorize ordered dataframe values by a condition

using spark sql for conditional lag summation

Apache Spark - Scala API - Aggregate on sequentially increasing key

Categories

Resources