How do I transform a Spark dataframe so that my values become column names? [duplicate] - scala

This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 4 years ago.
I'm not sure of a good way to phrase the question, but an example will help. Here is the dataframe that I have with the columns: name, type, and count:
+------+------+-------+
| Name | Type | Count |
+------+------+-------+
| a | 0 | 5 |
| a | 1 | 4 |
| a | 5 | 5 |
| a | 4 | 5 |
| a | 2 | 1 |
| b | 0 | 2 |
| b | 1 | 4 |
| b | 3 | 5 |
| b | 4 | 5 |
| b | 2 | 1 |
| c | 0 | 5 |
| c | ... | ... |
+------+------+-------+
I want to get a new dataframe structured like this where the Type column values have become new columns:
+------+---+-----+---+---+---+---+
| Name | 0 | 1 | 2 | 3 | 4 | 5 | <- Number columns are types from input
+------+---+-----+---+---+---+---+
| a | 5 | 4 | 1 | 0 | 5 | 5 |
| b | 2 | 4 | 1 | 5 | 5 | 0 |
| c | 5 | ... | | | | |
+------+---+-----+---+---+---+---+
The columns here are [Name,0,1,2,3,4,5].

Do this by using the pivot function in Spark.
val df2 = df.groupBy("Name").pivot("Type").sum("Count")
Here, if the name and the type is the same for two rows, the count values are simply added together, but other aggregations are possible as well.
Resulting dataframe when using the example data in the question:
+----+---+----+----+----+----+----+
|Name| 0| 1| 2| 3| 4| 5|
+----+---+----+----+----+----+----+
| c| 5|null|null|null|null|null|
| b| 2| 4| 1| 5| 5|null|
| a| 5| 4| 1|null| 5| 5|
+----+---+----+----+----+----+----+

Related

Mean with differents columns ignoring Null values, Spark Scala

I have a dataframe with different columns, what I am trying to do is the mean of this diff columns ignoring null values. For example:
+--------+-------+---------+-------+
| Baller | Power | Vision | KXD |
+--------+-------+---------+-------+
| John | 5 | null | 10 |
| Bilbo | 5 | 3 | 2 |
+--------+-------+---------+-------+
The output have to be:
+--------+-------+---------+-------+-----------+
| Baller | Power | Vision | KXD | MEAN |
+--------+-------+---------+-------+-----------+
| John | 5 | null | 10 | 7.5 |
| Bilbo | 5 | 3 | 2 | 3,33 |
+--------+-------+---------+-------+-----------+
What I am doing:
val a_cols = Array(col("Power"), col("Vision"), col("KXD"))
val avgFunc = a_cols.foldLeft(lit(0)){(x, y) => x+y}/a_cols.length
val avg_calc = df.withColumn("MEAN", avgFunc)
But I get the null values:
+--------+-------+---------+-------+-----------+
| Baller | Power | Vision | KXD | MEAN |
+--------+-------+---------+-------+-----------+
| John | 5 | null | 10 | null |
| Bilbo | 5 | 3 | 2 | 3,33 |
+--------+-------+---------+-------+-----------+
You can explode the columns and do a group by + mean, then join back to the original dataframe using the Baller column:
val result = df.join(
df.select(
col("Baller"),
explode(array(col("Power"), col("Vision"), col("KXD")))
).groupBy("Baller").agg(mean("col").as("MEAN")),
Seq("Baller")
)
result.show
+------+-----+------+---+------------------+
|Baller|Power|Vision|KXD| MEAN|
+------+-----+------+---+------------------+
| John| 5| null| 10| 7.5|
| Bilbo| 5| 3| 2|3.3333333333333335|
+------+-----+------+---+------------------+

subtract two columns with null in spark dataframe

I new to spark, I have dataframe df:
+----------+------------+-----------+
| Column1 | Column2 | Sub |
+----------+------------+-----------+
| 1 | 2 | 1 |
+----------+------------+-----------+
| 4 | null | null |
+----------+------------+-----------+
| 5 | null | null |
+----------+------------+-----------+
| 6 | 8 | 2 |
+----------+------------+-----------+
when subtracting two columns, one column has null so resulting column also resulting as null.
df.withColumn("Sub", col(A)-col(B))
Expected output should be:
+----------+------------+-----------+
| Column1 | Column2 | Sub |
+----------+------------+-----------+
| 1 | 2 | 1 |
+----------+------------+-----------+
| 4 | null | 4 |
+----------+------------+-----------+
| 5 | null | 5 |
+----------+------------+-----------+
| 6 | 8 | 2 |
+----------+------------+-----------+
I don't want to replace the column2 to replace with 0, it should be null only.
Can someone help me on this?
You can use when function as
import org.apache.spark.sql.functions._
df.withColumn("Sub", when(col("Column1").isNull(), lit(0)).otherwise(col("Column1")) - when(col("Column2").isNull(), lit(0)).otherwise(col("Column2")))
you should have final result as
+-------+-------+----+
|Column1|Column2| Sub|
+-------+-------+----+
| 1| 2|-1.0|
| 4| null| 4.0|
| 5| null| 5.0|
| 6| 8|-2.0|
+-------+-------+----+
You can coalesce nulls to zero on both columns and then do the subtraction:
val df = Seq((Some(1), Some(2)),
(Some(4), null),
(Some(5), null),
(Some(6), Some(8))
).toDF("A", "B")
df.withColumn("Sub", abs(coalesce($"A", lit(0)) - coalesce($"B", lit(0)))).show
+---+----+---+
| A| B|Sub|
+---+----+---+
| 1| 2| 1|
| 4|null| 4|
| 5|null| 5|
| 6| 8| 2|
+---+----+---+

reorder column values pyspark

When I perform a Select operation on a DataFrame in PySpark it reduces to the following:
+-----+--------+-------+
| val | Feat1 | Feat2 |
+-----+--------+-------+
| 1 | f1a | f2a |
| 2 | f1a | f2b |
| 8 | f1b | f2f |
| 9 | f1a | f2d |
| 4 | f1b | f2c |
| 6 | f1b | f2a |
| 1 | f1c | f2c |
| 3 | f1c | f2g |
| 9 | f1c | f2e |
+-----+--------+-------+
I require the val column to be ordered group wise based on another field Feat1 like the following:
+-----+--------+-------+
| val | Feat1 | Feat2 |
+-----+--------+-------+
| 1 | f1a | f2a |
| 2 | f1a | f2b |
| 3 | f1a | f2d |
| 1 | f1b | f2c |
| 2 | f1b | f2a |
| 3 | f1b | f2f |
| 1 | f1c | f2c |
| 2 | f1c | f2g |
| 3 | f1c | f2e |
+-----+--------+-------+
NOTE that the val values don't depend on the order of Feat2 but are instead ordered based on their original val values.
Is there a command to reorder the column value in PySpark as required.
NOTE: Question exists for the same but is specific to SQL-lite.
data = [(1, 'f1a', 'f2a'),
(2, 'f1a', 'f2b'),
(8, 'f1b', 'f2f'),
(9, 'f1a', 'f2d'),
(4, 'f1b', 'f2c'),
(6, 'f1b', 'f2a'),
(1, 'f1c', 'f2c'),
(3, 'f1c', 'f2g'),
(9, 'f1c', 'f2e')]
table = sqlContext.createDataFrame(data, ['val', 'Feat1', 'Feat2'])
Edit: For this purpose, you can use window with rank function:
from pyspark.sql import Window
from pyspark.sql.functions import rank
w = Window.partitionBy('Feat1').orderBy('val')
table.withColumn('val', rank().over(w)).orderBy('Feat1').show()
+---+-----+-----+
|val|Feat1|Feat2|
+---+-----+-----+
| 1| f1a| f2a|
| 2| f1a| f2b|
| 3| f1a| f2d|
| 1| f1b| f2c|
| 2| f1b| f2a|
| 3| f1b| f2f|
| 1| f1c| f2c|
| 2| f1c| f2g|
| 3| f1c| f2e|
+---+-----+-----+

Spark groupby filter sorting with top 3 read articles each city

I have a table data like following :
+-----------+--------+-------------+
| City Name | URL | Read Count |
+-----------+--------+-------------+
| Gurgaon | URL1 | 3 |
| Gurgaon | URL3 | 6 |
| Gurgaon | URL6 | 5 |
| Gurgaon | URL4 | 1 |
| Gurgaon | URL5 | 5 |
| Delhi | URL3 | 4 |
| Delhi | URL7 | 2 |
| Delhi | URL5 | 1 |
| Delhi | URL6 | 6 |
| Punjab | URL6 | 5 |
| Punjab | URL4 | 1 |
| Mumbai | URL5 | 5 |
+-----------+--------+-------------+
I would like to see somthing like -> Top 3 Read article(if exists) each city
+-----------+--------+--------+
| City Name | URL | Count |
+-----------+--------+--------+
| Gurgaon | URL3 | 6 |
| Gurgaon | URL6 | 5 |
| Gurgaon | URL5 | 5 |
| Delhi | URL6 | 6 |
| Delhi | URL3 | 4 |
| Delhi | URL1 | 3 |
| Punjab | URL6 | 5 |
| Punjab | URL4 | 1 |
| Mumbai | URL5 | 5 |
+-----------+--------+--------+
I am working on Spark 2.0.2, Scala 2.11.8
You can use window function to get the output.
import org.apache.spark.sql.expressions.Window
val df = sc.parallelize(Seq(
("Gurgaon","URL1",3), ("Gurgaon","URL3",6), ("Gurgaon","URL6",5), ("Gurgaon","URL4",1),("Gurgaon","URL5",5)
("DELHI","URL3",4), ("DELHI","URL7",2), ("DELHI","URL5",1), ("DELHI","URL6",6),("Mumbai","URL5",5)
("Punjab","URL6",6), ("Punjab","URL4",1))).toDF("City", "URL", "Count")
df.show()
+-------+----+-----+
| City| URL|Count|
+-------+----+-----+
|Gurgaon|URL1| 3|
|Gurgaon|URL3| 6|
|Gurgaon|URL6| 5|
|Gurgaon|URL4| 1|
|Gurgaon|URL5| 5|
| DELHI|URL3| 4|
| DELHI|URL7| 2|
| DELHI|URL5| 1|
| DELHI|URL6| 6|
| Mumbai|URL5| 5|
| Punjab|URL6| 6|
| Punjab|URL4| 1|
+-------+----+-----+
val w = Window.partitionBy($"City").orderBy($"Count".desc)
val dfTop = df.withColumn("row", rowNumber.over(w)).where($"row" <= 3).drop("row")
dfTop.show
+-------+----+-----+
| City| URL|Count|
+-------+----+-----+
|Gurgaon|URL3| 6|
|Gurgaon|URL6| 5|
|Gurgaon|URL5| 5|
| Mumbai|URL5| 5|
| DELHI|URL6| 6|
| DELHI|URL3| 4|
| DELHI|URL7| 2|
| Punjab|URL6| 6|
| Punjab|URL4| 1|
+-------+----+-----+
Output tested on Spark 1.6.2
Window functions are probably the way to go, and there is a built-in function for this purpose:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{rank, desc}
val window = Window.partitionBy($"City").orderBy(desc("Count"))
val dfTop = df.withColumn("rank", rank.over(window)).where($"rank" <= 3)

Design a saturated sum with window functions

I have, in a pyspark dataframe, a column with values 1, -1 and 0, representing the events "startup", "shutdown" and "other" of an engine. I want to build a column with the state of the engine, with 1 when the engine is on and 0 when it's off, something like:
+---+-----+-----+
|seq|event|state|
+---+-----+-----+
| 1 | 1 | 1 |
| 2 | 0 | 1 |
| 3 | -1 | 0 |
| 4 | 0 | 0 |
| 5 | 0 | 0 |
| 6 | 1 | 1 |
| 7 | -1 | 0 |
+---+-----+-----+
If 1s and -1s are always alternated, this can easily be done with a window function such as
sum('event').over(Window.orderBy('seq'))
It can happen, though, that I have some spurious 1s or -1s, which I want to ignore if I am already in state 1 or 0, respectively. I want thus to be able to do something like:
+---+-----+-----+
|seq|event|state|
+---+-----+-----+
| 1 | 1 | 1 |
| 2 | 0 | 1 |
| 3 | 1 | 1 |
| 4 | -1 | 0 |
| 5 | 0 | 0 |
| 6 | -1 | 0 |
| 7 | 1 | 1 |
+---+-----+-----+
That would require a "saturated" sum function that never goes above 1 or below 0, or some other approach which I am not able to imagine at the moment.
Does somebody have any ideas?
You can achieve the desired result using the last function to fill-forwards the most recent state-change.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df = (spark.createDataFrame([
(1, 1),
(2, 0),
(3, -1),
(4, 0)
], ["seq", "event"]))
w = Window.orderBy('seq')
#replace zeros with nulls so they can be ignored easily.
df = df.withColumn('helperCol',F.when(df.event != 0,df.event))
#fill statechanges forward in a new column.
df = df.withColumn('state',F.last(df.helperCol,ignorenulls=True).over(w))
#replace -1 values with 0
df = df.replace(-1,0,['state'])
df.show()
This produces:
+---+-----+---------+-----+
|seq|event|helperCol|state|
+---+-----+---------+-----+
| 1| 1| 1| 1|
| 2| 0| null| 1|
| 3| -1| -1| 0|
| 4| 0| null| 0|
+---+-----+---------+-----+
helperCol need not be added to the dataframe, I have only included it to make the process more readable.