Concatenate PySpark Dataframe Column Names by Value and Sum - pyspark

I have an example dataframe:
df = spark.createDataFrame([
(1, 0, 1, 1, 1, 1, "something"),
(2, 0, 1, 1, 1, 0, "something"),
(3, 1, 0, 0, 0, 0, "something"),
(4, 0, 1, 0, 0, 0, "something"),
(5, 1, 0, 0, 0, 0, "something"),
(6, 0, 0, 0, 0, 0, "something")
], ["int" * 6, "string"]) \
.toDF("id", "a", "b", "c", "d", "e", "extra_column")
df.show()
+---+---+---+---+---+---+------------+
| id| a| b| c| d| e|extra_column|
+---+---+---+---+---+---+------------+
| 1| 0| 1| 1| 1| 1| something|
| 2| 0| 1| 1| 1| 0| something|
| 3| 1| 0| 0| 0| 0| something|
| 4| 0| 1| 0| 0| 0| something|
| 5| 1| 0| 0| 0| 0| something|
| 6| 0| 0| 0| 0| 0| something|
I want to concatenate across the columns per row and produce a key where the column = 1. I don't need to show this result but this is the intermediate step I need to solve:
df_row_concat = spark.createDataFrame([
(1, 0, 1, 1, 1, 1, "something", "bcde"),
(2, 0, 1, 1, 1, 0, "something", "bcd"),
(3, 1, 0, 0, 0, 0, "something", "a"),
(4, 0, 1, 0, 0, 0, "something", "b"),
(5, 1, 0, 0, 0, 0, "something", "a"),
(6, 0, 0, 0, 0, 0, "something", "")
], ["int" * 6, "string" * 2]) \
.toDF("id", "a", "b", "c", "d", "e", "extra_column", "key")
df_row_concat.show()
+---+---+---+---+---+---+------------+----+
| id| a| b| c| d| e|extra_column| key|
+---+---+---+---+---+---+------------+----+
| 1| 0| 1| 1| 1| 1| something|bcde|
| 2| 0| 1| 1| 1| 0| something| bcd|
| 3| 1| 0| 0| 0| 0| something| a|
| 4| 0| 1| 0| 0| 0| something| b|
| 5| 1| 0| 0| 0| 0| something| a|
| 6| 0| 0| 0| 0| 0| something| |
+---+---+---+---+---+---+------------+----+
This last part I can get on my own, but to complete the example, I want to sum the key values and output:
+----+-----+
| key|value|
+----+-----+
| a| 2|
| b| 1|
| bcd| 1|
|bcde| 1|
+----+-----+
My actual dataset is much longer and wider. I could hard-code every combination but there must be a more efficient way to loop over the list of columns to consider (e.g. column_list = ["a", "b", "c", "d", "e"]). Maybe not necessary, but I included the extra_column because there are additional columns in my dataset which won't be considered..

I don't see anything wrong with writing a for loop here
from pyspark.sql import functions as F
cols = ['a', 'b', 'c', 'd', 'e']
temp = (df.withColumn('key', F.concat(*[F.when(F.col(c) == 1, c).otherwise('') for c in cols])))
+---+---+---+---+---+---+------------+----+
| id| a| b| c| d| e|extra_column| key|
+---+---+---+---+---+---+------------+----+
| 1| 0| 1| 1| 1| 1| something|bcde|
| 2| 0| 1| 1| 1| 0| something| bcd|
| 3| 1| 0| 0| 0| 0| something| a|
| 4| 0| 1| 0| 0| 0| something| b|
| 5| 1| 0| 0| 0| 0| something| a|
| 6| 0| 0| 0| 0| 0| something| |
+---+---+---+---+---+---+------------+----+
(temp
.groupBy('key')
.agg(F.count('*').alias('value'))
.where(F.col('key') != '')
.show()
)
+----+-----+
| key|value|
+----+-----+
|bcde| 1|
| b| 1|
| a| 2|
| bcd| 1|
+----+-----+

Related

How to get value from previous group in spark?

I need to get value of previous group in spark and set it to the current group.
How can I achieve that?
I must order by count instead of TEXT_NUM.
Ordering by TEXT_NUM is not possible because events repeat in time, as count 10 and 11 shows.
I'm trying with the following code:
val spark = SparkSession.builder()
.master("spark://spark-master:7077")
.getOrCreate()
val df = spark
.createDataFrame(
Seq[(Int, String, Int)](
(0, "", 0),
(1, "", 0),
(2, "A", 1),
(3, "A", 1),
(4, "A", 1),
(5, "B", 2),
(6, "B", 2),
(7, "B", 2),
(8, "C", 3),
(9, "C", 3),
(10, "A", 1),
(11, "A", 1)
))
.toDF("count", "TEXT", "TEXT_NUM")
val w1 = Window
.orderBy("count")
.rangeBetween(Window.unboundedPreceding, -1)
df
.withColumn("LAST_VALUE", last("TEXT_NUM").over(w1))
.orderBy("count")
.show()
Result:
+-----+----+--------+----------+
|count|TEXT|TEXT_NUM|LAST_VALUE|
+-----+----+--------+----------+
| 0| | 0| null|
| 1| | 0| 0|
| 2| A| 1| 0|
| 3| A| 1| 1|
| 4| A| 1| 1|
| 5| B| 2| 1|
| 6| B| 2| 2|
| 7| B| 2| 2|
| 8| C| 3| 2|
| 9| C| 3| 3|
| 10| A| 1| 3|
| 11| A| 1| 1|
+-----+----+--------+----------+
Desired result:
+-----+----+--------+----------+
|count|TEXT|TEXT_NUM|LAST_VALUE|
+-----+----+--------+----------+
| 0| | 0| null|
| 1| | 0| null|
| 2| A| 1| 0|
| 3| A| 1| 0|
| 4| A| 1| 0|
| 5| B| 2| 1|
| 6| B| 2| 1|
| 7| B| 2| 1|
| 8| C| 3| 2|
| 9| C| 3| 2|
| 10| A| 1| 3|
| 11| A| 1| 3|
+-----+----+--------+----------+
Consider using Window function last(columnName, ignoreNulls) to backfill nulls in a column that consists of previous "text_num" at group boundaries, as shown below:
val df = Seq(
(0, "", 0), (1, "", 0),
(2, "A", 1), (3, "A", 1), (4, "A", 1),
(5, "B", 2), (6, "B", 2), (7, "B", 2),
(8, "C", 3), (9, "C", 3),
(10, "A", 1), (11, "A", 1)
).toDF("count", "text", "text_num")
import org.apache.spark.sql.expressions.Window
val w1 = Window.orderBy("count")
val w2 = w1.rowsBetween(Window.unboundedPreceding, 0)
df.
withColumn("prev_num", lag("text_num", 1).over(w1)).
withColumn("last_change", when($"text_num" =!= $"prev_num", $"prev_num")).
withColumn("last_value", last("last_change", ignoreNulls=true).over(w2)).
show
/*
+-----+----+--------+--------+-----------+----------+
|count|text|text_num|prev_num|last_change|last_value|
+-----+----+--------+--------+-----------+----------+
| 0| | 0| null| null| null|
| 1| | 0| 0| null| null|
| 2| A| 1| 0| 0| 0|
| 3| A| 1| 1| null| 0|
| 4| A| 1| 1| null| 0|
| 5| B| 2| 1| 1| 1|
| 6| B| 2| 2| null| 1|
| 7| B| 2| 2| null| 1|
| 8| C| 3| 2| 2| 2|
| 9| C| 3| 3| null| 2|
| 10| A| 1| 3| 3| 3|
| 11| A| 1| 1| null| 3|
+-----+----+--------+--------+-----------+----------+
*/
The intermediary columns are kept in the output for references. Just drop them if they aren't needed.

scala explode method Cartesian product multiple array

Trying to resolve some transformation within dataframes, any help is much appreciated.
Within scala (version 2.3.1) : I have a dataframe which has array of string and long.
+------+---------+----------+---------+---------+
|userId| varA| varB| varC| varD|
+------+---------+----------+---------+---------+
| 1|[A, B, C]| [0, 2, 5]|[1, 2, 9]|[0, 0, 0]|
| 2|[X, Y, Z]|[1, 20, 5]|[9, 0, 6]|[1, 1, 1]|
+------+---------+----------+---------+---------+
I would want my output to be like below dataframe.
+------+---+---+---+---+
|userId| A| B| C| D|
+------+---+---+---+---+
| 1| A| 0| 1| 0|
| 1| B| 2| 2| 0|
| 1| C| 5| 9| 0|
| 2| X| 1| 9| 1|
| 2| Y| 20| 0| 1|
| 2| Z| 5| 6| 1|
+------+---+---+---+---+
I tried doing this using explode, getting Cartesian product. Is there a way to keep the record count to 6 rows, instead of 18 rows.
scala> val data = sc.parallelize(Seq("""{"userId": 1,"varA": ["A", "B", "C"], "varB": [0, 2, 5], "varC": [1, 2, 9], "varD": [0, 0, 0]}""","""{"userId": 2,"varA": ["X", "Y", "Z"], "varB": [1, 20, 5], "varC": [9, 0, 6], "varD": [1, 1, 1]}"""))
scala> val df = spark.read.json(data)
scala> df.show()
+------+---------+----------+---------+---------+
|userId| varA| varB| varC| varD|
+------+---------+----------+---------+---------+
| 1|[A, B, C]| [0, 2, 5]|[1, 2, 9]|[0, 0, 0]|
| 2|[X, Y, Z]|[1, 20, 5]|[9, 0, 6]|[1, 1, 1]|
+------+---------+----------+---------+---------+
scala>
scala> df.printSchema
root
|-- userId: long (nullable = true)
|-- varA: array (nullable = true)
| |-- element: string (containsNull = true)
|-- varB: array (nullable = true)
| |-- element: long (containsNull = true)
|-- varC: array (nullable = true)
| |-- element: long (containsNull = true)
|-- varD: array (nullable = true)
| |-- element: long (containsNull = true)
scala>
scala> val zip_str = udf((x: Seq[String], y: Seq[Long]) => x.zip(y))
zip_str: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(StructType(StructField(_1,StringType,true), StructField(_2,LongType,false)),true),Some(List(ArrayType(StringType,true), ArrayType(LongType,false))))
scala> val zip_long = udf((x: Seq[Long], y: Seq[Long]) => x.zip(y))
zip_long: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(StructType(StructField(_1,LongType,false), StructField(_2,LongType,false)),true),Some(List(ArrayType(LongType,false), ArrayType(LongType,false))))
scala> df.withColumn("zip_1", explode(zip_str($"varA", $"varB"))).withColumn("zip_2", explode(zip_long($"varC", $"varD"))).select($"userId", $"zip_1._1".alias("A"),$"zip_1._2".alias("B"),$"zip_2._1".alias("C"),$"zip_2._2".alias("D")).show()
+------+---+---+---+---+
|userId| A| B| C| D|
+------+---+---+---+---+
| 1| A| 0| 1| 0|
| 1| A| 0| 2| 0|
| 1| A| 0| 9| 0|
| 1| B| 2| 1| 0|
| 1| B| 2| 2| 0|
| 1| B| 2| 9| 0|
| 1| C| 5| 1| 0|
| 1| C| 5| 2| 0|
| 1| C| 5| 9| 0|
| 2| X| 1| 9| 1|
| 2| X| 1| 0| 1|
| 2| X| 1| 6| 1|
| 2| Y| 20| 9| 1|
| 2| Y| 20| 0| 1|
| 2| Y| 20| 6| 1|
| 2| Z| 5| 9| 1|
| 2| Z| 5| 0| 1|
| 2| Z| 5| 6| 1|
+------+---+---+---+---+
scala>
Some reference used here
https://intellipaat.com/community/17050/explode-transpose-multiple-columns-in-spark-sql-table
Something down the line of combining posexplode and expr could work.
if we do the following:
df.select(
col("userId"),
posexplode("varA"),
col("varB"),
col("varC")
).withColumn(
"varB",
expr("varB[pos]")
).withColumn(
"varC",
expr("varC[pos]")
)
I am writing this from memory so I am not 100% sure. I will run a test later and update with Edit if I verify.
EDIT
Above expression works except one minor correct is needed. Updated expression -
df.select(col("userId"),posexplode(col("varA")),col("varB"),col("varC"), col("varD")).withColumn("varB",expr("varB[pos]")).withColumn("varC",expr("varC[pos]")).withColumn("varD",expr("varD[pos]")).show()
Ouput -
+------+---+---+----+----+----+
|userId|pos|col|varB|varC|varD|
+------+---+---+----+----+----+
| 1| 0| A| 0| 1| 0|
| 1| 1| B| 2| 2| 0|
| 1| 2| C| 5| 9| 0|
| 2| 0| X| 1| 9| 1|
| 2| 1| Y| 20| 0| 1|
| 2| 2| Z| 5| 6| 1|
+------+---+---+----+----+----+
You don't need udfs, it could be achieved using spark sql arrays_zip and then explode:
df.select('userId,explode(arrays_zip('varA,'varB,'varC,'varD)))
.select("userId","col.varA","col.varB","col.varC","col.varD")
.show
output:
+------+----+----+----+----+
|userId|varA|varB|varC|varD|
+------+----+----+----+----+
| 1| A| 0| 1| 0|
| 1| B| 2| 2| 0|
| 1| C| 5| 9| 0|
| 1| X| 1| 9| 1|
| 1| Y| 20| 0| 1|
| 1| Z| 5| 6| 1|
+------+----+----+----+----+

Creating a unique grouping key from column-wise runs in a Spark DataFrame

I have something analogous to this, where spark is my sparkContext. I've imported implicits._ in my sparkContext so I can use the $ syntax:
val df = spark.createDataFrame(Seq(("a", 0L), ("b", 1L), ("c", 1L), ("d", 1L), ("e", 0L), ("f", 1L)))
.toDF("id", "flag")
.withColumn("index", monotonically_increasing_id)
.withColumn("run_key", when($"flag" === 1, $"index").otherwise(0))
df.show
df: org.apache.spark.sql.DataFrame = [id: string, flag: bigint ... 2 more fields]
+---+----+-----+-------+
| id|flag|index|run_key|
+---+----+-----+-------+
| a| 0| 0| 0|
| b| 1| 1| 1|
| c| 1| 2| 2|
| d| 1| 3| 3|
| e| 0| 4| 0|
| f| 1| 5| 5|
+---+----+-----+-------+
I want to create another column with a unique grouping key for each nonzero chunk of run_key, something equivalent to this:
+---+----+-----+-------+---+
| id|flag|index|run_key|key|
+---+----+-----+-------+---|
| a| 0| 0| 0| 0|
| b| 1| 1| 1| 1|
| c| 1| 2| 2| 1|
| d| 1| 3| 3| 1|
| e| 0| 4| 0| 0|
| f| 1| 5| 5| 2|
+---+----+-----+-------+---+
It could be the first value in each run, average of each run, or some other value -- it doesn't really matter as long as it's guaranteed to be unique so that I can group on it afterward to compare other values between groups.
Edit: BTW, I don't need to retain the rows where flag is 0.
One approach would be to 1) create a column $"lag1" using Window function lag() from $"flag", 2) create another column $"switched" with $"index" value in rows where $"flag" is switched, and finally 3) create the column which copies $"switched" from the last non-null row via last() and rowsBetween().
Note that this solution uses Window function without partitioning hence may not work for large dataset.
val df = Seq(
("a", 0L), ("b", 1L), ("c", 1L), ("d", 1L), ("e", 0L), ("f", 1L),
("g", 1L), ("h", 0L), ("i", 0L), ("j", 1L), ("k", 1L), ("l", 1L)
).toDF("id", "flag").
withColumn("index", monotonically_increasing_id).
withColumn("run_key", when($"flag" === 1, $"index").otherwise(0))
import org.apache.spark.sql.expressions.Window
df.withColumn( "lag1", lag("flag", 1, -1).over(Window.orderBy("index")) ).
withColumn( "switched", when($"flag" =!= $"lag1", $"index") ).
withColumn( "key", last("switched", ignoreNulls = true).over(
Window.orderBy("index").rowsBetween(Window.unboundedPreceding, 0)
) )
// +---+----+-----+-------+----+--------+---+
// | id|flag|index|run_key|lag1|switched|key|
// +---+----+-----+-------+----+--------+---+
// | a| 0| 0| 0| -1| 0| 0|
// | b| 1| 1| 1| 0| 1| 1|
// | c| 1| 2| 2| 1| null| 1|
// | d| 1| 3| 3| 1| null| 1|
// | e| 0| 4| 0| 1| 4| 4|
// | f| 1| 5| 5| 0| 5| 5|
// | g| 1| 6| 6| 1| null| 5|
// | h| 0| 7| 0| 1| 7| 7|
// | i| 0| 8| 0| 0| null| 7|
// | j| 1| 9| 9| 0| 9| 9|
// | k| 1| 10| 10| 1| null| 9|
// | l| 1| 11| 11| 1| null| 9|
// +---+----+-----+-------+----+--------+---+
You can label the "run" with the largest index where flag is 0 smaller than the index of the row in question.
Something like:
flags = df.filter($"flag" === 0)
.select("index")
.withColumnRenamed("index", "flagIndex")
indices = df.select("index").join(flags, df.index > flags.flagIndex)
.groupBy($"index")
.agg(max($"index$).as("groupKey"))
dfWithGroups = df.join(indices, Seq("index"))

Calculate links between nodes using Spark

I have the following two DataFrames in Spark 2.2 and Scala 2.11. The DataFrame edges defines the edges of a directed graph, while the DataFrame types defines the type of each node.
edges =
+-----+-----+----+
|from |to |attr|
+-----+-----+----+
| 1| 0| 1|
| 1| 4| 1|
| 2| 2| 1|
| 4| 3| 1|
| 4| 5| 1|
+-----+-----+----+
types =
+------+---------+
|nodeId|type |
+------+---------+
| 0| 0|
| 1| 0|
| 2| 2|
| 3| 4|
| 4| 4|
| 5| 4|
+------+---------+
For each node, I want to know the number of edges to the nodes of the same type. Please notice that I only want to count the edges outgoing from a node, since I deal with the directed graph.
In order to reach this objective, I performed the joining of both DataFrames:
val graphDF = edges
.join(types, types("nodeId") === edges("from"), "left")
.drop("nodeId")
.withColumnRenamed("type","type_from")
.join(types, types("nodeId") === edges("to"), "left")
.drop("nodeId")
.withColumnRenamed("type","type_to")
I obtained the following new DataFrame graphDF:
+-----+-----+----+---------------+---------------+
|from |to |attr|type_from |type_to |
+-----+-----+----+---------------+---------------+
| 1| 0| 1| 0| 0|
| 1| 4| 1| 0| 4|
| 2| 2| 1| 2| 2|
| 4| 3| 1| 4| 4|
| 4| 5| 1| 4| 4|
+-----+-----+----+---------------+---------------+
Now I need to get the following final result:
+------+---------+---------+
|nodeId|numLinks |type |
+------+---------+---------+
| 0| 0| 0|
| 1| 1| 0|
| 2| 0| 2|
| 3| 0| 4|
| 4| 2| 4|
| 5| 0| 4|
+------+---------+---------+
I was thinking about using groupBy and agg(count(...), but I do not know how to deal with directed edges.
Update:
numLinks is calculated as the number of edges outgoing from a given node. For example, the node 5 does not have any outgoing edges (only ingoing edge 4->5, see the DataFrame edges). The same refers to the node 0. But the node 4 has two outgoing edges (4->3 and 4->5).
My solution:
This is my solution, but it lacks those nodes that have 0 links.
graphDF.filter("from != to").filter("type_from == type_to").groupBy("from").agg(count("from") as "numLinks").show()
You can filter, aggregate by id and type and add missing nodes using types:
val graphDF = Seq(
(1, 0, 1, 0, 0), (1, 4, 1, 0, 4), (2, 2, 1, 2, 2),
(4, 3, 1, 4, 4), (4, 5, 1, 4, 4)
).toDF("from", "to", "attr", "type_from", "type_to")
val types = Seq(
(0, 0), (1, 0), (2, 2), (3, 4), (4,4), (5, 4)
).toDF("nodeId", "type")
graphDF
// I want to know the number of edges to the nodes of the same type
.where($"type_from" === $"type_to" && $"from" =!= $"to")
// I only want to count the edges outgoing from a node,
.groupBy($"from" as "nodeId", $"type_from" as "type")
.agg(count("*") as "numLinks")
// but it lacks those nodes that have 0 links.
.join(types, Seq("nodeId", "type"), "rightouter")
.na.fill(0)
// +------+----+--------+
// |nodeId|type|numLinks|
// +------+----+--------+
// | 0| 0| 0|
// | 1| 0| 1|
// | 2| 2| 1|
// | 3| 4| 0|
// | 4| 4| 2|
// | 5| 4| 0|
// +------+----+--------+
To skip self-links add $"from" =!= $"to" to the selection:
graphDF
.where($"type_from" === $"type_to" && $"from" =!= $"to")
.groupBy($"from" as "nodeId", $"type_from" as "type")
.agg(count("*") as "numLinks")
.join(types, Seq("nodeId", "type"), "rightouter")
.na.fill(0)
// +------+----+--------+
// |nodeId|type|numLinks|
// +------+----+--------+
// | 0| 0| 0|
// | 1| 0| 1|
// | 2| 2| 0|
// | 3| 4| 0|
// | 4| 4| 2|
// | 5| 4| 0|
// +------+----+--------+

E-num / get Dummies in pyspark

I would like to create a function in PYSPARK that get Dataframe and list of parameters (codes/categorical features) and return the data frame with additional dummy columns like the categories of the features in the list
PFA the Before and After DF:
before and After data frame- Example
The code in python looks like that:
enum = ['column1','column2']
for e in enum:
print e
temp = pd.get_dummies(data[e],drop_first=True,prefix=e)
data = pd.concat([data,temp], axis=1)
data.drop(e,axis=1,inplace=True)
data.to_csv('enum_data.csv')
First you need to collect distinct values of TYPES and CODE. Then either select add column with name of each value using withColumn or use select fro each column.
Here is sample code using select statement:-
import pyspark.sql.functions as F
df = sqlContext.createDataFrame([
(1, "A", "X1"),
(2, "B", "X2"),
(3, "B", "X3"),
(1, "B", "X3"),
(2, "C", "X2"),
(3, "C", "X2"),
(1, "C", "X1"),
(1, "B", "X1"),
], ["ID", "TYPE", "CODE"])
types = df.select("TYPE").distinct().rdd.flatMap(lambda x: x).collect()
codes = df.select("CODE").distinct().rdd.flatMap(lambda x: x).collect()
types_expr = [F.when(F.col("TYPE") == ty, 1).otherwise(0).alias("e_TYPE_" + ty) for ty in types]
codes_expr = [F.when(F.col("CODE") == code, 1).otherwise(0).alias("e_CODE_" + code) for code in codes]
df = df.select("ID", "TYPE", "CODE", *types_expr+codes_expr)
df.show()
OUTPUT
+---+----+----+--------+--------+--------+---------+---------+---------+
| ID|TYPE|CODE|e_TYPE_A|e_TYPE_B|e_TYPE_C|e_CODE_X1|e_CODE_X2|e_CODE_X3|
+---+----+----+--------+--------+--------+---------+---------+---------+
| 1| A| X1| 1| 0| 0| 1| 0| 0|
| 2| B| X2| 0| 1| 0| 0| 1| 0|
| 3| B| X3| 0| 1| 0| 0| 0| 1|
| 1| B| X3| 0| 1| 0| 0| 0| 1|
| 2| C| X2| 0| 0| 1| 0| 1| 0|
| 3| C| X2| 0| 0| 1| 0| 1| 0|
| 1| C| X1| 0| 0| 1| 1| 0| 0|
| 1| B| X1| 0| 1| 0| 1| 0| 0|
+---+----+----+--------+--------+--------+---------+---------+---------+
The solutions provided by Freek Wiemkeijer and Rakesh Kumar are perfectly adequate, however, since I coded it up, I thought it was worth posting this generic solution as it doesn't require hard coding of the column names.
pivot_cols = ['TYPE','CODE']
keys = ['ID','TYPE','CODE']
before = sc.parallelize([(1,'A','X1'),
(2,'B','X2'),
(3,'B','X3'),
(1,'B','X3'),
(2,'C','X2'),
(3,'C','X2'),
(1,'C','X1'),
(1,'B','X1')]).toDF(['ID','TYPE','CODE'])
#Helper function to recursively join a list of dataframes
#Can be simplified if you only need two columns
def join_all(dfs,keys):
if len(dfs) > 1:
return dfs[0].join(join_all(dfs[1:],keys), on = keys, how = 'inner')
else:
return dfs[0]
dfs = []
combined = []
for pivot_col in pivot_cols:
pivotDF = before.groupBy(keys).pivot(pivot_col).count()
new_names = pivotDF.columns[:len(keys)] + ["e_{0}_{1}".format(pivot_col, c) for c in pivotDF.columns[len(keys):]]
df = pivotDF.toDF(*new_names).fillna(0)
combined.append(df)
join_all(combined,keys).show()
This gives as output:
+---+----+----+--------+--------+--------+---------+---------+---------+
| ID|TYPE|CODE|e_TYPE_A|e_TYPE_B|e_TYPE_C|e_CODE_X1|e_CODE_X2|e_CODE_X3|
+---+----+----+--------+--------+--------+---------+---------+---------+
| 1| A| X1| 1| 0| 0| 1| 0| 0|
| 2| C| X2| 0| 0| 1| 0| 1| 0|
| 3| B| X3| 0| 1| 0| 0| 0| 1|
| 2| B| X2| 0| 1| 0| 0| 1| 0|
| 3| C| X2| 0| 0| 1| 0| 1| 0|
| 1| B| X3| 0| 1| 0| 0| 0| 1|
| 1| B| X1| 0| 1| 0| 1| 0| 0|
| 1| C| X1| 0| 0| 1| 1| 0| 0|
+---+----+----+--------+--------+--------+---------+---------+---------+
I was looking for the same solution but is scala, maybe this will help someone:
val list = df.select("category").distinct().rdd.map(r => r(0)).collect()
val oneHotDf = list.foldLeft(df)((df, category) => finalDf.withColumn("category_" + category, when(col("category") === category, 1).otherwise(0)))
If you'd like to get the PySpark version of pandas "pd.get_dummies" function, you can you the following function:
import itertools
def spark_get_dummies(df):
categories = []
for i, values in enumerate(df.columns):
categories.append(df.select(values).distinct().rdd.flatMap(lambda x: x).collect())
expressions = []
for i, values in enumerate(df.columns):
expressions.append([F.when(F.col(values) == i, 1).otherwise(0).alias(str(values) + "_" + str(i)) for i in categories[i]])
expressions_flat = list(itertools.chain.from_iterable(expressions))
df_final = df.select(*expressions_flat)
return df_final
The reproducible example is:
df = sqlContext.createDataFrame([
("A", "X1"),
("B", "X2"),
("B", "X3"),
("B", "X3"),
("C", "X2"),
("C", "X2"),
("C", "X1"),
("B", "X1"),
], ["TYPE", "CODE"])
dummies_df = spark_get_dummies(df)
dummies_df.show()
You will get:
The first step is to make a DataFrame from your CSV file.
See Get CSV to Spark dataframe ; the first answer gives a line by line example.
Then you can add the columns. Assume you have a DataFrame object called df, and the columns are: [ID, TYPE, CODE].
The rest van be fixed with DataFrame.withColumn() and pyspark.sql.functions.when:
from pyspark.sql.functions import when
df_with_extra_columns = df.withColumn("e_TYPE_A", when(df.TYPE == "A", 1).otherwise(0).withColumn("e_TYPE_B", when(df.TYPE == "B", 1).otherwise(0)
(this adds the first two columns. you get the point.)