I would like to create a function in PYSPARK that get Dataframe and list of parameters (codes/categorical features) and return the data frame with additional dummy columns like the categories of the features in the list
PFA the Before and After DF:
before and After data frame- Example
The code in python looks like that:
enum = ['column1','column2']
for e in enum:
print e
temp = pd.get_dummies(data[e],drop_first=True,prefix=e)
data = pd.concat([data,temp], axis=1)
data.drop(e,axis=1,inplace=True)
data.to_csv('enum_data.csv')
First you need to collect distinct values of TYPES and CODE. Then either select add column with name of each value using withColumn or use select fro each column.
Here is sample code using select statement:-
import pyspark.sql.functions as F
df = sqlContext.createDataFrame([
(1, "A", "X1"),
(2, "B", "X2"),
(3, "B", "X3"),
(1, "B", "X3"),
(2, "C", "X2"),
(3, "C", "X2"),
(1, "C", "X1"),
(1, "B", "X1"),
], ["ID", "TYPE", "CODE"])
types = df.select("TYPE").distinct().rdd.flatMap(lambda x: x).collect()
codes = df.select("CODE").distinct().rdd.flatMap(lambda x: x).collect()
types_expr = [F.when(F.col("TYPE") == ty, 1).otherwise(0).alias("e_TYPE_" + ty) for ty in types]
codes_expr = [F.when(F.col("CODE") == code, 1).otherwise(0).alias("e_CODE_" + code) for code in codes]
df = df.select("ID", "TYPE", "CODE", *types_expr+codes_expr)
df.show()
OUTPUT
+---+----+----+--------+--------+--------+---------+---------+---------+
| ID|TYPE|CODE|e_TYPE_A|e_TYPE_B|e_TYPE_C|e_CODE_X1|e_CODE_X2|e_CODE_X3|
+---+----+----+--------+--------+--------+---------+---------+---------+
| 1| A| X1| 1| 0| 0| 1| 0| 0|
| 2| B| X2| 0| 1| 0| 0| 1| 0|
| 3| B| X3| 0| 1| 0| 0| 0| 1|
| 1| B| X3| 0| 1| 0| 0| 0| 1|
| 2| C| X2| 0| 0| 1| 0| 1| 0|
| 3| C| X2| 0| 0| 1| 0| 1| 0|
| 1| C| X1| 0| 0| 1| 1| 0| 0|
| 1| B| X1| 0| 1| 0| 1| 0| 0|
+---+----+----+--------+--------+--------+---------+---------+---------+
The solutions provided by Freek Wiemkeijer and Rakesh Kumar are perfectly adequate, however, since I coded it up, I thought it was worth posting this generic solution as it doesn't require hard coding of the column names.
pivot_cols = ['TYPE','CODE']
keys = ['ID','TYPE','CODE']
before = sc.parallelize([(1,'A','X1'),
(2,'B','X2'),
(3,'B','X3'),
(1,'B','X3'),
(2,'C','X2'),
(3,'C','X2'),
(1,'C','X1'),
(1,'B','X1')]).toDF(['ID','TYPE','CODE'])
#Helper function to recursively join a list of dataframes
#Can be simplified if you only need two columns
def join_all(dfs,keys):
if len(dfs) > 1:
return dfs[0].join(join_all(dfs[1:],keys), on = keys, how = 'inner')
else:
return dfs[0]
dfs = []
combined = []
for pivot_col in pivot_cols:
pivotDF = before.groupBy(keys).pivot(pivot_col).count()
new_names = pivotDF.columns[:len(keys)] + ["e_{0}_{1}".format(pivot_col, c) for c in pivotDF.columns[len(keys):]]
df = pivotDF.toDF(*new_names).fillna(0)
combined.append(df)
join_all(combined,keys).show()
This gives as output:
+---+----+----+--------+--------+--------+---------+---------+---------+
| ID|TYPE|CODE|e_TYPE_A|e_TYPE_B|e_TYPE_C|e_CODE_X1|e_CODE_X2|e_CODE_X3|
+---+----+----+--------+--------+--------+---------+---------+---------+
| 1| A| X1| 1| 0| 0| 1| 0| 0|
| 2| C| X2| 0| 0| 1| 0| 1| 0|
| 3| B| X3| 0| 1| 0| 0| 0| 1|
| 2| B| X2| 0| 1| 0| 0| 1| 0|
| 3| C| X2| 0| 0| 1| 0| 1| 0|
| 1| B| X3| 0| 1| 0| 0| 0| 1|
| 1| B| X1| 0| 1| 0| 1| 0| 0|
| 1| C| X1| 0| 0| 1| 1| 0| 0|
+---+----+----+--------+--------+--------+---------+---------+---------+
I was looking for the same solution but is scala, maybe this will help someone:
val list = df.select("category").distinct().rdd.map(r => r(0)).collect()
val oneHotDf = list.foldLeft(df)((df, category) => finalDf.withColumn("category_" + category, when(col("category") === category, 1).otherwise(0)))
If you'd like to get the PySpark version of pandas "pd.get_dummies" function, you can you the following function:
import itertools
def spark_get_dummies(df):
categories = []
for i, values in enumerate(df.columns):
categories.append(df.select(values).distinct().rdd.flatMap(lambda x: x).collect())
expressions = []
for i, values in enumerate(df.columns):
expressions.append([F.when(F.col(values) == i, 1).otherwise(0).alias(str(values) + "_" + str(i)) for i in categories[i]])
expressions_flat = list(itertools.chain.from_iterable(expressions))
df_final = df.select(*expressions_flat)
return df_final
The reproducible example is:
df = sqlContext.createDataFrame([
("A", "X1"),
("B", "X2"),
("B", "X3"),
("B", "X3"),
("C", "X2"),
("C", "X2"),
("C", "X1"),
("B", "X1"),
], ["TYPE", "CODE"])
dummies_df = spark_get_dummies(df)
dummies_df.show()
You will get:
The first step is to make a DataFrame from your CSV file.
See Get CSV to Spark dataframe ; the first answer gives a line by line example.
Then you can add the columns. Assume you have a DataFrame object called df, and the columns are: [ID, TYPE, CODE].
The rest van be fixed with DataFrame.withColumn() and pyspark.sql.functions.when:
from pyspark.sql.functions import when
df_with_extra_columns = df.withColumn("e_TYPE_A", when(df.TYPE == "A", 1).otherwise(0).withColumn("e_TYPE_B", when(df.TYPE == "B", 1).otherwise(0)
(this adds the first two columns. you get the point.)
Related
I'm using Spark in Scala, with Datasets, and I'm facing the following situation. (EDIT: I've rewritten the examples to be more concrete. This is meant to be put in a method and reused with different datasets.)
Suppose we have the following Dataset, representing moves between region a to region b, at specific timestamps:
case class Move(id: Int, ra: Int, rb: Int, time_a: Timestamp, time_b: Timestamp)
val ds = Seq(
(1, 123, 125, "2021-07-25 13:05:20", "2021-07-25 15:05:20"),
(2, 470, 125, "2021-07-25 00:05:20", "2021-07-25 02:05:20"),
(1, 470, 125, "2021-07-25 02:05:20", "2021-07-25 04:05:20"),
(3, 123, 125, "2021-07-26 00:45:20", "2021-07-26 02:45:20"),
(3, 125, 123, "2021-07-28 16:05:20", "2021-07-28 18:05:20"),
(1, 125, 123, "2021-07-29 20:05:20", "2021-07-30 01:05:20")
).toDF("id", "ra", "rb", "time_a", "time_b")
.withColumn("time_a", to_timestamp($"time_a"))
.withColumn("time_b", to_timestamp($"time_b"))
.as[Move]
What I want is to count the number of moves for each day, so that I can get an origin-destination matrix with some time divisions. Also, I want to be able to apply some labels/categories along those time divisions. (The code for how this categorization is made isn't relevant, we can just assume here it works and each category is described in a val categs: Seq[TimeCategory]). This could be done like this:
case class Flow(ra: Int, rb: Int, time: Long, categ: Long, inflow: Long, outflow: Long)
val Row(t0: Timestamp, tf: Timestamp) = ds.select(min($"time_a"), max($"time_b")).head
val ldt0 = t0.toLocalDateTime()
val ldtf = tf.toLocalDateTime()
val unit = ChronoUnit.DAYS
val fmds = ds.map({
case Move(id, ra, rb, ta, tb) => {
val ldta = ta.toLocalDateTime()
val ldtb = tb.toLocalDateTime()
(
id, ra, rb,
unit.between(ldt0, ldta),
unit.between(ldt0, ldtb),
TimeCategory.encode(categs, ldta),
TimeCategory.encode(categs, ldtb)
)
}
})
val outfds = fmds
.groupBy($"ra", $"rb", col("time_a").as("time"), col("categ_a").as("categ"))
.agg(count("id").as("outflow"))
.withColumn("inflow", lit(0))
val infds = fmds
.groupBy($"ra", $"rb", col("time_b").as("time"), col("categ_b").as("categ"))
.agg(count("id").as("inflow"))
.withColumn("outflow", lit(0))
val fds = outfds
.unionByName(infds)
.groupBy("ra", "rb", "time", "categ")
.agg(sum("outflow").as("outflow"), sum("inflow").as("inflow"))
.orderBy("ra", "rb", "time", "categ")
.as[Flow]
and that would yield the following result:
+---+---+----+-----+-------+------+
| ra| rb|time|categ|outflow|inflow|
+---+---+----+-----+-------+------+
|123|125| 0| 19| 1| 1|
|123|125| 1| 13| 1| 0|
|123|125| 1| 14| 0| 1|
|125|123| 3| 11| 1| 0|
|125|123| 3| 12| 0| 1|
|125|123| 4| 12| 1| 0|
|125|123| 5| 14| 0| 1|
|470|125| 0| 21| 1| 0|
|470|125| 0| 22| 1| 2|
+---+---+----+-----+-------+------+
The problem is, if I wanted to get the average inflow per day for each pair of regions, many days with inflow = 0 would be missing. For example, if the following agg is done:
fds.groupBy("ra", "rb", "categ").agg(avg("inflow"))
// output:
+---+---+-----+-----------+
| ra| rb|categ|avg(inflow)|
+---+---+-----+-----------+
|123|125| 14| 1.0|
|123|125| 13| 0.0|
|470|125| 21| 0.0|
|125|123| 12| 0.5|
|125|123| 11| 0.0|
|123|125| 19| 1.0|
|125|123| 14| 1.0|
|470|125| 22| 2.0|
+---+---+-----+-----------+
For 125 -> 123 and category 12, the avg was 0.5, but considering the start and end timestamp of the whole dataset, there should be 5 days with category 12, not just 2. The avg should be 1 / 5 = 0.2. This second case is what I want to calculate. Considering I want to calculate other agg functions too (like stddev), I suppose the most flexible alternative would be to "fill" the rows whose values should be zero. What is the best of way - in terms of performance/scalability - of doing that?
(EDIT: I thought of a better approach here as well.) So far, a possible solution that comes to my mind is to create another DataFrame with the "time slots" (in this case, each "slot" would be a day index with all suitable categories) and do one more union, like this:
// TimeCategory.encodeAll basically just returns
// every category that suits that timestamp
val timeindex: Seq[(Long, Long)] = (0L to unit.between(ldt0, ldtf))
.flatMap(i => {
val t = ldt0.plus(i, unit)
val categCodes = TimeCategory.encodeAll(categs, unit, t)
categCodes.map((i, _))
})
val zds = infds
.select($"ra", $"rb", explode(typedLit(timeindex)).as("time"))
.select($"ra", $"rb", col("time")("_1").as("time"), col("time")("_2").as("categ"),
lit(0).as("outflow"), lit(0).as("inflow"))
val fds = outfds
.unionByName(infds)
.unionByName(zds)
.groupBy("ra", "rb", "time", "categ")
.agg(sum("outflow").as("outflow"), sum("inflow").as("inflow"))
.orderBy("ra", "rb", "time", "categ")
.as[Flow]
.show()
fds.groupBy("ra", "rb", "categ")
.agg(avg("inflow"))
.where($"avg(inflow)" > 0.0)
.show()
and the result:
+---+---+----+-----+-------+------+
| ra| rb|time|categ|outflow|inflow|
+---+---+----+-----+-------+------+
|123|125| 0| 17| 0| 0|
|123|125| 0| 18| 0| 0|
|123|125| 0| 19| 1| 1|
|123|125| 0| 20| 0| 0|
|123|125| 0| 21| 0| 0|
|123|125| 0| 22| 0| 0|
|123|125| 1| 9| 0| 0|
|123|125| 1| 10| 0| 0|
|123|125| 1| 11| 0| 0|
|123|125| 1| 12| 0| 0|
|123|125| 1| 13| 1| 0|
|123|125| 1| 14| 0| 1|
|123|125| 2| 9| 0| 0|
|123|125| 2| 10| 0| 0|
|123|125| 2| 11| 0| 0|
|123|125| 2| 12| 0| 0|
|123|125| 2| 13| 0| 0|
|123|125| 2| 14| 0| 0|
|123|125| 3| 9| 0| 0|
|123|125| 3| 10| 0| 0|
+---+---+----+-----+-------+------+
only showing top 20 rows
+---+---+-----+-----------+
| ra| rb|categ|avg(inflow)|
+---+---+-----+-----------+
|123|125| 14| 0.2|
|125|123| 12| 0.2| // <- the correct avg
|123|125| 19| 1.0|
|125|123| 14| 0.2|
|470|125| 22| 2.0|
+---+---+-----+-----------+
Is it possible to improve this?
I can only propose a little improvement to your initial solution, to replace join with unionByBame. It should be more efficient than join, but I didn't test performance:
val ds = Seq(
1L -> java.sql.Timestamp.valueOf("2021-07-25 13:05:20"),
2L -> java.sql.Timestamp.valueOf("2021-07-25 00:05:20"),
1L -> java.sql.Timestamp.valueOf("2021-07-25 02:05:20"),
3L -> java.sql.Timestamp.valueOf("2021-07-26 00:45:20"),
3L -> java.sql.Timestamp.valueOf("2021-07-28 16:05:20"),
1L -> java.sql.Timestamp.valueOf("2021-07-29 20:05:20")
).toDF("id", "timestamp")
val ddf = ds
.withColumn("day", dayofmonth($"timestamp"))
.drop("timestamp")
val r = ds.select(min($"timestamp"), max($"timestamp")).head()
val t0 = r.getTimestamp(0)
val tf = r.getTimestamp(1)
val idf = (t0.getDate() to tf.getDate()).toDF("day")
.withColumn("id", lit(null).cast(LongType))
ddf.unionByName(idf)
.groupBy($"day")
.agg(sum(when($"id".isNotNull, 1).otherwise(0)).as("count"))
.orderBy("day")
.show
I have something analogous to this, where spark is my sparkContext. I've imported implicits._ in my sparkContext so I can use the $ syntax:
val df = spark.createDataFrame(Seq(("a", 0L), ("b", 1L), ("c", 1L), ("d", 1L), ("e", 0L), ("f", 1L)))
.toDF("id", "flag")
.withColumn("index", monotonically_increasing_id)
.withColumn("run_key", when($"flag" === 1, $"index").otherwise(0))
df.show
df: org.apache.spark.sql.DataFrame = [id: string, flag: bigint ... 2 more fields]
+---+----+-----+-------+
| id|flag|index|run_key|
+---+----+-----+-------+
| a| 0| 0| 0|
| b| 1| 1| 1|
| c| 1| 2| 2|
| d| 1| 3| 3|
| e| 0| 4| 0|
| f| 1| 5| 5|
+---+----+-----+-------+
I want to create another column with a unique grouping key for each nonzero chunk of run_key, something equivalent to this:
+---+----+-----+-------+---+
| id|flag|index|run_key|key|
+---+----+-----+-------+---|
| a| 0| 0| 0| 0|
| b| 1| 1| 1| 1|
| c| 1| 2| 2| 1|
| d| 1| 3| 3| 1|
| e| 0| 4| 0| 0|
| f| 1| 5| 5| 2|
+---+----+-----+-------+---+
It could be the first value in each run, average of each run, or some other value -- it doesn't really matter as long as it's guaranteed to be unique so that I can group on it afterward to compare other values between groups.
Edit: BTW, I don't need to retain the rows where flag is 0.
One approach would be to 1) create a column $"lag1" using Window function lag() from $"flag", 2) create another column $"switched" with $"index" value in rows where $"flag" is switched, and finally 3) create the column which copies $"switched" from the last non-null row via last() and rowsBetween().
Note that this solution uses Window function without partitioning hence may not work for large dataset.
val df = Seq(
("a", 0L), ("b", 1L), ("c", 1L), ("d", 1L), ("e", 0L), ("f", 1L),
("g", 1L), ("h", 0L), ("i", 0L), ("j", 1L), ("k", 1L), ("l", 1L)
).toDF("id", "flag").
withColumn("index", monotonically_increasing_id).
withColumn("run_key", when($"flag" === 1, $"index").otherwise(0))
import org.apache.spark.sql.expressions.Window
df.withColumn( "lag1", lag("flag", 1, -1).over(Window.orderBy("index")) ).
withColumn( "switched", when($"flag" =!= $"lag1", $"index") ).
withColumn( "key", last("switched", ignoreNulls = true).over(
Window.orderBy("index").rowsBetween(Window.unboundedPreceding, 0)
) )
// +---+----+-----+-------+----+--------+---+
// | id|flag|index|run_key|lag1|switched|key|
// +---+----+-----+-------+----+--------+---+
// | a| 0| 0| 0| -1| 0| 0|
// | b| 1| 1| 1| 0| 1| 1|
// | c| 1| 2| 2| 1| null| 1|
// | d| 1| 3| 3| 1| null| 1|
// | e| 0| 4| 0| 1| 4| 4|
// | f| 1| 5| 5| 0| 5| 5|
// | g| 1| 6| 6| 1| null| 5|
// | h| 0| 7| 0| 1| 7| 7|
// | i| 0| 8| 0| 0| null| 7|
// | j| 1| 9| 9| 0| 9| 9|
// | k| 1| 10| 10| 1| null| 9|
// | l| 1| 11| 11| 1| null| 9|
// +---+----+-----+-------+----+--------+---+
You can label the "run" with the largest index where flag is 0 smaller than the index of the row in question.
Something like:
flags = df.filter($"flag" === 0)
.select("index")
.withColumnRenamed("index", "flagIndex")
indices = df.select("index").join(flags, df.index > flags.flagIndex)
.groupBy($"index")
.agg(max($"index$).as("groupKey"))
dfWithGroups = df.join(indices, Seq("index"))
I have the following two DataFrames in Spark 2.2 and Scala 2.11. The DataFrame edges defines the edges of a directed graph, while the DataFrame types defines the type of each node.
edges =
+-----+-----+----+
|from |to |attr|
+-----+-----+----+
| 1| 0| 1|
| 1| 4| 1|
| 2| 2| 1|
| 4| 3| 1|
| 4| 5| 1|
+-----+-----+----+
types =
+------+---------+
|nodeId|type |
+------+---------+
| 0| 0|
| 1| 0|
| 2| 2|
| 3| 4|
| 4| 4|
| 5| 4|
+------+---------+
For each node, I want to know the number of edges to the nodes of the same type. Please notice that I only want to count the edges outgoing from a node, since I deal with the directed graph.
In order to reach this objective, I performed the joining of both DataFrames:
val graphDF = edges
.join(types, types("nodeId") === edges("from"), "left")
.drop("nodeId")
.withColumnRenamed("type","type_from")
.join(types, types("nodeId") === edges("to"), "left")
.drop("nodeId")
.withColumnRenamed("type","type_to")
I obtained the following new DataFrame graphDF:
+-----+-----+----+---------------+---------------+
|from |to |attr|type_from |type_to |
+-----+-----+----+---------------+---------------+
| 1| 0| 1| 0| 0|
| 1| 4| 1| 0| 4|
| 2| 2| 1| 2| 2|
| 4| 3| 1| 4| 4|
| 4| 5| 1| 4| 4|
+-----+-----+----+---------------+---------------+
Now I need to get the following final result:
+------+---------+---------+
|nodeId|numLinks |type |
+------+---------+---------+
| 0| 0| 0|
| 1| 1| 0|
| 2| 0| 2|
| 3| 0| 4|
| 4| 2| 4|
| 5| 0| 4|
+------+---------+---------+
I was thinking about using groupBy and agg(count(...), but I do not know how to deal with directed edges.
Update:
numLinks is calculated as the number of edges outgoing from a given node. For example, the node 5 does not have any outgoing edges (only ingoing edge 4->5, see the DataFrame edges). The same refers to the node 0. But the node 4 has two outgoing edges (4->3 and 4->5).
My solution:
This is my solution, but it lacks those nodes that have 0 links.
graphDF.filter("from != to").filter("type_from == type_to").groupBy("from").agg(count("from") as "numLinks").show()
You can filter, aggregate by id and type and add missing nodes using types:
val graphDF = Seq(
(1, 0, 1, 0, 0), (1, 4, 1, 0, 4), (2, 2, 1, 2, 2),
(4, 3, 1, 4, 4), (4, 5, 1, 4, 4)
).toDF("from", "to", "attr", "type_from", "type_to")
val types = Seq(
(0, 0), (1, 0), (2, 2), (3, 4), (4,4), (5, 4)
).toDF("nodeId", "type")
graphDF
// I want to know the number of edges to the nodes of the same type
.where($"type_from" === $"type_to" && $"from" =!= $"to")
// I only want to count the edges outgoing from a node,
.groupBy($"from" as "nodeId", $"type_from" as "type")
.agg(count("*") as "numLinks")
// but it lacks those nodes that have 0 links.
.join(types, Seq("nodeId", "type"), "rightouter")
.na.fill(0)
// +------+----+--------+
// |nodeId|type|numLinks|
// +------+----+--------+
// | 0| 0| 0|
// | 1| 0| 1|
// | 2| 2| 1|
// | 3| 4| 0|
// | 4| 4| 2|
// | 5| 4| 0|
// +------+----+--------+
To skip self-links add $"from" =!= $"to" to the selection:
graphDF
.where($"type_from" === $"type_to" && $"from" =!= $"to")
.groupBy($"from" as "nodeId", $"type_from" as "type")
.agg(count("*") as "numLinks")
.join(types, Seq("nodeId", "type"), "rightouter")
.na.fill(0)
// +------+----+--------+
// |nodeId|type|numLinks|
// +------+----+--------+
// | 0| 0| 0|
// | 1| 0| 1|
// | 2| 2| 0|
// | 3| 4| 0|
// | 4| 4| 2|
// | 5| 4| 0|
// +------+----+--------+
I have looked at a number of questions online, but they don't seem to do what I'm trying to achieve.
I'm using Apache Spark 2.0.2 with Scala.
I have a dataframe:
+----------+-----+----+----+----+----+----+
|segment_id| val1|val2|val3|val4|val5|val6|
+----------+-----+----+----+----+----+----+
| 1| 100| 0| 0| 0| 0| 0|
| 2| 0| 50| 0| 0| 20| 0|
| 3| 0| 0| 0| 0| 0| 0|
| 4| 0| 0| 0| 0| 0| 0|
+----------+-----+----+----+----+----+----+
which I want to transpose to
+----+-----+----+----+----+
|vals| 1| 2| 3| 4|
+----+-----+----+----+----+
|val1| 100| 0| 0| 0|
|val2| 0| 50| 0| 0|
|val3| 0| 0| 0| 0|
|val4| 0| 0| 0| 0|
|val5| 0| 20| 0| 0|
|val6| 0| 0| 0| 0|
+----+-----+----+----+----+
I've tried using pivot() but I couldn't get to the right answer. I ended up looping through my val{x} columns, and pivoting each as per below, but this is proving to be very slow.
val d = df.select('segment_id, 'val1)
+----------+-----+
|segment_id| val1|
+----------+-----+
| 1| 100|
| 2| 0|
| 3| 0|
| 4| 0|
+----------+-----+
d.groupBy('val1).sum().withColumnRenamed('val1', 'vals')
+----+-----+----+----+----+
|vals| 1| 2| 3| 4|
+----+-----+----+----+----+
|val1| 100| 0| 0| 0|
+----+-----+----+----+----+
Then using union() on each iteration of val{x} to my first dataframe.
+----+-----+----+----+----+
|vals| 1| 2| 3| 4|
+----+-----+----+----+----+
|val2| 0| 50| 0| 0|
+----+-----+----+----+----+
Is there a more efficient way of a transpose where I do not want to aggregate data?
Thanks :)
Unfortunately there is no case when:
Spark DataFrame is justified considering amount of data.
Transposition of data is feasible.
You have to remember that DataFrame, as implemented in Spark, is a distributed collection of rows and each row is stored and processed on a single node.
You could express transposition on a DataFrame as pivot:
val kv = explode(array(df.columns.tail.map {
c => struct(lit(c).alias("k"), col(c).alias("v"))
}: _*))
df
.withColumn("kv", kv)
.select($"segment_id", $"kv.k", $"kv.v")
.groupBy($"k")
.pivot("segment_id")
.agg(first($"v"))
.orderBy($"k")
.withColumnRenamed("k", "vals")
but it is merely a toy code with no practical applications. In practice it is not better than collecting data:
val (header, data) = df.collect.map(_.toSeq.toArray).transpose match {
case Array(h, t # _*) => {
(h.map(_.toString), t.map(_.collect { case x: Int => x }))
}
}
val rows = df.columns.tail.zip(data).map { case (x, ys) => Row.fromSeq(x +: ys) }
val schema = StructType(
StructField("vals", StringType) +: header.map(StructField(_, IntegerType))
)
spark.createDataFrame(sc.parallelize(rows), schema)
For DataFrame defined as:
val df = Seq(
(1, 100, 0, 0, 0, 0, 0),
(2, 0, 50, 0, 0, 20, 0),
(3, 0, 0, 0, 0, 0, 0),
(4, 0, 0, 0, 0, 0, 0)
).toDF("segment_id", "val1", "val2", "val3", "val4", "val5", "val6")
both would you give you the desired result:
+----+---+---+---+---+
|vals| 1| 2| 3| 4|
+----+---+---+---+---+
|val1|100| 0| 0| 0|
|val2| 0| 50| 0| 0|
|val3| 0| 0| 0| 0|
|val4| 0| 0| 0| 0|
|val5| 0| 20| 0| 0|
|val6| 0| 0| 0| 0|
+----+---+---+---+---+
That being said if you need an efficient transpositions on distributed data structure you'll have to look somewhere else. There is a number of structures, including core CoordinateMatrix and BlockMatrix, which can distribute data across both dimensions and can be transposed.
In python, this can be done in a simple way
I normally use transpose function in Pandas by converting the spark DataFrame
spark_df.toPandas().T
Here is the solution for Pyspark
https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transpose.html
Here is the solution code for your problem:
Step1: Choose columns
d = df.select('val1','val2','val3','val4','val5','val6','segment_id')
This code part can form the data frame like this:
+----------+-----+----+----+----+----+----+
| val1|val2|val3|val4|val5|val6|segment_id
+----------+-----+----+----+----+----+----+
| 100| 0| 0| 0| 0| 0| 1 |
| 0| 50| 0| 0| 20| 0| 2 |
| 0| 0| 0| 0| 0| 0| 3 |
| 0| 0| 0| 0| 0| 0| 4 |
+----------+-----+----+----+----+----+----+
Step 2: Transpose the whole table.
d_transposed = d.T.sort_index()
This code part can form the data frame like this:
+----+-----+----+----+----+----+-
|segment_id| 1| 2| 3| 4|
+----+-----+----+----+----+----+-
|val1 | 100| 0| 0| 0|
|val2 | 0| 50| 0| 0|
|val3 | 0| 0| 0| 0|
|val4 | 0| 0| 0| 0|
|val5 | 0| 20| 0| 0|
|val6 | 0| 0| 0| 0|
+----+-----+----+----+----+----+-
Step 3: You need to rename the segment_id to vals:
d_transposed.withColumnRenamed("segment_id","vals")
+----+-----+----+----+----+----+-
|vals | 1| 2| 3| 4|
+----+-----+----+----+----+----+-
|val1 | 100| 0| 0| 0|
|val2 | 0| 50| 0| 0|
|val3 | 0| 0| 0| 0|
|val4 | 0| 0| 0| 0|
|val5 | 0| 20| 0| 0|
|val6 | 0| 0| 0| 0|
+----+-----+----+----+----+----+-
Here is your full code:
d = df.select('val1','val2','val3','val4','val5','val6','segment_id')
d_transposed = d.T.sort_index()
d_transposed.withColumnRenamed("segment_id","vals")
This should be a perfect solution.
val seq = Seq((1,100,0,0,0,0,0),(2,0,50,0,0,20,0),(3,0,0,0,0,0,0),(4,0,0,0,0,0,0))
val df1 = seq.toDF("segment_id", "val1", "val2", "val3", "val4", "val5", "val6")
df1.show()
val schema = df1.schema
val df2 = df1.flatMap(row => {
val metric = row.getInt(0)
(1 until row.size).map(i => {
(metric, schema(i).name, row.getInt(i))
})
})
val df3 = df2.toDF("metric", "vals", "value")
df3.show()
import org.apache.spark.sql.functions._
val df4 = df3.groupBy("vals").pivot("metric").agg(first("value"))
df4.show()
I created a dataframe in spark with the following schema:
root
|-- user_id: long (nullable = false)
|-- event_id: long (nullable = false)
|-- invited: integer (nullable = false)
|-- day_diff: long (nullable = true)
|-- interested: integer (nullable = false)
|-- event_owner: long (nullable = false)
|-- friend_id: long (nullable = false)
And the data is shown below:
+----------+----------+-------+--------+----------+-----------+---------+
| user_id| event_id|invited|day_diff|interested|event_owner|friend_id|
+----------+----------+-------+--------+----------+-----------+---------+
| 4236494| 110357109| 0| -1| 0| 937597069| null|
| 78065188| 498404626| 0| 0| 0| 2904922087| null|
| 282487230|2520855981| 0| 28| 0| 3749735525| null|
| 335269852|1641491432| 0| 2| 0| 1490350911| null|
| 437050836|1238456614| 0| 2| 0| 991277599| null|
| 447244169|2095085551| 0| -1| 0| 1579858878| null|
| 516353916|1076364848| 0| 3| 1| 3597645735| null|
| 528218683|1151525474| 0| 1| 0| 3433080956| null|
| 531967718|3632072502| 0| 1| 0| 3863085861| null|
| 627948360|2823119321| 0| 0| 0| 4092665803| null|
| 811791433|3513954032| 0| 2| 0| 415464198| null|
| 830686203| 99027353| 0| 0| 0| 3549822604| null|
|1008893291|1115453150| 0| 2| 0| 2245155244| null|
|1239364869|2824096896| 0| 2| 1| 2579294650| null|
|1287950172|1076364848| 0| 0| 0| 3597645735| null|
|1345896548|2658555390| 0| 1| 0| 2025118823| null|
|1354205322|2564682277| 0| 3| 0| 2563033185| null|
|1408344828|1255629030| 0| -1| 1| 804901063| null|
|1452633375|1334001859| 0| 4| 0| 1488588320| null|
|1625052108|3297535757| 0| 3| 0| 1972598895| null|
+----------+----------+-------+--------+----------+-----------+---------+
I want to filter out the rows have null values in the field of "friend_id".
scala> val aaa = test.filter("friend_id is null")
scala> aaa.count
I got :res52: Long = 0 which is obvious not right. What is the right way to get it?
One more question, I want to replace the values in the friend_id field. I want to replace null with 0 and 1 for any other value except null. The code I can figure out is:
val aaa = train_friend_join.select($"user_id", $"event_id", $"invited", $"day_diff", $"interested", $"event_owner", ($"friend_id" != null)?1:0)
This code also doesn't work. Can anyone tell me how can I fix it? Thanks
Let's say you have this data setup (so that results are reproducible):
// declaring data types
case class Company(cName: String, cId: String, details: String)
case class Employee(name: String, id: String, email: String, company: Company)
// setting up example data
val e1 = Employee("n1", null, "n1#c1.com", Company("c1", "1", "d1"))
val e2 = Employee("n2", "2", "n2#c1.com", Company("c1", "1", "d1"))
val e3 = Employee("n3", "3", "n3#c1.com", Company("c1", "1", "d1"))
val e4 = Employee("n4", "4", "n4#c2.com", Company("c2", "2", "d2"))
val e5 = Employee("n5", null, "n5#c2.com", Company("c2", "2", "d2"))
val e6 = Employee("n6", "6", "n6#c2.com", Company("c2", "2", "d2"))
val e7 = Employee("n7", "7", "n7#c3.com", Company("c3", "3", "d3"))
val e8 = Employee("n8", "8", "n8#c3.com", Company("c3", "3", "d3"))
val employees = Seq(e1, e2, e3, e4, e5, e6, e7, e8)
val df = sc.parallelize(employees).toDF
Data is:
+----+----+---------+---------+
|name| id| email| company|
+----+----+---------+---------+
| n1|null|n1#c1.com|[c1,1,d1]|
| n2| 2|n2#c1.com|[c1,1,d1]|
| n3| 3|n3#c1.com|[c1,1,d1]|
| n4| 4|n4#c2.com|[c2,2,d2]|
| n5|null|n5#c2.com|[c2,2,d2]|
| n6| 6|n6#c2.com|[c2,2,d2]|
| n7| 7|n7#c3.com|[c3,3,d3]|
| n8| 8|n8#c3.com|[c3,3,d3]|
+----+----+---------+---------+
Now to filter employees with null ids, you will do --
df.filter("id is null").show
which will correctly show you following:
+----+----+---------+---------+
|name| id| email| company|
+----+----+---------+---------+
| n1|null|n1#c1.com|[c1,1,d1]|
| n5|null|n5#c2.com|[c2,2,d2]|
+----+----+---------+---------+
Coming to the second part of your question, you can replace the null ids with 0 and other values with 1 with this --
df.withColumn("id", when($"id".isNull, 0).otherwise(1)).show
This results in:
+----+---+---------+---------+
|name| id| email| company|
+----+---+---------+---------+
| n1| 0|n1#c1.com|[c1,1,d1]|
| n2| 1|n2#c1.com|[c1,1,d1]|
| n3| 1|n3#c1.com|[c1,1,d1]|
| n4| 1|n4#c2.com|[c2,2,d2]|
| n5| 0|n5#c2.com|[c2,2,d2]|
| n6| 1|n6#c2.com|[c2,2,d2]|
| n7| 1|n7#c3.com|[c3,3,d3]|
| n8| 1|n8#c3.com|[c3,3,d3]|
+----+---+---------+---------+
Or like df.filter($"friend_id".isNotNull)
df.where(df.col("friend_id").isNull)
There are two ways to do it: creating filter condition 1) Manually 2) Dynamically.
Sample DataFrame:
val df = spark.createDataFrame(Seq(
(0, "a1", "b1", "c1", "d1"),
(1, "a2", "b2", "c2", "d2"),
(2, "a3", "b3", null, "d3"),
(3, "a4", null, "c4", "d4"),
(4, null, "b5", "c5", "d5")
)).toDF("id", "col1", "col2", "col3", "col4")
+---+----+----+----+----+
| id|col1|col2|col3|col4|
+---+----+----+----+----+
| 0| a1| b1| c1| d1|
| 1| a2| b2| c2| d2|
| 2| a3| b3|null| d3|
| 3| a4|null| c4| d4|
| 4|null| b5| c5| d5|
+---+----+----+----+----+
1) Creating filter condition manually i.e. using DataFrame where or filter function
df.filter(col("col1").isNotNull && col("col2").isNotNull).show
or
df.where("col1 is not null and col2 is not null").show
Result:
+---+----+----+----+----+
| id|col1|col2|col3|col4|
+---+----+----+----+----+
| 0| a1| b1| c1| d1|
| 1| a2| b2| c2| d2|
| 2| a3| b3|null| d3|
+---+----+----+----+----+
2) Creating filter condition dynamically: This is useful when we don't want any column to have null value and there are large number of columns, which is mostly the case.
To create the filter condition manually in these cases will waste a lot of time. In below code we are including all columns dynamically using map and reduce function on DataFrame columns:
val filterCond = df.columns.map(x=>col(x).isNotNull).reduce(_ && _)
How filterCond looks:
filterCond: org.apache.spark.sql.Column = (((((id IS NOT NULL) AND (col1 IS NOT NULL)) AND (col2 IS NOT NULL)) AND (col3 IS NOT NULL)) AND (col4 IS NOT NULL))
Filtering:
val filteredDf = df.filter(filterCond)
Result:
+---+----+----+----+----+
| id|col1|col2|col3|col4|
+---+----+----+----+----+
| 0| a1| b1| c1| d1|
| 1| a2| b2| c2| d2|
+---+----+----+----+----+
A good solution for me was to drop the rows with any null values:
Dataset<Row> filtered = df.filter(row => !row.anyNull);
In case one is interested in the other case, just call row.anyNull.
(Spark 2.1.0 using Java API)
The following lines work well:
test.filter("friend_id is not null")
From the hint from Michael Kopaniov, below works
df.where(df("id").isNotNull).show
Here is a solution for spark in Java. To select data rows containing nulls. When you have Dataset data, you do:
Dataset<Row> containingNulls = data.where(data.col("COLUMN_NAME").isNull())
To filter out data without nulls you do:
Dataset<Row> withoutNulls = data.where(data.col("COLUMN_NAME").isNotNull())
Often dataframes contain columns of type String where instead of nulls we have empty strings like "". To filter out such data as well we do:
Dataset<Row> withoutNullsAndEmpty = data.where(data.col("COLUMN_NAME").isNotNull().and(data.col("COLUMN_NAME").notEqual("")))
for the first question, it is correct you are filtering out nulls and hence count is zero.
for the second replacing: use like below:
val options = Map("path" -> "...\\ex.csv", "header" -> "true")
val dfNull = spark.sqlContext.load("com.databricks.spark.csv", options)
scala> dfNull.show
+----------+----------+-------+--------+----------+-----------+---------+
| user_id| event_id|invited|day_diff|interested|event_owner|friend_id|
+----------+----------+-------+--------+----------+-----------+---------+
| 4236494| 110357109| 0| -1| 0| 937597069| null|
| 78065188| 498404626| 0| 0| 0| 2904922087| null|
| 282487230|2520855981| 0| 28| 0| 3749735525| null|
| 335269852|1641491432| 0| 2| 0| 1490350911| null|
| 437050836|1238456614| 0| 2| 0| 991277599| null|
| 447244169|2095085551| 0| -1| 0| 1579858878| a|
| 516353916|1076364848| 0| 3| 1| 3597645735| b|
| 528218683|1151525474| 0| 1| 0| 3433080956| c|
| 531967718|3632072502| 0| 1| 0| 3863085861| null|
| 627948360|2823119321| 0| 0| 0| 4092665803| null|
| 811791433|3513954032| 0| 2| 0| 415464198| null|
| 830686203| 99027353| 0| 0| 0| 3549822604| null|
|1008893291|1115453150| 0| 2| 0| 2245155244| null|
|1239364869|2824096896| 0| 2| 1| 2579294650| d|
|1287950172|1076364848| 0| 0| 0| 3597645735| null|
|1345896548|2658555390| 0| 1| 0| 2025118823| null|
|1354205322|2564682277| 0| 3| 0| 2563033185| null|
|1408344828|1255629030| 0| -1| 1| 804901063| null|
|1452633375|1334001859| 0| 4| 0| 1488588320| null|
|1625052108|3297535757| 0| 3| 0| 1972598895| null|
+----------+----------+-------+--------+----------+-----------+---------+
dfNull.withColumn("friend_idTmp", when($"friend_id".isNull, "1").otherwise("0")).drop($"friend_id").withColumnRenamed("friend_idTmp", "friend_id").show
+----------+----------+-------+--------+----------+-----------+---------+
| user_id| event_id|invited|day_diff|interested|event_owner|friend_id|
+----------+----------+-------+--------+----------+-----------+---------+
| 4236494| 110357109| 0| -1| 0| 937597069| 1|
| 78065188| 498404626| 0| 0| 0| 2904922087| 1|
| 282487230|2520855981| 0| 28| 0| 3749735525| 1|
| 335269852|1641491432| 0| 2| 0| 1490350911| 1|
| 437050836|1238456614| 0| 2| 0| 991277599| 1|
| 447244169|2095085551| 0| -1| 0| 1579858878| 0|
| 516353916|1076364848| 0| 3| 1| 3597645735| 0|
| 528218683|1151525474| 0| 1| 0| 3433080956| 0|
| 531967718|3632072502| 0| 1| 0| 3863085861| 1|
| 627948360|2823119321| 0| 0| 0| 4092665803| 1|
| 811791433|3513954032| 0| 2| 0| 415464198| 1|
| 830686203| 99027353| 0| 0| 0| 3549822604| 1|
|1008893291|1115453150| 0| 2| 0| 2245155244| 1|
|1239364869|2824096896| 0| 2| 1| 2579294650| 0|
|1287950172|1076364848| 0| 0| 0| 3597645735| 1|
|1345896548|2658555390| 0| 1| 0| 2025118823| 1|
|1354205322|2564682277| 0| 3| 0| 2563033185| 1|
|1408344828|1255629030| 0| -1| 1| 804901063| 1|
|1452633375|1334001859| 0| 4| 0| 1488588320| 1|
|1625052108|3297535757| 0| 3| 0| 1972598895| 1|
+----------+----------+-------+--------+----------+-----------+---------+
val df = Seq(
("1001", "1007"),
("1002", null),
("1003", "1005"),
(null, "1006")
).toDF("user_id", "friend_id")
Data is:
+-------+---------+
|user_id|friend_id|
+-------+---------+
| 1001| 1007|
| 1002| null|
| 1003| 1005|
| null| 1006|
+-------+---------+
Drop rows containing any null or NaN values in the specified columns of the Seq:
df.na.drop(Seq("friend_id"))
.show()
Output:
+-------+---------+
|user_id|friend_id|
+-------+---------+
| 1001| 1007|
| 1003| 1005|
| null| 1006|
+-------+---------+
If do not specify columns, drop row as long as any column of a row contains null or NaN values:
df.na.drop()
.show()
Output:
+-------+---------+
|user_id|friend_id|
+-------+---------+
| 1001| 1007|
| 1003| 1005|
+-------+---------+
Another easy way to filter out null values from multiple columns in spark dataframe. Please pay attention there is AND between columns.
df.filter(" COALESCE(col1, col2, col3, col4, col5, col6) IS NOT NULL")
If you need to filter out rows that contain any null (OR connected) please use
df.na.drop()
I use the following code to solve my question. It works. But as we all know, I work around a country's mile to solve it. So, is there a short cut for that? Thanks
def filter_null(field : Any) : Int = field match {
case null => 0
case _ => 1
}
val test = train_event_join.join(
user_friends_pair,
train_event_join("user_id") === user_friends_pair("user_id") &&
train_event_join("event_owner") === user_friends_pair("friend_id"),
"left"
).select(
train_event_join("user_id"),
train_event_join("event_id"),
train_event_join("invited"),
train_event_join("day_diff"),
train_event_join("interested"),
train_event_join("event_owner"),
user_friends_pair("friend_id")
).rdd.map{
line => (
line(0).toString.toLong,
line(1).toString.toLong,
line(2).toString.toLong,
line(3).toString.toLong,
line(4).toString.toLong,
line(5).toString.toLong,
filter_null(line(6))
)
}.toDF("user_id", "event_id", "invited", "day_diff", "interested", "event_owner", "creator_is_friend")