Joining data in spark data frames using Scala - scala

I have a Spark dataframe in Scala as below -
val df = Seq(
(0,0,0,0.0,0),
(1,0,0,0.1,1),
(0,1,0,0.11,1),
(0,0,1,0.12,1),
(1,1,0,0.24,2),
(1,0,1,0.27,2),
(0,1,1,0.3,2),
(1,1,1,0.4,3)
).toDF("A","B","C","rate","total")
Here is how it looks like
scala> df.show
+---+---+---+----+-----+
| A| B| C|rate|total|
+---+---+---+----+-----+
| 0| 0| 0| 0.0| 0|
| 1| 0| 0| 0.1| 1|
| 0| 1| 0|0.11| 1|
| 0| 0| 1|0.12| 1|
| 1| 1| 0|0.24| 2|
| 1| 0| 1|0.27| 2|
| 0| 1| 1| 0.3| 2|
| 1| 1| 1| 0.4| 3|
+---+---+---+----+-----+
A,B and C are channels in this case. 0 and 1 represent absence and presence of channels respectively. 2^3 shows 8 combinations in the data-frame with a column 'total' giving row-wise sum of these 3 channels.
The individual probabilities of these channel occurrence can be given by -
scala> val oneChannelCase = df.filter($"total" === 1).toDF()
scala> oneChannelCase.show()
+---+---+---+----+-----+
| A| B| C|rate|total|
+---+---+---+----+-----+
| 1| 0| 0| 0.1| 1|
| 0| 1| 0|0.11| 1|
| 0| 0| 1|0.12| 1|
+---+---+---+----+-----+
However, I am interested in only pair-wise probabilities of these channels which is given by -
scala> val probs = df.filter($"total" === 2).toDF()
scala> probs.show()
+---+---+---+----+-----+
| A| B| C|rate|total|
+---+---+---+----+-----+
| 1| 1| 0|0.24| 2|
| 1| 0| 1|0.27| 2|
| 0| 1| 1| 0.3| 2|
+---+---+---+----+-----+
What I would like to do is - append 3 new columns to these "probs" dataframe that shows individual probabilities. Below is the output that I am looking for -
A B C rate prob_A prob_B prob_C
1 1 0 0.24 0.1 0.11 0
1 0 1 0.27 0.1 0 0.12
0 1 1 0.3 0 0.11 0.12
To make thing clearer, the first row of output result shows A=1, B=1, C=0. Hence the individual probabilities for A=0.1, B=0.11 and C=0 is appended to the probs dataframe respectively. Similarly, for second row, A=1, B=0, C=1 shows individual probabilities for A=0.1, B=0 and C=0.12 is appended to the probs dataframe respectively.
Here is what I have tried -
scala> val channels = df.columns.filter(v => !(v.contains("rate") | v.contains("total")))
#channels: Array[String] = Array(A, B, C)
scala> val pivotedProb = channels.map(v => f"case when $v = 1 then rate else 0 end as prob_${v}")
scala> val param = pivotedProb.mkString(",")
scala> val probs = spark.sql(f"select *, $param from df")
scala> probs.show()
+---+---+---+----+-----+------+------+------+
| A| B| C|rate|total|prob_A|prob_B|prob_C|
+---+---+---+----+-----+------+------+------+
| 0| 0| 0| 0.0| 0| 0.0| 0.0| 0.0|
| 1| 0| 0| 0.1| 1| 0.1| 0.0| 0.0|
| 0| 1| 0|0.11| 1| 0.0| 0.11| 0.0|
| 0| 0| 1|0.12| 1| 0.0| 0.0| 0.12|
| 1| 1| 0|0.24| 2| 0.24| 0.24| 0.0|
| 1| 0| 1|0.27| 2| 0.27| 0.0| 0.27|
| 0| 1| 1| 0.3| 2| 0.0| 0.3| 0.3|
| 1| 1| 1| 0.4| 3| 0.4| 0.4| 0.4|
+---+---+---+----+-----+------+------+------+
which gives me the wrong output.
Kindly help.

If I understand your requirement correctly, using foldLeft to traverse the channel columns, you can 1) generate a ratesMap from the one-channel dataframe, and, 2) add columns to the two-channel dataframe with column values equal to product of channel and corresponding ratesMap value:
val df = Seq(
(0, 0, 0, 0.0, 0),
(1, 0, 0, 0.1, 1),
(0, 1, 0, 0.11, 1),
(0, 0, 1, 0.12, 1),
(1, 1, 0, 0.24, 2),
(1, 0, 1, 0.27, 2),
(0, 1, 1, 0.3, 2),
(1, 1, 1, 0.4, 3)
).toDF("A", "B", "C", "rate", "total")
val oneChannelDF = df.filter($"total" === 1)
val twoChannelDF = df.filter($"total" === 2)
val channels = df.columns.filter(v => !(v.contains("rate") || v.contains("total")))
// channels: Array[String] = Array(A, B, C)
val ratesMap = channels.foldLeft( Map[String, Double]() ){ (acc, c) =>
acc + (c -> oneChannelDF.select("rate").where(col(c) === 1).head.getDouble(0))
}
// ratesMap: scala.collection.immutable.Map[String,Double] = Map(A -> 0.1, B -> 0.11, C -> 0.12)
val probsDF = channels.foldLeft( twoChannelDF ){ (acc, c) =>
acc.withColumn( "prob_" + c, col(c) * ratesMap.getOrElse(c, 0.0) )
}
probsDF.show
// +---+---+---+----+-----+------+------+------+
// | A| B| C|rate|total|prob_A|prob_B|prob_C|
// +---+---+---+----+-----+------+------+------+
// | 1| 1| 0|0.24| 2| 0.1| 0.11| 0.0|
// | 1| 0| 1|0.27| 2| 0.1| 0.0| 0.12|
// | 0| 1| 1| 0.3| 2| 0.0| 0.11| 0.12|
// +---+---+---+----+-----+------+------+------+

Related

Creating a unique grouping key from column-wise runs in a Spark DataFrame

I have something analogous to this, where spark is my sparkContext. I've imported implicits._ in my sparkContext so I can use the $ syntax:
val df = spark.createDataFrame(Seq(("a", 0L), ("b", 1L), ("c", 1L), ("d", 1L), ("e", 0L), ("f", 1L)))
.toDF("id", "flag")
.withColumn("index", monotonically_increasing_id)
.withColumn("run_key", when($"flag" === 1, $"index").otherwise(0))
df.show
df: org.apache.spark.sql.DataFrame = [id: string, flag: bigint ... 2 more fields]
+---+----+-----+-------+
| id|flag|index|run_key|
+---+----+-----+-------+
| a| 0| 0| 0|
| b| 1| 1| 1|
| c| 1| 2| 2|
| d| 1| 3| 3|
| e| 0| 4| 0|
| f| 1| 5| 5|
+---+----+-----+-------+
I want to create another column with a unique grouping key for each nonzero chunk of run_key, something equivalent to this:
+---+----+-----+-------+---+
| id|flag|index|run_key|key|
+---+----+-----+-------+---|
| a| 0| 0| 0| 0|
| b| 1| 1| 1| 1|
| c| 1| 2| 2| 1|
| d| 1| 3| 3| 1|
| e| 0| 4| 0| 0|
| f| 1| 5| 5| 2|
+---+----+-----+-------+---+
It could be the first value in each run, average of each run, or some other value -- it doesn't really matter as long as it's guaranteed to be unique so that I can group on it afterward to compare other values between groups.
Edit: BTW, I don't need to retain the rows where flag is 0.
One approach would be to 1) create a column $"lag1" using Window function lag() from $"flag", 2) create another column $"switched" with $"index" value in rows where $"flag" is switched, and finally 3) create the column which copies $"switched" from the last non-null row via last() and rowsBetween().
Note that this solution uses Window function without partitioning hence may not work for large dataset.
val df = Seq(
("a", 0L), ("b", 1L), ("c", 1L), ("d", 1L), ("e", 0L), ("f", 1L),
("g", 1L), ("h", 0L), ("i", 0L), ("j", 1L), ("k", 1L), ("l", 1L)
).toDF("id", "flag").
withColumn("index", monotonically_increasing_id).
withColumn("run_key", when($"flag" === 1, $"index").otherwise(0))
import org.apache.spark.sql.expressions.Window
df.withColumn( "lag1", lag("flag", 1, -1).over(Window.orderBy("index")) ).
withColumn( "switched", when($"flag" =!= $"lag1", $"index") ).
withColumn( "key", last("switched", ignoreNulls = true).over(
Window.orderBy("index").rowsBetween(Window.unboundedPreceding, 0)
) )
// +---+----+-----+-------+----+--------+---+
// | id|flag|index|run_key|lag1|switched|key|
// +---+----+-----+-------+----+--------+---+
// | a| 0| 0| 0| -1| 0| 0|
// | b| 1| 1| 1| 0| 1| 1|
// | c| 1| 2| 2| 1| null| 1|
// | d| 1| 3| 3| 1| null| 1|
// | e| 0| 4| 0| 1| 4| 4|
// | f| 1| 5| 5| 0| 5| 5|
// | g| 1| 6| 6| 1| null| 5|
// | h| 0| 7| 0| 1| 7| 7|
// | i| 0| 8| 0| 0| null| 7|
// | j| 1| 9| 9| 0| 9| 9|
// | k| 1| 10| 10| 1| null| 9|
// | l| 1| 11| 11| 1| null| 9|
// +---+----+-----+-------+----+--------+---+
You can label the "run" with the largest index where flag is 0 smaller than the index of the row in question.
Something like:
flags = df.filter($"flag" === 0)
.select("index")
.withColumnRenamed("index", "flagIndex")
indices = df.select("index").join(flags, df.index > flags.flagIndex)
.groupBy($"index")
.agg(max($"index$).as("groupKey"))
dfWithGroups = df.join(indices, Seq("index"))

Calculate links between nodes using Spark

I have the following two DataFrames in Spark 2.2 and Scala 2.11. The DataFrame edges defines the edges of a directed graph, while the DataFrame types defines the type of each node.
edges =
+-----+-----+----+
|from |to |attr|
+-----+-----+----+
| 1| 0| 1|
| 1| 4| 1|
| 2| 2| 1|
| 4| 3| 1|
| 4| 5| 1|
+-----+-----+----+
types =
+------+---------+
|nodeId|type |
+------+---------+
| 0| 0|
| 1| 0|
| 2| 2|
| 3| 4|
| 4| 4|
| 5| 4|
+------+---------+
For each node, I want to know the number of edges to the nodes of the same type. Please notice that I only want to count the edges outgoing from a node, since I deal with the directed graph.
In order to reach this objective, I performed the joining of both DataFrames:
val graphDF = edges
.join(types, types("nodeId") === edges("from"), "left")
.drop("nodeId")
.withColumnRenamed("type","type_from")
.join(types, types("nodeId") === edges("to"), "left")
.drop("nodeId")
.withColumnRenamed("type","type_to")
I obtained the following new DataFrame graphDF:
+-----+-----+----+---------------+---------------+
|from |to |attr|type_from |type_to |
+-----+-----+----+---------------+---------------+
| 1| 0| 1| 0| 0|
| 1| 4| 1| 0| 4|
| 2| 2| 1| 2| 2|
| 4| 3| 1| 4| 4|
| 4| 5| 1| 4| 4|
+-----+-----+----+---------------+---------------+
Now I need to get the following final result:
+------+---------+---------+
|nodeId|numLinks |type |
+------+---------+---------+
| 0| 0| 0|
| 1| 1| 0|
| 2| 0| 2|
| 3| 0| 4|
| 4| 2| 4|
| 5| 0| 4|
+------+---------+---------+
I was thinking about using groupBy and agg(count(...), but I do not know how to deal with directed edges.
Update:
numLinks is calculated as the number of edges outgoing from a given node. For example, the node 5 does not have any outgoing edges (only ingoing edge 4->5, see the DataFrame edges). The same refers to the node 0. But the node 4 has two outgoing edges (4->3 and 4->5).
My solution:
This is my solution, but it lacks those nodes that have 0 links.
graphDF.filter("from != to").filter("type_from == type_to").groupBy("from").agg(count("from") as "numLinks").show()
You can filter, aggregate by id and type and add missing nodes using types:
val graphDF = Seq(
(1, 0, 1, 0, 0), (1, 4, 1, 0, 4), (2, 2, 1, 2, 2),
(4, 3, 1, 4, 4), (4, 5, 1, 4, 4)
).toDF("from", "to", "attr", "type_from", "type_to")
val types = Seq(
(0, 0), (1, 0), (2, 2), (3, 4), (4,4), (5, 4)
).toDF("nodeId", "type")
graphDF
// I want to know the number of edges to the nodes of the same type
.where($"type_from" === $"type_to" && $"from" =!= $"to")
// I only want to count the edges outgoing from a node,
.groupBy($"from" as "nodeId", $"type_from" as "type")
.agg(count("*") as "numLinks")
// but it lacks those nodes that have 0 links.
.join(types, Seq("nodeId", "type"), "rightouter")
.na.fill(0)
// +------+----+--------+
// |nodeId|type|numLinks|
// +------+----+--------+
// | 0| 0| 0|
// | 1| 0| 1|
// | 2| 2| 1|
// | 3| 4| 0|
// | 4| 4| 2|
// | 5| 4| 0|
// +------+----+--------+
To skip self-links add $"from" =!= $"to" to the selection:
graphDF
.where($"type_from" === $"type_to" && $"from" =!= $"to")
.groupBy($"from" as "nodeId", $"type_from" as "type")
.agg(count("*") as "numLinks")
.join(types, Seq("nodeId", "type"), "rightouter")
.na.fill(0)
// +------+----+--------+
// |nodeId|type|numLinks|
// +------+----+--------+
// | 0| 0| 0|
// | 1| 0| 1|
// | 2| 2| 0|
// | 3| 4| 0|
// | 4| 4| 2|
// | 5| 4| 0|
// +------+----+--------+

Pivot scala dataframe with conditional counting

I would like to aggregate this DataFrame and count the number of observations with a value less than or equal to the "BUCKET" field for each level. For example:
val myDF = Seq(
("foo", 0),
("foo", 0),
("bar", 0),
("foo", 1),
("foo", 1),
("bar", 1),
("foo", 2),
("bar", 2),
("foo", 3),
("bar", 3)).toDF("COL1", "BUCKET")
myDF.show
+----+------+
|COL1|BUCKET|
+----+------+
| foo| 0|
| foo| 0|
| bar| 0|
| foo| 1|
| foo| 1|
| bar| 1|
| foo| 2|
| bar| 2|
| foo| 3|
| bar| 3|
+----+------+
I can count the number of observations matching each bucket value using this code:
myDF.groupBy("COL1").pivot("BUCKET").count.show
+----+---+---+---+---+
|COL1| 0| 1| 2| 3|
+----+---+---+---+---+
| bar| 1| 1| 1| 1|
| foo| 2| 2| 1| 1|
+----+---+---+---+---+
But I want to count the number of rows with a value in the "BUCKET" field which is less than or equal to the final header after pivoting, like this:
+----+---+---+---+---+
|COL1| 0| 1| 2| 3|
+----+---+---+---+---+
| bar| 1| 2| 3| 4|
| foo| 2| 4| 5| 6|
+----+---+---+---+---+
You can achieve this using a window function, as follows:
import org.apache.spark.sql.expressions.Window.partitionBy
import org.apache.spark.sql.functions.first
myDF.
select(
$"COL1",
$"BUCKET",
count($"BUCKET").over(partitionBy($"COL1").orderBy($"BUCKET")).as("ROLLING_COUNT")).
groupBy($"COL1").pivot("BUCKET").agg(first("ROLLING_COUNT")).
show()
+----+---+---+---+---+
|COL1| 0| 1| 2| 3|
+----+---+---+---+---+
| bar| 1| 2| 3| 4|
| foo| 2| 4| 5| 6|
+----+---+---+---+---+
What you are specifying here is that you want to perform a count of your observations, partitioned in windows as determined by a key (COL1 in this case). By specifying an ordering, you are also making the count rolling over the window, thus obtaining the results you want then to be pivoted in your end results.
This is the result of applying the window function:
myDF.
select(
$"COL1",
$"BUCKET",
count($"BUCKET").over(partitionBy($"COL1").orderBy($"BUCKET")).as("ROLLING_COUNT")).
show()
+----+------+-------------+
|COL1|BUCKET|ROLLING_COUNT|
+----+------+-------------+
| bar| 0| 1|
| bar| 1| 2|
| bar| 2| 3|
| bar| 3| 4|
| foo| 0| 2|
| foo| 0| 2|
| foo| 1| 4|
| foo| 1| 4|
| foo| 2| 5|
| foo| 3| 6|
+----+------+-------------+
Finally, by grouping by COL1, pivoting over BUCKET and only getting the first result of the rolling count (anyone would be good as all of them are applied to the whole window), you finally obtain the result you were looking for.
In a way, window functions are very similar to aggregations over groupings, but are more flexible and powerful. This just scratches the surface of window functions and you can dig a little bit deeper by having a look at this introductory reading.
Here's one approach to get the rolling counts by traversing the pivoted BUCKET value columns using foldLeft to aggregate the counts. Note that a tuple of (DataFrame, Int) is used for foldLeft to transform the DataFrame as well as store the count in the previous iteration:
val pivotedDF = myDF.groupBy($"COL1").pivot("BUCKET").count
val buckets = pivotedDF.columns.filter(_ != "COL1")
buckets.drop(1).foldLeft((pivotedDF, buckets.head))( (acc, c) =>
( acc._1.withColumn(c, col(acc._2) + col(c)), c )
)._1.show
// +----+---+---+---+---+
// |COL1| 0| 1| 2| 3|
// +----+---+---+---+---+
// | bar| 1| 2| 3| 4|
// | foo| 2| 4| 5| 6|
// +----+---+---+---+---+

E-num / get Dummies in pyspark

I would like to create a function in PYSPARK that get Dataframe and list of parameters (codes/categorical features) and return the data frame with additional dummy columns like the categories of the features in the list
PFA the Before and After DF:
before and After data frame- Example
The code in python looks like that:
enum = ['column1','column2']
for e in enum:
print e
temp = pd.get_dummies(data[e],drop_first=True,prefix=e)
data = pd.concat([data,temp], axis=1)
data.drop(e,axis=1,inplace=True)
data.to_csv('enum_data.csv')
First you need to collect distinct values of TYPES and CODE. Then either select add column with name of each value using withColumn or use select fro each column.
Here is sample code using select statement:-
import pyspark.sql.functions as F
df = sqlContext.createDataFrame([
(1, "A", "X1"),
(2, "B", "X2"),
(3, "B", "X3"),
(1, "B", "X3"),
(2, "C", "X2"),
(3, "C", "X2"),
(1, "C", "X1"),
(1, "B", "X1"),
], ["ID", "TYPE", "CODE"])
types = df.select("TYPE").distinct().rdd.flatMap(lambda x: x).collect()
codes = df.select("CODE").distinct().rdd.flatMap(lambda x: x).collect()
types_expr = [F.when(F.col("TYPE") == ty, 1).otherwise(0).alias("e_TYPE_" + ty) for ty in types]
codes_expr = [F.when(F.col("CODE") == code, 1).otherwise(0).alias("e_CODE_" + code) for code in codes]
df = df.select("ID", "TYPE", "CODE", *types_expr+codes_expr)
df.show()
OUTPUT
+---+----+----+--------+--------+--------+---------+---------+---------+
| ID|TYPE|CODE|e_TYPE_A|e_TYPE_B|e_TYPE_C|e_CODE_X1|e_CODE_X2|e_CODE_X3|
+---+----+----+--------+--------+--------+---------+---------+---------+
| 1| A| X1| 1| 0| 0| 1| 0| 0|
| 2| B| X2| 0| 1| 0| 0| 1| 0|
| 3| B| X3| 0| 1| 0| 0| 0| 1|
| 1| B| X3| 0| 1| 0| 0| 0| 1|
| 2| C| X2| 0| 0| 1| 0| 1| 0|
| 3| C| X2| 0| 0| 1| 0| 1| 0|
| 1| C| X1| 0| 0| 1| 1| 0| 0|
| 1| B| X1| 0| 1| 0| 1| 0| 0|
+---+----+----+--------+--------+--------+---------+---------+---------+
The solutions provided by Freek Wiemkeijer and Rakesh Kumar are perfectly adequate, however, since I coded it up, I thought it was worth posting this generic solution as it doesn't require hard coding of the column names.
pivot_cols = ['TYPE','CODE']
keys = ['ID','TYPE','CODE']
before = sc.parallelize([(1,'A','X1'),
(2,'B','X2'),
(3,'B','X3'),
(1,'B','X3'),
(2,'C','X2'),
(3,'C','X2'),
(1,'C','X1'),
(1,'B','X1')]).toDF(['ID','TYPE','CODE'])
#Helper function to recursively join a list of dataframes
#Can be simplified if you only need two columns
def join_all(dfs,keys):
if len(dfs) > 1:
return dfs[0].join(join_all(dfs[1:],keys), on = keys, how = 'inner')
else:
return dfs[0]
dfs = []
combined = []
for pivot_col in pivot_cols:
pivotDF = before.groupBy(keys).pivot(pivot_col).count()
new_names = pivotDF.columns[:len(keys)] + ["e_{0}_{1}".format(pivot_col, c) for c in pivotDF.columns[len(keys):]]
df = pivotDF.toDF(*new_names).fillna(0)
combined.append(df)
join_all(combined,keys).show()
This gives as output:
+---+----+----+--------+--------+--------+---------+---------+---------+
| ID|TYPE|CODE|e_TYPE_A|e_TYPE_B|e_TYPE_C|e_CODE_X1|e_CODE_X2|e_CODE_X3|
+---+----+----+--------+--------+--------+---------+---------+---------+
| 1| A| X1| 1| 0| 0| 1| 0| 0|
| 2| C| X2| 0| 0| 1| 0| 1| 0|
| 3| B| X3| 0| 1| 0| 0| 0| 1|
| 2| B| X2| 0| 1| 0| 0| 1| 0|
| 3| C| X2| 0| 0| 1| 0| 1| 0|
| 1| B| X3| 0| 1| 0| 0| 0| 1|
| 1| B| X1| 0| 1| 0| 1| 0| 0|
| 1| C| X1| 0| 0| 1| 1| 0| 0|
+---+----+----+--------+--------+--------+---------+---------+---------+
I was looking for the same solution but is scala, maybe this will help someone:
val list = df.select("category").distinct().rdd.map(r => r(0)).collect()
val oneHotDf = list.foldLeft(df)((df, category) => finalDf.withColumn("category_" + category, when(col("category") === category, 1).otherwise(0)))
If you'd like to get the PySpark version of pandas "pd.get_dummies" function, you can you the following function:
import itertools
def spark_get_dummies(df):
categories = []
for i, values in enumerate(df.columns):
categories.append(df.select(values).distinct().rdd.flatMap(lambda x: x).collect())
expressions = []
for i, values in enumerate(df.columns):
expressions.append([F.when(F.col(values) == i, 1).otherwise(0).alias(str(values) + "_" + str(i)) for i in categories[i]])
expressions_flat = list(itertools.chain.from_iterable(expressions))
df_final = df.select(*expressions_flat)
return df_final
The reproducible example is:
df = sqlContext.createDataFrame([
("A", "X1"),
("B", "X2"),
("B", "X3"),
("B", "X3"),
("C", "X2"),
("C", "X2"),
("C", "X1"),
("B", "X1"),
], ["TYPE", "CODE"])
dummies_df = spark_get_dummies(df)
dummies_df.show()
You will get:
The first step is to make a DataFrame from your CSV file.
See Get CSV to Spark dataframe ; the first answer gives a line by line example.
Then you can add the columns. Assume you have a DataFrame object called df, and the columns are: [ID, TYPE, CODE].
The rest van be fixed with DataFrame.withColumn() and pyspark.sql.functions.when:
from pyspark.sql.functions import when
df_with_extra_columns = df.withColumn("e_TYPE_A", when(df.TYPE == "A", 1).otherwise(0).withColumn("e_TYPE_B", when(df.TYPE == "B", 1).otherwise(0)
(this adds the first two columns. you get the point.)

Filtering rows based on subsequent row values in spark dataframe [duplicate]

I have a dataframe(spark):
id value
3 0
3 1
3 0
4 1
4 0
4 0
I want to create a new dataframe:
3 0
3 1
4 1
Need to remove all the rows after 1(value) for each id.I tried with window functions in spark dateframe(Scala). But couldn't able to find a solution.Seems to be I am going in a wrong direction.
I am looking for a solution in Scala.Thanks
Output using monotonically_increasing_id
scala> val data = Seq((3,0),(3,1),(3,0),(4,1),(4,0),(4,0)).toDF("id", "value")
data: org.apache.spark.sql.DataFrame = [id: int, value: int]
scala> val minIdx = dataWithIndex.filter($"value" === 1).groupBy($"id").agg(min($"idx")).toDF("r_id", "min_idx")
minIdx: org.apache.spark.sql.DataFrame = [r_id: int, min_idx: bigint]
scala> dataWithIndex.join(minIdx,($"r_id" === $"id") && ($"idx" <= $"min_idx")).select($"id", $"value").show
+---+-----+
| id|value|
+---+-----+
| 3| 0|
| 3| 1|
| 4| 1|
+---+-----+
The solution wont work if we did a sorted transformation in the original dataframe. That time the monotonically_increasing_id() is generated based on original DF rather that sorted DF.I have missed that requirement before.
All suggestions are welcome.
One way is to use monotonically_increasing_id() and a self-join:
val data = Seq((3,0),(3,1),(3,0),(4,1),(4,0),(4,0)).toDF("id", "value")
data.show
+---+-----+
| id|value|
+---+-----+
| 3| 0|
| 3| 1|
| 3| 0|
| 4| 1|
| 4| 0|
| 4| 0|
+---+-----+
Now we generate a column named idx with an increasing Long:
val dataWithIndex = data.withColumn("idx", monotonically_increasing_id())
// dataWithIndex.cache()
Now we get the min(idx) for each id where value = 1:
val minIdx = dataWithIndex
.filter($"value" === 1)
.groupBy($"id")
.agg(min($"idx"))
.toDF("r_id", "min_idx")
Now we join the min(idx) back to the original DataFrame:
dataWithIndex.join(
minIdx,
($"r_id" === $"id") && ($"idx" <= $"min_idx")
).select($"id", $"value").show
+---+-----+
| id|value|
+---+-----+
| 3| 0|
| 3| 1|
| 4| 1|
+---+-----+
Note: monotonically_increasing_id() generates its value based on the partition of the row. This value may change each time dataWithIndex is re-evaluated. In my code above, because of lazy evaluation, it's only when I call the final show that monotonically_increasing_id() is evaluated.
If you want to force the value to stay the same, for example so you can use show to evaluate the above step-by-step, uncomment this line above:
// dataWithIndex.cache()
Hi I found the solution using Window and self join.
val data = Seq((3,0,2),(3,1,3),(3,0,1),(4,1,6),(4,0,5),(4,0,4),(1,0,7),(1,1,8),(1,0,9),(2,1,10),(2,0,11),(2,0,12)).toDF("id", "value","sorted")
data.show
scala> data.show
+---+-----+------+
| id|value|sorted|
+---+-----+------+
| 3| 0| 2|
| 3| 1| 3|
| 3| 0| 1|
| 4| 1| 6|
| 4| 0| 5|
| 4| 0| 4|
| 1| 0| 7|
| 1| 1| 8|
| 1| 0| 9|
| 2| 1| 10|
| 2| 0| 11|
| 2| 0| 12|
+---+-----+------+
val sort_df=data.sort($"sorted")
scala> sort_df.show
+---+-----+------+
| id|value|sorted|
+---+-----+------+
| 3| 0| 1|
| 3| 0| 2|
| 3| 1| 3|
| 4| 0| 4|
| 4| 0| 5|
| 4| 1| 6|
| 1| 0| 7|
| 1| 1| 8|
| 1| 0| 9|
| 2| 1| 10|
| 2| 0| 11|
| 2| 0| 12|
+---+-----+------+
var window=Window.partitionBy("id").orderBy("$sorted")
val sort_idx=sort_df.select($"*",rowNumber.over(window).as("count_index"))
val minIdx=sort_idx.filter($"value"===1).groupBy("id").agg(min("count_index")).toDF("idx","min_idx")
val result_id=sort_idx.join(minIdx,($"id"===$"idx") &&($"count_index" <= $"min_idx"))
result_id.show
+---+-----+------+-----------+---+-------+
| id|value|sorted|count_index|idx|min_idx|
+---+-----+------+-----------+---+-------+
| 1| 0| 7| 1| 1| 2|
| 1| 1| 8| 2| 1| 2|
| 2| 1| 10| 1| 2| 1|
| 3| 0| 1| 1| 3| 3|
| 3| 0| 2| 2| 3| 3|
| 3| 1| 3| 3| 3| 3|
| 4| 0| 4| 1| 4| 3|
| 4| 0| 5| 2| 4| 3|
| 4| 1| 6| 3| 4| 3|
+---+-----+------+-----------+---+-------+
Still looking for a more optimized solutions.Thanks
You can simply use groupBy like this
val df2 = df1.groupBy("id","value").count().select("id","value")
Here your df1 is
id value
3 0
3 1
3 0
4 1
4 0
4 0
And resultant dataframe is df2 which is your expected output like this
id value
3 0
3 1
4 1
4 0
use isin method and filter as below:
val data = Seq((3,0,2),(3,1,3),(3,0,1),(4,1,6),(4,0,5),(4,0,4),(1,0,7),(1,1,8),(1,0,9),(2,1,10),(2,0,11),(2,0,12)).toDF("id", "value","sorted")
val idFilter = List(1, 2)
data.filter($"id".isin(idFilter:_*)).show
+---+-----+------+
| id|value|sorted|
+---+-----+------+
| 1| 0| 7|
| 1| 1| 8|
| 1| 0| 9|
| 2| 1| 10|
| 2| 0| 11|
| 2| 0| 12|
+---+-----+------+
Ex: filter based on val
val valFilter = List(0)
data.filter($"value".isin(valFilter:_*)).show
+---+-----+------+
| id|value|sorted|
+---+-----+------+
| 3| 0| 2|
| 3| 0| 1|
| 4| 0| 5|
| 4| 0| 4|
| 1| 0| 7|
| 1| 0| 9|
| 2| 0| 11|
| 2| 0| 12|
+---+-----+------+