Given any df, I want to calculate another column for the df called "has_duplicates", and then add a column with a boolean value for whether each row is unique. Example input df:
val df = Seq((1, 2), (2, 5), (1, 7), (1, 2), (2, 5)).toDF("A", "B")
Given an input columns: Seq[String], I know how to get the count of each row:
val countsDf = df.withColumn("count", count("*").over(Window.partitionBy(columns.map(col(_)): _*)))
But I'm not sure how to use this to create a column expression for the final column indicating whether each row is unique.
Something like
def getEvaluationExpression(df: DataFrame): Column = {
when("count > 1", lit("fail").otherwise(lit("pass"))
}
but the count needs to be evaluated on the spot using the query above.
Try below code.
scala> df.withColumn("has_duplicates", when(count("*").over(Window.partitionBy(df.columns.map(col(_)): _*)) > 1 , lit("fail")).otherwise("pass")).show(false)
+---+---+--------------+
|A |B |has_duplicates|
+---+---+--------------+
|1 |7 |pass |
|1 |2 |fail |
|1 |2 |fail |
|2 |5 |fail |
|2 |5 |fail |
+---+---+--------------+
Or
scala> df.withColumn("count",count("*").over(Window.partitionBy(df.columns.map(col(_)): _*))).withColumn("has_duplicates", when($"count" > 1 , lit("fail")).otherwise("pass")).show(false)
+---+---+-----+--------------+
|A |B |count|has_duplicates|
+---+---+-----+--------------+
|1 |7 |1 |pass |
|1 |2 |2 |fail |
|1 |2 |2 |fail |
|2 |5 |2 |fail |
|2 |5 |2 |fail |
+---+---+-----+--------------+
Related
I have problem in joining 2 dataframes grouped by ID
val df1 = Seq(
(1, 1,100),
(1, 3,20),
(2, 5,5),
(2, 2,10)).toDF("id", "index","value")
val df2 = Seq(
(1, 0),
(2, 0),
(3, 0),
(4, 0),
(5,0)).toDF("index", "value")
df1 joins with df2 by index column for every id
expected result
id
index
value
1
1
100
1
2
0
1
3
20
1
4
0
1
5
0
2
1
0
2
2
10
2
3
0
2
4
0
2
5
5
please help me on this
First of all, I would replace your df2 table with this:
var df2 = Seq(
(Array(1, 2), Array(1, 2, 3, 4, 5))
).toDF("id", "index")
This allows us to use explode and auto-generate a table which can be of help to us:
df2 = df2
.withColumn("id", explode(col("id")))
.withColumn("index", explode(col("index")))
and it gives:
+---+-----+
|id |index|
+---+-----+
|1 |1 |
|1 |2 |
|1 |3 |
|1 |4 |
|1 |5 |
|2 |1 |
|2 |2 |
|2 |3 |
|2 |4 |
|2 |5 |
+---+-----+
Now, all we need to do, is join with your df1 as below:
df2 = df2
.join(df1, Seq("id", "index"), "left")
.withColumn("value", when(col("value").isNull, 0).otherwise(col("value")))
And we get this final output:
+---+-----+-----+
|id |index|value|
+---+-----+-----+
|1 |1 |100 |
|1 |2 |0 |
|1 |3 |20 |
|1 |4 |0 |
|1 |5 |0 |
|2 |1 |0 |
|2 |2 |10 |
|2 |3 |0 |
|2 |4 |0 |
|2 |5 |5 |
+---+-----+-----+
which should be what you want. Good luck!
I'm trying to replace Null or invalid values present in a column with the above or below nonnull value of the same column. For Example:-
Name|Place|row_count
a |a1 |1
a |a2 |2
a |a2 |3
|d1 |4
b |a2 |5
c |a2 |6
| |7
| |8
d |c1 |9
In this case, I try to replace all the NULL values in the column "Name" 1st NULL will replace with 'a' & 2nd NULL will replace with 'c' and in column "Place" NULL replace with 'a2'.
When we try to replace the 8th cell NULL of 'Place' column then also replace with its sparse nonnull value 'a2'.
Required Result:
If we select the 8th cell NULL of 'Place' column replacing then result will be
Name|Place|row_count
a |a1 |1
a |a2 |2
a |a2 |3
|d1 |4
b |a2 |5
c |a2 |6
| |7
|a2 |8
d |c1 |9
if we select the 4th cell NULL of 'Name' column for replace then result will be
Name|Place|row_count
a |a1 |1
a |a2 |2
a |a2 |3
a |d1 |4
b |a2 |5
c |a2 |6
| |7
| |8
d |c1 |9
Windows functions would come handy to solve the issue. For the sake of simplicity, I'm focusing on just name column. If previous row has null, I'm using next row value. You can change this order according to your need.Same approach needs to be done for other columns as well.
import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(("a", "a1", "1"),
("a", "a2", "2"),
("a", "a2", "3"),
("d1", null, "4"),
("b", "a2", "5"),
("c", "a2", "6"),
(null, null, "7"),
(null, null, "8"),
("d", "c1", "9")).toDF("name", "place", "row_count")
val window = Window.orderBy("row_count")
val lagNameWindowExpression = lag('name, 1).over(window)
val leadNameWindowExpression = lead('name, 1).over(window)
val nameConditionExpression = when($"name".isNull.and('previous_name_col.isNull), 'next_name_col)
.when($"name".isNull.and('previous_name_col.isNotNull), 'previous_name_col).otherwise($"name")
df.select($"*", lagNameWindowExpression as 'previous_name_col, leadNameWindowExpression as 'next_name_col)
.withColumn("name", nameConditionExpression).drop("previous_name_col", "next_name_col")
.show(false)
Output
+----+-----+---------+
|name|place|row_count|
+----+-----+---------+
|a |a1 |1 |
|a |a2 |2 |
|a |a2 |3 |
|d1 |null |4 |
|b |a2 |5 |
|c |a2 |6 |
|c |null |7 |
|d |null |8 |
|d |c1 |9 |
+----+-----+---------+
I have the input data set like:
id operation value
1 null 1
1 discard 0
2 null 1
2 null 2
2 max 0
3 null 1
3 null 1
3 list 0
I want to group the input and produce rows according to "operation" column.
for group 1, operation="discard", then the output is null,
for group 2, operation="max", the output is:
2 null 2
for group 3, operation="list", the output is:
3 null 1
3 null 1
So finally the output is like:
id operation value
2 null 2
3 null 1
3 null 1
Is there a solution for this?
I know there is a similar question how-to-iterate-grouped-data-in-spark
But the differences compared to that are:
I want to produce more than one row for each grouped data. Possible
and how?
I want my logic to be easily extended for more operation to be added in future. So User-defined aggregate functions (aka UDAF) is
the only possible solution?
Update 1:
Thank stack0114106, then more details according to his answer, e.g. for id=1, operation="max", I want to iterate all the item with id=2, and find the max value, rather than assign a hard-coded value, that's why I want to iterate the rows in each group. Below is a updated example:
The input:
scala> val df = Seq((0,null,1),(0,"discard",0),(1,null,1),(1,null,2),(1,"max",0),(2,null,1),(2,null,3),(2,"max",0),(3,null,1),(3,null,1),(3,"list",0)).toDF("id"
,"operation","value")
df: org.apache.spark.sql.DataFrame = [id: int, operation: string ... 1 more field]
scala> df.show(false)
+---+---------+-----+
|id |operation|value|
+---+---------+-----+
|0 |null |1 |
|0 |discard |0 |
|1 |null |1 |
|1 |null |2 |
|1 |max |0 |
|2 |null |1 |
|2 |null |3 |
|2 |max |0 |
|3 |null |1 |
|3 |null |1 |
|3 |list |0 |
+---+---------+-----+
The expected output:
+---+---------+-----+
|id |operation|value|
+---+---------+-----+
|1 |null |2 |
|2 |null |3 |
|3 |null |1 |
|3 |null |1 |
+---+---------+-----+
group everything collecting the values, then write logic for each operation :
import org.apache.spark.sql.functions._
val grouped=df.groupBy($"id").agg(max($"operation").as("op"),collect_list($"value").as("vals"))
val maxs=grouped.filter($"op"==="max").withColumn("val",explode($"vals")).groupBy($"id").agg(max("val").as("value"))
val lists=grouped.filter($"op"==="list").withColumn("value",explode($"vals")).filter($"value"!==0).select($"id",$"value")
//we don't collect the "discard"
//and we can add additional subsets for new "operations"
val result=maxs.union(lists)
//if you need the null in "operation" column add it with withColumn
You can use flatMap operation on the dataframe and generate required rows based on the conditions that you mentioned. Check this out
scala> val df = Seq((1,null,1),(1,"discard",0),(2,null,1),(2,null,2),(2,"max",0),(3,null,1),(3,null,1),(3,"list",0)).toDF("id","operation","value")
df: org.apache.spark.sql.DataFrame = [id: int, operation: string ... 1 more field]
scala> df.show(false)
+---+---------+-----+
|id |operation|value|
+---+---------+-----+
|1 |null |1 |
|1 |discard |0 |
|2 |null |1 |
|2 |null |2 |
|2 |max |0 |
|3 |null |1 |
|3 |null |1 |
|3 |list |0 |
+---+---------+-----+
scala> df.filter("operation is not null").flatMap( r=> { val x=r.getString(1); val s = x match { case "discard" => (0,0) case "max" => (1,2) case "list" => (2,1) } ; (0
until s._1).map( i => (r.getInt(0),null,s._2) ) }).show(false)
+---+----+---+
|_1 |_2 |_3 |
+---+----+---+
|2 |null|2 |
|3 |null|1 |
|3 |null|1 |
+---+----+---+
Spark assigns _1,_2 etc.. so you can map them to actual names by assigning them as below
scala> val df2 = df.filter("operation is not null").flatMap( r=> { val x=r.getString(1); val s = x match { case "discard" => (0,0) case "max" => (1,2) case "list" => (2,1) } ; (0 until s._1).map( i => (r.getInt(0),null,s._2) ) }).toDF("id","operation","value")
df2: org.apache.spark.sql.DataFrame = [id: int, operation: null ... 1 more field]
scala> df2.show(false)
+---+---------+-----+
|id |operation|value|
+---+---------+-----+
|2 |null |2 |
|3 |null |1 |
|3 |null |1 |
+---+---------+-----+
scala>
EDIT1:
Since you need the max(value) for each id, you can use window functions and get the max value in a new column, then use the same technique and get the results. Check this out
scala> val df = Seq((0,null,1),(0,"discard",0),(1,null,1),(1,null,2),(1,"max",0),(2,null,1),(2,null,3),(2,"max",0),(3,null,1),(3,null,1),(3,"list",0)).toDF("id","operation","value")
df: org.apache.spark.sql.DataFrame = [id: int, operation: string ... 1 more field]
scala> df.createOrReplaceTempView("michael")
scala> val df2 = spark.sql(""" select *, max(value) over(partition by id) mx from michael """)
df2: org.apache.spark.sql.DataFrame = [id: int, operation: string ... 2 more fields]
scala> df2.show(false)
+---+---------+-----+---+
|id |operation|value|mx |
+---+---------+-----+---+
|1 |null |1 |2 |
|1 |null |2 |2 |
|1 |max |0 |2 |
|3 |null |1 |1 |
|3 |null |1 |1 |
|3 |list |0 |1 |
|2 |null |1 |3 |
|2 |null |3 |3 |
|2 |max |0 |3 |
|0 |null |1 |1 |
|0 |discard |0 |1 |
+---+---------+-----+---+
scala> val df3 = df2.filter("operation is not null").flatMap( r=> { val x=r.getString(1); val s = x match { case "discard" => 0 case "max" => 1 case "list" => 2 } ; (0 until s).map( i => (r.getInt(0),null,r.getInt(3) )) }).toDF("id","operation","value")
df3: org.apache.spark.sql.DataFrame = [id: int, operation: null ... 1 more field]
scala> df3.show(false)
+---+---------+-----+
|id |operation|value|
+---+---------+-----+
|1 |null |2 |
|3 |null |1 |
|3 |null |1 |
|2 |null |3 |
+---+---------+-----+
scala>
I am learning Spark and Scala, and was experimenting in the spark REPL.
When I try to convert a List to a DataFrame, it works as follows:
val convertedDf = Seq(1,2,3,4).toDF("Field1")
However, when I try to convert a list of lists to a DataFrame with two columns (field1, field2), it fails with
java.lang.IllegalArgumentException: requirement failed: The number of
columns doesn't match
error message:
val twoColumnDf =Seq(Seq(1,2,3,4,5), Seq(5,4,3,2,3)).toDF("Field1", (Field2))
How to convert such a List of Lists to a DataFrame in Scala?
If you are seeking ways to have each elements of each sequence in each row of respective columns then following are the options for you
zip
zip both sequences and then apply toDF as
val twoColumnDf =Seq(1,2,3,4,5).zip(Seq(5,4,3,2,3)).toDF("Field1", "Field2")
which should give you twoColumnDf as
+------+------+
|Field1|Field2|
+------+------+
|1 |5 |
|2 |4 |
|3 |3 |
|4 |2 |
|5 |3 |
+------+------+
zipped
Another better way is to use zipped as
val threeColumnDf = (Seq(1,2,3,4,5), Seq(5,4,3,2,3), Seq(10,10,10,12,14)).zipped.toList.toDF("Field1", "Field2", "field3")
which should give you
+------+------+------+
|Field1|Field2|field3|
+------+------+------+
|1 |5 |10 |
|2 |4 |10 |
|3 |3 |10 |
|4 |2 |12 |
|5 |3 |14 |
+------+------+------+
But zipped works only for maximum three sequeces Thanks for pointing that out #Shaido
Note: the number of rows is determined by the shortest sequence present
transpose
Tanspose combines all sequences as zip and zipped does but returns list instead of tuples so a little hacking is needed as
Seq(Seq(1,2,3,4,5), Seq(5,4,3,2,3)).transpose.map{case List(a,b) => (a, b)}.toDF("Field1", "Field2")
+------+------+
|Field1|Field2|
+------+------+
|1 |5 |
|2 |4 |
|3 |3 |
|4 |2 |
|5 |3 |
+------+------+
and
Seq(Seq(1,2,3,4,5), Seq(5,4,3,2,3), Seq(10,10,10,12,14)).transpose.map{case List(a,b,c) => (a, b, c)}.toDF("Field1", "Field2", "Field3")
+------+------+------+
|Field1|Field2|Field3|
+------+------+------+
|1 |5 |10 |
|2 |4 |10 |
|3 |3 |10 |
|4 |2 |12 |
|5 |3 |14 |
+------+------+------+
and so on ...
Note: transpose requires all sequences to be of same length
I hope the answer is helpful
By default, each element is considered to be a Row of the dataFrame.
If you want each of the Seqs to be a different column you need to group them inside a Tuple:
val twoColumnDf =Seq((Seq(1,2,3,4,5), Seq(5,4,3,2,3))).toDF("Field1", "Field2")
twoColumnDf.show
+---------------+---------------+
| Field1| Field2|
+---------------+---------------+
|[1, 2, 3, 4, 5]|[5, 4, 3, 2, 3]|
+---------------+---------------+
I am trying to filter out table rows based in column value.
I have a dataframe:
+---+-----+
|id |value|
+---+-----+
|3 |0 |
|3 |1 |
|3 |0 |
|4 |1 |
|4 |0 |
|4 |0 |
+---+-----+
I want to create a new dataframe deleting all rows with value!=0:
+---+-----+
|id |value|
+---+-----+
|3 |0 |
|3 |0 |
|4 |0 |
|4 |0 |
+---+-----+
I figured the syntax should be something like this but couldn't get it right:
val newDataFrame = OldDataFrame.filter($"value"==0)
Correct way is as following. You just forgot to add one = sign
val newDataFrame = OldDataFrame.filter($"value" === 0)
Their are various ways by which you can do the filtering.
val newDataFrame = OldDataFrame.filter($"value"===0)
val newDataFrame = OldDataFrame.filter(OldDataFrame("value") === 0)
val newDataFrame = OldDataFrame.filter("value === 0")
You can also use where function as well instead of filter.