how to create a dataframe based on the first appearing date and based on additional columns each id column - scala

i try to create a dataframe with following condition:
I have multiple IDs, multiple columns with defaults (0 or 1) and a startdate column. I would like to get a dataframe with the appearing defaults based on the first startdate (default_date) and each id.
the orginal df looks like this:
|id |def_a|def_b|deb_c|date |
| 01| 1| 0| 1| 2019-01-31|
| 02| 1| 1| 0| 2018-12-31|
| 03| 1| 1| 1| 2018-10-31|
| 01| 1| 0| 1| 2018-09-30|
| 02| 1| 1| 0| 2018-08-31|
| 03| 1| 1| 0| 2018-07-31|
| 03| 1| 1| 1| 2019-05-31|
this is how i would like to have it:
|id |def_a|def_b|deb_c|date |
| 01| 1| 0| 1| 2018-09-30|
| 02| 1| 1| 0| 2018-08-31|
| 03| 1| 1| 1| 2018-07-31|
i tried following code:
val w = Window.partitionBy($"id").orderBy($"date".asc)
val reult = join3.withColumn("rn", row_number.over(w)).where($"def_a" === 1 || $"def_b" === 1 ||$"def_c" === 1).filter($"rn" >= 1).drop("rn")
I would be grateful for any help

This should work for you. First assign the min date to the original df then join the new df2 with df.
import org.apache.spark.sql.expressions.Window
val df = Seq(
.toDF("id" ,"def_a" , "def_b", "deb_c", "date")
val w = Window.partitionBy($"id").orderBy($"date".asc)
val df2 = df.withColumn("date", $"date".cast("date"))
.withColumn("min_date", min($"date").over(w))
.select("id", "min_date")
df.join(df2, df("id") === df2("id") && df("date") === df2("min_date"))
And the output should be:
| id|def_a|def_b|deb_c| date|
| 1| 1| 0| 1|2018-09-30|
| 2| 1| 1| 0|2018-08-31|
| 3| 1| 1| 0|2018-07-31|
By the way I believe you had a little mistake on your expected results. It is (3, 1, 1, 0, 2018-07-31) not (3, 1, 1, 1, 2018-07-31)


how to join dataframes with some similar values and multiple keys / scala

I have problems to get following table. The first two tables are my source tables which i would like to join. the third table is how i would like to have it.
I tried it with an outer join and used the keys "ID" and "date" but the result is not the same like in this example. The problem is, that some def_ values in each table have the same date and i would like to get them in the same row.
I used following join:
val df_result = df_1.join(df_2, Seq("ID", "date"), "outer")
|ID |def_a| date |
| 01| 1| 2019-01-31|
| 02| 1| 2019-12-31|
| 03| 1| 2019-11-30|
| 01| 1| 2019-10-31|
|ID |def_b|def_c|date |
| 01| 1| 0| 2017-01-31|
| 02| 1| 1| 2019-12-31|
| 03| 1| 1| 2018-11-30|
| 03| 0| 1| 2019-11-30|
| 01| 1| 1| 2018-09-30|
| 02| 1| 1| 2018-08-31|
| 01| 1| 1| 2018-07-31|
|ID |def_a|def_b|deb_c|date |
| 01| 1| 0| 0| 2019-01-31|
| 02| 1| 1| 1| 2019-12-31|
| 03| 1| 0| 1| 2019-11-30|
| 01| 1| 0| 0| 2019-10-31|
| 01| 0| 1| 0| 2017-01-31|
| 03| 0| 1| 1| 2018-11-30|
| 01| 0| 1| 1| 2018-09-30|
| 02| 0| 1| 1| 2018-08-31|
| 01| 0| 1| 1| 2018-07-31|
I would be grateful for any help.
Hope the following code would be helpful —
.groupBy("ID", "date")

How to create a sequence of events (column values) per some other column?

I have a Spark data frame as shown below -
val myDF = Seq(
| 1| A| 100| 0| 0|
| 1| E| 200| 0| 0|
| 1| | 300| 1| 49|
| 2| A| 200| 0| 0|
| 2| C| 300| 0| 0|
| 2| D| 100| 0| 0|
I would like to create Sequence dataframe for every visitor from myDF that traces a visitor's path to purchase ordered by timestamp dimension.
The output dataframe should look like below(-> can be any delimiter) -
|visitor|channel sequence |
| 1| A->E->purchase |
| 2| D->A->C->no_purchase|
To make things clear, visitor 2 has been exposed to channel D, then A and then C; and he does not make a purchase.
Hence the sequence is to be formed as D->A-C->no_purchase.
NOTE: Whenever a purchase happens, channel value goes blank and purchase_flag is set to 1.
I want to do this using a Scala UDF in Spark so that I re-apply the method on other datasets.
Here's how it is done using udf function
val myDF = Seq(
import org.apache.spark.sql.functions._
def sequenceUdf = udf((struct: Seq[Row], purchased: Seq[Int])=> => (row.getAs[String]("channel"), row.getAs[Int]("timestamp"))).sortBy(_._2).map(_._1).filterNot(_ == "").mkString("->")+{if(purchased.contains(1)) "->purchase" else "->no_purchase"})
myDF.groupBy("visitor").agg(collect_list(struct("channel", "timestamp")).as("struct"), collect_list("purchase_flag").as("purchased"))
.select(col("visitor"), sequenceUdf(col("struct"), col("purchased")).as("channel sequence"))
which should give you
|visitor|channel sequence |
|1 |A->E->purchase |
|2 |D->A->C->no_purchase|
You can make it as much generic as you can . this is just a demo on how you should proceed

pyspark window function with filter

I have the following DataFrame with columns: ["id", "timestamp", "x", "y"]:
| id| timestamp| x| y|
| 0|1443489380|100| 1|
| 0|1443489390|200| 0|
| 0|1443489400|300| 0|
| 1|1443489410|400| 1|
| 1|1443489550|100| 1|
| 2|1443489560|600| 0|
| 2|1443489570|200| 0|
| 2|1443489580|700| 1|
I have defined the following Window:
from pyspark.sql import Window
w = Window.partitionBy("id").orderBy("timestamp")
I would like to extract only the first and last row of data in the window w. How can I accomplish this?
If you want the first and last values on the same row, one way is to use pyspark.sql.functions.first():
from pyspark.sql import Window
from pyspark.sql.functions import first
w1 = Window.partitionBy("id").orderBy("timestamp")
w2 = Window.partitionBy("id").orderBy(f.col("timestamp").desc()) # sort desc
*([first(c).over(w1).alias("first_" + c) for c in df.columns if c != "id"] +
[first(c).over(w2).alias("last_" + c) for c in df.columns if c != "id"])
#| id|first_timestamp|first_x|first_y|last_timestamp|last_x|last_y|
#| 0| 1443489380| 100| 1| 1443489400| 300| 0|
#| 1| 1443489410| 400| 1| 1443489550| 100| 1|
#| 2| 1443489560| 600| 0| 1443489580| 700| 1|

how to filter out a null value from spark dataframe

I created a dataframe in spark with the following schema:
|-- user_id: long (nullable = false)
|-- event_id: long (nullable = false)
|-- invited: integer (nullable = false)
|-- day_diff: long (nullable = true)
|-- interested: integer (nullable = false)
|-- event_owner: long (nullable = false)
|-- friend_id: long (nullable = false)
And the data is shown below:
| user_id| event_id|invited|day_diff|interested|event_owner|friend_id|
| 4236494| 110357109| 0| -1| 0| 937597069| null|
| 78065188| 498404626| 0| 0| 0| 2904922087| null|
| 282487230|2520855981| 0| 28| 0| 3749735525| null|
| 335269852|1641491432| 0| 2| 0| 1490350911| null|
| 437050836|1238456614| 0| 2| 0| 991277599| null|
| 447244169|2095085551| 0| -1| 0| 1579858878| null|
| 516353916|1076364848| 0| 3| 1| 3597645735| null|
| 528218683|1151525474| 0| 1| 0| 3433080956| null|
| 531967718|3632072502| 0| 1| 0| 3863085861| null|
| 627948360|2823119321| 0| 0| 0| 4092665803| null|
| 811791433|3513954032| 0| 2| 0| 415464198| null|
| 830686203| 99027353| 0| 0| 0| 3549822604| null|
|1008893291|1115453150| 0| 2| 0| 2245155244| null|
|1239364869|2824096896| 0| 2| 1| 2579294650| null|
|1287950172|1076364848| 0| 0| 0| 3597645735| null|
|1345896548|2658555390| 0| 1| 0| 2025118823| null|
|1354205322|2564682277| 0| 3| 0| 2563033185| null|
|1408344828|1255629030| 0| -1| 1| 804901063| null|
|1452633375|1334001859| 0| 4| 0| 1488588320| null|
|1625052108|3297535757| 0| 3| 0| 1972598895| null|
I want to filter out the rows have null values in the field of "friend_id".
scala> val aaa = test.filter("friend_id is null")
scala> aaa.count
I got :res52: Long = 0 which is obvious not right. What is the right way to get it?
One more question, I want to replace the values in the friend_id field. I want to replace null with 0 and 1 for any other value except null. The code I can figure out is:
val aaa =$"user_id", $"event_id", $"invited", $"day_diff", $"interested", $"event_owner", ($"friend_id" != null)?1:0)
This code also doesn't work. Can anyone tell me how can I fix it? Thanks
Let's say you have this data setup (so that results are reproducible):
// declaring data types
case class Company(cName: String, cId: String, details: String)
case class Employee(name: String, id: String, email: String, company: Company)
// setting up example data
val e1 = Employee("n1", null, "", Company("c1", "1", "d1"))
val e2 = Employee("n2", "2", "", Company("c1", "1", "d1"))
val e3 = Employee("n3", "3", "", Company("c1", "1", "d1"))
val e4 = Employee("n4", "4", "", Company("c2", "2", "d2"))
val e5 = Employee("n5", null, "", Company("c2", "2", "d2"))
val e6 = Employee("n6", "6", "", Company("c2", "2", "d2"))
val e7 = Employee("n7", "7", "", Company("c3", "3", "d3"))
val e8 = Employee("n8", "8", "", Company("c3", "3", "d3"))
val employees = Seq(e1, e2, e3, e4, e5, e6, e7, e8)
val df = sc.parallelize(employees).toDF
Data is:
|name| id| email| company|
| n1|null||[c1,1,d1]|
| n2| 2||[c1,1,d1]|
| n3| 3||[c1,1,d1]|
| n4| 4||[c2,2,d2]|
| n5|null||[c2,2,d2]|
| n6| 6||[c2,2,d2]|
| n7| 7||[c3,3,d3]|
| n8| 8||[c3,3,d3]|
Now to filter employees with null ids, you will do --
df.filter("id is null").show
which will correctly show you following:
|name| id| email| company|
| n1|null||[c1,1,d1]|
| n5|null||[c2,2,d2]|
Coming to the second part of your question, you can replace the null ids with 0 and other values with 1 with this --
df.withColumn("id", when($"id".isNull, 0).otherwise(1)).show
This results in:
|name| id| email| company|
| n1| 0||[c1,1,d1]|
| n2| 1||[c1,1,d1]|
| n3| 1||[c1,1,d1]|
| n4| 1||[c2,2,d2]|
| n5| 0||[c2,2,d2]|
| n6| 1||[c2,2,d2]|
| n7| 1||[c3,3,d3]|
| n8| 1||[c3,3,d3]|
Or like df.filter($"friend_id".isNotNull)
There are two ways to do it: creating filter condition 1) Manually 2) Dynamically.
Sample DataFrame:
val df = spark.createDataFrame(Seq(
(0, "a1", "b1", "c1", "d1"),
(1, "a2", "b2", "c2", "d2"),
(2, "a3", "b3", null, "d3"),
(3, "a4", null, "c4", "d4"),
(4, null, "b5", "c5", "d5")
)).toDF("id", "col1", "col2", "col3", "col4")
| id|col1|col2|col3|col4|
| 0| a1| b1| c1| d1|
| 1| a2| b2| c2| d2|
| 2| a3| b3|null| d3|
| 3| a4|null| c4| d4|
| 4|null| b5| c5| d5|
1) Creating filter condition manually i.e. using DataFrame where or filter function
df.filter(col("col1").isNotNull && col("col2").isNotNull).show
df.where("col1 is not null and col2 is not null").show
| id|col1|col2|col3|col4|
| 0| a1| b1| c1| d1|
| 1| a2| b2| c2| d2|
| 2| a3| b3|null| d3|
2) Creating filter condition dynamically: This is useful when we don't want any column to have null value and there are large number of columns, which is mostly the case.
To create the filter condition manually in these cases will waste a lot of time. In below code we are including all columns dynamically using map and reduce function on DataFrame columns:
val filterCond =>col(x).isNotNull).reduce(_ && _)
How filterCond looks:
filterCond: org.apache.spark.sql.Column = (((((id IS NOT NULL) AND (col1 IS NOT NULL)) AND (col2 IS NOT NULL)) AND (col3 IS NOT NULL)) AND (col4 IS NOT NULL))
val filteredDf = df.filter(filterCond)
| id|col1|col2|col3|col4|
| 0| a1| b1| c1| d1|
| 1| a2| b2| c2| d2|
A good solution for me was to drop the rows with any null values:
Dataset<Row> filtered = df.filter(row => !row.anyNull);
In case one is interested in the other case, just call row.anyNull.
(Spark 2.1.0 using Java API)
The following lines work well:
test.filter("friend_id is not null")
From the hint from Michael Kopaniov, below works
Here is a solution for spark in Java. To select data rows containing nulls. When you have Dataset data, you do:
Dataset<Row> containingNulls = data.where(data.col("COLUMN_NAME").isNull())
To filter out data without nulls you do:
Dataset<Row> withoutNulls = data.where(data.col("COLUMN_NAME").isNotNull())
Often dataframes contain columns of type String where instead of nulls we have empty strings like "". To filter out such data as well we do:
Dataset<Row> withoutNullsAndEmpty = data.where(data.col("COLUMN_NAME").isNotNull().and(data.col("COLUMN_NAME").notEqual("")))
for the first question, it is correct you are filtering out nulls and hence count is zero.
for the second replacing: use like below:
val options = Map("path" -> "...\\ex.csv", "header" -> "true")
val dfNull = spark.sqlContext.load("com.databricks.spark.csv", options)
| user_id| event_id|invited|day_diff|interested|event_owner|friend_id|
| 4236494| 110357109| 0| -1| 0| 937597069| null|
| 78065188| 498404626| 0| 0| 0| 2904922087| null|
| 282487230|2520855981| 0| 28| 0| 3749735525| null|
| 335269852|1641491432| 0| 2| 0| 1490350911| null|
| 437050836|1238456614| 0| 2| 0| 991277599| null|
| 447244169|2095085551| 0| -1| 0| 1579858878| a|
| 516353916|1076364848| 0| 3| 1| 3597645735| b|
| 528218683|1151525474| 0| 1| 0| 3433080956| c|
| 531967718|3632072502| 0| 1| 0| 3863085861| null|
| 627948360|2823119321| 0| 0| 0| 4092665803| null|
| 811791433|3513954032| 0| 2| 0| 415464198| null|
| 830686203| 99027353| 0| 0| 0| 3549822604| null|
|1008893291|1115453150| 0| 2| 0| 2245155244| null|
|1239364869|2824096896| 0| 2| 1| 2579294650| d|
|1287950172|1076364848| 0| 0| 0| 3597645735| null|
|1345896548|2658555390| 0| 1| 0| 2025118823| null|
|1354205322|2564682277| 0| 3| 0| 2563033185| null|
|1408344828|1255629030| 0| -1| 1| 804901063| null|
|1452633375|1334001859| 0| 4| 0| 1488588320| null|
|1625052108|3297535757| 0| 3| 0| 1972598895| null|
dfNull.withColumn("friend_idTmp", when($"friend_id".isNull, "1").otherwise("0")).drop($"friend_id").withColumnRenamed("friend_idTmp", "friend_id").show
| user_id| event_id|invited|day_diff|interested|event_owner|friend_id|
| 4236494| 110357109| 0| -1| 0| 937597069| 1|
| 78065188| 498404626| 0| 0| 0| 2904922087| 1|
| 282487230|2520855981| 0| 28| 0| 3749735525| 1|
| 335269852|1641491432| 0| 2| 0| 1490350911| 1|
| 437050836|1238456614| 0| 2| 0| 991277599| 1|
| 447244169|2095085551| 0| -1| 0| 1579858878| 0|
| 516353916|1076364848| 0| 3| 1| 3597645735| 0|
| 528218683|1151525474| 0| 1| 0| 3433080956| 0|
| 531967718|3632072502| 0| 1| 0| 3863085861| 1|
| 627948360|2823119321| 0| 0| 0| 4092665803| 1|
| 811791433|3513954032| 0| 2| 0| 415464198| 1|
| 830686203| 99027353| 0| 0| 0| 3549822604| 1|
|1008893291|1115453150| 0| 2| 0| 2245155244| 1|
|1239364869|2824096896| 0| 2| 1| 2579294650| 0|
|1287950172|1076364848| 0| 0| 0| 3597645735| 1|
|1345896548|2658555390| 0| 1| 0| 2025118823| 1|
|1354205322|2564682277| 0| 3| 0| 2563033185| 1|
|1408344828|1255629030| 0| -1| 1| 804901063| 1|
|1452633375|1334001859| 0| 4| 0| 1488588320| 1|
|1625052108|3297535757| 0| 3| 0| 1972598895| 1|
val df = Seq(
("1001", "1007"),
("1002", null),
("1003", "1005"),
(null, "1006")
).toDF("user_id", "friend_id")
Data is:
| 1001| 1007|
| 1002| null|
| 1003| 1005|
| null| 1006|
Drop rows containing any null or NaN values in the specified columns of the Seq:"friend_id"))
| 1001| 1007|
| 1003| 1005|
| null| 1006|
If do not specify columns, drop row as long as any column of a row contains null or NaN values:
| 1001| 1007|
| 1003| 1005|
Another easy way to filter out null values from multiple columns in spark dataframe. Please pay attention there is AND between columns.
df.filter(" COALESCE(col1, col2, col3, col4, col5, col6) IS NOT NULL")
If you need to filter out rows that contain any null (OR connected) please use
I use the following code to solve my question. It works. But as we all know, I work around a country's mile to solve it. So, is there a short cut for that? Thanks
def filter_null(field : Any) : Int = field match {
case null => 0
case _ => 1
val test = train_event_join.join(
train_event_join("user_id") === user_friends_pair("user_id") &&
train_event_join("event_owner") === user_friends_pair("friend_id"),
line => (
}.toDF("user_id", "event_id", "invited", "day_diff", "interested", "event_owner", "creator_is_friend")

In Spark Dataframe how to get duplicate records and distinct records in two dataframes?

I am working on a problem in which I am loading data from a hive table into spark dataframe and now I want all the unique accts in 1 dataframe and all duplicates in another. for example if I have acct id 1,1,2,3,4. I want to get 2,3,4 in one dataframe and 1,1 in another. How can I do this?
Depending on the version of spark you have, you could use window functions in datasets/sql like below:
Dataset<Row> New = df.withColumn("Duplicate", count("*").over( Window.partitionBy("id") ) );
Dataset<Row> Dups = New.filter(col("Duplicate").gt(1));
Dataset<Row> Uniques = New.filter(col("Duplicate").equalTo(1));
the above is written in java. should be similar in scala and read this on how to do in python.
df.groupBy($"field1",$"field2"...).count.filter($"count" > 1).show()
val acctDF = List(("1", "Acc1"), ("1", "Acc1"), ("1", "Acc1"), ("2", "Acc2"), ("2", "Acc2"), ("3", "Acc3")).toDF("AcctId", "Details")
| 1| Acc1|
| 1| Acc1|
| 1| Acc1|
| 2| Acc2|
| 2| Acc2|
| 3| Acc3|
// Need to convert the DF to rdd to apply map and reduceByKey and again to DF to use it further more
val countsDF = => (rec(0), 1)).reduceByKey(_+_).map(rec=> (rec._1.toString, rec._2)).toDF("AcctId", "AcctCount")
val accJoinedDF = acctDF.join(countsDF, acctDF("AcctId")===countsDF("AcctId"), "left_outer").select(acctDF("AcctId"), acctDF("Details"), countsDF("AcctCount"))
| 1| Acc1| 3|
| 1| Acc1| 3|
| 1| Acc1| 3|
| 2| Acc2| 2|
| 2| Acc2| 2|
| 3| Acc3| 1|
val distAcctDF = accJoinedDF.filter($"AcctCount"===1)
| 3| Acc3| 1|
val duplAcctDF = accJoinedDF.filter($"AcctCount">1)
| 1| Acc1| 3|
| 1| Acc1| 3|
| 1| Acc1| 3|
| 2| Acc2| 2|
| 2| Acc2| 2|
(OR scala> )