I have a dataframe which is the following :
+---------+--------+-------+
|date |id |typ_mvt|
+---------+--------+-------+
|date_1 |5697 |C |
|date_2 |5697 |M |
|date_3 |NULL |M |
|date_4 |NULL |S |
+---------+--------+-------+
I want to restore the id (NULL) values as below :
+---------+--------+-------+
|date |id |typ_mvt|
+---------+--------+-------+
|date_1 |5697 |C |
|date_2 |5697 |M |
|date_3 |5697 |M |
|date_4 |5697 |S |
+---------+--------+-------+
Is there a way to achieve this ?
Thank you for your answers.
Bonjour Doc,
Le na.fill fait bien le taff :
val rdd = sc.parallelize(Seq(
(201901, new Integer(5697), "C"),
(201902, new Integer(5697), "M"),
(201903, null.asInstanceOf[Integer], "M"),
(201904, null.asInstanceOf[Integer], "S")
))
val df = rdd.toDF("date", "id", "typ_mvt")
import org.apache.spark.sql.functions.{lag,lead}
val window = org.apache.spark.sql.expressions.Window.orderBy("date")
val sampleId = df.filter($"id".isNotNull).select($"id").first.getInt(0)
val newDf = df.na.fill(sampleId,Seq("id"))
Sinon, j'ai trouvé le post suivant très similaire avec une bien meilleur solution :
Fill in null with previously known good value with pyspark
Related
The use case is to group by each column in a given dataset, and get the count of that column.
The resulting set is (key, value) map and then finally uinion of them all.
For eg
students = {(age, firstname, lastname)(12, "FN", "LN"), (13, "df", "gh")}
groupby age => (12, 1), (13, 1)
groupby firstname => etc
I know the brute force approach is to do a map and maintain a map for count for each column
but i wanted to see if there is something more we can do with maybe foldLeft and windows function. I tried using rollup and cube but that does groups all column together rather than indivdual
Assuming that you need Key, Value, Grouping Column name as three columns in the output, you would have to use the below code so that key and grouping column relationships can be understood.
Code
val df = Seq(("12", "FN", "LN"),
("13", "FN", "gh")).toDF("age", "firstname", "lastname")
df.show(false)
val initialDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], StructType(
Seq(StructField("Key", StringType), StructField("Value", IntegerType),
StructField("GroupColumn", StringType))
))
val resultantDf = df.columns.foldLeft(initialDF)((df1, column) => df1.union(
df.groupBy(column).count().withColumn("GroupColumn", lit(column))
))
resultantDf.show(false)
resultantDf.collect().map { row =>
(row.getString(0), row.getLong(1))
}.foreach(println)
Output
INPUT DF:
+---+---------+--------+
|age|firstname|lastname|
+---+---------+--------+
|12 |FN |LN |
|13 |FN |gh |
+---+---------+--------+
OUTPUT DF:
+---+-----+-----------+
|Key|Value|GroupColumn|
+---+-----+-----------+
|12 |1 |age |
|13 |1 |age |
|FN |2 |firstname |
|gh |1 |lastname |
|LN |1 |lastname |
+---+-----+-----------+
OUTPUT LIST:
(12,1)
(13,1)
(FN,2)
(gh,1)
(LN,1)
Assuming that you need Union of the grouped data frames, I was able to solve it as below:
Code
val df = Seq(("12", "FN", "LN"),
("13", "FN", "gh")).toDF("age", "firstname", "lastname")
df.show(false)
val initialDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], StructType(
Seq(StructField("column", StringType), StructField("count", IntegerType))
))
df.columns.foldLeft(initialDF)((df1, column) => df1.union(df.groupBy(column).count())).show(false)
Output
INPUT DF:
+---+---------+--------+
|age|firstname|lastname|
+---+---------+--------+
|12 |FN |LN |
|13 |FN |gh |
+---+---------+--------+
OUTPUT DF:
+------+-----+
|column|count|
+------+-----+
|12 |1 |
|13 |1 |
|FN |2 |
|gh |1 |
|LN |1 |
+------+-----+
My spark dataframe looks like this:
+------+------+-------+------+
|userid|useid1|userid2|score |
+------+------+-------+------+
|23 |null |dsad |3 |
|11 |44 |null |4 |
|231 |null |temp |5 |
|231 |null |temp |2 |
+------+------+-------+------+
I want to do the calculation for each pair of userid and useid1/userid2 (whichever is not null).
And if it's useid1, I multiply the score by 5, if it's userid2, I multiply the score by 3.
Finally, I want to add all score for each pair.
The result should be:
+------+--------+-----------+
|userid|useid1/2|final score|
+------+--------+-----------+
|23 |dsad |9 |
|11 |44 |20 |
|231 |temp |21 |
+------+------+-------------+
How can I do this?
For the groupBy part, I know dataframe has the groupBy function, but I don't know if I can use it conditionally, like if userid1 is null, groupby(userid, userid2), if userid2 is null, groupby(userid, useid1).
For the calculation part, how to multiply 3 or 5 based on the condition?
The below solution will help to solve your problem.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val groupByUserWinFun = Window.partitionBy("userid","useid1/2")
val finalScoreDF = userDF.withColumn("useid1/2", when($"userid1".isNull, $"userid2").otherwise($"userid1"))
.withColumn("finalscore", when($"userid1".isNull, $"score" * 3).otherwise($"score" * 5))
.withColumn("finalscore", sum("finalscore").over(groupByUserWinFun))
.select("userid", "useid1/2", "finalscore").distinct()
using when method in spark SQL, select userid1 or 2 and multiply with values based on the condition
Output:
+------+--------+----------+
|userid|useid1/2|finalscore|
+------+--------+----------+
| 11 | 44| 20.0|
| 23 | dsad| 9.0|
| 231| temp| 21.0|
+------+--------+----------+
Group by will work:
val original = Seq(
(23, null, "dsad", 3),
(11, "44", null, 4),
(231, null, "temp", 5),
(231, null, "temp", 2)
).toDF("userid", "useid1", "userid2", "score")
// action
val result = original
.withColumn("useid1/2", coalesce($"useid1", $"userid2"))
.withColumn("score", $"score" * when($"useid1".isNotNull, 5).otherwise(3))
.groupBy("userid", "useid1/2")
.agg(sum("score").alias("final score"))
result.show(false)
Output:
+------+--------+-----------+
|userid|useid1/2|final score|
+------+--------+-----------+
|23 |dsad |9 |
|231 |temp |21 |
|11 |44 |20 |
+------+--------+-----------+
coalesce will do the needful.
df.withColumn("userid1/2", coalesce(col("useid1"), col("useid1")))
basically this function return first non-null value of the order
documentation :
COALESCE(T v1, T v2, ...)
Returns the first v that is not NULL, or NULL if all v's are NULL.
needs an import import org.apache.spark.sql.functions.coalesce
I have two dataframes i.e. left & right. I have working solution for my question. I need a way to make it generic. My question is at the end here.
leftDF:
+------+---------+-------+-------+
|leftId|leftAltId|leftCur|leftAmt|
+------+---------+-------+-------+
|1 |100 |USD |20 |
|2 |200 |INR |100 |
|4 |500 |MXN |100 |
+------+---------+-------+-------+
rightDF:
+-------+----------+--------+--------+
|rightId|rightAltId|rightCur|rightAmt|
+-------+----------+--------+--------+
|1 |300 |USD |20 |
|3 |400 |MXN |100 |
|4 |600 |MXN |200 |
+-------+----------+--------+--------+
I want to perform a join between these two dataframes and I expect four dataframes as output
transactions that exists in leftDF & not in rightDF
transactions that exists in rightDF & not in leftDF
transactions that have alteast one of the ids common between two dataframes
3.a Strict Match : same currency, amount between two dataframes. Example: transaction with id 1.
3.b Relaxed Match : transactions that have same Id but different currency/amount combo. Example transaction with id 4.
Here's the desired output:
transactions that exists in leftDF & not in rightDF
+------+---------+-------+-------+-------+----------+--------+--------+
|leftId|leftAltId|leftCur|leftAmt|rightId|rightAltId|rightCur|rightAmt|
+------+---------+-------+-------+-------+----------+--------+--------+
|2 |200 |INR |100 |null |null |null |null |
+------+---------+-------+-------+-------+----------+--------+--------+
transactions that exists in rightDF & not in leftDF
+------+---------+-------+-------+-------+----------+--------+--------+
|leftId|leftAtId|leftCur|leftAmt|rightId|rightAltId|rightCur|rightAmt|
+------+---------+-------+-------+-------+----------+--------+--------+
|null |null |null |null |3 |400 |MXN |100 |
+------+---------+-------+-------+-------+----------+--------+--------+
transactions that have alteast one of the ids common between two dataframes
+------+---------+-------+-------+-------+----------+--------+--------+
|leftId|leftAltId|leftCur|leftAmt|rightId|rightAltId|rightCur|rightAmt|
+------+---------+-------+-------+-------+----------+--------+--------+
|1 |100 |USD |20 |1 |300 |USD |20 |
|4 |500 |MXN |100 |4 |600 |MXN |200 |
+------+---------+-------+-------+-------+----------+--------+--------+
3.a Strict Match : same currency, amount between two dataframes. Example: transaction with id 1.
+------+---------+-------+-------+-------+----------+--------+--------+
|leftId|leftAltId|leftCur|leftAmt|rightId|rightAltId|rightCur|rightAmt|
+------+---------+-------+-------+-------+----------+--------+--------+
|1 |100 |USD |20 |1 |300 |USD |20 |
+------+---------+-------+-------+-------+----------+--------+--------+
3.b Relaxed Match : transactions that have same Id but different currency/amount combo. Example transaction with id 4.
+------+---------+-------+-------+-------+----------+--------+--------+
|leftId|leftAltId|leftCur|leftAmt|rightId|rightAltId|rightCur|rightAmt|
+------+---------+-------+-------+-------+----------+--------+--------+
|4 |500 |MXN |100 |4 |600 |MXN |200 |
+------+---------+-------+-------+-------+----------+--------+--------+
Here's the working code I have for it:
import sparkSession.implicits._
val leftDF: DataFrame = Seq((1, 100, "USD", 20), (2, 200, "INR", 100), (4, 500, "MXN", 100)).toDF("leftId", "leftAltId", "leftCur", "leftAmt")
val rightDF: DataFrame = Seq((1, 300, "USD", 20), (3, 400, "MXN", 100), (4, 600, "MXN", 200)).toDF("rightId", "rightAltId", "rightCur", "rightAmt")
leftDF.show(false)
rightDF.show(false)
val idMatchQuery = leftDF("leftId") === rightDF("rightId") || leftDF("leftAltId") === rightDF("rightAltId")
val currencyMatchQuery = leftDF("leftCur") === rightDF("rightCur") && leftDF("leftAmt") === rightDF("rightAmt")
val leftOnlyQuery = (col("leftId").isNotNull && col("rightId").isNull) || (col("leftAltId").isNotNull && col("rightAltId").isNull)
val rightOnlyQuery = (col("rightId").isNotNull && col("leftId").isNull) || (col("rightAltId").isNotNull && col("leftAltId").isNull)
val matchQuery = (col("rightId").isNotNull && col("leftId").isNotNull) || (col("rightAltId").isNotNull && col("leftAltId").isNotNull)
val result = leftDF.join(rightDF, idMatchQuery, "fullouter")
val leftOnlyDF = result.filter(leftOnlyQuery)
val rightOnlyDF = result.filter(rightOnlyQuery)
val matchDF = result.filter(matchQuery)
val strictMatchDF = matchDF.filter(currencyMatchQuery.equalTo(true))
val relaxedMatchDF = matchDF.filter(currencyMatchQuery.equalTo(false))
leftOnlyDF.show(false)
rightOnlyDF.show(false)
matchDF.show(false)
strictMatchDF.show(false)
relaxedMatchDF.show(false)
What I'm looking for:
I want to be able to take the column names to join on, as a list and make the code generic.
for e.g.
val relaxedJoinList = Array(("leftId", "rightId"), ("leftAltId", "rightAltId"))
val strictJoinList = Array(("leftCur", "rightCur"), ("leftAmt", "rightAmt"))
I want to be able to take the column names to join on, as a list and make the code generic.
This is not a perfect suggestion but would definitely help you get generalized. The suggestion is to go with foldLeft
val relaxedJoinList = Array(("leftId", "rightId"), ("leftAltId", "rightAltId"))
val rHead = relaxedJoinList.head
val strictJoinList = Array(("leftCur", "rightCur"), ("leftAmt", "rightAmt"))
val sHead = strictJoinList.head
val idMatchQuery = relaxedJoinList.tail.foldLeft(leftDF(rHead._1) === rightDF(rHead._2)){(x, y) => x || leftDF(y._1) === rightDF(y._2)}
val currencyMatchQuery = strictJoinList.tail.foldLeft(leftDF(sHead._1) === rightDF(sHead._2)){(x, y) => x && leftDF(y._1) === rightDF(y._2)}
val leftOnlyQuery = relaxedJoinList.tail.foldLeft(col(rHead._1).isNotNull && col(rHead._2).isNull){(x, y) => x || col(y._1).isNotNull && col(y._2).isNull}
val rightOnlyQuery = relaxedJoinList.tail.foldLeft(col(rHead._1).isNull && col(rHead._2).isNotNull){(x, y) => x || col(y._1).isNull && col(y._2).isNotNull}
val matchQuery = relaxedJoinList.tail.foldLeft(col(rHead._1).isNotNull && col(rHead._2).isNotNull){(x, y) => x || col(y._1).isNotNull && col(y._2).isNotNull}
The rest of the code is as yours
I hope the answer is helpful
Supposed i have two dataset as following:
Dataset 1:
id, name, score
1, Bill, 200
2, Bew, 23
3, Amy, 44
4, Ramond, 68
Dataset 2:
id,message
1, i love Bill
2, i hate Bill
3, Bew go go !
4, Amy is the best
5, Ramond is the wrost
6, Bill go go
7, Bill i love ya
8, Ramond is Bad
9, Amy is great
I wanted to join above two datasets and counting the top number of person's name that appears in dataset2 according to the name in dataset1 the result should be:
Bill, 4
Ramond, 2
..
..
I managed to join both of them together but not sure how to count how many time it appear for each person.
Any suggestion would be appreciated.
Edited:
my join code:
val rdd = sc.textFile("dataset1")
val rdd2 = sc.textFile("dataset2")
val rddPair1 = rdd.map { x =>
var data = x.split(",")
new Tuple2(data(0), data(1))
}
val rddPair2 = rdd2.map { x =>
var data = x.split(",")
new Tuple2(data(0), data(1))
}
rddPair1.join(rddPair2).collect().foreach(f =>{
println(f._1+" "+f._2._1+" "+f._2._2)
})
Using RDDs, achieving the solution you desire, would be complex. Not so much using dataframes.
First step would be to read the two files you have into dataframes as below
val df1 = sqlContext.read.format("com.databricks.spark.csv")
.option("header", true)
.load("dataset1")
val df2 = sqlContext.read.format("com.databricks.spark.csv")
.option("header", true)
.load("dataset1")
so that you should be having
df1
+---+------+-----+
|id |name |score|
+---+------+-----+
|1 |Bill |200 |
|2 |Bew |23 |
|3 |Amy |44 |
|4 |Ramond|68 |
+---+------+-----+
df2
+---+-------------------+
|id |message |
+---+-------------------+
|1 |i love Bill |
|2 |i hate Bill |
|3 |Bew go go ! |
|4 |Amy is the best |
|5 |Ramond is the wrost|
|6 |Bill go go |
|7 |Bill i love ya |
|8 |Ramond is Bad |
|9 |Amy is great |
+---+-------------------+
join, groupBy and count should give your desired output as
df1.join(df2, df2("message").contains(df1("name")), "left").groupBy("name").count().as("count").show(false)
Final output would be
+------+-----+
|name |count|
+------+-----+
|Ramond|2 |
|Bill |4 |
|Amy |2 |
|Bew |1 |
+------+-----+
Hello guys I want to update an old dataframe based on pos_id and article_id field.
If the tuple (pos_id,article_id) exist , I will add each column to the old one, if it doesn't exist I will add the new one. It worked fine. But I don't know how to deal with the case , when the dataframe is intially empty , in this case , I will add the new rows in the second dataframe to the old one. Here it is what I did
val histocaisse = spark.read
.format("csv")
.option("header", "true") //reading the headers
.load("C:/Users/MHT/Desktop/histocaisse_dte1.csv")
val hist = histocaisse
.withColumn("pos_id", 'pos_id.cast(LongType))
.withColumn("article_id", 'pos_id.cast(LongType))
.withColumn("date", 'date.cast(DateType))
.withColumn("qte", 'qte.cast(DoubleType))
.withColumn("ca", 'ca.cast(DoubleType))
val histocaisse2 = spark.read
.format("csv")
.option("header", "true") //reading the headers
.load("C:/Users/MHT/Desktop/histocaisse_dte2.csv")
val hist2 = histocaisse2.withColumn("pos_id", 'pos_id.cast(LongType))
.withColumn("article_id", 'pos_id.cast(LongType))
.withColumn("date", 'date.cast(DateType))
.withColumn("qte", 'qte.cast(DoubleType))
.withColumn("ca", 'ca.cast(DoubleType))
hist2.show(false)
+------+----------+----------+----+----+
|pos_id|article_id|date |qte |ca |
+------+----------+----------+----+----+
|1 |1 |2000-01-07|2.5 |3.5 |
|2 |2 |2000-01-07|14.7|12.0|
|3 |3 |2000-01-07|3.5 |1.2 |
+------+----------+----------+----+----+
+------+----------+----------+----+----+
|pos_id|article_id|date |qte |ca |
+------+----------+----------+----+----+
|1 |1 |2000-01-08|2.5 |3.5 |
|2 |2 |2000-01-08|14.7|12.0|
|3 |3 |2000-01-08|3.5 |1.2 |
|4 |4 |2000-01-08|3.5 |1.2 |
|5 |5 |2000-01-08|14.5|1.2 |
|6 |6 |2000-01-08|2.0 |1.25|
+------+----------+----------+----+----+
+------+----------+----------+----+----+
|pos_id|article_id|date |qte |ca |
+------+----------+----------+----+----+
|1 |1 |2000-01-08|5.0 |7.0 |
|2 |2 |2000-01-08|39.4|24.0|
|3 |3 |2000-01-08|7.0 |2.4 |
|4 |4 |2000-01-08|3.5 |1.2 |
|5 |5 |2000-01-08|14.5|1.2 |
|6 |6 |2000-01-08|2.0 |1.25|
+------+----------+----------+----+----+
Here is the solution , i found
val df = hist2.join(hist1, Seq("article_id", "pos_id"), "left")
.select($"pos_id", $"article_id",
coalesce(hist2("date"), hist1("date")).alias("date"),
(coalesce(hist2("qte"), lit(0)) + coalesce(hist1("qte"), lit(0))).alias("qte"),
(coalesce(hist2("ca"), lit(0)) + coalesce(hist1("ca"), lit(0))).alias("ca"))
.orderBy("pos_id", "article_id")
This case doesn't work when hist1 is empty .Any help please ?
Thanks a lot
Not sure if I understood correctly, but if the problem is sometimes the second dataframe is empty, and that makes the join crash, something you can try is this:
val checkHist1Empty = Try(hist1.first)
val df = checkHist1Empty match {
case Success(df) => {
hist2.join(hist1, Seq("article_id", "pos_id"), "left")
.select($"pos_id", $"article_id",
coalesce(hist2("date"), hist1("date")).alias("date"),
(coalesce(hist2("qte"), lit(0)) + coalesce(hist1("qte"), lit(0))).alias("qte"),
(coalesce(hist2("ca"), lit(0)) + coalesce(hist1("ca"), lit(0))).alias("ca"))
.orderBy("pos_id", "article_id")
}
case Failure(e) => {
hist2.select($"pos_id", $"article_id",
coalesce(hist2("date")).alias("date"),
coalesce(hist2("qte"), lit(0)).alias("qte"),
coalesce(hist2("ca"), lit(0)).alias("ca"))
.orderBy("pos_id", "article_id")
}
}
This basically checks if the hist1 is empty before performing the join. In case it is empty it generates the df based on the same logic but applied only to the hist2 dataframe. In case it contains information it applies the logic you had, which you said that works.
instead of doing a join, why don't you do a union of the two dataframes and then groupBy (pos_id,article_id) and apply udf to each column sum for qte and ca.
val df3 = df1.unionAll(df2)
val df4 = df3.groupBy("pos_id", "article_id").agg($"pos_id", $"article_id", max("date"), sum("qte"), sum("ca"))