How to get this using scala - scala

**DF1** **DF2** **output_DF**
120 D A 120 null A
120 E B 120 null B
125 F C 120 null C
D 120 D D
E 120 E E
F 120 null F
G 120 null G
H 120 null H
125 null A
125 null B
125 null C
125 null D
125 null E
125 F F
125 null G
125 null H
From dataframe 1 and 2 need to get the final output dataframe in spark-shell.
where A,B,C,D,E,F are in date format(yyyy-MM-dd) & 120,125 are the ticket_id's column where there are thousands of ticket_id's.
I just extracted one out of it here.

To get the expected result you can use df.join() and df.na.fill() (as mentioned in comments), like this:
For Spark 2.0+
val resultDF = df1.select("col1").distinct.collect.map(_.getInt(0)).map(id => df1.filter(s"col1 = $id").join(df2, df1("col2") === df2("value"), "right").na.fill(id)).reduce(_ union _)
For Spark 1.6
val resultDF = df1.select("col1").distinct.collect.map(_.getInt(0)).map(id => df1.filter(s"col1 = $id").join(df2, df1("col2") === df2("value"), "right").na.fill(id)).reduce(_ unionAll _)
It will give you the following result -
+---+----+-----+
|120|null| A|
|120|null| B|
|120|null| C|
|120| D| D|
|120| E| E|
|120|null| F|
|120|null| G|
|120|null| H|
|125|null| A|
|125|null| B|
|125|null| C|
|125|null| D|
|125|null| E|
|125| F| F|
|125|null| G|
|125|null| H|
+---+----+-----+
I hope it helps!

Full join of possible values, then left join with original dataframe:
import hiveContext.implicits._
val df1Data = List((120, "D"), (120, "E"), (125, "F"))
val df2Data = List("A", "B", "C", "D", "E", "F", "G", "H")
val df1 = sparkContext.parallelize(df1Data).toDF("id", "date")
val df2 = sparkContext.parallelize(df2Data).toDF("date")
// get unique ID: 120, 125
val uniqueIDDF = df1.select(col("id")).distinct()
val fullJoin = uniqueIDDF.join(df2)
val result = fullJoin.as("full").join(df1.as("df1"), col("full.id") === col("df1.id") && col("full.date") === col("df1.date"), "left_outer")
val sorted = result.select(col("full.id"), col("df1.date"), col("full.date")).sort(col("full.id"), col("full.date"))
sorted.show(false)
output:
+---+----+----+
|id |date|date|
+---+----+----+
|120|null|A |
|120|null|B |
|120|null|C |
|120|D |D |
|120|E |E |
|120|null|F |
|120|null|G |
|120|null|H |
|125|null|A |
|125|null|B |
|125|null|C |
|125|null|D |
|125|null|E |
|125|F |F |
|125|null|G |
|125|null|H |
+---+----+----+
Sorting here just for show the same result, can be skipped.

Related

new column in dataframe derived from second dataframe

I've two dataframes df1 and df2.I've to add a new columns in df1 from df2 :
df1
X Y Z
1 2 3
4 5 6
7 8 9
3 6 9
df2
col1 col2
XX aa
YY bb
XX cc
ZZ vv
The values of col1 in df2 should be added as new column(if it does'nt exists) in df1 and col2 as value of new column.For example :
df1
X Y Z XX YY ZZ
1 2 3 aa bb vv
4 5 6 cc
7 8 9
3 6 9
df2
col1 col2
XX aa
YY bb
XX cc
ZZ vv
First, spark dataset are made to be distributed. But column name are part of the schema, so they are in memory of the master. Thus, to add columns for each distinct values of df2.col1, you first need to get those values in the master (i.e. collect)
// inputs
val df1 = List((1,2,3), (4,5,6), (7,8,9), (3,6,9)).toDF("X", "Y", "Z")
val df2 = List(("XX", "aa"), ("YY", "bb"), ("XX", "cc"), ("ZZ", "vv")).toDF("col1", "col2")
val newColumns = df2.select("col1").as[String].distinct.collect
val newDF = newColumns.foldLeft(df1)( (df, col) => df.withColumn(col, lit("?")))
newDF.show
+---+---+---+---+---+---+
| X| Y| Z| ZZ| YY| XX|
+---+---+---+---+---+---+
| 1| 2| 3| ?| ?| ?|
| 4| 5| 6| ?| ?| ?|
| 7| 8| 9| ?| ?| ?|
| 3| 6| 9| ?| ?| ?|
+---+---+---+---+---+---+
But
I don't know what values you want to put in those column (above, I put "?" everywhere)
if there are a lot of rows in df2, like 10's of thousand, it can kill the master to collect and add them all to df1
Now, to give a little more, here is how you can add columns from df2.col1 and put as values the concatenated values of df2.col2
val toAdd = df2.groupBy("col1").agg(concat_ws(",", collect_set("col2")).as("col2All"))
toAdd.show
+----+-------+
|col1|col2All|
+----+-------+
| ZZ| vv|
| YY| bb|
| XX| cc,aa|
+----+-------+
val newColumns = toAdd.rdd.map(r => (r.getAs[String]("col1"), r.getAs[String]("col2All"))).collectAsMap()
val newDF = newColumns.foldLeft(df1){ case (df, (name, value)) => df.withColumn(name, lit(value))}
newDF.show
+---+---+---+-----+---+---+
| X| Y| Z| XX| YY| ZZ|
+---+---+---+-----+---+---+
| 1| 2| 3|cc,aa| bb| vv|
| 4| 5| 6|cc,aa| bb| vv|
| 7| 8| 9|cc,aa| bb| vv|
| 3| 6| 9|cc,aa| bb| vv|
+---+---+---+-----+---+---+

Collect most occurring unique values across columns after a groupby in Spark

I have the following dataframe
val input = Seq(("ZZ","a","a","b","b"),
("ZZ","a","b","c","d"),
("YY","b","e",null,"f"),
("YY","b","b",null,"f"),
("XX","j","i","h",null))
.toDF("main","value1","value2","value3","value4")
input.show()
+----+------+------+------+------+
|main|value1|value2|value3|value4|
+----+------+------+------+------+
| ZZ| a| a| b| b|
| ZZ| a| b| c| d|
| YY| b| e| null| f|
| YY| b| b| null| f|
| XX| j| i| h| null|
+----+------+------+------+------+
I need to group by the main column and pick the two most occurring values from the remaining columns for each main value
I did the following
val newdf = input.select('main,array('value1,'value2,'value3,'value4).alias("values"))
val newdf2 = newdf.groupBy('main).agg(collect_set('values).alias("values"))
val newdf3 = newdf2.select('main, flatten($"values").alias("values"))
To get the data in the following form
+----+--------------------+
|main| values|
+----+--------------------+
| ZZ|[a, a, b, b, a, b...|
| YY|[b, e,, f, b, b,, f]|
| XX| [j, i, h,]|
+----+--------------------+
Now I need to pick the most occurring two items from the list as two columns. Dunno how to do that.
So, in this case the expected output should be
+----+------+------+
|main|value1|value2|
+----+------+------+
| ZZ| a| b|
| YY| b| f|
| XX| j| i|
+----+------+------+
null should not be counted and the final values should be null only if there are no other values to fill
Is this the best way to do things ? Is there a better way of doing it ?
You can use an udf to select the two values from the array that occur the most often.
input.withColumn("values", array("value1", "value2", "value3", "value4"))
.groupBy("main").agg(flatten(collect_list("values")).as("values"))
.withColumn("max", maxUdf('values)) //(1)
.cache() //(2)
.withColumn("value1", 'max.getItem(0))
.withColumn("value2", 'max.getItem(1))
.drop("values", "max")
.show(false)
with maxUdf being defined as
def getMax[T](array: Seq[T]) = {
array
.filter(_ != null) //remove null values
.groupBy(identity).mapValues(_.length) //count occurences of each value
.toSeq.sortWith(_._2 > _._2) //sort (3)
.map(_._1).take(2) //return the two (or one) most common values
}
val maxUdf = udf(getMax[String] _)
Remarks:
using an udf here means that the whole array with all entries for a single value of main has to fit into the memory of one Spark executor
cache is required here or the the udf will be called twice, once for value1 and once for value2
the sortWith here is stable but it might be necessary to add some extra logic to handle the situation if two elements have the same number of occurences (like i, j and h for the main value XX)
Here is my try without udf.
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy('main).orderBy('count.desc)
newdf3.withColumn("values", explode('values))
.groupBy('main, 'values).agg(count('values).as("count"))
.filter("values is not null")
.withColumn("target", concat(lit("value"), lit(row_number().over(w))))
.filter("target < 'value3'")
.groupBy('main).pivot('target).agg(first('values)).show
+----+------+------+
|main|value1|value2|
+----+------+------+
| ZZ| a| b|
| YY| b| f|
| XX| j| null|
+----+------+------+
The last row has the null value because I have modified your dataframe in this way,
+----+--------------------+
|main| values|
+----+--------------------+
| ZZ|[a, a, b, b, a, b...|
| YY|[b, e,, f, b, b,, f]|
| XX| [j,,,]| <- For null test
+----+--------------------+

spark scala: remove consecutive (by date) duplicates records from a dataframe

The question is regarding working with dataframes, I want to delete completely duplicate records excluding some fields (dates).
I tried to use a windowFunction (WindowSpec) as:
val wFromDupl: WindowSpec = Window
.partitionBy(comparateFields: _*)
.orderBy(asc(orderField))
At the variable comparateFields I store all the fields that I have to check (in the example it would be DESC1 and DESC2) to eliminate duplicates following the logic that, if there is a duplicate record, we discard those with higher date.
In the orderField variable, I simply store the effective_date field.
Therefore, by applying the window function, what I do is calculate a temporary column, assigning the smallest date to all the records that are duplicates, and then filter the dataFrame as:
val dfFinal: DataFrame = dfInicial
.withColumn("w_eff_date", min(col("effective_date")).over(wFromDupl))
.filter(col("effective_date") === col("w_eff_date"))
.drop("w_eff_date")
.distinct()
.withColumn("effective_end_date", lead(orderField, 1, "9999-12-31").over(w))
For the following case it works correctly:
KEY EFFECTIVE_DATE DESC 1 DESC 2 W_EFF_DATE (tmp)
E2 2000 A B 2000
E2 2001 A B 2000
E2 2002 AA B 2002
The code will drop the second record:
E2 2001 A B 2000
But the logic must be applied for CONSECUTIVE records (in date), for example, for the following case, as the code is implemented, we are deleting the third record (DESC1 and DESC2 are the same, and the min eff date is 2000), but we dont want this because we have (by eff_date) a record in the middle (2001 AA B)so we want to keep the 3 records
KEY EFFECTIVE_DATE DESC1 DESC2 W_EFF_DATE (tmp)
E1 2000 A B 2000
E1 2001 AA B 2001
E1 2002 A B 2000
Any advice on this?
Thank you all!
One approach would be to use when/otherwise along with Window function lag to determine which rows to keep, as shown below:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = Seq(
("E1", "2000", "A", "B"),
("E1", "2001", "AA", "B"),
("E1", "2002", "A", "B"),
("E1", "2003", "A", "B"),
("E1", "2004", "A", "B"),
("E2", "2000", "C", "D"),
("E2", "2001", "C", "D"),
("E2", "2002", "CC", "D"),
("E2", "2003", "C", "D")
).toDF("key", "effective_date", "desc1", "desc2")
val compareCols = List("desc1", "desc2")
val win1 = Window.partitionBy("key").orderBy("effective_date")
val df2 = df.
withColumn("compCols", struct(compareCols.map(col): _*)).
withColumn("rowNum", row_number.over(win1)).
withColumn("toKeep",
when($"rowNum" === 1 || $"compCols" =!= lag($"compCols", 1).over(win1), true).
otherwise(false)
)
// +---+--------------+-----+-----+--------+------+------+
// |key|effective_date|desc1|desc2|compCols|rowNum|toKeep|
// +---+--------------+-----+-----+--------+------+------+
// | E1| 2000| A| B| [A,B]| 1| true|
// | E1| 2001| AA| B| [AA,B]| 2| true|
// | E1| 2002| A| B| [A,B]| 3| true|
// | E1| 2003| A| B| [A,B]| 4| false|
// | E1| 2004| A| B| [A,B]| 5| false|
// | E2| 2000| C| D| [C,D]| 1| true|
// | E2| 2001| C| D| [C,D]| 2| false|
// | E2| 2002| CC| D| [CC,D]| 3| true|
// | E2| 2003| C| D| [C,D]| 4| true|
// +---+--------------+-----+-----+--------+------+------+
df2.where($"toKeep").select(df.columns.map(col): _*).
show
// +---+--------------+-----+-----+
// |key|effective_date|desc1|desc2|
// +---+--------------+-----+-----+
// | E1| 2000| A| B|
// | E1| 2001| AA| B|
// | E1| 2002| A| B|
// | E2| 2000| C| D|
// | E2| 2002| CC| D|
// | E2| 2003| C| D|
// +---+--------------+-----+-----+

How to update column of spark dataframe based on the values of previous record

I have three columns in df
Col1,col2,col3
X,x1,x2
Z,z1,z2
Y,
X,x3,x4
P,p1,p2
Q,q1,q2
Y
I want to do the following
when col1=x,store the value of col2 and col3
and assign those column values to next row when col1=y
expected output
X,x1,x2
Z,z1,z2
Y,x1,x2
X,x3,x4
P,p1,p2
Q,q1,q2
Y,x3,x4
Any help would be appreciated
Note:-spark 1.6
Here's one approach using Window function with steps as follows:
Add row-identifying column (not needed if there is already one) and combine non-key columns (presumably many of them) into one
Create tmp1 with conditional nulls and tmp2 using last/rowsBetween Window function to back-fill with the last non-null value
Create newcols conditionally from cols and tmp2
Expand newcols back to individual columns using foldLeft
Note that this solution uses Window function without partitioning, thus may not work for large dataset.
val df = Seq(
("X", "x1", "x2"),
("Z", "z1", "z2"),
("Y", "", ""),
("X", "x3", "x4"),
("P", "p1", "p2"),
("Q", "q1", "q2"),
("Y", "", "")
).toDF("col1", "col2", "col3")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val colList = df.columns.filter(_ != "col1")
val df2 = df.select($"col1", monotonically_increasing_id.as("id"),
struct(colList.map(col): _*).as("cols")
)
val df3 = df2.
withColumn( "tmp1", when($"col1" === "X", $"cols") ).
withColumn( "tmp2", last("tmp1", ignoreNulls = true).over(
Window.orderBy("id").rowsBetween(Window.unboundedPreceding, 0)
) )
df3.show
// +----+---+-------+-------+-------+
// |col1| id| cols| tmp1| tmp2|
// +----+---+-------+-------+-------+
// | X| 0|[x1,x2]|[x1,x2]|[x1,x2]|
// | Z| 1|[z1,z2]| null|[x1,x2]|
// | Y| 2| [,]| null|[x1,x2]|
// | X| 3|[x3,x4]|[x3,x4]|[x3,x4]|
// | P| 4|[p1,p2]| null|[x3,x4]|
// | Q| 5|[q1,q2]| null|[x3,x4]|
// | Y| 6| [,]| null|[x3,x4]|
// +----+---+-------+-------+-------+
val df4 = df3.withColumn( "newcols",
when($"col1" === "Y", $"tmp2").otherwise($"cols")
).select($"col1", $"newcols")
df4.show
// +----+-------+
// |col1|newcols|
// +----+-------+
// | X|[x1,x2]|
// | Z|[z1,z2]|
// | Y|[x1,x2]|
// | X|[x3,x4]|
// | P|[p1,p2]|
// | Q|[q1,q2]|
// | Y|[x3,x4]|
// +----+-------+
val dfResult = colList.foldLeft( df4 )(
(accDF, c) => accDF.withColumn(c, df4(s"newcols.$c"))
).drop($"newcols")
dfResult.show
// +----+----+----+
// |col1|col2|col3|
// +----+----+----+
// | X| x1| x2|
// | Z| z1| z2|
// | Y| x1| x2|
// | X| x3| x4|
// | P| p1| p2|
// | Q| q1| q2|
// | Y| x3| x4|
// +----+----+----+
[UPDATE]
For Spark 1.x, last(colName, ignoreNulls) isn't available in the DataFrame API. A work-around is to revert to use Spark SQL which supports ignore-null in its last() method:
df2.
withColumn( "tmp1", when($"col1" === "X", $"cols") ).
createOrReplaceTempView("df2table")
// might need to use registerTempTable("df2table") instead
val df3 = spark.sqlContext.sql("""
select col1, id, cols, tmp1, last(tmp1, true) over (
order by id rows between unbounded preceding and current row
) as tmp2
from df2table
""")
Yes, there is a lag function that requires ordering
import org.apache.spark.sql.expressions.Window.orderBy
import org.apache.spark.sql.functions.{coalesce, lag}
case class Temp(a: String, b: Option[String], c: Option[String])
val input = ss.createDataFrame(
Seq(
Temp("A", Some("a1"), Some("a2")),
Temp("D", Some("d1"), Some("d2")),
Temp("B", Some("b1"), Some("b2")),
Temp("E", None, None),
Temp("C", None, None)
))
+---+----+----+
| a| b| c|
+---+----+----+
| A| a1| a2|
| D| d1| d2|
| B| b1| b2|
| E|null|null|
| C|null|null|
+---+----+----+
val order = orderBy($"a")
input
.withColumn("b", coalesce($"b", lag($"b", 1).over(order)))
.withColumn("c", coalesce($"c", lag($"c", 1).over(order)))
.show()
+---+---+---+
| a| b| c|
+---+---+---+
| A| a1| a2|
| B| b1| b2|
| C| b1| b2|
| D| d1| d2|
| E| d1| d2|
+---+---+---+

Spark : Aggregating based on a column

I have a file consisting of 3 fields (Emp_ids, Groups, Salaries)
100 A 430
101 A 500
201 B 300
I want to get result as
1) Group name and count(*)
2) Group name and max( salary)
val myfile = "/home/hduser/ScalaDemo/Salary.txt"
val conf = new SparkConf().setAppName("Salary").setMaster("local[2]")
val sc= new SparkContext( conf)
val sal= sc.textFile(myfile)
Scala DSL:
case class Data(empId: Int, group: String, salary: Int)
val df = sqlContext.createDataFrame(lst.map {v =>
val arr = v.split(' ').map(_.trim())
Data(arr(0).toInt, arr(1), arr(2).toInt)
})
df.show()
+-----+-----+------+
|empId|group|salary|
+-----+-----+------+
| 100| A| 430|
| 101| A| 500|
| 201| B| 300|
+-----+-----+------+
df.groupBy($"group").agg(count("*") as "count").show()
+-----+-----+
|group|count|
+-----+-----+
| A| 2|
| B| 1|
+-----+-----+
df.groupBy($"group").agg(max($"salary") as "maxSalary").show()
+-----+---------+
|group|maxSalary|
+-----+---------+
| A| 500|
| B| 300|
+-----+---------+
Or with plain SQL:
df.registerTempTable("salaries")
sqlContext.sql("select group, count(*) as count from salaries group by group").show()
+-----+-----+
|group|count|
+-----+-----+
| A| 2|
| B| 1|
+-----+-----+
sqlContext.sql("select group, max(salary) as maxSalary from salaries group by group").show()
+-----+---------+
|group|maxSalary|
+-----+---------+
| A| 500|
| B| 300|
+-----+---------+
While Spark SQL is recommended way to do such aggregations due to performance reasons, it can be easily done with RDD API:
val rdd = sc.parallelize(Seq(Data(100, "A", 430), Data(101, "A", 500), Data(201, "B", 300)))
rdd.map(v => (v.group, 1)).reduceByKey(_ + _).collect()
res0: Array[(String, Int)] = Array((B,1), (A,2))
rdd.map(v => (v.group, v.salary)).reduceByKey((s1, s2) => if (s1 > s2) s1 else s2).collect()
res1: Array[(String, Int)] = Array((B,300), (A,500))