So, i have data looking like this
id | tag | flag | num
1 | tag_1 | Y | 100
1 | tag_2 | N | 200
2 | tag_3 | N | 100
3 | tag_4 | N | 300
3 | tag_5 | Y | 200
i need to group by id, sum the num, and if the row is more than 1, i want to choose data from row with flag = Y, otherwise use whatever row available. so the result is looking like this
id | tag | flag | num
1 | tag_1 | Y | 300
2 | tag_3 | N | 100
3 | tag_5 | Y | 500
there's only one Y and N for each group of id, so there will never be 2 rows or more with flag Y with same id.
is there anyway to do this in spark scala?
As you have suggested that only 1 row will have the flag set to Y. The following should work:
val df = Seq(
(1 ,"tag_1","Y",100),
(1 ,"tag_2","N",200),
(2 ,"tag_3","N",100),
(3 ,"tag_4","N",300),
(3 ,"tag_5","Y",200)).toDF("id", "tag", "flag", "num")
df.show
//+---+-----+----+---+
//| id| tag|flag|num|
//+---+-----+----+---+
//| 1|tag_1| Y|100|
//| 1|tag_2| N|200|
//| 2|tag_3| N|100|
//| 3|tag_4| N|300|
//| 3|tag_5| Y|200|
//+---+-----+----+---+
val joined = df.as("l").join(
df.groupBy($"id").agg(sum($"num").as("num"), count($"*").as("cnt")).as("r"),
$"l.id" === $"r.id")
.select($"l.id", $"l.tag", $"l.flag", $"r.num", $"r.cnt")
//+---+-----+----+---+---+
//|id |tag |flag|num|cnt|
//+---+-----+----+---+---+
//|1 |tag_1|Y |300|2 |
//|1 |tag_2|N |300|2 |
//|2 |tag_3|N |100|1 |
//|3 |tag_4|N |500|2 |
//|3 |tag_5|Y |500|2 |
//+---+-----+----+---+---+
joined.withColumn("filtered",
when($"cnt" > lit(1) && $"flag" === lit("Y"), lit("y"))
.when($"cnt" > lit(1) && $"flag" === lit("N"), lit("n"))
.otherwise(lit("y"))).where($"filtered" === lit("y"))
.drop("filtered", "cnt")
//+---+-----+----+---+
//| id| tag|flag|num|
//+---+-----+----+---+
//| 1|tag_1| Y|300|
//| 2|tag_3| N|100|
//| 3|tag_5| Y|500|
//+---+-----+----+---+
Related
Input
+-------+-------+----+-------
| id | a | b | c
+-------+-------+----+-------
| 1 | 1 | 0 | 1
+-------+-------+----+-------
output
+-------+-------+----+-------+-------+-------+----+-------
| id | a | b | c | a_b | a_c | b_c
+-------+-------+----+-------+-------+-------+----+-------
| 1 | 1 | 0 | 1 | 0 | 1 | 0
+-------+-------+----+-------+-------+-------+----+-------
basically I have a sequence of pair which contains Seq((a,b),(a,c),(b,c))
and thier values will be col(a)*col(b) , col(a)*col(c) col(b)*col(c) for new column
Like I know how to add them in dataFrame but not able to make a transform of return type DataFrame => DataFrame
Is this what you what?
Take a look at the API page. You will save yourself sometime :)
val df = Seq((1, 1, 0, 1))
.toDF("id", "a", "b", "c")
.withColumn("a_b", $"a" * $"b")
.withColumn("a_c", $"a" * $"c")
.withColumn("b_c", $"b" * $"c")
output ============
+---+---+---+---+---+---+---+
| id| a| b| c|a_b|a_c|b_c|
+---+---+---+---+---+---+---+
| 1| 1| 0| 1| 0| 1| 0|
+---+---+---+---+---+---+---+
How to aggregate sum on multi columns on Dataframe using reduce function and not groupby? Since, groupby sum is taking alot of time now i am thinking of using reduce function. Any lead will be helpful.
Input:
| A | B | C | D |
| x | 1 | 2 | 3 |
| x | 2 | 3 | 4 |
CODE:
dataFrame.groupBy("A").sum()
Output:
| A | B | C | D |
| x | 3 | 5 | 7 |
You will have to convert the DataFrame to RDD to perform reduceByKey operation.
val rows: RDD[Row] = df.rdd
Once you create your RDD you can use the reduceByKey to add values of multiple columns
val input = sc.parallelize(List(("X",1,2,3),("X",2,3,4)))
input.map{ case (a, b, c, d) => ((a), (b,c,d)) }.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2, x._3 + y._3))
spark.createDataFrame(final_rdd).toDF("M","N").select($"M", $"N._1".as("X"), $"N._2".as("Y"),$"N._3".as("Z")).show(10)
+---+---+---+---+
| M| X| Y| Z|
+---+---+---+---+
| X| 3| 5| 7|
+---+---+---+---+
I have the Data like this.
+------+------+------+----------+----------+----------+----------+----------+----------+
| Col1 | Col2 | Col3 | Col1_cnt | Col2_cnt | Col3_cnt | Col1_wts | Col2_wts | Col3_wts |
+------+------+------+----------+----------+----------+----------+----------+----------+
| AAA | VVVV | SSSS | 3 | 4 | 5 | 0.5 | 0.4 | 0.6 |
| BBB | BBBB | TTTT | 3 | 4 | 5 | 0.5 | 0.4 | 0.6 |
| CCC | DDDD | YYYY | 3 | 4 | 5 | 0.5 | 0.4 | 0.6 |
+------+------+------+----------+----------+----------+----------+----------+----------+
I have tried but I am not getting any help here.
val df = Seq(("G",Some(4),2,None),("H",None,4,Some(5))).toDF("A","X","Y", "Z")
I want the output in the form of below table
+-----------+---------+---------+
| Cols_name | Col_cnt | Col_wts |
+-----------+---------+---------+
| Col1 | 3 | 0.5 |
| Col2 | 4 | 0.4 |
| Col3 | 5 | 0.6 |
+-----------+---------+---------+
Here's a general approach for transposing a DataFrame:
For each of the pivot columns (say c1, c2, c3), combine the column name and associated value columns into a struct (e.g. struct(lit(c1), c1_cnt, c1_wts))
Put all these struct-typed columns into an array which is then explode-ed into rows of struct columns
Group by the pivot column name to aggregate the associated struct elements
The following sample code has been generalized to handle an arbitrary list of columns to be transposed:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
("AAA", "VVVV", "SSSS", 3, 4, 5, 0.5, 0.4, 0.6),
("BBB", "BBBB", "TTTT", 3, 4, 5, 0.5, 0.4, 0.6),
("CCC", "DDDD", "YYYY", 3, 4, 5, 0.5, 0.4, 0.6)
).toDF("c1", "c2", "c3", "c1_cnt", "c2_cnt", "c3_cnt", "c1_wts", "c2_wts", "c3_wts")
val pivotCols = Seq("c1", "c2", "c3")
val valueColSfx = Seq("_cnt", "_wts")
val arrStructs = pivotCols.map{ c => struct(
Seq(lit(c).as("_pvt")) ++
valueColSfx.map((c, _)).map{ case (p, s) => col(p + s).as(s) }: _*
).as(c + "_struct")
}
val valueColAgg = valueColSfx.map(s => first($"struct_col.$s").as(s + "_first"))
df.
select(array(arrStructs: _*).as("arr_structs")).
withColumn("struct_col", explode($"arr_structs")).
groupBy($"struct_col._pvt").agg(valueColAgg.head, valueColAgg.tail: _*).
show
// +----+----------+----------+
// |_pvt|_cnt_first|_wts_first|
// +----+----------+----------+
// | c1| 3| 0.5|
// | c3| 5| 0.6|
// | c2| 4| 0.4|
// +----+----------+----------+
Note that function first is used in the above example, but it could be any other aggregate function (e.g. avg, max, collect_list) depending on the specific business requirement.
I want to join two tables A and B and pick the records having max date from table B for each value.
Consider the following tables:
Table A:
+---+-----+----------+
| id|Value|start_date|
+---+---- +----------+
| 1 | a | 1/1/2018 |
| 2 | a | 4/1/2018 |
| 3 | a | 8/1/2018 |
| 4 | c | 1/1/2018 |
| 5 | d | 1/1/2018 |
| 6 | e | 1/1/2018 |
+---+-----+----------+
Table B:
+---+-----+----------+
|Key|Value|sent_date |
+---+---- +----------+
| x | a | 2/1/2018 |
| y | a | 7/1/2018 |
| z | a | 11/1/2018|
| p | c | 5/1/2018 |
| q | d | 5/1/2018 |
| r | e | 5/1/2018 |
+---+-----+----------+
The aim is to bring in column id from Table A to Table B for each value in Table B.
For the same, table A and B needs to be joined together with column value and for each record in B, max(A.start_date) for each data in column Value in Table A is found with condition A.start_date < B.sent_date
Lets consider the value=a here.
In table A, we can see 3 records for Value=a with 3 different start_date.
So when joining Table B, for value=a with sent_date=2/1/2018, record with max(start_date) for start_date which are less than sent_date in Table B is taken(in this case 1/1/2018) and corresponding data in column A.id is pulled to Table B.
Similarly for record with value=a and sent_date = 11/1/2018 in Table B, id=3 from table A needs to be pulled to table B.
The result must be as follows:
+---+-----+----------+---+
|Key|Value|sent_date |id |
+---+---- +----------+---+
| x | a | 2/1/2018 | 1 |
| y | a | 7/1/2018 | 2 |
| z | a | 11/1/2018| 3 |
| p | c | 5/1/2018 | 4 |
| q | d | 5/1/2018 | 5 |
| r | e | 5/1/2018 | 6 |
+---+-----+----------+---+
I am using Spark 2.3.
I have joined the two tables(using Dataframe) and found the max(start_date) based on the condition.
But I am unable to figure out how to pull the records here.
Can anyone help me out here
Thanks in Advance!!
I just changed the date "11/1/2018" to "9/1/2018" as the string sorting gives incorrect results. When converted to date, the logic would still work. See below
scala> val df_a = Seq((1,"a","1/1/2018"),
| (2,"a","4/1/2018"),
| (3,"a","8/1/2018"),
| (4,"c","1/1/2018"),
| (5,"d","1/1/2018"),
| (6,"e","1/1/2018")).toDF("id","value","start_date")
df_a: org.apache.spark.sql.DataFrame = [id: int, value: string ... 1 more field]
scala> val df_b = Seq(("x","a","2/1/2018"),
| ("y","a","7/1/2018"),
| ("z","a","9/1/2018"),
| ("p","c","5/1/2018"),
| ("q","d","5/1/2018"),
| ("r","e","5/1/2018")).toDF("key","valueb","sent_date")
df_b: org.apache.spark.sql.DataFrame = [key: string, valueb: string ... 1 more field]
scala> val df_join = df_b.join(df_a,'valueb==='valuea,"inner")
df_join: org.apache.spark.sql.DataFrame = [key: string, valueb: string ... 4 more fields]
scala> df_join.filter('sent_date >= 'start_date).withColumn("rank", rank().over(Window.partitionBy('key,'valueb,'sent_date).orderBy('start_date.desc))).filter('rank===1).drop("valuea","start_date","rank").show()
+---+------+---------+---+
|key|valueb|sent_date| id|
+---+------+---------+---+
| q| d| 5/1/2018| 5|
| p| c| 5/1/2018| 4|
| r| e| 5/1/2018| 6|
| x| a| 2/1/2018| 1|
| y| a| 7/1/2018| 2|
| z| a| 9/1/2018| 3|
+---+------+---------+---+
scala>
UPDATE
Below is the udf to handle date strings with MM/dd/yyyy formats
scala> def dateConv(x:String):String=
| {
| val y = x.split("/").map(_.toInt).map("%02d".format(_))
| y(2)+"-"+y(0)+"-"+y(1)
| }
dateConv: (x: String)String
scala> val udfdateconv = udf( dateConv(_:String):String )
udfdateconv: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
scala> val df_a_dt = df_a.withColumn("start_date",date_format(udfdateconv('start_date),"yyyy-MM-dd").cast("date"))
df_a_dt: org.apache.spark.sql.DataFrame = [id: int, valuea: string ... 1 more field]
scala> df_a_dt.printSchema
root
|-- id: integer (nullable = false)
|-- valuea: string (nullable = true)
|-- start_date: date (nullable = true)
scala> df_a_dt.show()
+---+------+----------+
| id|valuea|start_date|
+---+------+----------+
| 1| a|2018-01-01|
| 2| a|2018-04-01|
| 3| a|2018-08-01|
| 4| c|2018-01-01|
| 5| d|2018-01-01|
| 6| e|2018-01-01|
+---+------+----------+
scala>
I have a spark DataFrame (df) which looks like this:
+----------+--------+----------+--------+
| c1| c2| c3| c4|
+----------+--------+----------+--------+
| 1 | 5 | null| 7 |
+----------+--------+----------+--------+
| 1 | 5 | 4 | 8 |
+----------+--------+----------+--------+
| 1 | 3 | null| 11 |
+----------+--------+----------+--------+
| 1 | 3 | null| null |
+----------+--------+----------+--------+
| 2 | 6 | 23 | 17 |
+----------+--------+----------+--------+
| 2 | 6 | 7 | 3 |
+----------+--------+----------+--------+
| 2 | 3 | null| 11 |
+----------+--------+----------+--------+
| 2 | 3 | null| 17 |
+----------+--------+----------+--------+
I want to aggregate using (c1,c2) as key and have average of c3 and c4, so that I have this:
+----------+--------+----------+--------+
| c1| c2| c3| c4|
+----------+--------+----------+--------+
| 1 | 5 | 4 | 7.5 |
+----------+--------+----------+--------+
| 1 | 3 | null| 11 |
+----------+--------+----------+--------+
| 2 | 6 | 15 | 10 |
+----------+--------+----------+--------+
| 2 | 3 | null| 14 |
+----------+--------+----------+--------+
So, essentially I am ignoring the null values.
My half-baked code looks something like this:
val df1 = df.
// just working on c3 for time being
map(x => ((x.getInt(0), x.getInt(1)), x.getDouble(3))).
reduceByKey(
(x, y) => {
var temp = 0
var sum = 0.0
var flag = false
if (x == null) {
if (y != null) {
temp = temp + 1
sum = y
flag = true
}
} else {
if (y == null) {
temp = temp + 1
sum = x
} else {
temp = temp + 1
sum = x + y
flag = true
}
}
if (flag == false) {
null
} else {
sum/temp
}
}
)
Obviously, the above code is not working. Any help to make the code work is very much appreciated.
Edit 1 The answer given by #zero232 is a solution. However, it is not "the solution" I am looking for. My interest was to understand how to deal with null values when writing a custom function for reduceByKey(). I am re-asking the question below:
I want to aggregate using (c1,c2) as key and have root mean square [{sum(a_i^2)}^0.5] (or some function which is not available in spark for that matter) of c3 and c4 while ignoring the null values, so that I have this:
+----------+--------+----------+--------+
| c1| c2| c3| c4|
+----------+--------+----------+--------+
| 1 | 5 | 4 | 10.63 |
+----------+--------+----------+--------+
| 1 | 3 | null| 11 |
+----------+--------+----------+--------+
| 2 | 6 | 24.04 | 17.26 |
+----------+--------+----------+--------+
| 2 | 3 | null| 20.24 |
+----------+--------+----------+--------+
Just groupBy and use mean:
df.groupBy("c1", "c2").mean("c3", "c4")
or agg
df.groupBy("c1", "c2").agg(avg("c3"), avg("c4"))
Typically all primitive functions on DataFrames will handle correctly null values.
import org.apache.spark.sql.functions._
def rms(c: String) = sqrt(avg(pow(col(c), 2))).alias(s"rms($c)")
df.groupBy("c1", "c2").agg(rms("c3"), rms("c4"))
If you want to ignore null with RDDs just filter these out before you apply reduction:
somePairRDD.filter(_._2 != null)
.foldByKey(someDefualtValue)(someReducingFunction)
or convert values to Option and use pattern matching:
somePairRDD.mapValues(Option(_)).reduceByKey {
case (Some(x), Some(y)) => doSomething(x, y)
case (Some(x), _) => doSomething(x)
case (_, Some(_)) => doSomething(y)
case _ => someDefualt
}
or use map / flatMap / getOrElse and other standard tools to handle undefined values.