Spark dataframe last object in a group [duplicate] - scala

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 4 years ago.
I need to select the last 'name' for the given 'id'. A possible solution could be the following:
val channels = sessions
.select($"start_time", $"id", $"name")
.orderBy($"start_time")
.select($"id", $"name")
.groupBy($"id")
.agg(last("name"))
I don't know if it's correct because I'm not sure that orderBy is kept after doing groupBy.
But it's certainly not a performant solution. Probably I should use reduceByKey. I tried the following in the spark shell and it works
val x = sc.parallelize(Array(("1", "T1"), ("2", "T2"), ("1", "T11"), ("1", "T111"), ("2", "T22"), ("1", "T100"), ("2", "T222"), ("2", "T200")), 3)
x.reduceByKey((acc,x) => x).collect
But it doesn't work with my dataframe.
case class ChannelRecord(id: Long, name: String)
val channels = sessions
.select($"start_time", $"id", $"name")
.orderBy($"start_time")
.select($"id", $"name")
.as[ChannelRecord]
.reduceByKey((acc, x) => x) // take the last object
I got a compilation error: value reduceByKey is not a member of org.apache.spark.sql.Dataset
I think I should add a map() call before doing reduceByKey but I cannot figure out what should I map.

You could do it with a window function for example. This will require a shuffle on a id column and a sort on start_time.
There are two stages:
Get last name for each id
Keep only rows with the last name (max start_time)
Example dataframe:
val rowsRdd: RDD[Row] = spark.sparkContext.parallelize(
Seq(
Row(1, "a", 1),
Row(1, "b", 2),
Row(1, "c", 3),
Row(2, "d", 4),
Row(2, "e", 5),
Row(2, "f", 6),
Row(3, "g", 7),
Row(3, "h", 8)
))
val schema: StructType = new StructType()
.add(StructField("id", IntegerType, false))
.add(StructField("name", StringType, false))
.add(StructField("start_time", IntegerType, false))
val df0: DataFrame = spark.createDataFrame(rowsRdd, schema)
Define a window. Note that I am sorting here by start_time in decreasing order. This is to be able to choose first row in next step.
val w = Window.partitionBy("id").orderBy(col("start_time").desc)
Then
df0.withColumn("last_name", first("name").over(w)) // get first name for each id (first because of decreasing start_time)
.withColumn("row_number", row_number().over(w)) // get row number for each id sorted by start_time
.filter("row_number=1") // choose only first rows (first row = max start_time)
.drop("row_number") // get rid of row_number columns
.sort("id")
.show(10, false)
This returns
+---+----+----------+---------+
|id |name|start_time|last_name|
+---+----+----------+---------+
|1 |c |3 |c |
|2 |f |6 |f |
|3 |h |8 |h |
+---+----+----------+---------+

Related

Spark SQL Split or Extract words from String of Words

I have a spark Dataframe like Below.I'm trying to split the column into 2 more columns:
date time content
28may 11am [ssid][customerid,shopid]
val personDF2 = personDF.withColumn("temp",split(col("content"),"\\[")).select(
col("*") +: (0 until 3).map(i => col("temp").getItem(i).as(s/col$i)): _*)
date time content col1 col2 col3
28may 11 [ssid][customerid,shopid] ssid customerid shopid
Assuming a String to represent an Array of Words. Got your request. You can optimize the number of dataframes as well to reduce load on system. If there are more than 9 cols etc. you may need to use c00, c01, etc. for c10 etc. Or just use integer as name for columns. leave that up to you.
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
// Set up data
val df = spark.sparkContext.parallelize(Seq(
("A", "[foo][customerid,shopid][Donald,Trump,Esq][single]"),
("B", "[foo]")
)).toDF("k", "v")
val df2 = df.withColumn("words_temp", regexp_replace($"v", lit("]"), lit("" )))
val df3 = df2.withColumn("words_temp2", regexp_replace($"words_temp", lit(","), lit("[" ))).drop("words_temp")
val df4 = df3.withColumn("words_temp3", expr("substring(words_temp2, 2, length(words_temp2))")).withColumn("cnt", expr("length(words_temp2)")).drop("words_temp2")
val df5 = df4.withColumn("words",split(col("words_temp3"),"\\[")).drop("words_temp3")
val df6 = df5.withColumn("num_words", size($"words"))
val df7 = df6.withColumn("v2", explode($"words"))
// Convert to Array of sorts via group by
val df8 = df7.groupBy("k")
.agg(collect_list("v2"))
// Convert to rdd Tuple and then find position so as to gen col names! That is the clue so as to be able to use pivot
val rdd = df8.rdd
val rdd2 = rdd.map(row => (row.getAs[String](0), row.getAs[WrappedArray[String]](1).toArray))
val rdd3 = rdd2.map { case (k, list) => (k, list.zipWithIndex) }
val df9 = rdd3.toDF("k", "v")
val df10 = df9.withColumn("vn", explode($"v"))
val df11 = df10.select($"k", $"vn".getField("_1"), concat(lit("c"),$"vn".getField("_2"))).toDF("k", "v", "c")
// Final manipulation
val result = df11.groupBy("k")
.pivot("c")
.agg(expr("coalesce(first(v),null)")) // May never occur in your case, just done for completeness and variable length cols.
result.show(100,false)
returns in this case:
+---+---+----------+------+------+-----+----+------+
|k |c0 |c1 |c2 |c3 |c4 |c5 |c6 |
+---+---+----------+------+------+-----+----+------+
|B |foo|null |null |null |null |null|null |
|A |foo|customerid|shopid|Donald|Trump|Esq |single|
+---+---+----------+------+------+-----+----+------+
Update:
Based on original title stating array of words. See other answer.
If new, then a few things here. Can also be done with dataset and map I assume. Here is a solution using DFs and rdd's. I might well investigate a complete DS in future, but this works for sure and at scale.
// Can amalgamate more steps
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
// Set up data
val df = spark.sparkContext.parallelize(Seq(
("A", Array(Array("foo", "bar"), Array("Donald", "Trump","Esq"), Array("single"))),
("B", Array(Array("foo2", "bar2"), Array("single2"))),
("C", Array(Array("foo3", "bar3", "x", "y", "z")))
)).toDF("k", "v")
// flatten via 2x explode, can be done more elegeantly with def or UDF, but keeping it simple here
val df2 = df.withColumn("v2", explode($"v"))
val df3 = df2.withColumn("v3", explode($"v2"))
// Convert to Array of sorts via group by
val df4 = df3.groupBy("k")
.agg(collect_list("v3"))
// Convert to rdd Tuple and then find position so as to gen col names! That is the clue so as to be able to use pivot
val rdd = df4.rdd
val rdd2 = rdd.map(row => (row.getAs[String](0), row.getAs[WrappedArray[String]](1).toArray))
val rdd3 = rdd2.map { case (k, list) => (k, list.zipWithIndex) }
val df5 = rdd3.toDF("k", "v")
val df6 = df5.withColumn("vn", explode($"v"))
val df7 = df6.select($"k", $"vn".getField("_1"), concat(lit("c"),$"vn".getField("_2"))).toDF("k", "v", "c")
// Final manipulation
val result = df7.groupBy("k")
.pivot("c")
.agg(expr("coalesce(first(v),null)")) // May never occur in your case, just done for completeness and variable length cols.
result.show(100,false)
returns in correct col order:
+---+----+----+-------+-----+----+------+
|k |c0 |c1 |c2 |c3 |c4 |c5 |
+---+----+----+-------+-----+----+------+
|B |foo2|bar2|single2|null |null|null |
|C |foo3|bar3|x |y |z |null |
|A |foo |bar |Donald |Trump|Esq |single|
+---+----+----+-------+-----+----+------+

Finding size of distinct array column

I am using Scala and Spark to create a dataframe. Here's my code so far:
val df = transformedFlattenDF
.groupBy($"market", $"city", $"carrier").agg(count("*").alias("count"), min($"bandwidth").alias("bandwidth"), first($"network").alias("network"), concat_ws(",", collect_list($"carrierCode")).alias("carrierCode")).withColumn("carrierCode", split(($"carrierCode"), ",").cast("array<string>")).withColumn("Carrier Count", collect_set("carrierCode"))
The column carrierCode becomes an array column. The data is present as follows:
CarrierCode
1: [12,2,12]
2: [5,2,8]
3: [1,1,3]
I'd like to create a column that counts the number of distinct values in each array. I tried doing collect_set, however, it gives me an error saying grouping expressions sequence is empty Is it possible to find the number of distinct values in each row's array? So that way in our same example, there could be a column like so:
Carrier Count
1: 2
2: 3
3: 2
collect_set is for aggregation hence should be applied within your groupBy-agg step:
val df = transformedFlattenDF.groupBy($"market", $"city", $"carrier").agg(
count("*").alias("count"), min($"bandwidth").alias("bandwidth"),
first($"network").alias("network"),
concat_ws(",", collect_list($"carrierCode")).alias("carrierCode"),
size(collect_set($"carrierCode")).as("carrier_count") // <-- ADDED `collect_set`
).
withColumn("carrierCode", split(($"carrierCode"), ",").cast("array<string>"))
If you don't want to change the existing groupBy-agg code, you can create a UDF like in the following example:
import org.apache.spark.sql.functions._
val codeDF = Seq(
Array("12", "2", "12"),
Array("5", "2", "8"),
Array("1", "1", "3")
).toDF("carrier_code")
def distinctElemCount = udf( (a: Seq[String]) => a.toSet.size )
codeDF.withColumn("carrier_count", distinctElemCount($"carrier_code")).
show
// +------------+-------------+
// |carrier_code|carrier_count|
// +------------+-------------+
// | [12, 2, 12]| 2|
// | [5, 2, 8]| 3|
// | [1, 1, 3]| 2|
// +------------+-------------+
Without UDF and using RDD conversion and back to DF for posterity:
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(
("A", 2, 100, 2), ("F", 7, 100, 1), ("B", 10, 100, 100)
)).toDF("c1", "c2", "c3", "c4")
val x = df.select("c1", "c2", "c3", "c4").rdd.map(x => (x.get(0), List(x.get(1), x.get(2), x.get(3))) )
val y = x.map {case (k, vL) => (k, vL.toSet.size) }
// Manipulate back to your DF, via conversion, join, what not.
Returns:
res15: Array[(Any, Int)] = Array((A,2), (F,3), (B,2))
Solution above better, as stated more so for posterity.
You can take help for udf and you can do like this.
//Input
df.show
+-----------+
|CarrierCode|
+-----------+
|1:[12,2,12]|
| 2:[5,2,8]|
| 3:[1,1,3]|
+-----------+
//udf
val countUDF=udf{(str:String)=>val strArr=str.split(":"); strArr(0)+":"+strArr(1).split(",").distinct.length.toString}
df.withColumn("Carrier Count",countUDF(col("CarrierCode"))).show
//Sample Output:
+-----------+-------------+
|CarrierCode|Carrier Count|
+-----------+-------------+
|1:[12,2,12]| 1:3|
| 2:[5,2,8]| 2:3|
| 3:[1,1,3]| 3:3|
+-----------+-------------+

Spark dataframe change column value to timestamp [duplicate]

This question already has answers here:
Apache Spark subtract days from timestamp column
(2 answers)
Closed 4 years ago.
I have a jsonl file I've read in, created a temporary table view and filtered down the records that I want to ammend.
val df = session.read.json("tiny.jsonl")
df.createOrReplaceTempView("tempTable")
val filter = df.select("*").where("field IS NOT NULL")
Now I am at the part where I have been trying various things. I want to change a column called "time" with the currentTimestamp before I write it back. Sometimes I will want to change the currentTimestamp to be timestampNow - 5 days for example.
val change = test.withColumn("server_time", date_add(current_timestamp(), -1))
The example above will throw me back a date that's 1 from today, rather than a timestamp.
Edit:
Sample Dataframe that mocks out my jsonl input:
val df = Seq(
(1, "fn", "2018-02-18T22:18:28.645Z"),
(2, "fu", "2018-02-18T22:18:28.645Z"),
(3, null, "2018-02-18T22:18:28.645Z")
).toDF("id", "field", "time")
Expected output:
+---+------+-------------------------+
| id|field |time |
+---+------+-------------------------+
| 1| fn | 2018-04-09T22:18:28.645Z|
| 2| fn | 2018-04-09T22:18:28.645Z|
+---+------+-------------------------+
If you want to replace current column time with current timestamp then, you can use current_timestamp function. To add the number of days you can use SQL INTERVAL
val df = Seq(
(1, "fn", "2018-02-18T22:18:28.645Z"),
(2, "fu", "2018-02-18T22:18:28.645Z"),
(3, null, "2018-02-18T22:18:28.645Z")
).toDF("id", "field", "time")
.na.drop()
val ddf = df
.withColumn("time", current_timestamp())
.withColumn("newTime", $"time" + expr("INTERVAL 5 DAYS"))
Output:
+---+-----+-----------------------+-----------------------+
|id |field|time |newTime |
+---+-----+-----------------------+-----------------------+
|1 |fn |2018-04-10 15:14:27.501|2018-04-15 15:14:27.501|
|2 |fu |2018-04-10 15:14:27.501|2018-04-15 15:14:27.501|
+---+-----+-----------------------+-----------------------+

How to transpose/pivot the rows data to column in Spark Scala? [duplicate]

This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 5 years ago.
I am new to Spark-SQL. I have information in Spark Dataframe like this
Company Type Status
A X done
A Y done
A Z done
C X done
C Y done
B Y done
I am want to be displayed like the following
Company X-type Y-type Z-type
A done done done
B pending done pending
C done done pending
I am not able to acheive this is Spark-SQL
Please Help
You can groupby Company and then use pivot function on column Type
Here is the simple example
import org.apache.spark.sql.functions._
val df = spark.sparkContext.parallelize(Seq(
("A", "X", "done"),
("A", "Y", "done"),
("A", "Z", "done"),
("C", "X", "done"),
("C", "Y", "done"),
("B", "Y", "done")
)).toDF("Company", "Type", "Status")
val result = df.groupBy("Company")
.pivot("Type")
.agg(expr("coalesce(first(Status), \"pending\")"))
result.show()
Output:
+-------+-------+----+-------+
|Company| X| Y| Z|
+-------+-------+----+-------+
| B|pending|done|pending|
| C| done|done|pending|
| A| done|done| done|
+-------+-------+----+-------+
You can rename the column later.
Hope this helps!

Merging the results of a scala spark dataframe as an array of results in another dataframe's column

Is there a way to take the following two dataframes and join them by the col0 field producing the output below?
//dataframe1
val df1 = Seq(
(1, 9, 100.1, 10),
).toDF("pk", "col0", "col1", "col2")
//dataframe2
val df2 = Seq(
(1, 9 "a1", "b1"),
(2, 9 "a2", "b2")
).toDF("pk", "col0", "str_col1", "str_col2")
//expected dataframe result
+---+-----+----+---------------------------+
| pk| col1|col2| new_arr_col |
+---+-----+----+---------------------------+
| 1|100.1| 10|[[1,9,a1, b1],[2,9,a2, b2]]|
+---+-----+----+---------------------------+
import org.apache.spark.sql.functions._
import spark.implicits._
// creating new array column out of all df2 columns:
val df2AsArray = df2.select($"col0", array(df2.columns.map(col): _*) as "new_arr_col")
val result = df1.join(df2AsArray, "col0")
.groupBy(df1.columns.map(col): _*) // grouping by all df1 columns
.agg(collect_list("new_arr_col") as "new_arr_col") // collecting array of arrays
.drop("col0")
result.show(false)
// +---+-----+----+--------------------------------------------------------+
// |pk |col1 |col2|new_arr_col |
// +---+-----+----+--------------------------------------------------------+
// |1 |100.1|10 |[WrappedArray(2, 9, a2, b2), WrappedArray(1, 9, a1, b1)]|
// +---+-----+----+--------------------------------------------------------+