Spark scala aggregate to an array and concat it - scala

I have a Dataset with a number of columns that looks like this:(Columns -name, timestamp, platform, clickcount, id)
Joy 2021-10-10T10:27:16 apple 5 1
May 2020-12-12T22:28:08 android 6 2
June 2021-09-15T20:20:06 Microsoft 9 3
Joy 2021-09-09T09:30:09 android 10 1
May 2021-08-08T05:05:05 apple 8 2
I want to group by id and after it should look like
Joy 2021-10-10T10:27:16,2021-09-09T09:30:09 apple,android 5,10 1
May 2020-12-12T22:28:08,2021-08-08T05:05:05 android,apple 6,8 2
June 2021-09-15T20:20:06 Microsoft 9 3
After calling for another Api which converts the id to pseudo id I want to map that id and make it to look like
Joy 2021-10-10T10:27:16,2021-09-09T09:30:09 apple,android 5,10 1 A12
May 2020-12-12T22:28:08,2021-08-08T05:05:05 android,apple 6,8 2 B23
June 2021-09-15T20:20:06 Microsoft 9 3 C34
I have tried using groupBy and forEach but I am stuck and unable to proceed further

In order to apply the aggregation you want, you should use collect_set as the aggregation function and concat_ws in order to join with comma the created arrays:
import org.apache.spark.sql.functions.{collect_set, concat_ws}
import spark.implicits._
val df: DataFrame = Seq(
("joy", "2021-10-10T10:27:16", "apple", 5, 1),
("may", "2020-12-12T22:28:08", "android", 6, 2),
("june", "2021-09-15T20:20:06", "microsoft", 9, 3),
("joy", "2021-09-09T09:30:09", "android", 10, 1),
("may", "2021-08-08T05:05:05", "apple", 8, 2)
).toDF("name", "timestamp", "platform", "clickcount", "id")
df
.groupBy("id")
.agg(
concat_ws(",", collect_set("timestamp")).as("timestamp"),
concat_ws(",", collect_set("name")).as("name"),
concat_ws(",", collect_set("platform")).as("platform"),
concat_ws(",", collect_set("clickcount")).as("clickcount")
).show()
The output should be:
+---+--------------------+----+-------------+----------+
| id| timestamp|name| platform|clickcount|
+---+--------------------+----+-------------+----------+
| 1|2021-10-10T10:27:...| joy|apple,android| 5,10|
| 3| 2021-09-15T20:20:06|june| microsoft| 9|
| 2|2021-08-08T05:05:...| may|apple,android| 6,8|
+---+--------------------+----+-------------+----------+
In order to add a pseudo id column, you should join the created df with another dataframe that contains the conversion values or write an UDF that will receive an id value and will convert it into pseudo id.

Related

Better way to concatenate many columns?

I have 30 columns. 26 of the column names are the names of the alphabet letters. I'd like to take those 26 columns and make them into one column as one string.
price dateCreate volume country A B C D E ..... Z
19 20190501 25 US 1 2 5 6 19 30
49 20190502 30 US 5 4 5 0 34 50
I want this:
price dateCreate volume country new_col
19 20190501 25 US "1,2,5,6,19,....30"
49 20190502 30 US "5,4,5,0,34,50"
I know I can do something like this:
df.withColumn("new_col", concat($"A", $"B", ...$"Z"))
However, in the future when faced with this problem I'd like to know how I can more easily concatenate many columns. Is there a way?
Just apply the following to any number of columns you want to concatenate
val df= Seq((19,20190501,24, "US", 1 , 2, 5, 6, 19 ),(49,20190502,30, "US", 5 , 4, 5, 0, 34 )).
toDF("price", "dataCreate", "volume", "country", "A","B","C","D","E")
val exprs = df.columns.drop(4).map(col _)
df.select($"price", $"dataCreate", $"volume", $"country", concat_ws(",",
array(exprs: _*)).as("new_col"))
+-----+----------+------+-------+----------+
|price|dataCreate|volume|country| new_col|
+-----+----------+------+-------+----------+
| 19| 20190501| 24| US|1,2,5,6,19|
| 49| 20190502| 30| US|5,4,5,0,34|
+-----+----------+------+-------+----------+
for completeness, here is the pyspark equivalent
import pyspark.sql.functions as F
df= spark.createDataFrame([[19,20190501,24, "US", 1 , 2, 5, 6, 19 ],[49,20190502,30, "US", 5 , 4, 5, 0, 34 ]],
["price", "dataCreate", "volume", "country", "A","B","C","D","E"])
exprs = [col for col in df.columns[4:]]
df.select("price","dataCreate", "volume", "country", F.concat_ws(",",F.array(*exprs)).alias("new_col"))
Maybe you had something similar to the next one in mind:
Scala
import org.apache.spark.sql.functions.{col, concat_ws}
val cols = ('A' to 'Z').map{col(_)}
df.withColumn("new_col", concat_ws(",", cols:_*)
Python
from pyspark.sql.functions import col, concat_ws
import string
cols = [col(x) for x in string.ascii_uppercase]
df.withColumn("new_col", concat_ws(",", *cols))
From Spark 2.3.0, you can use concatenation operator directly to do this in spark-sql itself.
spark.sql("select A||B||C from table");
https://issues.apache.org/jira/browse/SPARK-19951

Spark dataframe change column value to timestamp [duplicate]

This question already has answers here:
Apache Spark subtract days from timestamp column
(2 answers)
Closed 4 years ago.
I have a jsonl file I've read in, created a temporary table view and filtered down the records that I want to ammend.
val df = session.read.json("tiny.jsonl")
df.createOrReplaceTempView("tempTable")
val filter = df.select("*").where("field IS NOT NULL")
Now I am at the part where I have been trying various things. I want to change a column called "time" with the currentTimestamp before I write it back. Sometimes I will want to change the currentTimestamp to be timestampNow - 5 days for example.
val change = test.withColumn("server_time", date_add(current_timestamp(), -1))
The example above will throw me back a date that's 1 from today, rather than a timestamp.
Edit:
Sample Dataframe that mocks out my jsonl input:
val df = Seq(
(1, "fn", "2018-02-18T22:18:28.645Z"),
(2, "fu", "2018-02-18T22:18:28.645Z"),
(3, null, "2018-02-18T22:18:28.645Z")
).toDF("id", "field", "time")
Expected output:
+---+------+-------------------------+
| id|field |time |
+---+------+-------------------------+
| 1| fn | 2018-04-09T22:18:28.645Z|
| 2| fn | 2018-04-09T22:18:28.645Z|
+---+------+-------------------------+
If you want to replace current column time with current timestamp then, you can use current_timestamp function. To add the number of days you can use SQL INTERVAL
val df = Seq(
(1, "fn", "2018-02-18T22:18:28.645Z"),
(2, "fu", "2018-02-18T22:18:28.645Z"),
(3, null, "2018-02-18T22:18:28.645Z")
).toDF("id", "field", "time")
.na.drop()
val ddf = df
.withColumn("time", current_timestamp())
.withColumn("newTime", $"time" + expr("INTERVAL 5 DAYS"))
Output:
+---+-----+-----------------------+-----------------------+
|id |field|time |newTime |
+---+-----+-----------------------+-----------------------+
|1 |fn |2018-04-10 15:14:27.501|2018-04-15 15:14:27.501|
|2 |fu |2018-04-10 15:14:27.501|2018-04-15 15:14:27.501|
+---+-----+-----------------------+-----------------------+

Spark Scala GroupBy column and sum values

I am a newbie in Apache-spark and recently started coding in Scala.
I have a RDD with 4 columns that looks like this:
(Columns 1 - name, 2- title, 3- views, 4 - size)
aa File:Sleeping_lion.jpg 1 8030
aa Main_Page 1 78261
aa Special:Statistics 1 20493
aa.b User:5.34.97.97 1 4749
aa.b User:80.63.79.2 1 4751
af Blowback 2 16896
af Bluff 2 21442
en Huntingtown,_Maryland 1 0
I want to group based on Column Name and get the sum of Column views.
It should be like this:
aa 3
aa.b 2
af 2
en 1
I have tried to use groupByKey and reduceByKey but I am stuck and unable to proceed further.
This should work, you read the text file, split each line by the separator, map to key value with the appropiate fileds and use countByKey:
sc.textFile("path to the text file")
.map(x => x.split(" ",-1))
.map(x => (x(0),x(3)))
.countByKey
To complete my answer you can approach the problem using dataframe api ( if this is possible for you depending on spark version), example:
val result = df.groupBy("column to Group on").agg(count("column to count on"))
another possibility is to use the sql approach:
val df = spark.read.csv("csv path")
df.createOrReplaceTempView("temp_table")
val result = sqlContext.sql("select <col to Group on> , count(col to count on) from temp_table Group by <col to Group on>")
I assume that you have already have your RDD populated.
//For simplicity, I build RDD this way
val data = Seq(("aa", "File:Sleeping_lion.jpg", 1, 8030),
("aa", "Main_Page", 1, 78261),
("aa", "Special:Statistics", 1, 20493),
("aa.b", "User:5.34.97.97", 1, 4749),
("aa.b", "User:80.63.79.2", 1, 4751),
("af", "Blowback", 2, 16896),
("af", "Bluff", 2, 21442),
("en", "Huntingtown,_Maryland", 1, 0))
Dataframe approach
val sql = new SQLContext(sc)
import sql.implicits._
import org.apache.spark.sql.functions._
val df = data.toDF("name", "title", "views", "size")
df.groupBy($"name").agg(count($"name") as "") show
**Result**
+----+-----+
|name|count|
+----+-----+
| aa| 3|
| af| 2|
|aa.b| 2|
| en| 1|
+----+-----+
RDD Approach (CountByKey(...))
rdd.keyBy(f => f._1).countByKey().foreach(println(_))
RDD Approach (reduceByKey(...))
rdd.map(f => (f._1, 1)).reduceByKey((accum, curr) => accum + curr).foreach(println(_))
If any of this does not solve your problem, pls share where exactely you have strucked.

how to apply partition in spark scala dataframe with multiple columns? [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 5 years ago.
I have the following dataframe df in Spark Scala:
id project start_date Change_date designation
1 P1 08/10/2018 01/09/2017 2
1 P1 08/10/2018 02/11/2018 3
1 P1 08/10/2018 01/08/2016 1
then get designation closure to start_date and less than that
Expected output:
id project start_date designation
1 P1 08/10/2018 2
This is because change date 01/09/2017 is the closest date before start_date.
Can somebody advice how to achieve this?
This is not selecting first row but selecting the designation corresponding to change date closest to the start date
Parse dates:
import org.apache.spark.sql.functions._
val spark: SparkSession = ???
import spark.implicits._
val df = Seq(
(1, "P1", "08/10/2018", "01/09/2017", 2),
(1, "P1", "08/10/2018", "02/11/2018", 3),
(1, "P1", "08/10/2018", "01/08/2016", 1)
).toDF("id", "project_id", "start_date", "changed_date", "designation")
val parsed = df
.withColumn("start_date", to_date($"start_date", "dd/MM/yyyy"))
.withColumn("changed_date", to_date($"changed_date", "dd/MM/yyyy"))
Find difference
val diff = parsed
.withColumn("diff", datediff($"start_date", $"changed_date"))
.where($"diff" > 0)
Apply solution of your choice from How to select the first row of each group?, for example window functions. If you group by id:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy($"id").orderBy($"diff")
diff.withColumn("rn", row_number.over(w)).where($"rn" === 1).drop("rn").show
// +---+----------+----------+------------+-----------+----+
// | id|project_id|start_date|changed_date|designation|diff|
// +---+----------+----------+------------+-----------+----+
// | 1| P1|2018-10-08| 2017-09-01| 2| 402|
// +---+----------+----------+------------+-----------+----+
Reference:
How to select the first row of each group?

How to get the avg values in front of the current position in a RDD with spark/scala

I have a RDD, I want to get the average values in front of the current position(including current position) in a RDD
for example:
inputRDD:
1, 2, 3, 4, 5, 6, 7, 8
output:
1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5
this is my try:
val rdd=sc.parallelize(List(1,2,3,4,5,6,7,8),4)
var sum=0.0
var index=0.0
val partition=rdd.getNumPartitions
rdd.zipWithIndex().collect().foreach(println)
rdd.zipWithIndex().sortBy(x=>{x._2},true,1).mapPartitions(ite=>{
var result=new ArrayBuffer[Tuple2[Double,Long]]()
while (ite.hasNext){
val iteNext=ite.next()
sum+=iteNext._1
index+=1
var avg:Double=sum/index
result.append((avg,iteNext._2))
}
result.toIterator
}).sortBy(x=>{x._2},true,partition).map(x=>{x._1}).collect().foreach(println)
I have to repartition to 1 then calculate it with a array,it's so inefficient.
Is there any cleaner solution without using array in 4 partitions?
Sorry I dont use Scala and hope you could read it
df = spark.createDataFrame(map(lambda x: (x,), range(1, 9)), ['val'])
df = df.withColumn('spec_avg',
f.avg('val').over(Window().orderBy('val').rowsBetween(start=Window.unboundedPreceding, end=0)))
a simpler solution would be to use Spark-SQL.
here I am computing the running average for each row
val df = sc.parallelize(List(1,2,3,4,5,6,7,8)).toDF("col1")
df.createOrReplaceTempView("table1")
val result = spark.sql("""SELECT col1, sum(col1) over(order by col1 asc)/row_number() over(order by col1 asc) as avg FROM table1""")
or alternatively if you want to use the DataFrames API.
import org.apache.spark.sql.expressions._
val result = df
.withColumn("csum", sum($"col1").over(Window.orderBy($"col1")))
.withColumn("rownum", row_number().over(Window.orderBy($"col1")))
.withColumn("avg", $"csum"/$"rownum")
.select("col1","avg")
Output:
result.show()
+----+---+
|col1|avg|
+----+---+
| 1|1.0|
| 2|1.5|
| 3|2.0|
| 4|2.5|
| 5|3.0|
| 6|3.5|
| 7|4.0|
| 8|4.5|
+----+---+