Hourly Aggregation in Scala Spark - scala

I'm looking for a way to aggregate by hour my data. I want firstly to keep only hours in my evtTime. My DataFrame looks like this:
+-------+-----------------------+-----------+
|reqUser|evtTime |event_count|
+-------+-----------------------+-----------+
|X166814|2018-01-01 11:23:06.426|1 |
|X166815|2018-01-01 02:20:06.426|2 |
|X166816|2018-01-01 11:25:06.429|5 |
|X166817|2018-02-01 10:23:06.429|1 |
|X166818|2018-01-01 09:23:06.430|3 |
|X166819|2018-01-01 10:15:06.430|8 |
|X166820|2018-08-01 11:00:06.431|20 |
|X166821|2018-03-01 06:23:06.431|7 |
|X166822|2018-01-01 07:23:06.434|2 |
|X166823|2018-01-01 11:23:06.434|1 |
+-------+-----------------------+-----------+
My objectif is to get something like this :
+-------+-----------------------+-----------+
|reqUser|evtTime |event_count|
+-------+-----------------------+-----------+
|X166814|2018-01-01 11:00:00.000|1 |
|X166815|2018-01-01 02:00:00.000|2 |
|X166816|2018-01-01 11:00:00.000|5 |
|X166817|2018-02-01 10:00:00.000|1 |
|X166818|2018-01-01 09:00:00.000|3 |
|X166819|2018-01-01 10:00:00.000|8 |
|X166820|2018-08-01 11:00:00.000|20 |
|X166821|2018-03-01 06:00:00.000|7 |
|X166822|2018-01-01 07:00:00.000|2 |
|X166823|2018-01-01 11:00:00.000|1 |
+-------+-----------------------+-----------+
I'm using scala 2.10.5 and spark 1.6.3. My objectif subsequently is to group by reqUser and calculate the sum of event_count. I tried this :
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{round, sum}
val new_df = df
.groupBy($"reqUser",
Window(col("evtTime"), "1 hour"))
.agg(sum("event_count") as "aggregate_sum")
This is my error message :
Error:(81, 15) org.apache.spark.sql.expressions.Window.type does not take parameters
Window(col("time"), "1 hour"))
Help ? Thx !

In Spark 1.x you can use format tools
import org.apache.spark.sql.functions.trunc
val df = Seq("2018-01-01 10:15:06.430").toDF("evtTime")
df.select(date_format($"evtTime".cast("timestamp"), "yyyy-MM-dd HH:00:00")).show
+------------------------------------------------------------+
|date_format(CAST(evtTime AS TIMESTAMP), yyyy-MM-dd HH:00:00)|
+------------------------------------------------------------+
| 2018-01-01 10:00:00|
+------------------------------------------------------------+

Related

pyspark to_timestamp() handling format of miliseconds SSS

I have distorted Data,
I am using below function here.
to_timestamp("col","yyyy-MM-dd'T'hh:mm:ss.SSS'Z'")
Data:
time | OUTPUT | IDEAL
2022-06-16T07:01:25.346Z | 2022-06-16T07:01:25.346+0000 | 2022-06-16T07:01:25.346+0000
2022-06-16T06:54:21.51Z | 2022-06-16T06:54:21.051+0000 | 2022-06-16T06:54:21.510+0000
2022-06-16T06:54:21.5Z | 2022-06-16T06:54:21.005+0000 | 2022-06-16T06:54:21.500+0000
so, I have S or SS or SSS format for milisecond in data. How can i normalise it into SSS correct way? Here, 51 miliseconds mean 510 not 051.
Using spark version : 3.2.1
Code :
import pyspark.sql.functions as F
test = spark.createDataFrame([(1,'2022-06-16T07:01:25.346Z'),(2,'2022-06-16T06:54:21.51Z'),(3,'2022-06-16T06:54:21.5Z')],['no','timing1'])
timeFmt = "yyyy-MM-dd'T'hh:mm:ss.SSS'Z'"
test = test.withColumn("timing2", (F.to_timestamp(F.col('timing1'),format=timeFmt)))
test.select("timing1","timing2").show(truncate=False)
Output:
I also use v3.2.1 and it works for me if you just don't parse the timestamp format. It is already in the right format:
from pyspark.sql import functions as F
test = spark.createDataFrame([(1,'2022-06-16T07:01:25.346Z'),(2,'2022-06-16T06:54:21.51Z'),(3,'2022-06-16T06:54:21.5Z')],['no','timing1'])
new_df = test.withColumn('timing1_ts', F.to_timestamp('timing1'))\
new_df.show(truncate=False)
new_df.dtypes
+---+------------------------+-----------------------+
|no |timing1 |timing1_ts |
+---+------------------------+-----------------------+
|1 |2022-06-16T07:01:25.346Z|2022-06-16 07:01:25.346|
|2 |2022-06-16T06:54:21.51Z |2022-06-16 06:54:21.51 |
|3 |2022-06-16T06:54:21.5Z |2022-06-16 06:54:21.5 |
+---+------------------------+-----------------------+
Out[9]: [('no', 'bigint'), ('timing1', 'string'), ('timing1_ts', 'timestamp')]
I was using this setting :
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
I have to reset this and it is working as normal.

np.where logic in pyspark dataframe

I'm looking for a way to get character after 2nd place from a string in a dataframe column only if the length of the character is > 2 and place it into another column else null. I have several other columns in the spark dataframe
I have a Spark dataframe that looks like this:
animal
======
mo
cat
mouse
snake
reptiles
I want something like this:
remainder
========
null
t
use
ake
ptiles
I can do it using np.where in pandas dataframe like below
import numpy as np
df['remainder'] = np.where(len(df['animal]) > 2, df['animal].str[2:], 'null)
How do I do the same in pyspark dataframe
You can easily do this with a combination of when-otherwise with substring
Data Preparation
s = StringIO("""
animal
mo
cat
mouse
snake
reptiles
""")
df = pd.read_csv(s,delimiter=',')
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+--------+
| animal|
+--------+
| mo|
| cat|
| mouse|
| snake|
|reptiles|
+--------+
When-Otherwise - Substring
sparkDF = sparkDF.withColumn('animal_length',F.length(F.col('animal'))) \
.withColumn('remainder',F.when(F.col('animal_length') > 2
,F.substring(F.col('animal'),2,1000)
).otherwise(None)
) \
.drop('animal_length')
sparkDF.show()
+--------+---------+
| animal|remainder|
+--------+---------+
| mo| null|
| cat| at|
| mouse| ouse|
| snake| nake|
|reptiles| eptiles|
+--------+---------+

how to update all the values of a column in a dataFrame

I have a data frame which has a non formated Date column :
+--------+-----------+--------+
|CDOPEINT| bbbbbbbbbb| Date|
+--------+-----------+--------+
| AAA|bbbbbbbbbbb|13190326|
| AAA|bbbbbbbbbbb|10190309|
| AAA|bbbbbbbbbbb|36190908|
| AAA|bbbbbbbbbbb|07190214|
| AAA|bbbbbbbbbbb|13190328|
| AAA|bbbbbbbbbbb|23190608|
| AAA|bbbbbbbbbbb|13190330|
| AAA|bbbbbbbbbbb|26190630|
+--------+-----------+--------+
the date column is formated as : wwyyMMdd (week, year, month, day) which I want to format to YYYYMMdd, for that a have a method : format that do that.
so my question is how could I map all the values of column Date to the needed format? here is the output that I want :
+--------+-----------+----------+
|CDOPEINT| bbbbbbbbbb| Date|
+--------+-----------+----------+
| AAA|bbbbbbbbbbb|2019/03/26|
| AAA|bbbbbbbbbbb|2019/03/09|
| AAA|bbbbbbbbbbb|2019/09/08|
| AAA|bbbbbbbbbbb|2019/02/14|
| AAA|bbbbbbbbbbb|2019/03/28|
| AAA|bbbbbbbbbbb|2019/06/08|
| AAA|bbbbbbbbbbb|2019/03/30|
| AAA|bbbbbbbbbbb|2019/06/30|
+--------+-----------+----------+
Spark 2.4.3 using unix_timestamp you can convert data to the expected output.
scala> var df2 =spark.createDataFrame(Seq(("AAA","bbbbbbbbbbb","13190326"),("AAA","bbbbbbbbbbb","10190309"),("AAA","bbbbbbbbbbb","36190908"),("AAA","bbbbbbbbbbb","07190214"),("AAA","bbbbbbbbbbb","13190328"),("AAA","bbbbbbbbbbb","23190608"),("AAA","bbbbbbbbbbb","13190330"),("AAA","bbbbbbbbbbb","26190630"))).toDF("CDOPEINT","bbbbbbbbbb","Date")
scala> df2.withColumn("Date",from_unixtime(unix_timestamp(substring(col("Date"),3,7),"yyMMdd"),"yyyy/MM/dd")).show
+--------+-----------+----------+
|CDOPEINT| bbbbbbbbbb| Date|
+--------+-----------+----------+
| AAA|bbbbbbbbbbb|2019/03/26|
| AAA|bbbbbbbbbbb|2019/03/09|
| AAA|bbbbbbbbbbb|2019/09/08|
| AAA|bbbbbbbbbbb|2019/02/14|
| AAA|bbbbbbbbbbb|2019/03/28|
| AAA|bbbbbbbbbbb|2019/06/08|
| AAA|bbbbbbbbbbb|2019/03/30|
| AAA|bbbbbbbbbbb|2019/06/30|
+--------+-----------+----------+
let me know if you have any query related to this.
If the date involves values from 2000 and the Date column in your original dataframe is of Integer type,you could try something like this
def getDate =(date:Int) =>{
val dateString = date.toString.drop(2).sliding(2,2)
dateString.zipWithIndex.map{
case(value,index) => if(index ==0) 20+value else value
}.mkString("/")
}
Then create a UDF which calls this function
val updateDateUdf = udf(getDate)
If originalDF is the original Dataframe that you have, you could then change the dataframe like this
val updatedDF = originalDF.withColumn("Date",updateDateUdf(col("Date")))

Date format in pyspark

My data frame looks like -
id date
1 2018-08-23 11:48:22
2 2019-05-03 06:22:01
3 2019-05-13 10:12:15
4 2019-01-22 16:13:29
5 2018-11-27 11:17:19
My expected output is -
id date date1
1 2018-08-23 11:48:22 2018-08
2 2019-05-03 06:22:01 2019-05
3 2019-05-13 10:12:15 2019-05
4 2019-01-22 16:13:29 2019-01
5 2018-11-27 11:17:19 2018-11
How to do it in pyspark?
I think you are trying to drop day and time details, you can use date_format function for it
>>> df.show()
+---+-------------------+
| id| date|
+---+-------------------+
| 1|2018-08-23 11:48:22|
| 2|2019-05-03 06:22:01|
| 3|2019-05-13 10:12:15|
| 4|2019-01-22 16:13:29|
| 5|2018-11-27 11:17:19|
+---+-------------------+
>>> import pyspark.sql.functions as F
>>>
>>> df.withColumn('date1',F.date_format(F.to_date('date','yyyy-MM-dd HH:mm:ss'),'yyyy-MM')).show()
+---+-------------------+-------+
| id| date| date1|
+---+-------------------+-------+
| 1|2018-08-23 11:48:22|2018-08|
| 2|2019-05-03 06:22:01|2019-05|
| 3|2019-05-13 10:12:15|2019-05|
| 4|2019-01-22 16:13:29|2019-01|
| 5|2018-11-27 11:17:19|2018-11|
+---+-------------------+-------+
via to_date and then substr functions ... example:
import pyspark.sql.functions as F
import pyspark.sql.types as T
rawData = [(1, "2018-08-23 11:48:22"),
(2, "2019-05-03 06:22:01"),
(3, "2019-05-13 10:12:15")]
df = spark.createDataFrame(rawData).toDF("id","my_date")
df.withColumn("new_my_date",\
F.substring(F.to_date(F.col("my_date")), 1,7))\
.show()
+---+-------------------+-----------+
| id| my_date|new_my_date|
+---+-------------------+-----------+
| 1|2018-08-23 11:48:22| 2018-08|
| 2|2019-05-03 06:22:01| 2019-05|
| 3|2019-05-13 10:12:15| 2019-05|
+---+-------------------+-----------+
import pyspark.sql.functions as F
split_col = F.split(df['date'], '-')
df = df.withColumn('year', split_col.getItem(0)).withColumn('month', split_col.getItem(1))
df = df.select(F.concat(df['year'], F.lit('-'),df['month']).alias('year_month'))
df.show()
+----------+
|year_month|
+----------+
| 2018-08|
| 2019-05|
| 2019-05|
| 2019-01|
| 2018-11|
+----------+

Spark Dataframe :How to add a index Column : Aka Distributed Data Index

I read data from a csv file ,but don't have index.
I want to add a column from 1 to row's number.
What should I do,Thanks (scala)
With Scala you can use:
import org.apache.spark.sql.functions._
df.withColumn("id",monotonicallyIncreasingId)
You can refer to this exemple and scala docs.
With Pyspark you can use:
from pyspark.sql.functions import monotonically_increasing_id
df_index = df.select("*").withColumn("id", monotonically_increasing_id())
monotonically_increasing_id - The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
"I want to add a column from 1 to row's number."
Let say we have the following DF
+--------+-------------+-------+
| userId | productCode | count |
+--------+-------------+-------+
| 25 | 6001 | 2 |
| 11 | 5001 | 8 |
| 23 | 123 | 5 |
+--------+-------------+-------+
To generate the IDs starting from 1
val w = Window.orderBy("count")
val result = df.withColumn("index", row_number().over(w))
This would add an index column ordered by increasing value of count.
+--------+-------------+-------+-------+
| userId | productCode | count | index |
+--------+-------------+-------+-------+
| 25 | 6001 | 2 | 1 |
| 23 | 123 | 5 | 2 |
| 11 | 5001 | 8 | 3 |
+--------+-------------+-------+-------+
How to get a sequential id column id[1, 2, 3, 4...n]:
from pyspark.sql.functions import desc, row_number, monotonically_increasing_id
from pyspark.sql.window import Window
df_with_seq_id = df.withColumn('index_column_name', row_number().over(Window.orderBy(monotonically_increasing_id())) - 1)
Note that row_number() starts at 1, therefore subtract by 1 if you want 0-indexed column
NOTE : Above approaches doesn't give a sequence number, but it does give increasing id.
Simple way to do that and ensure the order of indexes is like below.. zipWithIndex.
Sample data.
+-------------------+
| Name|
+-------------------+
| Ram Ghadiyaram|
| Ravichandra|
| ilker|
| nick|
| Naveed|
| Gobinathan SP|
|Sreenivas Venigalla|
| Jackela Kowski|
| Arindam Sengupta|
| Liangpi|
| Omar14|
| anshu kumar|
+-------------------+
package com.example
import org.apache.spark.internal.Logging
import org.apache.spark.sql.SparkSession._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{LongType, StructField, StructType}
import org.apache.spark.sql.{DataFrame, Row}
/**
* DistributedDataIndex : Program to index an RDD with
*/
object DistributedDataIndex extends App with Logging {
val spark = builder
.master("local[*]")
.appName(this.getClass.getName)
.getOrCreate()
import spark.implicits._
val df = spark.sparkContext.parallelize(
Seq("Ram Ghadiyaram", "Ravichandra", "ilker", "nick"
, "Naveed", "Gobinathan SP", "Sreenivas Venigalla", "Jackela Kowski", "Arindam Sengupta", "Liangpi", "Omar14", "anshu kumar"
)).toDF("Name")
df.show
logInfo("addColumnIndex here")
// Add index now...
val df1WithIndex = addColumnIndex(df)
.withColumn("monotonically_increasing_id", monotonically_increasing_id)
df1WithIndex.show(false)
/**
* Add Column Index to dataframe to each row
*/
def addColumnIndex(df: DataFrame) = {
spark.sqlContext.createDataFrame(
df.rdd.zipWithIndex.map {
case (row, index) => Row.fromSeq(row.toSeq :+ index)
},
// Create schema for index column
StructType(df.schema.fields :+ StructField("index", LongType, false)))
}
}
Result :
+-------------------+-----+---------------------------+
|Name |index|monotonically_increasing_id|
+-------------------+-----+---------------------------+
|Ram Ghadiyaram |0 |0 |
|Ravichandra |1 |8589934592 |
|ilker |2 |8589934593 |
|nick |3 |17179869184 |
|Naveed |4 |25769803776 |
|Gobinathan SP |5 |25769803777 |
|Sreenivas Venigalla|6 |34359738368 |
|Jackela Kowski |7 |42949672960 |
|Arindam Sengupta |8 |42949672961 |
|Liangpi |9 |51539607552 |
|Omar14 |10 |60129542144 |
|anshu kumar |11 |60129542145 |
+-------------------+-----+---------------------------+
As Ram said, zippedwithindex is better than monotonically increasing id, id you need consecutive row numbers. Try this (PySpark environment):
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, LongType
new_schema = StructType(**original_dataframe**.schema.fields[:] + [StructField("index", LongType(), False)])
zipped_rdd = **original_dataframe**.rdd.zipWithIndex()
indexed = (zipped_rdd.map(lambda ri: row_with_index(*list(ri[0]) + [ri[1]])).toDF(new_schema))
where original_dataframe is the dataframe you have to add index on and row_with_index is the new schema with the column index which you can write as
row_with_index = Row(
"calendar_date"
,"year_week_number"
,"year_period_number"
,"realization"
,"index"
)
Here, calendar_date, year_week_number, year_period_number and realization were the columns of my original dataframe. You can replace the names with the names of your columns. index is the new column name you had to add for the row numbers.
If you require a unique sequence number for each row, I have a slightly different approach, where a static column is added and is used to compute the row number using that column.
val srcData = spark.read.option("header","true").csv("/FileStore/sample.csv")
srcData.show(5)
+--------+--------------------+
| Job| Name|
+--------+--------------------+
|Morpheus| HR Specialist|
| Kayla| Lawyer|
| Trisha| Bus Driver|
| Robert|Elementary School...|
| Ober| Judge|
+--------+--------------------+
val srcDataModf = srcData.withColumn("sl_no",lit("1"))
val windowSpecRowNum = Window.partitionBy("sl_no").orderBy("sl_no")
srcDataModf.withColumn("row_num",row_number.over(windowSpecRowNum)).drop("sl_no").select("row_num","Name","Job")show(5)
+-------+--------------------+--------+
|row_num| Name| Job|
+-------+--------------------+--------+
| 1| HR Specialist|Morpheus|
| 2| Lawyer| Kayla|
| 3| Bus Driver| Trisha|
| 4|Elementary School...| Robert|
| 5| Judge| Ober|
+-------+--------------------+--------+
For SparkR:
(Assuming sdf is some sort of spark data frame)
sdf<- withColumn(sdf, "row_id", SparkR:::monotonically_increasing_id())