I have a list of events and I need to count the time duration between the 2 events. The previous event can be in another window. How to do it? I use scala and streaming(mini-batches). I wrote the next but I had a mistake :"Non-time-based windows are not supported on streaming DataFrames/Datasets;"
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val ttt = tester
.withWatermark("time", "1 minutes")
.groupBy(
window($"time", "1 minutes", "10 seconds"),
$"nodeId"
)
.count()
.withColumn("WindowSize",(col("window.end").cast("Long") -
col("window.start").cast("Long")) / 60)
.withColumn("prev_time" ,lead(col("window.start"), 1)
.over(Window.partitionBy("nodeId").orderBy("window.start")))
Related
I have data in a DataFrame with a column as DateTime. Now I wish to find out the first day of the the week , First day means MONDAY.
So I have thought of the following Ways to achieve this -
With an arithmetic Calculation:
import org.apache.spark.sql.functions._
val df1 = Seq((1, "2020-05-12 10:23:45", 5000), (2, "2020-11-11 12:12:12", 2000)).toDF("id", "DateTime", "miliseconds")
val new_df1=df1.withColumn("week",date_sub(next_day(col("DateTime"),"monday"),7))
Result -
Creating an UDF which does the similar activity ( This is of least priority )
df1.withColumn("week", date_trunc("week", $"DateTime"))
If there are any more methods to achieve this, I would like to see more implementations.
Thanks in Advance.
I am trying to load up a Parquet file with columns storyId1 and publisher1. I want to find all pairs of publishers that publish articles about the same stories. For each publisher pair need to report the number of co-published stories. Where a co-published story in a story published by both publishers. Report the pairs in decreasing order of frequency. The solution must conform to the following rules:
1. There should not be any replicated entries like:
NASDAQ, NASDAQ, 1000
2. Should not have the same pair occurring twice in opposite order. Only one of the following should occur:
NASDAQ, Reuters, 1000
Reuters, NASDAQ, 1000
(i.e. it is incorrect to have both of the above two lines in your result)
Now it have tried following code:
> import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
import spark.implicits._
val worddocDF = spark.read.parquet("file:///home/user204943816622/t4_story_publishers.parquet")
val worddocDF1 = spark.read.parquet("file:///home/user204943816622/t4_story_publishers.parquet")
worddocDF.cache()
val joinDF = worddocDF.join(worddocDF1, "storyId1").withColumnRenamed("worddocDF.publisher1", "publisher2")
joinDF.filter($"publisher1" !== $"publisher2")
Input format:
[ddUyU0VZz0BRneMioxUPQVP6sIxvM, Livemint]
[ddUyU0VZz0BRneMioxUPQVP6sIxvM, IFA Magazine]
[ddUyU0VZz0BRneMioxUPQVP6sIxvM, Moneynews]
[ddUyU0VZz0BRneMioxUPQVP6sIxvM, NASDAQ]
[dPhGU51DcrolUIMxbRm0InaHGA2XM, IFA Magazine]
[ddUyU0VZz0BRneMioxUPQVP6sIxvM, Los Angeles Times]
[dPhGU51DcrolUIMxbRm0InaHGA2XM, NASDAQ]
Required output:
[
NASDAQ,IFA Magazine,2]
[Moneynews,Livemint,1]
[Moneynews,IFA Magazine,1]
[NASDAQ,Livemint,1]
[NASDAQ,Los Angeles Times,1]
[Moneynews,Los Angeles Times,1]
[Los Angeles Times,IFA Magazine,1]
[Livemint,IFA Magazine,1]
[NASDAQ,Moneynews,1]
[Los Angeles Times,Livemint,1]
import spark.implicits._
wordDocDf.as("a")
.join(
wordDocDf.as("b"),
$"a.storyId1" === $"b.storyId1" && $"a.publisher1" =!= $"b.publisher1",
"inner"
)
.select(
$"a.storyId1".as("storyId"),
$"a.publisher1".as("publisher1"),
$"b.publisher1".as("publisher2")
)
I am attempting to group data that fits into a specified window period using Spark Structured Streaming.
val profiles = rawProfiles.select("*")
.groupBy(window($"date", "10 minutes", "5 minutes").alias("date"), $"id", $"name")
.agg(sum("value").alias("value"))
.join(url.value, Seq("url"), "left")
.where("value > 20")
.as[profileRecord]
The format of the date from the rawProfiles is a string like this:
2017-07-20 18:27:45
What is returned for the date column after the window aggregation is something like this:
[0,554c749fb8a00,554c76dbed000]
I'm not really sure what to do with that. Does anyone have any ideas?
you can reformat your date field as follows;
rawProfiles.select(<your other fields>,to_date(unix_timestamp($"date").cast(DataTypes.TimestampType)).as("date"))).groupBy(window($"date", "10 minutes", "5 minutes").alias("date"), $"id", $"name")
.agg(sum("value").alias("value"))
.join(url.value, Seq("url"), "left")
.where("value > 20")
.as[profileRecord]
Good evening.
I am doing some comparative work on the performance of RDDs, Dataframes and Datasets in Spark 2.1.0 (using built-in Scala 2.11.8). I have downloaded some freely available data from https://data.london.gov.uk/dataset/smartmeter-energy-use-data-in-london-households and executed the script later on shown on it. To give you a preview, the interrogated data looks as follows:
LCLid,stdorToU,DateTime,KWH/hh (per half hour) ,Acorn,Acorn_grouped
MAC000002,Std,2012-10-12 00:30:00.0000000, 0 ,ACORN-A,Affluent
MAC000002,Std,2012-10-12 01:00:00.0000000, 0 ,ACORN-A,Affluent
MAC000002,Std,2012-10-12 01:30:00.0000000, 0 ,ACORN-A,Affluent
MAC000002,Std,2012-10-12 02:00:00.0000000, 0 ,ACORN-A,Affluent
MAC000002,Std,2012-10-12 02:30:00.0000000, 0 ,ACORN-A,Affluent
MAC000002,Std,2012-10-12 03:00:00.0000000, 0 ,ACORN-A,Affluent
MAC000002,Std,2012-10-12 03:30:00.0000000, 0 ,ACORN-A,Affluent
MAC000002,Std,2012-10-12 04:00:00.0000000, 0 ,ACORN-A,Affluent
To achieve my comparative work, I time Spark at different stages of the import and transformation [String, String, Timestamp, Double, String, String] of the 6 variables expressed above. I have successfully managed to map the data into a Dataframe and a Dataset but cannot quite achieve the same in terms of RDD. Everytime I try to convert the file into an RDD, I get the following error:
ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
I am very confused since the variable 'DateTime' is already expressed as a timestamp format of 'yyyy-mm-dd hh:mm:ss[.fffffffff]'. I have read posts such as these (Convert Date to Timestamp in Scala, How to convert unix timestamp to date in Spark, Spark SQL: parse timestamp without seconds) but do not satisfy my needs.
It's even more confusing as the defined class 'londonDataSchemaDS' I constructed works on my Dataset conversion but not on my RDD one.
This is the script I have used:
import java.io.File
import java.sql.Timestamp
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{DataTypes, StructField, StructType}
val sparkSession = SparkSession.builder.appName("SmartData London").master("local[*]").getOrCreate()
val LCLid = StructField("LCLid", DataTypes.StringType)
val stdorToU = StructField("stdorToU", DataTypes.StringType)
val DateTime = StructField("DateTime", DataTypes.TimestampType)
val KWHhh = StructField("KWH/hh (per half hour) ", DataTypes.DoubleType)
val Acorn = StructField("Acorn", DataTypes.StringType)
val Acorn_grouped = StructField("Acorn_grouped", DataTypes.StringType)
val fields = Array(LCLid,stdorToU,DateTime,KWHhh,Acorn,Acorn_grouped)
val londonDataSchemaDF = StructType(fields)
import sparkSession.implicits._
case class londonDataSchemaDS(LCLid: String, stdorToU: String, DateTime: java.sql.Timestamp, KWHhh: Double, Acorn: String, Acorn_grouped: String)
val t0 = System.nanoTime()
val loadFileRDD=sparkSession.sparkContext.textFile("C:/Data/Smart_Data_London/Power-Networks-LCL-June2015(withAcornGps).csv_Pieces/Power-Networks-LCL-June2015(withAcornGps)v2_1.csv")
.map(_.split(","))
.map(r=>londonDataSchemaDS(r(0), r(1), Timestamp.valueOf(r(2)), r(3).toDouble, r(4), r(5)))
val t1 = System.nanoTime()
val loadFileDF=sparkSession.read.schema(londonDataSchemaDF).option("header", true)
.csv("C:/Data/Smart_Data_London/Power-Networks-LCL-June2015(withAcornGps).csv_Pieces/Power-Networks-LCL-June2015(withAcornGps)v2_1.csv")
val t2=System.nanoTime()
val loadFileDS=sparkSession.read.option("header", "true")
.csv("C:/Data/Smart_Data_London/Power-Networks-LCL-June2015(withAcornGps).csv_Pieces/Power-Networks-LCL-June2015(withAcornGps)v2_1.csv")
.withColumn("DateTime", $"DateTime".cast("timestamp"))
.withColumnRenamed("KWH/hh (per half hour) ", "KWHhh")
.withColumn("KWHhh", $"KWHhh".cast("double"))
.as[londonDataSchemaDS]
val t3 = System.nanoTime()
loadFileRDD.take(10)
loadFileDF.show(10, false)
loadFileDF.printSchema()
loadFileDS.show(10, false)
loadFileDS.printSchema()
println("Time Elapsed to implement RDD: " + (t1 - t0) * 1E-9 + " seconds")
println("Time Elapsed to implement DataFrame: " + (t2 - t1) * 1E-9 + " seconds")
println("Time Elapsed to implement Dataset: " + (t3 - t2) * 1E-9 + " seconds")
Any help on this would be most appreciated and/or a nudge in the right direction.
Many thanks,
Christian
I know what I did wrong. I was so caught up in the DataFrame and Dataset conversion which has got a built-in function to skip the header, that I forgot to remove the header from the RDD conversion process.
By adding the lines below, I remove the header and successfully convert my csv to an RDD (This explains why I was getting a formatting error in Timestamp):
val loadFileRDDwH=sparkSession.sparkContext.textFile("C:/Data/Smart_Data_London/Power-Networks-LCL-June2015(withAcornGps).csv_Pieces/Power-Networks-LCL-June2015(withAcornGps)v2_1.csv").map(_.split(","))
val header=loadFileRDDwH.first()
val loadFileRDD=loadFileRDDwH.filter(_(0) != header(0)).map(r=>londonDataSchemaDS(r(0), r(1), Timestamp.valueOf(r(2)), r(3).split("\\s+").mkString.toDouble, r(4), r(5)))
Thanks for reading
Christian
I have two timestamp columns in a dataframe that I'd like to get the minute difference of, or alternatively, the hour difference of. Currently I'm able to get the day difference, with rounding, by doing
val df2 = df1.withColumn("time", datediff(df1("ts1"), df1("ts2")))
However, when i looked at the doc page
https://issues.apache.org/jira/browse/SPARK-8185
I didn't see any extra parameters to change the unit. Is their a different function I should be using for this?
You can get the difference in seconds by
import org.apache.spark.sql.functions._
val diff_secs_col = col("ts1").cast("long") - col("ts2").cast("long")
Then you can do some math to get the unit you want. For example:
val df2 = df1
.withColumn( "diff_secs", diff_secs_col )
.withColumn( "diff_mins", diff_secs_col / 60D )
.withColumn( "diff_hrs", diff_secs_col / 3600D )
.withColumn( "diff_days", diff_secs_col / (24D * 3600D) )
Or, in pyspark:
from pyspark.sql.functions import *
diff_secs_col = col("ts1").cast("long") - col("ts2").cast("long")
df2 = df1 \
.withColumn( "diff_secs", diff_secs_col ) \
.withColumn( "diff_mins", diff_secs_col / 60D ) \
.withColumn( "diff_hrs", diff_secs_col / 3600D ) \
.withColumn( "diff_days", diff_secs_col / (24D * 3600D) )
The answer given by Daniel de Paula works, but that solution does not work in the case where the difference is needed for every row in your table. Here is a solution that will do that for each row:
import org.apache.spark.sql.functions
val df2 = df1.selectExpr("(unix_timestamp(ts1) - unix_timestamp(ts2))/3600")
This first converts the data in the columns to a unix timestamp in seconds, subtracts them and then converts the difference to hours.
A useful list of functions can be found at:
http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.functions$