How to retrieve last 24-hours data from Spark DataFrame (Scala)? - scala

I want to retrieve the last 24-hours data from my DataFrame.
val data = spark.read.parquet(path_to_parquet_file)
data.createOrReplaceTempView("table")
var df = spark.sql("SELECT datetime, product_PK FROM table WHERE datetime BETWEEN (datetime - 24*3600000) AND datetime")
However, I do not know how to convert datetime to milliseconds using Spark SQL (Spark 2.2.0 and Scala 2.11).
I can do it using DataFrame, but don't know how to merge everything together:
import org.apache.spark.sql.functions.unix_timestamp
df = df.withColumn("unix_timestamp",unix_timestamp(col("datetime"))).drop("datetime")

Related

Convert Spark.sql timestamp to java.time.Instant in Scala

Very Simple question - Need to convert timestamp column in spark dataframe to java.time.Instant format
Here you can convert to java.time.instant:
val time1 = spark
.sql("...")
.as[java.sql.Timestamp]
.first()
.toInstant

How to include kafka timestamp value as columns in spark structured streaming?

I am looking for the solution for adding timestamp value of kafka to my Spark structured streaming schema. I have extracted the value field from kafka and making dataframe. My issue is, I need to get the timestamp field (from kafka) also along with the other columns.
Here is my current code:
val kafkaDatademostr = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers","zzzz.xxx.xxx.xxx.com:9002")
.option("subscribe","csvstream")
.load
val interval = kafkaDatademostr.select(col("value").cast("string")).alias("csv")
.select("csv.*")
val xmlData = interval.selectExpr("split(value,',')[0] as ddd" ,
"split(value,',')[1] as DFW",
"split(value,',')[2] as DTG",
"split(value,',')[3] as CDF",
"split(value,',')[4] as DFO",
"split(value,',')[5] as SAD",
"split(value,',')[6] as DER",
"split(value,',')[7] as time_for",
"split(value,',')[8] as fort")
How can I get the timestamp from kafka and add as columns along with other columns?
Timestamp is included in the source schema. Just add a "select timestamp" to get the timestamp like the below.
val interval = kafkaDatademostr.select(col("value").cast("string").alias("csv"), col("timestamp")).select("csv.*", "timestamp")
At Apache Spark official web page you can find guide: Structured Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher)
There you can find information about the schema of DataFrame that is loaded from Kafka.
Each row from Kafka source has following columns:
key - message key
value - message value
topic - name message topic
partition - partitions from which that message came from
offset - offset of the message
timestamp - timestamp
timestampType timestamp type
All of above columns are available to query.
In your example you use only value, so to get timestamp just need to add timestamp to your select statement:
val allFields = kafkaDatademostr.selectExpr(
s"CAST(value AS STRING) AS csv",
s"CAST(key AS STRING) AS key",
s"topic as topic",
s"partition as partition",
s"offset as offset",
s"timestamp as timestamp",
s"timestampType as timestampType"
)
In my case of Kafka, I was receiving the values in JSON format. Which contains the actual data along with original Event Time not kafka timestamp. Below is the schema.
val mySchema = StructType(Array(
StructField("time", LongType),
StructField("close", DoubleType)
))
In order to use watermarking feature of Spark Structured Streaming, I had to cast the time field into the timestamp format.
val df1 = df.selectExpr("CAST(value AS STRING)").as[(String)]
.select(from_json($"value", mySchema).as("data"))
.select(col("data.time").cast("timestamp").alias("time"),col("data.close"))
Now you can use the time field for window operation as well as watermarking purpose.
import spark.implicits._
val windowedData = df1.withWatermark("time","1 minute")
.groupBy(
window(col("time"), "1 minute", "30 seconds"),
$"close"
).count()
I hope this answer clarifies.

How to order string of exact format (dd-MM-yyyy HH:mm) using sparkSQL or Dataframe API

I want a dataframe to be reordered in ascending order based on a datetime column which is in the format of "23-07-2018 16:01"
My program sorts to date level but not HH:mm standard.I want output to include HH:mm details as well sorted according to it.
package com.spark
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{to_date, to_timestamp}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
object conversion{
def main(args:Array[String]) = {
val spark = SparkSession.builder().master("local").appName("conversion").enableHiveSupport().getOrCreate()
import spark.implicits._
val sourceDF = spark.read.format("csv").option("header","true").option("inferSchema","true").load("D:\\2018_Sheet1.csv")
val modifiedDF = sourceDF.withColumn("CredetialEndDate",to_date($"CredetialEndDate","dd-MM-yyyy HH:mm"))
//This converts into "dd-MM-yyyy" but "dd-MM-yyyy HH:mm" is expected
//what is the equivalent Dataframe API to convert string to HH:mm ?
modifiedDF.createOrReplaceGlobalTempView("conversion")
val sortedDF = spark.sql("select * from global_temp.conversion order by CredetialEndDate ASC ").show(50)
//dd-MM-YYYY 23-07-2018 16:01
}
}
So my result should have the column in the format "23-07-2018 16:01" instead of just "23-07-2018" and having sorted ascending manner.
The method to_date converts the column into a DateType which has date only, no time. Try to use to_timestamp instead.
Edit: If you want to do the sorting but keep the original string representation you can do something like:
val modifiedDF = sourceDF.withColumn("SortingColumn",to_timestamp($"CredetialEndDate","dd-MM-yyyy HH:mm"))
and then modify the result to:
val sortedDF = spark.sql("select * from global_temp.conversion order by SortingColumnASC ").drop("SortingColumn").show(50)

Spark Scala: How to transform a column in a DF

I have a dataframe in Spark with many columns and a udf that I defined. I want the same dataframe back, except with one column transformed. Furthermore, my udf takes in a string and returns a timestamp. Is there an easy way to do this? I tried
val test = myDF.select("my_column").rdd.map(r => getTimestamp(r))
but this returns an RDD and just with the transformed column.
If you really need to use your function, I can suggest two options:
Using map / toDF:
import org.apache.spark.sql.Row
import sqlContext.implicits._
def getTimestamp: (String => java.sql.Timestamp) = // your function here
val test = myDF.select("my_column").rdd.map {
case Row(string_val: String) => (string_val, getTimestamp(string_val))
}.toDF("my_column", "new_column")
Using UDFs (UserDefinedFunction):
import org.apache.spark.sql.functions._
def getTimestamp: (String => java.sql.Timestamp) = // your function here
val newCol = udf(getTimestamp).apply(col("my_column")) // creates the new column
val test = myDF.withColumn("new_column", newCol) // adds the new column to original DF
Alternatively,
If you just want to transform a StringType column into a TimestampType column you can use the unix_timestamp column function available since Spark SQL 1.5:
val test = myDF
.withColumn("new_column", unix_timestamp(col("my_column"), "yyyy-MM-dd HH:mm")
.cast("timestamp"))
Note: For spark 1.5.x, it is necessary to multiply the result of unix_timestamp by 1000 before casting to timestamp (issue SPARK-11724). The resulting code would be:
val test = myDF
.withColumn("new_column", (unix_timestamp(col("my_column"), "yyyy-MM-dd HH:mm") *1000L)
.cast("timestamp"))
Edit: Added udf option

Extract value from scala TimeStampType

I have a schemaRDD created from a hive query
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val rdd = sqlContext.sql("Select * from mytime")
My RDD contains the following schema
StructField(id,StringType,true)
StructField(t,TimestampType,true)
We have our own custom database and want to same the TimestampType to a string. But I could not find a way to extract the value and save it as a string.
Can you help? Thanks!
What happens if you change your query to:
SELECT id, cast(t as STRING) from mytime