I have to compute a cumulative sum on a value column by group from the beginning of the time series with a daily output.
If I do with a batch, it should be something like this:
val columns = Seq("timestamp", "group", "value")
val data = List(
(Instant.parse("2020-01-01T00:00:00Z"), "Group1", 0),
(Instant.parse("2020-01-01T00:00:00Z"), "Group2", 0),
(Instant.parse("2020-01-01T12:00:00Z"), "Group1", 1),
(Instant.parse("2020-01-01T12:00:00Z"), "Group2", -1),
(Instant.parse("2020-01-02T00:00:00Z"), "Group1", 2),
(Instant.parse("2020-01-02T00:00:00Z"), "Group2", -2),
(Instant.parse("2020-01-02T12:00:00Z"), "Group1", 3),
(Instant.parse("2020-01-02T12:00:00Z"), "Group2", -3),
)
val df = spark
.createDataFrame(data)
.toDF(columns: _*)
// defines a window from the beginning by `group`
val event_window = Window
.partitionBy(col("group"))
.orderBy(col("timestamp"))
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
val computed_df = df
.withColumn(
"cumsum",
functions
.sum('value)
.over(event_window) // apply the aggregation on a window from the beginning
)
.groupBy(window($"timestamp", "1 day"), $"group")
.agg(functions.last("cumsum").as("cumsum_by_day")) // display the last value for each day
computed_df.show(truncate = false)
and the output is
+------------------------------------------+------+-------------+
|window |group |cumsum_by_day|
+------------------------------------------+------+-------------+
|{2020-01-01 01:00:00, 2020-01-02 01:00:00}|Group1| 1 |
|{2020-01-02 01:00:00, 2020-01-03 01:00:00}|Group1| 6 |
|{2020-01-01 01:00:00, 2020-01-02 01:00:00}|Group2|-1 |
|{2020-01-02 01:00:00, 2020-01-03 01:00:00}|Group2|-6 |
+------------------------------------------+------+-------------+
The result is perfectly fine.
However, in my case, the data source is not an existing dataset but a stream and I didn't find any solution to apply the aggregation from the beginning of the stream, not on a sliding window.
The closest code I can do is:
// MemoryStream to reproduce locally the issue
implicit val sqlCtx: SQLContext = spark.sqlContext
val memoryStream = MemoryStream[(Instant, String, Int)]
memoryStream.addData(data)
val df = memoryStream
.toDF()
.toDF(columns: _*)
val computed_df = df
.groupBy(window($"timestamp", "1 day"), $"group")
.agg(functions.sum('value).as("agg"))
computed_df.writeStream
.option("truncate", value = false)
.format("console")
.outputMode("complete")
.start()
.processAllAvailable()
}
It produces an aggregation for each day but not from the beginning of the stream.
If I try to add something like .over(event_window) (like in batch), it compiles but fails at runtime.
How can we apply an aggregation function from the beginning of a stream?
Here a GitHub repository with all the context to run that code.
I didn't find any solution using the high-level functions. For example, it is not possible to add another groupBy over the main aggregation agg(functions.sum('value).as("agg"), functions.last('timestamp).as("ts") to get the daily report.
After many experiments, I switched to the low level functions. The most polyvalent function seems to be flatMapGroupsWithState.
// same `events` Dataframe as before
// Accumulate value by group and report every day
val computed_df = events
.withWatermark("timestamp", "0 second") // watermarking required to use GroupStateTimeout
.as[(Instant, String, Int)]
.groupByKey(event => event._2)
.flatMapGroupsWithState[IntermediateState, AggResult](
OutputMode.Append(),
GroupStateTimeout.EventTimeTimeout
)(processEventGroup)
Then it returns:
-------------------------------------------
Batch: 0
-------------------------------------------
+-------------------+------+-------------+
|day_start |group |cumsum_by_day|
+-------------------+------+-------------+
|2020-01-01 01:00:00|Group2|-1 |
|2020-01-02 01:00:00|Group2|-6 |
|2020-01-01 01:00:00|Group1|1 |
|2020-01-02 01:00:00|Group1|6 |
+-------------------+------+-------------+
-------------------------------------------
Batch: 1
-------------------------------------------
+-------------------+------+-------------+
|day_start |group |cumsum_by_day|
+-------------------+------+-------------+
|2020-01-03 01:00:00|Group2|-15 |
|2020-01-03 01:00:00|Group1|15 |
+-------------------+------+-------------+
processEventGroup is the key function which contains all the technical stuff: cumulative aggregative and output after each day.
def processEventGroup(
group: String,
events: Iterator[(Instant, String, Int)],
state: GroupState[IntermediateState]
) = {
def mergeState(events: List[Event]): Iterator[AggResult] = {
// Initialize the aggregation without previous state or a new one
var (acc_value, acc_timestamp) = state.getOption
.map(s => (s.agg_value, s.last_timestamp))
.getOrElse((0, Instant.EPOCH))
val agg_results = events.flatMap { e =>
// create an daily report if the new event occurs on another day
val intermediate_day_result =
if ( // not same day
acc_timestamp != Instant.EPOCH &&
truncateDay(e.timestamp) > truncateDay(acc_timestamp)
) {
Seq(AggResult(truncateDay(acc_timestamp), group, acc_value))
} else {
Seq.empty
}
// apply the aggregation as usual (`sum` on value, `last` on timestamp)
acc_value += e.value
acc_timestamp = e.timestamp
intermediate_day_result
}
// if a timeout occurs before next events data in the same group,
// a daily report will be generated
state.setTimeoutTimestamp(state.getCurrentWatermarkMs, "1 day")
// save the current aggregated value as state storage
state.update(IntermediateState(acc_timestamp, group, acc_value))
agg_results.iterator
}
if (state.hasTimedOut && events.isEmpty) {
// generate a daily report on timeout
state.getOption
.map(agg_result =>
AggResult(truncateDay(agg_result.last_timestamp), group, agg_result.agg_value)
)
.iterator
} else {
// a list daily report may be generated while processing the new events
mergeState(events.map { case (timestamp, group, value) =>
Event(timestamp, group, value)
}.toList)
}
}
processEventGroup will be called at each batch once per group.
The state is managed by GroupState (the state should just be serializable).
For completness, here the missing elements:
def truncateDay(ts: Instant): Instant = {
ts.truncatedTo(ChronoUnit.DAYS)
}
case class Event(timestamp: Instant, group: String, value: Int)
case class IntermediateState(last_timestamp: Instant, group: String, agg_value: Int)
case class AggResult(day_start: Instant, group: String, cumsum_by_day: Int)
(code available here)
Related
How can I compute percentile 15th and percentile 50th of column students taking into consideration occ column without using array_repeat and avoiding explosion? I have huge input dataframe and explosion blows out the memory.
My DF is:
name | occ | students
aaa 1 1
aaa 3 7
aaa 6 11
...
For example, if I consider students and occ are bot arrays then to compute percentile 50th of array students with taking into consideration of occ I would normaly compute like this:
val students = Array(1,7,11)
val occ = Array(1,3,6)
it gives:
val student_repeated = Array(1,7,7,7,11,11,11,11,11,11)
then student_50th would be 50th percentile of student_repeated => 11.
My current code:
import spark.implicits._
val inputDF = Seq(
("aaa", 1, 1),
("aaa", 3, 7),
("aaa", 6, 11),
)
.toDF("name", "occ", "student")
// Solution 1
inputDF
.withColumn("student", array_repeat(col("student"), col("occ")))
.withColumn("student", explode(col("student")))
.groupBy("name")
.agg(
percentile_approx(col("student"), lit(0.5), lit(10000)).alias("student_50"),
percentile_approx(col("student"), lit(0.15), lit(10000)).alias("student_15"),
)
.show(false)
which outputs:
+----+----------+----------+
|name|student_50|student_15|
+----+----------+----------+
|aaa |11 |7 |
+----+----------+----------+
EDIT:
I am looking for scala equivalent solution:
https://stackoverflow.com/a/58309977/4450090
EDIT2:
I am proceeding with sketches-java
https://github.com/DataDog/sketches-java
I have decided to use dds sketch which has method accept which allows the sketch to be updated.
"com.datadoghq" % "sketches-java" % "0.8.2"
First, I initialize empty sketch.
Then, I accept pair of values (value, weight)
Then after all I call dds sketch method getValueAtQuantile
I do execute all as Spark Scala Aggregator.
class DDSInitAgg(pct: Double, accuracy: Double) extends Aggregator[ValueWithWeigth, SketchData, Double]{
private val precision: String = "%.6f"
override def zero: SketchData = DDSUtils.sketchToTuple(DDSketches.unboundedDense(accuracy))
override def reduce(b: SketchData, a: ValueWithWeigth): SketchData = {
val s = DDSUtils.sketchFromTuple(b)
s.accept(a.value, a.weight)
DDSUtils.sketchToTuple(s)
}
override def merge(b1: SketchData, b2: SketchData): SketchData = {
val s1: DDSketch = DDSUtils.sketchFromTuple(b1)
val s2: DDSketch = DDSUtils.sketchFromTuple(b2)
s1.mergeWith(s2)
DDSUtils.sketchToTuple(s1)
}
override def finish(reduction: SketchData): Double = {
val percentile: Double = DDSUtils.sketchFromTuple(reduction).getValueAtQuantile(pct)
precision.format(percentile).toDouble
}
override def bufferEncoder: Encoder[SketchData] = ExpressionEncoder()
override def outputEncoder: Encoder[Double] = Encoders.scalaDouble
}
You can execute it as udaf taking two columns as the input.
Additionaly, I developed methods for encoding/decoding back and forth from DDSSketch <---> Array[Byte]
case class SketchData(backingArray: Array[Byte], numWrittenBytes: Int)
object DDSUtils {
val emptySketch: DDSketch = DDSketches.unboundedDense(0.01)
val supplierStore: Supplier[Store] = () => new UnboundedSizeDenseStore()
def sketchToTuple(s: DDSketch): SketchData = {
val o = GrowingByteArrayOutput.withDefaultInitialCapacity()
s.encode(o, false)
SketchData(o.backingArray(), o.numWrittenBytes())
}
def sketchFromTuple(sketchData: SketchData): DDSketch = {
val i: ByteArrayInput = ByteArrayInput.wrap(sketchData.backingArray, 0, sketchData.numWrittenBytes)
DDSketch.decode(i, supplierStore)
}
}
This is how I call it as udaf
val ddsInitAgg50UDAF: UserDefinedFunction = udaf(new DDSInitAgg(0.50, 0.50), ExpressionEncoder[ValueWithWeigth])
and finally then in aggregation:
ddsInitAgg50UDAF(col("weigthCol"), col("valueCol")).alias("value_pct_50")
I have a DataSet[Metric] and transform it to a KeyValueGroupedDataset (grouping by metricId) in order to then perform reduceGroups.
The problem that I've faced is that when there is just one record with some metricId, like metric3 in the example below, it is returned as-is and the processTime field is not getting updated. However when there is more than one record with the same metricId, they are getting reduced and the processTime field is updated correctly.
I guess that it's happening since reduceGroups needs at least 2 records in a group and otherwise just returns the single record unchanged.
But I can't figure out how to achieve updating the processTime field when there is a single record in a group?
case class Metric (
metricId: String,
rank: Int,
features: List[Feature]
processTime: Timestamp
)
case class Feature (
featureId: String,
name: String,
value: String
)
val f1 = Feature(1, "f1", "v1")
val f2 = Feature(1, "f2", "v2")
val f3 = Feature(2, "f3", "v3")
val metric1 = Metric("1", 1, List(f1, f2, f3), Timestamp.valueOf("2019-07-01 00:00:00"))
val metric2 = Metric("1", 2, List(f3, f2), Timestamp.valueOf("2019-07-01 00:00:00"))
val metric3 = Metric("2", 1, List(f1, f2), Timestamp.valueOf("2019-07-21 00:00:00"))
val metricsList = List(metric1, metric2, metric3)
val groupedMetrics: KeyValueGroupedDataset[String, Metric] = metricsList.groupByKey(x => x.metricId)
val aggregatedMetrics: Dataset[(String, Metric)] = groupedMetrics.reduceGroups {
(m1: Metric, m2: Metric) =>
val theMetric: Metric = if (m2.rank >= m1.rank) {
m2
} else {
m1
}
Metric(
m2.metricId,
m2.rank,
m2.features ++ m1.features
Timestamp.valueOf(LocalDateTime.now()),
)
}
I have a small scenario where i read text file and calculate average based on date and store the summary into Mysql database.
Following is code
val repo_sum = joined_data.map(SensorReport.generateReport)
repo_sum.show() --- STEP 1
repo_sum.write.mode(SaveMode.Overwrite).jdbc(url, "sensor_report", prop)
repo_sum.show() --- STEP 2
After calculating average in repo_sum dataframe following is the result of STEP 1
+----------+------------------+-----+-----+
| date| flo| hz|count|
+----------+------------------+-----+-----+
|2017-10-05|52.887049194476745|10.27| 5.0|
|2017-10-04| 55.4188048943416|10.27| 5.0|
|2017-10-03| 54.1529270444092|10.27| 10.0|
+----------+------------------+-----+-----+
Then the save command is executed and the dataset values at step 2 is
+----------+-----------------+------------------+-----+
| date| flo| hz|count|
+----------+-----------------+------------------+-----+
|2017-10-05|52.88704919447673|31.578524597238367| 10.0|
|2017-10-04| 55.4188048943416| 32.84440244717079| 10.0|
+----------+-----------------+------------------+-----+
Following is complete code
class StreamRead extends Serializable {
org.apache.spark.sql.catalyst.encoders.OuterScopes.addOuterScope(this);
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Application").setMaster("local[2]")
val ssc = new StreamingContext(conf, Seconds(2))
val sqlContext = new SQLContext(ssc.sparkContext)
import sqlContext.implicits._
val sensorDStream = ssc.textFileStream("file:///C:/Users/M1026352/Desktop/Spark/StreamData").map(Sensor.parseSensor)
val url = "jdbc:mysql://localhost:3306/streamdata"
val prop = new java.util.Properties
prop.setProperty("user", "root")
prop.setProperty("password", "root")
val tweets = sensorDStream.foreachRDD {
rdd =>
if (rdd.count() != 0) {
val databaseVal = sqlContext.read.jdbc("jdbc:mysql://localhost:3306/streamdata", "sensor_report", prop)
val rdd_group = rdd.groupBy { x => x.date }
val repo_data = rdd_group.map { x =>
val sum_flo = x._2.map { x => x.flo }.reduce(_ + _)
val sum_hz = x._2.map { x => x.hz }.reduce(_ + _)
val sum_flo_count = x._2.size
print(sum_flo_count)
SensorReport(x._1, sum_flo, sum_hz, sum_flo_count)
}
val df = repo_data.toDF()
val joined_data = df.join(databaseVal, Seq("date"), "fullouter")
joined_data.show()
val repo_sum = joined_data.map(SensorReport.generateReport)
repo_sum.show()
repo_sum.write.mode(SaveMode.Overwrite).jdbc(url, "sensor_report", prop)
repo_sum.show()
}
}
ssc.start()
WorkerAndTaskExample.main(args)
ssc.awaitTermination()
}
case class Sensor(resid: String, date: String, time: String, hz: Double, disp: Double, flo: Double, sedPPM: Double, psi: Double, chlPPM: Double)
object Sensor extends Serializable {
def parseSensor(str: String): Sensor = {
val p = str.split(",")
Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble, p(6).toDouble, p(7).toDouble, p(8).toDouble)
}
}
case class SensorReport(date: String, flo: Double, hz: Double, count: Double)
object SensorReport extends Serializable {
def generateReport(row: Row): SensorReport = {
print(row)
if (row.get(4) == null) {
SensorReport(row.getString(0), row.getDouble(1) / row.getDouble(3), row.getDouble(2) / row.getDouble(3), row.getDouble(3))
} else if (row.get(2) == null) {
SensorReport(row.getString(0), row.getDouble(4), row.getDouble(5), row.getDouble(6))
} else {
val count = row.getDouble(3) + row.getDouble(6)
val flow_avg_update = (row.getDouble(6) * row.getDouble(4) + row.getDouble(1)) / count
val flow_flo_update = (row.getDouble(6) * row.getDouble(5) + row.getDouble(1)) / count
print(count + " : " + flow_avg_update + " : " + flow_flo_update)
SensorReport(row.getString(0), flow_avg_update, flow_flo_update, count)
}
}
}
As far as i understand when save command is executed in spark the whole process runs again, is my understanding is correct please let me know.
In Spark all transformations are lazy, nothing will happen until an action is called. At the same time, this means that if multiple actions are called on the same RDD or dataframe, all computations will be performed multiple times. This includes loading the data and all transformations.
To avoid this, use cache() or persist() (same thing except that cache() can specify different types of storage, the default is RAM memory only). cache() will keep the RDD/dataframe in memory after the first time an action was used on it. Hence, avoiding running the same transformations multiple times.
In this case, since two actions are performed on the dataframe is causing this unexpected behavior, caching the dataframe would solve the problem:
val repo_sum = joined_data.map(SensorReport.generateReport).cache()
I have records like below. I would like to convert a single record into two records with values EXTERNAL and INTERNAL each if the 3rd attribute is All.
Input dataset:
Surender,cts,INTERNAL
Raja,cts,EXTERNAL
Ajay,tcs,All
Expected output:
Surender,cts,INTERNAL
Raja,cts,EXTERNAL
Ajay,tcs,INTERNAL
Ajay,tcs,EXTERNAL
My Spark Code :
case class Customer(name:String,organisation:String,campaign_type:String)
val custRDD = sc.textFile("/user/cloudera/input_files/customer.txt")
val mapRDD = custRDD.map(record => record.split(","))
.map(arr => (arr(0),arr(1),arr(2))
.map(tuple => {
val name = tuple._1.trim
val organisation = tuple._2.trim
val campaign_type = tuple._3.trim.toUpperCase
Customer(name, organisation, campaign_type)
})
mapRDD.toDF().registerTempTable("customer_processed")
sqlContext.sql("SELECT * FROM customer_processed").show
Could Someone help me to fix this issue?
Since it's Scala...
If you want to write a more idiomatic Scala code (and perhaps trading some performance due to lack of optimizations to have a more idiomatic code), you can use flatMap operator (removed the implicit parameter):
flatMap[U](func: (T) ⇒ TraversableOnce[U]): Dataset[U] Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.
NOTE: flatMap is equivalent to explode function, but you don't have to register a UDF (as in the other answer).
A solution could be as follows:
// I don't care about the names of the columns since we use Scala
// as you did when you tried to write the code
scala> input.show
+--------+---+--------+
| _c0|_c1| _c2|
+--------+---+--------+
|Surender|cts|INTERNAL|
| Raja|cts|EXTERNAL|
| Ajay|tcs| All|
+--------+---+--------+
val result = input.
as[(String, String, String)].
flatMap { case r # (name, org, campaign) =>
if ("all".equalsIgnoreCase(campaign)) {
Seq("INTERNAL", "EXTERNAL").map { cname =>
(name, org, cname)
}
} else Seq(r)
}
scala> result.show
+--------+---+--------+
| _1| _2| _3|
+--------+---+--------+
|Surender|cts|INTERNAL|
| Raja|cts|EXTERNAL|
| Ajay|tcs|INTERNAL|
| Ajay|tcs|EXTERNAL|
+--------+---+--------+
Comparing performance of the two queries, i.e. flatMap-based vs explode-based queries, I think explode-based may be slightly faster and optimized better as some code is under Spark's control (using logical operators before they get mapped to physical couterparts). In flatMap the entire optimization is your responsibility as a Scala developer.
The below red-bounded area corresponds to flatMap-based code and the warning sign are very cost expensive DeserializeToObject and SerializeFromObject operators.
What's interesting is the number of Spark jobs per query and their durations. It appears that explode-based query takes 2 Spark jobs and 200 ms while flatMap-based take only 1 Spark job and 43 ms.
That surprises me a lot and suggests that flatMap-based query could be faster (!)
You can use and udf to transform the campaign_type column containing a Seq of strings to map it to the campaigns type and then explode :
val campaignType_ : (String => Seq[String]) = {
case s if s == "ALL" => Seq("EXTERNAL", "INTERNAL")
case s => Seq(s)
}
val campaignType = udf(campaignType_)
val df = Seq(("Surender", "cts", "INTERNAL"),
("Raja", "cts", "EXTERNAL"),
("Ajay", "tcs", "ALL"))
.toDF("name", "organisation", "campaign_type")
val step1 = df.withColumn("campaign_type", campaignType($"campaign_type"))
step1.show
// +--------+------------+--------------------+
// | name|organisation| campaign_type|
// +--------+------------+--------------------+
// |Surender| cts| [INTERNAL]|
// | Raja| cts| [EXTERNAL]|
// | Ajay| tcs|[EXTERNAL, INTERNAL]|
// +--------+------------+--------------------+
val step2 = step1.select($"name", $"organisation", explode($"campaign_type"))
step2.show
// +--------+------------+--------+
// | name|organisation| col|
// +--------+------------+--------+
// |Surender| cts|INTERNAL|
// | Raja| cts|EXTERNAL|
// | Ajay| tcs|EXTERNAL|
// | Ajay| tcs|INTERNAL|
// +--------+------------+--------+
EDIT:
You don't actually need a udf, you can use a when().otherwise predicate instead on step1 as followed :
val step1 = df.withColumn("campaign_type",
when(col("campaign_type") === "ALL", array("EXTERNAL", "INTERNAL")).otherwise(array(col("campaign_type")))
I have a CSV in which a field is datetime in a specific format. I cannot import it directly in my Dataframe because it needs to be a timestamp. So I import it as string and convert it into a Timestamp like this
import java.sql.Timestamp
import java.text.SimpleDateFormat
import java.util.Date
import org.apache.spark.sql.Row
def getTimestamp(x:Any) : Timestamp = {
val format = new SimpleDateFormat("MM/dd/yyyy' 'HH:mm:ss")
if (x.toString() == "")
return null
else {
val d = format.parse(x.toString());
val t = new Timestamp(d.getTime());
return t
}
}
def convert(row : Row) : Row = {
val d1 = getTimestamp(row(3))
return Row(row(0),row(1),row(2),d1)
}
Is there a better, more concise way to do this, with the Dataframe API or spark-sql? The above method requires the creation of an RDD and to give the schema for the Dataframe again.
Spark >= 2.2
Since you 2.2 you can provide format string directly:
import org.apache.spark.sql.functions.to_timestamp
val ts = to_timestamp($"dts", "MM/dd/yyyy HH:mm:ss")
df.withColumn("ts", ts).show(2, false)
// +---+-------------------+-------------------+
// |id |dts |ts |
// +---+-------------------+-------------------+
// |1 |05/26/2016 01:01:01|2016-05-26 01:01:01|
// |2 |#$#### |null |
// +---+-------------------+-------------------+
Spark >= 1.6, < 2.2
You can use date processing functions which have been introduced in Spark 1.5. Assuming you have following data:
val df = Seq((1L, "05/26/2016 01:01:01"), (2L, "#$####")).toDF("id", "dts")
You can use unix_timestamp to parse strings and cast it to timestamp
import org.apache.spark.sql.functions.unix_timestamp
val ts = unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss").cast("timestamp")
df.withColumn("ts", ts).show(2, false)
// +---+-------------------+---------------------+
// |id |dts |ts |
// +---+-------------------+---------------------+
// |1 |05/26/2016 01:01:01|2016-05-26 01:01:01.0|
// |2 |#$#### |null |
// +---+-------------------+---------------------+
As you can see it covers both parsing and error handling. The format string should be compatible with Java SimpleDateFormat.
Spark >= 1.5, < 1.6
You'll have to use use something like this:
unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss").cast("double").cast("timestamp")
or
(unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss") * 1000).cast("timestamp")
due to SPARK-11724.
Spark < 1.5
you should be able to use these with expr and HiveContext.
I haven't played with Spark SQL yet but I think this would be more idiomatic scala (null usage is not considered a good practice):
def getTimestamp(s: String) : Option[Timestamp] = s match {
case "" => None
case _ => {
val format = new SimpleDateFormat("MM/dd/yyyy' 'HH:mm:ss")
Try(new Timestamp(format.parse(s).getTime)) match {
case Success(t) => Some(t)
case Failure(_) => None
}
}
}
Please notice I assume you know Row elements types beforehand (if you read it from a csv file, all them are String), that's why I use a proper type like String and not Any (everything is subtype of Any).
It also depends on how you want to handle parsing exceptions. In this case, if a parsing exception occurs, a None is simply returned.
You could use it further on with:
rows.map(row => Row(row(0),row(1),row(2), getTimestamp(row(3))
I have ISO8601 timestamp in my dataset and I needed to convert it to "yyyy-MM-dd" format. This is what I did:
import org.joda.time.{DateTime, DateTimeZone}
object DateUtils extends Serializable {
def dtFromUtcSeconds(seconds: Int): DateTime = new DateTime(seconds * 1000L, DateTimeZone.UTC)
def dtFromIso8601(isoString: String): DateTime = new DateTime(isoString, DateTimeZone.UTC)
}
sqlContext.udf.register("formatTimeStamp", (isoTimestamp : String) => DateUtils.dtFromIso8601(isoTimestamp).toString("yyyy-MM-dd"))
And you can just use the UDF in your spark SQL query.
Spark Version: 2.4.4
scala> import org.apache.spark.sql.types.TimestampType
import org.apache.spark.sql.types.TimestampType
scala> val df = Seq("2019-04-01 08:28:00").toDF("ts")
df: org.apache.spark.sql.DataFrame = [ts: string]
scala> val df_mod = df.select($"ts".cast(TimestampType))
df_mod: org.apache.spark.sql.DataFrame = [ts: timestamp]
scala> df_mod.printSchema()
root
|-- ts: timestamp (nullable = true)
I would like to move the getTimeStamp method wrote by you into rdd's mapPartitions and reuse GenericMutableRow among rows in an iterator:
val strRdd = sc.textFile("hdfs://path/to/cvs-file")
val rowRdd: RDD[Row] = strRdd.map(_.split('\t')).mapPartitions { iter =>
new Iterator[Row] {
val row = new GenericMutableRow(4)
var current: Array[String] = _
def hasNext = iter.hasNext
def next() = {
current = iter.next()
row(0) = current(0)
row(1) = current(1)
row(2) = current(2)
val ts = getTimestamp(current(3))
if(ts != null) {
row.update(3, ts)
} else {
row.setNullAt(3)
}
row
}
}
}
And you should still use schema to generate a DataFrame
val df = sqlContext.createDataFrame(rowRdd, tableSchema)
The usage of GenericMutableRow inside an iterator implementation could be find in Aggregate Operator, InMemoryColumnarTableScan, ParquetTableOperations etc.
I would use https://github.com/databricks/spark-csv
This will infer timestamps for you.
import com.databricks.spark.csv._
val rdd: RDD[String] = sc.textFile("csvfile.csv")
val df : DataFrame = new CsvParser().withDelimiter('|')
.withInferSchema(true)
.withParseMode("DROPMALFORMED")
.csvRdd(sqlContext, rdd)
I had some issues with to_timestamp where it was returning an empty string. After a lot of trial and error, I was able to get around it by casting as a timestamp, and then casting back as a string. I hope this helps for anyone else with the same issue:
df.columns.intersect(cols).foldLeft(df)((newDf, col) => {
val conversionFunc = to_timestamp(newDf(col).cast("timestamp"), "MM/dd/yyyy HH:mm:ss").cast("string")
newDf.withColumn(col, conversionFunc)
})