Datetime conversion in Spark - scala

It seems that I can't make date_format work. Using a format that I know work on my data (see below)
import org.apache.spark.sql.functions._
dat.withColumn("ts", date_format(dat("timestamp"), "MMM-dd-yyyy hh:mm:ss:SSS a (z)")).select("timestamp", "ts").first
I get
res310: org.apache.spark.sql.Row = [Aug-11-2016 09:21:43:749 PM (CEST),null]
Reading the doc I understand that date_format should accept any SimpleDateFormat. Is that correct?
I can make it work going through the pain of the code below:
val timestamp_parser = new SimpleDateFormat("MMM-dd-yyyy hh:mm:ss:SSS a (z)")
val udf_timestamp_string_to_long = udf[Long, String]( timestamp_parser.parse(_).getTime() )
val udf_timestamp_long_to_sql_timestamp = udf[Timestamp, Long]( new Timestamp(_) )
dat.withColumn("ts", udf_timestamp_long_to_sql_timestamp(udf_timestamp_string_to_long(dat("timestamp")))).select("timestamp", "ts").first
which gives
res314: org.apache.spark.sql.Row = [Aug-11-2016 09:21:43:749 PM (CEST),2016-08-11 21:21:43.749]

Related

How can I read multiple parquet files in spark scala

Below are some folders, which might keep updating with time. They have multiple .parquet files. How can I read them in a Spark dataframe in scala ?
"id=200393/date=2019-03-25"
"id=200393/date=2019-03-26"
"id=200393/date=2019-03-27"
"id=200393/date=2019-03-28"
"id=200393/date=2019-03-29" and so on ...
Note:- There could be 100 date folders, I need to pick only specific(let's say for 25,26 and 28)
Is there any better way than below ?
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.sql._
val spark = SparkSession.builder.appName("ScalaCodeTest").master("yarn").getOrCreate()
val parquetFiles = List("id=200393/date=2019-03-25", "id=200393/date=2019-03-26", "id=200393/date=2019-03-28")
spark.read.format("parquet").load(parquetFiles: _*)
The above code is working but I want to do something like below-
val parquetFiles = List()
parquetFiles(0) = "id=200393/date=2019-03-25"
parquetFiles(1) = "id=200393/date=2019-03-26"
parquetFiles(2) = "id=200393/date=2019-03-28"
spark.read.format("parquet").load(parquetFiles: _*)
you can read it this way to read all folders in a directory id=200393:
val df = spark.read.parquet("id=200393/*")
If you want to select only some dates, for example only september 2019:
val df = spark.read.parquet("id=200393/2019-09-*")
If you have some special days, you can have the list of days in a list
val days = List("2019-09-02", "2019-09-03")
val paths = days.map(day => "id=200393/" ++ day)
val df = spark.read.parquet(paths:_*)
If you want to keep the column 'id', you could try this:
val df = sqlContext
.read
.option("basePath", "id=200393/")
.parquet("id=200393/date=*")

Extract date from unix_timestamp which is in string format Apache Spark? [duplicate]

I have a data frame with a column of unix timestamp(eg.1435655706000), and I want to convert it to data with format 'yyyy-MM-DD', I've tried nscala-time but it doesn't work.
val time_col = sqlc.sql("select ts from mr").map(_(0).toString.toDateTime)
time_col.collect().foreach(println)
and I got error:
java.lang.IllegalArgumentException: Invalid format: "1435655706000" is malformed at "6000"
Here it is using Scala DataFrame functions: from_unixtime and to_date
// NOTE: divide by 1000 required if milliseconds
// e.g. 1446846655609 -> 2015-11-06 21:50:55 -> 2015-11-06
mr.select(to_date(from_unixtime($"ts" / 1000)))
Since spark1.5 , there is a builtin UDF for doing that.
val df = sqlContext.sql("select from_unixtime(ts,'YYYY-MM-dd') as `ts` from mr")
Please check Spark 1.5.2 API Doc for more info.
import org.joda.time.{DateTime, DateTimeZone}
import org.joda.time.format.DateTimeFormat
You need to import the following libraries.
val stri = new DateTime(timeInMillisec).toString("yyyy/MM/dd")
Or adjusting to your case :
val time_col = sqlContext.sql("select ts from mr")
.map(line => new DateTime(line(0).toInt).toString("yyyy/MM/dd"))
There could be another way :
import com.github.nscala_time.time.Imports._
val date = (new DateTime() + ((threshold.toDouble)/1000).toInt.seconds )
.toString("yyyy/MM/dd")
Hope this helps :)
You needn't convert to String before applying toDataTime with nscala_time
import com.github.nscala_time.time.Imports._
scala> 1435655706000L.toDateTime
res4: org.joda.time.DateTime = 2015-06-30T09:15:06.000Z
`
I have solved this issue using the joda-time library by mapping on the DataFrame and converting the DateTime into a String :
import org.joda.time._
val time_col = sqlContext.sql("select ts from mr")
.map(line => new DateTime(line(0)).toString("yyyy-MM-dd"))
You can use the following syntax in Java
input.select("timestamp)
.withColumn("date", date_format(col("timestamp").$div(1000).cast(DataTypes.TimestampType), "yyyyMMdd").cast(DataTypes.IntegerType))
What you can do is:
input.withColumn("time", concat(from_unixtime(input.col("COL_WITH_UNIX_TIME")/1000,
"yyyy-MM-dd'T'HH:mm:ss"), typedLit("."), substring(input.col("COL_WITH_UNIX_TIME"), 11, 3),
typedLit("Z")))
where time is a new column name and COL_WITH_UNIX_TIME is the name of the column which you want to convert. This will give data in millis, making your data more accurate, like: "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"

Getting a date x days back from a custom date in Scala

I have a date key of type 20170501 which is in YYYYmmdd format. How can we get a date x days back from this date in Scala?
This is what I have in program
val runDate = 20170501
Now I want a date say 30 days back from this date.
Using Scala/JVM/Java 8...
scala> import java.time._
import java.time._
scala> import java.time.format._
import java.time.format._
scala> val formatter = DateTimeFormatter.ofPattern("yyyyMMdd")
formatter: java.time.format.DateTimeFormatter = Value(YearOfEra,4,19,EXCEEDS_PAD)Value(MonthOfYear,2)Value(DayOfMonth,2)
scala> val runDate = 20170501
runDate: Int = 20170501
scala> val runDay = LocalDate.parse(runDate.toString, formatter)
runDay: java.time.LocalDate = 2017-05-01
scala> val runDayMinus30 = runDay.minusDays(30)
runDayMinus30: java.time.LocalDate = 2017-04-01
You can also use joda-time API with which has really good functions like
date.minusMonths
date.minusYear
date.minusDays
date.minusHours
date.minusMinutes
Here is simple example usinf JodaTIme API '
import org.joda.time.format.DateTimeFormat
val dtf = DateTimeFormat.forPattern("yyyyMMdd")
val dt= "20170531"
val date = dtf.parseDateTime(dt)
println(date.minusDays(30))
Output:
2017-05-01T00:00:00.000+05:45
For this you need to use udf and create a DateTime object with your input format "YYYYmmdd" and do the operations.
Hope this helps!

Unable to find encoder for Decimal type stored in a DataSet

Folks-
I am a complete Spark newb and have been trying to get the code below to work in spark-shell for the past day. I took the time to review the docs and tried to Google the problem but, I am running out of ideas.
Below is the code:
import spark.implicits._
val opts = Map(
"url" -> "jdbc:netezza://netezza:5480/test_schema",
"user" -> "user",
"password" -> "password",
"dbtable" -> "test_messages",
"numPartitions" -> "48"
)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val site = sqlContext
.read()
.format("com.ibm.spark.netezza")
.options(opts)
.load()
.select("az","range","time")
.where("id == 34000007")
site.printSchema() illustrates all columns are of type decimal
val calcs = ama.agg(
min("az"), (max("az")-min("az")).divide(100),
min("range"), (max("range")-min("range")).divide(100),
min("time"), (max("time")-min("time")).divide(100)
).collect()(0)
calcs.printSchema() illustrates all columns are of type decimal
Everything works as expected until this line. I thought that by import spark.implicits._ this would give me access to the Encoder that was required but, that is not the case.
val newSite = site.map( r => r.getDecimal(0).subtract(calcs.getDecimal(0)) )
Every post that I have reviewed talks about importing implicits but, this has not helped. I am using Spark 2.0.2.
Any ideas would be greatly appreciated.
There is simply no Encoder for Decimal in spark.implicits. You can provide it either explicitly:
import org.apache.spark.sql.types.DecimalType
import org.apache.spark.sql.Encoders
val dt = DecimalType(38, 0)
val df = Seq((1, 2)).toDF("x", "y").select($"x".cast(dt), $"y".cast(dt))
df.map(r => r.getDecimal(0).subtract(r.getDecimal(1)))(Encoders.DECIMAL).first
java.math.BigDecimal = -1.000000000000000000
or implicitly:
implicit val decimalEncoder = Encoders.DECIMAL
df.map(r => r.getDecimal(0).subtract(r.getDecimal(1))).first
java.math.BigDecimal = -1.000000000000000000
That being said it could be a better idea to use DataFrames all the way for example:
site.select($"az" - calcs.getDecimal(0))
or
site.select($"az" - calcs.getAs[java.math.BigDecimal]("min(az)"))

How to convert unix timestamp to date in Spark

I have a data frame with a column of unix timestamp(eg.1435655706000), and I want to convert it to data with format 'yyyy-MM-DD', I've tried nscala-time but it doesn't work.
val time_col = sqlc.sql("select ts from mr").map(_(0).toString.toDateTime)
time_col.collect().foreach(println)
and I got error:
java.lang.IllegalArgumentException: Invalid format: "1435655706000" is malformed at "6000"
Here it is using Scala DataFrame functions: from_unixtime and to_date
// NOTE: divide by 1000 required if milliseconds
// e.g. 1446846655609 -> 2015-11-06 21:50:55 -> 2015-11-06
mr.select(to_date(from_unixtime($"ts" / 1000)))
Since spark1.5 , there is a builtin UDF for doing that.
val df = sqlContext.sql("select from_unixtime(ts,'YYYY-MM-dd') as `ts` from mr")
Please check Spark 1.5.2 API Doc for more info.
import org.joda.time.{DateTime, DateTimeZone}
import org.joda.time.format.DateTimeFormat
You need to import the following libraries.
val stri = new DateTime(timeInMillisec).toString("yyyy/MM/dd")
Or adjusting to your case :
val time_col = sqlContext.sql("select ts from mr")
.map(line => new DateTime(line(0).toInt).toString("yyyy/MM/dd"))
There could be another way :
import com.github.nscala_time.time.Imports._
val date = (new DateTime() + ((threshold.toDouble)/1000).toInt.seconds )
.toString("yyyy/MM/dd")
Hope this helps :)
You needn't convert to String before applying toDataTime with nscala_time
import com.github.nscala_time.time.Imports._
scala> 1435655706000L.toDateTime
res4: org.joda.time.DateTime = 2015-06-30T09:15:06.000Z
`
I have solved this issue using the joda-time library by mapping on the DataFrame and converting the DateTime into a String :
import org.joda.time._
val time_col = sqlContext.sql("select ts from mr")
.map(line => new DateTime(line(0)).toString("yyyy-MM-dd"))
You can use the following syntax in Java
input.select("timestamp)
.withColumn("date", date_format(col("timestamp").$div(1000).cast(DataTypes.TimestampType), "yyyyMMdd").cast(DataTypes.IntegerType))
What you can do is:
input.withColumn("time", concat(from_unixtime(input.col("COL_WITH_UNIX_TIME")/1000,
"yyyy-MM-dd'T'HH:mm:ss"), typedLit("."), substring(input.col("COL_WITH_UNIX_TIME"), 11, 3),
typedLit("Z")))
where time is a new column name and COL_WITH_UNIX_TIME is the name of the column which you want to convert. This will give data in millis, making your data more accurate, like: "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"