Spark Timestamp issue, same timestamp but mismatching - scala

I am moving data from source into my bucket and need to write a script for data validation. But for Timestamp data type, I face some weird issue: I have two rows containing two same timestamp [2017-06-08 17:50:02.422437], [2017-06-08 17:50:02.422], because the second one is having a different format due to different file system configuration Spark considers them different. Is there anyway to resolve this problem? The ideal way is to ignore this column when doing data frame comparison.

You can use unix_timestamp and use that number for comparison. For actual date requirements you can use from_unixtime to convert to your required format. Not sure it is efficient method on large volume of data...
sqlContext.sql("Select unix_timestamp('2017-06-08 17:50:02.422'), unix_timestamp('2017-06-08 17:50:02.422437') ").show
+----------+----------+
| _c0| _c1|
+----------+----------+
|1496958602|1496958602|
+----------+----------+

Related

Spark sql function to get time stamp from date time string

I am converting sql version of a code to spark-sql (pyspark) one. I need to get time component from a datetime string. In sql, the code used is
to_char(a.timeclose,''HH24:MI:SS'') as timeclose
Which function can be used for pyspark?
Input : 1970-01-01 16:37:59
Output: 16:37:59
you can do like this
Data :
+--------+
|datetime|
+--------+
|09:39:20|
+--------+
df.registertempTable("sample")
spark.sql("select split(datetime,':')[0] as a from sample").show()
Output :
+---+
| a|
+---+
| 09|
+---+
Spark does not have a standalone - Time Datatype that you will be able to utilise
The closest you can use is to convert your current datetime/timestamp to epoch using - unix_timestamp
Apart from that kindly go through - Minimalistic Reproducible Example and share examples/samples from your dataset to assist you
Found a solution
SQL Code to_char(a.timeopen,'HH24:MI:SS') as timeopen
Alternative for that in pyspark substring(CAST (a.timeopen AS string ),-8) as timeopen
Input : 1970-01-01 16:37:59
Output: 16:37:59

Spark Scala - Data type for columns representing times without dates (hhmm format)

I have seen that there are types TimestampType and DateType for representing dates and timestamps like this:
+-------------------+----------+
| timestamp| date|
+-------------------+----------+
|2020-07-01 00:00:00|2020-07-01|
+-------------------+----------+
I was wondering how to deal in Spark (Scala) with columns representing day times in hhmm format.
For instance these columns:
deptime: actual departure time (e.g. 1345)
arrtime: actual arrival time (e.g. 1545)
In this case: a plane departs at 13:45 and arrives at its destination at 15:45
I don't know if there is a specific type to deal with this columns or if I have to treat them as integers or maybe as strings.
Thanks in advance.

pySpark Timestamp as String to DateTime

I read from a CSV where column time contains a timestamp with miliseconds '1414250523582'
When I use TimestampType in schema it returnns NULL.
The only way it ready my data is to use StringType.
Now I need this value to be a Datetime for forther processing.
First I god rid of the to long timestamp with this:
df2 = df.withColumn("date", col("time")[0:10].cast(IntegerType()))
a schema checks says its a integer now.
now i try to make it a datetime with
df3 = df2.withColumn("date", datetime.fromtimestamp(col("time")))
it returns
TypeError: an integer is required (got type Column)
when I google people always just use col("x") to read and transform data, so what do I make wrong here?
The schema checks are a bit tricky; the data in that column may be pyspark.sql.types.IntegerType, but that is not equivalent to Python's int type. The col function returns a pyspark.sql.column.Column object, which often do not play nicely with vanilla Python functions like datetime.fromtimestamp. This explains the TypeError. Even though the "date" data in the actual rows is an integer, col doesn't allow you to access it as an integer to feed into a python function quite so simply. To apply arbitrary Python code to that integer value, you can compile a udf pretty easily, but in this case, pyspark.sql.functions already has a solution for your unix timestamp. Try this: df3 = df2.withColumn("date", from_unixtime(col("time"))), and you should see a nice date in 2014 for your example.
Small note: This "date" column will be of StringType.

Spark read csv containing nanosecond timestamps

I am dumping a Postgres table using a copy command outputting to CSV.
The CSV contains timestamps formatted as such: 2011-01-01 12:30:10.123456+00.
I'm reading the CSV like
df = spark.read.csv(
"s3://path/to/csv",
inferSchema=True,
timestampFormat="yyyy-MM-dd HH:mm:ss.SSSSSSX",
...
)
but this doesn't work (as expected). The timestampFormat uses java.text.SimpleDateFormat which does not have nanosecond support.
I've tried a lot of variations on the timestampFormat, and they all produce either String columns or misformat the timestamp. Seems like the nanoseconds end up overflowing the seconds and adding time to my timestamp.
I can't apply a schema to the CSV because I don't always know it, and I can't cast the columns because I don't always know which will be timestamps. I also can't cast the timestamp on the way out of Postgres, because I'm just doing select * ....
How can I solve this so I can ingest the CSV with the proper timestamp format?
My first thought was I just had to modify timestampFormat, this seems like it's not possible? My second thought is to use sed to truncate the timestamp as I'm dumping from Postgres.
I'm using spark 2.3.1.
Thanks for the help!

Create a Spark Dataframe on Time

quick question that I did not find on google.
What is the best way to create a Spark Dataframe with time stamps
lets say I have startpoint endpoint and a 15 min interval. What would be the best way to resolve this on spark ?
try SparkSQL inside udf like this DF.withColumn("regTime",current_timestamp) ,it's can replace a column ,
you also can convert a String to date or timestamp use DF.withColumn("regTime", to_date(spark_temp_user_day("dateTime")))
regTime is a column of DF