I have a Pandas dataframe with category and datetime64[ns] types. When looking through the Spark datatypes in pyspark.sql.types I could not find any equivalent for categorical. Is there any good recommendation or a way of defining a custom datatype?
The datetime64[ns] type can only accept a LongType in the schema, and crashes with both DateType and TimestampType. For example
2016-06-06 07:15:32.112202 -> 1465197332112202000
Related
Can you help with pyspak query to convert the dates as below like in SQL
Spark Datetime functions related to convert StringType to/from DateType or TimestampType. For example, unix_timestamp, date_format, to_unix_timestamp, from_unixtime, to_date, to_timestamp, from_utc_timestamp, to_utc_timestamp, etc.
You can use date_format function to get the desired date format.
select date_format(date '24-03-2022', "yyyy/MM/dd");
I read from a CSV where column time contains a timestamp with miliseconds '1414250523582'
When I use TimestampType in schema it returnns NULL.
The only way it ready my data is to use StringType.
Now I need this value to be a Datetime for forther processing.
First I god rid of the to long timestamp with this:
df2 = df.withColumn("date", col("time")[0:10].cast(IntegerType()))
a schema checks says its a integer now.
now i try to make it a datetime with
df3 = df2.withColumn("date", datetime.fromtimestamp(col("time")))
it returns
TypeError: an integer is required (got type Column)
when I google people always just use col("x") to read and transform data, so what do I make wrong here?
The schema checks are a bit tricky; the data in that column may be pyspark.sql.types.IntegerType, but that is not equivalent to Python's int type. The col function returns a pyspark.sql.column.Column object, which often do not play nicely with vanilla Python functions like datetime.fromtimestamp. This explains the TypeError. Even though the "date" data in the actual rows is an integer, col doesn't allow you to access it as an integer to feed into a python function quite so simply. To apply arbitrary Python code to that integer value, you can compile a udf pretty easily, but in this case, pyspark.sql.functions already has a solution for your unix timestamp. Try this: df3 = df2.withColumn("date", from_unixtime(col("time"))), and you should see a nice date in 2014 for your example.
Small note: This "date" column will be of StringType.
I get an error when writing data to elasticsearch from spark. Most documents are written fine, then I have this kind of exceptions
org.elasticsearch.hadoop.rest.EsHadoopRemoteException: date_time_exception: date_time_exception: Invalid value for Year (valid values -999999999 - 999999999): -6220800000
The field mapping in elasticsearch is "date"
The field type in pySpark is DateType not TimestampType which imo
should make clear that this is a date without time. The value shown
by spark is "1969-10-21" so a perfectly reasonable date.
(It was originally a timestampType, from another elasticsearch date read but I converted it to a dateType in hope to solve this error but I have the exact same error message (with the exact same timestamp value) either sending to elasticSearch a TimestampType or DateType)
My guess is that there are three 0s that shouldn't be in that timestamp sent to elasticsearch but I can't find any way to normalize it.
Is there an option for org.elasticsearch.hadoop connector ?
(elk version is 7.5.2, spark is 2.4.4)
Obvious workaround : use any other type than an TimestampType or DateType
e.g. using this udf for LongType (to demonstrate that it's indeed a timestamp length issue).
import datetime
import time
def conv_ts(d) :
return time.mktime(d.timetuple())
ts_udf = F.udf(lambda z : int(conv_ts(z)), LongType())
(Note that in that snippet the spark input is timestampType not dateType, so a python datetime not date, because I tried messing around with time conversions too)
OR (much more efficient way) obviously to avoid the udf by using a StringType field of formatted date instead of the Long timestamp thanks to the pyspark.sql.date_format function.
A solution but not a really satisfying one, I would rather understand why the connector doesn't properly deal with timestampTypes and dateTypes by adjusting the timestamp length accordingly.
How can I convert a timestamp in the format 2019-08-22T23:57:57-07:00 into unixtime using Spark SQL or PySpark?
The most similar function I know is unix_timestamp() it doesn't accept the above time format with UTC offset.
Any suggestion on how I could approach that using preferably Spark SQL or PySpark?
Thanks
The java SimpleDateFormat pattern for ISO 8601time zone in this case is XXX.
So you need to use yyyy-MM-dd'T'HH:mm:ssXXX as your format string.
SparkSQL
spark.sql(
"""select unix_timestamp("2019-08-22T23:57:57-07:00", "yyyy-MM-dd'T'HH:mm:ssXXX")
AS epoch"""
).show(truncate=False)
#+----------+
#|epoch |
#+----------+
#|1566543477|
#+----------+
Spark DataFrame
from pyspark.sql.functions import unix_timestamp
df = spark.createDataFrame([("2019-08-22T23:57:57-07:00",)], ["timestamp"])
df.withColumn(
"unixtime",
unix_timestamp("timestamp", "yyyy-MM-dd'T'HH:mm:ssXXX")
).show(truncate=False)
#+-------------------------+----------+
#|timestamp |unixtime |
#+-------------------------+----------+
#|2019-08-22T23:57:57-07:00|1566543477|
#+-------------------------+----------+
Note that pyspark is just a wrapper on spark - generally I've found the scala/java docs are more complete than the python ones. It may be helpful in the future.
Suppose i have a dataframe in which Timestamp column is present.
Timestamp
2016-04-19T17:13:17
2016-04-20T11:31:31
2016-04-20T18:44:31
2016-04-20T14:44:01
I have to check whether current timsetamp is greater than Timestamp + 1 (i.e addition of 1 day to it) column in Scala
DataFrame supports two types of current_ on date and timestamp
Let's consider a DataFrame df with id and event_date columns.
We can perform the following filter operations :
import sqlContext.implicits._
import org.apache.spark.sql.functions._
// the event_date is before the current timestamp
df.filter('event_date.lt(current_timestamp()))
// the event_date is after the current timestamp
df.filter('event_date.gt(current_timestamp()))
I advice you to read the associated scala doc for more information here. You have a whole section on dates and timestamps operations.
EDIT: As discussed in the comments, in order to add a day to your event_date column, you can use the date_addfunction :
df.filter(date_add('event_date,1).lt(current_timestamp()))
You can do it liked this.
df.filter(date_add('column_name', 1).lt(current_timestamp()))