I read from a CSV where column time contains a timestamp with miliseconds '1414250523582'
When I use TimestampType in schema it returnns NULL.
The only way it ready my data is to use StringType.
Now I need this value to be a Datetime for forther processing.
First I god rid of the to long timestamp with this:
df2 = df.withColumn("date", col("time")[0:10].cast(IntegerType()))
a schema checks says its a integer now.
now i try to make it a datetime with
df3 = df2.withColumn("date", datetime.fromtimestamp(col("time")))
it returns
TypeError: an integer is required (got type Column)
when I google people always just use col("x") to read and transform data, so what do I make wrong here?
The schema checks are a bit tricky; the data in that column may be pyspark.sql.types.IntegerType, but that is not equivalent to Python's int type. The col function returns a pyspark.sql.column.Column object, which often do not play nicely with vanilla Python functions like datetime.fromtimestamp. This explains the TypeError. Even though the "date" data in the actual rows is an integer, col doesn't allow you to access it as an integer to feed into a python function quite so simply. To apply arbitrary Python code to that integer value, you can compile a udf pretty easily, but in this case, pyspark.sql.functions already has a solution for your unix timestamp. Try this: df3 = df2.withColumn("date", from_unixtime(col("time"))), and you should see a nice date in 2014 for your example.
Small note: This "date" column will be of StringType.
Related
Trying to convert string value(2022-07-24T07:04:27.5765591Z) into datetime/timestamp to insert into SQL table in datetime format without losing any value till milliseconds. String which I am providing is actually a datetime and my source is ADLS CSV. I tried below options in data flow.
Using Projection-> Changed the datatype format for specific column into timestamp and format type-yyyy-MM-dd'T'HH:mm:ss.SSS'Z' however getting NULL in output.
Derived column-> Tried below expressions but getting NULL value in output
toTimestamp(DataLakeModified_DateTime,'%Y-%m-%dT%H:%M:%s%z')
toTimestamp(DataLakeModified_DateTime,'yyyy-MM-ddTHH:mm:ss:fffffffK')
toTimestamp(DataLakeModified_DateTime,'yyyy-MM-dd HH:mm:ss.SSS')
I want the same value in output-
2022-07-24T07:04:27.5765591Z (coming as string) to 2022-07-24T07:04:27.5765591Z (in datetime format which will be accepted by SQL database)
I have tried to repro the issue and it is also giving me the same error, i.e., null values for yyyy-MM-dd'T'HH:mm:ss.SSS'Z' timestamp format. The issue is with the string format you are providing in source. The ADF isn’t taking the given string as timestamp and hence giving NULL in return.
But if you tried with some different format, like keeping only 3 digits before Z in last format, it will convert it into timestamp and will not return NULL.
This is what I have tried. I have kept one timestamp as per your given data and other with some modification. Refer below image.
This will return NULL for the first time and datetime for second time.
But the format you are looking for is still missing. With the existing source format, the yyyy-MM-dd'T'HH:mm:ss would work fine. This format also works fine in SQL tables. I have tried and it’s working fine.
Try to use to String instead of timestamp and use this to create your Desired timestamp
toString(DataLakeModified_DateTime, 'yyyy-MM-dd HH:mm:ss:SS')
Say I have a dataframe with two columns, both that need to be converted to datetime format. However, the current formatting of the columns varies from row to row, and when I apply to to_date method, I get all nulls returned.
Here's a screenshot of the format....
the code I tried is...
date_subset.select(col("InsertDate"),to_date(col("InsertDate")).as("to_date")).show()
which returned
Your datetime is not in the default format, so you should give the format.
to_date(col("InsertDate"), "MM/dd/yyyy HH:mm")
I don't know which one is month and date, but you can do that in this way.
I get an error when writing data to elasticsearch from spark. Most documents are written fine, then I have this kind of exceptions
org.elasticsearch.hadoop.rest.EsHadoopRemoteException: date_time_exception: date_time_exception: Invalid value for Year (valid values -999999999 - 999999999): -6220800000
The field mapping in elasticsearch is "date"
The field type in pySpark is DateType not TimestampType which imo
should make clear that this is a date without time. The value shown
by spark is "1969-10-21" so a perfectly reasonable date.
(It was originally a timestampType, from another elasticsearch date read but I converted it to a dateType in hope to solve this error but I have the exact same error message (with the exact same timestamp value) either sending to elasticSearch a TimestampType or DateType)
My guess is that there are three 0s that shouldn't be in that timestamp sent to elasticsearch but I can't find any way to normalize it.
Is there an option for org.elasticsearch.hadoop connector ?
(elk version is 7.5.2, spark is 2.4.4)
Obvious workaround : use any other type than an TimestampType or DateType
e.g. using this udf for LongType (to demonstrate that it's indeed a timestamp length issue).
import datetime
import time
def conv_ts(d) :
return time.mktime(d.timetuple())
ts_udf = F.udf(lambda z : int(conv_ts(z)), LongType())
(Note that in that snippet the spark input is timestampType not dateType, so a python datetime not date, because I tried messing around with time conversions too)
OR (much more efficient way) obviously to avoid the udf by using a StringType field of formatted date instead of the Long timestamp thanks to the pyspark.sql.date_format function.
A solution but not a really satisfying one, I would rather understand why the connector doesn't properly deal with timestampTypes and dateTypes by adjusting the timestamp length accordingly.
I have passed a string (datestr) to a function (that do ETL on a dataframe in spark using scala API) however at some point I need to filter the dataframe by a certain date
something like :
df.filter(col("dt_adpublished_simple") === date_add(datestr, -8))
where datestr is the parameter that I passed to the function.
Unfortunately, the function date_add requires a column type as a first param.
Can anyone help me with how to convert the param into a column or a similar solution that will solve the issue?
You probably only need to use lit to create a String Column from your input String. And then, use to_date to create a Date Column from the previous one.
df.filter(col("dt_adpublished_simple") === date_add(to_date(lit(datestr), format), -8))
I have troubles converting HubSpot UNIX timestamp to date. Timestamp is stored as text value.
Value looks like this:
1549324800000
My logic was first to convert the number to bigint and later converted it to date using:
TO_CHAR(TO_TIMESTAMP(properties__vape_station__value)), 'DD/MM/YYYY')
What would be the best way to achieve converting UNIX Timestamp in text type to date in PostgreSQL 11.
You can cast the value in that column to a bigint and then use to_timestamp.
But apparently that column also stores empty strings rather than NULL values if no value is present, so you need to take that into account:
to_timestamp(nullif(trim(properties__vape_station__value),'')::bigint/1000)
This would still fail however if anything else than a number is stored in that column.