I have passed a string (datestr) to a function (that do ETL on a dataframe in spark using scala API) however at some point I need to filter the dataframe by a certain date
something like :
df.filter(col("dt_adpublished_simple") === date_add(datestr, -8))
where datestr is the parameter that I passed to the function.
Unfortunately, the function date_add requires a column type as a first param.
Can anyone help me with how to convert the param into a column or a similar solution that will solve the issue?
You probably only need to use lit to create a String Column from your input String. And then, use to_date to create a Date Column from the previous one.
df.filter(col("dt_adpublished_simple") === date_add(to_date(lit(datestr), format), -8))
Related
I read from a CSV where column time contains a timestamp with miliseconds '1414250523582'
When I use TimestampType in schema it returnns NULL.
The only way it ready my data is to use StringType.
Now I need this value to be a Datetime for forther processing.
First I god rid of the to long timestamp with this:
df2 = df.withColumn("date", col("time")[0:10].cast(IntegerType()))
a schema checks says its a integer now.
now i try to make it a datetime with
df3 = df2.withColumn("date", datetime.fromtimestamp(col("time")))
it returns
TypeError: an integer is required (got type Column)
when I google people always just use col("x") to read and transform data, so what do I make wrong here?
The schema checks are a bit tricky; the data in that column may be pyspark.sql.types.IntegerType, but that is not equivalent to Python's int type. The col function returns a pyspark.sql.column.Column object, which often do not play nicely with vanilla Python functions like datetime.fromtimestamp. This explains the TypeError. Even though the "date" data in the actual rows is an integer, col doesn't allow you to access it as an integer to feed into a python function quite so simply. To apply arbitrary Python code to that integer value, you can compile a udf pretty easily, but in this case, pyspark.sql.functions already has a solution for your unix timestamp. Try this: df3 = df2.withColumn("date", from_unixtime(col("time"))), and you should see a nice date in 2014 for your example.
Small note: This "date" column will be of StringType.
Say I have a dataframe with two columns, both that need to be converted to datetime format. However, the current formatting of the columns varies from row to row, and when I apply to to_date method, I get all nulls returned.
Here's a screenshot of the format....
the code I tried is...
date_subset.select(col("InsertDate"),to_date(col("InsertDate")).as("to_date")).show()
which returned
Your datetime is not in the default format, so you should give the format.
to_date(col("InsertDate"), "MM/dd/yyyy HH:mm")
I don't know which one is month and date, but you can do that in this way.
one column in my csv file is a date that is read as a string and it follows this pattern : 2018-09-19 10:27:28.409Z. I am struggling to convert the column from string to date.
The conversion options in spotfire didn't allow me to change the column type. however, I found the solution, at the moment of importing the data set (file) you need to specify the type (date time) and magically spotfire manages the conversion.
I am processing the data in Spark shell and have a dataframe with a date column. The format of the column is like "2017-05-01 00:00:00.0", but I want to change all the values to "2017-05-01" without the "00:00:00.0".
Thanks!
Just use String.split():
"2017-05-01 00:00:00.0".split(" ")(0)
I'm using Neo4j database.Neo4j does not have date data type only have timestamp data type.
I need to compare current date with existing date using cql query.
My existing date format is "8/4/2011" that is string.
Then how can I compare it.Any way to use stored procedure [date] while csv bulk data import time.
I used APOC stored procedure but I don't know how compare it.
CALL apoc.date.format(timestamp(),"ms","dd.MM.yyyy")
07.07.2016
CALL apoc.date.parse("13.01.1975 19:00","s","dd.MM.yyyy HH:mm")
158871600
I expect like this
MATCH(dst:Distributor) WHERE dst.DIST_ID = "111137401" WITH dst CALL apoc.date.parse(dst.ENTRY_DATE,'s', 'dd/MM/yyyy') YIELD d SET dst.ENTRY_DATE = d RETURN dst;
Any possibilities please help me...
RETURN datetime("2018-06-04T10:58:30.007Z").epochMillis
1528109910007
Right query is :
USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM "file:///DST.csv" AS row
CALL apoc.date.parse(toString(row.ENTRY_DATE),"ms","dd-MMM-yy") YIELD value as date CREATE (DST:Distributor {ENTRY_DATE: date })