I am unable to apply certain methods on spark dataframes that have pre-epoch timestamps
e.g.
df = ss.createDataFrame([('1969-12-31 18:59:59',)], ['col_timestamp'])
df.select(to_timestamp(df.col_timestamp, 'yyyy-MM-dd HH:mm:ss')).toPandas()
gives error ...
File "C:\venv37\lib\site-packages\pyspark\sql\types.py", line 201, in fromInternal
return datetime.datetime.fromtimestamp(ts // 1000000).replace(microsecond=ts % 1000000)
OSError: [Errno 22] Invalid argument
does anyone know a workaround?
Try with the new parser:
from pyspark.sql.functions import to_timestamp
df = spark.createDataFrame([('1969-12-31 18:59:59',)], ['col_timestamp'])
df.select(to_timestamp(df.col_timestamp, 'y-M-d H:m:s')).toPandas()
Related
I want to generate a DataFrame with dates using PySpark's sequence() function (not looking for work-arounds using other methods). I got this working with the default step of 1. But how do I generate a sequence with dates, say, 1 week apart? I can't figure out what type/value to feed into the step parameter of the function.
df = (spark.createDataFrame([{'date':1}])
.select(explode(sequence(to_date(lit('2021-01-01')),to_date(lit(date.today())))).alias('calendar_date')))
df.show()
You have to use an INTERVAL literal. From your code:
df = (
spark
.createDataFrame([{'date':1}])
.select(
explode(sequence(
to_date(lit('2021-01-01')), # start
to_date(lit(date.today())), # stop
expr("INTERVAL 1 WEEK") # step
)).alias('calendar_date')
)
)
df.show()
https://spark.apache.org/docs/latest/sql-ref-literals.html#interval-literal
I keep on getting error, that I pass too many arguments and not sure why, as I am following the exact examples from:
command-629173529675356:9: error: too many arguments for method apply: (colName: String)org.apache.spark.sql.Column in class Dataset
val df_2 = date_format.withColumn("day_of_week", date_format(col("date"), "EEEE"))
My code:
val date_format = df_filter.withColumn("date", to_date(col("pickup_datetime")))
val df_2 = date_format.withColumn("day_of_week", date_format(col("date"), "EEEE"))
Thank you for the help!
dayofweek is the function you're looking for, so something like this
import org.apache.spark.sql.functions.dayofweek
date_format.withColumn("day_of_week", dayofweek(col("date")))
You get your error because you named your first dataframe date_format, which is the same name as the Spark's built-in function you want to use. So when you call date_format, you're retrieving your dataframe instead of date_format built-in function.
To solve this, you should either rename your first dataframe:
val df_1 = df_filter.withColumn("date", to_date(col("pickup_datetime")))
val df_2 = df_1.withColumn("day_of_week", date_format(col("date"), "EEEE"))
Or ensure that you're calling right date_format by importing functions and then call functions.date_format when extracting day of week:
import org.apache.spark.sql.functions
val date_format = df_filter.withColumn("date", to_date(col("pickup_datetime")))
val df_2 = date_format.withColumn("day_of_week", functions.date_format(col("date"), "EEEE"))
I want to insert string 2021-02-12 16:16:43 from Databricks into Snowflake timestamp. So, this is what i tried:
Use to_timestamp function in Databricks to convert from string to timestamp, but this function gives the incorrect timestamp format which Snowflake doesn't recognize.
.withColumn('date_test',to_timestamp("date_test", 'yyyy-MM-dd HH:mm:ss'))
Output format: 2021-02-12T16:16:43.000+0000
Error trace:
Py4JJavaError: An error occurred while calling o5318.save.
: net.snowflake.client.jdbc.SnowflakeSQLException: Timestamp '40' is not recognized
File '3Xko6AuNme/19.CSV.gz', line 1, character 66
Row 1, column "TEST_READY_STAGING_862888178"["DATE_TEST":5]
If you would like to continue loading when an error is encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option. For more information on loading options, please run 'info loading_data' in a SQL client.
Then I tried using date_format, which gives correct format but the type is string, and Snowflake complains about it again
.withColumn('date_test',date_format(to_timestamp("date_test", 'yyyy-MM-dd HH:mm:ss'), 'yyyy-MM-dd HH:mm:ss'))
Error trace:
: net.snowflake.client.jdbc.SnowflakeSQLException: Timestamp '40' is not recognized
File 'gSOm5eLHFZ/22.CSV.gz', line 1, character 86
Row 1, column "TEST_READY_STAGING_25342852"["DATE_TEST":5]
If you would like to continue loading when an error is encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option.
3.I tried to convert string to timestamp using udf from this topic: pyspark to_timestamp does not include milliseconds
But it just doesn't convert to timestamp.
import datetime
from pyspark.sql.functions import udf
from pyspark.sql.types import TimestampType
def _to_timestamp(s):
return datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S')
udf_to_timestamp = udf(_to_timestamp, TimestampType())
df_source_data.select('PURCHASE_DATE_TIME').withColumn("PURCHASE_DATE_TIME", udf_to_timestamp("PURCHASE_DATE_TIME")).show(1,False)
display(df_source_data)
df_source_data.printSchema()
Output:
+------------------------+
|PURCHASE_EVENT_DATE_TIME|
+------------------------+
|2021-02-12 16:16:43 |
+------------------------+
only showing top 1 row
root
|-- PURCHASE_EVENT_DATE_TIME: string (nullable = true)
Does anybody has any advice on how to push this string from Databricks to timestamp in Snowflake?
#alterego your format yyyy-MM-dd HH:mm:ss should be yyyy-MM-dd HH:mi:ss
Inferring Schema of a Spark Dataframe throws error if the csv file has column with special chars..
Test sample
foo.csv
id,comment
1, #Hi
2, Hello
spark = SparkSession.builder.appName("footest").getOrCreate()
df= spark.read.load("foo.csv", format="csv", inferSchema="true", header="true")
print(df.dtypes)
raise ValueError("Could not parse datatype: %s" % json_value)
I found comment from Dat Tran on inferSchema in spark csv package how to resolve this...cann't we still inferschema before dataclean?
Use it like this:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').enableHiveSupport().getOrCreate()
df = spark.read.format("csv").option("inferSchema", "true").option("header", "true").load("test19.csv")
print(df.dtypes)
Output:
[('id', 'int'), ('comment', 'string')]
I am trying to save a dataframe to a csv file, that contains a timestamp.
The problem that this column changes of format one written in the csv file. Here is the code I used:
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df = spark.read.option("header",true).option("inferSchema", "true").csv("C:/Users/mhattabi/Desktop/dataTest2.csv")
//val df = spark.read.option("header",true).option("inferSchema", "true").csv("C:\\dataSet.csv\\datasetTest.csv")
//convert all column to numeric value in order to apply aggregation function
df.columns.map { c =>df.withColumn(c, col(c).cast("int")) }
//add a new column inluding the new timestamp column
val result2=df.withColumn("new_time",((unix_timestamp(col("time"))/300).cast("long") * 300).cast("timestamp")).drop("time")
val finalresult=result2.groupBy("new_time").agg(result2.drop("new_time").columns.map((_ -> "mean")).toMap).sort("new_time") //agg(avg(all columns..)
finalresult.coalesce(1).write.option("header",true).option("inferSchema","true").csv("C:/mydata.csv")
when display via df.show it shoes the correct format
But in the csv file it shoes this format:
Use option to format timestamp into desired one which you need:
finalresult.coalesce(1).write.option("header",true).option("inferSchema","true").option("dateFormat", "yyyy-MM-dd HH:mm:ss").csv("C:/mydata.csv")
or
finalresult.coalesce(1).write.format("csv").option("delimiter", "\t").option("header",true).option("inferSchema","true").option("dateFormat", "yyyy-MM-dd HH:mm:ss").option("escape", "\\").save("C:/mydata.csv")
Here is the code snippet that worked for me to modify the CSV output format for timestamps.
I needed a 'T' character in there, and no seconds or microseconds. The timestampFormat option did work for this.
DF.write
.mode(SaveMode.Overwrite)
.option("timestampFormat", "yyyy-MM-dd'T'HH:mm")
Such as 2017-02-20T06:53
If you substitute a space for 'T' then you get this:
DF.write
.mode(SaveMode.Overwrite)
.option("timestampFormat", "yyyy-MM-dd HH:mm")
Such as 2017-02-20 06:53