I have a pyspark script that processes a dataframe containing a date & time, called startdate, for which I use the following code to convert it into a timestamp format:
## Convert startdate to timestamp
df = df.withColumn('startdate', regexp_replace('startdate', 'T', ' ')) \
.withColumn("startdate",expr("substring(startdate, 1, length(startdate)-1)")) \
.withColumn('startdate',to_timestamp(col('startdate'), 'yyyy-MM-dd HH:mm:ss'))
I'm creating a type 2 slowly changing dimension, so for every ID, I'm taking the start date and "lagging it" to create an end date. For this, I run the following code:
# Get end date
df = df.withColumn("rank",
dense_rank().over(Window.partitionBy('id').orderBy(desc("startdate"))))
partition = Window().partitionBy().orderBy(col('id'))
df = df.withColumn("enddate",when(df.rank == 1,lit(None)).otherwise(lag("startdate").over(partition)))
This all works fine and the script outputs to parquet files in data lake Gen 2, and when I display my output I can see the right output. But the next step that I'm doing which is that I have an ADF pipeline that copies data from parquet files into an Azure SQL database fails because it isn't turning the 'undefined' (or the output of the lit(none) part of the script) values into NULL in the database.
My question is, what do I need to do in my script above, where I'm currently using lit(none) to get an output that will be compatible to turn into a NULL 'value' in the SQL database as part of my pipeline?
The column in my database is of datetime2 type, and is nullable. The startdate column is working fine, but it will never be NULL / empty, which is why I've concluded the issue here is with the 'empty' values.
The question was asked in error as something else was obstructing my pipeline, but to answer the question, of how to produce an "empty" value in parquet files, that becomes a NULL value in a SQL database, the following values can be used:
lit(None)
None
So the following line in my code was actually working:
df = df.withColumn("enddate",when(df.rank == 1,lit(None)).otherwise(lag("startdate").over(partition)))
Related
Trying to convert string value(2022-07-24T07:04:27.5765591Z) into datetime/timestamp to insert into SQL table in datetime format without losing any value till milliseconds. String which I am providing is actually a datetime and my source is ADLS CSV. I tried below options in data flow.
Using Projection-> Changed the datatype format for specific column into timestamp and format type-yyyy-MM-dd'T'HH:mm:ss.SSS'Z' however getting NULL in output.
Derived column-> Tried below expressions but getting NULL value in output
toTimestamp(DataLakeModified_DateTime,'%Y-%m-%dT%H:%M:%s%z')
toTimestamp(DataLakeModified_DateTime,'yyyy-MM-ddTHH:mm:ss:fffffffK')
toTimestamp(DataLakeModified_DateTime,'yyyy-MM-dd HH:mm:ss.SSS')
I want the same value in output-
2022-07-24T07:04:27.5765591Z (coming as string) to 2022-07-24T07:04:27.5765591Z (in datetime format which will be accepted by SQL database)
I have tried to repro the issue and it is also giving me the same error, i.e., null values for yyyy-MM-dd'T'HH:mm:ss.SSS'Z' timestamp format. The issue is with the string format you are providing in source. The ADF isn’t taking the given string as timestamp and hence giving NULL in return.
But if you tried with some different format, like keeping only 3 digits before Z in last format, it will convert it into timestamp and will not return NULL.
This is what I have tried. I have kept one timestamp as per your given data and other with some modification. Refer below image.
This will return NULL for the first time and datetime for second time.
But the format you are looking for is still missing. With the existing source format, the yyyy-MM-dd'T'HH:mm:ss would work fine. This format also works fine in SQL tables. I have tried and it’s working fine.
Try to use to String instead of timestamp and use this to create your Desired timestamp
toString(DataLakeModified_DateTime, 'yyyy-MM-dd HH:mm:ss:SS')
How do we convert a string to date in cloud datafusion?
I have a column with the value say 20191120 (format of yyyyMMdd) i want to load this into a table in bigquery as date. The table column datatype is also date.
What i have tried so far is that i converted the string to timestamp using "parse-as-simple-date" and i try to convert it to format using format-date to "yyyy-MM-dd", but this step converts it to string and the final load fails. I have even tried to explicitly mention the column as date in the o/p schema as date. But it fails at runtime.
I tried keeping it as timestamp in the pipeline and try loading the date into Bigquery date type.
I noticed in the error that came op was field dt_1 incompatible with avro integer. Is datafusion internally converting the extract into avro before loading. AVRO does not have a date datatype which is causing the isssue?
Adding answer for posterity:
You can try doing these,
Go to LocalDateTime column in wrangler
Open dropdown and click on "Custom Transform"
Type timestamp.toLocalDate() (timestamp being the column name)
After the last step it should convert it into LocalDate type which you can write to bigquery. Hope this helps
For this specific date format, the Wrangler Transform directive would be:
parse-as-simple-date date_field_dt yyyyMMdd
set-column date_field_dt date_field_dt.toLocalDate()
The second line is required if the destination is of type Date.
Skip empty values:
set-column date_field_dt empty(date_field_dt) ? date_field_dt : date_field_dt.toLocalDate()
References:
https://github.com/data-integrations/wrangler/blob/develop/wrangler-docs/directives/parse-as-simple-date.md
https://github.com/data-integrations/wrangler/blob/develop/wrangler-docs/directives/parse-as-date.md
You could try to parse your input data with Data Fusion using Wrangler.
In order to test it out I have replicated a workflow where a Data Fusion pipeline is fed with data coming from BigQuery. This data is then parsed to the proper type and then it is exported back again to BigQuery. Note that the public dataset is “austin_311” and I have used ‘’311_request’ table as some of their columns are TIMESTAMP type.
The steps I have done are the following:
I have queried a public dataset that contained TIMESTAMP data using:
select * from `bigquery-public-data.austin_311.311_request`
limit 1000;
I have uploaded it to Google Cloud Storage.
I have created a new Data Fusion batch pipeline following this.
I have used the Wrangler to Parse CSV data to custom 'Simple Data' yyyy-MM-dd HH:mm:ss
I have exported Pipeline results to BigQuery.
This qwiklab has helped me through the steps.
Result:
Following the above procedure I have been able to export Data Fusion data to BigQuery and the DATE fields are exported as TIMESTAMP, as expected.
I am dumping a Postgres table using a copy command outputting to CSV.
The CSV contains timestamps formatted as such: 2011-01-01 12:30:10.123456+00.
I'm reading the CSV like
df = spark.read.csv(
"s3://path/to/csv",
inferSchema=True,
timestampFormat="yyyy-MM-dd HH:mm:ss.SSSSSSX",
...
)
but this doesn't work (as expected). The timestampFormat uses java.text.SimpleDateFormat which does not have nanosecond support.
I've tried a lot of variations on the timestampFormat, and they all produce either String columns or misformat the timestamp. Seems like the nanoseconds end up overflowing the seconds and adding time to my timestamp.
I can't apply a schema to the CSV because I don't always know it, and I can't cast the columns because I don't always know which will be timestamps. I also can't cast the timestamp on the way out of Postgres, because I'm just doing select * ....
How can I solve this so I can ingest the CSV with the proper timestamp format?
My first thought was I just had to modify timestampFormat, this seems like it's not possible? My second thought is to use sed to truncate the timestamp as I'm dumping from Postgres.
I'm using spark 2.3.1.
Thanks for the help!
I am trying to insert rows like:
2016/02/03,name,12345,34,...
I am trying to copy a S3 file like so
copy events
from 's3://dailyevents/eventdata/l/''
credentials 'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secret-access-key>'
CSV
DATEFORMAT AS 'YYYY/MM/DD';
However I am getting a type mismatch error as its interpreting 2016/02/03 as 3 separate values.
Invalid digit, Value 'n', Pos 3, Type: Integer
How can I get it to parse the first column as the date format?
The copy command needs the column list as well for it to parse the the column as a date.
copy events
(event_date, event_name,event_id,cost)
from 's3://dailyevents/eventdata/l/''
credentials 'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secret-access-key>'
CSV
DATEFORMAT AS 'YYYY/MM/DD';
Worked.
I am trying to get a Redshift table from a flat file with multiple date formats which is causing nulls to be inserted. My insert command looks like below:-
echo "COPY xxscty.daily_facebook_campaign from '${S3_BUCKET}/Society/20140701_20150315_campaign.csv' credentials as 'aws_access_key_id=${ACCESS_KEY};aws_secret_access_key=${SECRET_KEY}' acceptanydate dateformat 'auto' delimiter',' csv quote as '~' ACCEPTINVCHARS as '~' IGNOREHEADER 1"|psql "$PSQLARGS"
The reason why nulls are being inserted seem to be fairly sporadic with data being inserted for some and not for others of the same date format.
For example; the date column does get loaded with
1/07/2014 (DD/MM/YYYY)
but inserts null for
2014-07-13 (YYYY/MM/DD)
You need to use dateformat 'YYYY/MM/DD' instead of 'auto'.
Problem is, if you have different formats in the same file then it won't copy all dates.