Copy command unable to copy Date into Redshift - amazon-redshift

I am trying to insert rows like:
2016/02/03,name,12345,34,...
I am trying to copy a S3 file like so
copy events
from 's3://dailyevents/eventdata/l/''
credentials 'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secret-access-key>'
CSV
DATEFORMAT AS 'YYYY/MM/DD';
However I am getting a type mismatch error as its interpreting 2016/02/03 as 3 separate values.
Invalid digit, Value 'n', Pos 3, Type: Integer
How can I get it to parse the first column as the date format?

The copy command needs the column list as well for it to parse the the column as a date.
copy events
(event_date, event_name,event_id,cost)
from 's3://dailyevents/eventdata/l/''
credentials 'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secret-access-key>'
CSV
DATEFORMAT AS 'YYYY/MM/DD';
Worked.

Related

data frame date field with null values to SQL database

I have a pyspark script that processes a dataframe containing a date & time, called startdate, for which I use the following code to convert it into a timestamp format:
## Convert startdate to timestamp
df = df.withColumn('startdate', regexp_replace('startdate', 'T', ' ')) \
.withColumn("startdate",expr("substring(startdate, 1, length(startdate)-1)")) \
.withColumn('startdate',to_timestamp(col('startdate'), 'yyyy-MM-dd HH:mm:ss'))
I'm creating a type 2 slowly changing dimension, so for every ID, I'm taking the start date and "lagging it" to create an end date. For this, I run the following code:
# Get end date
df = df.withColumn("rank",
dense_rank().over(Window.partitionBy('id').orderBy(desc("startdate"))))
partition = Window().partitionBy().orderBy(col('id'))
df = df.withColumn("enddate",when(df.rank == 1,lit(None)).otherwise(lag("startdate").over(partition)))
This all works fine and the script outputs to parquet files in data lake Gen 2, and when I display my output I can see the right output. But the next step that I'm doing which is that I have an ADF pipeline that copies data from parquet files into an Azure SQL database fails because it isn't turning the 'undefined' (or the output of the lit(none) part of the script) values into NULL in the database.
My question is, what do I need to do in my script above, where I'm currently using lit(none) to get an output that will be compatible to turn into a NULL 'value' in the SQL database as part of my pipeline?
The column in my database is of datetime2 type, and is nullable. The startdate column is working fine, but it will never be NULL / empty, which is why I've concluded the issue here is with the 'empty' values.
The question was asked in error as something else was obstructing my pipeline, but to answer the question, of how to produce an "empty" value in parquet files, that becomes a NULL value in a SQL database, the following values can be used:
lit(None)
None
So the following line in my code was actually working:
df = df.withColumn("enddate",when(df.rank == 1,lit(None)).otherwise(lag("startdate").over(partition)))

Problem loading an "interval" from a CSV file

I am trying to create a table using a csv file that is very big. Amongst the data, I have a column named 'bouwjaar' which means construction year and I have selected 'date' as a date type. I would receive an error, therefore I changed the date type into an interval but it again won't work. It gives me the following error. What should I select as a date type?
ERROR: interval field value out of range: "1971-1980"
CONTEXT: COPY fundadata, line 24, column bouwjaar : "1971-1980"
An interval in PostgreSQL is not something with a starting point and and end point, but a duration like "9 years".
A more appropriate data type for that would be daterange, but the values would have to look like [1971-01-01,1981-01-01). You either have to pre-process the file before loading, or you have to load the data into a text column and post-process it.

PostgreSQL, trying to copy column of dates from csv file to table's column

I have a PostgreSQL table that contains an empty column of type 'date'
I'm trying to copy date values from a CSV file.
But It raises this:
COPY books (publication_date) FROM 'path/to/file/pub.csv' CSV;
ERROR: date/time field value out of range: "11/31/2000"
This value is at index 8178 of the CSV, so it's not the entire file that's faulty.
I don't understand why, as the date seems perfectly fine.
So, how can I fix this or make Postgres ignore the faulty dates?
ERROR: date/time field value out of range: "11/31/2000"
I don't understand why, as the date seems perfectly fine.
Well, november has only 30 days, so the date is indeed invalid.
You need to set the datestyle to the required format.
https://www.postgresql.org/docs/7.2/sql-set.html
https://www.postgresql.org/docs/9.1/runtime-config-client.html#GUC-DATESTYLE
SET datestyle = DMY;

Convert to date in cloud datafusion

How do we convert a string to date in cloud datafusion?
I have a column with the value say 20191120 (format of yyyyMMdd) i want to load this into a table in bigquery as date. The table column datatype is also date.
What i have tried so far is that i converted the string to timestamp using "parse-as-simple-date" and i try to convert it to format using format-date to "yyyy-MM-dd", but this step converts it to string and the final load fails. I have even tried to explicitly mention the column as date in the o/p schema as date. But it fails at runtime.
I tried keeping it as timestamp in the pipeline and try loading the date into Bigquery date type.
I noticed in the error that came op was field dt_1 incompatible with avro integer. Is datafusion internally converting the extract into avro before loading. AVRO does not have a date datatype which is causing the isssue?
Adding answer for posterity:
You can try doing these,
Go to LocalDateTime column in wrangler
Open dropdown and click on "Custom Transform"
Type timestamp.toLocalDate() (timestamp being the column name)
After the last step it should convert it into LocalDate type which you can write to bigquery. Hope this helps
For this specific date format, the Wrangler Transform directive would be:
parse-as-simple-date date_field_dt yyyyMMdd
set-column date_field_dt date_field_dt.toLocalDate()
The second line is required if the destination is of type Date.
Skip empty values:
set-column date_field_dt empty(date_field_dt) ? date_field_dt : date_field_dt.toLocalDate()
References:
https://github.com/data-integrations/wrangler/blob/develop/wrangler-docs/directives/parse-as-simple-date.md
https://github.com/data-integrations/wrangler/blob/develop/wrangler-docs/directives/parse-as-date.md
You could try to parse your input data with Data Fusion using Wrangler.
In order to test it out I have replicated a workflow where a Data Fusion pipeline is fed with data coming from BigQuery. This data is then parsed to the proper type and then it is exported back again to BigQuery. Note that the public dataset is “austin_311” and I have used ‘’311_request’ table as some of their columns are TIMESTAMP type.
The steps I have done are the following:
I have queried a public dataset that contained TIMESTAMP data using:
select * from `bigquery-public-data.austin_311.311_request`
limit 1000;
I have uploaded it to Google Cloud Storage.
I have created a new Data Fusion batch pipeline following this.
I have used the Wrangler to Parse CSV data to custom 'Simple Data' yyyy-MM-dd HH:mm:ss
I have exported Pipeline results to BigQuery.
This qwiklab has helped me through the steps.
Result:
Following the above procedure I have been able to export Data Fusion data to BigQuery and the DATE fields are exported as TIMESTAMP, as expected.

parse int to date (tmap ) Talend PostgreSQL

My job Talend is about mapping between a csv file and a postregresql table.
I need to insert a column date which can be with normal format yyyyMMdd or(0/99999999) in the csv file. So if the date is equal to 0 or 99999999 it's will be mapping as a null variable in the database, else the data must be loaded as a date type timestamp yyyy-mm-dd HH:mm:ss.
In the csv file I declared the date as an int, so I must parse int to a datetime in the tmap and loaded the 0/99999999 as a null variable.
Any help please.
if I understand the problem correctly, its solution is as follows:
// correspondent expression to convert string with special "0/99999999" values is:
(row1.dateAsString.equals("0")||row1.dateAsString.equals("99999999"))?null:routines.TalendDate.parseDate("yyyyMMdd", row1.dateAsString)