How to check date format in Azure Data Factory - azure-data-factory

I am creating a pipeline where the source is csv files and sink is SQL Server.
The date column in CSV file may have values like
12/31/2020
10162018
20201017
31/12/1982
1982/12/31
I do not find the function which checks the format of the date. How do I check the format and convert the above values to yyyy-MM-dd format.

The solution is given by HimanshuSinha-msft
Solved the issues using expression builder in Derived Column in Mapping Data Flow.
coalesce(toDate(Somedate,'MM/dd/yyyy'),toDate(Somedate,'yyyy/MM/dd'),toDate(Somedate,'dd/MM/yyyy'),toDate(Somedate,'MMddyyyy'),toDate(Somedate,'yyyyddMM'),toDate(Somedate,'MMddyyyy'),toDate(Somedate,'yyyyMMdd'))

This coalesce function answer will not actually solve the problem. It just gets rid of the errors. There are plenty of dates that are valid in multiple formats. For example: "2/1/2020" (mm/dd/yyyy) and "1/2/2020" (dd/mm/yyyy). The previous answer just gets rid of errors, but your analyses downstream will be very incorrect.
You need to do an aggregate analysis of which date format best fits the incoming stream, and the route the logic to the respective separate pipeline branches.

You can configure this in the Mapping tab of your copy activity. The datetime format can be specified, but it only supports one format type. If you have a mix of formats like in your example then it will not work.
One option would be to ingest the column into a staging table as a nvarchar. Then in another copy activity use a custom select statement to detect the column format and cast the date as needed. You should be able to do this using a CASE SQL statement in your SELECT from the staging table.
FYI: data type mapping
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping#data-type-mapping

Related

Convert StringTimestamp datastage to Timestamp db2

i am working on ETL job in datastage , a simple one Source ---> tRANsformer -----> destination
the source is a csv file , the destination is db2 base , so the prob is that the csv file contains a string timestamp like this
and i need to put it the db2 stage this is my table that i created with a script
this is the transformer config
and this is my prob this error appears
that means this in english
update_or_insert, 3: Unhandled conversion error in the "SEC_DAT_DATE_INSERT" zone from the source
type "timestamp" to the target type "timestamp [microseconds]":
source value = "*****************". The result does not accept a NULL value and there is no
handle_null to specify a default value
I don't know what it means that's the prob if anyone could help that would be nice thanks
First off, verify how Excel has handled the timestamp. Change the display format so that it conforms to ISO 8601, namely YYYY-MM-DD HH:NN:SS format, before you export it to CSV. Check the CSV file using a text editor (Notepad or Wordpad) to confirm that the timestamp format is correct.
Then change the StringToTimestamp() function so that it reflects the new format (or leave out the format entirely if this is your default format.
Note that the Else part of your expression uses a string. Perhaps you need to wrap that in a StringToTimestamp() function.
i will suggest you to check weather you have marked that column as key in the source (can happen by mistake) if so then deselect the key check box and see weather nullable is set to YES for that column in source ,if not then try and run on nullable column with YES selected. hope this helps.

In a data flow task, how do I restrict rows flowing using a value from another source?

I have an excel sheet with many tabs. Say one is called wsMain and the other is called wsDate.
In my data flow transformation I am able to successfully load the data from wsMain to my table.
Now I have to update this transformation where I have to fetch the maximum date from the worksheet wsDate and only load data from wsMain where the date is less than on equal to the maximum date in wsDate (that is the only column available).
So for I have figured out that I need to create a new Excel connection manager to read the data from wsDate and I have used the Aggregate transformatioin to get the maximum date.
Now the question is how do I use this date to restrict the rows coming from wsMain?
I understand from the link below that you can store the value in a variable but what do I do next?:
SSIS set result set from data flow to variable
I have tried using a merge join but not sure if I am doing it right.
Here is what it looks like now:
I could not achieve the above but would be interested to know if that is possible. As a work around I have created a separate dataflow where I have stored the valued in a variable and then used the variable in the conditional split to filter the required rows:
Here is a step by step guide I followed to write the variable:
https://www.proteanit.com/2008/12/11/ssis-writing-to-a-package-variable-in-a-dataflow/
You can obtain the maximum value of the wsDate column first, this use this as a filter to avoid introducing unnecessary records into the data flow which which would be discarded by the Conditional Split. An overview of this process is below. I'd also recommend confirming the data types for all columns involved.
Create an SSIS DateTime variable and name this something descriptive such as MaxDate.
Create a Data Flow Task before the current one with an Excel Source component. Use the SQL command option for the Data Access Mode and enter a SQL statement to return the max value of the wsDate column. In the following example ExcelSource is the name of the sheet that you're pulling from. I'd suggested confirming the query with the Preview button on the Excel Source as well.
Add a Script Component (not Task) after the Excel Source. Add the MaxDate variable in the ReadWriteVariables field on the main page of the Script Component. On the Inputs and Outputs pane add the output column from the Excel Source as an Input Column with the ReadOnly usage Type. Example C# code for this is below. Note that variables can only be written to in the PostExecute method. The Input0_ProcessInputRow method is called once for each row that passes through, however there will only be the single row in this case. On the following code MaxExcelDate is the name of the output column from the Excel Source.
On the Excel Source component in the Data Flow Task where the records are imported from Excel, change the Data Access Mode to SQL command and enter a SQL statement to return records that have a date less than or equal to the maximum wsDate value. This is the last example and the ? is a placeholder for the parameter. After entering this SQL, click the Parameters button and select Parameter0 for the Parameters field, the MaxDate variable for Variables field, and a direction of Input. The Conditional Split can then be removed since these records will now be filtered out.
Excel MAX wsDate SELECT:
SELECT MAX(wsDate) AS MaxExcelDate FROM ExcelSource
C# Script Component:
DateTime maxDate;
public override void PostExecute()
{
base.PostExecute();
Variables.MaxDate = maxDate;
}
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
maxDate = Row.MaxExcelDate;
}
Excel Command with Date Filter:
SELECT
Column1,
Column2,
Column3
FROM ExcelSheet
WHERE DateColumn <= ?
Yes, it is possible. In the data flow, you will need to determine the max date, which you already have. Next, you will need to MERGE JOIN the two data flows on the date column. From there, you will feed it into a CONDITIONAL SPLIT and split where the date columns match [i.e., !ISNULL()] versus do not match [i.e., ISNULL()]. In your case, you only want the matches. The non-matches will be disregarded.
Note: if you use an INNER JOIN on the MERGE JOIN where there is only one date (i.e., MaxDate) to join on, then this will take care of the row filtering for you. You will not need a CONDITIONAL SPLIT.
Welcome to ETL.
Update
It is a real pain that SSIS's MERGE JOINs only perform joins on EQUAL operations as opposed to LESS THAN and GREATER THAN operations. You will need to separate the data flows.
Use a script component to scan the excel file for the MAX Date and assign that value to a package variable in SSIS. Alternatively, you can have a dates table in SQL Server and then use an Execute SQL Command in SSIS to retrieve the MAX Date from the table and assign that value to a package variable
Modify your existing data flow to remove the reading of the Excel date file completely. Then add a DERIVED COLUMN transformation and add a new column that is mapped to the package variable in SSIS that stores the MAX date. You can name the Derived Column Name 'MaxDate'
Add a conditional split transformation with the following CONDITION logic: [AsOfDt] <= [MaxDate]
Set the Output Name to Insert Records
Note: The CONDITIONAL SPLIT creates a new output data flow with restricted/filtered rows. It does not create a new column within the existing data flow. Think of this as a transposition of data flow output from column modification to row modification. Only those rows that match the condition will be sent to the output that you desire. I assume you only want to Insert these records, so I named it that. You can choose whatever naming convention you prefer
Note 2: Sorry for not making the Update my original answer - I haven't used the AGGREGATE transformation before so I was not aware that it restricts row output as opposed to reading a value in the data flow and then assigning it to a variable. That would be a terrific transformation for Microsoft to add to SSIS. It appears that the ROWCOUNT and SCRIPT COMPONENT transformations are the only ones that have the ability to set a package variable value within the data flow.

Date in table is dd.mm.yyyy - Can't import to postgres via csv

I'm trying to add a .csv to a table in database.
All dates in the .csv is in this format dd.mm.yyyy ( 18.10.2017).
I'm importing via pgadmin and always get an invalid input error.
I've tried to use almost all date formatting options for the column but without any luck.
I would rather not change the csv manually.
Can anyone help me with this?
I almost always import data into a staging table where all the columns are strings.
Then I use queries to load the final table.
This has several advantages:
It gives me much more control over how the data is transformed.
It makes it easier to debug problems -- the entire staging table can be queried to find all rows with a particular issue (for instance).
Additional validations can be performed before loading into the final table.
This is just a suggestion, but you might find that overall this takes less time.
The DateStyle setting is probably set to MDY. You can check this by running:
show datestyle;
Although dd.mm.yyy isn't listed as a standard input format, if you expect it to work, you will need the DateStyle to line up with the ordering here (DMY).
The date/time style can be selected by the user using the SET datestyle command, the DateStyle parameter in the postgresql.conf configuration file, or the PGDATESTYLE environment variable on the server or client.
See section "Date Order Conventions":
https://www.postgresql.org/docs/current/static/datatype-datetime.html

Data Type Cast Won't Stick in SSIS

I'm trying to automate a process with SSIS that exports data into a flat file (.csv) that is then saved to a directory, where it will be scanned and imported by some accounting software. The software (unfortunately) only recognizes dates that are in MM/DD/YYYY fashion. I have tried every which way to cast or convert the data pulled from SQL to be in the MM/DD/YYYY, but somehow the data is always recognized as either a DT_Date or DT_dbDate data type in the flat file connection, and saved down as YYYY-MM-DD.
I've tried various combinations of data conversion, derived columns, and changing the properties of the flat file columns to string in hopes that I can at least use substring operations to get this formatted correctly, but it never fails to save down as YYYY-MM-DD. It is truly baffling. The preview in the OLE DB source will show the dates as "MM/DD/YYYY" but somehow it always changes to "YYYY-MM-DD" when it hits the flat file.
I've tried to look up solutions (for example, here: Stubborn column data type in SSIS flat flat file connection manager won't change. :() but with no luck. Amazingly if I merely open the file in Excel and save it, it will then show dates in a text editor as "MM/DD/YYYY", only adding more mystery to this Bermuda Triangle-esque caper.
If there are any tips, I would be very appreciative.
This is a date formatting issue.
In SQL and in SSIS, dates have one literal string format and that is YYYY-MM-DD. Ignore the way they appear to you in the data previewer and/or Excel. Dates are displayed to you based upon your Windows regional prefrences.
Above - unlike the US - folks in the UK will see all dates as DD/MM/YYYY. The way we are shown dates is NOT the way they are stored on disk. When you open in Excel it does this conversion as a favor. It's not until you SAVE that the dates are stored - as text - according to your regional preferences.
In order to get dates to always display the same way. We need to save them not as dates, but as strings of text. TO do this, we have to get the data out of a date column DT_DATE or DT_DBDATE and into a string column: DT_STR or DT_WSTR. Then, map this new string column into your csv file. Two ways to do this "date-to-string" conversion...
First, have SQL do it. Update your OLE DB Source query and add one more column...
SELECT
*,
CONVERT(VARCHAR(10), MyDateColumn, 101) AS MyFormattedDateColumn
FROM MyTable
The other way is let SSIS do it. Add a Derived Column component with the expression
SUBSTRING([MyDateColumn],6,2) + "/" + SUBSTRING([MyDateColumn],8,2) + "/" + SUBSTRING([MyDateColumn],1,4)
Map the string columns into your csv file, NOT the date columns. Hope this helps.
It's been a while but I just came across this today because I had the same issue and hope to be able to spare someone the trouble of figuring it out. What worked for me was adding a new field in the Derived Column transform rather than trying to change the existing field.
Edit
I can't comment on Troy Witthoeft's answer, but wanted to note that if you have a Date type input, you wouldn't be able to do SUBSTRING. Instead, you could use something like this:
(DT_WSTR,255)(MONTH([Visit Date])) + "/" + (DT_WSTR,255)(DAY([Visit Date])) + "/" + (DT_WSTR,255)(YEAR([Visit Date]))

Talend How to Convert String to Date for SAP integration using tMap

1) I need to pass a date (billing) to the RFC, but I am not sure how to map using tmap. How to set it up (see screenshots).
2) I need to run this job daily (M-F) and I am not sure how to automate the date input
3) For the date input, I thought of using a joblet, but I can't find it in Talend. Most screenshots shows the Joblets in the same window as job designs and metadata, but I don't have it. Seen Joblet image.
As you might guessed, I am very new to Talend.
Use a tMap, and inside it use the function TalendDate.parseDate("yyyy-MM-dd", sap_data.date) in the expression field where you want the output. Also, note that the output type must be Date. The Date Pattern in the type definition (on the bottom of the tmap) is irrelevant.
Something like that: