How can i split a column delimiter with spaces? - azure-data-factory

I have a csv delimited with spaces that can change to 9-10-11, it s there a way for split the column in two with Azure data factory?
Examples:
This is my CSV
I try using dataflows:
but when I execute the dataflow, it throw me this error:
PD: the csv has 4.000.000 rows
Solve the problem using azure data factory, the csv needs to finish in my DW

I have the following data in my sample csv file with either 9, 10 or 11 spaces in between the value.
Now, after reading the column, use derived column transformation to split the required column using 9 spaces (minimum number of spaces).
req1 : split(col1,' ')[1]
req2 : split(col1,' ')[2]
This will split the data into an array of 2 elements, where 1st index will have no spaces in its value and the 2nd index element has trailing spaces.
Now apply ltrim on the req2 column. Check the length of this column before and after the transformation to confirm that we are eliminating the trailing spaces.
req2 : ltrim(req2)
After doing this, you can check the length of the req2 and it would be 1.
Now, select only the required columns and then write it to required sink.

Related

Is it possible to generate the space separated header row using data factory copy activity?

I am using azure sql as source dataset and delimited file as sink dataset in the copy activity.
I tried copy activity but First row as header gives comma separated headers.
Is there way to change the header output style ?
Please note spacing is unequal (h3...h4)
In this repro, I tried to give
1 space between 1st and 2nd column,
2 spaces between 2nd and 3rd column,
3 spaces between 3rd and 4th column.
Also, I tried to give same column name for column2 and column3. The approach is as follows.
Data is copied from Azure SQL database to datalake in comma delimitted format as a staging file.
This staging file is taken as a source in Dataflow activity.
In source dataset, first row as header is not checked.
Data preview of Source transformation:
Derived column transformation is added to change the column name of column2 and column3.
In this case, date_col of column1 is header data. Thus when column1 is 'date_col' replace column2 and column3 data with same column name.
column_2 = iif(Column_1=='date_col','ECIX',Column_2);
column_3 = iif(Column_1=='date_col','ECIX',Column_3);
Again derived column transformation is added to concat all the columns with spaces. Column name is given as concat . Value for this column is
concat(Column_1,' ',Column_2,' ',Column_3,' ',Column_4)
Select transformation is added and only concat column is selected here.
In sink, new delimited file is added as a sink dataset. And in sink dataset also , first row as header is not checked.
Output file screenshot
After pipeline is run, the target file looks like this.
Keeping the source as azure sql itself in the dataflow, I created a single derived column 'OUTDC' and added all the columns from the source like this:
(h1)+' '+(h2)+' '+(h3)
Then fed the OUTDC to a delimited sink and kept the Headers option as single string like this:
['h1 h2 h2']

Custom Row Delimiter in Azure Data Factory (ADF)

I have a CSV file that terminates a row with Comma and CRLF.
I set my dataset to ",\r\n" but when I ran the pipeline, it won't accept this, thinking it's multiple values in the delimiter... If I don't put the comma in the dataset row delimiter, when pipeline runs, it thinks that there's an unnamed header. Is it possible in ADF to have this combination as a delimeter (comma + crlf) - ",\r\n"?
FirstName,LastName,Occupation,<CRLF Char>
Michael,Jordan,Doctor,<CRLF Char>
Update:
When running the copy activity, I encountered the same problem as you.
Then I selcet Line feed(\n)as Row delimiter at Source.
Add Column mapping as follows:
When I run debug, the csv file was successfully copied into Azure SQL table.
I created a simple test. Do you just want ADF to read 3 columns?
This is the origin csv file.
In ADF, If we use default Row delimiter and Column delimiter settings, select First row as header.
We also can select Edit and enter \r\n at Row delimiter field.
You can import schema here.

Extract columns from a csv file with fields containing delimited values

I am trying to extract certain fields from a csv file, having comma separated values.
The issues is , some of the fields also contains comma and the fields are not enclosed within quotes. Given the scenario, how can i extract the fields.
also,only one of the field contains comma within values, and i don't need that. e.g: I want to extract the first 2 columns and the last 5 columns from the data set of 8 columns , where the third column contains values with comma
PS: Instead of down voting i would suggest to come ahead and post your
brilliant ideas if you have any.
Solution:
$path = "C:\IE3BW0047A_08112017133859.csv"
Get-Content $path| Foreach {"$($_.split(',')[0,1,-8,-7,-6,-5,-4,-3,-2,-1] -join '|')"} | set-content C:\IE3BW0047A_08112017133859_filter.csv

How to split 2 or more delimited columns in a single row to multiple rows using Talend

I am trying to move data from a CSV file to DB table. There are 2 delimited columns in the CSV file (separated by ";"). I would like to create a row for each of the delimited values at matching indexes as shown below. Assumption is that both columns will contain same number of delimited items.
Example CSV Input:
Labels Values
A;B;C 1;2;3
D 4
F;G 5;6
Expected Output:
Labels Values
A 1
B 2
C 3
D 4
E 5
F 6
How can I achieve this? I have tried using tNormalize but this only works for a single column. Also I tried 2 successive tNormalize nodes but as expected it resulted in unwanted combinations.
Thanks
Read your CSV file with a tfileinputdelimited, and
define your schema for the file.
Assuming you are using MySQL , also drop a tMysqlOutput component on you desinger to save your parsed file to the DB.

Read csv file excluding first column and first line

I have a csv file containing 8 lines and 1777 columns.
I need to read all the contents in matlab, excluding the first line and first column. First line and first column contain strings and matlab can't parse them.
Do you have any idea?
data = csvread(filepath);
The code above reads all the contents
As suggested, csvread with a range will read in the numeric data. If you would like to read in the strings as well (which are presumably column headers), you can use readtable:
t = readtable(filepath);
This will create a table with the column headers in your file as variable names of the columns of the table. This way you can keep the strings associated with the data, if need be.