Using the Data Set Stage to read the file as a single record - datastage

I have an DataStage Dat Set file and the Data Set has 120 columns. I’m trying to use the Data Set Stage to read records in one field. Is this possible? Thanks in advance for any help.

This is not possible. A Data Set is constrained by the metadata sitting in its descriptor file. You could, of course, use a Column Export stage immediately following the Data Set stage, to convert the 120 columns into a single delimited string.

Related

Capturing Each Skipped Record - Copy Data Activity

I'm doing a large-scale project with multiple pipelines, millions of records per pipeline. I'm trying to develop a generic skipped row capture process.
What I need to do is: for every source row skipped due to any error encountered on the attempted load, I want to capture a key column value from the row and write it to a distinct log file (or separate DB table row). This can't be summary data: for each individual row that fails, I need to capture the row key from that row so we can review/re-load later (I will add in system variable values to identify pipeline, component, time stamp, etc). Pipeline must complete with all successful rows loaded, all unsuccessful rows logged.
This is no-brainer functionality in most ETL tools; I have to be overlooking something in ADF, because I can't find a way to do this. Appreciate any/all suggestions.
You can enable Fault tolerance and choose Skip incompatible rows option. It will skip the incompatible rows between source and target store during copy data. e.g. type and field mismatch or PK violation.
Then you can enable session log and choose Warning log level in copy activity to log skipped rows. Finally, you can save your log file in Azure Storage or Azure Data Lake Storage Gen2.
Reference:
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-fault-tolerance
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-log
With your first copy activity, check the fault tolerance option in 'settings' to log skipped fault rows.
Make sure to place your rows key column, as the first in the mapping definition.
Get the copy activity logFilePath from the activity output into a variable
Add another copy activity to load skipped rows into relational table
it source path will be the variable holds logFilePath
Set the file path type to: 'Wildcard file path'
Keep the 'Wildcard file path' empty
Will be the value in Wildcard file name
Make sure that the delimited file dataset escape character is set to quotations.
The OperationItem field of the lg file holds your record fields seperated by ,; because we placed the rowID first on mapping, it will appear first in OperationalItem as well.
Goodluck

Google Cloud Data Fusion is appending a column to original data

When I am loading data encrypted data from GCS source to GCS sink there one additional column getting added.
Original data
Employee ID,Employee First Name,Employee Last Name,Employee Joining Date,Employee location
1,Vinay,Argekar,01/01/2017,India
2,Thirukkumaran,Haridass,02/02/2017,USA
3,David,Wu,03/04/2000,Canada
4,Vinod,Kumar,04/02/2002,India
5,Joshua,Abraham,04/15/2010,France
6,Allaudin,Dastigar,09/24/2012,UK
7,Senthil,Kumar,08/15/2009,Germany
8,Sudha,Narayanan,12/14/2016,India
9,Ravi,Prasad,11/11/2011,Costa Rica
Data came to file after running pipeline
0,Employee ID,Employee First Name,Employee Last Name,Employee Joining Date,Employee location
91,1,Vinay,Argekar,01/01/2017,India
124,2,Thirukkumaran,Haridass,02/02/2017,US
164,3,David,Wu,03/04/2000,Canada
193,4,Vinod,Kumar,04/02/2002,India
224,5,Joshua,Abraham,04/15/2010,France
259,6,Allaudin,Dastigar,09/24/2012,UK
293,7,Senthil,Kumar,08/15/2009,Germany
328,8,Sudha,Narayanan,12/14/2016,India
363,9,Ravi,Prasad,11/11/2011,Costa Rica
First column 0 was not present in original file
When you are configuring the GCS source, did you specify the Format to be CSV or was it left as Text? When the Format is Text, the output schema actually contains an offset, which is the first column that first column that you see in the output data. When you specify the format to be CSV, you have to specify the output schema of the file.

Set row as a header Azure Data Factory [mapping data flow]

Currently, I have an Excel file that I'm processing using a mapping data flow to remove some null values.
This is my input file:
and after remove the null values I have:
I'm sinking my data into a Cosmos DB but I need to change the names of the columns and set my first row as headers...
I need to do this (First row as a header) in the previous step before Sink and y Can't use the mapping option a set manually the names of the columns because maybe some of these position of columns can change
Any idea to do this?
Thanks
First row as a header can only check in dataset connection.
As a work around, you can save your excel to blob(csv format) after removing null value.
Then create a copy data activity or data flow, use this csv file as source(check first row as header), Cosmos DB as sink.
Update
Setting of sink in data flow:
Data preview of sink:
Result:

Add headers in csv file using azure data factory while moving to sink

How can we add headers to the files existing in the blob/ azure data lake using azure data factory.
I am using a copy activity to move the header less files to the sink, but while moving the files should have default headers like "Prop_0" or "Column_1". Any method available to achieve the same?
Any help would be appreciated.
Thanks and Regards,
Sandeep
In usually, Data factory will using the default header Prop_0, Prop_1...Prop_N for the less header csv file to help us copy the data, if we don't set the first row as header.
This is to help us do the column mapping but won't change the csv file.
According my experience and know about Data Factory, it doesn't support us do the schema change of the csv file. It's impossible to add the header of the csv files at least for now.
Hope this helps
In ADF, create a new Data Flow. Add your CSV source with a no header dataset. Then add your Sink with a dataset that writes to ADLS G2 folder as a text delimited file WITH headers. In the Sink Mapping, you can name your columns:
I tried a different solution. Used the 'no delimiter' option to keep all of them as one column. Then, In the derived column action, I split the single column into multiple columns and provided a proper name for each column. Now we can map the columns into the target table.

Talend shuffle the order of the columns

I was trying to achieve merging all the rows of a file into columns based on a certain sequence number. This has been achieved by tpivotToColumnDelimited.( this has to be done , cannot be changed ).
But after using that, the column ordering has been changed.
Is there any way of reading a file according to a schema and writing the file according to some other schema in talend ? ( Basically shuffling the column ordering in a file )
I had tried using setting tdynamicschema from input and output but was not able to read and write the data properly.
Any help would be highly appreciated.
I had solved the issue.
Simply added a column which had the index number read from the file and before using the tpivotToColumnDelimited , i had used that column dynamically to sort the results and write to a tmp file and then with the help of tpivotToColumnDelimited , it is now according to the input schema.