Pipeline not picking up all rows/records from source - streamsets

Stream sets pipeline is not picking up all the records/rows from the source. Am I missing the obvious?
In example,
Source Informix
39,136 rows
StreamSets
Input = 38,926 rows
Output = 38,928 rows

Related

Azure Data Factory schema mapping not working with SQL sink

I have a simple pipeline that loads data from a csv file to an Azure SQL db.
I have added a data flow where I have ensured all schema matches the SQL table. I have a specific field which contains numbers with leading zeros. The data type in the source - projection is set to string. The field is mapped to the SQL sink showing as string data-type. The field in SQL has nvarchar(50) data-type.
Once the pipeline is run, all the leading zeros are lost and the field appears to be treated as decimal:
Original data: 0012345
Inserted data: 12345.0
The CSV data shown in the data preview is showing correctly, however for some reason it loses its formatting during insert.
Any ideas how I can get it to insert correctly?
I had repro’d in my lab and was able to load as expected. Please see the below repro details.
Source file (CSV file):
Sink table (SQL table):
ADF:
Connect the data flow source to the CSV source file. As my file is in text format, all the source columns in the projection are in a string.
Source data preview:
Connect sink to Azure SQL database to load the data to the destination table.
Data in Azure SQL database table.
Note: You can all add derived columns before sink to convert the value to string as the sink data type is a string.
Thank you very much for your response.
As per your post the DF dataflow appears to be working correctly. I have finally discovered an issue with the transformation - I have an Azure batch service which runs a python script, which does a basic transformation and saves the output to a csv file.
Interestingly, when I preview the data in the dataflow, it looks as expected. However, the values stored in SQL are not.
For the sake of others having a similar issue, my existing python script used to convert a 'float' datatype column to string-type. Upon conversion, it used to retain 1 decimal number but as all of my numbers are integers, they were ending up with .0.
The solution was to convert values to integer and then to string:
df['col_name'] = df['col_name'].astype('Int64').astype('str')

ADF Using Union to combine data

I'm trying to combine data from 2 data sources in ADF. The combining of data is working correctly but not in the correct order. I want to do this using union.
Below is my dataflow containing the union.
Source1 contains 1 row of data. Whereas, source2 contains multiple rows of data. When combining these rows using union this is done in a random order. However, I want the 1 row in source1 to be the first row in the sink output. Anyone know how to do this? I've tried adding the union to source1 instead but this doesn't work either.
source2 data example:
1110,555,666,1
1130,345,876,5
source1 data example:
uniquekey,number,id,position
Current Output:
1110,555,666,1
1130,345,876,5
uniquekey,number,id,position
Desired Ouput:
uniquekey,number,id,position
1110,555,666,1
1130,345,876,5
I tried to repro your issue, where I got expected output as required. But when I tried to run this pipeline. It was generating two different files same as Source and Sink.
So I performed below steps to get required output.
Source1 file.
Source2 file.
Union Configuration as follows:
Union settings tab
Optimize tab
Inspect tab
Sink Configuration:
Do not provide filename in Sink Configuration.
In Sink configuration use Single Partition in Optimize tab.
Keep mapping in Sink Configuration as shown below:
Expected Output:

How to get row count in file using Azure Lookup Activity

I am reading the data file and RecordCount file having counts of record in data file. I am using lookup Activity to get the counts in data file and comparing it with the count of RecordCount file. This approach is working well and I can compare the records when we have count less than 5000. When data file has Count is more than 5000, it's considering only 5000 records and my pipeline is aborting because of count mismatch.
eg:
Datafile count: 7500
RecordCount file: 7500
Though counts are equal but Lookup will consider only 5000 records and will give a mismatch.
How can I achieve this.
Add a Data Flow to your pipeline before the Lookup. Source = ADLS Gen2, Sink = ADLS Gen2. Add a Surrogate Key transformation, call the new column as "mycounter". Add an Aggregate transformation and call the new column as "rowcount" with a formula of max(mycounter). In the Sink, output just the "rowcount" column. You'll now have a new dataset that is just the row count from any file. You can consume that row count as a single-line Lookup activity in the pipeline directly after your data flow.

Set row as a header Azure Data Factory [mapping data flow]

Currently, I have an Excel file that I'm processing using a mapping data flow to remove some null values.
This is my input file:
and after remove the null values I have:
I'm sinking my data into a Cosmos DB but I need to change the names of the columns and set my first row as headers...
I need to do this (First row as a header) in the previous step before Sink and y Can't use the mapping option a set manually the names of the columns because maybe some of these position of columns can change
Any idea to do this?
Thanks
First row as a header can only check in dataset connection.
As a work around, you can save your excel to blob(csv format) after removing null value.
Then create a copy data activity or data flow, use this csv file as source(check first row as header), Cosmos DB as sink.
Update
Setting of sink in data flow:
Data preview of sink:
Result:

How can I filter my source dataset to copy only specific vaules to my sink?

I have a csv file with 2 columns (id and name). The csv file has over 1 million names. I'm struggling to workout how I can filter my results to only copy data where column 2 has the name 'mary' in it.
Can anyone advise?
Add a Data Flow activity to your ADF pipeline. In that pipeline, point the Source to your CSV dataset. Next, add a Filter transformation and write an expression such as name == 'mary'. Next, add a Sink. This will copy only rows that have 'Mary' for the value in the name column.