Streamsets Transformer - JDBC Origin without offset column - streamsets

I'm testing platforms that can allow any user to easily create data processing pipelines. This platform has to meet certain requirements and one of them is to be capable of moving data from Oracle/SQL Server to HDFS.
Streamsets Transformer (v3.11) meets all requirements including the one referred above. I just can't get it to work in a very specific case: When ingesting a table that contains no numeric columns.
In these cases I want the pipeline to process all data so, in the JDBC Origin, I enabled the "Skip Offset Tracking" property. I thought that by skipping the offset tracking there would be no need to set the "Offset Column" property (guess I was wrong).
JDBC_05 - Table doesn't have compatible primary key configuration - supporting exactly one column but table have 0
If a numeric column exists, a possible workaround is to set it as the offset column but I can't find a way of doing this when none exists.
Am I missing something?
Thanks

We are looking at providing this functionality in Transformer in a future release. I'll come back and update this answer with any news.
In the meantime, you might want to look at using StreamSets Data Collector for these tables. It does not have the 'numeric offset column' requirement.

Related

Can a flowlet pass through all columns?

We are using Azure Data Factory and are exploring if we could use Flowlets for transformations that occur in most Data flows.
Our first attempt was to create a flowlets that only add some columns (using a "Derived Column" step) to a stream. So in the "Input" step we don't require any column to be present in the received stream. Then the "Derived Column" followed by the "Output" step. And done... we thought.
When using this flowlet in a data flow we go from 25 columns back to only the column we added, all our original columns are no longer available.
Is it possible to use a flowlet to work on only a selection of all available columns but that all columns in the stream are "passed through" and thus will be available in the sink of the original data flow?
Be sure to select the Allow Schema Drift option on your Flowlet input settings

How to configure a Synapse Mapping Data Flow for INSERT/UPDATE/DELETE when the destination table does not yet exist

I am trying to build a generic Mapping Data Flow for some basic cleansing on tables in my Data Lake. I need it to be able to work both on an ongoing basis after data already exists in my cleansed tables as well as when new tables are added (it would detect them automatically and create and populate the destination). Both the Source and Destination tables with be Delta tables.
The approach I have taken is to have Sources configured to both my actual source and to the target and use either JOIN transformations or EXISTS transformations to identify the new, updated and removed rows.
This works fine for INSERTS and UPDATES, however my issues is dealing with DELETES when there is no data currently in the destination. Obviously there will be nothing to DELETE - that is as expected. However, because I reference the key column that will exist once data is loaded to the table I get an error on an initial run that states:
ERROR Dataflow AppManager: name=BatchJobListener.failed, opId=xxx, message=Job 'xxx failed due to reason: DF-SINK-007 at Sink 'cleansedTableWithDeletes': Sink results in 0 output columns. Please ensure at least one column is mapped.
The overall process looks as follows:
Has anyone developed a pattern that works for a generic flow (this one is parameter driven and ensures schema drift is accommodated) or a way for the Data Flow to think that there IS a column in the destination that it can refer to and get past this issue?
In Source options check Allow no files found.
You can also provide date dynamically in Filter by last modified option.
Refer - https://learn.microsoft.com/en-us/azure/data-factory/data-flow-sink#sink-settings

ADF Mapping Data in Sink Table Error - Sink Table remains empty

I have a Data flow in Azure Data Factory which I want to use to combine data from three sources and then sink in a destination table (with some transformation in-between). For the sink table I created a table in SQL, matching the column headers and data types from my Data Flow in Azure.
However when I publish the data flow, the sink table remains empty. The only error I get is under Mapping "At least one incoming column is mapped to a column in the sink dataset schema with a conflicting type, which can cause NULL values or runtime errors." This seems to be inhibiting me from enabling Auto Mapping - so I mapped the columns manually.
So where I'm at the moment:
DataFlowLayout
I tried manually mapping the columns - the datatypes in my input and my sink tables match up with each other but my sink table is still empty
SinkTableDataMapping
Under Data Preview for both my source and my sink tables I am able to view a sample of my data, so they are not empty
SinkTableDataPreview
Anyone experience something similar?
The details provided about is not sufficient to provide any inputs. Can you add more details like what are the file format of those three sources/from where you are pulling/ additional screenshots that will really help to guide better. Thanks
You can try below steps:
Disable auto mapping of columns in Sink Transformation and manually map columns.
And check Allow insert option selected under sink transformation settings.
Also make sure all column data types of input and output of Sink transformation should match to avoid nulls.
Thanks for the feedback. In the end the issue was that the pipeline which sinks the data from the flow, in the destination table was not properly set up - that is why the dataflow was not showing any errors but the sink table still remained empty. So the dataflow was kind of hanging in the air with no instruction to actually perform the sink

How to create a Derived Column in IIDR CDC for Kafka Topics?

we are currently working on a project to get data from an IBM i (formerly known as AS400) system with IBM IIDR CDC to Apache Kafka (Confluent Plattform).
So far everything was working fine, everything get replicated and appears in the topics.
Now we are trying to create a derived column in a table mapping which gives us the journal entry type from the source system (IBM i).
We would like to have the information to see whether it was an Insert,Update or Delete Operation.
Therefore we crated a derived column called OPERATION as Char(2) with Expression &ENTTYP.
But unfortunately the Kafka Topic doesn't show the value.
Can someone tell me what we were missing here?
Best regards,
Michael
I own the IBM IDR Kafka target, so lets see if I can help a bit.
So you have two options. The recommended way to see audit information would be to use one of the audit KCOPs. For instance you might use this one...
https://www.ibm.com/support/knowledgecenter/en/SSTRGZ_11.4.0/com.ibm.cdcdoc.cdckafka.doc/tasks/kcopauditavroformat.html#kcopauditavroformat
You'll note that the audit.jcf property in the example is set to CCID and ENTTYP, so you get both the operation type and the transaction id.
Now if you are using derived columns I believe you would follow the following procedure... https://www.ibm.com/support/knowledgecenter/en/SSTRGZ_11.4.0/com.ibm.cdcdoc.mcadminguide.doc/tasks/addderivedcolumn.html
If this is not working out, open a ticket and the L2 folks will provide a deeper debug. Oh also if you end up adding one, does the actual column get created in the output, just with no value in it?
Cheers,
Shawn
your colleagues told me how to do it:
DR Management Console -> Go to the "Filtering" tab -> find out "Derived Column" column in "Filter Columns" (Source Columns) section and mark "replicate" next to the column. Save table mapping afterwards and see if it appears now.
Unfortunately a derived column isn`t automatically selected for replication, but now I know how to select it.
you need to duplicate the new column on filter:
https://www.ibm.com/docs/en/idr/11.4.0?topic=mstkul-mapping-audit-fields-journal-control-fields-kafka-targets

Best practices for parameterizing load of multiple CSV files in Data Factory

I am experimenting with Azure Data Factory to replace some other data-load solutions we currently have, and I'm struggling with finding the best way to organize and parameterize the pipelines to provide the scalability we need.
Our typical pattern is that we build an integration for a particular Platform. This "integration" is essentially the mapping and transform of fields from their data files (CSVs) into our Stage1 SQL database, and by the time the data lands in there, the data types should be set properly and the indexes set.
Within each Platform, we have Customers. Each Customer has their own set of data files that get processed in that Customer context -- within the scope of a Platform, all Customer files follow the same schema (or close to it), but they all get sent to us separately. If you looked at our incoming file store, it might look like (simplified, there are 20-30 source datasets per customer depending on platform):
Platform
Customer A
Employees.csv
PayPeriods.csv
etc
Customer B
Employees.csv
PayPeriods.csv
etc
Each customer lands in their own SQL schema. So after processing the above, I should have CustomerA.Employees and CustomerB.Employees tables. (This allows a little bit of schema drift between customers, which does happen on some platforms. We handle it later in our stage 2 ETL process.)
What I'm trying to figure out is:
What is the best way to setup ADF so I can effectively manage one set of mappings per platform, and automatically accommodate any new customers we add to that platform without having to change the pipeline/flow?
My current thinking is to have one pipeline per platform, and one dataflow per file per platform. The pipeline has a variable, "schemaname", which is set using the path of the file that triggered it (e.g. "CustomerA"). Then, depending on file name, there is a branching conditional that will fire the right dataflow. E.g. if it's "employees.csv" it runs one dataflow, if it's "payperiods.csv" it loads a different dataflow. Also, they'd all be using the same generic target sink datasource, the table name being parameterized and those parameters being set in the pipeline using the schema variable and the filename from the conditional branch.
Are there any pitfalls to setting it up this way? Am I thinking about this correctly?
This sounds solid. Just be aware that you if you define column-specific mappings with expressions that expect those columns to be present, you may have data flow execution failures if those columns are not present in your customer source files.
The ways to protect against that in ADF Data Flow is to use column patterns. This will allow you to define mappings that are generic and more flexible.