ADF Mapping Dataflow Temp Table issue inside SP call - azure-data-factory

I have a mapping dataflow inside a foreach activity which I'm using to copy several tables into ADLS; in the dataflow source activity, I call a stored procedure from my synapse environment. In the SP, I have a small temp table which I create to store some values which I will later use for processing a query.
When I run the pipeline, I get an error on the mapping dataflow; "SQLServerException: 111212: Operation cannot be performed within a transaction." If I remove the temp table, and just do a simple select * from a small table, it returns the data fine; it's only after I bring back the temp table that I get an issue.
Have you guys ever seen this before, and is there a way around this?

If you go through the official MS docs, this error is very well documented.
Failed with an error: "SQLServerException: 111212; Operation cannot be performed within a transaction."
Symptoms
When you use the Azure SQL Database as a sink in the data flow to
preview data, debug/trigger run and do other activities, you may find
your job fails with following error message:
{"StatusCode":"DFExecutorUserError","Message":"Job failed due to reason: at Sink 'sink': shaded.msdataflow.com.microsoft.sqlserver.jdbc.SQLServerException: 111212;Operation cannot be performed within a transaction.","Details":"at Sink 'sink': shaded.msdataflow.com.microsoft.sqlserver.jdbc.SQLServerException: 111212;Operation cannot be performed within a transaction."}
Cause
The error "111212;Operation cannot be performed within a transaction." only occurs in the Synapse dedicated SQL pool. But you
mistakenly use the Azure SQL Database as the connector instead.
Recommendation
Confirm if your SQL Database is a Synapse dedicated SQL pool. If so,
use Azure Synapse Analytics as a connector shown in the picture below.

So after running some tests around this issue, it seems like Mapping Dataflows do not like Temp Tables when I call my stored procedure.
The way I ended up fixing this was that instead of using a Temp Table, I ended up using a CTE, which believe it or not, runs a bit faster than when I used the Temp Table.
#KarthikBhyresh
I looked at that article before, but it wasn't an issue with the sink, I was using Synapse LS as my source and a Data Lake Storage as my sink, so I knew from the beginning that this did not apply to my issue, even though it was the same error number.

Related

Incrementally loading into a Synapse table using Spark

I am creating a data warehouse using Azure Data Factory to extract data from a MySQL table and saving it in parquet format in an ADLS Gen 2 filesystem. From there, I use Synapse notebooks to process and load data into destination tables.
The initial load is fairly easy using spark.write.saveAsTable('orders') however, I am running into some issues doing incremental load following the intial load. In particular, I have not been able to find a way to reliably insert/update information into an existing Synapse table.
Since Spark does not allow DML operations on a table, I have resorted to reading the current table into a Spark DataFrame and inserting/updating records in that DataFrame. However, when I try to save that DataFrame using spark.write.saveAsTable('orders', mode='overwrite', format='parquet'), I run into a Cannot overwrite table 'orders' that is also being read from error.
A solution indicated by this suggests creating a temporary table and then inserting using that but that still resorts in the above error.
Another solution in this post suggests to write the data into a temporary table, drop the target table, and then rename the table but upon doing this, Spark gives me a FileNotFound errors regarding metadata.
I know Delta Tables can fix this issue pretty reliably but our company is not yet ready to move over to DataBricks.
All suggestions are greatly appreciated.

Stored Procedure as Source in Data Flow

I'm trying to execute a stored procedure which will have rows as output but when I try in the Data Flow Source I'm getting error message
DF-SYS-01 at Source 'source1':
com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near the keyword 'EXEC'.
My Source is Query option and I'm trying to execute
"EXEC [UVREP].spFeedsProduct 'HH',-2"
Can't I use Stored Procedure as Source in Data Flow ? I'm able to do the same in Copy Data Activity it works fine? What I'm doing wrong?
ADF Data Flow source can take queries or UDFs, but not sprocs.
https://learn.microsoft.com/en-us/azure/data-factory/connector-azure-sql-database#source-transformation
As Joel mentioned in comments, you can use an ADF Stored Proc activity in the pipeline to execute the sproc before your data flow and store the results in a table or staging file (Parquet/CSV) for the data flow source to read it.
Thanks MarkKromer and JoelCochran.
Instead of Stored Procedure now I modified using Views. Using a pipeline with lookup and Data Flow inside the for each loop. I have to copy like 12 tables to three different sinks.
Is there a better way?

COPY command runs but no data being copied from Teradata (on-prem)

I am running into an issue where I have a set up a pipeline that gets a list of tables from Teradata using a Lookup activity and then passes those items to a ForEach activity that then copies the data in parallel and saves them as a gzipped file. The requirement is to essentially archive some tables that are no longer being used.
For this pipeline I am not using any partition options as most of the tables are small and I kept it to be flexible.
Pipeline
COPY activity within ForEach activity
99% of the tables ran without issues and were copied as gz files into blob storage, but two tables in particular run for long time (apprx 4 to 6 hours) without any of the data being written into a blob storage account.
Note that the image above says "Cancelled", but that was done by me. Before that I had a run time as described above, but still no data being written. This is affecting only 2 tables.
I checked with our Teradata team and those tables are not being used by any one (hence its not locked). I also looked at "Teradata Viewpoint" (admin tool) and looked at the query monitor and saw that the query was running on Teradata without issues.
Any insight would be greatly apreciated.
Onlooking issue mention it look the data size of table is more than a blob can store ( As you are not using any partition options )
Use partition option for optimize performance and hold the data
Link
Just in case someone else comes across this, the way I solved this was to create a new data store connection called "TD_Prod_datasetname". The purpose of this dataset is to not point to a specific table, but to just accept a "item().TableName" value.
This datasource contains two main values. 1st is the #dataset().TeradataName
Dataset property
I only came up with that after doing a little bit of digging in Google.
I then created a parameter called "TeradataTable" as String.
I then updated my pipeline. As above the main two activities remain the same. I have a lookup and then a ForEach Activity (where for each will get the item values):
However, in the COPY command inside the ForEach activity I updated the source. Instead of getting "item().Name" I am passing through #item().TableName:
This then enabled me to then select the "Table" option and because I am using Table instead of query I can then use the "Hash" partition. I left it blank because according to Microsoft documentation it will automatically find the Primary Key that will be used for this.
The only issue that I ran into when using this was that if you run into a table that does not have a Primary Key then this item will fail and will need to be run through either a different process or manually outside of this job.
Because of this change the previously files that just hung there and did not copy now copied successfully into our blob storage account.
Hope this helps someone else that wants to see how to create parallel copies using Teradata as a source and pass through multiple table values.

Loading many tables in Cloud Data Fusion fails with DAG error

I have an MS SQL Server data source with around 1000 tables, which I need to put into BigQuery. I was hoping to use Data Fusion to load them all into staging tables in BigQuery, and then perform transformations on them afterwards. However, as soon as I create a pipeline with two "islands" it give a DAG error. Is that a feature or a just something I'm doing wrong? I can't find anything in the documentation. My pipeline looks like this:
And the error I get when I try to deploy is: "Invalid DAG. There is an island made up of stages BigTest,BigQuery BigTest (no other stages connect to them)."
Each pipeline is a single DAG (Directed acyclic graph) and all the source and sink should be connected for the configuration to be valid. You can use multi-table source plugin that can bring in multiple tables at once to a landing table in BQ.
You can use Multi table plugins and BQ Multi table sink for your use-case.

How to copy a database to sap hana from postgresql with talend?

well my problem is, how could i copy a database with talend from postgresql to sap hana without needing to write a job for every table ?
The reason for this is, because it could take some long time to prepare all those jobs, while taking in consideration, having at least 200 tables, which at least have 30 columns.
I tried tTransferDatabase plugin, but i can't success to transfer it to sap hana, it gives me an error that it can't copy schema (while it successfully worked copying it to other database in postgresql), and i am sure that the schemas names are right.
here is the error:
Exception in component tTransferDatabase_1
java.lang.NullPointerException
at org.apache.ddlutils.PlatformFactory.createNewPlatformInstance(PlatformFactory.java:86)
at org.apache.ddlutils.PlatformFactory.createNewPlatformInstance(PlatformFactory.java:124)
at com.devjpcb.transferdatabase.TransferDatabase.getPlatformDestine(TransferDatabase.java:179)
at com.devjpcb.transferdatabase.TransferDatabase.copySchemaToDatabase(TransferDatabase.java:249)
at local_project.aaasa_0_1.aaasa.tTransferDatabase_1Process(aaasa.java:836)
at local_project.aaasa_0_1.aaasa.runJobInTOS(aaasa.java:1130)
at local_project.aaasa_0_1.aaasa.main(aaasa.java:951)
Is there maybe a chance to do sth like .. for each table in connection, table guess schema, copy columns from table to other side of tmap, run ?
Any advice would be helpful ;), Thank you !
With some work, you could use the example job created by rbaldwin on Talend Exchange; note that it starts with files, not a database. But you could easily create a job that loops through all your database tables and does an extract to file, to then use as the starting point.
Another option is Bekwam's solution