Calling Azure SQL DW Stored Procedure from Azure Data Factory - azure-data-factory

I am getting the following error when trying to run a stored procedure in an Azure SQL Datawarehouse.
Activity 'SprocActivitySample' contains an invalid Dataset reference 'Destination-SQLDW-nna'. This dataset is pointing to Azure SQL DW and stored procedure is in it.
Here is the entire code.
{
"name": "SprocActivitySamplePipeline",
"properties": {
"activities": [
{
"type":"SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "DailyImport",
"storedProcedureParameters": {
"DateToImportFor": "$$Text.Format('{0:yyyy-MM-dd HH:mm:ss}', SliceStart)"
}
},
"outputs": [
{
"name": "Destination-SQLDW-nna"
}
],
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "SprocActivitySample"
}
],
"start": "2017-01-01T00:00:00Z",
"end": "2017-02-20T05:00:00Z",
"isPaused": true
}
}

I'm afraid that Azure Sql Data Warehouse does not support table-valued parameters in stored procedures.
Read more about it here: https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-develop-stored-procedures
If you find a workaround for it please share! I couldnt find any.
Also it would be good if you can post the dataset json so we can try to find any errors on it.
Cheers!

I got this working. the problem was that i was referencing wrong
"outputs": [
{
"name": "Destination-SQLDW-nna"
}
after correcting name to the right Dataset it is working

Related

Unable to inject data from postgresql to druid (deployed using helm)

I deployed druid using helm from repository using commands from https://github.com/apache/druid/tree/master/helm/druid and got it deployed successfully but when I created a task with following spec
{
"type": "index_parallel",
"id": "sairam_testing_postgresql_100",
"spec": {
"dataSchema": {
"dataSource": "test-ingestion-postgresql-100",
"timestampSpec": {
"format": "iso",
"column": "created_at"
},
"dimensionsSpec": {
"dimensions": [
"app_id","user_id"
]
}
},
"ioConfig": {
"type": "index_parallel",
"inputSource": {
"type": "sql",
"database": {
"type": "postgresql",
"connectorConfig": {
"connectURI": "jdbc:postgresql://35.200.128.167:5432/mhere_trans",
"user": "postgres#jiovishwam-frp-att-prod-mhere-trans-psql-db-1",
"password": "lFRWncdXG4Po0e"
}
},
"sqls": [
"SELECT app_id ,user_id FROM transactions limit 10"
]
}
},
"maxNumConcurrentSubTasks": 2,
"tuningConfig": {
"type": "index_parallel",
"partitionsSpec": {
"type": "dynamic"
}
}
}
}
it is throwing error
Failed to submit task: Cannot construct instance of org.apache.druid.firehose.PostgresqlFirehoseDatabaseConnector, problem: java.lang.ClassNotFoundException: org.postgresql.Driver at [Source: (org.eclipse.jetty.server.HttpInputOverHTTP); line: 1, column: 969] (through reference chain: org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexSupervisorTask["spec"]->org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexIngestionSpec["ioConfig"]->org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexIOConfig["inputSource"]->org.apache.druid.metadata.input.SqlInputSource["database"])
NOTE: I did try using quickstart and found similar issue (got fixed by manually adding postgresql jar file to lib directory ) but not sure how to handle this when the druid is deployed using helm charts on production.
According to the docs here, in order to use SQL data source for postgresql, you will need to add the extension in the load list:
druid.extensions.loadList=["postgresql-metadata-storage"]
If you are installing with the helm chart, you can specify this in the general parameters section of your custom values.yaml.
configVars:
druid_extensions_loadList: '["druid-histogram", "druid-datasketches", "druid-lookups-cached-global", "postgresql-metadata-storage"]
Note: Please also note that while you can extract data from postgresql using a JDBC connection, it is recommended that for larger data sets you use multiple SQL statements such that ingestion can be parallelized.

azure data factory dependencies

I have two activities in my azure data factory.
Activity A1 = a stored proc on a sql db. Input=none, output = DB (output1). The stored proc targets the output dataset.
Activity A2 = an azure copy activity ("type": "Copy") which copies from blob to same sql db. Input = blob, Output = DB (output2)
i need to run activity A1 before A2 and i cant for the world figure out what dependencies to put between them.
i tried to mark A2 as having two inputs - the blob + the DB (output1). if i do this, the copy activity doesn't throw error, but it does NOT copy the blob to db (i think it silently uses the DB as the source of copy, instead of blob as source of copy and somehow does nothing).
if i remove the DB input (output1) on A2 it can successfully copy blob to DB but i no longer have the dependency chain that A1 needs to run before A2
thanks!
I figured this out - I was able to keep two dependencies on A2, but just needed to make sure of the ordering of the 2 inputs. Weird. Looks like the Copy activity just acts on the FIRST input - so when i moved the blob as the first input it worked! :) (earlier i had the DB output1 as first input and it silently did nothing)
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "MyBlobInput"
},
{
"name": "MyDBOutput1"
}
],
"outputs": [
{
"name": "MyDBOutput2"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 3,
"retry": 3
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "AzureBlobtoSQL",
"description": "Copy Activity"
}
],

USQL Activity in ADF V2 - 2705 User is not able to access to datalake store

I'm facing an issue when running a U-SQL Script with Azure Data Factory V2.
This U-SQL Script work fine in the portal or vs :
#a =
SELECT * FROM
(VALUES
("Contoso", 1500.0, "2017-03-39"),
("Woodgrove", 2700.0, "2017-04-10")
) AS D( customer, amount );
#results =
SELECT
customer,
amount
FROM #a;
OUTPUT #results
TO "test"
USING Outputters.Text();
But don't work when started in Azure Data Factory V2 Activity, bellow my ADF scripts.
Creating or updating linked service ADLA [adla] ...
{
"properties": {
"type": "AzureDataLakeAnalytics",
"typeProperties": {
"accountName": "adla",
"servicePrincipalId": "ea4823f2-3b7a-4c-78d29cffa68b",
"servicePrincipalKey": {
"type": "SecureString",
"value": "jKhyspEwMScDAGU0MO39FcAP9qQ="
},
"tenant": "41f06-8ca09e840041",
"subscriptionId": "9cf053128b749",
"resourceGroupName": "test"
}
}
}
Creating or updating linked service BLOB [BLOB] ...
{
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "DefaultEndpointsProtocol=https;AccountName=totoblob;AccountKey=qZqpKyGtWMRXZO2CNLa0qTyvLTMko4lzfgsg07pjloIPGZtJel4qvRBkoVOA==;EndpointSuffix=core.windows.net"
}
}
}
}
Creating or updating pipeline ADLAPipeline...
{
"properties": {
"activities": [
{
"type": "DataLakeAnalyticsU-SQL",
"typeProperties": {
"scriptPath": "src/test.usql",
"scriptLinkedService": {
"referenceName": "blob",
"type": "LinkedServiceReference"
},
"degreeOfParallelism": 1
},
"linkedServiceName": {
"referenceName": "adla",
"type": "LinkedServiceReference"
},
"name": "Usql-toto"
}
]
}
}
1 - I checked the connection to the blob storage, the u-sql script is successfully found (if i rename it, it throw a not found error)
2 - I checked the connection to azure data lake analytics, it seems to connect, if i set a wrong credential it throw an other error
3 - When running the pipeline I Have the following error : "Activity Usql-toto failed: User is not able to access to datalake store" and actually I don't provide Data Lake Store Credential but there is a default datalake store account attached ADLAnalytics.
Any hint?
Finally found help in this post: U-SQL Job Failing in Data Factory
Folders system and catalog doesn't inherit from parent permissions ... So I had to reassign permission on this two folders.
I was also having this problem. What helped me was running the "Add User Wizard" from Data Lake Analytics. Using this wizard I added the service principal that I use in the linked service for Data Lake Analytics as an Owner with R+W permissions.
Before using the wizard, I tried to configure this manually by setting appropriate permissions from the Explore data screen, but that didn't resolve the problem. (SP was already contributor on the service)

Azure Data Factory: Parameterized the Folder and file path

Environments
Azure Data Factory
Scenario
I have ADF pipeline which reads the data from On premise server and writes the data to azure data lake.
For the same - I have provided Folder structure in ADF*(dataset)*as follows
Folder Path : - DBName/RawTables/Transactional
File Path : - TableName.csv
Problem
Is it possible to parameterized the Folder name or file path ? Basically - if tomorrow - I want to change the folder path*(without deployment)* then we should be updating the metadata or table structure.
So the short answer here is no. You won't be able to achieve this level of dynamic flexibility with ADF on its own.
You'll need to add new defined datasets to your pipeline as inputs for folder changes. In Data Lake you could probably get away with a single stored procedure that accepts a parameter for the file path which could be reused. But this would still require tweaks to the ADF JSON when calling the proc.
Of course, the catch all situation here is to use an ADF custom activity and write a C# class with methods to do whatever you need. Maybe overkill though and lots of effort to setup the authentication to data lake store.
Hope this gives you a steer.
Mangesh, why don't you try the .Net custom activity in ADF. This custom activity will be your first activity that will potentially check for the processed folder and if the processed folder is present it will move that to History(say) folder. As, ADF is a platform for data movement and data transformation, it doesn't deal with the IO activity. You can learn more about the .Net custom activity at:
https://learn.microsoft.com/en-us/azure/data-factory/data-factory-use-custom-activities
What you want to do is possible with the new Lookup activity in Azure Data Factory V2. Documentation is here: Lookup Activity.
A JSON example would be something like this:
{
"name": "LookupPipelineDemo",
"properties": {
"activities": [
{
"name": "LookupActivity",
"type": "Lookup",
"typeProperties": {
"dataset": {
"referenceName": "LookupDataset",
"type": "DatasetReference"
}
}
},
{
"name": "CopyActivity",
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from #{activity('LookupActivity').output.tableName}"
},
"sink": {
"type": "BlobSink"
}
},
"dependsOn": [
{
"activity": "LookupActivity",
"dependencyConditions": [ "Succeeded" ]
}
],
"inputs": [
{
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "SinkDataset",
"type": "DatasetReference"
}
]
}
]
}
}

Exporting a AWS Postgres RDS Table to AWS S3

I wanted to use AWS Data Pipeline to pipe data from a Postgres RDS to AWS S3. Does anybody know how this is done?
More precisely, I wanted to export a Postgres Table to AWS S3 using data Pipeline. The reason I am using Data Pipeline is I want to automate this process and this export is going to run once every week.
Any other suggestions will also work.
There is a sample on github.
https://github.com/awslabs/data-pipeline-samples/tree/master/samples/RDStoS3
Here is the code:
https://github.com/awslabs/data-pipeline-samples/blob/master/samples/RDStoS3/RDStoS3Pipeline.json
You can define a copy-activity in the Data Pipeline interface to extract data from a Postgres RDS instance into S3.
Create a data node of the type SqlDataNode. Specify table name and select query.
Setup the database connection by specifying RDS instance ID (the instance ID is in your URL, e.g. your-instance-id.xxxxx.eu-west-1.rds.amazonaws.com) along with username, password and database name.
Create a data node of the type S3DataNode.
Create a Copy activity and set the SqlDataNode as input and the S3DataNode as output.
Another option is to use an external tool like Alooma. Alooma can replicate tables from PostgreSQL database hosted Amazon RDS to Amazon S3 (https://www.alooma.com/integrations/postgresql/s3). The process can be automated and you can run it once a week.
I built a Pipeline from scratch using the MySQL and the documentation as reference.
You need to have the roles on place, DataPipelineDefaultResourceRole && DataPipelineDefaultRole.
I haven't load the parameters, so, you need to get into the architech and put your credentials and folders.
Hope it helps.
{
"objects": [
{
"failureAndRerunMode": "CASCADE",
"resourceRole": "DataPipelineDefaultResourceRole",
"role": "DataPipelineDefaultRole",
"pipelineLogUri": "#{myS3LogsPath}",
"scheduleType": "ONDEMAND",
"name": "Default",
"id": "Default"
},
{
"database": {
"ref": "DatabaseId_WC2j5"
},
"name": "DefaultSqlDataNode1",
"id": "SqlDataNodeId_VevnE",
"type": "SqlDataNode",
"selectQuery": "#{myRDSSelectQuery}",
"table": "#{myRDSTable}"
},
{
"*password": "#{*myRDSPassword}",
"name": "RDS_database",
"id": "DatabaseId_WC2j5",
"type": "RdsDatabase",
"rdsInstanceId": "#{myRDSId}",
"username": "#{myRDSUsername}"
},
{
"output": {
"ref": "S3DataNodeId_iYhHx"
},
"input": {
"ref": "SqlDataNodeId_VevnE"
},
"name": "DefaultCopyActivity1",
"runsOn": {
"ref": "ResourceId_G9GWz"
},
"id": "CopyActivityId_CapKO",
"type": "CopyActivity"
},
{
"dependsOn": {
"ref": "CopyActivityId_CapKO"
},
"filePath": "#{myS3Container}#{format(#scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}",
"name": "DefaultS3DataNode1",
"id": "S3DataNodeId_iYhHx",
"type": "S3DataNode"
},
{
"resourceRole": "DataPipelineDefaultResourceRole",
"role": "DataPipelineDefaultRole",
"instanceType": "m1.medium",
"name": "DefaultResource1",
"id": "ResourceId_G9GWz",
"type": "Ec2Resource",
"terminateAfter": "30 Minutes"
}
],
"parameters": [
]
}
You can now do this with aws_s3.query_export_to_s3 command within postgres itself https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/postgresql-s3-export.html