get blob API Call from Azure Data Factory - azure-data-factory

I asked the same question at MS qna site too.
In ADF, I tried to call get BLOB() https://learn.microsoft.com/en-us/rest/api/storageservices/get-blob
I got this error message: "Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature."
I'd like to read an image or non structured file and insert it into a varchar(max) column in SQL server. (source: binary to sink:binary in sQL server)
My pipeline is configured as below.
linked service:
base url: https://{account name}.blob.core.windows.net/
authentication type: anonymouse
server certificate: disabled
type: Rest
data set
type :Rest
relative url: {container name}/xyz.jpeg
copy data activity
request method: get
x-ms-date: #concat(formatDateTime(utcNow(), 'yyyy-MM-ddTHH:mm:ss'), 'Z')
x-ms-version: 2018-11-09
x-ms-blob-type: BlockBlob
Authorization: SharedKey {storage name}:CBntp....{SAS key}....LsIHw%3D
( I took a key from an SAS connection string....https&sig=CBntp{SAS key}LsIHw%3D)
Is it possible to call the Azure Blob rest API in ADF pipelines?

Unfortunately this is not possible because When using Binary dataset in copy activity, you can only copy from Binary dataset to Binary dataset.
Source dataset property when Source is Binary
Sink dataset property
Reference - https://learn.microsoft.com/en-us/azure/data-factory/format-binary#copy-activity-properties

Related

Getting an error while using copy activity (polybase) in adf to copy parquet files in ADLS gen2 to Azure synapse table

My source is parquet files in ADLS gen2. All the parquet files are part files of size 10-14 MB. The total size should be around 80 GB
Sink is Azuresynapse table.
Copy method is Polybase. Getting below error within 5 sec of execution like below:
ErrorCode=PolybaseOperationFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Error happened when loading data into SQL Data Warehouse. Operation: 'Create external table'.,Source=Microsoft.DataTransfer.ClientLibrary,''Type=System.Data.SqlClient.SqlException,Message=External file access failed due to internal error: 'Error occurred while accessing HDFS: Java exception raised on call to HdfsBridge_IsDirExist. Java exception message:
HdfsBridge::isDirExist - Unexpected error encountered checking whether directory exists or not: AbfsRestOperationException: Operation failed: "This request is not authorized to perform this operation.", 403, HEAD, URL',Source=.Net SqlClient Data Provider,SqlErrorNumber=105019,Class=16,ErrorCode=-2146232060,State=1,Errors=[{Class=16,Number=105019,State=1,Message=External file access failed due to internal error: 'Error occurred while accessing HDFS: Java exception raised on call to HdfsBridge_IsDirExist. Java exception message:
HdfsBridge::isDirExist - Unexpected error encountered checking whether directory exists or not: AbfsRestOperationException: Operation failed: "This request is not authorized to perform this operation.", 403, HEAD,
I've seen this error due to failed authentication, check whether the authorization header and/or signature is wrong.
For example, create the scope credential using your ADLS Gen2 storage account access key:
CREATE DATABASE SCOPED CREDENTIAL [MyADLSGen2Cred] WITH
IDENTITY='user',
SECRET='zge . . . 8V/rw=='
The external data source is created as follows:
CREATE EXTERNAL DATA SOURCE [MyADLSGen2] WITH (
TYPE=HADOOP,
LOCATION='abfs://myblob#pabechevb.dfs.core.windows.net',
CREDENTIAL=[MyADLSGen2Cred])
You can specify wasb instead of abfs, and if you're using SSL, specify it as abfss. Then the external table is created as follows:
CREATE EXTERNAL TABLE [dbo].[ADLSGen2] (
[Content] varchar(128))
WITH (
LOCATION='/',
DATA_SOURCE=[MyADLSGen2],
FILE_FORMAT=[TextFileFormat])
You can find additional information in my book "Hands-On Data Virtualization with Polybase".

How to create a bucket using the python SDK?

I'm trying to create a bucket in cloud object storage using python. I have followed the instructions in the API docs.
This is the code I'm using
COS_ENDPOINT = "https://control.cloud-object-storage.cloud.ibm.com/v2/endpoints"
# Create client
cos = ibm_boto3.client("s3",
ibm_api_key_id=COS_API_KEY_ID,
ibm_service_instance_id=COS_INSTANCE_CRN,
config=Config(signature_version="oauth"),
endpoint_url=COS_ENDPOINT
)
s3 = ibm_boto3.resource('s3')
def create_bucket(bucket_name):
print("Creating new bucket: {0}".format(bucket_name))
s3.Bucket(bucket_name).create()
return
bucket_name = 'test_bucket_442332'
create_bucket(bucket_name)
I'm getting this error - I tried setting CreateBucketConfiguration={"LocationConstraint":"us-south"}, but it doesnt seem to work
"ClientError: An error occurred (IllegalLocationConstraintException) when calling the CreateBucket operation: The unspecified location constraint is incompatible for the region specific endpoint this request was sent to."
Resolved by going to https://cloud.ibm.com/docs/cloud-object-storage?topic=cloud-object-storage-endpoints#endpoints
And choosing the endpoint specific to the region I need. The "Endpoint" provided with the credentials, is not the actual endpoint.

How to create/start cluster from data bricks web activity by invoking databricks rest api

I have 2 requirements:
1:I have a clusterID. I need to start the cluster from a "Wb Activity" in ADF. The activity parameters look like this:
url:https://XXXX..azuredatabricks.net/api/2.0/clusters/start
body: {"cluster_id":"0311-004310-cars577"}
Authentication: Azure Key Vault Client Certificate
Upon running this activity I am encountering with below error:
"errorCode": "2108",
"message": "Error calling the endpoint
'https://xxxxx.azuredatabricks.net/api/2.0/clusters/start'. Response status code: ''. More
details:Exception message: 'Cannot find the requested object.\r\n'.\r\nNo response from the
endpoint. Possible causes: network connectivity, DNS failure, server certificate validation or
timeout.",
"failureType": "UserError",
"target": "GetADBToken",
"GetADBToken" is my activity name.
The above security mechanism is working for other Databricks related activity such a running jar which is already installed on my databricks cluster.
2: I want to create a new cluster with the below settings:
url:https://XXXX..azuredatabricks.net/api/2.0/clusters/create
body:{
"cluster_name": "my-cluster",
"spark_version": "5.3.x-scala2.11",
"node_type_id": "i3.xlarge",
"spark_conf": {
"spark.speculation": true
},
"num_workers": 2
}
Upon calling this api, if a cluster creation is successful I would like to capture the cluster id in the next activity.
So what would be the output of the above activity and how can I access them in an immediate ADF activity?
For #2 ) Can you please check if you change the version
"spark_version": "5.3.x-scala2.11"
to
"spark_version": "6.4.x-scala2.11"
if that helps

Google Cloud Data Fusion -- building pipeline from REST API endpoint source

Attempting to build a pipeline to read from a 3rd party REST API endpoint data source.
I am using the HTTP (version 1.2.0) plugin found in the Hub.
The response request URL is: https://api.example.io/v2/somedata?return_count=false
A sample of response body:
{
"paging": {
"token": "12456789",
"next": "https://api.example.io/v2/somedata?return_count=false&__paging_token=123456789"
},
"data": [
{
"cID": "aerrfaerrf",
"first": true,
"_id": "aerfaerrfaerrf",
"action": "aerrfaerrf",
"time": "1970-10-09T14:48:29+0000",
"email": "example#aol.com"
},
{...}
]
}
The main error in the logs is:
java.lang.NullPointerException: null
at io.cdap.plugin.http.source.common.pagination.BaseHttpPaginationIterator.getNextPage(BaseHttpPaginationIterator.java:118) ~[1580429892615-0/:na]
at io.cdap.plugin.http.source.common.pagination.BaseHttpPaginationIterator.ensurePageIterable(BaseHttpPaginationIterator.java:161) ~[1580429892615-0/:na]
at io.cdap.plugin.http.source.common.pagination.BaseHttpPaginationIterator.hasNext(BaseHttpPaginationIterator.java:203) ~[1580429892615-0/:na]
at io.cdap.plugin.http.source.batch.HttpRecordReader.nextKeyValue(HttpRecordReader.java:60) ~[1580429892615-0/:na]
at io.cdap.cdap.etl.batch.preview.LimitingRecordReader.nextKeyValue(LimitingRecordReader.java:51) ~[cdap-etl-core-6.1.1.jar:na]
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:214) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) ~[scala-library-2.11.8.jar:na]
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) ~[scala-library-2.11.8.jar:na]
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) ~[scala-library-2.11.8.jar:na]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:128) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:127) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1415) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:139) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.scheduler.Task.run(Task.scala:109) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) [spark-core_2.11-2.3.3.jar:2.3.3]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_232]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_232]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_232]
Possible issues
After trying to troubleshoot this for awhile, I'm thinking the issue might be with
Pagination
Data Fusion HTTP plugin has a lot of methods to deal with pagination
Based on the response body above, it seems like the best option for Pagination Type is Link in Response Body
For the required Next Page JSON/XML Field Path parameter, I've tried $.paging.next and paging/next. Neither work.
I have verified that the link in /paging/next works when opening in Chrome
Authentication
When simply trying to view the response URL in Chrome, a prompt will pop up asking for username and password
Only need to input API key for username to get past this prompt in Chrome
To do this in the Data Fusion HTTP plugin, the API Key is used for Username in the Basic Authentication section
Anyone have any success in creating a pipeline in Google Cloud Data Fusion where the data source is a REST API?
In answer to
Anyone have any success in creating a pipeline in Google Cloud Data Fusion where the data source is a REST API?
This is not the optimal way to achieve this the best way would be to ingest data Service APIs Overview to pub/sub your would then use pub/sub as the source for your pipeline this would provide a simple and reliable staging location for your data on its for processing, storage, and analysis, see the documentation for the pub/sub API . In order to use this in conjunction with Dataflow, the steps to follow are in the official documentation here Using Pub/Sub with Dataflow
I think your problem is in the data format that you receive. The exception:
java.lang.NullPointerException: null
occurs when you do not specify a correct output schema (no schema in this case I believe)
Solution 1
To solve it, try configuring the HTTP Data Fusion plugin to:
Receive format: Text.
Output Schema: name: user Type: String
This should work to obtain the response from the API in string format. Once that is done, use a JSONParser to convert the string into a table like object.
Solution 2
Configure the HTTP Data Fusion plugin to:
Receive format: json
JSON/XML Result Path : data
JSON/XML Fields Mapping : Include the fields you presented (see attached foto).

How to read a file content from Bitbucket via a Mulejob?

I have a requirement to integrate data from Bitbucket through Mulejob. Not looking for CI/CD integration, I mean actually read content of a repository file through Mule. Also given, I am a newbie to Mule.
I am able to read a content of a repo branch through Mule using Atlassian Stash connector.
Q1: Does this flow actually retrieve file or just metadata of files? Able to print names of files.
Q2: Assuming files are indeed retrieved, how do I read the contents of the file? Tried using Mule Requester to read output payload of Stash connector, but I see that a payload is null when I print it. No errors are thrown, just blank payload. Appreciate your help!
My Mule flow: http -> Stash Connecter (read files) -> For Each -> Mule Requester: Retrieve File -> Log payload.
Mule to Stash connector API I am using: http://hotovo.github.io/mule-stash-connector/mule/stash-config.html#commit-files-get
Syntax I am trying with Mule Requester instance: file://#[payload.value]
Output:
INFO 2019-08-09 14:37:07,691
[[test-bitbucket-connect].HTTP_Listener_Configuration.worker.01]
org.mule.api.processor.LoggerMessageProcessor: FOR payload ---- Key:
repository/.java Value: repository/.java INFO
2019-08-09 14:37:07,692
[[test-bitbucket-connect].HTTP_Listener_Configuration.worker.01]
org.mule.lifecycle.AbstractLifecycleManager: Initialising:
'file-connector-config.requester.1217682634'. Object is:
FileMessageRequester INFO 2019-08-09 14:37:07,692
[[test-bitbucket-connect].HTTP_Listener_Configuration.worker.01]
org.mule.lifecycle.AbstractLifecycleManager: Starting:
'file-connector-config.requester.1217682634'. Object is:
FileMessageRequester INFO 2019-08-09 14:37:07,693
[[test-bitbucket-connect].HTTP_Listener_Configuration.worker.01]
org.mule.api.processor.LoggerMessageProcessor: Post retrieve File:
null