Google Cloud Data Fusion -- building pipeline from REST API endpoint source - rest

Attempting to build a pipeline to read from a 3rd party REST API endpoint data source.
I am using the HTTP (version 1.2.0) plugin found in the Hub.
The response request URL is: https://api.example.io/v2/somedata?return_count=false
A sample of response body:
{
"paging": {
"token": "12456789",
"next": "https://api.example.io/v2/somedata?return_count=false&__paging_token=123456789"
},
"data": [
{
"cID": "aerrfaerrf",
"first": true,
"_id": "aerfaerrfaerrf",
"action": "aerrfaerrf",
"time": "1970-10-09T14:48:29+0000",
"email": "example#aol.com"
},
{...}
]
}
The main error in the logs is:
java.lang.NullPointerException: null
at io.cdap.plugin.http.source.common.pagination.BaseHttpPaginationIterator.getNextPage(BaseHttpPaginationIterator.java:118) ~[1580429892615-0/:na]
at io.cdap.plugin.http.source.common.pagination.BaseHttpPaginationIterator.ensurePageIterable(BaseHttpPaginationIterator.java:161) ~[1580429892615-0/:na]
at io.cdap.plugin.http.source.common.pagination.BaseHttpPaginationIterator.hasNext(BaseHttpPaginationIterator.java:203) ~[1580429892615-0/:na]
at io.cdap.plugin.http.source.batch.HttpRecordReader.nextKeyValue(HttpRecordReader.java:60) ~[1580429892615-0/:na]
at io.cdap.cdap.etl.batch.preview.LimitingRecordReader.nextKeyValue(LimitingRecordReader.java:51) ~[cdap-etl-core-6.1.1.jar:na]
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:214) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) ~[scala-library-2.11.8.jar:na]
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) ~[scala-library-2.11.8.jar:na]
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) ~[scala-library-2.11.8.jar:na]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:128) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:127) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1415) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:139) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.scheduler.Task.run(Task.scala:109) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) [spark-core_2.11-2.3.3.jar:2.3.3]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_232]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_232]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_232]
Possible issues
After trying to troubleshoot this for awhile, I'm thinking the issue might be with
Pagination
Data Fusion HTTP plugin has a lot of methods to deal with pagination
Based on the response body above, it seems like the best option for Pagination Type is Link in Response Body
For the required Next Page JSON/XML Field Path parameter, I've tried $.paging.next and paging/next. Neither work.
I have verified that the link in /paging/next works when opening in Chrome
Authentication
When simply trying to view the response URL in Chrome, a prompt will pop up asking for username and password
Only need to input API key for username to get past this prompt in Chrome
To do this in the Data Fusion HTTP plugin, the API Key is used for Username in the Basic Authentication section
Anyone have any success in creating a pipeline in Google Cloud Data Fusion where the data source is a REST API?

In answer to
Anyone have any success in creating a pipeline in Google Cloud Data Fusion where the data source is a REST API?
This is not the optimal way to achieve this the best way would be to ingest data Service APIs Overview to pub/sub your would then use pub/sub as the source for your pipeline this would provide a simple and reliable staging location for your data on its for processing, storage, and analysis, see the documentation for the pub/sub API . In order to use this in conjunction with Dataflow, the steps to follow are in the official documentation here Using Pub/Sub with Dataflow

I think your problem is in the data format that you receive. The exception:
java.lang.NullPointerException: null
occurs when you do not specify a correct output schema (no schema in this case I believe)
Solution 1
To solve it, try configuring the HTTP Data Fusion plugin to:
Receive format: Text.
Output Schema: name: user Type: String
This should work to obtain the response from the API in string format. Once that is done, use a JSONParser to convert the string into a table like object.
Solution 2
Configure the HTTP Data Fusion plugin to:
Receive format: json
JSON/XML Result Path : data
JSON/XML Fields Mapping : Include the fields you presented (see attached foto).

Related

get blob API Call from Azure Data Factory

I asked the same question at MS qna site too.
In ADF, I tried to call get BLOB() https://learn.microsoft.com/en-us/rest/api/storageservices/get-blob
I got this error message: "Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature."
I'd like to read an image or non structured file and insert it into a varchar(max) column in SQL server. (source: binary to sink:binary in sQL server)
My pipeline is configured as below.
linked service:
base url: https://{account name}.blob.core.windows.net/
authentication type: anonymouse
server certificate: disabled
type: Rest
data set
type :Rest
relative url: {container name}/xyz.jpeg
copy data activity
request method: get
x-ms-date: #concat(formatDateTime(utcNow(), 'yyyy-MM-ddTHH:mm:ss'), 'Z')
x-ms-version: 2018-11-09
x-ms-blob-type: BlockBlob
Authorization: SharedKey {storage name}:CBntp....{SAS key}....LsIHw%3D
( I took a key from an SAS connection string....https&sig=CBntp{SAS key}LsIHw%3D)
Is it possible to call the Azure Blob rest API in ADF pipelines?
Unfortunately this is not possible because When using Binary dataset in copy activity, you can only copy from Binary dataset to Binary dataset.
Source dataset property when Source is Binary
Sink dataset property
Reference - https://learn.microsoft.com/en-us/azure/data-factory/format-binary#copy-activity-properties

How to create/start cluster from data bricks web activity by invoking databricks rest api

I have 2 requirements:
1:I have a clusterID. I need to start the cluster from a "Wb Activity" in ADF. The activity parameters look like this:
url:https://XXXX..azuredatabricks.net/api/2.0/clusters/start
body: {"cluster_id":"0311-004310-cars577"}
Authentication: Azure Key Vault Client Certificate
Upon running this activity I am encountering with below error:
"errorCode": "2108",
"message": "Error calling the endpoint
'https://xxxxx.azuredatabricks.net/api/2.0/clusters/start'. Response status code: ''. More
details:Exception message: 'Cannot find the requested object.\r\n'.\r\nNo response from the
endpoint. Possible causes: network connectivity, DNS failure, server certificate validation or
timeout.",
"failureType": "UserError",
"target": "GetADBToken",
"GetADBToken" is my activity name.
The above security mechanism is working for other Databricks related activity such a running jar which is already installed on my databricks cluster.
2: I want to create a new cluster with the below settings:
url:https://XXXX..azuredatabricks.net/api/2.0/clusters/create
body:{
"cluster_name": "my-cluster",
"spark_version": "5.3.x-scala2.11",
"node_type_id": "i3.xlarge",
"spark_conf": {
"spark.speculation": true
},
"num_workers": 2
}
Upon calling this api, if a cluster creation is successful I would like to capture the cluster id in the next activity.
So what would be the output of the above activity and how can I access them in an immediate ADF activity?
For #2 ) Can you please check if you change the version
"spark_version": "5.3.x-scala2.11"
to
"spark_version": "6.4.x-scala2.11"
if that helps

Google Dataflow Pipeline creation fails with 400: Bad Request / invalid grant

I have been building and creating templates for google dataflow for over a year now. I never had a problem creating templates and uploading them to gcs with the options.setTemplateLocation(templatePath); call. Since today, when creating the Pipeline with Pipeline.create(options); and running the java-program in eclipse, I get following exception:
Exception in thread "main" java.lang.RuntimeException: Failed to construct instance from factory method DataflowRunner#fromOptions(interface org.apache.beam.sdk.options.PipelineOptions)
at org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod(InstanceBuilder.java:233)
at org.apache.beam.sdk.util.InstanceBuilder.build(InstanceBuilder.java:162)
at org.apache.beam.sdk.PipelineRunner.fromOptions(PipelineRunner.java:52)
at org.apache.beam.sdk.Pipeline.create(Pipeline.java:142)
at mypackage.PipelineCreation.getTemplatePipeline(PipelineCreation.java:34)
at myotherpackage.Main.main(Main.java:51)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod(InstanceBuilder.java:222)
... 5 more
Caused by: java.lang.RuntimeException: Unable to verify that GCS bucket gs://my-projects-staging-bucket exists.
at org.apache.beam.sdk.extensions.gcp.storage.GcsPathValidator.verifyPathIsAccessible(GcsPathValidator.java:92)
at org.apache.beam.sdk.extensions.gcp.storage.GcsPathValidator.validateOutputFilePrefixSupported(GcsPathValidator.java:61)
at org.apache.beam.runners.dataflow.DataflowRunner.fromOptions(DataflowRunner.java:228)
... 10 more
Caused by: com.google.api.client.http.HttpResponseException: 400 Bad Request
{
"error" : "invalid_grant",
"error_description" : "Bad Request"
}
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1070)
at com.google.auth.oauth2.UserCredentials.refreshAccessToken(UserCredentials.java:207)
at com.google.auth.oauth2.OAuth2Credentials.refresh(OAuth2Credentials.java:149)
at com.google.auth.oauth2.OAuth2Credentials.getRequestMetadata(OAuth2Credentials.java:135)
at com.google.auth.http.HttpCredentialsAdapter.initialize(HttpCredentialsAdapter.java:96)
at com.google.cloud.hadoop.util.ChainingHttpRequestInitializer.initialize(ChainingHttpRequestInitializer.java:52)
at com.google.api.client.http.HttpRequestFactory.buildRequest(HttpRequestFactory.java:93)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.buildHttpRequest(AbstractGoogleClientRequest.java:300)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
at com.google.cloud.hadoop.util.ResilientOperation$AbstractGoogleClientRequestExecutor.call(ResilientOperation.java:166)
at com.google.cloud.hadoop.util.ResilientOperation.retry(ResilientOperation.java:66)
at org.apache.beam.sdk.util.GcsUtil.getBucket(GcsUtil.java:505)
at org.apache.beam.sdk.util.GcsUtil.bucketAccessible(GcsUtil.java:492)
at org.apache.beam.sdk.util.GcsUtil.bucketAccessible(GcsUtil.java:457)
at org.apache.beam.sdk.extensions.gcp.storage.GcsPathValidator.verifyPathIsAccessible(GcsPathValidator.java:88)
... 12 more
I was logged-in today with another account into gcloud but logged in again with the account associated with the project as "Owner" with gcloud auth login.
I also restarted Eclipse but the same error keeps occuring. Also when trying to run the pipeline locally, I get another error but also with the "invalid_grant" "bad request" content. Restarting the laptop also had no effect.
My pom defines the google-cloud-dataflow-java-sdk-all with version 2.2.0 and upgrading to 2.5.0 had no effect.
I am able to copy data to the bucket with gsutil from commandline. But when running the java-program from command-line with mvn compile exec:java -Dexec.mainClass=mypackage.Main i still get the same errors.
My function to create a templatePipeline looks like the following:
public static Pipeline getTemplatePipeline(String jobName, String templatePath){
DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
options.setProject("my-project-id");
options.setRunner(DataflowRunner.class);
options.setStagingLocation("gs://my-projects-staging-bucket/binaries");
options.setTempLocation("gs://my-projects-staging-bucket/binaries/tmp");
options.setGcpTempLocation("gs://my-projects-staging-bucket/binaries/tmp");
options.setZone("europe-west3-a");
options.setWorkerMachineType("n1-standard-2");
options.setJobName(jobName);
options.setMaxNumWorkers(2);
options.setDiskSizeGb(40);
options.setTemplateLocation(templatePath);
return Pipeline.create(options);
}
Any help is highly appreciated.
You don't have to use service account and still you can use gcloud, you should use the following command and login with your account:
gcloud auth application-default login
I found the solution in the quickstart docs.
It seems like the gcloud auth is no longer used and you have to use a service account. So like in the docs I created a service account with role "project/owner" and downloaded it's json file to $path.
Then on my Mac i used export GOOGLE_APPLICATION_CREDENTIALS="$path" and within the same session used the command mentioned in the question to compile and execute the java-program.

WSO2IS NullPointerException when using step authenticator

Occasionally (?) the WSO2 IS user is unable to authenticate with following exception. When retrying, the user will be authenticated. Any ideas what could be reason / resolution? We set up the session caching.
Using WSO2 Identity Server 5.0.0.SP1 / SAML authentication with the authenticator set to advanced (single step, multiple options). I cannot find the correct source code commit to check out (to match the line number in the exception)
Thank you all in advance
Gabriel
TID: [0] [IS] [2016-02-15 13:07:22,914] ERROR
{org.wso2.carbon.identity.application.authentication.framework.handler.request.impl.DefaultRequestCoordinator}
- Exception in Authentication Framework {org.wso2.carbon.identity.application.authentication.framework.handler.request.impl.DefaultRequestCoordinator}
java.lang.NullPointerException at
org.wso2.carbon.identity.application.authentication.framework.handler.sequence.impl.DefaultStepBasedSequenceHandler.handle(DefaultStepBasedSequenceHandler.java:83)
at
org.wso2.carbon.identity.application.authentication.framework.handler.request.impl.DefaultAuthenticationRequestHandler.handle(DefaultAuthenticationRequestHandler.java:121)
at
org.wso2.carbon.identity.application.authentication.framework.handler.request.impl.DefaultRequestCoordinator.handle(DefaultRequestCoordinator.java:94)
at
org.wso2.carbon.identity.application.authentication.framework.servlet.CommonAuthenticationServlet.doPost(CommonAuthenticationServlet.java:54)
at
org.wso2.carbon.identity.application.authentication.framework.servlet.CommonAuthenticationServlet.doGet(CommonAuthenticationServlet.java:44)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:735) at
javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
Edit:
This exception occurs on the WSO2 IS 5.1.0 too
see the Source code line 105
StepConfig stepConfig = context.getSequenceConfig().getStepMap().get(currentStep);
// if the current step is completed
if (stepConfig.isCompleted()) {
stepConfig.setCompleted(false);
ERROR org.wso2.carbon.identity.application.authentication.framework.handler.request.impl.DefaultRequestCoordinator} - Exception in Authentication Framework
java.lang.NullPointerException
at org.wso2.carbon.identity.application.authentication.framework.handler.sequence.impl.DefaultStepBasedSequenceHandler.handle(DefaultStepBasedSequenceHandler.java:105)
at org.wso2.carbon.identity.application.authentication.framework.handler.request.impl.DefaultAuthenticationRequestHandler.handle(DefaultAuthenticationRequestHandler.java:115)
it looks like the stepConfig 'dissapeared' from the authentication config. The setup is done on a single node with session persistence into a database.
Apparently it looks like a problem with concurrency.
When multiple concurrent requests are sent to the SSO endpoint while the user is already authenticated, all threads are attempting to process the request modifying the same authentication context object (currentStep counter) so the cached authentication context comes to an invalid state.
Valid use case is that the client should send only a single request to the SSO endpoint, so the team dealing with the UI have to fix it. But - that's only the a quick fix not preventing the issue in long term. We have to really pick it up with WSO2 (and fix the code ourselves maybe) :)
g.

BAM 2.5.0 - Monitoring realtime traffic - Error when creating a new execution plan "Imported streams cannot be empty"

New user of the BAM with CEP integration, I'm currently following the "Monitoring Realtime Traffic" sample from WSo2 Doc and block when creating the Execution-plan step. Link to doc
The doc requires to
4. Under Import Stream select org.wso2.sample.rt.traffic for Import Stream, and enter traffic for As.
Unfortunatly when I click "import" nothing happens (in the doc it shows we should get //imported from org.wso2.sample.rt.traffic:1.0.0)
When I try to add the execution plan I get the "Imported streams cannot be empty"
Am I making a mistake ?
Regards
Vpl
I was able to solve the issue of this UI problem by creating the event stream directly into the registry table. For that I've created the following resource
/_system/governance/StreamDefinitions/org.wso2.sample.rt.traffic/1.0.0
containing
{
"streamId": "org.wso2.sample.rt.traffic:1.0.0",
"name": "org.wso2.sample.rt.traffic",
"version": "1.0.0",
"payloadData": [
{
"name": "entry",
"type": "STRING"
}
]
}
With a Media Type: application/json
Then creating the execution plan I could import the event stream and continue the use-case/ / tutorial
Regards