Cannot create a batch pipeline to get data from ZohoCRM with http plugin 1.2.1 to BigQuery. Retuns Spark Program 'phase-1' failed - rest

My first post here and I'm new to Data Fusion and I'm with low to no coding skills.
I want to get data from ZohoCRM to BigQuery. Module from ZohoCRM (e.g. accounts, contacts...) to be a separate table in BigQuery.
To connect to Zoho CRM I obtained a code, token, refresh token and everything needed as described here https://www.zoho.com/crm/developer/docs/api/v2/get-records.html. Then I ran a successful get records request as described here via Postman and it returned the records from Zoho CRM Accounts module as JSON file.
I thought it will be all fine and set the parameters in Data Fusion
DataFusion_settings_1 and DataFusion_settings_2 it validated fine. Then I previewed and ran the pipeline without deploying it. It failed with the following info from the logs logs_screenshot. I tried to manually enter a few fields in the schema when the format was JSON. I tried changing the format to csv, nether worked. I tried switching the Verify HTTPS Trust Certificates on and off. It did not help.
I'd be really thankful for some help. Thanks.
Update, 2020-12-03
I got in touch with Google Cloud Account Manager, who then took my question to their engineers and here is the info
The HTTP plugin can be used to "fetch Atom or RSS feeds regularly, or to fetch the status of an external system" it does not seems to be designed for APIs
At the moment a more suitable tool for data collected via APIs is Dataflow https://cloud.google.com/dataflow
"Google Cloud Dataflow is used as the primary ETL mechanism, extracting the data from the API Endpoints specified by the customer, which is then transformed into the required format and pushed into BigQuery, Cloud Storage and Pub/Sub."
https://www.onixnet.com/insights/gcp-101-an-introduction-to-google-cloud-platform
So in the next weeks I'll be looking at Data Flow.

Can you please attach the complete logs of the preview run? Make sure to redact any PII data. Also what is the version of CDF you are using? Is CDF instance private or public?
Thanks and Regards,
Sagar

Did you end up using Dataflow?
I am also experiencing the same issue with the HTTP plugin, but my temporary way to go around it was to use a cloud scheduler to periodically trigger a cloud function that fetches my data from the API and exports them as a JSON to GCS, which can then be accessed by Data Fusion.
My solution is of course non-ideal, so I am still looking for a way to use the Data Fusion HTTP plugin. I was able to make it work to get sample data from public API end-points, but for a reason still unknown to me I can't get it to work for my actual API.

Related

How can I troubleshoot the 'No available gateway error' in Power BI?

I am having a trouble with Power BI dataset refresh via REST API.
I get this error whenever I try to execute the dataset refresh API:
Last refresh failed: ...
There is no available gateway.
I'm testing on two accounts, and this happens only on one of them.
What is more confusing is that the storage I'm using is cloud based (Azure Data lake). So it doesn't need a gateway connection. And when I refresh the datasets manually it works.
When I get the refresh history for further investigationI get this:
serviceExceptionJson={"errorCode":"DMTS_MonikerWithUnboundDataSources"}
I could use any given help.
I have found a workaround solution, although I'm not very convinced this is the source of the problem (maybe the why is still very blurry to me)
First for more context, this is what I'm doing (using the APIs) :
1-Upload a dataset to workspace (import a pbix and skipping report, I only need the parameters and power query within it)
2- Update the parameters
3- Refresh dataset
The thing is, I have a parameter for the url of the datalake data. There are two values of the url, depending on the use case/user.
So what I did, is that I removed that parameter and made two different pbix, each one with its specific url. So then, I import a pbix depending on the use case.
I still don't understand why I get a gateway error, while I should get "error while processing data" message (seeing that it seems to be a data access issue).

Need a design/approach to build a custom sync engine between elasticsearch and RestAPIs

My application is interacting with elasticsearch instance for all data retrieval and searchability. Elasticsearch needs to feed data from RestAPI (Bulk data around 2-4 gigs). For that I need to write my rest Client to feed elasticsearch with the rest api. Which is perfectly alright.
Since client database is not exposed and we are integration with Rest endpionts only, how to fetch the last updated data at database (mongoDB). Also there is no field called "Last updated" in mongoDB.
I need to write a custom transporter utility which will run on a fixed time, which will fetch the data from rest apis, compare it with elasticsearch and update the modified content only.
Please help !!!!
Thanks

Azure Batch support for Data Lake Store Linked Service

I am using a data factory pipeline with a custom activity (configured to run on Azure Batch) that has a data lake store input dataset and output dataset. The data lake store linked service is using service to service auth (service principal) and is working fine while being used in a Copy activity through Copy Wizard. But when used with a custom activity that tries to check if a file is present in the data lake, the activity fails with an error "Authorization is required". Upon using a Azure Blob Store as the input and output datasets, the same custom activity works fine.
Seems like an issue with Azure Batch (Compute node) not able to authorize Data Lake Store. Please help if you have solved the above mentioned problem.
I had this exact same issue about 3 weeks ago. I feel your pain!
This is a Microsoft bug!
After much trial and error and redeployments I raised a support ticket with Microsoft who confirmed that service principal authentication for data lake store currently only works with copy activities. Not with custom activities.
This is the official response I got on Monday 10th April.
The issue happen because of a bug that custom activity’s connector
schema doesn’t match the latest published connector schema. Actually,
we notice the issue on custom activity and have plan to fix & deploy
to prod in next 2 weeks.
Be aware that if you change your linked service back to use a session token etc you'll also need to redeploy your pipelines that contain the custom activities. Otherwise you'll get another error something like the following...
Access is forbidden, please check credentials and try again. Code:
'AuthenticationFailed' Message: 'Server failed to authenticate the
request. Make sure the value of Authorization header is formed
correctly including the signature.
Hope this helps.

How to automatically create a new file based on an existing file within Google Cloud Storage

It's the first time I used Google Cloud, so I might ask the question in the wrong place.
 
Information provider upload a new file to Google Cloud Storage every day.
The file contains the information of all my clients/departments.
I have to sort through information and create a new file/s containing the relevant information for each department in my company .so that everyone gets the relevant information to them (security).
I can't figure out what are the steps I need to follow, to complete the task.
Can you help me?
You want to have a process that starts automatically and subsequently generates a new file once you upload something to Google Cloud Storage.
The easiest way to handle this is using Object Change Notifications. You can set up Object Change Notifications per bucket and this will send a POST request to an URL that you can define.
You can then easily set up a server (or run it on app engine) that will execute an action based on the POST request that it receives.
There is an even simpler option (although still in alpha) named cloud functions. Cloud functions is a serverless service that provides event based microservices (e.g. 'do this' if a new file is uploaded on GCS). This means you only have to write the code that defines what needs to happen if a new file is uploaded and then Cloud Functions will take care of executing the code when you upload a file to GCS. See this tutorial on using cloud functions with Google Cloud Storage.

Google SQL Cloud operations callback?

I currently have an application which triggers import jobs to Google SQL Cloud using the their API:
https://cloud.google.com/sql/docs/admin-api/v1beta4/instances/import
This works great. However, this is only a request to import an SQL file. I have to check that the request was successful a minute or two afterwards.
What I would like, is to somehow register a callback to notify my application when the operation is complete. Then I can delete the bucket item and mark the data as persisted.
I have no idea if this is possible, but would be grateful for any advice. Perhaps the PubSub system API could be used for this, but so far have been unable to find any documentation on how this would be done.
There's currently no out of the box way to do this. You need to poll the operation status to determine when it's finished.