I have a pyspark script as part of an oozie job. The actions are as follows:
1. Fetch data from a data store.
2. Some data munging operations on the cluster.
3. Download the data. Currently, doing a toPandas() on the pyspark dataframe.[Doing this to easily convert to a json dump]
4. Pushing the data to a REST API.
Step (3) in only necessitated because of step(4) as I you need the data to be on the driver to be able to do a REST call. However, I have noticed that step(3) is responsible for the variable execution time of my script as well slowing down of my script. My question is it possible to invoke and POST to a REST API from the worker nodes? I saw some examples of using a GET request from REST(https://dataplatform.cloud.ibm.com/analytics/notebooks/52845a4a-1b5e-4f6e-b1a3-f312d796a93a/view?access_token=e3f303d7dd90138a9cf1fb77b00265a7b02aa12b891c2018e2e547f2050ef4e0), but this did not work for my usecase.
Related
I'm trying to pull data from Hubspot to my SQL Server Database through an Azure Data Factory pipeline with the usage of a REST dataset. I have problems setting up the right pagination rules. I've already spent a day on Google and MS guides, but I find it hard to get it working properly.
This is the source API. I am able to connect and pull the first set of 20 rows. It gives an offset which is usable with vidoffset= which is returned in the body.
I need to return the result of vid-offset from the body to the HTTP request. Also the process needs to stop when has-more results in 'false'.
I tried to reproduce the same in my environment and I got the below results:
First I create a linked service with this URL: https://api.hubapi.com/contacts/v1/lists/all/contacts/all?hapikey=demo&vidOffset
Then after I created the pagination end condition rule with $.has-more and absolute URL.
For demo purpose, I took sink as a storage account.
The pipeline run success full look at the below image for reference.
For more information refer this Ms Document
Hi i am processing a set of ~50K records from a pipe delimeted flat kn azure data factory and need to invoke a rest API call for each input record. So, I am using a foreach loop to access each record and inside the loop, I am using a copy activity to invoke a rest API call.
My question is, can I invoke the rest API call in bulk for all the records at once, as the foreach loop is slowing the pipeline execution. I want to remove the foreach loop and also process the API json response and store it in azure sql database.
Thanks
You will have to check the Pagination properties so that you can decide how much payload you need to return from source API:
https://learn.microsoft.com/en-us/azure/data-factory/connector-rest?tabs=data-factory#pagination-support
Also, if you need to store the API JSON response in Azure SQL, then you can do so with many built in functions like JSON_PATH
More details can be found in this link:
https://learn.microsoft.com/en-us/azure/azure-sql/database/json-features
currently I am working with a pipeline node.js base with backpressure in order to download all data from several API endpoints, some of these endpoints have data related between them.
The idea is to make a refactor in order to build an application more maintainable than the current in order to improve the updates over the data, and the current approach is the following.
The first step (map) is to download all data from endpoints and push into several topics, some of this data is complex and it is necessary data from one endpoint in order to retrieve data from another endpoint.
And the second step (reduce) is to get that data from all topics and push into a SQL but only the data that we need.
The question are
Could be a good approach to this problem
Is better to use Kafka streams in order to use KSQL to make transforms and only use a microservice to publish into the database.
The architecture schema is the following, and real time is not necessary for this data:
Thanks
We're trying to put together a proof-of-concept where we read data from an API and store it in a blob. We have a For Each activity that loops through a file that has parameters that are used in an API call. We are trying to do this in parallel. The first API call works fine, but the second call returns a 429 error. Are we asking the impossible?
Usually error code 429 meaning is too many requests. Inside ForEach activity, use Sequence execution option and see if that helps.
Is it possible to create a H2OFrame using the H2O's REST API and if so how?
My main objective is to utilize models stored inside H2O so as to make predictions on external H2OFrames.
I need to be able to generate those H2OFrames externally from JSON (I suppose by calling an endpoint)
I read the API documentation but couldn't find any clear explanation.
I believe that the closest endpoints are
/3/CreateFrame which creates random data and /3/ParseSetup
but I couldn't find any reliable tutorial.
Currently there is no REST API endpoint to directly convert some JSON record into a Frame object. Thus, the only way forward for you would be to first write the data to a CSV file, then upload it to h2o using POST /3/PostFile, and then parse using POST /3/Parse.
(Note that POST /3/PostFile endpoint is not in the documentation. This is because it is handled separately from the other endpoints. Basically, it's an endpoint that takes an arbitrary file in the body of the post request, and saves it as "raw data file").
The same job is much easier to do in Python or in R: for example in order to upload some dataset into h2o for scoring, you only need to say
df = h2o.H2OFrame(plaindata)
I am already doing something similar in my project. Since, there is no REST API endpoint to directly convert JSON record into a Frame object. So, I am doing the following: -
1- For Model Building:- first transfer and write the data into the CSV file where h2o server or cluster is running.Then import data into the h2o using POST /3/ImportFiles, and then parse and build a model etc. I am using the h2o-bindings APIs (RESTful APIs) for it. Since I have a large data (hundreds MBs to few GBs), so I use /3/ImportFiles instead POST /3/PostFile as latter is slow to upload large data.
2- For Model Scoring or Prediction:- I am using the Model MOJO and POJO. In your case, you use POST /3/PostFile as suggested by #Pasha, if your data is not large. But, as per h2o documentation, it's advisable to use the MOJO or POJO for model scoring or prediction in a production environment and not to call h2o server/cluster directly. MOJO and POJO are thread safe, so you can scale it using multithreading for concurrent requests.