Is it possible to create a H2OFrame using the H2O's REST API and if so how?
My main objective is to utilize models stored inside H2O so as to make predictions on external H2OFrames.
I need to be able to generate those H2OFrames externally from JSON (I suppose by calling an endpoint)
I read the API documentation but couldn't find any clear explanation.
I believe that the closest endpoints are
/3/CreateFrame which creates random data and /3/ParseSetup
but I couldn't find any reliable tutorial.
Currently there is no REST API endpoint to directly convert some JSON record into a Frame object. Thus, the only way forward for you would be to first write the data to a CSV file, then upload it to h2o using POST /3/PostFile, and then parse using POST /3/Parse.
(Note that POST /3/PostFile endpoint is not in the documentation. This is because it is handled separately from the other endpoints. Basically, it's an endpoint that takes an arbitrary file in the body of the post request, and saves it as "raw data file").
The same job is much easier to do in Python or in R: for example in order to upload some dataset into h2o for scoring, you only need to say
df = h2o.H2OFrame(plaindata)
I am already doing something similar in my project. Since, there is no REST API endpoint to directly convert JSON record into a Frame object. So, I am doing the following: -
1- For Model Building:- first transfer and write the data into the CSV file where h2o server or cluster is running.Then import data into the h2o using POST /3/ImportFiles, and then parse and build a model etc. I am using the h2o-bindings APIs (RESTful APIs) for it. Since I have a large data (hundreds MBs to few GBs), so I use /3/ImportFiles instead POST /3/PostFile as latter is slow to upload large data.
2- For Model Scoring or Prediction:- I am using the Model MOJO and POJO. In your case, you use POST /3/PostFile as suggested by #Pasha, if your data is not large. But, as per h2o documentation, it's advisable to use the MOJO or POJO for model scoring or prediction in a production environment and not to call h2o server/cluster directly. MOJO and POJO are thread safe, so you can scale it using multithreading for concurrent requests.
Related
currently I am working with a pipeline node.js base with backpressure in order to download all data from several API endpoints, some of these endpoints have data related between them.
The idea is to make a refactor in order to build an application more maintainable than the current in order to improve the updates over the data, and the current approach is the following.
The first step (map) is to download all data from endpoints and push into several topics, some of this data is complex and it is necessary data from one endpoint in order to retrieve data from another endpoint.
And the second step (reduce) is to get that data from all topics and push into a SQL but only the data that we need.
The question are
Could be a good approach to this problem
Is better to use Kafka streams in order to use KSQL to make transforms and only use a microservice to publish into the database.
The architecture schema is the following, and real time is not necessary for this data:
Thanks
I really like the way Spring cloud function decouples the business logic from the runtime target (local or cloud) and makes it easy to integrate with serverless providers.
I plan to use SCF with AWS Lambda behind an API gateway to design the backend of a system.
However, I am not completely clear on what is the recommended way to handle REST related parameters such as Query params, headers, path etc. inside the Spring cloud functions.
As per our initial analysis, we could derive two possible approaches:
When enabling “Lambda proxy integration” in API Gateway, Query params and other information are available as Message headers inside the SCF.
We can use “Mapping templates” in API Gateway to map all the required information into a JSON body and deserialize as a POJO to take input directly into the SCF.
This way, the SCF does not need to bother about how the required data is passed to the API.
What is the recommended way to achieve this? Are we missing something that enables to do this in a better way?
I don't think you are missing anything featurewise, except perhaps that it might also be convenient to work with composite functions - e.g. marshal|transform, where marshal is a Function<Message<?>, ?> and transform is the business logic. The marshal function could be generic (and convert to some sort of canonical form), and be provided as an autoconfiguration in a shared library (for instance).
I have a pyspark script as part of an oozie job. The actions are as follows:
1. Fetch data from a data store.
2. Some data munging operations on the cluster.
3. Download the data. Currently, doing a toPandas() on the pyspark dataframe.[Doing this to easily convert to a json dump]
4. Pushing the data to a REST API.
Step (3) in only necessitated because of step(4) as I you need the data to be on the driver to be able to do a REST call. However, I have noticed that step(3) is responsible for the variable execution time of my script as well slowing down of my script. My question is it possible to invoke and POST to a REST API from the worker nodes? I saw some examples of using a GET request from REST(https://dataplatform.cloud.ibm.com/analytics/notebooks/52845a4a-1b5e-4f6e-b1a3-f312d796a93a/view?access_token=e3f303d7dd90138a9cf1fb77b00265a7b02aa12b891c2018e2e547f2050ef4e0), but this did not work for my usecase.
I want to use Tarantool database for logging user activity.
Are there any out of the box solutions to create web dashboard with nice charts based on the collected data?
A long time ago, using an old-old version of tarantool I've created a draft of tarbon - time-series database, with carbon-cache identical interface.
Since that time the protocol have changed, but the generic idea still the same: use spaces to store data, compact data organization and correct indexes to access spaces as time-series rows and lua for preparing resulting jsons.
That solution was perfect in performance (either on reads or on writes), but that old version lacks disk storage and without disk I was very limited to metrics capacity.
Tarantool has embedded lua language so u could generate json from your data and use any charting library. For example D3.js has method to load json directly from url.
d3.json(url[, callback])
Creates a request for the JSON file at the specified url with the mime type "application/json". If a callback is specified, the request is immediately issued with the GET method, and the callback will be invoked asynchronously when the file is loaded or the request fails; the callback is invoked with two arguments: the error, if any, and the parsed JSON. The parsed JSON is undefined if an error occurs. If no callback is specified, the returned request can be issued using xhr.get or similar, and handled using xhr.on.
You also could look at c3.js simple facade for d3
We are currently using visjs version 3 to map the dependencies of our custom built workflow engine. This has been WONDERFUL because it helps us to visualize the flow and find invalid or missing dependencies. What we want to do next is simplify the process of building the dependencies using the visjs manipulation feature. The idea would be that we would display a large group of nodes and allow the user to order them correctly. We then want to be able to submit that json structure back to the server for processing.
Would this be possible?
Yes, this is possible.
Vis.js dispatches various events that relate to user interactions with graph (e.g. manipulations, or position changes) for which you can add handlers that modify or store the data on change. If you use DataSets to store nodes and edges in your network, you can always use the DataSets' get() function to retrieve all elements in you handler in JSON format. Then in your handler, just use an ajax request to transmit the JSON to your server to store the entire graph in your DB or by saving the JSON as a file.
The oppposite for loading the graph: simply query the JSON from your server and inject it into the node and edge DataSets' using the set method.
You can also store the networks current options using the network's getOptions method, which returns all applied options as json.