Is this a good approach for api clone (api scraping)?

Is this a good approach for api clone (api scraping)? - apache-kafka

currently I am working with a pipeline node.js base with backpressure in order to download all data from several API endpoints, some of these endpoints have data related between them.
The idea is to make a refactor in order to build an application more maintainable than the current in order to improve the updates over the data, and the current approach is the following.
The first step (map) is to download all data from endpoints and push into several topics, some of this data is complex and it is necessary data from one endpoint in order to retrieve data from another endpoint.
And the second step (reduce) is to get that data from all topics and push into a SQL but only the data that we need.
The question are
Could be a good approach to this problem
Is better to use Kafka streams in order to use KSQL to make transforms and only use a microservice to publish into the database.
The architecture schema is the following, and real time is not necessary for this data:
Thanks

Related

Using spring data rest with postgresql and cache as redis

I have a simple model with repository configured persisting to postgresql. Using spring-data-rest, the api's are available out of the box for all the crud operations.
Now I want to introduce the caching with redis-6.0. So that any write(rest api's for POST or PUT, DELETE) operation, the model is persisted to db first and updated to the cache.
For the read operation(rest api GET), the item is looked into cache first, if available, then use that or else use spring-data-rest default behavior in this case i.e. find it in postgresql.
Write Operations (POST, PUT, DELETE):
Using the RepositoryEventHandler, HandleAfterCreate, HandleAfterDelete, HandleAfterSave events are subscribed and used to sync up the cache. This reasonably keep the cache to latest.
Read Operations(GET):
I do not see any event listener for read operation. Read is the only operation, that I want to bypass hitting the db as much as possible. But currently do not find a way to do this.
Please let me know, if there is a way to listen for the read operation and do cache lookup first.
Thanks.

Combine multiple REST enpoints into one

I'm looking for design pattern/library to combine multiple different endpoint to be visible as one.
Let's say I have 3 endpoints, each of them returns the same type of objects, of course objects are different and have unique id. I want to create an endpoint which calls all of them, combine results, filter & sort & page, then return results.
There can be many objects, so it could be good to have caching, to call those three endpoints only when something has changed (let's say I somehow know if I need to refresh cache).
I imagine there's a library which can connect multiple endpoint into one, cache results and deliver filter & sort & page. Some sort of, let's say, Spring repository: however data are not read from database, but from cache, which gathers them from REST endpoints.
I was looking at Gateway design pattern like Spring Gateway or Zuul Proxy, but it seems to be only a wrapper, no possibility process data.
Of course I can do such things manually:
create controller
call three endpoints (if needed to refresh)
fill cache
read data from cache sort, filter, page, and return them
but if I need to do that multiple times, I'm looking for library to do that.

You can implement the GatewayFilterFactory with aggregation and caching in Spring Cloud Gateway. But it looks like the creation of a controller.

How to merge/consolidate responses from multiple RESTful microservices?

Let's say there are two (or more) RESTful microservices serving JSON. Service (A) stores user information (name, login, password, etc) and service (B) stores messages to/from that user (e.g. sender_id, subject, body, rcpt_ids).
Service (A) on /profile/{user_id} may respond with:
{id: 1, name:'Bob'}
{id: 2, name:'Alice'}
{id: 3, name:'Sue'}
and so on
Service (B) responding at /user/{user_id}/messages returns a list of messages destined for that {user_id} like so:
{id: 1, subj:'Hey', body:'Lorem ipsum', sender_id: 2, rcpt_ids: [1,3]},
{id: 2, subj:'Test', body:'blah blah', sender_id: 3, rcpt_ids: [1]}
How does the client application consuming these services handle putting the message listing together such that names are shown instead of sender/rcpt ids?
Method 1: Pull the list of messages, then start pulling profile info for each id listed in sender_id and rcpt_ids? That may require 100's of requests and could take a while. Rather naive and inefficient and may not scale with complex apps???
Method 2: Pull the list of messages, extract all user ids and make bulk request for all relevant users separately... this assumes such service endpoint exists. There is still delay between getting message listing, extracting user ids, sending request for bulk user info, and then awaiting for bulk user info response.
Ideally I want to serve out a complete response set in one go (messages and user info). My research brings me to merging of responses at service layer... a.k.a. Method 3: API Gateway technique.
But how does one even implement this?
I can obtain list of messages, extract user ids, make a call behind the scenes and obtain users data, merge result sets, then serve this final result up... This works ok with 2 services behind the scenes... But what if the message listing depends on more services... What if I needed to query multiple services behind the scenes, further parse responses of these, query more services based on secondary (tertiary?) results, and then finally merge... where does this madness stop? How does this affect response times?
And I've now effectively created another "client" that combines all microservice responses into one mega-response... which is no different that Method 1 above... except at server level.
Is that how it's done in the "real world"? Any insights? Are there any open source projects that are built on such API Gateway architecture I could examine?

The solution which we used for such problem was denormalization of data and events for updating.
Basically, a microservice has a subset of data it requires from other microservices beforehand so that it doesn't have to call them at run time. This data is managed through events. Other microservices when updated, fire an event with id as a context which can be consumed by any microservice which have any interest in it. This way the data remain in sync (of course it requires some form of failure mechanism for events). This seems lots of work but helps us with any future decisions regarding consolidation of data from different microservices. Our microservice will always have all data available locally for it process any request without synchronous dependency on other services
In your case i.e. for showing names with a message, you can keep an extra property for names in Service(B). So whenever a name update in Service(A) it will fire an update event with id for the updated name. The Service(B) then gets consumes the event, fetches relevant data from Service(A) and updates its database. This way even if Service(A) is down Service(B) will function, albeit with some stale data which will eventually be consistent when Service(A) comes up and you will always have some name to be shown on UI.
https://enterprisecraftsmanship.com/2017/07/05/how-to-request-information-from-multiple-microservices/

You might want to perform response aggregation strategies on your API gateway. I've written an article on how to perform this on ASP.net Core and Ocelot, but there should be a counter-part for other API gateway technologies:
https://www.pogsdotnet.com/2018/09/api-gateway-response-aggregation-with.html

You need to write another service called Aggregator which will internally call both services and get the response and merge/filter them and return the desired result. This can be easily achieved in non-blocking using Mono/Flux in Spring Reactive.
An API Gateway often does API composition.
But this is typical engineering problem where you have microservices which is implementing databases per service pattern.
The API Composition and Command Query Responsibility Segregation (CQRS) pattern are useful ways to implement queries .

Ideally I want to serve out a complete response set in one go
(messages and user info).
The problem you've described is what Facebook realized years ago in which they decided to tackle that by creating an open source specification called GraphQL.
But how does one even implement this?
It is already implemented in various popular programming languages and maybe you can give it a try in the programming language of your choice.

How to create a H2OFrame using H2O REST API

Is it possible to create a H2OFrame using the H2O's REST API and if so how?
My main objective is to utilize models stored inside H2O so as to make predictions on external H2OFrames.
I need to be able to generate those H2OFrames externally from JSON (I suppose by calling an endpoint)
I read the API documentation but couldn't find any clear explanation.
I believe that the closest endpoints are
/3/CreateFrame which creates random data and /3/ParseSetup
but I couldn't find any reliable tutorial.

Currently there is no REST API endpoint to directly convert some JSON record into a Frame object. Thus, the only way forward for you would be to first write the data to a CSV file, then upload it to h2o using POST /3/PostFile, and then parse using POST /3/Parse.
(Note that POST /3/PostFile endpoint is not in the documentation. This is because it is handled separately from the other endpoints. Basically, it's an endpoint that takes an arbitrary file in the body of the post request, and saves it as "raw data file").
The same job is much easier to do in Python or in R: for example in order to upload some dataset into h2o for scoring, you only need to say
df = h2o.H2OFrame(plaindata)

I am already doing something similar in my project. Since, there is no REST API endpoint to directly convert JSON record into a Frame object. So, I am doing the following: -
1- For Model Building:- first transfer and write the data into the CSV file where h2o server or cluster is running.Then import data into the h2o using POST /3/ImportFiles, and then parse and build a model etc. I am using the h2o-bindings APIs (RESTful APIs) for it. Since I have a large data (hundreds MBs to few GBs), so I use /3/ImportFiles instead POST /3/PostFile as latter is slow to upload large data.
2- For Model Scoring or Prediction:- I am using the Model MOJO and POJO. In your case, you use POST /3/PostFile as suggested by #Pasha, if your data is not large. But, as per h2o documentation, it's advisable to use the MOJO or POJO for model scoring or prediction in a production environment and not to call h2o server/cluster directly. MOJO and POJO are thread safe, so you can scale it using multithreading for concurrent requests.

Make Orion fetch data from Cosmos and publish

I have set up a subscription between Orion ContextBroker and Cosmos BigData using Cygnus, and data is properly persisted in Cosmos when an update is made to Orion.
But I want to analyze the data in Cosmos and return the results to Orion, and finally access the result data in Orion from "outside".
How would one do this? Of course, I would like the solution I build to be as "automated" as possible, but mostly I just want to solve this problem.
Any advise is much appreciated!

As general response (as also the question is very general ;), what you need is a process that access to information stored in Cosmos (either using HDFS APIs -such as WebHDFS or HttpFs-, Hive queries, general MapReduce jobs on top of Hadoop, etc.), then implement the client side of the NGSI API that Orion implements in order to inject context elements into Orion based in the information you retrieved from Cosmos. The key operation to do so in the Orion API is updateContext.
The automation degree would depend on how you implement that process. It can be as automated as you want.
EDIT: considering this answer comments, I will try to add more detail.
What I mean is to develop a piece of software (let's call it APOS -A Piece Of Software) implementing the following behaviour:
APOS will grab data from Cosmos any of the interfaces provided by Cosmos, i.e. WebHDFS/HttpFs, Hive, MapReduce jobs, etc.
APOS will process the data to produce some result
APOS will inject that result in Orion, using the Orion REST API described in the Orion user manual. It is particularly useful for that task the updateContext operation. From a client-server point of view, Orion is a server exposing a REST API and APOS is the client interacting with that server.
It is completely up to you how to implement this APOS and how orchestrate the flow from 1 to 3 (e.g. it can run in batch mode all midnights, be triggered by user interaction on a web portal, etc.).
At the present moment, FI-WARE doesn't provide any generic enabler to convert from Cosmos data to NGSI given that each particular realization of the steps 1 to 3 above is different and depends on the use case. However, note that there is software component named Cygnus which implements the other way: from NGIS to Cosmos.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse