Google Data Fusion: "Looping" over input data to then execute multiple Restful API calls per input row - google-cloud-data-fusion

I have the following challenge I would like to solve preferably in Google Data Fusion:
I have one web service that returns about 30-50 elements describing an invoice in a JSON payload like this:
{
"invoice-services": [
{
"serviceId": "[some-20-digit-string]",
// some other stuff omitted
},
[...]
]
}
For each occurrence of serviceId I then need to call another webservice https://example.com/api/v2/services/{serviceId}/items repeatedly where each serviceId comes from the first call. I am only interested in the data from the second call which is to be persisted into BigQuery. This second service call doesn't support wildcards or any other mechanism to aggregate the items - i.e. if I have 30 serviceId from the first call, I need to call the second webservice 30 times.
I have made the first call work, I have made the second call work with a hard coded serviceId and also the persistence into BigQuery. These calls simply use the Data Fusion HTTP adapter.
However, how can I use the output of the first service in such a way that I issue one webservice call for the second service for each row returned from the first call - effectively looping over all serviceId?
I completely appreciate this is very easy in Python Code, but for maintainability and fit with our environment I would prefer to solve this in Data Fusion or need be any of the other -as-a-Service offerings from Google.
Any help is really appreciated!
J
PS: This is NOT a big data problem -I am looking at about 50 serviceId and maybe 300 items.

Related

Is there a way to GET all items in a global secondary index with REST API using aws api-gateway? I can only GET some

I created a REST api using aws api-gateway and dynamodb without using aws-lambda (I wrote mapping templates for both the integration request and integration response instead of lambda) on a GET API method, POST http method and Scan action setting. I'm fetching from a global secondary index in dynamodb to make my scan smaller than the original table.
It's working well except I am only able to scan roughly 1,000 of my 7,500 items that I need to scan. I checked out paginating the json in an s3 bucket, but I really want to keep it simple with just the aws api-gateway and the dynamodb, if possible.
Is there a way to get all 7,500 of the items in my payload with some modification to my integration request and/or response mappings? If not, what do you suggest?
Below is the mapping code I'm using that works for a 1000 item json payload instead of the 7,500 that I would like to have:
Integration Request:
{
"TableName": "TrailData",
"IndexName": "trail-index"
}
Integration Response:
#set($inputRoot = $input.path('$'))
[
#foreach($elem in $inputRoot.Items)
{
"id":$elem.id.N,
"trail_name":"$elem.trail_name.S",
"challenge_rank":$elem.challenge_rank.N,
"challenge_description":"$elem.challenge_description.S",
"reliability_description":"$elem.reliability_description.S"
}
#if($foreach.hasNext),#end
#end
]
Here is a screenshot of the GET method settings for my API:
API Screenshot
I have already checked out this: stackoverflow question related topic, but I can't figure out how to apply it to my situation. I have put a lot of time into this.
I am aware of the 1MB query limit for dynamodb, but the limited data I am returning is only 142KB.
I appreciate any help or suggestions. I am new to this. Thank you!
This limitation is not related to Dynamo Scan but VTL within Response Template #foreach is restricted to 1000 iterations Here is the issue.
We can also confirm this, by simply removing the #foreach(or entire response template), we should see all(1MB) the records back (but not well formatted).
Easiest solution is pass the request parameters to restrict only necessary attributes from Dynamo table
{
"TableName":"ana-qa-linkshare",
"Limit":2000,
"ProjectionExpression":"challenge_rank,reliability_description,trail_name"
}
However, we can avoid doing a single loop that goes over 1000 with multiple foreach loops, but going to get little complex with in template, instead, we could use lambda. But here is how it might look like.
#set($inputRoot = $input.path('$'))
#set($maxRec = 500)
#set($totalLoops = $inputRoot.Count / $maxRec )
#set($outerArray = [0..$totalLoops])
#set($innerArray = [0..$maxRec])
{
[
#foreach($outer in $outerArray)
#foreach($inner in $innerArray)
{
grab the element with $inputRoot.Items.get(..index)
and Build JSON here.
}
#end
#end
]
}

asynchronous bulk data validations service - GET or POST?

Here is a different scenario for GET or POST confusion. I am working on a web application built with spring-boot microservice architecture where there is a need of validate and update some bulk data from excel sheet.
There can be 500-1000 records in excel sheet with 6 different columns for bulk processing. Once UI submits the excel sheet to server from then the total process is asynchronous. There are microservice to microservice calls which I am getting confused to have GET or POST.
Here is a problem: I have 4 microservices (let's say orchestra-service,A-service,B-service and C-service).
OrchestraService creates a DTO list from excel sheet which will be used in further calls. Orchestra calls 'A'. 'A' validates the data with DB and marks success and failure records in DTO list object and returns the list back to orchestra. Again orchestra calls 'B', it does the similar job like 'A' and returns back to orchestra.
Now orchestra calls 'C' which will update success records into database, updates the file status on database and also creates a new resultant excel sheet with error messages per row which will be emailed to the user later(small report kind of thing).
In above microservice to microservice calls only C is updating database and creating resource on server. All above calls I used POST method because I need the request body to pass my input list to all services.
According to HTTP Standards am I doing right?
https://www.rfc-editor.org/rfc/rfc7231#section-4.3.3
Providing a block of data, such as the fields entered into an HTML form, to a data-handling process it should be a POST call.
Please advice me whether:
I should use POST for only 'C' and GET for others or
It should be POST for all as other process involves in data filtering process.
NOTE: service A,B, and C not all services uses all the columns of excel but some of them in combinations. One column having 18 characters long data so I think it can be a problem with GET header limit for bulk operation.
Http Protocol
There is no actual violation on passing information on GET and if that request doesn't mutate between identical requests, then it's fine.
Microservice wise
Now for clarification, are Service A and Service B actually needed ?
Aren't they the same Domain as Service C, and can reside inside of him ?
It's more then good practice to have a Microservice validate its own domain and return a collection of success and failure with the relevant messages.
I had the similar question few years back and here is the possible solution for the first part of your question.
As mentioned by #Oreal Eraki in his answer, I would also question whether you need services A and B. If its just validation and data transformation it can be done in the same domain where the data is actually stored.

How to pass a large number of input parameters to RESTful service?

I have a RESTful service that returns detailed data about a machine by the supplied list of Ids. GET api/machine/
http://service.com/api/machine/1,2,3,4
Up till now this has been fine since I am getting a small number of machines at a time, but now I need to get all machines (more then 1000). This exceeds the 2000 character limit on URLs.
I have gotten both of the options below to work and I'm looking for some community feedback on which way to go.
Option 1: Split up my GET. Make multiple calls with a subset of the ids. Pros: I am doing a get so using the HTTP verb GET makes sense. Cons: If a person new to the service doesn't know about this limit, or doesn't use my client, it would cause problems.
Option 2: Add a PUT/POST method and include the full list of ids in the body. Pros: Makes 1 call to get all data. Cons: I am now doing a get from a PUT/POST.
Probably your best course-of-action would be something in the lines of option 2, you can create a JSON on your side with an array of the numbers you want to send in the Body of the message. If there's the possibility of it still being far too large, you can split it in several messages, when you receive the response of one you'd send the next item in the queue, and so on.
Another option, used by the Facebook API among others, is to create a "/batch" POST method which can be used to make multiple requests in one go.
So instead of having http://service.com/api/machine/1,2,3,4,5,.... you'll have a batch of requests with /machine/1, /machine/2, /machine/3, etc.
The advantage is that you keep clean RESTful URLs (no more coma-separated values) and it scales very well since you can batch as many requests as you want.
The disadvantage is that it is slightly more complex to build.
See there for more information - https://developers.facebook.com/docs/graph-api/making-multiple-requests

Get request with very long query string. (CSV)

I'm looking to implement an API call where you can specify any combination of up to ~6000 ids to get data from the server. Trouble is it's quite likely that a request will contain a large number of id's - say around 4000. The query string would therefore be very long and possibly too long for the browser?
I wonder, what would be the best approach? I could use a POST but it doesn't really fit with REST - but then again I'm not too fussed about that. Is there a better way of doing this?
In this case, POST really is the solution. From a REST perspective and also from an optimization perspective, if you expect this call to be invoked multiple times with the same list of IDs, you may want to consider one POST call to create a server-side named/defined list and then for subsequent GET requests to reference the created list so that this data doesn't have to be repeated each and every time.

Spring batch Item reader to iterate over a rest api call

I have a spring batch job, which needs to fetch details from rest api call and process the particular data on my side. My rest api call will have mainly the below parameters :
StartinIdNumber(offset)
PageSize(limit)
ps: StartinIdNumber serves the same purpose as rownumber or "offset" in this particular API. The API response results are sorted by IdNumber, so by specifying a StartinIdNumber, the API will in turn perform a "where IdNumber >= StartinIdNumber order by IdNumber limit pageSize" in their DB query.
It will return the given number of user details, I need to iterate through all the ids by changing the StartingIdNumber parameter for each request.
I have seen current ItemReader implementations of spring batch framework,which read through database or xml etc. But I didn't come across any reader which helps in my case. Please suggest a way to iterate through the user details as specified above .
Note : If I write my own custom item reader, I have to take care of preserving state (last processed "StartingIdNumer") which is proving challenging to me.
Does implementing ItemStream serves my purpose? Or is there any better way?
Implementing the ItemStream interface and writing my own custom reader served my purpose. It is now state-full as required for me. Thanks.