Spring Data R2DBC - Backpressure not taken into account? - spring-data

This thread is a continuation of the Github issue at: https://github.com/spring-projects/spring-data-r2dbc/issues/194
Context:
Hi,
I just tried a very simple Exemple, based on two reactive repositories:
Given br, a r2dbc crud repo, and cr, another r2dbc crud repo:
br.findAll()
.flatMap(br -> {
return cr.findById(br.getPropertyOne())
.doOnNext(c -> br.setProperty2(c))
.thenReturn(br);
})
.collectList().block();
This code samples never completes (only the 250 first, or so, entries reach the .collectList operator). After some digging, adding some onBackpressureXXX operator after the findAll seems to "fix" the issue by... well, dropping elements or buffering them.
At this point, my understanding is that the r2dbc reactive repositories doesn't uses the consumer feedback mechanism which removes a significant part of r2dbc's benefits.
Am I wrong ? Is there any better way to achieve the same objective ?
Thanks !
Suggestion from #mp911de:
Avoid stream creation while another stream is active (Famous quote: Do not cross the streams) as general rule.
If you want to fetch related data, then ideally collect all results as List and then run subqueries. This way, the initial response stream is consumed and the connection is free to fetch additional results.
Something like the following snippet should do the job:
br.findAll().collectList()
.flatMap(it -> {
List<Mono<Reference>> refs = new ArrayList<>();
for (Person p : it) {
refs.add(cr.findById(br.getPropertyOne()).doOnNext(…));
}
return Flux.concat(refs).thenReturn(it);
});
But this removes the benefit of streaming the data without keeping it all in memory (my final step not being to list but to stream-write to output to some file).
Any help on this one ?

Related

If many Kafka streams updates domain model (a.k.a materialized view)?

I have a materialized view that is updated from many streams. Every one enrich it partially. Order doesn't matter. Updates comes in not specified time. Is following algorithm is a good approach:
Update comes and I check what is stored in materialized view via get(), that this is an initial one so enrich and save.
Second comes and get() shows that partial update exist - add next information
... and I continue with same style
If there is a query/join, object that is stored has a method that shows that the update is not complete isValid() that could be used in KafkaStreams#filter().
Could you share please is this a good plan? Is there any pattern in Kafka streams world that handle this case?
Please advice.
Your plan looks good , you have the general idea, but you'll have to use the lower Kafka Stream API : Processor API.
There is a .transform operator that allow you to access a KeyValueStatestore, inside this operation implementation you are free to decide if you current aggregated value is valid or not.
Therefore send it downstream or returning null waiting for more information.

How to perform batch insertion for list of events using Google Calendar API in Dart?

Problem Statement
As a developer, my requirements are to insert a list of events in the Google Calendar with a single call.
Why Batch?
As mentioned here, it helps to reduce the network overheads and increases performance.
Current Scenario
Using the googleapis package, I can only perform a single insert operation as shown in the below snippet :
var eventResponse = await _calendarApi.events
.insert(event, calendarId, sendUpdates: 'all');
From a development perspective, it's not an efficient approach to call this method multiple times.
Also, it would be a bad idea to create an array of the insert method wrapped in Future and use Future.wait() to wait for all the insertion calls to be executed, see below snippet.
Future<List<Event> insertEvents(List<Event> events) async {
var _calendarApi = CalendarApi(_authClient);
List<Future<Event>> _futureEvents = [];
for (int i = 0; i < events.length; i++) {
_futureEvents.add(_calendarApi.events.insert(events[i], calendarId));
}
var _eventResponse =
await Future.wait(_futureEvents).catchError((e) => print(e));
return _eventResponse;
}
As per the official Google Blog, there's no way to perform a batch operation in dart.
Does anyone know a better optimistic solution for this problem?
If you check the Google calendar api documentation for events.insert you will notice that it states.
Creates an event.
There is however the option to use the batching endpoint You should be aware that the only thing batching is going to save you is the HTTP Calls back and forth.
Each call within the batch still counts against your quota it does not count as a single request against your quota. You're limited to 50 calls in a single batch request.
Question: As per the official Google Blog, there's no way to perform a batch operation in dart.
If you check the batching guide it shows how to build up a batching request using HTTP. I would be surprised to find that you could not do this with flutter as well as flutter is capable of a making a HTTP request. I was also unable to find anyone that had gotten it to work. This may just mean that no one has bothered to try or that they just haven't posted the results.
Actually the Batching blog post states that the Dart Client library does not support batching
The Google API Dart Client Library does not support these features.
That does by no means that its not possible at all it just means that your going to have to code it from scratch yourself.
is it worth it.
Again as there is little or no advantage to using batching aside from the HTTP request send limit. You will need to decide if its worth your time trying to get it to work. IMO its not, i have been using these apis for more then 11 years and i have never found batching to be very stable or worth the aggravation it causes.
With batching you tend to get more quota flooding errors as the server things you are sending to many requests at once even sending 5 at once has resulted in flooding errors in my experience and then you have to go back and check which ones made it though to send them again.

Is there a way to GET all items in a global secondary index with REST API using aws api-gateway? I can only GET some

I created a REST api using aws api-gateway and dynamodb without using aws-lambda (I wrote mapping templates for both the integration request and integration response instead of lambda) on a GET API method, POST http method and Scan action setting. I'm fetching from a global secondary index in dynamodb to make my scan smaller than the original table.
It's working well except I am only able to scan roughly 1,000 of my 7,500 items that I need to scan. I checked out paginating the json in an s3 bucket, but I really want to keep it simple with just the aws api-gateway and the dynamodb, if possible.
Is there a way to get all 7,500 of the items in my payload with some modification to my integration request and/or response mappings? If not, what do you suggest?
Below is the mapping code I'm using that works for a 1000 item json payload instead of the 7,500 that I would like to have:
Integration Request:
{
"TableName": "TrailData",
"IndexName": "trail-index"
}
Integration Response:
#set($inputRoot = $input.path('$'))
[
#foreach($elem in $inputRoot.Items)
{
"id":$elem.id.N,
"trail_name":"$elem.trail_name.S",
"challenge_rank":$elem.challenge_rank.N,
"challenge_description":"$elem.challenge_description.S",
"reliability_description":"$elem.reliability_description.S"
}
#if($foreach.hasNext),#end
#end
]
Here is a screenshot of the GET method settings for my API:
API Screenshot
I have already checked out this: stackoverflow question related topic, but I can't figure out how to apply it to my situation. I have put a lot of time into this.
I am aware of the 1MB query limit for dynamodb, but the limited data I am returning is only 142KB.
I appreciate any help or suggestions. I am new to this. Thank you!
This limitation is not related to Dynamo Scan but VTL within Response Template #foreach is restricted to 1000 iterations Here is the issue.
We can also confirm this, by simply removing the #foreach(or entire response template), we should see all(1MB) the records back (but not well formatted).
Easiest solution is pass the request parameters to restrict only necessary attributes from Dynamo table
{
"TableName":"ana-qa-linkshare",
"Limit":2000,
"ProjectionExpression":"challenge_rank,reliability_description,trail_name"
}
However, we can avoid doing a single loop that goes over 1000 with multiple foreach loops, but going to get little complex with in template, instead, we could use lambda. But here is how it might look like.
#set($inputRoot = $input.path('$'))
#set($maxRec = 500)
#set($totalLoops = $inputRoot.Count / $maxRec )
#set($outerArray = [0..$totalLoops])
#set($innerArray = [0..$maxRec])
{
[
#foreach($outer in $outerArray)
#foreach($inner in $innerArray)
{
grab the element with $inputRoot.Items.get(..index)
and Build JSON here.
}
#end
#end
]
}

How to control data failures in Azure Data Factory Pipelines?

I receive an error from time and time due to incompatible data in my source data set compared to my target data set. I would like to control the action that the pipeline determines based on error types, maybe output or drop those particulate rows, yet completing everything else. Is that possible? Furthermore, is it possible to get a hold of the actual failing line(s) from Data Factory without accessing and searching in the actual source data set in some simple way?
Copy activity encountered a user error at Sink side: ErrorCode=UserErrorInvalidDataValue,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Column 'Timestamp' contains an invalid value '11667'. Cannot convert '11667' to type 'DateTimeOffset'.,Source=Microsoft.DataTransfer.Common,''Type=System.FormatException,Message=String was not recognized as a valid DateTime.,Source=mscorlib,'.
Thanks
I think you've hit a fairly common problem and limitation within ADF. Although the datasets you define with your JSON allow ADF to understand the structure of the data, that is all, just the structure, the orchestration tool can't do anything to transform or manipulate the data as part of the activity processing.
To answer your question directly, it's certainly possible. But you need to break out the C# and use ADF's extensibility functionality to deal with your bad rows before passing it to the final destination.
I suggest you expand your data factory to include a custom activity where you can build some lower level cleaning processes to divert the bad rows as described.
This is an approach we often take as not all data is perfect (I wish) and ETL or ELT doesn't work. I prefer the acronym ECLT. Where the 'C' stands for clean. Or cleanse, prepare etc. This certainly applies to ADF because this service doesn't have its own compute or SSIS style data flow engine.
So...
In terms of how to do this. First I recommend you check out this blog post on creating ADF custom activities. Link:
https://www.purplefrogsystems.com/paul/2016/11/creating-azure-data-factory-custom-activities/
Then within your C# class inherited from IDotNetActivity do something like the below.
public IDictionary<string, string> Execute(
IEnumerable<LinkedService> linkedServices,
IEnumerable<Dataset> datasets,
Activity activity,
IActivityLogger logger)
{
//etc
using (StreamReader vReader = new StreamReader(YourSource))
{
using (StreamWriter vWriter = new StreamWriter(YourDestination))
{
while (!vReader.EndOfStream)
{
//data transform logic, if bad row etc
}
}
}
}
You get the idea. Build your own SSIS data flow!
Then write out your clean row as an output dataset, which can be the input for your next ADF activity. Either with multiple pipelines, or as chained activities within a single pipeline.
This is the only way you will get ADF to deal with your bad data in the current service offerings.
Hope this helps

Querying a list of Actors in Azure Service Fabric

I currently have a ReliableActor for every user in the system. This actor is appropriately named User, and for the sake of this question has a Location property. What would be the recommended approach for querying Users by Location?
My current thought is to create a ReliableService that contains a ReliableDictionary. The data in the dictionary would be a projection of the User data. If I did that, then I would need to:
Query the dictionary. After GA, this seems like the recommended approach.
Keep the dictionary in sync. Perhaps through Pub/Sub or IActorEvents.
Another alternative would be to have a persistent store outside Service Fabric, such as a database. This feels wrong, as it goes against some of the ideals of using the Service Fabric. If I did, I would assume something similar to the above but using a Stateless service?
Thank you very much.
I'm personally exploring the use of Actors as the main datastore (ie: source of truth) for my entities. As Actors are added, updated or deleted, I use MassTransit to publish events. I then have Reliable Statefull Services subscribed to these events. The services receive the events and update their internal IReliableDictionary's. The services can then be queried to find the entities required by the client. Each service only keeps the entity data that it requires to perform it's queries.
I'm also exploring the use of EventStore to publish the events as well. That way, if in the future I decide I need to query the entities in a new way, I could create a new service and replay all the events to it.
These Pub/Sub methods do mean the query services are only eventually consistent, but in a distributed system, this seems to be the norm.
While the standard recommendation is definitely as Vaclav's response, if querying is the exception then Actors could still be appropriate. For me whether they're suitable or not is defined by the normal way of accessing them, if it's by key (presumably for a user record it would be) then Actors work well.
It is possible to iterate over Actors, but it's quite a heavy task, so like I say is only appropriate if it's the exceptional case. The following code will build up a set of Actor references, you then iterate over this set to fetch the actors and then can use Linq or similar on the collection that you've built up.
ContinuationToken continuationToken = null;
var actorServiceProxy = ActorServiceProxy.Create("fabric:/MyActorApp/MyActorService", partitionKey);
var queriedActorCount = 0;
do
{
var queryResult = actorServiceProxy.GetActorsAsync(continuationToken, cancellationToken).GetAwaiter().GetResult();
queriedActorCount += queryResult.Items.Count();
continuationToken = queryResult.ContinuationToken;
} while (continuationToken != null);
TLDR: It's not always advisable to query over actors, but it can be achieved if required. Code above will get you started.
if you find yourself needing to query across a data set by some data property, like User.Location, then Reliable Collections are the right answer. Reliable Actors are not meant to be queried over this way.
In your case, a user could simply be a row in a Reliable Dictionary.