Jackson streaming api & WebFlux response body in Spring Cloud Gateway - spring-cloud

I'm pretty new to Spring Cloud Gateway and Spring WebFlux.
I have a requirement where I need to gather all the IDs of the data elements in the response for logging purposes.
I have done so in a very memory-intensive way, by reading the DataBuffer into a byte array, parsing it and then wrapping it into a DataBuffer and passing it along.
However, this isn't very viable when the responses are big and I am making use of the Jackson Streaming API so it seems silly.
Anyone have any tips on how to achieve this? All the examples I have seen seem to do it in a similar way, by buffering the entire response in memory.
Current version (Groovy):
class DataIdResponseHandler extends ServerHttpResponseDecorator {
final DataBufferFactory dataBufferFactory
DataIdResponseHandler(ServerHttpResponse delegate) {
super(delegate)
dataBufferFactory = delegate.bufferFactory()
}
#Override
Mono<Void> writeWith(Publisher<? extends DataBuffer> body) {
Flux<? extends DataBuffer> fluxBody = (Flux<? extends DataBuffer>) body
return super.writeWith(fluxBody.map { dataBuffer ->
byte[] content = new byte[dataBuffer.readableByteCount()]
dataBuffer.read(content)
List<String> dataIds = DataIdParser.parseFromByteArray(content)
DataIdCollector.add(dataIds)
return dataBufferFactory.wrap(content)
})
}
}
Reactive version of the above would method would be:
return super.writeWith(fluxBody.doOnNext { DataBuffer dataBuffer ->
List<String> dataIds = DataIdParser.parseFromByteArray(dataBuffer.toString(Charset.defaultCharset()).bytes)
Logger logger = LoggerFactory.getLogger(DataIdResponseFilter)
logger.info(dataIds.toString())
})
Oddly enough, if there's an interaction with the DataBuffer where you access the asInputStream() and you read from that it'll empty the DataBuffer and the actual response will be empty. If you use the toString() method, which obviously also reads from the DataBuffer, the buffer and hence body will still be complete.

Related

Is there any way to compress the data while using mongo persistence with NEventStore?

I'm working with C#, Dotnet core, and NeventStore( version- 9.0.1), trying to evaluate various persistence options that it supports out of the box.
More specifically, when trying to use the mongo persistence, the payload is getting stored without any compression being applied.
Note: Payload compression is happening perfectly when using the SQL persistence of NEventStore whereas not with the mongo persistence.
I'm using the below code to create the event store and initialize:
private IStoreEvents CreateEventStore(string connectionString)
{
var store = Wireup.Init()
.UsingMongoPersistence(connectionString,
new NEventStore.Serialization.DocumentObjectSerializer())
.InitializeStorageEngine()
.UsingBsonSerialization()
.Compress()
.HookIntoPipelineUsing()
.Build();
return store;
}
And, I'm using the below code for storing the events:
public async Task AddMessageTostore(Command command)
{
using (var stream = _eventStore.CreateStream(command.Id))
{
stream.Add(new EventMessage { Body = command });
stream.CommitChanges(Guid.NewGuid());
}
}
The workaround did: Implementing the PreCommit(CommitAttempt attempt) and Select methods in IPipelineHook and by using gzip compression logic the compression of events was achieved in MongoDB.
Attaching data store image of both SQL and mongo persistence:
So, the questions are:
Is there some other option or setting I'm missing so that the events get compressed while saving(fluent way of calling compress method) ?
Is the workaround mentioned above sensible to do or is it a performance overhead?
I also faced the same issue while using the NEventStore.Persistence.MongoDB.
Even if I used the fluent way of compress method, the payload compression is not happening perfectly in the mongo persistence like SQL persistence.
Finally, I have achieved the compression/decompression by customizing the logic inside the PreCommit(CommitAttempt attempt) and Select(ICommit committed) methods.
Code used for compression:
using (var stream = new MemoryStream())
{
using (var compressedStream = new GZipStream(stream,
CompressionMode.Compress))
{
var serializer = new JsonSerializer {
TypeNameHandling = TypeNameHandling.None,
ReferenceLoopHandling = ReferenceLoopHandling.Ignore
};
var writer = new JsonTextWriter(new StreamWriter(compressedStream));
serializer.Serialize(writer, this);
writer.Flush();
}
return stream.ToArray();
}
Code used for decompression:
using (var stream = new MemoryStream(bytes))
{
var decompressedStream = new GZipStream(stream, CompressionMode.Decompress);
var serializer = new JsonSerializer {
TypeNameHandling = TypeNameHandling.None,
ReferenceLoopHandling = ReferenceLoopHandling.Ignore
};
var reader = new JsonTextReader(new StreamReader(decompressedStream));
var body = serializer.Deserialize(reader, type);
return body as Command;
}
I'm not sure if this a right approach or will this have any impact on the performance of EventStore operations like Insert and Select..

How to generate output files for each input in Apache Flink

I'm using Flink to process my streaming data.
The streaming is coming from some other middleware, such as Kafka, Pravega, etc.
Saying that Pravega is sending some word stream, hello world my name is....
What I need is three steps of process:
Map each word to my custom class object MyJson.
Map the object MyJson to String.
Write Strings to files: one String is written to one file.
For example, for the stream hello world my name is, I should get five files.
Here is my code:
// init Pravega connector
PravegaDeserializationSchema<String> adapter = new PravegaDeserializationSchema<>(String.class, new JavaSerializer<>());
FlinkPravegaReader<String> source = FlinkPravegaReader.<String>builder()
.withPravegaConfig(pravegaConfig)
.forStream(stream)
.withDeserializationSchema(adapter)
.build();
// map stream to MyJson
DataStream<MyJson> jsonStream = env.addSource(source).name("Pravega Stream")
.map(new MapFunction<String, MyJson>() {
#Override
public MyJson map(String s) throws Exception {
MyJson myJson = JSON.parseObject(s, MyJson.class);
return myJson;
}
});
// map MyJson to String
DataStream<String> valueInJson = jsonStream
.map(new MapFunction<MyJson, String>() {
#Override
public String map(MyJson myJson) throws Exception {
return myJson.toString();
}
});
// output
valueInJson.print();
This code will output all of results to Flink log files.
My question is how to write one word to one output file?
I think the easiest way to do this would be with a custom sink.
stream.addSink(new WordFileSink)
public static class WordFileSink implements SinkFunction<String> {
#Override
public void invoke(String value, Context context) {
// generate a unique name for the new file and open it
// write the word to the file
// close the file
}
}
Note that this implementation won't necessarily provide exactly once behavior. You might want to take care that the file naming scheme is both unique and deterministic (rather than depending on processing time), and be prepared for the case that the file may already exist.

How to enrich event stream with big file in Apache Flink?

I have a Flink application for click stream collection and processing. The application consists of Kafka as event source, a map function and a sink as image shown below:
I want to enrich the incoming click stream data with user's IP location based on userIp field in raw event ingested from Kafka.
a simplified slice of the CSV file as shown below
start_ip,end_ip,country
"1.1.1.1","100.100.100.100","United States of America"
"100.100.100.101","200.200.200.200","China"
I have made some researches and found a couple of potential solutions:
1. Solution: Broadcast the enrichment data and connect with event stream with some IP matching logic.
1. Result: It worked well for a couple sample IP location data but not with whole CSV data. JVM heap has reached to 3.5 GB and due to the broadcast state, there is no way to put the broadcast state into disk (for RocksDb)
2. Solution: Load CSV data in open() method in RichFlatMapFunction into the state(ValueState) before start of the event processing and enrich event data in flatMap method.
2. Result: Due to the enrichment data is so big to store in JVM heap, it's impossible to load into ValueState. And also de/serializing through ValueState is bad practice for data in key-value nature.
3. Solution: To avoid to deal with JVM heap constraint, I have tried to put the enrichment data into RocksDB(uses disk) as state with MapState.
3. Result: Trying to load the CSV file into MapState in open() method, gave me error that tells me you cannot put into MapState in open() method because I was not in keyed context in open() method like this question: Flink keyed stream key is null
4. Solution: Because of need of the keyed context for MapState(to put RocksDB), I tried to load whole CSV file into local RocksDB instance(disk) in the process function after making the DataStream into KeyedStream:
class KeyedIpProcess extends KeyedProcessFunction[Long, Event, Event] {
var ipMapState: MapState[String, String] = _
var csvFinishedFlag: ValueState[Boolean] = _
override def processElement(event: Event,
ctx: KeyedProcessFunction[Long, Event, Event]#Context,
out: Collector[Event]): Unit = {
val ipDescriptor = new MapStateDescriptor[String, String]("ipMapState", classOf[String], classOf[String])
val csvFinishedDescriptor = new ValueStateDescriptor[Boolean]("csvFinished", classOf[Boolean])
ipMapState = getRuntimeContext.getMapState(ipDescriptor)
csvFinishedFlag = getRuntimeContext.getState(csvFinishedDescriptor)
if (!csvFinishedFlag.value()) {
val csv = new CSVParser(defaultCSVFormat)
val fileSource = Source.fromFile("/tmp/ip.csv", "UTF-8")
for (row <- fileSource.getLines()) {
val Some(List(start, end, country)) = csv.parseLine(row)
ipMapState.put(start, country)
}
fileSource.close()
csvFinishedFlag.update(true)
}
out.collect {
if (ipMapState.contains(event.userIp)) {
val details = ipMapState.get(event.userIp)
event.copy(data =
event.data.copy(
ipLocation = Some(details.country)
))
} else {
event
}
}
}
}
4. Result: It's too hacky and prevents event processing due to blocking file read operation.
Could you tell me what can I do for this situation?
Thanks
What you can do is to implement a custom partitioner, and load a slice of the enrichment data into each partition. There's an example of this approach here; I'll excerpt some key portions:
The job is organized like this:
DataStream<SensorMeasurement> measurements = env.addSource(new SensorMeasurementSource(100_000));
DataStream<EnrichedMeasurements> enrichedMeasurements = measurements
.partitionCustom(new SensorIdPartitioner(), measurement -> measurement.getSensorId())
.flatMap(new EnrichmentFunctionWithPartitionedPreloading());
The custom partitioner needs to know how many partitions there are, and deterministically assigns each event to a specific partition:
private static class SensorIdPartitioner implements Partitioner<Long> {
#Override
public int partition(final Long sensorMeasurement, final int numPartitions) {
return Math.toIntExact(sensorMeasurement % numPartitions);
}
}
And then the enrichment function takes advantage of knowing how the partitioning was done to load only the relevant slice into each instance:
public class EnrichmentFunctionWithPartitionedPreloading extends RichFlatMapFunction<SensorMeasurement, EnrichedMeasurements> {
private Map<Long, SensorReferenceData> referenceData;
#Override
public void open(final Configuration parameters) throws Exception {
super.open(parameters);
referenceData = loadReferenceData(getRuntimeContext().getIndexOfThisSubtask(), getRuntimeContext().getNumberOfParallelSubtasks());
}
#Override
public void flatMap(
final SensorMeasurement sensorMeasurement,
final Collector<EnrichedMeasurements> collector) throws Exception {
SensorReferenceData sensorReferenceData = referenceData.get(sensorMeasurement.getSensorId());
collector.collect(new EnrichedMeasurements(sensorMeasurement, sensorReferenceData));
}
private Map<Long, SensorReferenceData> loadReferenceData(
final int partition,
final int numPartitions) {
SensorReferenceDataClient client = new SensorReferenceDataClient();
return client.getSensorReferenceDataForPartition(partition, numPartitions);
}
}
Note that the enrichment is not being done on a keyed stream, so you can not use keyed state or timers in the enrichment function.

Non-blocking functional methods with Reactive Mongo and Web client

I have a micro service which reads objects from a database using a ReactiveMongoRepository interface.
The goal is to take each one of those objects and push it to a AWS Lambda function (after converting it to a DTO). If the result of that lambda function is in the 200 range, mark the object as being a success otherwise ignore.
In the old days of a simple Mongo Repository and a RestTemplate this is would be a trivial task. However I'm trying to understand this Reactive deal, and avoid blocking.
Here is the code I've come up with, I know I'm blocking on the webClient, but how do I avoid that?
#Override
public Flux<Video> index() {
return videoRepository.findAllByIndexedIsFalse().flatMap(video -> {
final SearchDTO searchDTO = SearchDTO.builder()
.name(video.getName())
.canonicalPath(video.getCanonicalPath())
.objectID(video.getObjectID())
.userId(video.getUserId())
.build();
// Blocking call
final HttpStatus httpStatus = webClient.post()
.uri(URI.create(LAMBDA_ENDPOINT))
.body(BodyInserters.fromObject(searchDTO)).exchange()
.block()
.statusCode();
if (httpStatus.is2xxSuccessful()) {
video.setIndexed(true);
}
return videoRepository.save(video);
});
}
I'm calling the above from a scheduled task, and I don't really care about that actual result of the index() method, just what happens during.
#Scheduled(fixedDelay = 60000)
public void indexTask() {
indexService
.index()
.log()
.subscribe();
}
I've read a bunch of blog posts etc on the subject but they're all just simple CRUD operations without anything happening in the middle so don't really give me a full picture of how to implement these things.
Any help?
Your solution is actually quite close.
In those cases, you should try and decompose the reactive chain in steps and not hesitate to turn bits into independent methods for clarity.
#Override
public Flux<Video> index() {
Flux<Video> unindexedVideos = videoRepository.findAllByIndexedIsFalse();
return unindexedVideos.flatMap(video -> {
final SearchDTO searchDTO = SearchDTO.builder()
.name(video.getName())
.canonicalPath(video.getCanonicalPath())
.objectID(video.getObjectID())
.userId(video.getUserId())
.build();
Mono<ClientResponse> indexedResponse = webClient.post()
.uri(URI.create(LAMBDA_ENDPOINT))
.body(BodyInserters.fromObject(searchDTO)).exchange()
.filter(res -> res.statusCode().is2xxSuccessful());
return indexedResponse.flatMap(response -> {
video.setIndexed(true);
return videoRepository.save(video);
});
});
my approach, maybe a little bit more readable. But I admit I didn't run it so not 100% guarantee that it will work.
public Flux<Video> index() {
return videoRepository.findAll()
.flatMap(this::callLambda)
.flatMap(videoRepository::save);
}
private Mono<Video> callLambda(final Video video) {
SearchDTO searchDTO = new SearchDTO(video);
return webClient.post()
.uri(URI.create(LAMBDA_ENDPOINT))
.body(BodyInserters.fromObject(searchDTO))
.exchange()
.map(ClientResponse::statusCode)
.filter(HttpStatus::is2xxSuccessful)
.map(t -> {
video.setIndexed(true);
return video;
});
}

How to do Async Http Call with Apache Beam (Java)?

Input PCollection is http requests, which is a bounded dataset. I want to make async http call (Java) in a ParDo , parse response and put results into output PCollection. My code is below. Getting exception as following.
I cound't figure out the reason. need a guide....
java.util.concurrent.CompletionException: java.lang.IllegalStateException: Can't add element ValueInGlobalWindow{value=streaming.mapserver.backfill.EnrichedPoint#2c59e, pane=PaneInfo.NO_FIRING} to committed bundle in PCollection Call Map Server With Rate Throttle/ParMultiDo(ProcessRequests).output [PCollection]
Code:
public class ProcessRequestsFn extends DoFn<PreparedRequest,EnrichedPoint> {
private static AsyncHttpClient _HttpClientAsync;
private static ExecutorService _ExecutorService;
static{
AsyncHttpClientConfig cg = config()
.setKeepAlive(true)
.setDisableHttpsEndpointIdentificationAlgorithm(true)
.setUseInsecureTrustManager(true)
.addRequestFilter(new RateLimitedThrottleRequestFilter(100,1000))
.build();
_HttpClientAsync = asyncHttpClient(cg);
_ExecutorService = Executors.newCachedThreadPool();
}
#DoFn.ProcessElement
public void processElement(ProcessContext c) {
PreparedRequest request = c.element();
if(request == null)
return;
_HttpClientAsync.prepareGet((request.getRequest()))
.execute()
.toCompletableFuture()
.thenApply(response -> { if(response.getStatusCode() == HttpStatusCodes.STATUS_CODE_OK){
return response.getResponseBody();
} return null; } )
.thenApply(responseBody->
{
List<EnrichedPoint> resList = new ArrayList<>();
/*some process logic here*/
System.out.printf("%d enriched points back\n", result.length());
}
return resList;
})
.thenAccept(resList -> {
for (EnrichedPoint enrichedPoint : resList) {
c.output(enrichedPoint);
}
})
.exceptionally(ex->{
System.out.println(ex);
return null;
});
}
}
The Scio library implements a DoFn which deals with asynchronous operations. The BaseAsyncDoFn might provide you the handling you need. Since you're dealing with CompletableFuture also take a look at the JavaAsyncDoFn.
Please note that you necessarily don't need to use the Scio library, but you can take the main idea of the BaseAsyncDoFn since it's independent of the rest of the Scio library.
The issue that your hitting is that your outputting outside the context of a processElement or finishBundle call.
You'll want to gather all your outputs in memory and output them eagerly during future processElement calls and at the end within finishBundle by blocking till all your calls finish.