How to build fault-tolerant aggregations in Esper CEP - real-time

As per Esper (CEP) docs it is standalone which doesn't connect to any DB or external systems and does all aggregations by keeping minimal past data in memory.
So if I am having an aggregate for last 1 hour window and if the Esper container node dies in between, wouldn't we lose the aggregation data permanently?
How to handle such scenarios with Esper?
Thanks,
Harish

Its handled by EsperHA but that is not open source.

Related

Apache Nifi : Oracle To Mongodb data transfer

I want to transfer data from oracle to MongoDB using apache nifi. Oracle has a total of 9 million records.
I have created nifi flow using QueryDatabaseTable and PutMongoRecord processors. This flow is working fine but has some performance issues.
After starting the nifi flow, records in the queue for SplitJson -> PutMongoRecord are increasing.
Is there any way to slow down records putting into the queue by SplitJson processor?
OR
Increase the rate of insertion in PutMongoRecord?
Right now, in 30 minutes 100k records are inserted, how to speed up this process?
#Vishal. The solution you are looking for is to increase the concurrency of PutMongoRecord:
You can also experiment with the the BATCH size in the configuration tab:
You can also reduce the execution time splitJson. However you should remember this process is going to take 1 flowfile and make ALOT of flowfiles regardless of the timing.
How much you can increase concurrency is going to depend on how many nifi nodes you have, and how many CPU Cores each node has. Be experimental and methodical here. Move up in single increments (1-2-3-etc) and test your file in each increment. If you only have 1 node, you may not be able to tune the flow to your performance expectations. Tune the flow instead for stability and as fast as you can get it. Then consider scaling.
How much you can increase concurrency and batch is also going to depend on the MongoDB Data Source and the total number of connections you can get fro NiFi to Mongo.
In addition to Steven's answer, there are two properties on QueryDatabaseTable that you should experiment with:
Max Results Per Flowfile
Use Avro logical types
With the latter, you might be able to do a direct shift from Oracle to MongoDB because it'll convert Oracle date types into Avro ones and those should in turn by converted directly into proper Mongo date types. Max results per flowfile should also allow you to specify appropriate batching without having to use the extra processors.

Reading from a MongoDB changeStream with unbounded PCollections in Apache Beam

I'm designing a new way for my company to stream data from multiple MongoDB databases, perform some arbitrary initial transformations, and sink them into BigQuery.
There are various requirements but the key ones are speed and ability to omit or redact certain fields before they reach the data warehouse.
We're using Dataflow to basically do this:
MongoDB -> Dataflow (Apache Beam, Python) -> BigQuery
We basically need to just wait on the collection.watch() call as the input, but from the docs and existing research it may not be possible,
At the moment, the MongoDB connector is bounded and there seems to be no readily-available solution to read from a changeStream, or a collection in an unbounded way.
Is it possible to read from a changeStream and have the pipeline wait until the task is killed rather than being out of records?
In this instance I decided to go via Google Pub/Sub which serves as the unbounded data source.

can MongoDB ChangeStream feature replace Pub/Sub technologies

I was going through MongoDB ChangeStream , I understand it reduces the risk of tailing the oplog - we are Currently tailing the oplog to publish the data to Kafka.
Please help me to grasp so how can changestreams are better compared to the Pub/Sub technologies like Kafka or RabbitMQ
ChangeStreams should not be compared to Pub/Sub technologies - ChangeStreams are there to provide a safe way to enable real-time (data) change events to be captured and then processed (and as you rightly pointed out, previously you had to tail the oplog in MongoDB in order to achieve a similar outcome which had its own set of issues, risks and complexities which taxed the developer).
As I mentioned above, ChangeStreams provide a safe way to look at each data change event occuring in MongoDB, apply a filter to those events, and then process each qualifying event. ChangeStreams let you to replay previous events based on the timeframe that your oplog covers - for example, if your application that is implementing ChangeStreams fails, you have the ability to pick-up from the point the application failed.
Whilst ChangeStreams exhibit Pub/Sub-like behaviour from an event identification/processing perspective, that is where the likeness stops. A typical/common use case, where you are interested in capturing/identifying data-change events in MongoDB for downstream processing, is to create a Kafka Producer that utilises the MongoDB driver, instantiates a ChangeStream, and for each qualifying event occurring in MongoDB (made available via the ChangeStream) passes that onto Kafka.

Apache drill parallel query is done sequentially

I tested apache drill in local file system.
I used rest api to query some parquet files. when i run a rest query i can not execute another query and it waits until first query finishes. i want two queries use half of cpu. but it seems multiple queries are finishing sequentially.
This is regression, which is present in 1.13 and 1.14 versions:
https://issues.apache.org/jira/browse/DRILL-6693
It is resolved for now. The fix is in master branch and will be a part of upcoming Drill 1.15 version.
Under apache drill options in the UI, check the following options:
exec.queue.enable
exec.queue.large
exec.queue.small
Description:
exec.queue.enable: Changes the state of query queues. False allows unlimited concurrent queries.
exec.queue.large: Sets the number of large queries that can run concurrently in the cluster. Range: 0-1000
exec.queue.small: Sets the number of small queries that can run concurrently in the cluster. Range: 0-1001
It also depends on the complexity of the query, if the query has joins it will consider it as multiple queries internetally and exec.queue.large should be higher.

What is an efficient way to send data from MongoDB to Hadoop?

I was discussing with a coworker about the usage of the MongoDB connector for Hadoop and he explained that it was very inefficient. He stated that the MongoDB connectors utilizes its own map reduce, and then uses the Hadoop map reduce, which internally slows down the entire system.
If that is the case, what is the most efficient way to transport my data to the Hadoop cluster? What purpose does the MongoDB connector serve if it is more inefficient? In my scenario, I want to get the daily inserted data from MongoDB (roughly around 10MB) and put that all into Hadoop. I should also add that each MongoDB node and Hadoop node all share the same server.
The MongoDB Connector for Hadoop reads data directly from MongoDB. You can configure multiple input splits to read data from the same collection in parallel. The Mapper and Reducer jobs are run by Hadoop's Map/Reduce engine, not MongoDB's Map/Reduce.
If your data estimate is correct (only 10MB per day?) that is a small amount to ingest and the job may be faster if you don't have any input splits calculated.
You should be wary of Hadoop and MongoDB competing for resources on the same server, as contention for memory or disk can affect the efficiency of your data transfer.
To transfer your data from Mongodb to Hadoop you can use some ETL tools like Talend or Pentaho , it's much more easy and practical ! Good luck !