Apache Ignite CEP implementation with large datasets - complex-event-processing

We need a CEP engine which can run over large datasets so I had a look over alternatives like FLink, Ignite etc.
When I was on Ignite, I saw that Ignite's querying api is not eligible enough to run over large data. The reason is: that much data can not be stored into cache(insufficient memory size : 2 TB is needed). I have looked at write-through and read-through but the data payload(not key) is not queryable with Predicates(for ex SQLPredicate).
My question is: Am I missing something or is it really like that?
Thx

Ignite is an in-memory system by design. Cache store (read-through/write-through) allows storing data on disk, but queries only work over in-memory data.
that much data can not be stored into cache(insufficient memory size : 2 TB is needed)
Why not? Ignite is a distributed system, it is possible to build a cluster with more than 2TB of combined RAM.

Related

Apache Ignite with Posgresql

Objective: To scale existing application where PostgreSQL is used as a data store.
How can Apache Ignite help: We have an application which has many modules and all the modules are using some shared tables. So we have only one PostgreSQL master database and It's already on AWS large SSD machines. We already have Redis for caching but as we no limitation of Redis is, It's not easy partial updates and querying on secondary indexes.
Our use case:
We have two big tables, one is member and second is subscription. It's many to many relations where one member is subscribed in multiple groups and we are maintaining subscriptions in subscription table.
Member table size is around 40 million and size of this table is around 40M x 1.5KB + more ~= 60GB
Challenge
A challenge is, we can't archive this data since every member is working and there are frequent updates and read on this table.
My thought:
Apache Ignite can help to provide a caching layer on top of PostgreSQL table, as per I read from the documentation.
Now, I have a couple of questions from an Implementation point of
view.
Will Apache Ignite fits in our use case? If Yes then,
Will apache Ignite keep all data 60GB in RAM? Or we can distribute RAM load on multiple machines?
On updating PostgreSQL database table, we are using python and SQLALchamy (ORM). Will there be a separate call for Apache Ignite to
update the same record in memory OR IS there any way that Apache
Ignite can sync it immediately from Database?
Is there enough support for Python?
Are there REST API support to Interact with Apache Ignite. I can avoid ODBC connection.
How about If this load becomes double in next one year?
A quick answer is much appreciated and Thanks in Advance.
Yes it should fit your case.
Apache Ignite has persistence meaning it can store the data on disk optionally, but if you employ it for caching only it will happily store everything in RAM.
There are two approaches. You can do your updates on Apache Ignite (which will propagate them to PostgreSQL) or you can do your updates to PostgreSQL and have Apache Ignite fetch them on the first use (pull from PostgreSQL). The latter only works for new records as you can imagine. There is no support of propagating data from PostgreSQL to Apache Ignite, I guess you could do something like that by using triggers but it is untested.
There is 3rd party client. I didn't try it. Apache Ignite only has built-in native clients for C++/C#/Java for now, other platforms can only connect through JDBC/ODBC/REST and only use a fraction of functionality.
There is REST API and it have improved recently.
120GB doesn't sound like anything scary as far as Apache Ignite is concerned.
in addition to alamar's answer:
You can store your data in-memory on many machines, as Ignite supports partitioned caches that are divided on parts and are distributed between machines. You can set data-collocations and number of backups.
There is an interesting memory model in Apache Ignite that allows you to persist data on the disk quickly. As Ignite Developers said a database behind the cluster will be slower than Ignite persistence because communication goes through external protocols
In our company we have huge Ignite cluster that keeps in RAM much more data

How to modify the configuration of Kafka to process large amount of data

I am using kafka_2.10-0.10.0.1. I have two questions:
- I want to know how I can modify the default configuration of Kafka to process large amount of data with good performance.
- Is it possible to configure Kafka to process the records in memory without storing in disk?
thank you
Is it possible to configure Kafka to process the records in memory without storing in disk?
No. Kafka is all about storing records reliably on disk, and then reading them back quickly off of disk. In fact, its documentation says:
As a result of taking storage seriously and allowing the clients to control their read position, you can think of Kafka as a kind of special purpose distributed filesystem dedicated to high-performance, low-latency commit log storage, replication, and propagation.
You can read more about its design here: https://kafka.apache.org/documentation/#design. The implementation section is also quite interesting: https://kafka.apache.org/documentation/#implementation.
That said, Kafka is also all about processing large amounts of data with good performance. In 2014 it could handle 2 million writes per second on three cheap instances: https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines. More links about performance:
https://docs.confluent.io/current/kafka/deployment.html
https://www.confluent.io/blog/optimizing-apache-kafka-deployment/
https://community.hortonworks.com/articles/80813/kafka-best-practices-1.html
https://www.cloudera.com/documentation/kafka/latest/topics/kafka_performance.html

Key/Value distributed database for caching binary data

I am looking for distributed kv database for caching small binary objects, like images with TTL. Size limitation is not a problem, as I am planning to split each object anyway, to minimize latency. I need C# and Java drivers, and in very near future I will also need C++ driver. The databases like CouchDb and Redis seems to be document based. Mongo supports binary data and well documented, but it is persistent and I am not sure it is scalable in terms of throughput , Cassandra is also persistent and I am not sure about C++/C# drivers quality + need for constantly repair because of deletions.
Aerospike is commercial and also document based. Maybe Riak with memory or leveldb backend (anyone worked with its C++ client?)
Aerospike would be a perfect solution for you because of below reasons:
Serves all your Use cases
Key Value based.
Open sourced from 3.0 version. Earlier upto 2 node Aerospike cluster was open sourced and 3 or more nodes cluster was paid.
Can be used in Caching mode with no persistence.
Supports LRU and TTL.
Can save binary data.
Reasons for choosing Aerospike
Throughput: Better than Mongo/Couchbase or any other NoSQL solution. See this http://www.aerospike.com/benchmark/.
Have personally seen it work fine with more than 300k read TPS and 100k Write TPS concurrently.
Automatic and efficient data sharding, data re-balancing and data distribution using RIPEMD160.
Highly Available system in case of Failover and/or Network Partitions.
Couchbase (not CouchDB) is a great option for you. Highly scalable, easy to understand, use and scale. It's a KV document database evolved from memcached that also offers secondary indexes through Map/Reduce and many new things coming soon. You can still use memcached protocol/libraries or speed it up with Couchbase SDK's.
Have you looked at Pivotal GemFire Pivotal GemFire is a distributed data management platform providing dynamic scalability, high performance, and database-like persistence.
Pivotal GemFire also has client drivers in C++, C# and Java

HBase, mongoDB, Cassandra - overhead on small cluster, small data

Ok,this systems are scalable with respect to nr of nodes and big amount of data.
But how about the overhead involved, if I use this systems on a small cluster (5-10 nodes), and on a small amount of data, processing/storing on a scale of a couple of gigabytes? Or on a smaller data, like hundreds of MB ?
Are there better database systems to use for my cluster and my amount of data?
A scalable solution usually pays a penalty required to scale over large data. The penalty is paltry compared to large data that you get to process. If you do not envisage processing data in Terabytes then you could do with a more responsive system that does not pay that penalty.
Use Sqlite database for smaller data. Frankly it depends on what other requirements/constraints you have.
You can probably just use a single node mySQL server for this kind of data with the advantage of a full SQL capabilities, full ACID, mature tools etc.

Redis versus Cassandra(Bigtable data model)

Suppose I need to do the following operations intensively:
put(key, value)
where value is a map of <column name, column value>.
I havn’t known NoSQL for long, what I know is that both Cassandra insert(which conform the api defined in Bigtable paper) and Redis “HSET” command could do that. But what’s the pros and cons of both way? Any performance and scalability difference there?
EDIT :
My requirement is something like an IM server --- I need to store session data , and I want all of them to be in memory so that low latency can be easily achieved. The session last for at most 2 hours. No consistency requirement to consider yet. And disk is only for fail-over. Lost of data is not terrible. All i need is lower latency. Operations per second --- the more, the better.
Both redis and cassandra can be used as a key value store. The difference is in speed, scale and reliability.
Redis works best as a single server, where the entire data set resides in memory.
Cassandra can handle data sets that don't fit in memory, and data sets that don't fit on a single machine. As part of distributing over multiple machines, cassandra is much more reliable. Cassandra can handle machine failures, rebuilding machines, adding capacity to the cluster when needed.
Because redis is entirely in memory, and reads/writes are served by a single machine (a single cassandra write will typically talk to multiple machines), redis will most likely be faster.
If your primary goal is speed, and you don't need to store data reliably, and your data set fits in memory, then redis would probably be a better solution.