Apache Ignite with Posgresql - postgresql

Objective: To scale existing application where PostgreSQL is used as a data store.
How can Apache Ignite help: We have an application which has many modules and all the modules are using some shared tables. So we have only one PostgreSQL master database and It's already on AWS large SSD machines. We already have Redis for caching but as we no limitation of Redis is, It's not easy partial updates and querying on secondary indexes.
Our use case:
We have two big tables, one is member and second is subscription. It's many to many relations where one member is subscribed in multiple groups and we are maintaining subscriptions in subscription table.
Member table size is around 40 million and size of this table is around 40M x 1.5KB + more ~= 60GB
Challenge
A challenge is, we can't archive this data since every member is working and there are frequent updates and read on this table.
My thought:
Apache Ignite can help to provide a caching layer on top of PostgreSQL table, as per I read from the documentation.
Now, I have a couple of questions from an Implementation point of
view.
Will Apache Ignite fits in our use case? If Yes then,
Will apache Ignite keep all data 60GB in RAM? Or we can distribute RAM load on multiple machines?
On updating PostgreSQL database table, we are using python and SQLALchamy (ORM). Will there be a separate call for Apache Ignite to
update the same record in memory OR IS there any way that Apache
Ignite can sync it immediately from Database?
Is there enough support for Python?
Are there REST API support to Interact with Apache Ignite. I can avoid ODBC connection.
How about If this load becomes double in next one year?
A quick answer is much appreciated and Thanks in Advance.

Yes it should fit your case.
Apache Ignite has persistence meaning it can store the data on disk optionally, but if you employ it for caching only it will happily store everything in RAM.
There are two approaches. You can do your updates on Apache Ignite (which will propagate them to PostgreSQL) or you can do your updates to PostgreSQL and have Apache Ignite fetch them on the first use (pull from PostgreSQL). The latter only works for new records as you can imagine. There is no support of propagating data from PostgreSQL to Apache Ignite, I guess you could do something like that by using triggers but it is untested.
There is 3rd party client. I didn't try it. Apache Ignite only has built-in native clients for C++/C#/Java for now, other platforms can only connect through JDBC/ODBC/REST and only use a fraction of functionality.
There is REST API and it have improved recently.
120GB doesn't sound like anything scary as far as Apache Ignite is concerned.

in addition to alamar's answer:
You can store your data in-memory on many machines, as Ignite supports partitioned caches that are divided on parts and are distributed between machines. You can set data-collocations and number of backups.
There is an interesting memory model in Apache Ignite that allows you to persist data on the disk quickly. As Ignite Developers said a database behind the cluster will be slower than Ignite persistence because communication goes through external protocols
In our company we have huge Ignite cluster that keeps in RAM much more data

Related

Can 2 Debezium Connectors read from same source at the same time?

As the title says, I have 2 seperate servers and I want both connectors to read from same source to write to their respective topic. A single connector works well. When I create another one in a different server they seem to be running but no data flow occurs for both.
My question is, is that possible to run 2 debezium connectors that read from same source? I couldn't find any information about this topic in documentation.
Edit: I've tested it with oracle database and never seen it's working well. Definitely wouldn't recommend using it especially in oracle.
So generally speaking, Debezium does not recommend that you use multiple connectors per database source and prefer that you adjust your connector configuration instead. We understand that isn't always the case when you have different business use cases at play.
That said, it's important that if you do deploy multiple connectors you properly configure each connector so that it doesn't share state such as the same database history topic, etc.
For certain database platforms, having multiple source connectors really doesn't apply any real burden to the database, such as MySQL. But other databases like Oracle, running multiple connectors can have a pretty substantial impact.
When an Oracle connector streams changes, it starts an Oracle LogMIner mining session. This session is responsible for loading, reading, parsing, and preparing the contents of the data read in a special in-memory table that the connector uses to generate change events. When you run multiple connectors, you will have concurrent Oracle LogMiner sessions happening and each session will be consuming its own share of PGA memory to support the steps taken by Oracle LogMiner. Depending on your database's volatility, this can be stressful on the database server since Oracle specifically assigns one LogMiner session to a CPU.
For an Oracle environment, I highly recommend you avoid using multiple connectors unless you are needing to stream changes from different PDBs within the same instance since there is really no technical reason why you should want to read, load, parse, and generate change data for the same redo entries multiple times, once per connector deployment.

Limitations of Kafka as a Distributed DB

I have an application which requires an interesting orchestration between states of instances distributed across geographic regions, in combination with the need for a scalable distributed database.
At the moment I think that Kafka with log compaction will fit my needs for state maintenance and message exchange between instances, and Cassandra will fit my needs for high volume distributed reads and writes of persisted data.
However, there is quite a lot of data duplicated that way: Many of the data exchanged via Kafka would also need to be stored to Cassandra for distributed data access. Using Kafka for both messaging and distributed data querying and persistence seems tempting.
Therefore, I'm interested to figure out the real-world pros and cons to be expected when using e.g. the pull queries feature of Kafka to use it as a distributed database [1].
Though, I'm a bit suspicious about what to expect of that in terms of performance and scalability, especially when compared to Cassandra, as well as unknown pitfalls.
What are the tradeoffs when using Kafka as a distributed DB, and what would it compare performance-wise to "native" distributed systems like Cassandra?
[1] https://www.confluent.io/de-de/blog/pull-queries-in-preview-confluent-cloud-ksqdb/
pure KV lookups
Then Kafka StateStores / Interactive Queries can work, but with the caveat that if you use containers and an orchestrator, you need to maintain the state of those stores somewhere on persistent volumes. Otherwise, when the containers move to a fresh host, the streams changelog topic needs to be read from the very beginning, giving you a "cold-start" problem, and you will be unable to query.
Using any database (with persistent storage) will not have this problem, and will always be able to query immediately.
I'm not sure I would suggest Cassandra for strictly KV data, though.

What's the recommended number of Kafka connectors for a large database ? (Debezium)

I'm trying to set up Debezium for data change monitoring in this huge database.
Documentation says "Debezium can monitor any number of databases. The number of connectors that can be deployed to a single cluster of Kafka Connect services depends upon upon the volume and rate of events. However, Debezium supports multiple Kafka Connect service clusters and, if needed, multiple Kafka clusters as well."
However, there's no mention about how many connectors is a good practice.
Reading mediums and some use cases, it seems like one connector for a whole database is a suitable option. But if we have a lot of tables and a lot of changing events in a fraction of time, it should become a bottleneck. Or not? I've seen people working with one connector per table too. It would mean a LOT of connectors in this case. If you have an use case concerning heavy databases along with Debezium, could you tell your experiences about connectors ?
(The source database, in this case, are mostly postgres)
Sorry if it's a dumb question. Thank you in advance.

Key/Value distributed database for caching binary data

I am looking for distributed kv database for caching small binary objects, like images with TTL. Size limitation is not a problem, as I am planning to split each object anyway, to minimize latency. I need C# and Java drivers, and in very near future I will also need C++ driver. The databases like CouchDb and Redis seems to be document based. Mongo supports binary data and well documented, but it is persistent and I am not sure it is scalable in terms of throughput , Cassandra is also persistent and I am not sure about C++/C# drivers quality + need for constantly repair because of deletions.
Aerospike is commercial and also document based. Maybe Riak with memory or leveldb backend (anyone worked with its C++ client?)
Aerospike would be a perfect solution for you because of below reasons:
Serves all your Use cases
Key Value based.
Open sourced from 3.0 version. Earlier upto 2 node Aerospike cluster was open sourced and 3 or more nodes cluster was paid.
Can be used in Caching mode with no persistence.
Supports LRU and TTL.
Can save binary data.
Reasons for choosing Aerospike
Throughput: Better than Mongo/Couchbase or any other NoSQL solution. See this http://www.aerospike.com/benchmark/.
Have personally seen it work fine with more than 300k read TPS and 100k Write TPS concurrently.
Automatic and efficient data sharding, data re-balancing and data distribution using RIPEMD160.
Highly Available system in case of Failover and/or Network Partitions.
Couchbase (not CouchDB) is a great option for you. Highly scalable, easy to understand, use and scale. It's a KV document database evolved from memcached that also offers secondary indexes through Map/Reduce and many new things coming soon. You can still use memcached protocol/libraries or speed it up with Couchbase SDK's.
Have you looked at Pivotal GemFire Pivotal GemFire is a distributed data management platform providing dynamic scalability, high performance, and database-like persistence.
Pivotal GemFire also has client drivers in C++, C# and Java

I am looking for a key-value datastore which has following (preferable) properties

I am trying to build a distributed task queue, and I am wondering if there is any data store, which has some or all of the following properties. I am looking to have a completely decentralized, multinode/multi-master self replicating datastore cluster to avoid any single point of failure.
Essential
Supports Python pickled object as Value.
Persistent.
More, the better, In decreasing order of importance (I do not expect any datastore to meet all the criteria. :-))
Distributed.
Synchronous Replication across multiple nodes supported.
Runs/Can run on multiple nodes, in multi-master configuration.
Datastore cluster exposed as a single server.
Round-robin access to/selection of a node for read/write action.
Decent python client.
Support for Atomicity in get/put and replication.
Automatic failover
Decent documentation and/or Active/helpful community
Significantly mature
Decent read/write performance
Any suggestions would be much appreciated.
Cassandra (open-sourced by facebook) has pretty much all of these properties. There are several Python clients, including pycassa.
Edited to add:
Cassandra is fully distributed, multi-node P2P, with tunable consistency levels (i.e. your replication can be synchronous or asynchronous or a mixture of both). Clients can connect to any server. Failover is automatic, and new servers can be added on-the-fly for load balancing. Cassandra is in production use by companies such as Facebook. There is an O'Reilly book. Write performance is extremely high, read performance is also high.