Apache Geode region vs cache - geode

When defining caches within lets say EhCache we define multiple caches like employee, departments etc. In similar fashion, do we define separate regions within Apache Geode?

From About Apache Geode:
Caches are an abstraction that describe a node in a Geode cluster.
Within each cache, you define data regions. Data regions are analogous
to tables in a relational database and manage data in a distributed
fashion as name/value pairs.
So yes, as an analogy you can think of the Apache Geode Cache as the EhCache CacheManager, and the Apache Geode Region as the EhCache Cache.
Hope this helps. Cheers.

Related

Limitations of Kafka as a Distributed DB

I have an application which requires an interesting orchestration between states of instances distributed across geographic regions, in combination with the need for a scalable distributed database.
At the moment I think that Kafka with log compaction will fit my needs for state maintenance and message exchange between instances, and Cassandra will fit my needs for high volume distributed reads and writes of persisted data.
However, there is quite a lot of data duplicated that way: Many of the data exchanged via Kafka would also need to be stored to Cassandra for distributed data access. Using Kafka for both messaging and distributed data querying and persistence seems tempting.
Therefore, I'm interested to figure out the real-world pros and cons to be expected when using e.g. the pull queries feature of Kafka to use it as a distributed database [1].
Though, I'm a bit suspicious about what to expect of that in terms of performance and scalability, especially when compared to Cassandra, as well as unknown pitfalls.
What are the tradeoffs when using Kafka as a distributed DB, and what would it compare performance-wise to "native" distributed systems like Cassandra?
[1] https://www.confluent.io/de-de/blog/pull-queries-in-preview-confluent-cloud-ksqdb/
pure KV lookups
Then Kafka StateStores / Interactive Queries can work, but with the caveat that if you use containers and an orchestrator, you need to maintain the state of those stores somewhere on persistent volumes. Otherwise, when the containers move to a fresh host, the streams changelog topic needs to be read from the very beginning, giving you a "cold-start" problem, and you will be unable to query.
Using any database (with persistent storage) will not have this problem, and will always be able to query immediately.
I'm not sure I would suggest Cassandra for strictly KV data, though.

Configure Apache Ignite Cluster with multiple database as backend partitions

I am new to Apache Ignite. Here's what I am curious about:
Can I setup an ignite cluster as the frontend proxy to distribute requests based on some data column like tenantID, to mysql instances where each mysql instance holds data for single tenant?
Just to make it clear, it is pretty much like a proxy to multiple database instances with same table. So I could save single tenant data into an isolated database.
It's possible two approaches:
You could implement a custom cache store implementation[1] that uses the right connection depending on the record's attribute.
You could use tenantId as affinity key to map records with the same tenantId to the same partitions. Also a custom affinity function[2] allows to map partitions to corresponded nodes marked with some attribute[3].
On each node the cache store configuration[4] could use a datasource to a concrete mysql server based on node attribute.
[1] https://apacheignite.readme.io/docs/3rd-party-store
[2] https://apacheignite.readme.io/docs/affinity-collocation#affinity-function
[3] https://apacheignite.readme.io/docs/cluster#cluster-node-attributes
[4] https://apacheignite.readme.io/docs/3rd-party-store#cachejdbcpojostore
You can also just use the default cache store implementation, just pass a different data source to each node. It is taken from local Spring container typically.
In this case, every node will only talk to a local MySQL, leading to the consequence that every MySQL only holds data of one node, and then you can configure the distribution via Ignite's facilities.
I have not tried this, but it may be sound.

Apache Ignite with Posgresql

Objective: To scale existing application where PostgreSQL is used as a data store.
How can Apache Ignite help: We have an application which has many modules and all the modules are using some shared tables. So we have only one PostgreSQL master database and It's already on AWS large SSD machines. We already have Redis for caching but as we no limitation of Redis is, It's not easy partial updates and querying on secondary indexes.
Our use case:
We have two big tables, one is member and second is subscription. It's many to many relations where one member is subscribed in multiple groups and we are maintaining subscriptions in subscription table.
Member table size is around 40 million and size of this table is around 40M x 1.5KB + more ~= 60GB
Challenge
A challenge is, we can't archive this data since every member is working and there are frequent updates and read on this table.
My thought:
Apache Ignite can help to provide a caching layer on top of PostgreSQL table, as per I read from the documentation.
Now, I have a couple of questions from an Implementation point of
view.
Will Apache Ignite fits in our use case? If Yes then,
Will apache Ignite keep all data 60GB in RAM? Or we can distribute RAM load on multiple machines?
On updating PostgreSQL database table, we are using python and SQLALchamy (ORM). Will there be a separate call for Apache Ignite to
update the same record in memory OR IS there any way that Apache
Ignite can sync it immediately from Database?
Is there enough support for Python?
Are there REST API support to Interact with Apache Ignite. I can avoid ODBC connection.
How about If this load becomes double in next one year?
A quick answer is much appreciated and Thanks in Advance.
Yes it should fit your case.
Apache Ignite has persistence meaning it can store the data on disk optionally, but if you employ it for caching only it will happily store everything in RAM.
There are two approaches. You can do your updates on Apache Ignite (which will propagate them to PostgreSQL) or you can do your updates to PostgreSQL and have Apache Ignite fetch them on the first use (pull from PostgreSQL). The latter only works for new records as you can imagine. There is no support of propagating data from PostgreSQL to Apache Ignite, I guess you could do something like that by using triggers but it is untested.
There is 3rd party client. I didn't try it. Apache Ignite only has built-in native clients for C++/C#/Java for now, other platforms can only connect through JDBC/ODBC/REST and only use a fraction of functionality.
There is REST API and it have improved recently.
120GB doesn't sound like anything scary as far as Apache Ignite is concerned.
in addition to alamar's answer:
You can store your data in-memory on many machines, as Ignite supports partitioned caches that are divided on parts and are distributed between machines. You can set data-collocations and number of backups.
There is an interesting memory model in Apache Ignite that allows you to persist data on the disk quickly. As Ignite Developers said a database behind the cluster will be slower than Ignite persistence because communication goes through external protocols
In our company we have huge Ignite cluster that keeps in RAM much more data

What's a Cluster / Bucket in couchbase Server

I'm new to Couchbase and NoSql technologies in general, but I'm working on a web chat application running on node js using express and some other modules.
I've chosen to work with NoSql to store sessions and all needed data on server-side. But I don't really understand some important features of Couchbase : What is a Cluster, a Bucket? Where can I find some clear definitions of how the server works?
Couchbase uses the term cluster in the same way as many other products, a Couchbase cluster is simply a collection of machines running as a co-ordinated, distributed system of Couchbase nodes.
A Bucket is a Couchbase specific term that is roughly analogous to a 'database' in traditional RDBMS terms. A Bucket provides a container for grouping your data, both in terms of organisation and grouping of similar data and resource allocation. You can configure your buckets separately, providing different quotas, different IO priorities and different security settings on a per bucket basis. Buckets are also the primary method for namespacing documents in Couchbase.
For further information, the Architecture and Concepts overview in the Couchbase documentation, specifically data storage, is a good starting point. A somewhat outdated, but still useful video on Introduction to Couchbase might also be useful to you.
Even though it's answered, hope the following would be more helpful for someone.
A Couchbase cluster contains nodes. Nodes contain buckets. Buckets contain documents. Documents can be retrieved multiple ways: by their keys, queried with N1QL, and also by using Views.(Ref)
As specified in the Couchbase Documentation,
Node
A single Couchbase Server instance running on a physical server,
virtual machine, or a container. All nodes are identical: they consist
of the same components and services and provide the same interfaces.
Cluster
A cluster is a collection of nodes that are accessed and managed as a
single group. Each node is an equal partner in orchestrating the
cluster to provide facilities such as operational information
(monitoring) or managing cluster membership of nodes and health of
nodes.
Clusters are scalable. You can expand a cluster by adding new nodes
and shrink a cluster by removing nodes.
The Cluster Manager is the main component that orchestrates the
cluster level operations. For more information, see Cluster Manager.
Bucket
A bucket is a logical container for a related set of items such as
key-value pairs or documents. Buckets are similar to databases in
relational databases. They provide a resource management facility for
the group of data that they contain. Applications can use one or more
buckets to store their data. Through configuration, buckets provide
segregation along the following boundaries:
Cache and IO management
Authentication
Replication and Cross Datacenter Replication (XDCR)
Indexing and Views
For further info : Couchbase Terminology

Key/Value distributed database for caching binary data

I am looking for distributed kv database for caching small binary objects, like images with TTL. Size limitation is not a problem, as I am planning to split each object anyway, to minimize latency. I need C# and Java drivers, and in very near future I will also need C++ driver. The databases like CouchDb and Redis seems to be document based. Mongo supports binary data and well documented, but it is persistent and I am not sure it is scalable in terms of throughput , Cassandra is also persistent and I am not sure about C++/C# drivers quality + need for constantly repair because of deletions.
Aerospike is commercial and also document based. Maybe Riak with memory or leveldb backend (anyone worked with its C++ client?)
Aerospike would be a perfect solution for you because of below reasons:
Serves all your Use cases
Key Value based.
Open sourced from 3.0 version. Earlier upto 2 node Aerospike cluster was open sourced and 3 or more nodes cluster was paid.
Can be used in Caching mode with no persistence.
Supports LRU and TTL.
Can save binary data.
Reasons for choosing Aerospike
Throughput: Better than Mongo/Couchbase or any other NoSQL solution. See this http://www.aerospike.com/benchmark/.
Have personally seen it work fine with more than 300k read TPS and 100k Write TPS concurrently.
Automatic and efficient data sharding, data re-balancing and data distribution using RIPEMD160.
Highly Available system in case of Failover and/or Network Partitions.
Couchbase (not CouchDB) is a great option for you. Highly scalable, easy to understand, use and scale. It's a KV document database evolved from memcached that also offers secondary indexes through Map/Reduce and many new things coming soon. You can still use memcached protocol/libraries or speed it up with Couchbase SDK's.
Have you looked at Pivotal GemFire Pivotal GemFire is a distributed data management platform providing dynamic scalability, high performance, and database-like persistence.
Pivotal GemFire also has client drivers in C++, C# and Java