What are the unusable scenarios of TDengine? - cluster-analysis

I was testing the brain split brain issue of TDengine these days with different configurations which prevents me from continuing to deploy cluster. In order to reduce the possibility of encountering bottlenecks, the unusable scenarios of TDengine are necessary to declare in advance. So I want to know what situation will cause the TDengine cluster core dump or unavailable except split brain or master selection failure? How can TDengine cluster effectively ensure high availability?

I think TDengine is designed for storing/processing time-series data, for example in IoT industry data collected by sensor or data generated through monitoring devices over time. The underlying data structure for storing data on Disk uses LSM tree so its more suitable for situations with more data writing than reading since new data will be just append to the last.
Basically TDengine is more suitable for storing time-series data with respect to optimizing the performance, but not recommended for usage in OLAP and other situations.

Related

Limitations of Kafka as a Distributed DB

I have an application which requires an interesting orchestration between states of instances distributed across geographic regions, in combination with the need for a scalable distributed database.
At the moment I think that Kafka with log compaction will fit my needs for state maintenance and message exchange between instances, and Cassandra will fit my needs for high volume distributed reads and writes of persisted data.
However, there is quite a lot of data duplicated that way: Many of the data exchanged via Kafka would also need to be stored to Cassandra for distributed data access. Using Kafka for both messaging and distributed data querying and persistence seems tempting.
Therefore, I'm interested to figure out the real-world pros and cons to be expected when using e.g. the pull queries feature of Kafka to use it as a distributed database [1].
Though, I'm a bit suspicious about what to expect of that in terms of performance and scalability, especially when compared to Cassandra, as well as unknown pitfalls.
What are the tradeoffs when using Kafka as a distributed DB, and what would it compare performance-wise to "native" distributed systems like Cassandra?
[1] https://www.confluent.io/de-de/blog/pull-queries-in-preview-confluent-cloud-ksqdb/
pure KV lookups
Then Kafka StateStores / Interactive Queries can work, but with the caveat that if you use containers and an orchestrator, you need to maintain the state of those stores somewhere on persistent volumes. Otherwise, when the containers move to a fresh host, the streams changelog topic needs to be read from the very beginning, giving you a "cold-start" problem, and you will be unable to query.
Using any database (with persistent storage) will not have this problem, and will always be able to query immediately.
I'm not sure I would suggest Cassandra for strictly KV data, though.

Redis Streams for implementing a Messaging System (chat) app versus traditional approaches

I'm implementing a chat app, which will support both one-on-one conversation and Group conversations.
So far the direction was to use Redis Pub/Sub with PostgreSQL as the cold storage, and WebSocket being the transport.
Every user will fetch the history from postgresql upon launch (up until the timestamp of the WebSocket+redis connection), and then subscribe to channels that go by their own user_id.
However, having a roundtrip to a DMBS with each new message sounds a bit strange, while definitely doable and legit.
So I decided to examine other approaches. One possible approach was to use Kafka and eliminate the need for an DBMS altogether.
It sounds viable and comes with its own set of advantages.
But turns out there's a new kid on the block - Redis Streams.
From what I gather, it is actually quite similar to Kafka in this specific scenario (chat).
It has many nice features that sound very convenient for implementing a chat system.
And now I am trying to understand whether Streams + disk persistency is the wise way to go versus Kafka versus PostgreSQL+Redis pub/sub
The main aspects in consideration are:
Performance. Postgres and Kafka both operate on disk, meaning slower than the in-memory operations in the case of redis. On the other hand , obviously the messages must be persisted and available at all times and events, so redis will be persisted to disk. Wouldn't that negate the whole in-memory performance gain?
And even if not - would the performance gain under peak load and a big data base be noticeable?
Memory / Costs. With redis these two are closely tied together. As a small startup, the efforts are focused on being ready to cope with sudden scale peaks (up to a million users), but at the same time - the costs should be minimized.
Is storing millions of messages in Streams going to be too memory-costly which in turn will translate to financially-costly?
Recovery, Reliability & Availability, Persistency. with Postgres, even a single instance can handle a big traffic load, but it can also offer master-slave setups and also consistency. Can Redis be a match to that? Also, with a DMBS I can be assured that the data is there to stay. Can I know that with redis?
Scaling.

Propagate change in distributed in-memory cache

I've an application deployed on a cluster of 1000 commodity boxes. While starting, each instance of the application loads a non-trivial amount of data from database and uses this as cache. During a day, around 20%of this cached data needs to be updated.
What are the efficient ways of near simultaneous update of in-memory data of entire cluster? I thought of JMX, Zookeeper, but not sure if that would be really efficient/fast enough.
Well assuming you're using Memcached's consistent hashing, go a step further and have each cache replicate to their closest successor. This can lessen the problem but not entirely alleviate it but it's a simple solution, Gossip + CRDTs are another solution, Dynamo and Riak use a combination of Gossip, Consistent Hashing, and CRDTs.

Redis versus Cassandra(Bigtable data model)

Suppose I need to do the following operations intensively:
put(key, value)
where value is a map of <column name, column value>.
I havn’t known NoSQL for long, what I know is that both Cassandra insert(which conform the api defined in Bigtable paper) and Redis “HSET” command could do that. But what’s the pros and cons of both way? Any performance and scalability difference there?
EDIT :
My requirement is something like an IM server --- I need to store session data , and I want all of them to be in memory so that low latency can be easily achieved. The session last for at most 2 hours. No consistency requirement to consider yet. And disk is only for fail-over. Lost of data is not terrible. All i need is lower latency. Operations per second --- the more, the better.
Both redis and cassandra can be used as a key value store. The difference is in speed, scale and reliability.
Redis works best as a single server, where the entire data set resides in memory.
Cassandra can handle data sets that don't fit in memory, and data sets that don't fit on a single machine. As part of distributing over multiple machines, cassandra is much more reliable. Cassandra can handle machine failures, rebuilding machines, adding capacity to the cluster when needed.
Because redis is entirely in memory, and reads/writes are served by a single machine (a single cassandra write will typically talk to multiple machines), redis will most likely be faster.
If your primary goal is speed, and you don't need to store data reliably, and your data set fits in memory, then redis would probably be a better solution.

Do NoSQL datacenter aware features enable fast reads and writes when nodes are distributed across high-latency connections?

We have a data system in which writes and reads can be made in a couple of geographic locations which have high network latency between them (crossing a few continents, but not this slow). We can live with 'last write wins' conflict resolution, especially since edits can't be meaningfully merged.
I'd ideally like to use a distributed system that allows fast, local reads and writes, and copes with the replication and write propagation over the slow connection in the background. Do the datacenter-aware features in e.g. Voldemort or Cassandra deliver this?
It's either this, or we roll our own, probably based on collecting writes using something like
rsync and sorting out the conflict resolution ourselves.
You should be able to get the behavior you're looking for using Voldemort. (I can't speak to Cassandra, but imagine that it's similarly possible using it.)
The key settings in the configuration will be:
replication-factor — This is the total number of times the data is stored. Each put or delete operation must eventually hit this many nodes. A replication factor of n means it can be possible to tolerate up to n - 1 node failures without data loss.
required-reads — The least number of reads that can succeed without throwing an exception.
required-writes — The least number of writes that can succeed without the client getting back an exception.
So for your situation, the replication would be set to whatever number made sense for your redundancy requirements, while both required-reads and required-writes would be set to 1. Reads and writes would return quickly, with a concomitant risk of stale or lost data, and the data would only be replicated to the other nodes afterwards.
I have no experience with Voldemort, so I can only comment on Cassandra.
You can deploy Cassandra to multiple datacenters with an inter-DC latency higher than a few milliseconds (see http://spyced.blogspot.com/2010/04/cassandra-fact-vs-fiction.html).
To ensure fast local reads, you can configure the cluster to replicate your data to a certain number of nodes in each datacenter (see "Network Topology Strategy"). For example, you specify that there should always be two replica in each data center. So even when you lose a node in a data center, you will still be able to read your data locally.
Write requests can be sent to any node in a Cassandra cluster. So for fast writes, your clients would always speak to a local node. The node receiving the request (the "coordinator") will replicate the data to other nodes (in other datacenters) in the background. If nodes are down, the write request will still succeed and the coordinator will replicate the data to the failed nodes at a later time ("hinted handoff").
Conflict resolution is based on a client-supplied timestamp.
If you need more than eventual consistency, Cassandra offers several consistency options (including datacenter-aware options).