Different Ways to store session information in distributed cluster while building a distributed chat server? - chat

Look for various ways and designs used in storing session information such that message from a client can be handled by a server and delegated to appropriate server to forward message further.
I have explored following methods
Using a central storage such as redis for storing server_id : user_session_id (single point of failure)
Using Distributed Hash table for distributing information across servers - O(n) for fetching info
Using Chord (efficient DHT) - O(log n) for fetching info
Are there other standard practices while solving the above problem?

Related

For Data Transfer, REST API vs SFTP, which is more secure?

Two different applications needs a data transfer from one another for certain activities. Option to do this data transfer is either prepare a file of data and push it through SFTP at certain point of time, or push/pull the changes through REST API in real time.
Which approach will be more secure if the data in one system is completely encrypted and in one it is raw?
When choosing an integration pattern, as always, the answer is: "It depends".
How much data, and how frequently?
What is the security classification for the data (and the potential impact of a data breach)?
How mature are the IT Engineering/Operations/DevOps teams that will be involved in implementing the integration points, on both ends?
What facilities are available for ensuring that data is encrypted at-rest, as well as in-transit?
What is the business requirement regarding data latency?
What is the physical distance between the two systems?
What is the sustainable network connection speed between the two systems?
What are the hours during which the integration needs to be active/scheduled?
Is the data of a bulk/reference type - or is it an event/transaction nature type?
What is the size per message/transaction?
What is the total size of the data (MB? GB? PB? ...?) that needs to be transferred during a given processing cycle (per hour? per day?)?
Additional considerations that should be examined:
Network timeout/retry scenarios
Batch reruns
CPU, Memory, Network bandwidth utilization
Peak hour processing vs. off peak hour processing
Infrastructure/environmental limits/constraints - and cost factors (e.g. Cloud hosting limits on transactions, data, file size, Cloud-hosted API Gateway pricing strategies, ...)
Network Latency introduced by number of messages that must be sent to complete transfer of all data via an API vs. SFTP
Data Latency introduced by SFTP batch scheduling
It is hard to compare SSH (SFTP) with SSL (RESTful API using HTTPS) as both have different functions.
SFTP has a broader surface area for attack as it uses SSH for tunneling which means there are multiple areas for there to be compromises. That does not mean it is less secure.

Questions about using Apache Kafka Streams to implement event sourcing microservices

Event sourcing means a 180 degree shift in the way many of us have been architecting and developing web applications, with lots of advantages but also many challenges.
Apache Kafka is an awesome platform that through its Apache Kafka Streams API is advertised as a tool that allows us to implement this paradimg through its many features (decoupling, fault tolerance, scalability...): https://www.confluent.io/blog/event-sourcing-cqrs-stream-processing-apache-kafka-whats-connection/
On the other hand there are some articles discouraging us from using it for event sourcing: https://medium.com/serialized-io/apache-kafka-is-not-for-event-sourcing-81735c3cf5c
These are my questions regarding Kafka Streams suitability as an event sourcing plaftorm:
The article above comes from Jesper Hammarbäck (who works for serialized.io, an event sourcing platform). I would like to get an answer to the main problems he brings up:
Loading current state. In my view with log compaction and state stores it's not a problem. Am I right?
Consistent writes.
When moving certain pieces of functionality into Kafka Streams I'm not sure if they do fit naturally:
Authentication & Security: Imagine your customers are stored in a state store generated from a customer-topic. Should we keep their passwords in the topic/store? It doesn't sound safe enough, does it? Then how are we supposed to manage this aspect of having customers on a state store and their passwords somewhere else? Any recommended good practice?
Queries: Interactive queries are a nice tool to generate queriable views of our data (by key). That's ok to get an entity by id but what about complex queries (joins)? Do we need to generate state stores per query? For instance one store for customers by id, another one for customers by state, another store for customers who purchased a product last year... It doesn't sound manageable. Another point is the lack of pagination: how can we handle big sets of data when querying the state stores? One more point, we can’t do dynamic queries (like JPA criteria API) anymore. This leads to CQRS maybe? Complexity keeps growing this way...
Data growth: with databases we are used to have thousands and thousands of rows per table. Kafka Streams applications keep a local state store that will grow and grow over time. How scalable is that? How is that local storage kept (local disk/RAM)? If it's disk we should provision applications with enough space, if it's RAM enough memory.
Loading Current State: The mechanism described in the blog, about re-reacting current state ad-hoc for a single entity would indeed be costly with Kafka. However Kafka Streams follow the philosophy to keep the current state for all object in a KTable (that is distributed/sharded). Thus, it's never required to do this -- of course, it come with certain memory costs.
Kafka Streams parallelized based on different events. Thus, all interactions for a single event (processing, state updates) are performed by a single thread. Thus, I don't see why there should be inconsistent writes.
I am not sure what the exact requirement would be. In the current implementation, Kafka Streams does not offer any store specific authentication or security features. There are several things one could do for security though: (a) encrypt the local disk: this might be the simplest thing to do to protect data. (2) encrypt messages within the business logic, before you put them into the store.
Interactive Queries offers limited support for many reasons (don't want to go into details) and it was never design with the goal to support complex queries. The idea is about eager computation of result what can be retrieved with simple lookups. As you pointed out, this is not very scalable (cost intensive) if you have a lot of different queries. To tackle this, it would make sense to load the data into a database, and let the DB does what it is build for. Kafka Streams alone is not the right tool for this atm -- however, there is no reason to not combine both.
Per default Kafka Streams uses RocksDB to keep local state (you can switch to in-memory stores, too). Thus, it's possible to write to disk and to use very large state. Of course, you need to provision your instances accordingly (cf: https://docs.confluent.io/current/streams/sizing.html). Besides this, Kafka Streams scales horizontally and is fully elastic. Thus, you can add new instances at any point in time allowing you to hold terra-bytes of state if you have large disks and enough instances. Note, that the number of input topic partitions limit the number of instances you can use (internally, Kafka Streams is a consumer group, and you cannot have more instances than partitions). If this is a concern, it's recommended to over-partition the input topics in the first place.

Are all distributed database designed to process data in parallel?

I am learning about the characteristics of distributed database and I came across this website that describes some of the advantages of distributed database:
https://www.atlantic.net/cloud-hosting/about-distributed-databases-and-distributed-data-systems/
According to that site, the advantages of distributed database are listed below:
Reliability – Building an infrastructure is similar to investing: diversify to reduce your chances of loss. Specifically, if a failure occurs in one area of the distribution, the entire database does not experience a setback.
Security – You can give permissions to single sections of the overall database, for better internal and external protection.
Cost-effective – Bandwidth prices go down because users are accessing remote data less frequently.
Local access – Similarly to #1 above, if there is a failure in the umbrella network, you can still get access to your portion of the database.
Growth – If you add a new location to your business, it’s simple to create an additional node within the database, making distribution highly scalable.
Speed & resource efficiency – Most requests and other interactivity with the database are performed at a local level, also decreasing remote traffic.
Responsibility & containment – Because any glitches or failures occur locally, the issue is contained and can potentially be handled by the IT staff designated to handle that piece of the company.
However, parallelism (I mean not concurrent write, but processing data in parallel in each node) is not on the list. This makes me wonder: are all distributed databases (i.e. Mongo DB, Cassandra, HBase) designed to process data in parallel? If this is false, which distributed databases support parallel processing and which ones don't?
To find out what I mean by Parallel Processing (not concurrent write), please see: https://softwareengineering.stackexchange.com/questions/190719/the-difference-between-concurrent-and-parallel-execution

How good are ZooKeeper and Etcd?

Disclaimer: I'm quite new for the etcd project and ZooKeeper project.
I'm recently getting interested in the distributed open source products.
I found they seems to require configuration(coordination?) systems such as ZooKeeper for Presto DB, Hive and Etcd for kubernetes and I think that understanding the role of etcd and ZooKeeper is the first step to understand the distributed systems.
But now, I feel like getting lost... I could not yet understand what is the good and unique points of the etcd and ZooKeeper. They look for me a well-distributed key-value storage or file systems.
Here is the impression that I have for the products. I know the impressions don't reflect the feature of the products. but I don't know what is the remaining feature that I should know.
ZooKeeper: According to the overview page of ZooKeeper, it guarantees the following things.
Sequential Consistency - Updates from a client will be applied in the order that they were sent.
Atomicity - Updates either succeed or fail. No partial results.
Single System Image - A client will see the same view of the service regardless of the server that it connects to.
Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.
Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.
The sequential consistency and atomicity are the unique features which is not supported by most file systems but others are common among other file systems.
Etcd: According to the README of etcd. it focuses on
Simple: curl'able user-facing API (HTTP+JSON)
Secure: optional SSL client cert authentication
Fast: benchmarked 1000s of writes/s per instance
Reliable: properly distributed using Raft
Most of them seems common with Amazon S3 (S3 doesn't support such a fast access.)
I know those products are very good ones because most of the distributed open source products depend on them. but what is the key, unique feature that the distributed open source product choose them?
I think you're confusing the file-system-like interface with an actual file system. The systems you are mentioning are well suited for cluster coordination, in particular ZooKeeper. What they are not designed for is storing large amounts of data like a file system would. You should think of them more as suited for coordinating a file system. That is, one could imagine a file system storing paths to files in a consistent store like ZooKeeper or etcd, but not the files themselves. That they expose a file system-like interface does not correlate to any ability to store files. Indeed, these systems are designed to store small amounts of data that can be held in memory. By using a consistent store like ZooKeeper for storing file information in a distributed file system, the file system would ensure that clients see changes in the file system in sequential order.
ZooKeeper is really a set of primitives with which distributed systems can be coordinated. Particularly relevant to coordinating distributed systems with ZooKeeper are its session events (watches) which allow clients to listen for changes to the cluster state. Distributed systems typically use watches in ZooKeeper for things like locks, and the strong consistency guarantees of ZooKeeper make it perfectly suitable for that use case.
If you want a good idea of what systems like ZooKeeper and etcd are used for, you should check out the Apache Curator recipes. Atomix also implements similar types of APIs for coordinating distributed systems on top of a consensus algorithm. All of these tools are demonstrative of typical use cases for consensus-based distributed systems.
What's important to note is that these types of systems are built on top of consensus algorithms and usually store state in memory. They're suitable for operations that involve a small amount of data but require a high level of consistency, and that's why they're frequently used for things like distributed locking, configuration management, and group membership.

Suggested Hadoop-based Design / Component for Ingestion of Periodic REST API Calls

We are planning to use REST API calls to ingest data from an endpoint and store the data to HDFS. The REST calls are done in a periodic fashion (daily or maybe hourly).
I've already done Twitter ingestion using Flume, but I don't think using Flume would suit my current use-case because I am not using a continuous data firehose like this one in Twitter, but rather discrete regular time-bound invocations.
The idea I have right now, is to use custom Java that takes care of REST API calls and saves to HDFS, and then use Oozie coordinator on that Java jar.
I would like to hear suggestions / alternatives (if there's easier than what I'm thinking right now) about design and which Hadoop-based component(s) to use for this use-case. If you feel I can stick to Flume, then kindly give me also an idea how to do this.
As stated in the Apache Flume web:
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.
As you can see, among the features attributed to Flume is the gathering of data. "Pushing-like or emitting-like" data sources are easy to integrate thanks to HttpSource, AvroSurce, ThriftSource, etc. In your case, where the data must be let's say "actively pulled" from a http-based service, the integration is not so obvious, but can be done. For instance, by using the ExecSource, which runs a script getting the data and pushing it to the Flume agent.
If you use a proprietary code in charge of pulling the data and writting it into HDFS, such a design will be OK, but you will be missing some interesting built-in Flume characteristics (that probably you will have to implement by yourself):
Reliability. Flume has mechanisms to ensure the data is really persisted in the final storage, retrying until is is effectively written. This is achieved through the usage of an internal channel buffering data both at the input (ingesting peaks of loads) and the output (retaining data until it is effecively persisted) and the transaction concept.
Performance. The usage of transactions and the possibility to configure multiple parallel sinks (data processors) will your deployment able to deal with really large amounts of data generated per second.
Usability. By using Flume you don't need to deal with the storage details (e.g. HDFS API). Even, if some day you decide to change the final storage you only have to reconfigure the Flume agent for using the new related sink.