System/Protocol for sending data that deals with loss connection - streaming

This is not very much a coding question, but we are currently working at my company on a tool for processing data. This works this way :
Local computer captures data from sensors every second and send it to cloud influxdb (time series-based database) on the fly
Cloud computer retrieves data from cloud influxdb and preprocess it
... (this part is not important here)
The problem is that the internet connection on the local computer is instable. We the current infrastructure, we loose the data when the connection drops (which is very inconvenient for us). We would like to never loose the data even when the connection drops while keeping the real-time streaming data sent to influxdb (we could store data on disk locally beforehand and then sending it to influxdb but we loose the real time here)
We wanted to use Kafka or MQTM for this purpose but it does not seem to have good capabilities of handling connection loss (AFAIK)
Is there a protocol/system which can handle this problem ?
EDIT : we do not need high throughput as we have only 1 data per second

Related

What is the correct streaming pattern to replace database table polling?

I am trying to architect an event streaming system to replace our existing database table polling mechanism. We currently have a process where Application ABC will query/scan the entire XYZ (MySQL) table every 5 minutes so that we may get any updates to our data and cache them on Application ABC. As our data grows this will not be scalable or performant.
Instead, I want to have Application ABC read from a Kafka stream that contains any new events around the XYZ table, and use that to modify Application ABC's in-memory cache.
Where I'm having a hard time formulating a good solution is the initial database table load onto the Kafka stream. Since all the XYZ data that would be consumed by Application ABC is cached, we lose that data when we redeploy all of the Application ABC nodes. So we would need some kind of mechanism to be able to get all the XYZ data from the initial load onto the stream. I know Kafka streams are supposed to allow for infinite retention but I'm not sure if infinite retention is a realistic solution in this case due to cost.
What's the usually prescribed solution around this initial load case where Application ABC would need to reload the entire database again off of the stream (every time a new instance is spun up)? Also trying to think about what is the most performant solution here so that Application ABC has the lowest latency to be able to gather all the data it needs from XYZ Table.
Another constraint to mention is that Application ABC needs to have this data in memory for performance reasons. We need to be able to iterate over the entire XYZ data set at all times. We cannot do simple queries by ID.
There is a bit to unpack here but here are is some info.
Instead of polling the DB, consider using a source connector to get the data into Kafka. Debezium is made for this. You havent specified what type of database you are using, but it does support quite a few variants. The mechanism is called CDC - Change Data Capture, and it needs to be enabled on the database and each of the tables first.
As for the Application ABC side - consider using a distributed cache with persistence enabled. Redis is a good option for this. This way it will retain the data even if your application is restarted. Reloading all the data back from Kafka is not a good idea, this will take a long time (depending on the amount of data) the application will be unavailable for that duration after a restart.

Sync two Mongodb Databases and have the client connect to the closest one

I am using mongodb as a database for my game servers. When I started them, they were bound to just one region, I used the free tier of Mongodb Atlas in that region, India, and it worked perfectly with around 20ms of latency even on high load.
Now when I'm trying to scale up my servers and reach other regions like US-East, the latency jumps up to 500ms.
Is there a way I could open up two mongodb servers on both my India and US instances, which would always be in complete sync with each other, while the game server process would just use localhost to connect to the specific replica. I'm using pyMongo.
The players have no connection with the database, it's the game server process that manages it
Is there a way I could open up two mongodb servers on both my India and US instances, which would always be in complete sync with each other, while the game server process would just use localhost to connect to the specific replica.
Assuming you are using a replica set, I believe the closest you can come to achieving the stated requirement for reads is by:
Performing all writes with the write concern w=(# of nodes in the deployment), if you have two servers then use w=2.
Performing reads with read preference=nearest and read concern=majority.
I say "closest" because I suspect read concern=majority on a secondary could return stale data even with w=(# of nodes in deployment), since I imagine the majority commit point would necessarily lag behind the point when each secondary commits each document. To guarantee that you are reading current data you must read from the primary.
Note that such a setup has (at least) two additional major drawbacks:
If any of the servers becomes unavailable, your application ceases to be able to write any data to the database.
Each write will wait for all servers in the deployment to store it, hence all writes will be slow.
Achieving the same requirement for reads and writes is not physically possible. (You essentially want to be able to write a document in, say, India in 20ms and have it be "instantly" available in US, i.e. be able to retrieve it in US 20 ms later, but it takes a minimum of 500 ms for data to travel from India to US.)

Redis Streams for implementing a Messaging System (chat) app versus traditional approaches

I'm implementing a chat app, which will support both one-on-one conversation and Group conversations.
So far the direction was to use Redis Pub/Sub with PostgreSQL as the cold storage, and WebSocket being the transport.
Every user will fetch the history from postgresql upon launch (up until the timestamp of the WebSocket+redis connection), and then subscribe to channels that go by their own user_id.
However, having a roundtrip to a DMBS with each new message sounds a bit strange, while definitely doable and legit.
So I decided to examine other approaches. One possible approach was to use Kafka and eliminate the need for an DBMS altogether.
It sounds viable and comes with its own set of advantages.
But turns out there's a new kid on the block - Redis Streams.
From what I gather, it is actually quite similar to Kafka in this specific scenario (chat).
It has many nice features that sound very convenient for implementing a chat system.
And now I am trying to understand whether Streams + disk persistency is the wise way to go versus Kafka versus PostgreSQL+Redis pub/sub
The main aspects in consideration are:
Performance. Postgres and Kafka both operate on disk, meaning slower than the in-memory operations in the case of redis. On the other hand , obviously the messages must be persisted and available at all times and events, so redis will be persisted to disk. Wouldn't that negate the whole in-memory performance gain?
And even if not - would the performance gain under peak load and a big data base be noticeable?
Memory / Costs. With redis these two are closely tied together. As a small startup, the efforts are focused on being ready to cope with sudden scale peaks (up to a million users), but at the same time - the costs should be minimized.
Is storing millions of messages in Streams going to be too memory-costly which in turn will translate to financially-costly?
Recovery, Reliability & Availability, Persistency. with Postgres, even a single instance can handle a big traffic load, but it can also offer master-slave setups and also consistency. Can Redis be a match to that? Also, with a DMBS I can be assured that the data is there to stay. Can I know that with redis?
Scaling.

For Data Transfer, REST API vs SFTP, which is more secure?

Two different applications needs a data transfer from one another for certain activities. Option to do this data transfer is either prepare a file of data and push it through SFTP at certain point of time, or push/pull the changes through REST API in real time.
Which approach will be more secure if the data in one system is completely encrypted and in one it is raw?
When choosing an integration pattern, as always, the answer is: "It depends".
How much data, and how frequently?
What is the security classification for the data (and the potential impact of a data breach)?
How mature are the IT Engineering/Operations/DevOps teams that will be involved in implementing the integration points, on both ends?
What facilities are available for ensuring that data is encrypted at-rest, as well as in-transit?
What is the business requirement regarding data latency?
What is the physical distance between the two systems?
What is the sustainable network connection speed between the two systems?
What are the hours during which the integration needs to be active/scheduled?
Is the data of a bulk/reference type - or is it an event/transaction nature type?
What is the size per message/transaction?
What is the total size of the data (MB? GB? PB? ...?) that needs to be transferred during a given processing cycle (per hour? per day?)?
Additional considerations that should be examined:
Network timeout/retry scenarios
Batch reruns
CPU, Memory, Network bandwidth utilization
Peak hour processing vs. off peak hour processing
Infrastructure/environmental limits/constraints - and cost factors (e.g. Cloud hosting limits on transactions, data, file size, Cloud-hosted API Gateway pricing strategies, ...)
Network Latency introduced by number of messages that must be sent to complete transfer of all data via an API vs. SFTP
Data Latency introduced by SFTP batch scheduling
It is hard to compare SSH (SFTP) with SSL (RESTful API using HTTPS) as both have different functions.
SFTP has a broader surface area for attack as it uses SSH for tunneling which means there are multiple areas for there to be compromises. That does not mean it is less secure.

Do NoSQL datacenter aware features enable fast reads and writes when nodes are distributed across high-latency connections?

We have a data system in which writes and reads can be made in a couple of geographic locations which have high network latency between them (crossing a few continents, but not this slow). We can live with 'last write wins' conflict resolution, especially since edits can't be meaningfully merged.
I'd ideally like to use a distributed system that allows fast, local reads and writes, and copes with the replication and write propagation over the slow connection in the background. Do the datacenter-aware features in e.g. Voldemort or Cassandra deliver this?
It's either this, or we roll our own, probably based on collecting writes using something like
rsync and sorting out the conflict resolution ourselves.
You should be able to get the behavior you're looking for using Voldemort. (I can't speak to Cassandra, but imagine that it's similarly possible using it.)
The key settings in the configuration will be:
replication-factor — This is the total number of times the data is stored. Each put or delete operation must eventually hit this many nodes. A replication factor of n means it can be possible to tolerate up to n - 1 node failures without data loss.
required-reads — The least number of reads that can succeed without throwing an exception.
required-writes — The least number of writes that can succeed without the client getting back an exception.
So for your situation, the replication would be set to whatever number made sense for your redundancy requirements, while both required-reads and required-writes would be set to 1. Reads and writes would return quickly, with a concomitant risk of stale or lost data, and the data would only be replicated to the other nodes afterwards.
I have no experience with Voldemort, so I can only comment on Cassandra.
You can deploy Cassandra to multiple datacenters with an inter-DC latency higher than a few milliseconds (see http://spyced.blogspot.com/2010/04/cassandra-fact-vs-fiction.html).
To ensure fast local reads, you can configure the cluster to replicate your data to a certain number of nodes in each datacenter (see "Network Topology Strategy"). For example, you specify that there should always be two replica in each data center. So even when you lose a node in a data center, you will still be able to read your data locally.
Write requests can be sent to any node in a Cassandra cluster. So for fast writes, your clients would always speak to a local node. The node receiving the request (the "coordinator") will replicate the data to other nodes (in other datacenters) in the background. If nodes are down, the write request will still succeed and the coordinator will replicate the data to the failed nodes at a later time ("hinted handoff").
Conflict resolution is based on a client-supplied timestamp.
If you need more than eventual consistency, Cassandra offers several consistency options (including datacenter-aware options).