how to improve publishing efficiency between ctps and client procs? - kdb

I have
host A - US
host B - germany
CTP is in host A and all client procs are in host B.
CTP is publishing data to about 30 client processes.
My question is :
If I move the ctp on the same host where the client processes are, will the data transfer speed improve ?

If the 30x clients on hostB are subscribing to small tables only and/or filtering on sym to only receive subsets of tables then possibly data volumes could increase if the CTP is moved. As then all data will be sent rather than a subset.
If the 30x clients on hostB are subscribing to a table each without overlap then data transfer volume will not change.
If the 30x clients on hostB all subscribe to all data from the CTP then moving it to hostB would see a 30x reduction between the hosts. As then the data would only be sent once between the machines before fanning out to subscribers.
In most scenarios you will likely see a decrease. You can see what subscribers are listening to in .u.w in standard tick.q
https://code.kx.com/q/kb/publish-subscribe/
Then you can check counts of tables and sum up data that will be transferred to measure an estimate of how much traffic will increase/decrease.

Related

Peer to Peer network bandwitth

I am working on a project that involves a peer to peer network. Someone raised concerns that we may be expecting a larger bandwidth than is reasonable.
Suppose we had a large number of registered nodes (in the thousands, possibly 10,000), and these nodes constantly are receiving data which they wish to propagate around the network. The data doesn't have to get to every node, but we would like it to get to most of them.
In general, how much data creation and transmission could be handled reasonably as the number of nodes increases? I am hoping that, in my case, the answer is more than 50 MB/minute (as this is the maximum amount of data my system is expected to create), but I don't have a basis for comparison.

How to reliably shard data across multiple servers

I am currently reading up on some distributed systems design patterns. One of the designs patterns when you have to deal with a lot of data (billions of entires or multiple peta bytes) would be to spread it out across multiple servers or storage units.
One of the solutions for this is to use a Consistent hash. This should result in an even spread across all servers in the hash.
The concept is rather simple: we can just add new servers and only the servers in the range would be affected, and if you loose servers the remaining servers in the consistent hash would take over. This is when all servers in the hash have the same data (in memory, disk or database).
My question is how do we handle adding and removing servers from a consistent hash where there are so much data that it can't be stored on a single host. How do they figure out what data to store and what not too?
Example:
Let say that we have 2 machines running, "0" and "1". They are starting to reach 60% of their maximum capacity, so we decide to add an additional machine "2". Now a large part the data on machine 0 has to be migrated to machine 2.
How would we automate so this will happen without downtime and while being reliable.
My own suggested approach would be that the service hosing consistent hash and the machines would have be aware of how to transfer data between each other. When a new machine is added, will the consistent hash service calculate the affected hash ranges. Then inform the affect machine
of the affected hash range and that they need to transfer affected data to machine 2. Once the affected machines are done transferring their data, they would ACK back to the consistent hash service. Once all affected services are done transferring data, the consistent hash service would start sending data to machine 2, and inform the affected machine that they can remove their transferred data now. If we have peta bytes on each server can this process take a long time. We there for need to keep track of what entires where changes during the transfer so we can ensure to sync them after, or we can submit the write/updates to both machine 0 and 2 during the transfer.
My approach would work, but i feel it is a little risky with all the backs and forth, so i would like to hear if there is a better way.
How would we automate so this will happen without downtime and while being reliable?
It depends on the technology used to store your data, but for example in Cassandra, there is no "central" entity that governs the process and it is done like almost everything else; by having nodes gossiping with each other. There is no downtime when a new node joins the cluster (performance might be slightly impacted though).
The process is as follow:
The new node joining the cluster is defined as an empty node without system tables or data.
When a new node joins the cluster using the auto bootstrap feature, it will perform the following operations
- Contact the seed nodes to learn about gossip state.
- Transition to Up and Joining state (to indicate it is joining the cluster; represented by UJ in the nodetool status).
- Contact the seed nodes to ensure schema agreement.
- Calculate the tokens that it will become responsible for.
- Stream replica data associated with the tokens it is responsible for from the former owners.
- Transition to Up and Normal state once streaming is complete (to indicate it is now part of the cluster; represented by UN in the nodetool status).
Taken from https://thelastpickle.com/blog/2017/05/23/auto-bootstrapping-part1.html
So when the joining node is in the Joining State, it is receiving data from other nodes but not ready for reads until the process is complete (Up status).
DataStax also has some material on this https://academy.datastax.com/units/2017-ring-dse-foundations-apache-cassandra?path=developer&resource=ds201-datastax-enterprise-6-foundations-of-apache-cassandra

Database interaction in an IOT network

Suppose we have several (100) nodes in an IOT network. Each node has limited resources. There is a postgresql database server running in one of these nodes. Now every node has several (4-5) processes which need to interact with this server to perform some insert and select queries. Each query response has to be as fast as possible for the process to work as it should. Now i think of some ways to do this are :
Each process in a node makes one database client and performs queries.
All processes in a node send their queries to a destination in localhost itself from where the queries are performed through an optimum number of database clients. This way we have some sort of control over the number of database clients like optimisation of queries getting performed through a priority queue implementation or performing queries in separate thread/process through a separate database client in each thread/process. In this case somewhat we have the control over the optimisation of number of clients,number of threads/processes , priority of in what order queries must be executed.
Each node sends all queries through some network protocol directly to the database server which then uses a limited number of database clients performing queries now in its own localhost database and then returning the response to each node through same channel. This way it increases the latency but keeps number of clients minimum. Plus we can also implement some optimisation here running each client in different process/thread etc. Database interaction can be faster in this case since number of clients can be kept minimum, it is running in localhost machine itself but it adds some overhead to transfer the query response data back to the node's process.
In order to keep the resource usage as minimum as possible in every node and queries response as fast as possible , what is the best strategy to solve this problem ?
Without knowing the networking details, option 3 would be used normally.
Reasons:
Authentication: Typically you do not want to use database users to authenticate IoT devices.
Security: By using a specific IoT protocol, you can be sure to use TLS and certificate based server authentication.
Protocol compatibility: In upgrade cases, you must ensure that you can upgrade the client nodes independently from the server nodes, or vice versa. This may not be the case for database protocol.

For Data Transfer, REST API vs SFTP, which is more secure?

Two different applications needs a data transfer from one another for certain activities. Option to do this data transfer is either prepare a file of data and push it through SFTP at certain point of time, or push/pull the changes through REST API in real time.
Which approach will be more secure if the data in one system is completely encrypted and in one it is raw?
When choosing an integration pattern, as always, the answer is: "It depends".
How much data, and how frequently?
What is the security classification for the data (and the potential impact of a data breach)?
How mature are the IT Engineering/Operations/DevOps teams that will be involved in implementing the integration points, on both ends?
What facilities are available for ensuring that data is encrypted at-rest, as well as in-transit?
What is the business requirement regarding data latency?
What is the physical distance between the two systems?
What is the sustainable network connection speed between the two systems?
What are the hours during which the integration needs to be active/scheduled?
Is the data of a bulk/reference type - or is it an event/transaction nature type?
What is the size per message/transaction?
What is the total size of the data (MB? GB? PB? ...?) that needs to be transferred during a given processing cycle (per hour? per day?)?
Additional considerations that should be examined:
Network timeout/retry scenarios
Batch reruns
CPU, Memory, Network bandwidth utilization
Peak hour processing vs. off peak hour processing
Infrastructure/environmental limits/constraints - and cost factors (e.g. Cloud hosting limits on transactions, data, file size, Cloud-hosted API Gateway pricing strategies, ...)
Network Latency introduced by number of messages that must be sent to complete transfer of all data via an API vs. SFTP
Data Latency introduced by SFTP batch scheduling
It is hard to compare SSH (SFTP) with SSL (RESTful API using HTTPS) as both have different functions.
SFTP has a broader surface area for attack as it uses SSH for tunneling which means there are multiple areas for there to be compromises. That does not mean it is less secure.

How to deploy zookeeper across multiple data centers and failover?

I would like to know about the existing approaches that are available when running Zookeeper across data centers?
One approach that I found after doing some research is to have observers. That approach is to have only one ensemble in the main data center with leader and follower. And having observers in the backup data center. When main datacenter crash, we select other datacenter as the new main data center and convert observers to leader/follower manually.
I would like to about better approaches to achieve the same.
Thanks
First I would like to point the cons of your solution which hopefully my solution would solve:
a) in case of main data center failure the recovery process is manual (I quote you: "convert observers to leader/follower manually")
b) only the main data center accepts writes -> in case of failure all data (when observer don't write logs) or only last updates (when observer do write logs) are lost
Because the question is about data centerS I'll consider that we have enough (DCs) to reach our objective: solving a. and b. while having an usable multi data center distributed ZK.
So, when having an even number of data centers (DC) one could use an additional DC only for getting an odd number of ZK nodes in the ensemble. When having e.g. 2 DCs than a 3rd one could be added; each DC could contain 1 rwZK (read-write ZK node) or, for better tolerance against failures, each DC could contain 3 rwZK organized as hierarchical quorums (both cases could benefit of ZK observers). Inside a DC all ZK clients should point only to the DC's ZK-group so the traffic remained between DCs would be only for e.g. leader election, writes. With this kind of setup one solves both a. and b. but loses write/recovery-performance because the writes/elections must be agreed between data centers: at least 2 DCs must agree on writes/elections with 2 ZK nodes agreement per DC (see hierarchical quorums). The intra-DC agreement should be fast enough hence won't matter much for the overall write agreement process; bottom line, approximately only the delay between DCs would matter. The disadvantages of this approach are:
- additional cost for the 3rd data center: this could be mitigated by using the company office (a guy did that) as the 3rd data center
- lost sessions because of inter-DC network latency and/or throughput: with high enough timeouts one could reach a maximum possible write-throughput (depending on inter-DC average network speed) so this solution would be valid only when that maximum is acceptable. Still, when using 1 rw-ZK per DC I guess there'll be not much difference to your solution because the writes from backup DC to main DC must travel between DCs too; but for your solution won't be inter-DCs write agreements or leader elections related communication so it's faster.
Other consideration:
Regardless of the chosen solution the inter-DCs communication should be secured and for this ZK offers no solution so tunneling or other approach must be implemented.
UPDATE
Another solution would be to still use an additional 3rd DC (or company office) but where to keep only the rw-ZKs (1, 3 or other odd number) while the other 2 DCs to only have observer-ZKs. The clients should still connect only to the DC's ZK servers but we no longer need hierarchical quorums. The gain here is that the write agreements and leader elections would be only inside the DC with rw-ZKs (let's call it arbiter DC). The disadvantages are:
- the arbiter DC is a single point of failure
- the write requests will still have to travel from observer DCs to arbiter DC