workload balance and partition relationship - pyspark

In Azure cloud,I have apache spark pool with nodes=3 and scalable to 10 nodes.
My query is taking long time to run but nodes are not getting scaled up. Its always uses only 3 nodes.
Is there anything I have to do with my query?

Related

How to estimate RAM and CPU per Kubernetes Pod for a Spring Batch processing job?

I'm trying to estimate hardware resources for a Kubernetes Cluster to be able to handle the following scenarios:
On a daily basis I need to read 46,3 Million XML messages 10KB each (approx.) from a queue and then insert them in a Spark instance and in a Sybase DB instance. I need to come out with an estimation of how many pods I will need to process this amount of data and how much RAM and how many vCPUs will be required per pod in order to determine the characteristics of the nodes of the cluster. The reason behind all this is that we have some budget restrictions and we need to have an idea of the sizing before starting the corresponding development.
The second scenario is the same as the one already described but 18,65 times bigger, i.e. 833,33 Million XML messages per day. This is expected to be the case within a couple of years.
So far we plan to use Spring Batch with partitioning steps. I need orientation on how to determine the ideal Spring Batch configuration, required RAM, and required CPU per POD, as well as the number of PODS.
I will greatly appreciate any comments from your side.
Thanks in advance.

Kubernetes Orchestration depending upon number of rows/records/Input Files

Requirement is to orchestrate ETL containers depending upon the number of records present at the Source system (SQL/Google Analytics/SAAS/CSV files).
To explain take a Use Case:- ETL Job has to process 50K records present in SQL server, however, it takes good processing time to execute this job by one server/node as this server makes a connection with SQL, fetches the data and process the records.
Now the problem is how to orchestrate in Kubernetes this ETL Job so that it scales up/down the containers depending upon number of records/Input. Like the case discussed above if there are 50K records to process in parallel then it should scale up the containers process the records and scales down.
You would generally use a queue of some kind and Horizontal Pod Autoscaler (HPA) to watch the queue size and adjust the queue consumer replicas automatically. Specifics depend on the exact tools you use.

Kubernetes: why would you need more than 2 nodes?

Given a K8s Cluster(managed cluster for example AKS) with 2 worker nodes, I've read that if one node fails all the pods will be restarted on the second node.
Why would you need more than 2 worker nodes per cluster in this scenario? You always have the possibility to select the number of nodes you want. And the more you select the more expensive it is.
It depends on the solution that you are deploying in the kubernetes cluster and the nature of high-availability that you want to achieve
If you want to work on an active-standby mode, where, if one node fails, the pods would be moved to other nodes, two nodes would work fine (as long as the single surviving node has the capacity to run all the pods)
Some databases / stateful applications, for instance, need minimum of three replica, so that you can reconcile if there is a mismatch/conflict in data due to network partition (i.e. you can pick the content held by two replicas)
For instance, ETCD would need 3 replicas
If whatever you are building needs only two nodes, then you wouldn't need more than 2. If you are building anything big where the amount of compute, memory needed is much more, then instead of opting for expensive nodes with huge CPU and RAM, you could instead join more and more lower priced nodes to the cluster. This is called horizontal scaling.

Lagom: is it possible to split service instances across multiple clusters?

Let's say I have Hello-Service. In Lagom, this service can run across multiple nodes of a single cluster.
So within Cluster 1, we can have multiple "copies" of Hello-Service:
Cluster1: Hello-Service-1, Hello-Service-2, Hello-Service-3
But is it possible to run service Hello-Service across multiple clusters?
Like this:
Cluster1: Hello-Service-1, Hello-Service-2, Hello-Service-3,
Cluster2: Hello-Service-4, Hello-Service-5, Hello-Service-6
What I want to achieve is better scalability of the read-side processors and event consumers:
In Lagom, we need to set up front the number of shards of given event tag within the cluster.
So I wonder if I can just add another cluster to distribute the load across them.
And, of course, I'd like to shard persistent entities by some key.
(Let's say that I'm building a multi-tenant application, I would shard entities by organization id, so all entities of some set of organizations would go into Cluster 1, and entities of another set of organizations would go into Cluster 2, so I can have sharded read side processors per each cluster which handle only subset of events/entities within the cluster (for better scalability)).
With a single cluster approach, as a system grows, a sharded processor within a single cluster may become slower and slower because it needs to handle more and more events.
So as the system grows, I would just add a new cluster (Let's say, Cluster 2, then Cluster 3, which would handle their own subset of events/entities)
If you are using sharded read sides, Lagom will distribute the processing of the shards across all the nodes in the cluster. So, if you have 10 shards, and 6 nodes in 1 cluster, then each node will process between 1-2 shards. If you try to deploy two clusters, 3 nodes each, then you'll end up each node processing 3-4 shards, but every event will be processed twice, once in each cluster. That's not helping scalability, that's doing twice as much work as needs to be done. So I don't see why you would want two clusters, just have one cluster, and the Lagom will distribute the shards evenly across it.
If you are not using sharded read sides, then it doesn't matter how many nodes you have in your cluster, all events will be processed by one node. If you deploy a second cluster, it won't share the load, it will also process the same events, so you'll get double processing of each event by each cluster, which is not what you want.
So, just use sharded read sides, and let Lagom distribute the work across your single cluster for you, that's what it's designed to do.

Issue about workload balance in Flink streaming

I have a WordCount program running in a 4-worknodes Flink cluster which reads data from a Kafka topic.
In this topic, there are lot of pre-loaded texts(words). The words in the topic satisfy Zipf distribution. The topic has 16 partitions. Each partition has around 700 M data inside.
There is one node which is much slower than the others. As you can see in the picture, worker2 is the slower node. But the slower node is not always worker2. From my tests, it is also possible that worker3 or other nodes in the cluster can also be slower.
But, there is always such a slow worker node in the cluster. In the cluster, each worker node has 4 task slots, thus 16 task slots in total.
After sometime, Records sent to other worker nodes (except for the slower node) will not increase any more. The records sent to slower node will increase to the same level of others and the speed is much faster.
Is there anyone who can explain why the situation occurs ? Also, What am I doing wrong in my setup ?
Here is the throughput(count by words at Keyed Reduce -> Sink stage) of the cluster.
From this picture we could see that the throughput of the slower node - node2 is much higher than that of the others. It means that node2 received more records from the first stage. I think this would be because of the Zipf distribution of the words in the topic. The words with very high frequency are mapped to node2.
When nodes spend more compute resources on the Keyed Reduce -> Sink stage, the speed of reading data from Kafka decreases. When all the data in partitions corresponding to node1, node3 and node4 are processed, the throughput of the cluster drops down.
As your data follows a Zipf distribution, this behavior is expected. Some workers just receive more data due to the in-balance in the distribution itself. You would observe this behavior in other systems, too.