How long should a weighted average take on Apache Ignite? - real-time

I am currently benchmarking Appache Ignite for a near real-time application and simple operations seem to be excessively slow for a relatively small sample size. The following is giving the setup details and timings - please see 2 questions at the bottom.
Setup:
Cache mode: Partitioned
Number of server nodes: 3
CPUs: 4 per node (12)
Heap size: 2GB per node (6GB)
The first use case is computing the weighted average over two fields of the object at different rates.
First method is to run a SQL style query:
...
query = new SqlFieldsQuery("select SUM(field1*field2)/SUM(field2) from MyObject");
cache.query(query).getAll();
....
The observed timings are:
Cache: 500,000 Queries/second: 10
Median: 428ms, 90th percentile: 13,929ms
Cache: 500,000 Queries/second: 50
Median: 191,465ms, 90th percentile: 402,285ms
Clearly this is queuing up with an enormous latency (>400 ms), a simple weighted average computation on a single jvm (4 Cores) takes 6 ms.
The second approach is to use the IgniteCompute to broadcast Callables across nodes and compute the weighted average on each node, reducing at the caller, latency is only marginally better, throughput improves but still at unusable levels.
Cache: 500,000 Queries/second: 10
Median: 408ms, 90th percentile: 507ms
Cache: 500,000 Queries/second: 50
Median: 114,155ms, 90th percentile: 237,521ms
A few things i noticed during the experiment:
No disk swapping is happening
CPUs run at up to 400%
Query is split up in two different weighted averages (map reduce)
Entries are evenly split across the nodes
No garbage collections are triggered with each heap size around 500MB
To my questions:
Are these timings expected or is there some obvious setting i am missing? I could not find benchmarks on similar operations.
What is the advised method to run fork-join style computations on ignite without moving data?

This topic was discussed in detail on Apache Ignite user forum: http://apache-ignite-users.70518.x6.nabble.com/Ignite-performance-td6703.html

Related

How to perform large computations on Spark

I have 2 tables in Hive: user and item and I am trying to calculate cosine similarity between 2 features of each table for a cartesian product between the 2 tables, i.e. Cross Join.
There are around 20000 users and 5000 items resulting in 100 million rows of calculation. I am running the compute using Scala Spark on Hive Cluster with 12 cores.
The code goes a little something like this:
val pairs = userDf.crossJoin(itemDf).repartition(100)
val results = pairs.mapPartitions(computeScore) // computeScore is a function to compute the similarity scores I need
The Spark job will always fail due to memory issues (GC Allocation Failure) on the Hadoop cluster. If I reduce the computation to around 10 million, it will definitely work - under 15 minutes.
How do I compute the whole set without increasing the hardware specifications? I am fine if the job takes longer to run and does not fail halfway.
if you take a look in the Spark documentation you will see that spark uses different strategies for data management. These policies are enabled by the user via configurations in the spark configuration files or directly in the code or script.
Below the documentation about data management policies:
"MEMORY_AND_DISK" policy would be good for you because if the data (RDD) does not fit in the ram then the remaining partitons will be stored in the hard disk. But this strategy can be slow if you have to access the hard drive often.
There are few steps of doing that:
1. Check the expected Data volume after cross join and divide this by 200 as spark.sql.shuffle.partitions by default comes as 200. It has to be more than 1 GB raw data to each partition.
2. Calculate each row size and multiply with another table row count , you will be able to estimated the rough Volume. The process will work much better in Parquet in comparison to CSV file
3. spark.sql.shuffle.partitions needs to be set based on Total Data Volume/500 MB
4. spark.shuffle.minNumPartitionsToHighlyCompress needs to set a little less than Shuffle Partition
5. Bucketize the source parquet data based on the joining column for both of the files/tables
6. Provide a High Spark Executor Memory and Manage the Java Heap memory too considering the heap space

What is the meaning of 99th percentile latency and throughput

I've read some article, benchmarking the performance of stream processing engines like Spark streaming, Storm, and Flink. In the evaluation part, the criterion was 99th percentile and throughput. For example, Apache Kafka sent data at around 100.000 events per seconds and those three engines act as stream processor and their performance was described using 99th percentile latency and throughput.
Can anyone clarify these two criteria for me?
99th percentile latency of X milliseconds in stream jobs means that 99% of the items arrived at the end of the pipeline in less than X milliseconds. Read this reference for more details.
When application developers expect a certain latency, they often need
a latency bound. We measure several latency bounds for the stream
record grouping job which shuffles data over the network. The
following figure shows the median latency observed, as well as the
90-th, 95-th, and 99-th percentiles (a 99-th percentile of latency of
50 milliseconds, for example, means that 99% of the elements arrive at
the end of the pipeline in less than 50 milliseconds).

Spark executor max memory limit

Am wondering if there is any size limit to Spark executor memory ?
Considering the case of running a badass job doing collect, unions, count, etc.
Just a bit of context, let's say I have these resources (2 machines)
Cores: 40 cores, Total = 80 cores
Memory: 156G, Total = 312
What's the recommendation, bigger vs smaller executors ?
The suggestion by Spark development team is to not have an executor that is more than 64GB or so (often mentioned in training videos by Databricks). The idea is that a larger JVM will have a larger Heap that can result in really slow garbage collection cycles.
I think is a good practice to have your executors 32GB or even 24GB or 16GB. So instead of having one large one you have 2-4 smaller ones.
It will perhaps have some more coordination overhead, but I think these should be ok for the vast majority of applications.
If you have not read this post, please do.

same program is much slower on a supposedly better machine

When running the same application on two different machines, I see one is much slower but it ought to be the faster of the two. This is a compute bound application with a thread pool. The threads do not communicate with each other nor externally. The application reads from disk at the beginning (for a fraction of a second) and writes to disk at the end (for a fraction of a second).
The program repeatedly runs a simulation on a deterministically changing set of inputs. Since the inputs are identical the outputs can be compared and they are in fact identical. The only difference is the elapsed time. There is an object that I recall is "shared" in the sense that all threads read from it but my recollection is that this is strictly read-only. The work threaded is homogeneous.
Dual machine: 2 core / 4 thread machine, 2.53 GHz, 3MB cache, 8GB RAM, passmark.com benchmark is approximately 2100, my application's thread pool size set to 4, JVM memory high water mark was 2.8 GB, elapsed time is 47 minutes
Quad machine: 4 core / 8 thread machine, 2.2 GHz to 3.1 GHz, 6MB cache, 8GB RAM, passmark.com benchmark is approximately 6000, my application's thread pool size set to 8, JVM memory high water mark was 2.8GB, elapsed time 164 minutes
Another comparison:
Dual machine: thread pool size set to 2, elapsed time 98 minutes * Could be less. Please see the footnote.
Quad machine: thread pool size set to 2, elapsed time 167 minutes
*Probably should be less than 98 minutes since I was also playing an audio file. This means the anomaly is worse than this result makes it appear.
The jvisualvm profiles seem similar but due to what seem to be profiler glitches I haven't gotten much use from it. I'm looking for suggestions on where to look.
Both machines are Ubuntu 14.04.3 and on Java 8.
The answer is: collect more data and draw some conclusions. It appears that when comparing these two systems some conclusions can be drawn but they might not extend to the chipsets or the processors.
Reviewing the data in the original posting and the following measurements, it appears that for small data sets not only does the quad system's hyperthreading not significantly improve throughput, but that even going beyond 2 threads on a 4 core device does not improve throughput per unit of time, at least with these particular homogenous workloads. For large data sets it appears that hyperthreading reduces throughput per unit of time. Note the 2933 second result compared to an average of 1883 seconds (mean of 2032 and 1734).
The dual core hyperthreading is amazingly good, scaling well across the thread pool size dimension. The dual core also scaled well across the data set size dimension.
All measurements are elapsed times. Other means can be inferred, for example 2032 and 1734 can be averaged.

Degrading performance when increasing number of slaves [duplicate]

I am doing a simple scaling test on Spark using sort benchmark -- from 1 core, up to 8 cores. I notice that 8 cores is slower than 1 core.
//run spark using 1 core
spark-submit --master local[1] --class john.sort sort.jar data_800MB.txt data_800MB_output
//run spark using 8 cores
spark-submit --master local[8] --class john.sort sort.jar data_800MB.txt data_800MB_output
The input and output directories in each case, are in HDFS.
1 core: 80 secs
8 cores: 160 secs
I would expect 8 cores performance to have x amount of speedup.
Theoretical limitations
I assume you are familiar Amdahl's law but here is a quick reminder. Theoretical speedup is defined as followed :
where :
s - is the speedup of the parallel part.
p - is fraction of the program that can be parallelized.
In practice theoretical speedup is always limited by the part that cannot be parallelized and even if p is relatively high (0.95) the theoretical limit is quite low:
(This file is licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license.
Attribution: Daniels220 at English Wikipedia)
Effectively this sets theoretical bound how fast you can get. You can expect that p will be relatively high in case embarrassingly parallel jobs but I wouldn't dream about anything close to 0.95 or higher. This is because
Spark is a high cost abstraction
Spark is designed to work on commodity hardware at the datacenter scale. It's core design is focused on making a whole system robust and immune to hardware failures. It is a great feature when you work with hundreds of nodes
and execute long running jobs but it is doesn't scale down very well.
Spark is not focused on parallel computing
In practice Spark and similar systems are focused on two problems:
Reducing overall IO latency by distributing IO operations between multiple nodes.
Increasing amount of available memory without increasing the cost per unit.
which are fundamental problems for large scale, data intensive systems.
Parallel processing is more a side effect of the particular solution than the main goal. Spark is distributed first, parallel second. The main point is to keep processing time constant with increasing amount of data by scaling out, not speeding up existing computations.
With modern coprocessors and GPGPUs you can achieve much higher parallelism on a single machine than a typical Spark cluster but it doesn't necessarily help in data intensive jobs due to IO and memory limitations. The problem is how to load data fast enough not how to process it.
Practical implications
Spark is not a replacement for multiprocessing or mulithreading on a single machine.
Increasing parallelism on a single machine is unlikely to bring any improvements and typically will decrease performance due to overhead of the components.
In this context:
Assuming that the class and jar are meaningful and it is indeed a sort it is just cheaper to read data (single partition in, single partition out) and sort in memory on a single partition than executing a whole Spark sorting machinery with shuffle files and data exchange.