Apache beam - random numbers

Apache beam - random numbers - apache-beam

What is the recommended way to generate random numbers with apache beam so that each entries is associated with the same random number if it retries processing the entry?
For example, to map each entries, I would like to treat 90% one way and 10% the other way. If a worker crashes and beam retries the processing, I need to ensure that the random number that dictates which way to process the entry stays the same?

There is no built-in functionality for that in Apache Beam. However you could write a DoFn that implements whatever random number generation you want for this use-case. For example, a DoFn that uses some unique identifier from each element (some kind of ID that would be unique but wouldn't change on retries) and runs it through a noise function to get a random value which it associates with that element.

Related

Why does every simulation I perform in Arena Simulation give the same results?

Every time I run a simulation with the same parameters in the Run window I get exactly the same results. The results are different if a different number of replications is set each time the run is started
These are my settings in the run window:
enter image description here
I have a lot of Process blocks. Most of them have a normal distribution in duration. Why are the results not different?
If it helps in any way, her is a photo of the constructed model:
enter image description here

Arena uses the same random nr stream for your run unless you tell it to use another - so your first rep will look the same every time. The answer depends on the distributions you sample from. If you change the logic and sample at other times, the answer will change. Each following rep will have a unique answer based on the random nr streams you are sampling from. It makes it easier/possible to find errors if you can execute exactly the same run.

Arena provides (by default) ten different "streams" of (pseudo) random numbers. If you don't ask the system to use a particular stream, it will use stream 10. For example NORM(10,2) will use stream 10 to calculate a random normally distributed number (mean 10, standard deviation 2); NORM(10,2,4) will use stream 4 to calculate a similarly distributed number.
By default, the 10 random number generators for the streams are initialised at the beginning of each run to 14561, 25971, 31131, 22553, 12121, 32323, 19991, 18765, 14327, and 32535 (from Arena help). At the end of one replication, the generators will not be re-initialised, so they will start the next replication with a new value.
You can control the random number generator initialisation with the SEEDS element.
As #Marlize says, this helps to ensure that you can reproduce a simulation result if you need to.

Matlab random number rng: choosing a seed

I would like to know more precisely what happends when you choose a custom seed in Matlab, e.g.:
rng(101)
From my (limited, nut nevertheless existing) understanding of how pseudo-random number generators work, one can see the seed conceptually as choosing a position in a "very long list of pseudo-random numbers".
Question: lets say, (in my Matlab script), I choose rng(100) for my first computation (a sequence of instructions) and then rng(1e6) for my second. Please, note that each time I do some computations it involves generating up to about 300k random numbers (each time).
-> Does that imply that I make sure there is no overlap between the sequence in the "list" starting at 100 and ending around 300k and the one starting at 1e6 and ending at 1'300'000 ? (the idead of "no overlap" comes from the fact since the rng(100) and rng(1e6) are separated by much more than 300k)
i.e. that these are 2 "independent" sequences, (as far as I remember this 'long list' would be generated by a special PRNG algorithm, most likely involing modular arithmetic..?)

No that is not the case. The mapping between the seed and the "position" in our list of generated numbers is not linear, you could actually interpret it as a hash/one way function. It could actually happen that we get the same sequence of numbers shifted by one position (but it is very unlikely).
By default, MATLAB uses the Mersenne Twister (source).

Not quite. The seed you give to rng is the initiation point for the Mersenne Twister algorithm (by default) that is used to generate the pseudorandom numbers. If you choose two different seeds (no matter their relative non-negative integer values, except for maybe a special case or two), you will have effectively independent pseudorandom number streams.
For "99%" of people, the major uses of seeding the rng are using the 'shuffle' argument (to use a non-default seed based on the time to help ensure independence of numbers generated across multiple sessions), or to give it one particular seed (to be able to reproduce the same pseudorandom stream at a later date). If you try to finesse the seeds further without being extremely careful, you are more likely to cause issues than do anything helpful.
RandStream can be used to break off separate streams of pseudorandom numbers if that really matters for your application (it likely doesn't).

How to calculate the best numberOfPartitions for coalesce?

So, I understand that in general one should use coalesce() when:
the number of partitions decreases due to a filter or some other operation that may result in reducing the original dataset (RDD, DF). coalesce() is useful for running operations more efficiently after filtering down a large dataset.
I also understand that it is less expensive than repartition as it reduces shuffling by moving data only if necessary. My problem is how to define the parameter that coalesce takes (idealPartionionNo). I am working on a project which was passed to me from another engineer and he was using the below calculation to compute the value of that parameter.
// DEFINE OPTIMAL PARTITION NUMBER
implicit val NO_OF_EXECUTOR_INSTANCES = sc.getConf.getInt("spark.executor.instances", 5)
implicit val NO_OF_EXECUTOR_CORES = sc.getConf.getInt("spark.executor.cores", 2)
val idealPartionionNo = NO_OF_EXECUTOR_INSTANCES * NO_OF_EXECUTOR_CORES * REPARTITION_FACTOR
This is then used with a partitioner object:
val partitioner = new HashPartitioner(idealPartionionNo)
but also used with:
RDD.filter(x=>x._3<30).coalesce(idealPartionionNo)
Is this the right approach? What is the main idea behind the idealPartionionNo value computation? What is the REPARTITION_FACTOR? How do I generally work to define that?
Also, since YARN is responsible for identifying the available executors on the fly is there a way of getting that number (AVAILABLE_EXECUTOR_INSTANCES) on the fly and use that for computing idealPartionionNo (i.e. replace NO_OF_EXECUTOR_INSTANCES with AVAILABLE_EXECUTOR_INSTANCES)?
Ideally, some actual examples of the form:
Here 's a dataset (size);
Here's a number of transformations and possible reuses of an RDD/DF.
Here is where you should repartition/coalesce.
Assume you have n executors with m cores and a partition factor equal to k
then:
The ideal number of partitions would be ==> ???
Also, if you can refer me to a nice blog that explains these I would really appreciate it.

In practice optimal number of partitions depends more on the data you have, transformations you use and overall configuration than the available resources.
If the number of partitions is too low you'll experience long GC pauses, different types of memory issues, and lastly suboptimal resource utilization.
If the number of partitions is too high then maintenance cost can easily exceed processing cost. Moreover, if you use non-distributed reducing operations (like reduce in contrast to treeReduce), a large number of partitions results in a higher load on the driver.
You can find a number of rules which suggest oversubscribing partitions compared to the number of cores (factor 2 or 3 seems to be common) or keeping partitions at a certain size but this doesn't take into account your own code:
If you allocate a lot you can expect long GC pauses and it is probably better to go with smaller partitions.
If a certain piece of code is expensive then your shuffle cost can be amortized by a higher concurrency.
If you have a filter you can adjust the number of partitions based on a discriminative power of the predicate (you make different decisions if you expect to retain 5% of the data and 99% of the data).
In my opinion:
With one-off jobs keep higher number partitions to stay on the safe side (slower is better than failing).
With reusable jobs start with conservative configuration then execute - monitor - adjust configuration - repeat.
Don't try to use fixed number of partitions based on the number of executors or cores. First understand your data and code, then adjust configuration to reflect your understanding.
Usually, it is relatively easy to determine the amount of raw data per partition for which your cluster exhibits stable behavior (in my experience it is somewhere in the range of few hundred megabytes, depending on the format, data structure you use to load data, and configuration). This is the "magic number" you're looking for.
Some things you have to remember in general:
Number of partitions doesn't necessarily reflect
data distribution. Any operation that requires shuffle (*byKey, join, RDD.partitionBy, Dataset.repartition) can result in non-uniform data distribution. Always monitor your jobs for symptoms of a significant data skew.
Number of partitions in general is not constant. Any operation with multiple dependencies (union, coGroup, join) can affect the number of partitions.

Your question is a valid one, but Spark partitioning optimization depends entirely on the computation you're running. You need to have a good reason to repartition/coalesce; if you're just counting an RDD (even if it has a huge number of sparsely populated partitions), then any repartition/coalesce step is just going to slow you down.
Repartition vs coalesce
The difference between repartition(n) (which is the same as coalesce(n, shuffle = true) and coalesce(n, shuffle = false) has to do with execution model. The shuffle model takes each partition in the original RDD, randomly sends its data around to all executors, and results in an RDD with the new (smaller or greater) number of partitions. The no-shuffle model creates a new RDD which loads multiple partitions as one task.
Let's consider this computation:
sc.textFile("massive_file.txt")
.filter(sparseFilterFunction) // leaves only 0.1% of the lines
.coalesce(numPartitions, shuffle = shuffle)
If shuffle is true, then the text file / filter computations happen in a number of tasks given by the defaults in textFile, and the tiny filtered results are shuffled. If shuffle is false, then the number of total tasks is at most numPartitions.
If numPartitions is 1, then the difference is quite stark. The shuffle model will process and filter the data in parallel, then send the 0.1% of filtered results to one executor for downstream DAG operations. The no-shuffle model will process and filter the data all on one core from the beginning.
Steps to take
Consider your downstream operations. If you're just using this dataset once, then you probably don't need to repartition at all. If you are saving the filtered RDD for later use (to disk, for example), then consider the tradeoffs above. It takes experience to become familiar with these models and when one performs better, so try both out and see how they perform!

As others have answered, there is no formula which calculates what you ask for. That said, You can make an educated guess on the first part and then fine tune it over time.
The first step is to make sure you have enough partitions. If you have NO_OF_EXECUTOR_INSTANCES executors and NO_OF_EXECUTOR_CORES cores per executor then you can process NO_OF_EXECUTOR_INSTANCES*NO_OF_EXECUTOR_CORES partitions at the same time (each would go to a specific core of a specific instance).
That said this assumes everything is divided equally between the cores and everything takes exactly the same time to process. This is rarely the case. There is a good chance that some of them would be finished before others either because of locallity (e.g. the data needs to come from a different node) or simply because they are not balanced (e.g. if you have data partitioned by root domain then partitions including google would probably be quite big). This is where the REPARTITION_FACTOR comes into play. The idea is that we "overbook" each core and therefore if one finishes very quickly and one finishes slowly we have the option of dividing the tasks between them. A factor of 2-3 is generally a good idea.
Now lets take a look at the size of a single partition. Lets say your entire data is X MB in size and you have N partitions. Each partition would be on average X/N MBs. If N is large relative to X then you might have very small average partition size (e.g. a few KB). In this case it is usually a good idea to lower N because the overhead of managing each partition becomes too high. On the other hand if the size is very large (e.g. a few GB) then you need to hold a lot of data at the same time which would cause issues such as garbage collection, high memory usage etc.
The optimal size is a good question but generally people seem to prefer partitions of 100-1000MB but in truth tens of MB probably would also be good.
Another thing you should note is when you do the calculation how your partitions change. For example, lets say you start with 1000 partitions of 100MB each but then filter the data so each partition becomes 1K then you should probably coalesce. Similar issues can happen when you do a groupby or join. In such cases both the size of the partition and the number of partitions change and might reach an undesirable size.

Terminology for creating an ID using the position in a graph? (adjacency data)

Is there a term for the process of creating an ID, based on connectivity information, that helps identify elements as matching based on their neighbors?
This can be as simple as looping over items in an array and accumulating next and previous items (possibly with bit-shifting, xor... etc).
Another example is using an order independent hash based on nodes connected by edges in a graph.
I've used this multiple times, but don't know if there is a term for it.
Typically the following steps are done by assigning an ID to each element(often a number - created by hashing the contents).
Then iterate:
Store a copy of all ID's to prevent reading modified values.
Loop over each element and create a new ID from its value combined with the connected elements.
Each iteration the range of influence elements have on each-other increases - following a triangle number sequence.
Is there a term for each iteration?
Is there a term for this entire process?

Should i use Trident to compute the global mean of tuples in Storm?

I want to compute with Storm the mean from incoming tuples made of [int id,int value].
As you can see i can't partition the data by using a fields grouping. I need a topology architecture to distribute this computation and the only way im thinking of is doing mini batches within each bolt instances and then aggregate.
I kind of understood that trident was the appropriate solution to do mini-batch processing within storm.
What is the best practice to compute global analytics with storm like means, global count, std-devs when you can't partition the data based on attribute? Any topology example?

You can easily compute stream statistics such as mean, standard deviation and count computed using Trident-ML. There's a section in the README which explains how to compute theses stats within a trident topology.
Hope it helps.