I am learning parallel computation in ipython. I came across an example,
from ipyparallel import Client
rc = Client()
rc.block = True
print(rc.ids)
def mul(a,b):
return a*b
dview = rc[:]
print(dview.apply(mul, 5, 6))
print(rc[0].apply(mul, 5, 5))
print(rc[1].apply(mul, 5, 9))
In the above code, when dview.apply is called, it passes the same set of arguments for all the clients. I learned to call each clients separately. But if the clients are to do data intensive tasks, is there a way to pass different arguments through dview.apply since that is rather the point of doing parallel computation.
If there is no other way, can we make each client call asynchronous, so the tasks will be done parallel instead of waiting for results from the first client when individual clients are called ?.
In general, parallel computations can be expressed as a maps, where you pass sequences of arguments:
dview = rc[:]
inputs = [6, 5, 9]
results = dview.map(mul, [5] * len(inputs), inputs)
can we make each client call asynchronous
yes, you can use view.apply_async to return a Future corresponding to the result:
ar = view.apply_async(mul, 5, 6)
result = ar.get()
Related
I'm trying to migrate a stream processing project to Spark Structured Streaming.
Within this project, there is a correlation logic like this:
A dict with init values
{
1: [2, 3],
4: [5, 6],
}
Then a new input comes, saying that 2 and 5 should be correlated together.
We know the key for 2 is 1, and for 5 is 4, so we merge all values in entry 4 to entry 1.
Finally, the dict becomes { 1: [2, 3, 4, 5, 6] }.
Currently, we use a distributed database to save the dict. But with Spark, I want to retire the database and only rely on Spark's memory state.
According to this tutorial, I created a mapping function:
def mappingFunction(
key: String,
values: Iterator[Input],
state: GroupState[State]
): Iterator[...] = {
}
But it seems I can only access the state of the specific key (first param in this func).
My questions are:
If I receive <2, 5>, how can I update the group state of 1 and delete the group state of 4?
Can we rely on Spark for maintaining a complicated state like this? Or is a distributed global state store always needed for this case?
Thanks!
I am going through the List methods in Scala.
val mylist = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 3, 10)
I am quite confused by hasDefiniteSize and knownSize.
For List, hasDefiniteSize returns true and knownSize returns -1.
What is the exact theory behind these methods?
This method is defined by a superclass of List which is common with possibly endless collections (like Streams, LazyLists and Iterators).
For more details, I believe the documentation puts it best.
Here is the one for hasDefiniteSize in version 2.13.1:
Tests whether this collection is known to have a finite size. All
strict collections are known to have finite size. For a non-strict
collection such as Stream, the predicate returns true if all elements
have been computed. It returns false if the stream is not yet
evaluated to the end. Non-empty Iterators usually return false even if
they were created from a collection with a known finite size.
Note: many collection methods will not work on collections of infinite
sizes. The typical failure mode is an infinite loop. These methods
always attempt a traversal without checking first that hasDefiniteSize
returns true. However, checking hasDefiniteSize can provide an
assurance that size is well-defined and non-termination is not a
concern.
Note that hasDefiniteSize is deprecated with the following message:
(Since version 2.13.0) Check .knownSize instead of .hasDefiniteSize
for more actionable information (see scaladoc for details)
The documentation for knownSize further states:
The number of elements in this collection, if it can be cheaply
computed, -1 otherwise. Cheaply usually means: Not requiring a
collection traversal.
List is an implementation of a linked list, which is why List(1, 2, 3).hasDefiniteSize returns true (the collection is not boundless) but List(1, 2, 3).knownSize returns -1 (computing the collection size requires traversing the whole list).
Some collections know their size
Vector(1,2,3).knownSize // 3
and some do not
List(1,2,3).knownSize // -1
If a collection knows its size then some operations can be optimised, for example, consider how Iterable#sizeCompare uses knownSize to possibly return early
def sizeCompare(that: Iterable[_]): Int = {
val thatKnownSize = that.knownSize
if (thatKnownSize >= 0) this sizeCompare thatKnownSize
else {
val thisKnownSize = this.knownSize
if (thisKnownSize >= 0) {
val res = that sizeCompare thisKnownSize
// can't just invert the result, because `-Int.MinValue == Int.MinValue`
if (res == Int.MinValue) 1 else -res
} else {
val thisIt = this.iterator
val thatIt = that.iterator
while (thisIt.hasNext && thatIt.hasNext) {
thisIt.next()
thatIt.next()
}
java.lang.Boolean.compare(thisIt.hasNext, thatIt.hasNext)
}
}
}
See related question Difference between size and sizeIs
Is it possible to use the result of a simulation Sim 1 at time t as start value of simulation Sim 2? The use of extend doesn't work for start values.
Example:
model Sim 1
Real a;
equation
a=2*time;
end Sim 1;
for model Sim 2, I need
Real b (start=a at time t)
to use in several other set of equations.
You have to differ between the modeling and the simulation process:
With the language Modelica you define your models
With the simulation tool (like Dymola) you perform the simulation.
The keyword extends is part of the Modelica language. So it cannot be of any use in this context, as you use it to define models, not to describe how a simulation should be performed.
The solution for your problem must be searched in the simulation tool and Dymola offers a simulator function, which does exactly what you want: simulateExtendedModel. It allows to read the final value of a variable and you can initialize parameters and state variables with it. You can use it in a .mos script or within a Modelica function.
So if we rename your models Sim1 and Sim2 to Model1 and Model2 (because they are really models, not simulations) the function below would do what you want:
function sim
import DymolaCommands.SimulatorAPI.simulateExtendedModel;
protected
Boolean ok;
Real a;
Real[1] finalValues;
algorithm
(ok, finalValues) :=simulateExtendedModel("Model1", 0, 5, finalNames={"a"});
a :=finalValues[1];
simulateExtendedModel("Model2", 5, 10, initialNames={"b"}, initialValues={a});
end sim;
If you want to set multiple variables, you can use this code:
function sim2
import DymolaCommands.SimulatorAPI.simulateExtendedModel;
protected
Boolean ok;
Real[:] finalValues_sim1;
String[:] finalNames_sim1 = {"a1", "a2", "a3"};
String[:] initialNames_sim2 = {"b1", "b2", "b3"};
algorithm
(ok, finalValues_sim1) :=simulateExtendedModel("SO.Model1", 0, 5, finalNames=finalNames_sim1);
simulateExtendedModel("SO.Model2", 5, 10, initialNames=initialNames_sim2, initialValues=finalValues_sim1);
end sim2;
In the below example, how to increment n when using multiprocessing?
class Test:
def __init__(self, n):
self.n = n
def run(self):
self.n += 1
return True
# Generate 4 classes
klasses = [Test(0) for i in range(4)]
When [k.n for k in klasses] is run it produces [0, 0, 0, 0] as expected.
Trying to run the function run() for each class in parallel using:
from multiprocessing import Pool
with Pool() as pool:
results = [pool.apply_async(k.run, ()) for k in klasses ]
result = [i.get() for i in results]
results in result returning [True, True, True, True] as expected. The class' n attribute has not changed though as running [k.n for k in klasses] results in [0, 0, 0, 0].
When the method is not processed in parallel e.g. [k.run() for k in klasses], [k.n for k in klasses] returns [1, 1, 1, 1] as expected.
Is there a way for the classes to maintain state when run in parallel though?
Shared state in multiprocessing must be done explicitly, since each worker is a separate process. The multiprocessing docs cover the various options in so detail. The simplest solution would be to make n a multiprocessing.Value, though that requires significant changes in the Test class to make it use the proper types and attributes.
Alternatively, try and find a way to perform your work using pool.imap/pool.imap_unordered with state being passed in as arguments and new data returned; if your problem can be expressed this way, it's often better to limit sharing to inputs and outputs, not live state.
In Python we have lru_cache as a function wrapper. Add it to your function and the function will only be evaluated once per different input argument.
Example (from Python docs):
#lru_cache(maxsize=None)
def fib(n):
if n < 2:
return n
return fib(n-1) + fib(n-2)
>>> [fib(n) for n in range(16)]
[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610]
>>> fib.cache_info()
CacheInfo(hits=28, misses=16, maxsize=None, currsize=16)
I wonder whether a similar thing exists in Matlab? At the moment I am using cache files, like so:
function result = fib(n):
% FIB example like the Python example. Don't implement it like that!
cachefile = ['fib_', n, '.mat'];
try
load(cachefile);
catch e
if n < 2
result = n;
else
result = fib(n-1) + fib(n-2);
end
save(cachefile, 'result');
end
end
The problem I have with doing it this way, is that if I change my function, I need to delete the cachefile.
Is there a way to do this with Matlab realising when I changed the function and the cache has become invalidated?
Since matlab 2017 this is available:
https://nl.mathworks.com/help/matlab/ref/memoizedfunction.html
a = memoized(#sin)
I've created something like this for my own personal use: a CACHE class. (I haven't documented the code yet though.) It appears to be more flexible than Python's lru_cache (I wasn't aware of that, thanks) in that it has several methods for adjusting exactly what gets cached (to save memory) and how the comparisons are made. It could still use some refinement (#Daniel's suggestion to use the containers.Map class is a good one – though it would limit compatibility with old Matlab versions). The code is on GitHub so you're welcome to fork and improve it.
Here is a basic example of how it can be used:
function Output1 = CacheDemo(Input1,Input2)
persistent DEMO_CACHE
if isempty(DEMO_CACHE)
% Initialize cache object on first run
CACHE_SIZE = 10; % Number of input/output patterns to cache
DEMO_CACHE = CACHE(CACHE_SIZE,Input1,Input2);
CACHE_IDX = 1;
else
% Check if input pattern corresponds something stored in cache
% If not, return next available CACHE_IDX
CACHE_IDX = DEMO_CACHE.IN([],Input1,Input2);
if ~isempty(CACHE_IDX) && DEMO_CACHE.OUT(CACHE_IDX) > 0
[~,Output1] = DEMO_CACHE.OUT(CACHE_IDX);
return;
end
end
% Perform computation
Output1 = rand(Input1,Input2);
% Save output to cache CACHE_IDX
DEMO_CACHE.OUT(CACHE_IDX,Output1);
I created this class to cache the results from time-consuming stochastic simulations and have since used it to good effect in a few other places. If there is interest, I might be willing to spend some time documenting the code sooner as opposed to later. It would be nice if there was a way to limit memory use as well (a big consideration in my own applications), but getting the size of arbitrary Matlab datatypes is not trivial. I like your idea of caching to a file, which might be a good idea for larger data. Also, it might be nice to create a "lite" version that does what Python's lru_cache does.