I have multiple tasks I'd like to setup to execute in parallell.
Update packages: returns a list of packages.
Update versions: takes a package and returns a list of versions.
Update releases: takes a version (for a package) and fetches releases.
Which gets me something like:
#task()
def update_packages():
return [1, 2, 3]
#task()
def update_versions(package):
# Get versions
return [1, 2, 3]
#task()
def update_releases(version):
# Get releases
What I can do is execute them in order and wait for the results, but I would rather push intermediate results forward, like on the shell:
update_packages | update_versions | update_releases
What invocation of magic would accomplish this?
I think, you are looking for groups and Scatter-Gather pattern:
#task()
def update_packages():
res = group(update_versions.s(i) for i in [1, 2, 3])() # run tasks in parallel (Scatter)
res.get() # wait for all results (Gather)
return res
#task()
def update_versions(package):
# Get versions
res = group(update_packages.s(i) for i in [1, 2, 3])() # run tasks in parallel (Scatter)
res.get() # wait for all results (Gather)
return res
#task()
def update_releases(version):
# Get releases
return <what you want to see in final>
Now you can simple run update_packages and wait for all results:
res = update_packages()
You don't need to use .delay, because update_packages doesn't do any work by itself.
Related
I want a few jobs executed everyday at specific times.
The first job I want to run is to acquire data from the database and store it in a global variable
The second job I want to run is a few minutes after the first job is executed where it uses the data acquired from the first job that was stored in a global variable.
global dataacq
dataacq = None
def condb():
global check
global dataacq
conn = psycopg2.connect(#someinformation)
cursor = conn.cursor()
query = "SELECT conversation_id FROM tablename"
cursor.execute(query)
dataacq = cursor.fetchall()
print(dataacq)
cursor.close()
conn.close()
check = True
print(check)
return dataacq
def printresult(result):
print(result)
schedule.every().day.at("08:59").do(condb)
schedule.every().day.at("09:00").do(printresult, dataacq)
Above is a part of the code I am using for testing. The problem here is when the "printresult" function is called it displays None as output. But if I execute all the functions without any scheduling then it works and displays what I need it to show. So why is this happening?
I'm trying to train a pytorch model as follows:
start = time.time()
for epoch in range(100):
t_loss = 0
for i in range(100):
optimizer.zero_grad
scores = my_model(sent_dict_list[i])
scores = scores.permute(0, 2, 1)
loss = loss_function(scores, torch.tensor(targ_list[i]).cuda())
t_loss += loss.item()
loss.backward()
optimizer.step()
print("t_loss = ", t_loss)
I find that when I call "optimizer.zero_grad" my loss decreases at the end of every epoch whereas when I call "optimizer.zero_grad()" with the parenthesis it stays almost exactly the same. I don't know what difference this makes and was hoping someone could explain it to me.
I assume you're new to python, the '()' means simple a function call.
Consider this example:
>>> def foo():
print("function")
>>> foo
<function __main__.foo>
>>> foo()
function
Remember functions are objects in python, you can even store them like this:
>>> [foo, foo, foo]
Returning to your question, you have to call the function otherwise it won't work.
Consider this example:
import numpy as np
a = np.array(1)
np.save("a.npy", a)
a = np.load("a.npy", mmap_mode='r')
print(type(a))
b = a + 2
print(type(b))
which outputs
<class 'numpy.core.memmap.memmap'>
<class 'numpy.int32'>
So it seems that b is not a memmap any more, and I assume that this forces numpy to read the whole a.npy, defeating the purpose of the memmap. Hence my question, can operations on memmaps be deferred until access time?
I believe subclassing ndarray or memmap could work, but don't feel confident enough about my Python skills to try it.
Here is an extended example showing my problem:
import numpy as np
# create 8 GB file
# np.save("memmap.npy", np.empty([1000000000]))
# I want to print the first value using f and memmaps
def f(value):
print(value[1])
# this is fast: f receives a memmap
a = np.load("memmap.npy", mmap_mode='r')
print("a = ")
f(a)
# this is slow: b has to be read completely; converted into an array
b = np.load("memmap.npy", mmap_mode='r')
print("b + 1 = ")
f(b + 1)
Here's a simple example of an ndarray subclass that defers operations on it until a specific element is requested by indexing.
I'm including this to show that it can be done, but it almost certainly will fail in novel and unexpected ways, and require substantial work to make it usable.
For a very specific case it may be easier than redesigning your code to solve the problem in a better way.
I'd recommend reading over these examples from the docs to help understand how it works.
import numpy as np
class Defered(np.ndarray):
"""
An array class that deferrs calculations applied to it, only
calculating them when an index is requested
"""
def __new__(cls, arr):
arr = np.asanyarray(arr).view(cls)
arr.toApply = []
return arr
def __array_ufunc__(self, ufunc, method, *inputs, **kwargs):
## Convert all arguments to ndarray, otherwise arguments
# of type Defered will cause infinite recursion
# also store self as None, to be replaced later on
newinputs = []
for i in inputs:
if i is self:
newinputs.append(None)
elif isinstance(i, np.ndarray):
newinputs.append(i.view(np.ndarray))
else:
newinputs.append(i)
## Store function to apply and necessary arguments
self.toApply.append((ufunc, method, newinputs, kwargs))
return self
def __getitem__(self, idx):
## Get index and convert to regular array
sub = self.view(np.ndarray).__getitem__(idx)
## Apply stored actions
for ufunc, method, inputs, kwargs in self.toApply:
inputs = [i if i is not None else sub for i in inputs]
sub = super().__array_ufunc__(ufunc, method, *inputs, **kwargs)
return sub
This will fail if modifications are made to it that don't use numpy's universal functions. For instance percentile and median aren't based on ufuncs, and would end up loading the entire array. Likewise, if you pass it to a function that iterates over the array, or applies an index to substantial amounts the entire array will be loaded.
This is just how python works. By default numpy operations return a new array, so b never exists as a memmap - it is created when + is called on a.
There's a couple of ways to work around this. The simplest is to do all operations in place,
a += 1
This requires loading the memory mapped array for reading and writing,
a = np.load("a.npy", mmap_mode='r+')
Of course this isn't any good if you don't want to overwrite your original array.
In this case you need to specify that b should be memmapped.
b = np.memmap("b.npy", mmap+mode='w+', dtype=a.dtype, shape=a.shape)
Assigning can be done by using the out keyword provided by numpy ufuncs.
np.add(a, 2, out=b)
i want to be able to choose the best fit algorithm with it's best params .
how can i do it in one go , without creating few pipelines for each algorithm , and without doing checks in the cross validation for params that are not relevant for specific algorithm ?
i.e i want to check how logistic regression perform against randomforest.
my code is :
lr = LogisticRegression().setFamily("multinomial")
# Chain indexer and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer,labelIndexer2, assembler, lr , labelconverter])
paramGrid = ParamGridBuilder() \
.addGrid(lr.regParam, [0.1, 0.3, 0.01]) \
.addGrid(lr.elasticNetParam, [0.1, 0.8, 0.01]) \
.addGrid(lr.maxIter, [10, 20, 25]) \
.build()
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=RegressionEvaluator(),
numFolds=2) # use 3+ folds in practice
# Train model. This also runs the indexer.
model = crossval.fit(trainingData)
I've written a quick and dirty workaround in Python/Pyspark. It is a bit primitive (it doesn't have a corresponding Scala class) and I think it lacks the save/load capabilities but it might be a starting point for your case. Eventually it might become a new functionality in Spark it would be nice to have.
The idea is to have a special pipeline stage that acts as a switch between different objects, and maintains a dictionary to refer to them with strings. The user can enable one or another by name. They can be either Estimators, Transformers or mix both - the user is responsible for keeping the coherence in the pipeline (doing things that make sense, at her own risk). The parameter with the name of the enabled stage can be included in the grid to be cross-validated.
from pyspark.ml.wrapper import JavaEstimator
from pyspark.ml.base import Estimator, Transformer, Param, Params, TypeConverters
class PipelineStageChooser(JavaEstimator):
selectedStage = Param(Params._dummy(), "selectedStage", "key of the selected stage in the dict",
typeConverter=TypeConverters.toString)
stagesDict = None
_paramMap = {}
def __init__(self, stagesDict, selectedStage):
super(PipelineStageChooser, self).__init__()
self.stagesDict = stagesDict
if selectedStage not in self.stagesDict.keys():
raise KeyError("selected stage {0} not found in stagesDict".format(selectedStage))
if isinstance(self.stagesDict[selectedStage], Transformer):
self.fittedSelectedStage = self.stagesDict[selectedStage]
for stage in stagesDict.values():
if not (isinstance(stage, Estimator) or isinstance(stage, Transformer)):
raise TypeError("Cannot recognize a pipeline stage of type %s." % type(stage))
self._set(selectedStage=selectedStage)
self._java_obj = None
def fit(self, dataset, params=None):
selectedStage_str = self.getOrDefault(self.selectedStage)
if isinstance(self.stagesDict[selectedStage_str], Estimator):
return self.stagesDict[selectedStage_str].fit(dataset, params = params)
elif isinstance(self.stagesDict[selectedStage_str], Transformer):
return self.stagesDict[selectedStage_str]
Use example:
count_vectorizer = CountVectorizer() # set params
hashing_tf = HashingTF() # set params
chooser = PipelineStageChooser(stagesDict={"count_vectorizer": count_vectorizer,
"hashing_tf": hashing_tf},
selectedStage="count_vectorizer")
pipeline = Pipeline(stages = [chooser])
# Test which among CountVectorizer or HashingTF works better to create features
# Could be used as well to decide between different ML algorithms
paramGrid = ParamGridBuilder() \
.addGrid(chooser.selectedStage, ["count_vectorizer", "hashing_tf"])\
.build()
One step of my pipeline involves fetching from an external data source and I'd like to do that in chunks (order doesn't matter). I couldn't find any class that does something similar so I've created the following:
class FixedSizeBatchSplitter(beam.DoFn):
def __init__(self, size):
self.size = size
def start_bundle(self):
self.current_batch = []
def finish_bundle(self):
if self.current_batch:
yield self.current_batch
def process(self, element):
self.current_batch.append(element)
if len(self.current_batch) >= self.size:
yield self.current_batch
self.current_batch = []
However, when I run this pipeline, I get a RuntimeError: Finish Bundle should only output WindowedValue type error:
with beam.Pipeline() as p:
res = (p
| beam.Create(range(10))
| beam.ParDo(FixedSizeBatchSplitter(3))
)
Why is that? How comes that I can yield outputs in process but not in finish_bundle? By the way, if I remove finish_bundle the pipeline works but obviously discards the leftovers.
A DoFn may be processing elements from multiple different windows. When you're in process(), the "current window" is unambiguous - it's the window of the element being processed. When you're in finish_bundle, it's ambiguous and you need to specify the window explicitly. You need to be yielding something of the form yield WindowedValue(something, timestamp, [window]).
If all your data is in the global window, that makes it easier: window will be just GlobalWindow(). If you're using multiple windows, then you'll need to have 1 buffer per window; capture the window in process() so that you add to the proper buffer; and in finish_bundle emit each of them in the respective window.