ERROR: Uncaught exception in luigi (TypeError: must be string or buffer, not None) - subprocess

I am having trouble while calling /triggering Luigi Task from a python code.
Basically i need to trigger a luigi task just like we do on command line, but from a python code
I am using supbrocess.popen to call a luigi task using a shell
command
I have a test code named as test.py and have a test class in module
task_scheduler.py which contains my luigi task (both modules in same location/dir)
import luigi
class TestClass(luigi.Task):
# param = luigi.DictParameter(default=dict())
def requires(self):
print "I am TestClass req"
def run(self):
with open('myfile.txt', 'w') as f:
f.write("asasasas")
print "I am TestClass run"
import subprocess
p = subprocess.Popen("python -m luigi --module task_scheduler TestClass", shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
print p.pid
(output, err) = p.communicate()
print "-------------O/P-------------"
print output
print "-------------error-------------"
print err
But I am getting the error as
52688
-------------O/P-------------
-------------error-------------
ERROR: Uncaught exception in luigi
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/luigi/retcodes.py", line 61, in run_with_retcodes
worker = luigi.interface._run(argv)['worker']
File "/Library/Python/2.7/site-packages/luigi/interface.py", line 238, in _run
return _schedule_and_run([cp.get_task_obj()], worker_scheduler_factory)
File "/Library/Python/2.7/site-packages/luigi/interface.py", line 172, in _schedule_and_run
not(lock.acquire_for(env_params.lock_pid_dir, env_params.lock_size, kill_signal))):
File "/Library/Python/2.7/site-packages/luigi/lock.py", line 82, in acquire_for
my_pid, my_cmd, pid_file = get_info(pid_dir)
File "/Library/Python/2.7/site-packages/luigi/lock.py", line 67, in get_info
pid_file = os.path.join(pid_dir, hashlib.md5(cmd_hash).hexdigest()) + '.pid'
TypeError: must be string or buffer, not None
Can anyone please suggest me what I am doing wrong here?
The command "python -m luigi --module task_scheduler TestClass" works perfectly if I use shell prompt

I just tried executing the test.py from command line.
It seems when I was using the IDE to run (PyCharm) then only it gives this issue

This is fixed in version 2.2.0. Check github issue #1654

Related

Adjust wait_time in locust using custom arguments via CLI

I want to adjust wait_time parameter by passing it via CLI.
I have tried the following way:
custom_wait_time = None
# Add custom argument to locust
#events.init_command_line_parser.add_listener
def init_parser(parser):
parser.add_argument("--locust-wait-time",
type=int,
include_in_web_ui=True,
default=None,
help="Wait time per each request of a user.")
#events.init.add_listener
def _(environment, **kwargs):
global custom_wait_time
custom_wait_time = int(environment.parsed_options.locust_wait_time)
print(custom_wait_time) # First print
class MyUser(HttpUser):
global custom_wait_time
print(custom_wait_time) # Second print
wait_time = constant(custom_wait_time)
Assume that custom_wait_time=10 when I pass it via CLI, the First print gives me custom_wait_time=10 while the Second print gives me custom_wait_time=None instead of 10, so the wait_time = constant(custom_wait_time) will break and give me the error below:
Traceback (most recent call last):
File "src/gevent/greenlet.py", line 908, in gevent._gevent_cgreenlet.Greenlet.run
File "/Users/opt/anaconda3/envs/ai/lib/python3.7/site-packages/locust/user/users.py", line 176, in run_user
user.run()
File "/Users/opt/anaconda3/envs/ai/lib/python3.7/site-packages/locust/user/users.py", line 144, in run
self._taskset_instance.run()
File "/Users/opt/anaconda3/envs/ai/lib/python3.7/site-packages/locust/user/task.py", line 367, in run
self.wait()
File "/Users/opt/anaconda3/envs/ai/lib/python3.7/site-packages/locust/user/task.py", line 445, in wait
self._sleep(self.wait_time())
File "/Users/opt/anaconda3/envs/ai/lib/python3.7/site-packages/locust/user/task.py", line 451, in _sleep
gevent.sleep(seconds)
File "/Users/opt/anaconda3/envs/ai/lib/python3.7/site-packages/gevent/hub.py", line 157, in sleep
if seconds <= 0:
TypeError: '<=' not supported between instances of 'NoneType' and 'int'
Any help would be appreciated.
The problem is that the code will run in the wrong order - MyUser is defined before any of the init-methods are called.
If you instead do MyUser.wait_time = constant(custom_wait_time) inside your init handler (and dont set it at all in the class) it should work.
That way you dont need any globals either :)
I've just do the same work.
#events.init_command_line_parser.add_listener
def _(parser):
parser.add_argument("--waitTime", type=float, env_var="LOCUST_WAIT_TIME", default=1.0, help="wait time between each task of an user")
# Set `include_in_web_ui` to False if you want to hide from the web UI
#parser.add_argument("--my-ui-invisible-argument", include_in_web_ui=False, default="I am invisible")
and in the test class, just use it value like this
class GetInfoUser(HttpUser):
def wait_time(self):
return self.environment.parsed_options.waitTime

Chaining an iterable task result to a group, then chaining the results of this group to a chord

Related: How to chain a Celery task that returns a list into a group? and Chain a celery task's results into a distributed group, but they lack a MWE, to my understanding do not fully cover my needs, and are maybe outdated.
I need to dynamically pass the output of a task to a parallel group of tasks, then pass the results of this group to a final task. I would like to make this a single "meta task".
Here's my attempt so far:
# chaintasks.py
import random
from typing import List, Iterable, Callable
from celery import Celery, signature, group, chord
app = Celery(
"chaintasks",
backend="redis://localhost:6379",
broker="pyamqp://guest#localhost//",
)
app.conf.update(task_track_started=True, result_persistent=True)
#app.task
def select_items(items: List[str]):
print("In select_items, received", items)
selection = tuple(set(random.choices(items, k=random.randint(1, len(items)))))
print("In select_items, sending", selection)
return selection
#app.task
def process_item(item: str):
print("In process_item, received", item)
return item + "_processed"
#app.task
def group_items(items: List[str]):
print("In group_items, received", items)
return ":".join(items)
#app.task
def dynamic_map_group(iterable: Iterable, callback: Callable):
# Map a callback over an iterator and return as a group
# Credit to https://stackoverflow.com/a/13569873/5902284
callback = signature(callback)
return group(callback.clone((arg,)) for arg in iterable).delay()
#app.task
def dynamic_map_chord(iterable: Iterable, callback: Callable):
callback = signature(callback)
return chord(callback.clone((arg,)) for arg in iterable).delay()
Now, to try this out, let's spin up a celery worker and use docker to for rabbitmq and redis
$ docker run -p 5672:5672 rabbitmq
$ docker run -p 6379:6379 redis
$ celery -A chaintasks:app worker -E
Now, let's try to illustrate what I need:
items = ["item1", "item2", "item3", "item4"]
print(select_items.s(items).apply_async().get())
This outputs: ['item3'], great.
meta_task1 = select_items.s(items) | dynamic_map_group.s(process_item.s())
meta_task1_result = meta_task1.apply_async().get()
print(meta_task1_result)
It gets trickier here: the output is [['b916f5d3-4367-46c5-896f-f2d2e2e59a00', None], [[['3f78d31b-e374-484e-b05b-40c7a2038081', None], None], [['bce50d49-466a-43e9-b6ad-1b78b973b50f', None], None]]].
I am not sure what the Nones, mean, but I guess the UUIDs represents the subtasks of my meta task. But how do I get their results from here?
I tried:
subtasks_ids = [i[0][0] for i in meta_task1_result[1]]
processed_items = [app.AsyncResult(i).get() for i in subtasks_ids]
but this just hangs. What am I doing wrong here? I'm expecting an output looking like ['processed_item1', 'processed_item3']
What about trying to chord the output of select_items to group_items?
meta_task2 = select_items.s(items) | dynamic_map_chord.s(group_items.s())
print(meta_task2.apply_async().get())
Ouch, this raises an AttributeError that I don't really understand.
Traceback (most recent call last):
File "chaintasks.py", line 70, in <module>
print(meta_task2.apply_async().get())
File "/site-packages/celery/result.py", line 223, in get
return self.backend.wait_for_pending(
File "/site-packages/celery/backends/asynchronous.py", line 201, in wait_for_pending
return result.maybe_throw(callback=callback, propagate=propagate)
File "/site-packages/celery/result.py", line 335, in maybe_throw
self.throw(value, self._to_remote_traceback(tb))
File "/site-packages/celery/result.py", line 328, in throw
self.on_ready.throw(*args, **kwargs)
File "/site-packages/vine/promises.py", line 234, in throw
reraise(type(exc), exc, tb)
File "/site-packages/vine/utils.py", line 30, in reraise
raise value
AttributeError: 'NoneType' object has no attribute 'clone'
Process finished with exit code 1
And finally, what I really am trying to do is something like:
ultimate_meta_task = (
select_items.s(items)
| dynamic_map_group.s(process_item.s())
| dynamic_map_chord.s(group_items.s())
)
print(ultimate_meta_task.apply_async().get())
but this leads to:
Traceback (most recent call last):
File "chaintasks.py", line 77, in <module>
print(ultimate_meta_task.apply_async().get())
File "/site-packages/celery/result.py", line 223, in get
return self.backend.wait_for_pending(
File "/site-packages/celery/backends/asynchronous.py", line 199, in wait_for_pending
for _ in self._wait_for_pending(result, **kwargs):
File "/site-packages/celery/backends/asynchronous.py", line 265, in _wait_for_pending
for _ in self.drain_events_until(
File "/site-packages/celery/backends/asynchronous.py", line 58, in drain_events_until
on_interval()
File "/site-packages/vine/promises.py", line 160, in __call__
return self.throw()
File "/site-packages/vine/promises.py", line 157, in __call__
retval = fun(*final_args, **final_kwargs)
File "/site-packages/celery/result.py", line 236, in _maybe_reraise_parent_error
node.maybe_throw()
File "/site-packages/celery/result.py", line 335, in maybe_throw
self.throw(value, self._to_remote_traceback(tb))
File "/site-packages/celery/result.py", line 328, in throw
self.on_ready.throw(*args, **kwargs)
File "/site-packages/vine/promises.py", line 234, in throw
reraise(type(exc), exc, tb)
File "/site-packages/vine/utils.py", line 30, in reraise
raise value
kombu.exceptions.EncodeError: TypeError('Object of type GroupResult is not JSON serializable')
Process finished with exit code 1
What I try to achieve here, is to get an output looking like 'processed_item1:processed_item3'. Is this even doable using celery? Any help here is appreciated.

Apache Beam ReadFromSpanner decoding issue

I'm trying to run the following script in a GCP Dataflow pipeline.
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from typing import NamedTuple, Optional
from apache_beam.io.gcp.spanner import *
from past.builtins import unicode
import logging
class ItemRow(NamedTuple):
item_id: unicode
class LogResults(beam.DoFn):
"""Just log the results"""
def process(self, element):
logging.info("row: %s", element)
yield element
class SpannerToSpannerAndBigQueryPipelineOptions(PipelineOptions):
"""
Runtime Parameters given during template execution
path parameter is necessary for execution of pipeline
"""
#classmethod
def _add_argparse_args(cls, parser):
parser.add_argument(
'--SOURCE_SPANNER_PROJECT_ID', type=str, help='Source Spanner project ID',
default='project_id')
parser.add_argument(
'--SOURCE_SPANNER_DATASET_ID', type=str, help='Source Spanner dataset ID',
default='dataset_id')
parser.add_argument(
'--SOURCE_SPANNER_INSTANCE_ID', type=str, help='Source Spanner instance ID',
default='instance_id')
parser.add_argument(
'--SOURCE_QUERY', type=str, help='SQL to run in Source Spanner Instance',
required=True)
# Setup pipeline
def run():
beam.coders.registry.register_coder(ItemRow, beam.coders.RowCoder)
pipeline_options = PipelineOptions()
p = beam.Pipeline(options=pipeline_options)
importer_options = pipeline_options.view_as(
SpannerToSpannerAndBigQueryPipelineOptions)
rows = (
p
| "Read from source Spanner" >> ReadFromSpanner(
project_id=importer_options.SOURCE_SPANNER_PROJECT_ID,
instance_id=importer_options.SOURCE_SPANNER_INSTANCE_ID,
database_id=importer_options.SOURCE_SPANNER_DATASET_ID,
row_type=ItemRow,
sql='Select item_id from Items WHERE created_ts BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 5 SECOND) AND CURRENT_TIMESTAMP()',
timestamp_bound_mode=TimestampBoundMode.MAX_STALENESS,
staleness=3,
time_unit=TimeUnit.HOURS,
).with_output_types(ItemRow)
)
rows | 'Log results' >> beam.ParDo(LogResults())
result = p.run()
result.wait_until_finish()
if __name__ == "__main__":
run()
However, I've been running into issues for decoding the results obtained from Spanner. These are the output logs from my Dataflow job:
"An exception was raised when trying to execute the workitem 6665479626992209510 : Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 649, in do_work
work_executor.execute()
File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 179, in execute
op.start()
File "dataflow_worker/native_operations.py", line 38, in dataflow_worker.native_operations.NativeReadOperation.start
File "dataflow_worker/native_operations.py", line 39, in dataflow_worker.native_operations.NativeReadOperation.start
File "dataflow_worker/native_operations.py", line 44, in dataflow_worker.native_operations.NativeReadOperation.start
File "dataflow_worker/native_operations.py", line 48, in dataflow_worker.native_operations.NativeReadOperation.start
File "/usr/local/lib/python3.7/site-packages/dataflow_worker/inmemory.py", line 108, in __iter__
yield self._source.coder.decode(value)
File "/usr/local/lib/python3.7/site-packages/apache_beam/coders/coders.py", line 468, in decode
return self.get_impl().decode(encoded)
File "apache_beam/coders/coder_impl.py", line 226, in apache_beam.coders.coder_impl.StreamCoderImpl.decode
File "apache_beam/coders/coder_impl.py", line 228, in apache_beam.coders.coder_impl.StreamCoderImpl.decode
File "apache_beam/coders/coder_impl.py", line 123, in apache_beam.coders.coder_impl.CoderImpl.decode_from_stream
File "/usr/local/lib/python3.7/site-packages/apache_beam/coders/row_coder.py", line 215, in decode_from_stream
is_null in zip(self.components, nulls)))
File "/usr/local/lib/python3.7/site-packages/apache_beam/coders/row_coder.py", line 215, in <genexpr>
is_null in zip(self.components, nulls)))
File "apache_beam/coders/coder_impl.py", line 259, in apache_beam.coders.coder_impl.CallbackCoderImpl.decode_from_stream
File "apache_beam/coders/coder_impl.py", line 261, in apache_beam.coders.coder_impl.CallbackCoderImpl.decode_from_stream
File "/usr/local/lib/python3.7/site-packages/apache_beam/coders/coders.py", line 414, in decode
return value.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x83 in position 9: invalid start byte
"
I'm unsure as to how to solve this problem. I'm using this https://beam.apache.org/releases/pydoc/2.27.0/apache_beam.io.gcp.spanner.html?highlight=spanner#module-apache_beam.io.gcp.spanner example as a starting point. The issue appears to be in decoding the results obtained from Spanner. There is little to no documentation on how to specify the schema for the Spanner table/tables I'm trying to query.
There is also an experimental IO module for Spanner which does not use the Java expansion module. Is it recommended to switch to the experimental version?
Thanks
I could not run a pipeline using the apache_beam.io.gcp.spanner module so I ended up using the apache_beam.io.gcp.experimental.spannerio module instead.

Hyperopt mongotrials issue with Pickle: AttributeError: 'module' object has no attribute

I'm trying to use Hyperopt parallel search with MongoDB, and encountered some issues with Mongotrials, which have been discussed here. I've tried all their methods, and I am still unable to find solutions to my specific problem. The specific model I'm trying to minimize is RadomForestRegressor from sklearn.
I've followed this tutorial. And I'm able to print out the calculated "fmin" with no issue.
Here are my steps so far:
1) Activate a virtual environment called "tensorflow" (I've installed all my libraries there)
2) Start MongoDB:
(tensorflow) bash-3.2$ mongod --dbpath . --port 1234 --directoryperdb --journal --nohttpinterface
3) Initiate workers:
(tensorflow) bash-3.2$ hyperopt-mongo-worker --mongo=localhost:1234/foo_db --poll-interval=0.1
4) Run my python code, and my python code is as follows:
import numpy as np
import pandas as pd
from sklearn.metrics import mean_absolute_error
from hyperopt import hp, fmin, tpe, STATUS_OK, Trials
from hyperopt.mongoexp import MongoTrials
# Preprocessing data
train_xg = pd.read_csv('train.csv')
n_train = len(train_xg)
print "Whole data set size: ", n_train
# Creating columns for features, and categorical features
features_col = [x for x in train_xg.columns if x not in ['id', 'loss', 'log_loss']]
cat_features_col = [x for x in train_xg.select_dtypes(include=['object']).columns if x not in ['id', 'loss', 'log_loss']]
for c in range(len(cat_features_col)):
train_xg[cat_features_col[c]] = train_xg[cat_features_col[c]].astype('category').cat.codes
# Use this to train random forest regressor
train_xg_x = np.array(train_xg[features_col])
train_xg_y = np.array(train_xg['loss'])
space_rf = { 'min_samples_leaf': hp.choice('min_samples_leaf', range(1,100)) }
trials = MongoTrials('mongo://localhost:1234/foo_db/jobs', exp_key='exp1')
def minMe(params):
# Hyperopt tuning for hyperparameters
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from hyperopt import STATUS_OK
try:
import dill as pickle
print('Went with dill')
except ImportError:
import pickle
def hyperopt_rf(params):
rf = RandomForestRegressor(**params)
return cross_val_score(rf, train_xg_x, train_xg_y).mean()
acc = hyperopt_rf(params)
print 'new acc:', acc, 'params: ', params
return {'loss': -acc, 'status': STATUS_OK}
best = fmin(fn=minMe, space=space_rf, trials=trials, algo=tpe.suggest, max_evals=100)
print "Best: ", best
5) After I run the above Python code, I get the following errors:
INFO:hyperopt.mongoexp:Error while unpickling. Try installing dill via "pip install dill" for enhanced pickling support.
INFO:hyperopt.mongoexp:job exception: 'module' object has no attribute 'minMe'
Traceback (most recent call last):
File "/Users/WernerChao/tensorflow/bin/hyperopt-mongo-worker", line 6, in <module>
sys.exit(hyperopt.mongoexp.main_worker())
File "/Users/WernerChao/tensorflow/lib/python2.7/site-packages/hyperopt/mongoexp.py", line 1302, in main_worker
return main_worker_helper(options, args)
File "/Users/WernerChao/tensorflow/lib/python2.7/site-packages/hyperopt/mongoexp.py", line 1249, in main_worker_helper
mworker.run_one(reserve_timeout=float(options.reserve_timeout))
File "/Users/WernerChao/tensorflow/lib/python2.7/site-packages/hyperopt/mongoexp.py", line 1064, in run_one
domain = pickle.loads(blob)
AttributeError: 'module' object has no attribute 'minMe'
INFO:hyperopt.mongoexp:PROTOCOL mongo
INFO:hyperopt.mongoexp:USERNAME None
INFO:hyperopt.mongoexp:HOSTNAME localhost
INFO:hyperopt.mongoexp:PORT 1234
INFO:hyperopt.mongoexp:PATH /foo_db/jobs
INFO:hyperopt.mongoexp:DB foo_db
INFO:hyperopt.mongoexp:COLLECTION jobs
INFO:hyperopt.mongoexp:PASS None
INFO:hyperopt.mongoexp:Error while unpickling. Try installing dill via "pip install dill" for enhanced pickling support.
INFO:hyperopt.mongoexp:job exception: 'module' object has no attribute 'minMe'
Traceback (most recent call last):
File "/Users/WernerChao/tensorflow/bin/hyperopt-mongo-worker", line 6, in <module>
sys.exit(hyperopt.mongoexp.main_worker())
File "/Users/WernerChao/tensorflow/lib/python2.7/site-packages/hyperopt/mongoexp.py", line 1302, in main_worker
return main_worker_helper(options, args)
File "/Users/WernerChao/tensorflow/lib/python2.7/site-packages/hyperopt/mongoexp.py", line 1249, in main_worker_helper
mworker.run_one(reserve_timeout=float(options.reserve_timeout))
File "/Users/WernerChao/tensorflow/lib/python2.7/site-packages/hyperopt/mongoexp.py", line 1064, in run_one
domain = pickle.loads(blob)
AttributeError: 'module' object has no attribute 'minMe'
INFO:hyperopt.mongoexp:PROTOCOL mongo
INFO:hyperopt.mongoexp:USERNAME None
INFO:hyperopt.mongoexp:HOSTNAME localhost
INFO:hyperopt.mongoexp:PORT 1234
INFO:hyperopt.mongoexp:PATH /foo_db/jobs
INFO:hyperopt.mongoexp:DB foo_db
INFO:hyperopt.mongoexp:COLLECTION jobs
INFO:hyperopt.mongoexp:PASS None
INFO:hyperopt.mongoexp:Error while unpickling. Try installing dill via "pip install dill" for enhanced pickling support.
INFO:hyperopt.mongoexp:job exception: 'module' object has no attribute 'minMe'
Traceback (most recent call last):
File "/Users/WernerChao/tensorflow/bin/hyperopt-mongo-worker", line 6, in <module>
sys.exit(hyperopt.mongoexp.main_worker())
File "/Users/WernerChao/tensorflow/lib/python2.7/site-packages/hyperopt/mongoexp.py", line 1302, in main_worker
return main_worker_helper(options, args)
File "/Users/WernerChao/tensorflow/lib/python2.7/site-packages/hyperopt/mongoexp.py", line 1249, in main_worker_helper
mworker.run_one(reserve_timeout=float(options.reserve_timeout))
File "/Users/WernerChao/tensorflow/lib/python2.7/site-packages/hyperopt/mongoexp.py", line 1064, in run_one
domain = pickle.loads(blob)
AttributeError: 'module' object has no attribute 'minMe'
INFO:hyperopt.mongoexp:PROTOCOL mongo
INFO:hyperopt.mongoexp:USERNAME None
INFO:hyperopt.mongoexp:HOSTNAME localhost
INFO:hyperopt.mongoexp:PORT 1234
INFO:hyperopt.mongoexp:PATH /foo_db/jobs
INFO:hyperopt.mongoexp:DB foo_db
INFO:hyperopt.mongoexp:COLLECTION jobs
INFO:hyperopt.mongoexp:PASS None
INFO:hyperopt.mongoexp:no job found, sleeping for 0.7s
INFO:hyperopt.mongoexp:Error while unpickling. Try installing dill via "pip install dill" for enhanced pickling support.
INFO:hyperopt.mongoexp:job exception: 'module' object has no attribute 'minMe'
Traceback (most recent call last):
File "/Users/WernerChao/tensorflow/bin/hyperopt-mongo-worker", line 6, in <module>
sys.exit(hyperopt.mongoexp.main_worker())
File "/Users/WernerChao/tensorflow/lib/python2.7/site-packages/hyperopt/mongoexp.py", line 1302, in main_worker
return main_worker_helper(options, args)
File "/Users/WernerChao/tensorflow/lib/python2.7/site-packages/hyperopt/mongoexp.py", line 1249, in main_worker_helper
mworker.run_one(reserve_timeout=float(options.reserve_timeout))
File "/Users/WernerChao/tensorflow/lib/python2.7/site-packages/hyperopt/mongoexp.py", line 1064, in run_one
domain = pickle.loads(blob)
AttributeError: 'module' object has no attribute 'minMe'
INFO:hyperopt.mongoexp:exiting with N=9223372036854775803 after 4 consecutive exceptions
6) Then Mongo workers would shut off.
Things I've tried:
install "dill" as the error suggested -> didn't work
Put global imports into the objective function so it can pickle -> didn't work
Put try except with "dill" or "pickle" as import -> didn't work
Does anyone have similar issues? I'm running out of ideas to try, and have been working on this for 2 days in vain. I think I am missing something really simple here, just can't seem to find it.
What am I missing?
Any suggestion is welcomed please!
Had the same problem in python 3.5. Installing Dill didn't help, nor dir setting workdir in MongoTrials or hyperopt-mongo-worker cli. hyperopt-mongo-worker doesn't seem to have access to __main__ where the function was defined:
AttributeError: Can't get attribute 'minMe' on <module '__main__' from ...hyperopt-mongo-worker
As #jaikumarm suggested, I circumvented the problem by writing a module file with all the required functions. However, instead of soft-linking it into the bin directory, I extended the PYTHONPATH before running hyperopt-mongo-worker:
export PYTHONPATH="${PYTHONPATH}:<dir_with_the_module.py>"
hyperopt-mongo-worker ...
That way, the hyperopt-monogo-worker is able to import the module containing minMe.
I fought with this for several days before coming up with a workable solution. there are two problems:
1. the mongo worker spawns off a separate process to run the optimizer so any context from your original python file is lost and unavailable for this new process.
2. the imports on this new process happen in the context of the hyperopt-mongo-worker scipy, which is in your case will be /Users/WernerChao/tensorflow/bin/.
So my solution is to make this new optimizer function completely self sufficient
optimizer.py
import numpy as np
import pandas as pd
from sklearn.metrics import mean_absolute_error
# Preprocessing data
train_xg = pd.read_csv('train.csv')
n_train = len(train_xg)
print "Whole data set size: ", n_train
# Creating columns for features, and categorical features
features_col = [x for x in train_xg.columns if x not in ['id', 'loss', 'log_loss']]
cat_features_col = [x for x in train_xg.select_dtypes(include=['object']).columns if x not in ['id', 'loss', 'log_loss']]
for c in range(len(cat_features_col)):
train_xg[cat_features_col[c]] = train_xg[cat_features_col[c]].astype('category').cat.codes
# Use this to train random forest regressor
train_xg_x = np.array(train_xg[features_col])
train_xg_y = np.array(train_xg['loss'])
def minMe(params):
# Hyperopt tuning for hyperparameters
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from hyperopt import STATUS_OK
try:
import dill as pickle
print('Went with dill')
except ImportError:
import pickle
def hyperopt_rf(params):
rf = RandomForestRegressor(**params)
return cross_val_score(rf, train_xg_x, train_xg_y).mean()
acc = hyperopt_rf(params)
print 'new acc:', acc, 'params: ', params
return {'loss': -acc, 'status': STATUS_OK}
wrapper.py
from hyperopt import hp, fmin, tpe, STATUS_OK, Trials
from hyperopt.mongoexp import MongoTrials
import optimizer
space_rf = { 'min_samples_leaf': hp.choice('min_samples_leaf', range(1,100)) }
best = fmin(fn=optimizer.minMe, space=space_rf, trials=trials, algo=tpe.suggest, max_evals=100)
print "Best: ", best
trials = MongoTrials('mongo://localhost:1234/foo_db/jobs', exp_key='exp1')
Once you have this code link the optimizer.py to the bin folder
ln -s /Users/WernerChao/Git/test/optimizer.py /Users/WernerChao/tensorflow/bin/
now run the wrapper.py and then the mongo worker it should be able to import the optimizer from its local context and run the minMe function.
Try to install Dill in the Python environment of your tensorflow (or possibly the worker):
/Users/WernerChao/tensorflow/lib/python2.7/site-packages/hyperopt
Your aim is to get rid of the hyperopt error message:
hyperopt.mongoexp:Error while unpickling. Try installing dill via "pip install dill" for enhanced pickling support.
This is because the Python by default cannot marshal a function. It requires dill library to extend Python's pickling module for serialising/de-serialising Python objects. In your case, it failed to serialise your function minMe().
I made a separate file which calculates the loss and copied it to /anaconda2/bin/
and
/anaconda2/lib/python2.7/site-packages/hyperopt
it is working fine.
This was my Traceback
Traceback (most recent call last):
File "/home/greatskull/anaconda2/bin/hyperopt-mongo-worker", line 6, in <module>
sys.exit(hyperopt.mongoexp.main_worker())
File "/home/greatskull/anaconda2/lib/python2.7/site-packages/hyperopt/mongoexp.py", line 1302, in main_worker
return main_worker_helper(options, args)
File "/home/greatskull/anaconda2/lib/python2.7/site-packages/hyperopt/mongoexp.py", line 1249, in main_worker_helper
mworker.run_one(reserve_timeout=float(options.reserve_timeout))
File "/home/greatskull/anaconda2/lib/python2.7/site-packages/hyperopt/mongoexp.py", line 1073, in run_one
with temp_dir(workdir, erase_created_workdir), working_dir(workdir):
File "/home/greatskull/anaconda2/lib/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
File "/home/greatskull/anaconda2/lib/python2.7/site-packages/hyperopt/utils.py", line 229, in temp_dir
os.makedirs(dir)
File "/home/greatskull/anaconda2/lib/python2.7/os.py", line 150, in makedirs
makedirs(head, mode)
File "/home/greatskull/anaconda2/lib/python2.7/os.py", line 157, in makedirs
mkdir(name, mode)

Issue with multiline TokenInputTransformer in IPython extension

I am trying to use a TokenInputTransformer within an IPython extension module, but it seems that there is something wrong with the standard implementation of token transformers with multiline input. Consider the following minimal extension:
from IPython.core.inputtransformer import TokenInputTransformer
#TokenInputTransformer.wrap
def test_transformer(tokens):
return tokens
def load_ipython_extension(ip):
for s in (ip.input_splitter, ip.input_transformer_manager):
s.python_line_transforms.extend([test_transformer()])
print "Test activated"
When I load the extension in IPython 1.1.0 I get a non-handled exception with multiline input:
In [1]: %load_ext test
Test activated
In [2]: abs(
...: 2
...: )
Traceback (most recent call last):
File "/Applications/anaconda/bin/ipython", line 6, in <module>
sys.exit(start_ipython())
File "/Applications/anaconda/lib/python2.7/site-packages/IPython/__init__.py", line 118, in start_ipython
return launch_new_instance(argv=argv, **kwargs)
File "/Applications/anaconda/lib/python2.7/site-packages/IPython/config/application.py", line 545, in launch_instance
app.start()
File "/Applications/anaconda/lib/python2.7/site-packages/IPython/terminal/ipapp.py", line 362, in start
self.shell.mainloop()
File "/Applications/anaconda/lib/python2.7/site-packages/IPython/terminal/interactiveshell.py", line 436, in mainloop
self.interact(display_banner=display_banner)
File "/Applications/anaconda/lib/python2.7/site-packages/IPython/terminal/interactiveshell.py", line 548, in interact
self.input_splitter.push(line)
File "/Applications/anaconda/lib/python2.7/site-packages/IPython/core/inputsplitter.py", line 620, in push
out = self.push_line(line)
File "/Applications/anaconda/lib/python2.7/site-packages/IPython/core/inputsplitter.py", line 655, in push_line
line = transformer.push(line)
File "/Applications/anaconda/lib/python2.7/site-packages/IPython/core/inputtransformer.py", line 152, in push
return self.output(tokens)
File "/Applications/anaconda/lib/python2.7/site-packages/IPython/core/inputtransformer.py", line 157, in output
return untokenize(self.func(tokens)).rstrip('\n')
File "/Applications/anaconda/lib/python2.7/site-packages/IPython/utils/_tokenize_py2.py", line 276, in untokenize
return ut.untokenize(iterable)
File "/Applications/anaconda/lib/python2.7/site-packages/IPython/utils/_tokenize_py2.py", line 214, in untokenize
self.add_whitespace(start)
File "/Applications/anaconda/lib/python2.7/site-packages/IPython/utils/_tokenize_py2.py", line 199, in add_whitespace
assert row >= self.prev_row
AssertionError
If you suspect this is an IPython bug, please report it at:
https://github.com/ipython/ipython/issues
or send an email to the mailing list at ipython-dev#scipy.org
You can print a more detailed traceback right now with "%tb", or use "%debug"
to interactively debug it.
Extra-detailed tracebacks for bug-reporting purposes can be enabled via:
%config Application.verbose_crash=True
Am I doing something wrong or is it really an IPython bug?
It is really an IPython bug, I think. Specifically, the way we handle tokenize fails when an expression involving brackets (()[]{}) is spread over more than one line. I'm trying to work out what we can do about this.
Kinda late answer but, I was trying to use it in my own extension and just had the same problem. I've solved it by simply removing NL from the list (it's not the same as NEWLINE token which end statement), NL token only appears inside [], (), {} so it should be safely removable.
from tokenize import NL
#TokenInputTransformer.wrap
def mat_transformer(tokens):
tokens = list(filter(lambda t: t.type != NL, tokens))
return tokens
If you are looking for full example, I've posted my goofy code there: https://github.com/Quinzel/pymat