How to deploy a Google dataflow worker with a file loaded into memory? - streaming

I am trying to deploy Google Dataflow streaming for use in my machine learning streaming pipeline, but cannot seem to deploy the worker with a file already loaded into memory. Currently, I have setup the job to pull a pickle file from a GCS bucket, load it into memory, and use it for model prediction. But this is executed on every cycle of the job, i.e. pull from GCS every time a new object enters the dataflow pipeline - meaning that the current execution of the pipeline is much slower than it needs to be.
What I really need, is a way to allocate a variable within the worker nodes on setup of each worker. Then use that variable within the pipeline, without having to re-load on every execution of the pipeline.
Is there a way to do this step before the job is deployed, something like
with open('model.pkl', 'rb') as file:
pickle_model = pickle.load(file)
But within my setup.py file?
##### based on - https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/complete/juliaset/setup.py
"""Setup.py module for the workflow's worker utilities.
All the workflow related code is gathered in a package that will be built as a
source distribution, staged in the staging area for the workflow being run and
then installed in the workers when they start running.
This behavior is triggered by specifying the --setup_file command line option
when running the workflow for remote execution.
"""
# pytype: skip-file
from __future__ import absolute_import
from __future__ import print_function
import subprocess
from distutils.command.build import build as _build # type: ignore
import setuptools
# This class handles the pip install mechanism.
class build(_build): # pylint: disable=invalid-name
"""A build command class that will be invoked during package install.
The package built using the current setup.py will be staged and later
installed in the worker using `pip install package'. This class will be
instantiated during install for this specific scenario and will trigger
running the custom commands specified.
"""
sub_commands = _build.sub_commands + [('CustomCommands', None)]
CUSTOM_COMMANDS = [['pip', 'install', 'scikit-learn==0.23.1']]
CUSTOM_COMMANDS = [['pip', 'install', 'google-cloud-storage']]
CUSTOM_COMMANDS = [['pip', 'install', 'mlxtend']]
class CustomCommands(setuptools.Command):
"""A setuptools Command class able to run arbitrary commands."""
def initialize_options(self):
pass
def finalize_options(self):
pass
def RunCustomCommand(self, command_list):
print('Running command: %s' % command_list)
p = subprocess.Popen(
command_list,
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
# Can use communicate(input='y\n'.encode()) if the command run requires
# some confirmation.
stdout_data, _ = p.communicate()
print('Command output: %s' % stdout_data)
if p.returncode != 0:
raise RuntimeError(
'Command %s failed: exit code: %s' % (command_list, p.returncode))
def run(self):
for command in CUSTOM_COMMANDS:
self.RunCustomCommand(command)
REQUIRED_PACKAGES = [
'google-cloud-storage',
'mlxtend',
'scikit-learn==0.23.1',
]
setuptools.setup(
name='ML pipeline',
version='0.0.1',
description='ML set workflow package.',
install_requires=REQUIRED_PACKAGES,
packages=setuptools.find_packages(),
cmdclass={
'build': build,
'CustomCommands': CustomCommands,
})
Snippet of current ML load mechanism:
class MlModel(beam.DoFn):
def __init__(self):
self._model = None
from google.cloud import storage
import pandas as pd
import pickle as pkl
self._storage = storage
self._pkl = pkl
self._pd = pd
def process(self,element):
if self._model is None:
bucket = self._storage.Client().get_bucket(myBucket)
blob = bucket.get_blob(myBlob)
self._model = self._pkl.loads(blob.download_as_string())
new_df = self._pd.read_json(element, orient='records').iloc[:, 3:-1]
predict = self._model.predict(new_df)
df = self._pd.DataFrame(data=predict, columns=["A", "B"])
A = df.iloc[0]['A']
B = df.iloc[0]['B']
d = {'A':A, 'B':B}
return [d]

You can use the #Setup method in your MlModel DoFn method where you can load your model and then use it in your #Process method. The #Setup method is called once per worker initialization.
I had written a similar answer here
HTH

Related

Unit testing in Databricks notebooks

The following code is intended to run unit tests in Databricks notebooks, using pytest.
import pytest
import os
import sys
repo_name = "Databricks-Code-Repo"
# Get the path to this notebook, for example "/Workspace/Repos/{username}/{repo-name}".
notebook_path = dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()
# Get the repo's root directory name.
repo_root = os.path.dirname(os.path.dirname(notebook_path))
# Prepare to run pytest from the repo.
os.chdir(f"/Workspace/{repo_root}/{repo_name}")
print(os.getcwd())
# Skip writing pyc files on a readonly filesystem.
sys.dont_write_bytecode = True
# Run pytest.
retcode = pytest.main([".", "-v", "-p", "no:cacheprovider"])
# Fail the cell execution if there are any test failures.
assert retcode == 0, "The pytest invocation failed. See the log for details."
This code snippet is in the guide provided by Databricks.
However, it produces the following error:
PermissionError: [Errno 1] Operation not permitted: '/Workspace//Repos/<email_address>/Databricks-Code-Repo/Databricks-Code-Repo'
This notebook is inside Databricks Repos. I have two other notebooks:
functions (where I have defined three data transformation functions);
test_functions (where I have defined test function for each of the data transformation functions from the previous notebook).
I get that the error has something to do with permissions, but I can't figure out what is causing it. I will appreciate any suggestions.

How to stop Locust Load Test after all users complete their task?

In locust documentation, we can only stop task using self.interrupt() but it will move to parent class. It will not stop load test. I want to stop complete load test after all users login and complete their task
Locust Version: 1.1
class RegisteredUser(User):
#task
class Forum(TaskSet):
#task(5)
def view_thread(self):
pass
#task(1)
def stop(self):
self.interrupt()
#task
def frontpage(self):
pass
You can call self.environment.runner.quit() to stop the whole run.
More info: https://docs.locust.io/en/stable/writing-a-locustfile.html#environment-attribute
My framework creates a list of tuples of the credentials and support variables for every user. I have stored all my user credentials, tokens, support file names etc in those tuples as part of list. (Actually that is automatically done before starting locust)
I import that list in locustfile
# creds is created before running locust file and can be stored outside or part of locust # file
creds = [('demo_user1', 'pass1', 'lnla'),
('demo_user2', 'pass2', 'taam9'),
('demo_user3', 'pass3', 'wevee'),
('demo_user4', 'pass4', 'avwew')]
class RegisteredUser(SequentialTaskSet)
def on_start(self):
self.credentials = creds.pop()
#task
def task_one_name(self):
task_one_commands
#task
def task_two_name(self):
task_two_commands
#task
def stop(self):
if len(creds) == 0:
self.user.environment.reached_end = True
self.user.environment.runner.quit()
class ApiUser(HttpUser):
tasks = [RegisteredUser]
host = 'hosturl'
I use self.credentials in tasks
I created stop function in my class
Also, observe that RegisteredUser is inherited from SequentialTaskSet to run all tasks in sequence.

ModuleNotFoundError: No module named 'pyspark.dbutils' while running multiple.py file/notebook on job clusters in databricks

I am working in TravisCI, MlFlow and Databricks environment where .tavis.yml sits at git master branch and detects any change in .py file and whenever it gets updated, It will run mlflow command to run .py file in databricks environment.
my MLProject file looks as following:
name: mercury_cltv_lib
conda_env: conda-env.yml
entry_points:
main:
command: "python3 run-multiple-notebooks.py"
Workflow is as following:
TravisCI detects change in master branch-->triggers build which will run MLFlow command and it'll spin up a job cluster in databricks to run .py file from repo.
It worked fine with one .py file but when I tried to run multiple notebook using dbutils, it is throwing
File "run-multiple-notebooks.py", line 3, in <module>
from pyspark.dbutils import DBUtils
ModuleNotFoundError: No module named 'pyspark.dbutils'
Please find below the relevant code section from run-multiple-notebooks.py
def get_spark_session():
from pyspark.sql import SparkSession
return SparkSession.builder.getOrCreate()
def get_dbutils(self, spark = None):
try:
if spark == None:
spark = spark
from pyspark.dbutils import DBUtils #error line
dbutils = DBUtils(spark) #error line
except ImportError:
import IPython
dbutils = IPython.get_ipython().user_ns["dbutils"]
return dbutils
def submitNotebook(notebook):
print("Running notebook %s" % notebook.path)
spark = get_spark_session()
dbutils = get_dbutils(spark)
I tried all the options and tried
https://stackoverflow.com/questions/61546680/modulenotfounderror-no-module-named-pyspark-dbutils
as well. It is not working :(
Can someone please suggest if there is fix for the above-mentioned error while running .py in job cluster. My code works fine inside databricks local notebook but running from outside using TravisCI and MLFlow isn't working which is must requirement for pipeline automation.

How come the same py file runs in one Pycharm project, but not another, while both projects have same module imported

Recently, I installed pandas_profiling for the purpose of a particular project I created in the PyCharm IDE. It worked after having updated 'External tools' in Settings. Realising a similar need in another project context I did the installation of pandas_profiling in …\venv\Scripts path for that particular project as well. Did similar update of External tools in the new project. Yet the console keeps telling me that it cannot detect the module. Both projects have the pandas_profiling package files in the 'site packages' and 'venv' directories when I check. Any ideas out there? Thx, for your kind support.
from pathlib import Path
import pandas as pd
import numpy as np
import requests
import pandas_profiling
if __name__ == "__main__":
file_name = Path("C:\\Users\…..csv")
if not file_name.exists():
data = requests.get(
"C:\\Users\…..csv"
)
file_name.write_bytes(data.content)
df = pd.read_csv(file_name)
df["Feature_1"] = pd.to_datetime(df["Feature_1"], errors="coerce")
# Example: Constant variable
# df["source"] = "name of org"
# Example: Boolean variable
df["boolean"] = np.random.choice([True, False], df.shape[0])
# Example: Mixed with base types
df["mixed"] = np.random.choice([1, "A"], df.shape[0])
# Example: Highly correlated variables
df["Feature_2"] = df["Feature_2"] + np.random.normal(scale=5, size=(len(df)))
# Example: Duplicate observations
duplicates_to_add = pd.DataFrame(df.iloc[0:10])
duplicates_to_add[u"Feature_1"] = duplicates_to_add[u"Feature_1"]
df = df.append(duplicates_to_add, ignore_index=True)
profile = df.profile_report(
title="Report", correlation_overrides=["recclass"]
)
profile.to_file(output_file=Path("C:\\Users.....html"))
Response from console in new project (while working in existing project):
Traceback (most recent call last):
File "C:/Users/.../PycharmProjects/.../Pandas_Profiling_2.py", line 8, in <module>
import pandas_profiling
ModuleNotFoundError: No module named 'pandas_profiling'
Process finished with exit code 1

How to load objects in memory and share across different executions of Celery worker?

I have setup celery + rabbitmq for on a 3 cluster machine. I have also created a task which generates a regular expression based on data from the file and uses the information to parse text. However, I would like that the process of reading the file is done only once per worker spawn and not on every execution of as task.
from celery import Celery
celery = Celery('tasks', broker='amqp://localhost//')
import re
#celery.task
def add(x, y):
return x + y
def get_regular_expression():
with open("text") as fp:
data = fp.readlines()
str_re = "|".join([x.split()[2] for x in data ])
return str_re
#celery.task
def analyse_json(tw):
str_re = get_regular_expression()
re.match(str_re,tw.text)
In the above code, I would like to open the file and read the output into the string only once per worker, and then the task analyse_json should just use the string.
Any help will be appreciated,
thanks,
Amit
Put the call to get_regular_expression at the module level:
str_re = get_regular_expression()
#celery.task
def analyse_json(tw):
re.match(str_re, tw.text)
It will only be called once, when the module is first imported.
Additionally, if you must have only one instance of your worker running at a time (for example CUDA), you have to use the -P solo option:
celery worker --pool solo
Works with celery 4.4.2.