How come the same py file runs in one Pycharm project, but not another, while both projects have same module imported - pandas-profiling

Recently, I installed pandas_profiling for the purpose of a particular project I created in the PyCharm IDE. It worked after having updated 'External tools' in Settings. Realising a similar need in another project context I did the installation of pandas_profiling in …\venv\Scripts path for that particular project as well. Did similar update of External tools in the new project. Yet the console keeps telling me that it cannot detect the module. Both projects have the pandas_profiling package files in the 'site packages' and 'venv' directories when I check. Any ideas out there? Thx, for your kind support.
from pathlib import Path
import pandas as pd
import numpy as np
import requests
import pandas_profiling
if __name__ == "__main__":
file_name = Path("C:\\Users\…..csv")
if not file_name.exists():
data = requests.get(
"C:\\Users\…..csv"
)
file_name.write_bytes(data.content)
df = pd.read_csv(file_name)
df["Feature_1"] = pd.to_datetime(df["Feature_1"], errors="coerce")
# Example: Constant variable
# df["source"] = "name of org"
# Example: Boolean variable
df["boolean"] = np.random.choice([True, False], df.shape[0])
# Example: Mixed with base types
df["mixed"] = np.random.choice([1, "A"], df.shape[0])
# Example: Highly correlated variables
df["Feature_2"] = df["Feature_2"] + np.random.normal(scale=5, size=(len(df)))
# Example: Duplicate observations
duplicates_to_add = pd.DataFrame(df.iloc[0:10])
duplicates_to_add[u"Feature_1"] = duplicates_to_add[u"Feature_1"]
df = df.append(duplicates_to_add, ignore_index=True)
profile = df.profile_report(
title="Report", correlation_overrides=["recclass"]
)
profile.to_file(output_file=Path("C:\\Users.....html"))
Response from console in new project (while working in existing project):
Traceback (most recent call last):
File "C:/Users/.../PycharmProjects/.../Pandas_Profiling_2.py", line 8, in <module>
import pandas_profiling
ModuleNotFoundError: No module named 'pandas_profiling'
Process finished with exit code 1

Related

How to deploy a Google dataflow worker with a file loaded into memory?

I am trying to deploy Google Dataflow streaming for use in my machine learning streaming pipeline, but cannot seem to deploy the worker with a file already loaded into memory. Currently, I have setup the job to pull a pickle file from a GCS bucket, load it into memory, and use it for model prediction. But this is executed on every cycle of the job, i.e. pull from GCS every time a new object enters the dataflow pipeline - meaning that the current execution of the pipeline is much slower than it needs to be.
What I really need, is a way to allocate a variable within the worker nodes on setup of each worker. Then use that variable within the pipeline, without having to re-load on every execution of the pipeline.
Is there a way to do this step before the job is deployed, something like
with open('model.pkl', 'rb') as file:
pickle_model = pickle.load(file)
But within my setup.py file?
##### based on - https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/complete/juliaset/setup.py
"""Setup.py module for the workflow's worker utilities.
All the workflow related code is gathered in a package that will be built as a
source distribution, staged in the staging area for the workflow being run and
then installed in the workers when they start running.
This behavior is triggered by specifying the --setup_file command line option
when running the workflow for remote execution.
"""
# pytype: skip-file
from __future__ import absolute_import
from __future__ import print_function
import subprocess
from distutils.command.build import build as _build # type: ignore
import setuptools
# This class handles the pip install mechanism.
class build(_build): # pylint: disable=invalid-name
"""A build command class that will be invoked during package install.
The package built using the current setup.py will be staged and later
installed in the worker using `pip install package'. This class will be
instantiated during install for this specific scenario and will trigger
running the custom commands specified.
"""
sub_commands = _build.sub_commands + [('CustomCommands', None)]
CUSTOM_COMMANDS = [['pip', 'install', 'scikit-learn==0.23.1']]
CUSTOM_COMMANDS = [['pip', 'install', 'google-cloud-storage']]
CUSTOM_COMMANDS = [['pip', 'install', 'mlxtend']]
class CustomCommands(setuptools.Command):
"""A setuptools Command class able to run arbitrary commands."""
def initialize_options(self):
pass
def finalize_options(self):
pass
def RunCustomCommand(self, command_list):
print('Running command: %s' % command_list)
p = subprocess.Popen(
command_list,
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
# Can use communicate(input='y\n'.encode()) if the command run requires
# some confirmation.
stdout_data, _ = p.communicate()
print('Command output: %s' % stdout_data)
if p.returncode != 0:
raise RuntimeError(
'Command %s failed: exit code: %s' % (command_list, p.returncode))
def run(self):
for command in CUSTOM_COMMANDS:
self.RunCustomCommand(command)
REQUIRED_PACKAGES = [
'google-cloud-storage',
'mlxtend',
'scikit-learn==0.23.1',
]
setuptools.setup(
name='ML pipeline',
version='0.0.1',
description='ML set workflow package.',
install_requires=REQUIRED_PACKAGES,
packages=setuptools.find_packages(),
cmdclass={
'build': build,
'CustomCommands': CustomCommands,
})
Snippet of current ML load mechanism:
class MlModel(beam.DoFn):
def __init__(self):
self._model = None
from google.cloud import storage
import pandas as pd
import pickle as pkl
self._storage = storage
self._pkl = pkl
self._pd = pd
def process(self,element):
if self._model is None:
bucket = self._storage.Client().get_bucket(myBucket)
blob = bucket.get_blob(myBlob)
self._model = self._pkl.loads(blob.download_as_string())
new_df = self._pd.read_json(element, orient='records').iloc[:, 3:-1]
predict = self._model.predict(new_df)
df = self._pd.DataFrame(data=predict, columns=["A", "B"])
A = df.iloc[0]['A']
B = df.iloc[0]['B']
d = {'A':A, 'B':B}
return [d]
You can use the #Setup method in your MlModel DoFn method where you can load your model and then use it in your #Process method. The #Setup method is called once per worker initialization.
I had written a similar answer here
HTH

How to solve numpy import error when calling Anaconda env from Matlab

I want to execute a Python script from Matlab (on a Windows 7 machine). The libraries necessary are installed in an Anaconda virtual environment. When running the script from command line, it runs flawlessly.
When calling the script from Matlab as follows:
[status, commandOut] = system('C:/Users/user/AppData/Local/Continuum/anaconda3/envs/tf/python.exe test.py');
or with shell commands, I get an Import Error:
commandOut =
'Traceback (most recent call last):
File "C:\Users\user\AppData\Local\Continuum\anaconda3\envs\tf\lib\site-packages\numpy\core\__init__.py", line 16, in <module>
from . import multiarray
ImportError: DLL load failed: The specified path is invalid.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "test.py", line 2, in <module>
import numpy as np
File "C:\Users\user\AppData\Local\Continuum\anaconda3\envs\tf\lib\site-packages\numpy\__init__.py", line 142, in <module>
from . import add_newdocs
File "C:\Users\user\AppData\Local\Continuum\anaconda3\envs\tf\lib\site-packages\numpy\add_newdocs.py", line 13, in <module>
from numpy.lib import add_newdoc
File "C:\Users\user\AppData\Local\Continuum\anaconda3\envs\tf\lib\site-packages\numpy\lib\__init__.py", line 8, in <module>
from .type_check import *
File "C:\Users\user\AppData\Local\Continuum\anaconda3\envs\tf\lib\site-packages\numpy\lib\type_check.py", line 11, in <module>
import numpy.core.numeric as _nx
File "C:\Users\user\AppData\Local\Continuum\anaconda3\envs\tf\lib\site-packages\numpy\core\__init__.py", line 26, in <module>
raise ImportError(msg)
ImportError:
Importing the multiarray numpy extension module failed. Most
likely you are trying to import a failed build of numpy.
If you're working with a numpy git repo, try `git clean -xdf` (removes all
files not under version control). Otherwise reinstall numpy.
Original error was: DLL load failed: The specified path is invalid.
I already changed the default Matlab Python version to the Anaconda env, but no change:
version: '3.5'
executable: 'C:\Users\user\AppData\Local\Continuum\anaconda3\envs\tf\python.exe'
library: 'C:\Users\user\AppData\Local\Continuum\anaconda3\envs\tf\python35.dll'
home: 'C:\Users\user\AppData\Local\Continuum\anaconda3\envs\tf'
isloaded: 1
Just running my test script without importing numpy works. Reloading numpy (py.importlib.import_module('numpy');) didn't work but threw the same error as before.
Does anyone have an idea how to fix this?
So after corresponding with Matlab support I found out that Matlab depends on the path environment (paths which are deliberately not set when using a virtual environment) and therefore numpy fails to find the necessary paths when called from within Matlab (even if the call contains the path to the virtual environment).
The solution is either to call Matlab from within the virtual environment (via command line) or add the missing paths manually in the path environment.
Maybe this information can help someone else.
First Method
You can change the python interpreter with:
pyversion("/home/nibalysc/Programs/anaconda3/bin/python");
And check it with:
pyversion();
You could also do this in a
startup.m
file in your project folder and every time you start MATLAB from this folder the python interpreter will be changed automatically.
Now you can try to use:
py.importlib.import_module('numpy');
Read up the documentation on how to use the integrated python in MATLAB:
Call user defined custom module
Call modified python module
Alternative Method
Alternative method would be to create a
matlab_shell.sh
file with following content, this is basically the appended code from .bashrc when anaconda is installed and asks you if the installer should modify the .bashrc file:
#!/bin/bash
__conda_setup="$(CONDA_REPORT_ERRORS=false '$HOME/path/to/anaconda3/bin/conda' shell.bash hook 2> /dev/null)"
if [ $? -eq 0 ]; then
\eval "$__conda_setup"
else
if [ -f "$HOME/path/to/anaconda3/etc/profile.d/conda.sh" ]; then
CONDA_CHANGEPS1=false conda activate base
else
\export PATH="$HOME/path/to/anaconda3/bin:$PATH"
fi
fi
unset __conda_setup
# <<< conda init <<<
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('$HOME/path/to/anaconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
else
if [ -f "$HOME/path/to/anaconda3/etc/profile.d/conda.sh" ]; then
. "$HOME/path/to/anaconda3/etc/profile.d/conda.sh"
else
export PATH="$HOME/path/to/anaconda3/bin:$PATH"
fi
fi
unset __conda_setup
# <<< conda initialize <<<
conda activate base
eval $2
Then you need to set the MATLAB_SHELL environment variable either before running MATLAB or in MATLAB itself. The best thing in my opinion would be to do it also in the startup.m file like that:
setenv("MATLAB_SHELL", "/path/to/matlab_shell.sh");
Afterwards you can use the system(...) function to run conda python with all your modules installed like that...
String notation:
system("python -c ""python code goes here"");
Char notation:
system('python -c "python code goes here"');
Hope this helps!
Firstly, if you execute your Python script like a regular system command ([status, commandOut] = system('...python.exe test.py'))
the pyversion (and pyenv, since R2019b) got no effect at all. It only matters if you utilize the py. integration, as in the code below (and, in most cases, this is a way better approach).
Currently (I use R2019b update 5) there's a number of pitfalls, that might cause issues similar to yours. I'd recommend to start from the following:
Create a new clean conda environment:
conda create -n test_py36 python=3.6 numpy
Create the following dummy demo1.py:
def dummy_py_method(x):
return x+1
Create the following run_py_code.m:
function run_py_code()
% explicit module import sometimes show more detailed error messages
py.importlib.import_module('numpy');
% to reloads if there would be any changes:
pymodule = py.importlib.import_module('demo1');
py.importlib.reload(pymodule);
% passing data back and forth
x = rand([3 3]);
x_np = py.numpy.array(x);
y_np=pymodule.dummy_py_method(x_np);
y = double(y_np);
disp(y-x);
Create the following before_first_run.m:
setenv('PYTHONUNBUFFERED','1');
setenv('path',['C:\Users\username\Anaconda3\envs\test_py36\Library\bin;'...
getenv('path')]);
pe=pyenv('Version','C:\users\username\Anaconda3\envs\test_py36\pythonw.exe',...
'ExecutionMode','InProcess'...
);
% add "demo1.py" to path
py_file_path = 'W:\tests\Matlab\python_demos\call_pycode\pycode';
if count(py.sys.path,py_file_path) == 0
insert(py.sys.path,int32(0),py_file_path);
end
Run the before_first_run.m first and run the run_py_code.m next.
Notes:
As already mentioned in this answer, one key point is to add the folder, containing the necessary dll files to the %PATH%, before starting python. This could be achieved with setenv from withing Matlab. Usually, the Library\bin is what should be added.
It might be a good idea to try clean officially-supported CPython distribution (e.g. CPython 3.6.8 ). Only install numpy (python -m pip install numpy). To my experience, the setenv is not necessary in this case.
For me, OutOfProcess mode proved to be buggy. Thus, I'd recommend to explicitly setting InProcess mode (for versions before R2019b, the OutOfProcess option is not present, as well as pyenv).
Do not concatenate the two .m files above into one - the py.importlib statements seem to be pre-executed and thus conflict with pyenv.

make packages installed in virtualenv visibile to sphinx

I am using sphinx to document my software. and I am using a virtualenv for the installation. now some packages are only installed in the virtual environment, and sphinx does not see them.
I have this code in my conf.py:
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
p = os.path.abspath('..')
sys.path.insert(0, p)
if 'VIRTUAL_ENV' in os.environ:
q = os.sep.join([os.environ['VIRTUAL_ENV'],
'lib', 'python2.7', 'site-packages'])
sys.path.insert(0, q)
p = p + ":" + q
os.environ['PYTHONPATH'] = p
yet if I make html, I get this sort of warnings:
/home/mario/Local/github/Bauble/bauble.classic/doc/api.rst:358: WARNING: autodoc: failed to import class u'TagItemGUI' from module u'bauble.plugins.tag'; the following exception was raised:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/sphinx/ext/autodoc.py", line 385, in import_object
__import__(self.modname)
File "/home/mario/Local/github/Bauble/bauble.classic/bauble/plugins/tag/__init__.py", line 30, in <module>
from sqlalchemy import *
ImportError: No module named sqlalchemy
my $VIRTUAL_ENV/lib/python2.7/site-packages contains SQLAlchemy-1.0.4-py2.7-linux-x86_64.egg.
definitely related to question Sphinx autodoc dies on ImportError of third party package, but the description of the procedure I chose to follow is in a broken link.
The problem is that packages are not directly included in virtualenv's site-packages dir, you would need to specify the full path to be able to import package from there. I use the following hack:
if 'VIRTUAL_ENV' in os.environ:
site_packages_glob = os.sep.join([
os.environ['VIRTUAL_ENV'],
'lib', 'python2.7', 'site-packages', 'projectname-*py2.7.egg'])
site_packages = glob.glob(site_packages_glob)[-1]
sys.path.insert(0, site_packages)
Where projectname is the name of the python module I would like to import.
Note that this is error prone, especially when you have multiple versions
of the module, but so far it works for me.

fails to import module when using matlabdomain

While trying to use the sphinx matlab domain I can't get the MWE to work, provided on the extensions pypi site
There is always this Can't import module error. I'd guess, that the extension kind of generates pseudo modules from the m-code, but up to know I actually could not figure out, how this mechanism works.
The dir structure looks like this
root
|--test_data
| |--MyHandleClass.m
|
|--doc
|--------conf.py
|--------Makefile
|--------index.rst
The files MyHandleClass.m and index.rst contain the example code given on the package site and the conf.py starts like this
import sys, os
sys.path.append(os.path.abspath('.'))
sys.path.append(os.path.abspath('./test_data'))
# -- General configuration -----------------------------------------------------
# Add any Sphinx extension module names here, as strings. They can be extensions
# coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
extensions = [
"sphinxcontrib.matlab",
"sphinx.ext.autosummary",
"sphinx.ext.autodoc"]
autodoc_default_flags = ['members','show-inheritance','undoc-members']
autoclass_content = 'both'
mathjax_path = 'http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=default'
# The suffix of source filenames.
source_suffix = '.rst'
# The encoding of source files.
#source_encoding = 'utf-8'
# The master toctree document.
master_doc = 'index'
Error msg
WARNING: autodoc: failed to import module u'test_data'; the following exception was raised:
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\sphinx\ext\autodoc.py", line 335, in import_object
__import__(self.modname)
ImportError: No module named test_data
E:\ME\doc\index.rst:13: WARNING: don't know which module to import for autodocumenting u'MyHandleClass' (try placing a "module" or "currentmodule" directive in the document, or giving an explicit module name)
After varying this and that maybe somebody out there has a clue?
Thanks for trying the matlabdomain sphinxcontrib extension. In order to use Sphinx to document MATLAB m-files, you need to add matlab_src_dir in conf.py as described in the Configuration section of the documenation. This is because the Python interpreter can't import a MATLAB m-file. Therefore you should not add your MATLAB root to the Python sys.path, or you will get the error you received. Instead set matlab_src_dir to the path containing the folder of your MATLAB project which you want to document.
Given your file structure, in order to document test_data use a conf.py with the following:
import os
# NOTE: don't add MATLAB m-files to `sys.path`
#sys.path.insert(0, os.path.abspath('.'))
# instead add them to `matlab_src_dir
matlab_src_dir = os.path.abspath('..') # MATLAB
Hope that does it! Please feel free to ask any more questions. I'm happy to help!

IPython Notebook: Open/select file with GUI (Qt Dialog)

When you perform the same analysis in a notebook on different data files, may be handy to graphically select a data file.
In my python scripts I usually implement a QT dialog that returns the file-name of the selected file:
from PySide import QtCore, QtGui
def gui_fname(dir=None):
"""Select a file via a dialog and return the file name.
"""
if dir is None: dir ='./'
fname = QtGui.QFileDialog.getOpenFileName(None, "Select data file...",
dir, filter="All files (*);; SM Files (*.sm)")
return fname[0]
However, running this function from an notebook
full_fname = gui_fname()
causes the kernel to die (and restart):
Interestingly, puttying this 3 command in 3 separate cells works
%matplotlib qt
full_fname = gui_fname()
%matplotlib inline
but when I put those commands in one single cell the kernel dies again.
This prevents to create a function like gui_fname_ipynb() that transparently allows selecting a file with a GUI.
For convenience, I created a notebook illustrating the problem:
Open/select file with GUI (Qt Dialog)
Any suggestion on how to execute a dialog for file selection from within an IPython Notebook?
Using Anaconda 5.0.0 on windows (Python 3.6.2, IPython 6.1.0), the following two options are both working for me.
OPTION 1: Entirely in a Jupyter notebook:
CELL 1:
%gui qt
from PyQt5.QtWidgets import QFileDialog
def gui_fname(dir=None):
"""Select a file via a dialog and return the file name."""
if dir is None: dir ='./'
fname = QFileDialog.getOpenFileName(None, "Select data file...",
dir, filter="All files (*);; SM Files (*.sm)")
return fname[0]
CELL 2:
gui_fname()
This is working for me but it seems a bit...fragile. If I combine these two things into the same cell, it crashes. Or if I omit the %gui qt, it crashes. If I "restart kernel and run all cells", it doesn't work. So I kinda like this other option...
MORE RELIABLE OPTION: Separate script that opens dialog box in a new process
(Based on mkrog code here.)
PUT THE FOLLOWING IN A SEPARATE PYTHON SCRIPT CALLED blah.py:
from sys import executable, argv
from subprocess import check_output
from PyQt5.QtWidgets import QFileDialog, QApplication
def gui_fname(directory='./'):
"""Open a file dialog, starting in the given directory, and return
the chosen filename"""
# run this exact file in a separate process, and grab the result
file = check_output([executable, __file__, directory])
return file.strip()
if __name__ == "__main__":
directory = argv[1]
app = QApplication([directory])
fname = QFileDialog.getOpenFileName(None, "Select a file...",
directory, filter="All files (*)")
print(fname[0])
...AND IN YOUR JUPYTER NOTEBOOK
import blah
blah.gui_fname()
I have a universal code where it does its job without any problem. Here is my sugestion:
try:
from tkinter import Tk
from tkFileDialog import askopenfilenames
except:
from tkinter import Tk
from tkinter import filedialog
Tk().withdraw() # we don't want a full GUI, so keep the root window from appearing
filenames = filedialog.askopenfilenames() # show an "Open" dialog box and return the path to the selected file
print (filenames)
hope it can be useful
This behaviour was a bug in IPython:
https://github.com/ipython/ipython/issues/4997
that was fixed here:
https://github.com/ipython/ipython/pull/5077
The function to open a gui dialog should work on current master and on the oncoming 2.0 release.
To date, the last 1.x version (1.2.1) does not include a backport of the fix.
EDIT: The example code still crashes IPython 2.x, see this issue.