Merging spans is very slow when using the model "en_core_web_md" of Spacy - merge

Thank spacy for providing the very nice library. I found a question: merging span very slow(about 200ms) when using the model "en_core_web_md". The code is as follows:
import spacy
import time
nlp = spacy.load("en_core_web_md")
doc = nlp("I need to speak with one of your technicians")
st = time.time()
with doc.retokenize() as retokenizer:
print(time.time() - st)
print([w.text for w in doc])
This is fast(about 12ms) while using the model:"en_core_web_sm" or "en_core_web_lg".
The environment:
spacy 2.3.2
en-core-web-md 2.3.1
Python 3.8.6
Ubuntu 16.04.7 LTS
I tried to add:
doc.tensor = None
before retokenize according to here, but it did not work. Anyone can help me. Thanks in advance.


No module named 'spacy' in PySpark

I am attempting to perform some entity extraction, using a custom NER spaCy model. The extraction will be done over a Spark Dataframe, and everything is being orchestrated in a Dataproc cluster (using a Jupyter Notebook, available in the "Workbench"). The code I am using, looks like follows:
import pandas as pd
import numpy as np
import time
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, pandas_udf
from pyspark.sql.types import ArrayType, StringType
spark = SparkSession.builder.appName('SpacyOverPySpark') \
def load_spacy_model():
import spacy
print("Loading spacy model...")
return spacy.load("./spacy_model") # This model exists locally
def entities(list_of_text: pd.Series) -> pd.Series:
# retrieving the shared nlp object
nlp = broadcasted_nlp.value
# batch processing our list of text
docs = nlp.pipe(list_of_text)
# entity extraction (`ents` is a list[list[str]])
[ent.text for ent in doc.ents]
for doc in docs
return pd.Series(ents)
pdf = pd.DataFrame(
"Pyhton and Pandas are very important for Automation",
"Tony Stark is a Electrical Engineer",
"Pipe welding is a very dangerous task in Oil mining",
"Nursing is often underwhelmed, but it's very interesting",
"Software Engineering now opens a lot of doors for you",
"Civil Engineering can get exiting, as you travel very often",
"I am a Java Programmer, and I think I'm quite good at what I do",
"Diane is never bored of doing the same thing all day",
"My father is a Doctor, and he supports people in condition of poverty",
"A janitor is required as soon as possible"
# loading spaCy model and broadcasting it
broadcasted_nlp = spark.sparkContext.broadcast(load_spacy_model())
# Extracting entities
df_new = sdf.withColumn('skills',entities('postings'))
# Displaying results, truncate=20)
The error code I am getting, looks similar to this, but the answer does not apply for my case, because it deals with "executing a Pyspark job in Yarn" which is different (or so I think, feel free to correct me). Plus, I have also found this, but the answer is rather vague (I gotta be honest here: the only thing I have done to "restart the spark session" is to run spark.stop() in the last cell of my Jupyter Notebook, and then run the cells above again, feel free to correct me here too).
The code used was heavily inspired by "Answer 2 of 2" in this forum, which makes me wonder if some missing setting is still eluding me (BTW, "Answer 1 of 2" was already tested but did not work). And regarding my specific software versions, they can be found here.
Thank you.
Because some queries or hints generated in the comment section can be lengthy, I have decided to include them here:
No. 1: "Which command did you use to create your cluster?" : I used this method, so the command was not visible "at plain sight"; I have just realized however that, when you are about to create the cluster, you have an "EQUIVALENT COMMAND LINE" button, that grants access to such command:
In my case, the Dataproc cluster creation code (automatically generated by GCP) is:
gcloud dataproc clusters create my-cluster \
--enable-component-gateway \
--region us-central1 \
--zone us-central1-c \
--master-machine-type n1-standard-4 \
--master-boot-disk-size 500 \
--num-workers 2 \
--worker-machine-type n1-standard-4 \
--worker-boot-disk-size 500 \
--image-version 2.0-debian10 \
--optional-components JUPYTER \
--metadata PIP_PACKAGES=spacy==3.2.1 \
--project hidden-project-name
Notice how spaCy is installed in the metadata (following these recommendations); however running pip freeze | grep spacy command, right after the Dataproc cluster creation, does not display any result (i.e., spaCy does NOT get installed successfully). To enable it, the official method is used afterwards.
No. 2: "Wrong path as possible cause" : Not my case, it actually looks similar to this case (even when I can't say the root case is the same for both):
Running which python shows /opt/conda/miniconda3/bin/python as result.
Running which spacy (read "Clarification No. 1") shows /opt/conda/miniconda3/bin/spacy as result.
While replicating your case in Jupyter Notebook, I encountered the same error.
I used this pip command and it worked:
pip install -U spacy
But after installing, I got a JAVA_HOME is not set error, so I used these commands:
conda install openjdk conda install -c
conda-forge findspark
!python3 -m spacy download en_core_web_sm
I just included it in case you might also encounter it.
Here is the output:
Note: I used spacy.load("en_core_web_sm").
I managed to solve this issue, by combining 2 pieces of information:
"Configure Dataproc Python environment", "Dataproc image version 2.0" (as that is the version I am using): available here (special thanks to #Dagang in the comment section).
"Create a (Dataproc) cluster": available here.
In specific, during the Dataproc cluster setup via Google Console, I "installed" spaCy by doing:
And when the cluster was already created, I ran the code mentioned in my original post (NO modifications) with the following result:
That solves my original question. I am planning to apply my solution on a larger dataset, but I think whatever happen there, is subject of a different thread.

How can I run pytesseract / tesseract in Foundry Code Repositories?

I am trying to use the function image_to_string from the library pytesseract in a repository to perform OCR of PDFs. However, I am getting the following error:
From the checks I would assume the library was loaded correctly:
Does anyone have an idea how to trouble shoot here?
It seems like Foundry is not respecting / running the environment activation script
that sets the TESSDATA_PREFIX environment variable automatically. However, we can infer the value manually and provide it to the pytesseract API calls.
Define the following helper function:
def _get_tessdata_directory_path():
import sys
from pathlib import Path
env_root = Path(sys.executable).parent.parent
share_dir = env_root / 'share' / 'tessdata'
assert share_dir.exists(), 'tessdata directory does not exist in <envroot>/share/tessdata'
return str(share_dir)
and use it like shown in the following snippet:
tessdata_dir_config = f'--tessdata-dir "{_get_tessdata_directory_path()}"'
pytesseract.image_to_string(image, ..., config=tessdata_dir_config)

MATLAB — Unable to Import cv2 Library

I'm a beginner at OpenCV, and trying to run an open-source program.
I currently have the Computer Vision Toolbox OpenCV Interface 20.1.0 installed and Computer Vision Toolbox 9.2.
I cannot run this simple open-source feature matching algorithm without encountering errors.
import cv2
import matplotlib.pyplot as plt
%matplotlib inline
% read images
img1 = cv2.imread('[INSERT PATH #1]');
img2 = cv2.imread('[INSERT PATH #2]');
img1 = cv2.cvtColor(img1, cv2.COLOR_BGR2GRAY);
img2 = cv2.cvtColor(img2, cv2.COLOR_BGR2GRAY);
sift = cv2.xfeatures2d.SIFT_create();
keypoints_1, descriptors_1 = sift.detectAndCompute(img1,None);
keypoints_2, descriptors_2 = sift.detectAndCompute(img2,None);
len(keypoints_1), len(keypoints_2)
The following message is returned:
Error: File: Keypoints.m Line: 1 Column: 8
The import statement 'import cv2' cannot be found or cannot be imported. Imported names must end with '.*' or be
fully qualified.
However, when I remove Line 1, I instead get the following error.
Error: File: Keypoints.m Line: 2 Column: 8
The import statement 'import matplotlib.pyplot' cannot be found or cannot be imported. Imported names must end
with '.*' or be fully qualified.
Finally, following the error message only results in a sequence of further errors from the cv2 library. Any ideas?
That's because the code you've used isn't MATLAB code, it's python code.
As per the website you've linked:
From within Matlab
The parallel implementation coded in Matlab can be run by using the surf_find_keypoints() function. The output keypoints can be sorted by strength using surf_best_n_keypoints(), and plotted using surf_plot_keypoints().
Check that you've downloaded the correct files and try again.
Furthermore, the Matlab OpenCV Interface is designed to integrate C++ OpenCV code, not python. Documentations here.
Yes, it is correct that this is Python code. I would recommend checking your dependencies/libraries. The PyCharm IDE is what I personally use since it takes care of all the libraries easily.
If you do end up trying out PyCharm click on the red icon when hovering on CV2. It’ll then give you a prompt to download the library.
Using Python some setup can be done. Using pip:
Install opencv-python
pip install opencv-python
Install opencv-contrib-python
pip install opencv-contrib-python
Unfortunately, there is some issue with the sift feature since by default it is excluded from newer free versions of OpenCV.
sift = cv2.xfeatures2d.SIFT_create() not working even though have contrib installed
import cv2
Image_1 = cv2.imread("Image_1.png", cv2.IMREAD_COLOR)
Image_2 = cv2.imread("Image_2.jpg", cv2.IMREAD_COLOR)
Image_1 = cv2.cvtColor(Image_1, cv2.COLOR_BGR2GRAY)
Image_2 = cv2.cvtColor(Image_2, cv2.COLOR_BGR2GRAY)
sift = cv2.SIFT_create()
keypoints_1, descriptors_1 = sift.detectAndCompute(Image_1,None)
keypoints_2, descriptors_2 = sift.detectAndCompute(Image_2,None)
len(keypoints_1), len(keypoints_2)
The error I received:
"/Users/michael/Documents/PYTHON/Test Folder/venv/bin/python" "/Users/michael/Documents/PYTHON/Test Folder/"
Traceback (most recent call last):
File "/Users/michael/Documents/PYTHON/Test Folder/", line 9, in <module>
sift = cv2.SIFT_create()
AttributeError: module 'cv2.cv2' has no attribute 'SIFT_create'
Process finished with exit code 1

How Can I use my GPU on Ipython Notebook?

OS : Ubuntu 14.04LTS
Language : Python Anaconda 2.7 (keras, theano)
GPU : GTX980Ti
I wanna run keras python code on IPython Notebook by using my GPU(GTX980Ti)
But I can't find it.
I want to test below code. When I run it on to Ubuntu terminal,
I command as below (It uses GPU well. It doesn't have any problem)
First I set the path like below
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
Second I run the code as below
THEANO_FLAGS='floatX=float32,device=gpu0,nvcc.fastmath=True' python
And it runs well.
But when i run the code on pycharm(python IDE) or
When I run it on Ipython Notebook, It doesn't use gpu.
It only uses CPU code is as below.
from theano import function, config, shared, sandbox
import theano.tensor as T
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], T.exp(x))
t0 = time.time()
for i in xrange(iters):
r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
print('Used the cpu')
print('Used the gpu')
To solve it, I force the code use gpu as below
(Insert two lines more on
import theano.sandbox.cuda
Then It generate the error like below
ERROR (theano.sandbox.cuda): nvcc compiler not found on $PATH. Check your nvcc installation and try again.
how to do it??? I spent two days..
And I surely did the way of using '.theanorc' file at home directory.
I'm using theano on an ipython notebook making use of my system's GPU. This configuration seems to work fine on my system.(Macbook Pro with GTX 750M)
My ~/.theanorc file :
cnmem = True
floatX = float32
device = gpu0
Various environment variables (I use a virtual environment(macvnev):
echo $PATH
How I run ipython notebook (For me, the device is gpu0) :
$THEANO_FLAGS=mode=FAST_RUN,device=gpu0,floatX=float32 ipython notebook
Output of $nvcc -V :
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Thu_Sep_24_00:26:39_CDT_2015
Cuda compilation tools, release 7.5, V7.5.19
From your post, probably you've set the $PATH variable wrong.

How to start IPython Notebook with a specified namespace

I have an GUI-based (TraitsUI/PyQt/Envisage) application written in Python. I would like to spawn an IPython Notebook in which I expose a small API and a number of objects. Those objects include a SQLAlchemy session and a bunch of SQLAlchemy models.
I've looked a lot, but I can't find any examples of this. I can start a notebook:
from IPython.frontend.html.notebook import notebookapp
app = notebookapp.NotebookApp.instance()
and that works well enough (although I'd prefer if 'start' was nonblocking... I assume I can do it in another thread if needed), but I can't alter the namespace.
I've also found examples like this:
from IPython.zmq.ipkernel import IPKernelApp
namespace = dict(z=1010)
kapp = IPKernelApp.instance()
# Update the ns we want with special variables auto-created by the kernel
# Now set the kernel's ns to be ours = namespace
But I'm not sure how to actually open the Notebook from here.
Does anybody have any suggestions?
>>> import IPython
>>> z=1010
>>> IPython.embed()
Python 3.5.2 (default, Oct 8 2019, 13:06:37)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.9.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: z
Out[1]: 1010