PyPDF2.PdfFileReader hangs indefinitely - pypdf

I'm trying to read this pdf file (https://www.accessdata.fda.gov/cdrh_docs/pdf14/K141693.pdf) and am following these suggestions from SO
Opening pdf urls with pyPdf
I have actually downloaded the file locally and am running the following code
import PyPDF2
pdf_file = open("K141693.pdf")
pdf_read = PyPDF2.PdfFileReader(pdf_file)
but my code hangs indefinitely. I'm running Python 2.7 and here is the stacktrace.
Traceback (most recent call last):
File "", line 1, in
runfile('C:/PoC/pdf_reader.py', wdir='C:/PoC')
File
"C:\ProgramData\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py",
line 880, in runfile
execfile(filename, namespace)
File
"C:\ProgramData\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py",
line 87, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/PoC/pdf_reader.py", line 13, in
pdf_read = PyPDF2.PdfFileReader(pdf_file)
File "C:\ProgramData\Anaconda2\lib\site-packages\PyPDF2\pdf.py",
line 1084, in init
self.read(stream)
File "C:\ProgramData\Anaconda2\lib\site-packages\PyPDF2\pdf.py",
line 1697, in read
line = self.readNextEndLine(stream)
File "C:\ProgramData\Anaconda2\lib\site-packages\PyPDF2\pdf.py",
line 1938, in readNextEndLine
x = stream.read(1)
KeyboardInterrupt
I came across another post here PyPDF2 hangs on processing but that too doesn't have a response.

You need to parse the file in binary ('rb') mode. (This works in Python 3:)
import PyPDF2
pdf_file = open("K141693.pdf", "rb")
read_pdf = PyPDF2.PdfFileReader(pdf_file)

Related

Read GCS file into dask dataframe

I want to read a csv file stored in Google Cloud Storage using dask dataframe.
I have insalled gcsfs & dask in the conda env. on Windows
import dask.dataframe as dd
import gcsfs
project_id = 'my-project'
token_file = 'C:\\path\to\credentials.json'
fs = gcsfs.GCSFileSystem(project=project_id)
gcs_bucket_name = 'my_bucket'
df = dd.read_csv('gs://'+gcs_bucket_name+'/my_file.csv',storage_options={'token': token_file, 'project': project_id})
I know I'm not providing the key file correctly as per
https://gcsfs.readthedocs.io/en/latest/ but not sure how to do it. Can anyone please help?
Error I'm getting -
File "<ipython-input-XXXXXXXX>", line 1, in <module>
runfile('C:/path/to/scripts/my_python_script.py', wdir='C:/path/to/scripts')
File "C:\Users\AppData\Local\Continuum\anaconda3\envs\my-rdkit-env\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile
execfile(filename, namespace)
File "C:\Users\AppData\Local\Continuum\anaconda3\envs\my-rdkit-env\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/path/to/scripts/my_python_script.py", line 28, in <module>
df = dd.read_csv('gcs://'+gcs_bucket_name+'/my_file.csv',storage_options={'token': token_file, 'project': project_id})
File "C:\Users\AppData\Local\Continuum\anaconda3\envs\my-rdkit-env\lib\site-packages\dask\dataframe\io\csv.py", line 578, in read
**kwargs
File "C:\Users\AppData\Local\Continuum\anaconda3\envs\my-rdkit-env\lib\site-packages\dask\dataframe\io\csv.py", line 444, in read_pandas
head = reader(BytesIO(b_sample), **kwargs)
File "C:\Users\AppData\Local\Continuum\anaconda3\envs\my-rdkit-env\lib\site-packages\pandas\io\parsers.py", line 685, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\AppData\Local\Continuum\anaconda3\envs\my-rdkit-env\lib\site-packages\pandas\io\parsers.py", line 463, in _read
data = parser.read(nrows)
File "C:\Users\AppData\Local\Continuum\anaconda3\envs\my-rdkit-env\lib\site-packages\pandas\io\parsers.py", line 1154, in read
ret = self._engine.read(nrows)
File "C:\Users\AppData\Local\Continuum\anaconda3\envs\my-rdkit-env\lib\site-packages\pandas\io\parsers.py", line 2059, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 881, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 896, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 950, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 937, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2132, in pandas._libs.parsers.raise_parser_error
ParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 134

PyDev in Debug mode: NameError: dict_pop when importing pandas or xarray

I'm running code to read data from a NetCDF file from a URL with xarray. When I run this code in Eclipse/PyDev in debug mode I am seeing errors that don't happen when launching as a normal Python run or from the command line or as cells of a Jupyter notebook.
I get the following error when I include an import xarray in my code:
NameError: name 'dict_pop' is not defined
Full stack trace looks like this:
pydev debugger: starting (pid: 4004)
URL: https://www.ncei.noaa.gov/data/nclimgrid/nclimgrid_prcp.nc
Traceback (most recent call last):
File "_pydevd_bundle\pydevd_cython_win32_35_64.pyx", line 741, in _pydevd_bundle.pydevd_cython_win32_35_64.PyDBFrame.trace_dispatch (_pydevd_bundle/pydevd_cython_win32_35_64.c:15515)
File "_pydevd_bundle\pydevd_cython_win32_35_64.pyx", line 254, in _pydevd_bundle.pydevd_cython_win32_35_64.PyDBFrame.do_wait_suspend (_pydevd_bundle/pydevd_cython_win32_35_64.c:5631)
File "C:\User Programs\eclipse\neon\plugins\org.python.pydev_5.9.0.201708101613\pysrc\pydevd.py", line 764, in do_wait_suspend
self._activate_mpl_if_needed()
File "C:\User Programs\eclipse\neon\plugins\org.python.pydev_5.9.0.201708101613\pysrc\pydevd.py", line 415, in _activate_mpl_if_needed
activate_function = dict_pop(self.mpl_modules_for_patching, module)
NameError: name 'dict_pop' is not defined
2017-08-16 15:25:06 ERROR Failed to complete ERROR
Traceback (most recent call last):
File "C:\home\git\indices_github\process_xarray.py", line 27, in <module>
ncdata = url.read()
File "C:\home\git\indices_github\process_xarray.py", line 27, in <module>
ncdata = url.read()
File "_pydevd_bundle\pydevd_cython_win32_35_64.pyx", line 982, in _pydevd_bundle.pydevd_cython_win32_35_64.SafeCallWrapper.__call__ (_pydevd_bundle/pydevd_cython_win32_35_64.c:19346)
File "_pydevd_bundle\pydevd_cython_win32_35_64.pyx", line 498, in _pydevd_bundle.pydevd_cython_win32_35_64.PyDBFrame.trace_dispatch (_pydevd_bundle/pydevd_cython_win32_35_64.c:18639)
File "_pydevd_bundle\pydevd_cython_win32_35_64.pyx", line 750, in _pydevd_bundle.pydevd_cython_win32_35_64.PyDBFrame.trace_dispatch (_pydevd_bundle/pydevd_cython_win32_35_64.c:15669)
File "_pydevd_bundle\pydevd_cython_win32_35_64.pyx", line 741, in _pydevd_bundle.pydevd_cython_win32_35_64.PyDBFrame.trace_dispatch (_pydevd_bundle/pydevd_cython_win32_35_64.c:15515)
File "_pydevd_bundle\pydevd_cython_win32_35_64.pyx", line 254, in _pydevd_bundle.pydevd_cython_win32_35_64.PyDBFrame.do_wait_suspend (_pydevd_bundle/pydevd_cython_win32_35_64.c:5631)
File "C:\User Programs\eclipse\neon\plugins\org.python.pydev_5.9.0.201708101613\pysrc\pydevd.py", line 764, in do_wait_suspend
self._activate_mpl_if_needed()
File "C:\User Programs\eclipse\neon\plugins\org.python.pydev_5.9.0.201708101613\pysrc\pydevd.py", line 415, in _activate_mpl_if_needed
activate_function = dict_pop(self.mpl_modules_for_patching, module)
NameError: name 'dict_pop' is not defined
Traceback (most recent call last):
File "C:\User Programs\eclipse\neon\plugins\org.python.pydev_5.9.0.201708101613\pysrc\pydevd.py", line 1621, in <module>
main()
File "C:\User Programs\eclipse\neon\plugins\org.python.pydev_5.9.0.201708101613\pysrc\pydevd.py", line 1615, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "C:\User Programs\eclipse\neon\plugins\org.python.pydev_5.9.0.201708101613\pysrc\pydevd.py", line 1022, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "C:\User Programs\eclipse\neon\plugins\org.python.pydev_5.9.0.201708101613\pysrc\_pydev_imps\_pydev_execfile.py", line 25, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:\home\git\indices_github\process_xarray.py", line 27, in <module>
ncdata = url.read()
File "C:\home\git\indices_github\process_xarray.py", line 27, in <module>
ncdata = url.read()
File "_pydevd_bundle\pydevd_cython_win32_35_64.pyx", line 982, in _pydevd_bundle.pydevd_cython_win32_35_64.SafeCallWrapper.__call__ (_pydevd_bundle/pydevd_cython_win32_35_64.c:19346)
File "_pydevd_bundle\pydevd_cython_win32_35_64.pyx", line 498, in _pydevd_bundle.pydevd_cython_win32_35_64.PyDBFrame.trace_dispatch (_pydevd_bundle/pydevd_cython_win32_35_64.c:18639)
File "_pydevd_bundle\pydevd_cython_win32_35_64.pyx", line 750, in _pydevd_bundle.pydevd_cython_win32_35_64.PyDBFrame.trace_dispatch (_pydevd_bundle/pydevd_cython_win32_35_64.c:15669)
File "_pydevd_bundle\pydevd_cython_win32_35_64.pyx", line 741, in _pydevd_bundle.pydevd_cython_win32_35_64.PyDBFrame.trace_dispatch (_pydevd_bundle/pydevd_cython_win32_35_64.c:15515)
File "_pydevd_bundle\pydevd_cython_win32_35_64.pyx", line 254, in _pydevd_bundle.pydevd_cython_win32_35_64.PyDBFrame.do_wait_suspend (_pydevd_bundle/pydevd_cython_win32_35_64.c:5631)
File "C:\User Programs\eclipse\neon\plugins\org.python.pydev_5.9.0.201708101613\pysrc\pydevd.py", line 764, in do_wait_suspend
self._activate_mpl_if_needed()
File "C:\User Programs\eclipse\neon\plugins\org.python.pydev_5.9.0.201708101613\pysrc\pydevd.py", line 415, in _activate_mpl_if_needed
activate_function = dict_pop(self.mpl_modules_for_patching, module)
NameError: name 'dict_pop' is not defined
I'm using recent/latest versions of xarray and pandas installed on Python 3.5.3 (Anaconda).
I get the same error with an 'import pandas' statement as well, so this may be pandas related instead since xarray depends upon pandas internally.
The packages are installed in Anaconda as shown below:
$ conda list pandas
# packages in environment at C:\home\Anaconda3:
#
pandas 0.20.3 py35_1 conda-forge
$ conda list xarray
# packages in environment at C:\home\Anaconda3:
#
xarray 0.9.6 py35_0 conda-forge
Again I can run this code with the imports of pandas and xarray with no issue from the command line and/or within a Jupyter notebook, this only happens when I launch the code in Eclipse/PyDev in debug mode (regular Python run works as expected).
Example Python code I'm using to test this:
import sys
import xarray as xr
if __name__ == '__main__':
try:
# get the command line arguments
input_netcdf_url = sys.argv[1]
ds = xr.open_dataset(input_netcdf_url)
except:
raise
What can I try next?
Upgrade to 5.9.2... this was one of the critical issues fixed witch prompted the new release.

TensorFlow: use gfile.FastGfile() method can't not read a file with its path include Chinese characters

I want to read use gfile.FastGFile(image_path, 'rb').read() to read a picture and use it as the input of my project, and I use the directory name as the lable of these pictures which are include in the directory, when the directory name is in English, my code works fine, but when the directory name is in Chinese, it throws this Error:
Traceback (most recent call last):
File "F:/pythonWS/imageFilter/jpegFileJudge.py", line 27, in <module>
image_data = gfile.FastGFile(image_path, 'rb').read()
File "C:\Program Files\Python35\lib\site-
packages\tensorflow\python\lib\io\file_io.py", line 106, in read
self._preread_check()
File "C:\Program Files\Python35\lib\site-
packages\tensorflow\python\lib\io\file_io.py", line 73, in _preread_check
compat.as_bytes(self.__name), 1024 * 512, status)
File "C:\Program Files\Python35\lib\contextlib.py", line 66, in __exit__
next(self.gen)
File "C:\Program Files\Python35\lib\site-
packages\tensorflow\python\framework\errors_impl.py", line 466, in
raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.NotFoundError: NewRandomAccessFile
failed to Create/Open: F:\vsWorkspace\pics\test\三宝鸟
\0ff41bd5ad6eddc403fa02d13bdbb6fd526633fe.jpg :
ϵͳ\udcd5Ҳ\udcbb\udcb5\udcbdָ\udcb6\udca8\udcb5\udcc4\udcceļ\udcfe\udca1\udca3
my test code is :
# -*- coding: utf-8 -*-
import glob
import os
import random
import numpy as np
import tensorflow as tf
from tensorflow.python.platform import gfile
image_folder='F:/vsWorkspace/pics/test'
os.chdir(image_folder)
count=0
for each in os.listdir(image_folder):
each=os.path.abspath(each)
os.chdir(each)
for image_path in os.listdir(each):
image_path = os.path.abspath(image_path)
print(image_path)
image_data = gfile.FastGFile(image_path, 'rb').read()
count += 1
os.chdir(image_folder)
My envirorment is Windows7 x64, python 3.5.3 and TensorFlow 1.0, How can I solve this problem?
By the way,I have to use Chinese directories' name use my pictures lables.

django pipeline Error: The file build/fonts/glyphicons-halflings-regular.eot' could not be found with <pipeline.storage.PipelineCachedStorage object

I am using django-pipeline, On running "sudo python manage.py collectstatic"
Getting this error:
Traceback (most recent call last):
File "manage.py", line 10, in <module> execute_from_command_line(sys.argv)
File "/Users/office/lib/python2.7/site-packages/django/core/management/__init__.py", line 338, in execute_from_command_line utility.execute()
File "/Users/office/lib/python2.7/site-packages/django/core/management/__init__.py", line 330, in execute self.fetch_command(subcommand).run_from_argv(self.argv)
File "/Users/office/lib/python2.7/site-packages/django/core/management/base.py", line 390, in run_from_argv self.execute(*args, **cmd_options)
File "/Users/office/lib/python2.7/site-packages/django/core/management/base.py", line 441, in execute output = self.handle(*args, **options)
File "/Users/office/lib/python2.7/site-packages/django/contrib/staticfiles/management/commands/collectstatic.py", line 168, in handle collected = self.collect()
File "/Users/office/lib/python2.7/site-packages/django/contrib/staticfiles/management/commands/collectstatic.py", line 120, in collect raise processed ValueError: The file 'bower_components/eonasdan-bootstrap-datetimepicker/build/fonts/glyphicons-halflings-regular.eot' could not be found with <pipeline.storage.PipelineCachedStorage object at 0x10d274e10>.
Any idea how to fix this ?
Using django-pipeline-forgiving(https://pypi.python.org/pypi/django-pipeline-forgiving) solved the issue.
pip install django-pipeline-forgiving
Set in your settings.py: STATICFILES_STORAGE = 'django_pipeline_forgiving.storages.PipelineForgivingStorage'

ipython: OperationalError: disk I/O error

I was running ipython successfully on fedora 18 until now: I'm getting the following exception when trying to launch it:
Traceback (most recent call last):
File "/usr/bin/ipython", line 9, in <module>
load_entry_point('ipython==1.1.0', 'console_scripts', 'ipython')()
File "/usr/lib/python2.7/site-packages/IPython/__init__.py", line 118, in start_ipython
return launch_new_instance(argv=argv, **kwargs)
File "/usr/lib/python2.7/site-packages/IPython/config/application.py", line 544, in launch_instance
app.initialize(argv)
File "<string>", line 2, in initialize
File "/usr/lib/python2.7/site-packages/IPython/config/application.py", line 89, in catch_config_error
return method(app, *args, **kwargs)
File "/usr/lib/python2.7/site-packages/IPython/terminal/ipapp.py", line 323, in initialize
self.init_shell()
File "/usr/lib/python2.7/site-packages/IPython/terminal/ipapp.py", line 339, in init_shell
ipython_dir=self.ipython_dir, user_ns=self.user_ns)
File "/usr/lib/python2.7/site-packages/IPython/config/configurable.py", line 349, in instance
inst = cls(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/IPython/terminal/interactiveshell.py", line 320, in __init__
**kwargs
File "/usr/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 456, in __init__
self.init_history()
File "/usr/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 1487, in init_history
self.history_manager = HistoryManager(shell=self, parent=self)
File "/usr/lib/python2.7/site-packages/IPython/core/history.py", line 481, in __init__
self.new_session()
File "<string>", line 2, in new_session
File "/usr/lib/python2.7/site-packages/IPython/core/history.py", line 65, in needs_sqlite
return f(self, *a, **kw)
File "/usr/lib/python2.7/site-packages/IPython/core/history.py", line 500, in new_session
self.session_number = cur.lastrowid
OperationalError: disk I/O error
If you suspect this is an IPython bug, please report it at:
https://github.com/ipython/ipython/issues
or send an email to the mailing list at ipython-dev#scipy.org
You can print a more detailed traceback right now with "%tb", or use "%debug"
to interactively debug it.
Extra-detailed tracebacks for bug-reporting purposes can be enabled via:
c.Application.verbose_crash=True
I tried to upgrade ipython to the latest version using pip, which did not help. Any solution or workaround is very much welcomed.
IPython store history in a profile generally in ~/.ipython/profile_default/history.sqlite. There seem to be a disk error reading/writting to it.
Check the permissions of the file/folders, if necessary delete the file.
I solved it by set c.NotebookNotary.db_file = u':memory:' in jupyter configure file ~/.jupyter/jupyter_notebook_config.py