.extractText() returns "invalid literal for decimal" - pypdf

I'm coding something which will read PDFs online and return a set of keywords that are found in the document. However I keep running into a problem with the extractText() function from the PyPDF2 package.
Here's my code to open the PDFs and read it:
x = myurl.pdf
if ".pdf" in x:
remoteFile = urlopen(Request(x, headers={"User-Agent": "Magic-Browser"})).read()
memoryFile = StringIO(remoteFile)
pdfFile = PyPDF2.PdfFileReader(memoryFile, strict=False)
num_pages = pdfFile.numPages
count = 0
text = ""
while count < num_pages:
pageObj = pdfFile.getPage(count)
count += 1
text += pageObj.extractText()
The error that I keep running into on the extractText() line goes like this:
Traceback (most recent call last):
File "errortest.py", line 30, in <module>
text += pageObj.extractText()
File "/anaconda2/lib/python2.7/site-packages/PyPDF2/pdf.py", line 2595, in extractText
content = ContentStream(content, self.pdf)
File "/anaconda2/lib/python2.7/site-packages/PyPDF2/pdf.py", line 2674, in __init__
self.__parseContentStream(stream)
File "/anaconda2/lib/python2.7/site-packages/PyPDF2/pdf.py", line 2706, in __parseContentStream
operands.append(readObject(stream, None))
File "/anaconda2/lib/python2.7/site-packages/PyPDF2/generic.py", line 98, in readObject
return NumberObject.readFromStream(stream)
File "/anaconda2/lib/python2.7/site-packages/PyPDF2/generic.py", line 271, in readFromStream
return FloatObject(num)
File "/anaconda2/lib/python2.7/site-packages/PyPDF2/generic.py", line 231, in __new__
return decimal.Decimal.__new__(cls, str(value))
File "/anaconda2/lib/python2.7/decimal.py", line 547, in __new__
"Invalid literal for Decimal: %r" % value)
File "/anaconda2/lib/python2.7/decimal.py", line 3872, in _raise_error
raise error(explanation)
decimal.InvalidOperation: Invalid literal for Decimal: '99.-72'
Would be great if someone could help me out! Thanks!

There is too little information to be certain, but PyPDF2 (and now pypdf) improved a lot in 2022. You will probably just need to upgrade to the latest version of pypdf.
If you encounter a bug in pypdf again, please open an issue: https://github.com/py-pdf/pypdf
A good bug ticket contains (1) your pypdf version (2) the code + PDF document that caused the issue.

Related

Problems with "match" function in PyDroid3

Creating a todo list that uses the "match" function:
while True:
user_action = input("Type add, show, edit, or exit: ")
user_action = user_action.strip()
match user_action:
case "add":
todo = input("Enter a todo: ")
todos.append(todo)
file = open("todos.txt", "w")
file.writelines(todos)
Here is what happens:
Traceback (most recent call last):
File "/data/user/0/ru.iiec.pydroid3/files/accomp_files/iiec_run/iiec_run.py", line 31, in <module>
start(fakepyfile,mainpyfile)
File "/data/user/0/ru.iiec.pydroid3/files/accomp_files/iiec_run/iiec_run.py", line 30, in start
exec(open(mainpyfile).read(), __main__.__dict__)
File "<string>", line 7
match user_action:
^
SyntaxError: invalid syntax
I suspect that there is no "match" function in PyDroid. Sorry, but I'm quite new at this. Is there something I'm missing or some way I can make a work-around? Thanks in advance!
It seems the real problem here is that the version of PyDroid is 3.9.7, which is to say, it does not contain the "match" function.

What should I do to highlight error lines after running code in VSCode?

I want to highlight error line when I finish running code in vscode. Just like this code I wanna highlight 45 and 57 lines. What can i do?
Traceback (most recent call last):
File "f:\onedrive\L_ML\L_ML_W3_Lab06-09\L_ML_W3_Lab06.py", line 57, in <module>
dj_db_tmp, dj_dw_tmp = compute_gradient_logistic(X_tmp, y_tmp, w_tmp, b_tmp)
File "f:\onedrive\L_ML\L_ML_W3_Lab06-09\L_ML_W3_Lab06.py", line 45, in compute_gradient_logistic
dj_dw = dj_dw[j] + err_i * X[i,j]
IndexError: invalid index to scalar variable.
like this picture
use yellow highlight error line

I keep getting ValueError in my Python program

I'm trying to open a csv file for my python book project and this error keeps popping up
File "c:\Users\MSFT Surface Pro 3\Documents\Programming\Python\Python AIO\Code.py", line 13, in <module> birthYear = int(row[1] or 0) ValueError: invalid literal for int() with base 10: ' 1/11/2011'
and this is super annoying, help me please!
Looks like line 13 in in Code.py is birthYear = int(row[1] or 0), but probably should be something like birthYear = row[1] or "".

PyPDF2.PdfFileReader hangs indefinitely

I'm trying to read this pdf file (https://www.accessdata.fda.gov/cdrh_docs/pdf14/K141693.pdf) and am following these suggestions from SO
Opening pdf urls with pyPdf
I have actually downloaded the file locally and am running the following code
import PyPDF2
pdf_file = open("K141693.pdf")
pdf_read = PyPDF2.PdfFileReader(pdf_file)
but my code hangs indefinitely. I'm running Python 2.7 and here is the stacktrace.
Traceback (most recent call last):
File "", line 1, in
runfile('C:/PoC/pdf_reader.py', wdir='C:/PoC')
File
"C:\ProgramData\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py",
line 880, in runfile
execfile(filename, namespace)
File
"C:\ProgramData\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py",
line 87, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/PoC/pdf_reader.py", line 13, in
pdf_read = PyPDF2.PdfFileReader(pdf_file)
File "C:\ProgramData\Anaconda2\lib\site-packages\PyPDF2\pdf.py",
line 1084, in init
self.read(stream)
File "C:\ProgramData\Anaconda2\lib\site-packages\PyPDF2\pdf.py",
line 1697, in read
line = self.readNextEndLine(stream)
File "C:\ProgramData\Anaconda2\lib\site-packages\PyPDF2\pdf.py",
line 1938, in readNextEndLine
x = stream.read(1)
KeyboardInterrupt
I came across another post here PyPDF2 hangs on processing but that too doesn't have a response.
You need to parse the file in binary ('rb') mode. (This works in Python 3:)
import PyPDF2
pdf_file = open("K141693.pdf", "rb")
read_pdf = PyPDF2.PdfFileReader(pdf_file)

"InterfaceError: connection already closed" when using multiprocessing.Pool on black box function that queries PostgreSQL database

I've been given a Python (2.7) function that takes 3 strings as arguments, and returns a list of dictionaries. Due to the nature of the project, I can't alter the function, which is quite complex, calling several other non-standard Python modules and querying a PostgreSQL database using psychopg2. I think that it's the Postgres functionality that's causing me problems.
I want to use the multiprocessing module to speed up calling the function hundreds of times. I've written a "helper" function so that I can use multiprocessing.Pool (which takes only 1 argument) with my function:
from function_script import function
def function_helper(args):
return function(*args)
And my main code looks like this:
from helper_script import function_helper
from multiprocessing import Pool
argument_a = ['a0', 'a1', ..., 'a99']
argument_b = ['b0', 'b1', ..., 'b99']
argument_c = ['c0', 'c1', ..., 'c99']
input = zip(argument_a, argument_b, argument_c)
p = Pool(4)
results = p.map(function_helper, input)
print results
What I'm expecting is a list of lists of dictionaries, however I get the following errors:
Traceback (most recent call last):
File "/local/python/2.7/lib/python2.7/site-packages/variantValidator/variantValidator.py", line 898, in validator
vr.validate(input_parses)
File "/local/python/2.7/lib/python2.7/site-packages/hgvs/validator.py", line 33, in validate
return self._ivr.validate(var, strict) and self._evr.validate(var, strict)
File "/local/python/2.7/lib/python2.7/site-packages/hgvs/validator.py", line 69, in validate
(res, msg) = self._ref_is_valid(var)
File "/local/python/2.7/lib/python2.7/site-packages/hgvs/validator.py", line 89, in _ref_is_valid
var_x = self.vm.c_to_n(var) if var.type == "c" else var
File "/local/python/2.7/lib/python2.7/site-packages/hgvs/variantmapper.py", line 223, in c_to_n
tm = self._fetch_TranscriptMapper(tx_ac=var_c.ac, alt_ac=var_c.ac, alt_aln_method="transcript")
File "/local/python/2.7/lib/python2.7/site-packages/hgvs/decorators/lru_cache.py", line 176, in wrapper
result = user_function(*args, **kwds)
File "/local/python/2.7/lib/python2.7/site-packages/hgvs/variantmapper.py", line 372, in _fetch_TranscriptMapper
self.hdp, tx_ac=tx_ac, alt_ac=alt_ac, alt_aln_method=alt_aln_method)
File "/local/python/2.7/lib/python2.7/site-packages/hgvs/transcriptmapper.py", line 69, in __init__
self.tx_identity_info = hdp.get_tx_identity_info(self.tx_ac)
File "/local/python/2.7/lib/python2.7/site-packages/hgvs/decorators/lru_cache.py", line 176, in wrapper
result = user_function(*args, **kwds)
File "/local/python/2.7/lib/python2.7/site-packages/hgvs/dataproviders/uta.py", line 353, in get_tx_identity_info
rows = self._fetchall(self._queries['tx_identity_info'], [tx_ac])
File "/local/python/2.7/lib/python2.7/site-packages/hgvs/dataproviders/uta.py", line 216, in _fetchall
with self._get_cursor() as cur:
File "/local/python/2.7/lib/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
File "/local/python/2.7/lib/python2.7/site-packages/hgvs/dataproviders/uta.py", line 529, in _get_cursor
cur.execute("set search_path = " + self.url.schema + ";")
File "/local/python/2.7/lib/python2.7/site-packages/psycopg2/extras.py", line 144, in execute
return super(DictCursor, self).execute(query, vars)
DatabaseError: SSL error: decryption failed or bad record mac
And:
Traceback (most recent call last):
File "/local/python/2.7/lib/python2.7/site-packages/variantValidator/variantValidator.py", line 898, in validator
vr.validate(input_parses)
File "/local/python/2.7/lib/python2.7/site-packages/hgvs/validator.py", line 33, in validate
return self._ivr.validate(var, strict) and self._evr.validate(var, strict)
File "/local/python/2.7/lib/python2.7/site-packages/hgvs/validator.py", line 69, in validate
(res, msg) = self._ref_is_valid(var)
File "/local/python/2.7/lib/python2.7/site-packages/hgvs/validator.py", line 89, in _ref_is_valid
var_x = self.vm.c_to_n(var) if var.type == "c" else var
File "/local/python/2.7/lib/python2.7/site-packages/hgvs/variantmapper.py", line 223, in c_to_n
tm = self._fetch_TranscriptMapper(tx_ac=var_c.ac, alt_ac=var_c.ac, alt_aln_method="transcript")
File "/local/python/2.7/lib/python2.7/site-packages/hgvs/decorators/lru_cache.py", line 176, in wrapper
result = user_function(*args, **kwds)
File "/local/python/2.7/lib/python2.7/site-packages/hgvs/variantmapper.py", line 372, in _fetch_TranscriptMapper
self.hdp, tx_ac=tx_ac, alt_ac=alt_ac, alt_aln_method=alt_aln_method)
File "/local/python/2.7/lib/python2.7/site-packages/hgvs/transcriptmapper.py", line 69, in __init__
self.tx_identity_info = hdp.get_tx_identity_info(self.tx_ac)
File "/local/python/2.7/lib/python2.7/site-packages/hgvs/decorators/lru_cache.py", line 176, in wrapper
result = user_function(*args, **kwds)
File "/local/python/2.7/lib/python2.7/site-packages/hgvs/dataproviders/uta.py", line 353, in get_tx_identity_info
rows = self._fetchall(self._queries['tx_identity_info'], [tx_ac])
File "/local/python/2.7/lib/python2.7/site-packages/hgvs/dataproviders/uta.py", line 216, in _fetchall
with self._get_cursor() as cur:
File "/local/python/2.7/lib/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
File "/local/python/2.7/lib/python2.7/site-packages/hgvs/dataproviders/uta.py", line 526, in _get_cursor
conn.autocommit = True
InterfaceError: connection already closed
Does anybody know what might cause the Pool function to behave like this, when it seems so simple to use in other examples that I've tried? If this isn't enough information to go on, can anyone advise me on a way of getting to the bottom of the problem (this is the first time I've worked with someone else's code)? Alternatively, are there any other ways that I could use the multiprocessing module to call the function hundreds of times?
Thanks
I think what may be happening is that your connection object is used across all workers and when 1 worker has completed all its tasks it closes the connection and meanwhile the other workers are still working and the connection is closed so when one of those workers tries to use the db it is already closed.