Automate Boring Stuff Ch13 PyPDF2: pdfReader is not defined - pypdf

While trying to follow the instructions on the book, I encountered an error which I am afraid is due to my wrong understanding about the loop. My code is as below.
#! Python3
import PyPDF2, os
# Loop through all the PDF files
for filename in pdfFiles:
try:
pdfFileObj = open(filename, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
except FileNotFoundError:
print('File not foungd ' + filename)
pass
# Read through all the PDF files.
for pageNum in range(1, pdfReader.numPages):
pageObj = pdfReader.getPage(pageNum)
pdfWriter.addPage(pageObj)
=====
and the result is:
Traceback (most recent call last):
File "C:/…automate_py/combinesPdf.py", line 24, in <module>
for pageNum in range(1, pdfReader.numPages):
NameError: name 'pdfReader' is not defined
Does anyone know why is pdfReader not found? Very appreciated.
I have tried adjust indentation but it didn't seem to work. :(

You have committed an indentation mistake. pdfReader is defined within the first loop of your piece of code, so the second one should be inside that block. This is the entire commented code of that sample on the book:
#! python3
# combinePdfs.py - Combines all the PDFs in the current working directory into
# a single PDF.
import PyPDF2, os
# Get all the PDF filenames.
pdfFiles = []
for filename in os.listdir('.'):
if filename.endswith('.pdf'):
pdfFiles.append(filename)
pdfFiles.sort()
pdfWriter = PyPDF2.PdfFileWriter()
# Loop through all the PDF files.
for filename in pdfFiles:
pdfFileObj = open(filename, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# Loop through all the pages (except the first) and add them.
for pageNum in range(1, pdfReader.numPages):
pageObj = pdfReader.getPage(pageNum)
pdfWriter.addPage(pageObj)
# Save the resulting PDF to a file.
pdfOutput = open('allminutes.pdf', 'wb')
pdfWriter.write(pdfOutput)
pdfOutput.close()

Had the same issue (Python 3.x, Windows 10). This worked for me:
from PyPDF2 import PdfFileReader
with open("C:/yourpdf.pdf", 'rb') as f:
pdf = PdfFileReader(f)
page = pdf.getPage(1)
text = page.extractText()
print(text)

Related

Getting "write() argument must be str, not bytes" in simple Pillow module example

I'm pretty new to Python, and I'm trying to generate some images, so I was testing some Pillow functions.
This is one of the very first examples presented in Pillow's documentation:
from PIL import Image, ImageDraw
with Image.open("C:\\Users\\USER\\Desktop\\Quick_edits\\blank_square.jpg") as im:
draw = ImageDraw.Draw(im)
draw.line((0, 0) + im.size, fill=128)
draw.line((0, im.size[1], im.size[0], 0), fill=128)
# write to stdout
im.save(sys.stdout, "PNG")
And I keep getting this error:
Traceback (most recent call last):
File "<stdin>", line 5, in <module>
File "C:\Users\USER\AppData\Local\Programs\Python\Python38-32\lib\site-packages\PIL\Image.py", line 2158, in save
save_handler(self, fp, filename)
File "C:\Users\USER\AppData\Local\Programs\Python\Python38-32\lib\site-packages\PIL\PngImagePlugin.py", line 1179, in _save
fp.write(_MAGIC)
TypeError: write() argument must be str, not bytes
Please, help me. I've seen seimilar errors been addressed, but this is supposedly a simple example and can't find how to fix this...
If your goal is to simply view the image, try using image.show() instead:
from PIL import Image, ImageDraw
with Image.open("C:\\Users\\USER\\Desktop\\Quick_edits\\blank_square.jpg") as im:
draw = ImageDraw.Draw(im)
draw.line((0, 0) + im.size, fill=128)
draw.line((0, im.size[1], im.size[0], 0), fill=128)
im.show()

Is there a way to save each HDF5 data set as a .csv column?

I'm struggling with a H5 file to extract and save data as a multi column csv. as shown in the picture the structure of h5 file consisted of main groups (Genotypes, Positions, and taxa). The main group, Genotypes contains more than 1500 subgroups (genotype partial names) and each subgroup contains sub-sun groups (complete name of genotypes).There are about 1 million data sets (named calls) -each one is laid in one sub-sub group - which i need them to be written - each one - in a separate column. The problem is that when i use h5py (group.get function) i have to use the path of any calls. I extracted the all paths containing "calls" at the end of path but I cant reach all
1 million calls to get them into a csv file.
could anybody help me to extracts "calls" which are 8bit integer i\as a separate columns in a csv file.
By running the code in first answer I get this error:
Traceback (most recent call last): File "path/file.py", line 32,
in
h5r.visititems(dump_calls2csv) #NOTE: function name is NOT a string! File "path/file.py", line 565, in visititems
return h5o.visit(self.id, proxy) File "h5py_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File
"h5py_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py\h5o.pyx", line 355, in h5py.h5o.visit File
"h5py\defs.pyx", line 1641, in h5py.defs.H5Ovisit_by_name File
"h5py\h5o.pyx", line 302, in h5py.h5o.cb_obj_simple File
"path/file.py", line 564, in proxy
return func(name, self[name]) File "path/file.py", line 10, in dump_calls2csv
np.savetxt(csvfname, arr, fmt='%5d', delimiter=',') File "<array_function internals>", line 6, in savetxt File
"path/file.py", line 1377, in savetxt
open(fname, 'wt').close() OSError: [Errno 22] Invalid argument: 'Genotypes_ArgentineFlintyComposite-C(1)-37-B-B-B2-1-B25-B2-B?-1-B:100000977_calls.csv
16-May-2020 Update:
Added a second example that reads and exports using Pytables (aka
tables) using .walk_nodes(). I prefer this method over h5py
.visititems()
For clarity, I separated the code that creates the example file from the
2 examples that read and export the CSV data.
Enclosed below are 2 simple examples that show how to recursively loop on all top level objects. For completeness, the code to create the test file is at the end of this post.
Example 1: with h5py
This example uses the .visititems() method with a callable function (dump_calls2csv).
Summary of this procedure:
1) Checks for dataset objects with calls in the name.
2) When it finds a matching object it does the following:
a) reads the data into a Numpy array,
b) creates a unique file name (using string substitution on the H5 group/dataset path name to insure uniqueness),
c) writes the data to the file with numpy.savetxt().
import h5py
import numpy as np
def dump_calls2csv(name, node):
if isinstance(node, h5py.Dataset) and 'calls' in node.name :
print ('visiting object:', node.name, ', exporting data to CSV')
csvfname = node.name[1:].replace('/','_') +'.csv'
arr = node[:]
np.savetxt(csvfname, arr, fmt='%5d', delimiter=',')
##########################
with h5py.File('SO_61725716.h5', 'r') as h5r :
h5r.visititems(dump_calls2csv) #NOTE: function name is NOT a string!
If you want to get fancy, you can replace arr in np.savetxt() with node[:].
Also, you you want headers in your CSV, extract and reference the dtype field names from the dataset (I did not create any in this example).
Example 2: with PyTables (tables)
This example uses the .walk_nodes() method with a filter: classname='Leaf'. In PyTables, a leaf can be any of the storage classes (Arrays and Table).
The procedure is similar to the method above. walk_nodes() simplifies the process to find datasets and does NOT require a call to a separate function.
import tables as tb
import numpy as np
with tb.File('SO_61725716.h5', 'r') as h5r :
for node in h5r.walk_nodes('/',classname='Leaf') :
print ('visiting object:', node._v_pathname, 'export data to CSV')
csvfname = node._v_pathname[1:].replace('/','_') +'.csv'
np.savetxt(csvfname, node.read(), fmt='%d', delimiter=',')
For completeness, use the code below to create the test file used in the examples.
import h5py
import numpy as np
ngrps = 2
nsgrps = 3
nds = 4
nrows = 10
ncols = 2
with h5py.File('SO_61725716.h5', 'w') as h5w :
for gcnt in range(ngrps):
grp1 = h5w.create_group('Group_'+str(gcnt))
for scnt in range(nsgrps):
grp2 = grp1.create_group('SubGroup_'+str(scnt))
for dcnt in range(nds):
i_arr = np.random.randint(1,100, (nrows,ncols) )
ds = grp2.create_dataset('calls_'+str(dcnt), data=i_arr)

Reading and writing error pajek file in Networkx

I am receiving error when I write to a pajek file and then read back the same file using Networkx library python
>>> G=nx.read_pajek("eatRS.net")
>>> nx.write_pajek(G,"temp.net")
>>> G1=nx.read_pajek("temp.net")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<string>", line 2, in read_pajek
File "/usr/local/lib/python2.7/dist-packages/networkx/utils/decorators.py", line 193, in _open_file
result = func(*new_args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/networkx/readwrite/pajek.py", line 132, in read_pajek
return parse_pajek(lines)
File "/usr/local/lib/python2.7/dist-packages/networkx/readwrite/pajek.py", line 168, in parse_pajek
splitline=shlex.split(str(next(lines)))
File "/usr/lib/python2.7/shlex.py", line 279, in split
return list(lex)
File "/usr/lib/python2.7/shlex.py", line 269, in next
token = self.get_token()
File "/usr/lib/python2.7/shlex.py", line 96, in get_token
raw = self.read_token()
File "/usr/lib/python2.7/shlex.py", line 172, in read_token
r aise ValueError, "No closing quotation"
ValueError: No closing quotation
Creating a graph within networkx, writing in pajek format and then back again works fine for me. E.g. with gnm_random_graph:
import matplotlib.pyplot as np
n = 10
m = 20
G = nx.gnm_random_graph(n,m)
nx.write_pajek(G, "temp.net")
G1 = nx.read_pajek("temp.net")
Only if I edit the intermediate graph to have, say,
"vertex one 0.3456 0.1234 box ic White fos 20
do I get the ValueError: No closing quotation error you have. Node labels can be numeric or string, but if they include spaces, the name must be quoted.From the Pajek manual:
label - if label starts with character A..Z or 0..9 first blank determines end of the label
(example: vertex1), labels consisting of more words must be enclosed in pair of special
characters (example: "vertex 1")
Thus, I suggest that you inspect your input file "eatRS.net". Perhaps there is an issue with character encoding, mismatched quotes (e.g. opening with " and closing with '), or a line break within the node label?

Python with Gtk3 not setting unicode properly

I have some simple code that isn't working as expected. First, the docs say that Gtk.Clipboard.get(Gdk.SELECTION_PRIMARY).set_text() should be able to accept only one argument with the length argument option, but it doesn't work (see below). Finally, pasting a unicode ° symbol breaks setting the text when trying to retrieve it from the clipboard (and won't paste into other programs). It gives this warning:
Gdk-WARNING **: Error converting selection from UTF8_STRING
>>> from gi.repository.Gtk import Clipboard
>>> from gi.repository.Gdk import SELECTION_PRIMARY
>>> d='\u00B0'
>>> print(d)
°
>>> cb=Clipboard
Clipboard
>>> cb=Clipboard.get(SELECTION_PRIMARY)
>>> cb.set_text(d) #this should work
Traceback (most recent call last):
File "<ipython-input-6-b563adc3e800>", line 1, in <module>
cb.set_text(d)
File "/usr/lib/python3/dist-packages/gi/types.py", line 43, in function
return info.invoke(*args, **kwargs)
TypeError: set_text() takes exactly 3 arguments (2 given)
>>> cb.set_text(d, len(d))
>>> cb.wait_for_text()
(.:13153): Gdk-WARNING **: Error converting selection from UTF8_STRING
'\\Uffffffff\\Uffffffff'
From the documentation for Gtk.Clipboard
It looks like the method set_text needs a second argument. The first is the text, the second is the length of the text. Or if you don't want to provide the length, you can use -1 to let it calculate the length itself.
gtk.Clipboard.set_text
def set_text(text, len=-1)
text : a string.
len : the length of text, in bytes, or -1, to calculate the length.
I've tested it on Python 3 and it works with cb.set_text(d, -1).
Since GTK version 3.16 there is a easier way of getting the clipboard. You can get it with the get_default() method:
import gi
gi.require_version('Gtk', '3.0')
from gi.repository import Gtk, Gdk, GLib, Gio
display = Gdk.Display.get_default()
clipboard = Gtk.Clipboard.get_default(display)
clipboard.set_text(string, -1)
also for me it worked without
clipboard.store()
Reference: https://lazka.github.io/pgi-docs/Gtk-3.0/classes/Clipboard.html#Gtk.Clipboard.get_default
In Python 3.4. this is only needed for GtkEntryBuffers. In case of GtkTextBuffer set_text works without the second parameter.
example1 works as usual:
settinginfo = 'serveradres = ' + server + '\n poortnummer = ' + poort
GtkTextBuffer2.set_text(settinginfo)
example2 needs extra parameter len:
ErrorTextDate = 'choose earlier date'
GtkEntryBuffer1.set_text(ErrorTextDate, -1)

Python 3.2 lxml fill and submit form, select multiple, how to do it? value not working

Great page this one, coming from the perl world and after several years of doing nothing, I've re-started to program again (this web page didn't exist, how things change). And now, after a 2 full-days of searching, I play the last card of asking here for help.
Working under mac environment, with python 3.2 and lxml 2.3 (installed following www.jtmoon.com/?p=21), what I am trying to do:
web: http://biodbnet.abcc.ncifcrf.gov/db/db2db.php
to fill the form that you find there
to submit it
My code. I put several attempts and the output code.
from lxml.html import parse, submit_form, tostring
page = parse('http://biodbnet.abcc.ncifcrf.gov/db/db2db.php').getroot()
page.forms[0].fields['input'] = 'GI Number'
page.forms[0].inputs['outputs[]'].value = 'Gene ID'
page.forms[0].fields['hasComma'] = 'no'
page.forms[0].fields['removeDupValues'] = 'yes'
page.forms[0].fields['request'] = 'db2db'
page.forms[0].action = 'http://biodbnet.abcc.ncifcrf.gov/db/db2dbRes.php'
page.forms[0].fields['idList'] = '86439006'
submit_form(page.forms[0])
Output:
File "/Users/gerard/Desktop/barbacue/MGFtoXML.py", line 30, in <module>
page.forms[0].inputs['outputs[]'].value = 'Gene ID'
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/lxml/html/__init__.py", line 1058, in _value__set
"You must pass in a sequence")
TypeError: You must pass in a sequence
So, since that element is a multi-select element, I understand that I have to give a list
page.forms[0].inputs['outputs[]'].value = list('Gene ID')
Output:
File "/Users/gerard/Desktop/barbacue/MGFtoXML.py", line 30, in <module>
page.forms[0].inputs['outputs[]'].value = list('Gene ID')
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/lxml/html/__init__.py", line 1059, in _value__set
self.value.clear()
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/lxml/html/_setmixin.py", line 115, in clear
self.remove(item)
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/lxml/html/__init__.py", line 1159, in remove
"The option %r is not currently selected" % item)
ValueError: The option 'Affy ID' is not currently selected
'Affy ID' is the first option value of the list, and it is not selected. But what's the problem with it?
Surprisingly, if I instead put
page.forms[0].inputs['outputs[]'].multiple = list('Gene ID')
#page.forms[0].inputs['outputs[]'].value = list('Gene ID')
Then, somehow lxml likes it, and move on. However, the multiple attribute should be a boolean (actually it is if I print the value), I shouldn't touch it, and the "value" of the item should actually point to the selected items, according to the lxml docs.
The new output
File "/Users/gerard/Desktop/barbacue/MGFtoXML.py", line 87, in <module>
submit_form(page.forms[0])
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/lxml/html/__init__.py", line 856, in submit_form
return open_http(form.method, url, values)
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/lxml/html/__init__.py", line 876, in open_http_urllib
return urlopen(url, data)
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/urllib/request.py", line 138, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/urllib/request.py", line 364, in open
req = meth(req)
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/urllib/request.py", line 1052, in do_request_
raise TypeError("POST data should be bytes"
TypeError: POST data should be bytes or an iterable of bytes. It cannot be str.
So, what can be done?? I am sure that with python 2.6 I could use mecanize, or that perhaps lxml could work? But I really don't want to code in a sort-of deprecated version. I am enjoying a lot python, but I am starting to consider going back to perl. Perhaps this could be a smart movement??
Any help will be hugely appreciated
Gerard
Reading in this forum, I find pythonpaste.org, could it be a replacement for lxml?
Passing in a sequence to list() will generate a list from that sequence. 'Gene ID' is sequence (namely a sequence of characters). So list('Gene ID') will generate a list of characters, like so:
>>> list('Gene ID')
['G', 'e', 'n', 'e', ' ', 'I', 'D']
That's not what you want. Try this:
>>> ['Gene ID']
['Gene ID']
In other words:
page.forms[0].inputs['outputs[]'].value = ['Gene ID']
That should take you a bit forward.