PyPdf2 merge with scanned PDF yields "An error exists on this page..." - pypdf

I want to use PyPDF2 to take each page of a scanned PDF document,
scale the page to 85% of its original size
and center the page on a blank 8.5 by 11 page
with the same number of pages
to create margins that are needed for printing/adding barcodes.
I've tried a few approaches with mergeScaledTranslatedPage but I keep ending up with an error message when I open the file in Adobe Acrobat DC.
Even if the output appears to be a success, I get the following error when opening the file:
An error exists on this page. Acrobat may not display the page correctly. Please contact the person who created the PDF document to correct the problem.
How can I make it work?

I'm the maintainer of pypdf and PyPDF2. Please use pypdf.
from pypdf import PdfReader, PdfWriter, Transformation
from pypdf.generic import RectangleObject
reader = PdfReader("GeoTopo.pdf")
writer = PdfWriter()
desired_width = 100
desired_height = 100
r = RectangleObject([0, 0, desired_width, desired_height])
for page in reader.pages[:10]:
old_width = page.mediabox.width
old_height = page.mediabox.height
a1 = desired_width / old_width
a2 = desired_height / old_height
factor = min(a1, a2)
new_width = float(old_width * factor)
new_height = float(old_height * factor)
dx = (desired_width - new_width) / 2
dy = (desired_height - new_height) / 2
op = Transformation().translate(tx=dx, ty=dy)
page.scale_to(width=new_width, height=new_height)
page.add_transformation(op)
page.mediabox = r
page.artbox = r
page.cropbox = r
page.bleedbox = r
page.trimbox = r
writer.add_page(page)
with open("foo.pdf", "wb") as fp:
writer.write(fp)
``

Related

Macro for multiple images analysis in the calculator plus

I have a set of images in a folder which their names are: 2 days.bmp 3 days.bmp ..... 28 days.bmp. Also, I have a folder with the background images exactly with the same names. I want to use the calculator plus and divide each image to its relevant background. I want to save time and do that at the same time instead of doing it for each image one by one. Is there any idea how can I do that? I have written a macro but it didn't work for me.
macro "Batch calculate images [1]" {
LocationOfFiles = getDirectory("Select Folder");
LocationOfbackgrounds = getDirectory("Select Folder of backgrounds");
LocationOfSave = getDirectory("Select Save Location");
setBatchMode(true);
FileList = getFileList(LocationOfFiles);
NumberOfFiles = FileList.length;
for (i=0; i<NumberOfFiles; i+=1) {
FileName = FileList[i];
pathtofile1 = LocationOfFiles+FileName;
open(pathtofile1);
name1 = getTitle();
pathtofile2 = LocationOfbackgrounds+FileName;
open(pathtofile2);
name2 = getTitle();
run("Calculator Plus", "i1="+name1+" i2="+name2+" operation=[Divide: i2 = (i1/i2) x k1 + k2] k1=255 k2=0 create");
selectWindow("Result");
SaveName = replace(name, ".bmp", "_backgroud subtracted.jpg");
saveAs("BMP", LocationOfSave+SaveName);
selectWindow(BackgroundImage);
close("\\Others");

Accessing full size (851x315) timeline cover photo via graph api with offset crop

This is a continuous question of https://stackoverflow.com/a/21074207/3269910
the solution seems fine c0.0.851.315
but the c0.0 values are pointing to 0x0 position of the images
what i want is the image with offset_y . if an offset_y: 20,
ie, i want the same size and part of image, what i am seeing in my page timeline in a api call.
Is it possible to do this ?
Note: if i change c0.0 the zero value to some other then it is pointing to pixels but offset_y is % i think.
Thanks for your help.
I had discussion with Facebook Platform Team;
They are not ready to share the manual calculation to us.
and yes, it is possible to calculate value of c0.XXXX manually using your API offset_y value.
Here is my PHP code.
$fb_page_cover_json = json_decode( file_get_contents( 'https://graph.facebook.com/11111111?fields=cover'));
$fb_cover_image_url = $fb_page_cover_json->cover->source;
$image_info = getimagesize($fb_cover_image_url);
$image_width = $image_info[0];
$image_height = $image_info[1];
echo getTop($fb_page_cover_json->cover->offset_y,$image_width,$image_height);
function getTop($offset_y,$image_width,$image_height) {
$cover_w = 851;
$cover_h = 315;
$img_w = $image_width;
$img_h = $image_height;
$real_img_h = ($cover_w * $img_h / $img_w) - $cover_h;
$result = ($real_img_h * $offset_y / 100 * -1);
return floor($result);
}
The method getTop will return the c0.XXX value.
Hope this helps some one.

Excessively large overhead in MATLAB .mat file

I am parsing a large text file full of data and then saving it to disk as a *.mat file so that I can easily load in only parts of it (see here for more information on reading in the files, and here for the data). To do so, I read in one line at a time, parse the line, and then append it to the file. The problem is that the file itself is >3 orders of magnitude larger than the data contained therein!
Here is a stripped down version of my code:
database = which('01_hit12.par');
[directory,filename,~] = fileparts(database);
matObj = matfile(fullfile(directory,[filename '.mat']),'Writable',true);
fidr = fopen(database);
hitranTemp = fgetl(fidr);
k = 1;
while ischar(hitranTemp)
if abs(hitranTemp(1)) == 32;
hitranTemp(1) = '0';
end
hitran = textscan(hitranTemp,'%2u%1u%12f%10f%10f%5f%5f%10f%4f%8f%15c%15c%15c%15c%6u%2u%2u%2u%2u%2u%2u%1c%7f%7f','delimiter','','whitespace','');
matObj.moleculeNumber(1,k) = uint8(hitran{1});
matObj.isotopeologueNumber(1,k) = uint8(hitran{2});
matObj.vacuumWavenumber(1,k) = hitran{3};
matObj.lineIntensity(1,k) = hitran{4};
matObj.airWidth(1,k) = single(hitran{6});
matObj.selfWidth(1,k) = single(hitran{7});
matObj.lowStateE(1,k) = single(hitran{8});
matObj.tempDependWidth(1,k) = single(hitran{9});
matObj.pressureShift(1,k) = single(hitran{10});
if rem(k,1e4) == 0;
display(sprintf('line %u (%2.2f)',k,100*k/K));
end
hitranTemp = fgetl(fidr);
k = k + 1;
end
fclose(fidr);
I stopped the code after 13,813 of the 224,515 lines had been parsed because it had been taking a very long time and the file size was getting huge, but the last printout indicated that I had only just cleared 10k lines. I cleared the memory, and then ran:
S = whos('-file','01_hit12.mat');
fileBytes = sum([S.bytes]);
T = dir(which('01_hit12.mat'));
diskBytes = T.bytes;
disp([fileBytes diskBytes diskBytes/fileBytes])
and get the output:
524894 896189009 1707.37141022759
What is taking up the extra 895,664,115 bytes? I know the help page says there should be a little extra overhead, but I feel that nearly a Gb of descriptive header is a bit excessive!
New information:
I tried pre-allocating the file, thinking that perhaps MATLAB was doing the same thing it does when a matrix is embiggened in a loop and reallocating a chunk of disk space for the entire matrix on each write, and that isn't it. Filling the file with zeros of the appropriate data types results in a file that my short check script returns:
8531570 71467 0.00837677004349727
This makes more sense to me. Matlab is saving the file sparsely, so the disk file size is much smaller than the size of the full matrix in memory. Once it starts replacing values with real data, however, I get the same behavior as before and the file size starts skyrocketing beyond all reasonable bounds.
New new information:
Tried this on a subset of the data, 100 lines long. To stream to disk, the data has to be in v7.3 format, so I ran the subset through my script, loaded it into memory, and then resaved as v7.0 format. Here are the results:
v7.3: 3800 8752 2.30
v7.0: 3800 2561 0.67
No wonder the v7.3 format isn't the default. Does anyone know a way around this? Is this a bug or a feature?
This seems like a bug to me. A workaround is to write in chunks to pre-allocated arrays.
Start off by pre-allocating:
fid = fopen('01_hit12.par', 'r');
data = fread(fid, inf, 'uint8');
nlines = nnz(data == 10) + 1;
fclose(fid);
matObj.moleculeNumber = zeros(1,nlines,'uint8');
matObj.isotopeologueNumber = zeros(1,nlines,'uint8');
matObj.vacuumWavenumber = zeros(1,nlines,'double');
matObj.lineIntensity = zeros(1,nlines,'double');
matObj.airWidth = zeros(1,nlines,'single');
matObj.selfWidth = zeros(1,nlines,'single');
matObj.lowStateE = zeros(1,nlines,'single');
matObj.tempDependWidth = zeros(1,nlines,'single');
matObj.pressureShift = zeros(1,nlines,'single');
Then to write in chunks of 10000, I modified your code as follows:
... % your code plus pre-alloc first
bs = 10000;
while ischar(hitranTemp)
if abs(hitranTemp(1)) == 32;
hitranTemp(1) = '0';
end
for ii = 1:bs,
hitran{ii} = textscan(hitranTemp,'%2u%1u%12f%10f%10f%5f%5f%10f%4f%8f%15c%15c%15c%15c%6u%2u%2u%2u%2u%2u%2 u%1c%7f%7f','delimiter','','whitespace','');
hitranTemp = fgetl(fidr);
if hitranTemp==-1, bs=ii; break; end
end
% this part really ugly, sorry! trying to keep it compact...
matObj.moleculeNumber(1,k:k+bs-1) = uint8(builtin('_paren',cellfun(#(c)c{1},hitran),1:bs));
matObj.isotopeologueNumber(1,k:k+bs-1) = uint8(builtin('_paren',cellfun(#(c)c{2},hitran),1:bs));
matObj.vacuumWavenumber(1,k:k+bs-1) = builtin('_paren',cellfun(#(c)c{3},hitran),1:bs);
matObj.lineIntensity(1,k:k+bs-1) = builtin('_paren',cellfun(#(c)c{4},hitran),1:bs);
matObj.airWidth(1,k:k+bs-1) = single(builtin('_paren',cellfun(#(c)c{5},hitran),1:bs));
matObj.selfWidth(1,k:k+bs-1) = single(builtin('_paren',cellfun(#(c)c{6},hitran),1:bs));
matObj.lowStateE(1,k:k+bs-1) = single(builtin('_paren',cellfun(#(c)c{7},hitran),1:bs));
matObj.tempDependWidth(1,k:k+bs-1) = single(builtin('_paren',cellfun(#(c)c{8},hitran),1:bs));
matObj.pressureShift(1,k:k+bs-1) = single(builtin('_paren',cellfun(#(c)c{9},hitran),1:bs));
k = k + bs;
fprintf('.');
end
fclose(fidr);
The final size on disk is 21,393,408 bytes. The usage breaks down as,
>> S = whos('-file','01_hit12.mat');
>> fileBytes = sum([S.bytes]);
>> T = dir(which('01_hit12.mat'));
>> diskBytes = T.bytes; ratio = diskBytes/fileBytes;
>> fprintf('%10d whos\n%10d disk\n%10.6f\n',fileBytes,diskBytes,ratio)
8531608 whos
21389582 disk
2.507099
Still fairly inefficient, but not out of control.

Extract numbers from specific image

I am involved in a project that I think you can help me. I have multiple images that you can see here Images to recognize. The goal here is to extract the numbers between the dashed lines. What is the best approach to do that? The idea that I have from the beginning is to find the coordinates of the dash lines and do the crop function, then is just run OCR software. But is not easy to find those coordinates, can you help me? Or if you have a better approach tell me.
Best regards,
Pedro Pimenta
You may start by looking at more obvious (bigger) objects in your images. The dashed lines are way too small in some images. Searching for the "euros milhoes" logo and the barcode will be easier and it will help you have an idea of the scale and rotation involved.
To find these objects without using match template you can binarize your image (watch out for the background texture) and use the Hu moments on the contours/blobs.
Don't expect a good OCR accuracy on images where the numbers are smaller than 8-10 pixels.
You can use python-tesseract https://code.google.com/p/python-tesseract/ ,it works with your image.What you need to do is to split the result string.I use your https://www.dropbox.com/sh/kcybs1i04w3ao97/u33YGH_Kv6#f:euro9.jpg to test.And source code is below.UPDATE
# -*- coding: utf-8 -*-
from PIL import Image
from PIL import ImageEnhance
import tesseract
im = Image.open('test.jpg')
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(4)
im = im.convert('1')
w, h = im.size
im = im.resize((w * (416 / h), 416))
pix = im.load()
LINE_CR = 0.01
WHITE_HEIGHT_CR = int(h * (20 / 416.0))
status = 0
white_line = []
for i in xrange(h):
line = []
for j in xrange(w):
line.append(pix[(j, i)])
p = line.count(0) / float(w)
if not p > LINE_CR:
white_line.append(i)
wp = None
for i in range(10, len(white_line) - WHITE_HEIGHT_CR):
k = white_line[i]
if white_line[i + WHITE_HEIGHT_CR] == k + WHITE_HEIGHT_CR:
wp = k
break
result = []
flag = 0
while 1:
if wp < 0:
result.append(wp)
break
line = []
for i in xrange(w):
line.append(pix[(i, wp)])
p = line.count(0) / float(w)
if flag == 0 and p > LINE_CR:
l = []
for xx in xrange(20):
l.append(pix[(xx, wp)])
if l.count(0) > 5:
break
l = []
for xx in xrange(416-1, 416-100-1, -1):
l.append(pix[(xx, wp)])
if l.count(0) > 17:
break
result.append(wp)
wp -= 1
flag = 1
continue
if flag == 1 and p < LINE_CR:
result.append(wp)
wp -= 1
flag = 0
continue
wp -= 1
result.reverse()
for i in range(1, len(result)):
if result[i] - result[i - 1] < 15:
result[i - 1] = -1
result = filter(lambda x: x >= 0, result)
im = im.crop((0, result[0], w, result[-1]))
im.save('test_converted.jpg')
api = tesseract.TessBaseAPI()
api.Init(".","eng",tesseract.OEM_DEFAULT)
api.SetVariable("tessedit_char_whitelist", "0123456789abcdefghijklmnopqrstuvwxyz")
api.SetPageSegMode(tesseract.PSM_AUTO)
mImgFile = "test_converted.jpg"
mBuffer=open(mImgFile,"rb").read()
result = tesseract.ProcessPagesBuffer(mBuffer,len(mBuffer),api)
print "result(ProcessPagesBuffer)=",result
Depends python 2.7 python-tesseract-win32 python-opencv numpy PIL,and be sure to follow python-tesseract's remember to .

Preserving colors during CMYK to RGB transformation in PIL

I'm using PIL to process uploaded images. Unfortunately, I'm having trouble with color conversion from CMYK to RGB, as the resulting images tone and contrast changes.
I'd suspect that it's only doing direct number transformations. Does PIL, or anything built on top of it, have an Adobian dummy-proof consume embedded profile, convert to destination, preserve numbers tool I can use for conversion?
In all my healthy ignorance and inexperience, this sort of jumped at me and it's got me in a pinch. I'd really like to get this done without engaging any intricacies of color spaces, transformations and the necessary math for both at this point.
Though I've never previously used it, I'm also disposed at using ImageMagick for this processing step if anyone has experience that it can perform it in a gracious manner.
So it didn't take me long to run into other people mentioning Little CMS, being the most popular open source solution for color management. I ended snooping around for Python bindings, found the old pyCMS and some ostensible notions about PIL supporting Little CMS.
Indeed, there is support for Little CMS, it's mentioned in a whole whopping one-liner:
CMS support: littleCMS (1.1.5 or later is recommended).
The documentation contains no references, no topical guides, Google didn't crawl out anything, their mailing list is closed... but digging through the source there's a PIL.ImageCms module that's well documented and get's the job done. Hope this saves someone from a messy internet excavation.
Goes off getting himself a cookie...
it's 2019 and things have changed. Your problem is significantly more complex than it may appear at first sight. The problem is, CMYK to RGB and RGB to CMYK is not a simple there and back. If e.g. you open an image in Photoshop and convert it there, this conversion has 2 additional parameters: source color profile and destination color profile. These change things greatly! For a typical use case, you would assume Adobe RGB 1998 on the RGB side and say Coated FOGRA 39 on the CMYK side. These two additional pieces of information clarify to the converter how to deal with the colors on input and output. What you need next is a transformation mechanism, Little CMS is in deed a great tool for this. It is MIT licensed and (after looking for solutions myself for a considerable time), I would recommend the following setup if you indeed do need a python way to transform colors:
Python 3.X (necessary because of littlecms)
pip install littlecms
pip install Pillow
In littlecms' /tests folder you will find a great set of examples. I would allow myself a particular adaptation of one test. Before you get the code, please let me tell you something about those color profiles. On Windows, as is my case, you will find a set of files with an .icc extension in the folder C:\Windows\System32\spool\drivers\color where Windows stores it's color profiles. You can download other profiles from sites like https://www.adobe.com/support/downloads/iccprofiles/iccprofiles_win.html and install them on Windows simply by double-clicking the corresponding .icc file. The example I provide depends on such profile files, which Little CMS uses to do those magic color transforms. I work as a semi-professional graphics designer and needed to be able to convert colors from CMYK to RGB and vice versa for certain scripts that manipulate objects in InDesign. My setup is RGB: Adobe RGB 1998 and CMYK: Coated FOGRA 39 (these settings were recommended by most book printers I get my books printed at). The aforementioned color profiles generated very similar results for me to the same transforms made by Photoshop and InDesign. Still, be warned, the colors are slightly (by around 1%) off in comparison to what PS and Id will give you for the same inputs. I am trying to figure out why...
The little program:
import littlecms as lc
from PIL import Image
def rgb2cmykColor(rgb, psrc='C:\\Windows\\System32\\spool\\drivers\\color\\AdobeRGB1998.icc', pdst='C:\\Windows\\System32\\spool\\drivers\\color\\CoatedFOGRA39.icc') :
ctxt = lc.cmsCreateContext(None, None)
white = lc.cmsD50_xyY() # Set white point for D50
dst_profile = lc.cmsOpenProfileFromFile(pdst, 'r')
src_profile = lc.cmsOpenProfileFromFile(psrc, 'r') # cmsCreate_sRGBProfile()
transform = lc.cmsCreateTransform(src_profile, lc.TYPE_RGB_8, dst_profile, lc.TYPE_CMYK_8,
lc.INTENT_RELATIVE_COLORIMETRIC, lc.cmsFLAGS_NOCACHE)
n_pixels = 1
in_comps = 3
out_comps = 4
rgb_in = lc.uint8Array(in_comps * n_pixels)
cmyk_out = lc.uint8Array(out_comps * n_pixels)
for i in range(in_comps):
rgb_in[i] = rgb[i]
lc.cmsDoTransform(transform, rgb_in, cmyk_out, n_pixels)
cmyk = tuple(cmyk_out[i] for i in range(out_comps * n_pixels))
return cmyk
def cmyk2rgbColor(cmyk, psrc='C:\\Windows\\System32\\spool\\drivers\\color\\CoatedFOGRA39.icc', pdst='C:\\Windows\\System32\\spool\\drivers\\color\\AdobeRGB1998.icc') :
ctxt = lc.cmsCreateContext(None, None)
white = lc.cmsD50_xyY() # Set white point for D50
dst_profile = lc.cmsOpenProfileFromFile(pdst, 'r')
src_profile = lc.cmsOpenProfileFromFile(psrc, 'r') # cmsCreate_sRGBProfile()
transform = lc.cmsCreateTransform(src_profile, lc.TYPE_CMYK_8, dst_profile, lc.TYPE_RGB_8,
lc.INTENT_RELATIVE_COLORIMETRIC, lc.cmsFLAGS_NOCACHE)
n_pixels = 1
in_comps = 4
out_comps = 3
cmyk_in = lc.uint8Array(in_comps * n_pixels)
rgb_out = lc.uint8Array(out_comps * n_pixels)
for i in range(in_comps):
cmyk_in[i] = cmyk[i]
lc.cmsDoTransform(transform, cmyk_in, rgb_out, n_pixels)
rgb = tuple(rgb_out[i] for i in range(out_comps * n_pixels))
return rgb
def rgb2cmykImage(PILImage, psrc='C:\\Windows\\System32\\spool\\drivers\\color\\AdobeRGB1998.icc', pdst='C:\\Windows\\System32\\spool\\drivers\\color\\CoatedFOGRA39.icc') :
ctxt = lc.cmsCreateContext(None, None)
white = lc.cmsD50_xyY() # Set white point for D50
dst_profile = lc.cmsOpenProfileFromFile(pdst, 'r')
src_profile = lc.cmsOpenProfileFromFile(psrc, 'r')
transform = lc.cmsCreateTransform(src_profile, lc.TYPE_RGB_8, dst_profile, lc.TYPE_CMYK_8,
lc.INTENT_RELATIVE_COLORIMETRIC, lc.cmsFLAGS_NOCACHE)
n_pixels = PILImage.size[0]
in_comps = 3
out_comps = 4
n_rows = 16
rgb_in = lc.uint8Array(in_comps * n_pixels * n_rows)
cmyk_out = lc.uint8Array(out_comps * n_pixels * n_rows)
outImage = Image.new('CMYK', PILImage.size, 'white')
in_row = Image.new('RGB', (PILImage.size[0], n_rows), 'white')
out_row = Image.new('CMYK', (PILImage.size[0], n_rows), 'white')
out_b = bytearray(n_pixels * n_rows * out_comps)
row = 0
while row < PILImage.size[1] :
in_row.paste(PILImage, (0, -row))
data_in = in_row.tobytes('raw')
j = in_comps * n_pixels * n_rows
for i in range(j):
rgb_in[i] = data_in[i]
lc.cmsDoTransform(transform, rgb_in, cmyk_out, n_pixels * n_rows)
for j in cmyk_out :
out_b[j] = cmyk_out[j]
out_row = Image.frombytes('CMYK', in_row.size, bytes(out_b))
outImage.paste(out_row, (0, row))
row += n_rows
return outImage
def cmyk2rgbImage(PILImage, psrc='C:\\Windows\\System32\\spool\\drivers\\color\\CoatedFOGRA39.icc', pdst='C:\\Windows\\System32\\spool\\drivers\\color\\AdobeRGB1998.icc') :
ctxt = lc.cmsCreateContext(None, None)
white = lc.cmsD50_xyY() # Set white point for D50
dst_profile = lc.cmsOpenProfileFromFile(pdst, 'r')
src_profile = lc.cmsOpenProfileFromFile(psrc, 'r')
transform = lc.cmsCreateTransform(src_profile, lc.TYPE_CMYK_8, dst_profile, lc.TYPE_RGB_8,
lc.INTENT_RELATIVE_COLORIMETRIC, lc.cmsFLAGS_NOCACHE)
n_pixels = PILImage.size[0]
in_comps = 4
out_comps = 3
n_rows = 16
cmyk_in = lc.uint8Array(in_comps * n_pixels * n_rows)
rgb_out = lc.uint8Array(out_comps * n_pixels * n_rows)
outImage = Image.new('RGB', PILImage.size, 'white')
in_row = Image.new('CMYK', (PILImage.size[0], n_rows), 'white')
out_row = Image.new('RGB', (PILImage.size[0], n_rows), 'white')
out_b = bytearray(n_pixels * n_rows * out_comps)
row = 0
while row < PILImage.size[1] :
in_row.paste(PILImage, (0, -row))
data_in = in_row.tobytes('raw')
j = in_comps * n_pixels * n_rows
for i in range(j):
cmyk_in[i] = data_in[i]
lc.cmsDoTransform(transform, cmyk_in, rgb_out, n_pixels * n_rows)
for j in rgb_out :
out_b[j] = rgb_out[j]
out_row = Image.frombytes('RGB', in_row.size, bytes(out_b))
outImage.paste(out_row, (0, row))
row += n_rows
return outImage
Something to note for anyone implementing this: you probably want to take the uint8 CMYK values (0-255) and round them into the range 0-100 to better match most color pickers and uses of these values. See my code here: https://gist.github.com/mattdesl/ecf305c2f2b20672d682153a7ed0f133