itext PDFWrite merge the PDF layers, Hence generated PDF also show the hidden layer as spot - itext

Using iText PDFWrite class we render the inputFile into outputFile. As inputFile has multilayer, due to that the output PDF (outputFile) has outline of its internal layers. Actually PDFWrite merge the pdf layers while rendering it, Here we want to avoid it. We want to render the visible layers/top layers only. We use PDFWrite instead of PDFCopy because we do all matrics operations(move, rotate, scale ..etc) on inputFile.
Files:
Layered Image
Input file
Output file

A bit late but as far as I understand the question, you want to ensure that only visible layers are rendered during processing because your file has some invisible or hidden layers which re-appear. Since iText support will not help you, there is another (free/GPL License) way.
The first step is to 'burn in' the state of the layers and remove the optional content while honoring the visibility state, using another processor. Once that step is done you can continue with whatever version of iText you want.
There are 2 processing options that can 'burn in' the visibility state of Optional Content Groups (= Layers = OCG)
Option 1
Use www.ghostscript.com with -sDEVICE=pdfwrite command line. See GS documentation for full command line
Option 2
Use Poppler's pdftocairo.exe with -pdf command line.
Both will create a PDF that has no interactive layers, with OCG removed. All iText operations will work as normal on that file.

Related

How to Export Graphics in MATLAB with exportdlg.m to SVG with embedding fonts?

Hello last_hope_community, ;)
I have tried several options of exporting, the standard export. Works well to SVG.
load patients
figure
tbl = table(LastName,Age,Gender,SelfAssessedHealthStatus,...
Smoker,Weight,Location);
h = heatmap(tbl,'Smoker','SelfAssessedHealthStatus');
saveas(gcf,["test_heatp_save_svg.svg"],'svg') %no export style-> fonts gets embedded
%print(gcf,["test_heatp_print_pdf"],'-dpdf','-bestfit','-r0')
Saveas and print, but here is point 1.
For setprinttemplate one can not adopt the styles from the exportsetup (hgexport and so on). Print has the very nicy opportunity of -bestfit which one can use with print to pdf(with print fonts are embedded automatically) and then to import in Inkscape to use as SVG. But when styles are adopted, by export manually to save it then as SVG within the exportdlg or with saveas, all fonts gets vectorpaths.
Is there a way to apply export styles automatically and then get an SVG File in which the fonts are embedded in SVG like in the upper code?
I have already checked if it is just a heatmap issue but it is not. Same with the following code, without export_style everything is great. But with export style everything gets to vector paths.
figure
plot(1:10000)
title('Hallo')
saveas(gcf,["test_simple_save_svg.svg"],'svg')
%print(gcf,["test_simple_print_pdf"],'-dpdf','-bestfit','-r0')
Best would be if anyone knows a way to apply exportstyles programatically and to then save as svg with embedded font, or apply exportstyles programmatically and then print to pdf with -bestfit,export as pdf would just embed the titles and not the TickLabels.
Additional print with bestfit is not always good because the result from the figure window varies from time to time a lot.
Thanks for help, it's annoying that export as svg makes paths from text that one cannot customize with other programms.

merging PDFs with Ghostscipt ignoring outline and using pdfmark instead

I am using a Batch script to merge different PDFs in one complete file.
%gsc% -dBATCH -sDEVICE=pdfwrite -sPAPERSIZE=letter -dEPSFitPage -o %dsk%%zus%%ext% %mfd% %pth%tmp\pdfmarks
%dsk%%zus%%ext%: Path and name of final (complete) document
%mfd%: Path and name of docs to be merged (c:\test\1.pdf c:\test\2.pdf ...)
%pth%tmp = path to the pdfmarks file
Additionally, I am creating a pdfmark document inside the script which gs uses to create the bookmarks. But unfortunately, some of the docs I am merging, have already their own bookmarks and I did not yet find a solution how to ignore those. GS should only use the bookmarks inside the pdfmarks file.
How can this be done?
Firstly; you are not 'merging' PDF files when you use Ghotscript's pdfwrite device. The process is described in detail here
The important point is that the way the input file(s) are constructed has no bearing on the way the output file is constructed. If any other software you use relies on the file being constructed in a particular fashion it may not work on the output PDF file.
The -dEPSFitPage switch only has any effect when the input is an EPS file. If you want to 'fit' PostScript or PDF files then you need to use -dPDFFitPage, -dPSFitPage or just -dFitPage. However, all of these rely on you first selecting a media size, and then preventing it being altered by setting -dFIXEDMEDIA. For EPS files you would more normally use -dEPSCrop which sets the media size to the EPS declared BoundingBox.
You can prevent the PDF interpreter reading the Outlines tree (which you are calling Bookmarks) and then creating a pdfmark from it to pass to the pdfwrite device by using the -dNO_PDFMARK_OUTLINES switch which oddly isn't documented, presumably an oversight.

Method to decompress a PDF (non-Adobe) while retaining form fields?

I found a similar question that involves Acrobat, but in this case the PDF was made with a combination of MS Word and CenoPDF v3, with which I'm unfamiliar. Additionally the PDF is version 1.3. I'd like to decompress it, to see its low-level workings and make some changes. It's easy with GhostScript's -dCompressPages=false parameter, but that simultaneously strips all the fill-in form functionality. Is there a method for decompressing the file while leaving everything else intact? A quick search of the docs for tcpdf and fpdi (cited in the link) didn't reveal a compression option.
Ghostscript and pdfwrite isn't a good combination. The PDF file you get out is NOT the same as the one you put in. This is because of the way that Ghostscript and pdfwrite work; the input is fully interpreted to a sequence of graphics primitives, which is sent to the Ghostscript graphics library. These are then sent to the requested device, most devices then render the result to a bitmap, but the pdfwrite family reassemble those graphics primitives int a new PDF file.
Note that the contents of the new PDF file have no relationship to the original, other than the appearance when rendered. Ghostscript and pdfwrite do maintain much of the non-marking content of PDF files such as hyperlinks and so on (which obviously don't get turned into graphics primitives), by interpreting them into pdfmark operations (an extension to the PostScript language defined by Adobe). However, even if Ghostscript and pdfwrite maintained all this content, the resulting PDF file wouldn't be the same as the original one decompressed....
There are tools which will decompress PDF files, and I would recommend one of our other products, MuPDF. A part of this is mutool, and "mutool clean -d in.pdf out.pdf" will decompress pretty much everything in a PDF file
QPDF can decompress PDF documents (among other things). I used this tool in the past and it preserved forms and data.
The tool has some issues with large PDFs (can take too much time and memory for decompression). The tool can produce incomplete output (with warnings in console) for some partially broken / nonstandard PDFs.

Defining what is a line in Tesseract

I'm working on document recognition for scanned bank statement. The statements that I have are organized by lines, such as the one attached. Because Tesseract does such a good job at detecting the areas of text, it breaks the lines in the middle (I'm assuming this is because of the large white space between the first block in the line (blurred for privacy reason), and the next one ('EUR', or 'COURS').
In the hocr file, the bbox of all the elements in the line are within 2px or so, so I could potentially rebuild a line myself. However, this seems more like a hack. Is there a way to tell Tesseract that lines should be as wide as the document itself? Or would there be another way to go about it? I've tried playing with the psm option, but with no luck.
-psm 6 -- Assume a single uniform block of text -- should work. If not, you may want to use the older version 2.0x, which does not perform page layout analysis.

Inkscape command line programming

I'd like to be able to derive new images from a pre-existing image from the command line. To do that, I'd turn on/off specific layers that have portions of the image and then save the resulting image to a file. However, while I can see a number of commands listed in the help to manipulate layers, I don't see any that would allow one to select a specific one and turn it on/off.
If what you want to do can be achieved by deleting a few unwanted elements by their id (say, layer17 and layer4711), you can do it this way:
inkscape image.svg \
--select=layer17 --verb=EditDelete \
--select=layer4711 --verb=EditDelete \
--verb=FileSave --verb=FileClose
Note that this will overwrite image.svg with the result, so if you're scripting this, be sure to work on a copy rather than your originals.
inkscape image.svg --export-id-only --export-id=layer17 --export-png=image.png --export-width=100 --export-height=100
On a Mac you might have to do:
/Applications/Inkscape.app/Contents/Resources/bin/inkscape --without-gui --file=image.svg --export-id-only --export-id=layer17 --export-png=image.png --export-width=100 --export-height=100
I've written an Inkscape extension for work like this. It outputs one file for each option layer found. It will also show various layer combinations as needed. Scriptable as well. I call it the SLiCk Layer Combinator:
https://github.com/juanitogan/slick