pdf2HtmlEX process PDF coredump - coredump

I use following command tansform pdf to html. then I got croedump file.
./pdf2htmlEX --zoom 1 --dest-dir ./pdf_test --optimize-text 1 --zoom 1.4 --process-outline 0 --embed-image 0 --font-format ttf pdf_test/020616320411_2.pdf
[coredump message is here]
(https://i.stack.imgur.com/RuL9N.png)
I tried many methods and compare the diffent pdf then got these conclusion:
The coredump pdf contains an "arial unicode ms" font. I guess that's probably the reason。
I created an test pdf using "arial unicode ms" and the same content, pdf2htmlex worked ok.
The test pdf and the coredumped pdf has the same content but diffent filesize.
test pdf file size: 502,032
coredump pdf file size: 445,916

Related

Compress PDF content with Zopfli GZIP Compression

I wanted to do the following for PDF Compression
Extract text from PDF
Use Zopfli for efficient gzip compression of Text
Insert the compressed text into PDF
Is this feasible using PDFBox or iText

convert stream file of iText PDF not opening MS word

Our project has requirement to generate end report both in PDF and MS-Word Document. We are using iTextSharp to dynamically generate tables and rows in report. Finally we will upload the file to server as PDF and MS-word. Both will be converted to Byte Array/Stream file and saved as PDF and MS-Word Document. In Which,uploaded PDF working as expected, but MS-word getting error and not opening(Attaching the screen shot).
iTextSharp doesn't produce MS Word documents, so this isn't an actual iText question. When I look at your screen shot, I see that you are trying to import a PDF file into Word. Since Word can't interpret PDF syntax, it shows you the syntax of the PDF file:
%PDF-1.4
%âãÏÓ
1 0 obj
<</Type/Font...
I think your question is wrong. You are not using iTextSharp to create a PDF file and an MS Word file. You are using iTextSharp to create a PDF file, and not an MS Word file.
There is no such thing as "Save a PDF as MS Word file" in iTextSharp, and it will be extremely difficult to find another tool that can convert a PDF document to a Word document in an acceptable way. (There are such tools, but the quality is suboptimal for PDFs that weren't made to be converted to another format.)

Determine whether file is a PDF in perl?

Using perl, what is the best way to determine whether a file is a PDF?
Apparently, not all PDFs start with %PDF. See the comments on this answer: https://stackoverflow.com/a/941962/327528
Detecting a PDF is not hard, but there are some corner cases to be aware of.
All conforming PDFs contain a one-line header identifying the PDF specification to which the file conforms. Usually it's %PDF-1.N where N is a digit between 0 and 7.
The third edition of the PDF Reference has an implementation note that Acrobat viewer require only that the header appears within the first 1024 bytes of a file. (I've seen some cases where a job control prefix was added to the start of a PDF file, so '%PDF-1.' weren't the first seven bytes of the file)
The subsequent implementation note from the third edition (PDF 1.4) states: Acrobat viewers will also accept a header of the form: %!PS-Adobe-N.n PDF-M.m but note that this isn't part of the ISO32000:2008 (PDF 1.7) specification.
If the file doesn't begin immediately with %PDF-1.N, be careful because I've seen a case where a zip file containing a PDF was mistakenly identified as a PDF because that part of the embedded file wasn't compressed. so a check for the PDF file trailer is a good idea.
The end of a PDF will contain a line with '%%EOF',
The third edition of the PDF Reference has an implementation note that Acrobat viewer requires only that the %%EOF marker appears within the last 1024 bytes of a file.
Two lines above the %%EOF should be the 'startxref' token and the line in between should be a number for the byte offset from the start of the file to the last cross reference table.
In sum, read in the first and last 1kb of the file into a byte buffer, check that the relevant identifying byte string tokens are approximately where they are supposed to be and if they are then you have a reasonable expectation that you have a PDF file on your hands.
The module PDF::Parse has method called IsaPDF which
Returns true, if the file could be parsed and is a PDF-file.

How to draw Thai text to PDF file by using libharu library

i am using free pdf library libharu to generate PDF file,
but i have a encoding problem, i can not draw Thai lanugage text on PDF file,
all the text shows "???.."
Somebody know how to fix it?
Thanks
I have succeeded in rendering hieroglyphic texts (not Thai, but Chinese and Japanese) using libharu. First of all, I used Unicode mode, please refer to HPDF_UseUTFEncodings() function documentation.
For C language, here is a sequence of libharu API calls needed to overcome your trouble:
HPDF_UseUTFEncodings(docHandle);
HPDF_SetCurrentEncoder(docHandle, "UTF-8");
Here docHandle is a valid HPDF_Doc object.
Next part is proper work with UTF fonts:
const char * libFontName = HPDF_LoadTTFontFromFile(docHandle, fontFileName.c_str(), font_embed::EmbedFonts);
HPDF_Font font = HPDF_GetFont(docHandle, libFontName, "UTF-8");
After these calls you may render unicode texts containing Thai characters. Also note about embedding flag (3rd param of LoadTTFontFromFile) - your PDF file may be unreadable due to external font references. If you are not crazy with output PDF size, you may just embed fonts.
I've tested couple of Thai .ttf fonts found in Google and they were rendered OK. Also (it may be important, but I'm not sure) I'm using fork of libharu https://github.com/kdeforche/libharu which is now merged into master branch.
When you write text to the PDF, use the correct font and encoding. In the libharu documentation you have all the possibilities: https://github.com/libharu/libharu/wiki/Fonts
In your case, you must use the ISO8859-11 Thai, TIS 620-2569 character set
An example (in spanish):
HPDF_Font fontEn = HPDF_GetFont(pdf, "Helvetica-Bold", "ISO8859-2");
HPDF_Page_TextOut(page1, 50.00, 750.00, [#"Código para correcta codificación en libharu" cStringUsingEncoding:NSISOLatin1StringEncoding]);

iTextSharp - Problem while extracting images with CCITTFaxDecode

I am using iTextSharp to extract images from PDF. However, if the images are CCITT fax decoded, the bitmap creation fails with "Parameter not valid" error.
As PdfReader.GetStreamBytesRaw returns CCITT encoded bytes, bitmap creation fails.
Can someone please help me with decoding CCITT encoded bytes and in turn create a bitmap out of it?
Thanks,
Chandru
I found a workaround to get bitmap from CCITT encoded PDF files.
Ghostscript supports converting PDF files to Tiff. There is a simple C# wrapper available to convert PDF files to jpg files here.
http://www.mattephraim.com/blog/2009/01/06/a-simple-c-wrapper-for-ghostscript/
The wrapper can be easily modified to get CCITT compressed Tiff files instead of jpg files.
The wrapper supports converting a specific page of PDF to Tiff.
The solution is, convert the specific page of PDF to a temporary tiff file, load the bitmap from the tiff and delete the tiff file.
Chandru
but in your answer get resolution and i will get resolution from original image in pdf