I am using iTextSharp to extract images from PDF. However, if the images are CCITT fax decoded, the bitmap creation fails with "Parameter not valid" error.
As PdfReader.GetStreamBytesRaw returns CCITT encoded bytes, bitmap creation fails.
Can someone please help me with decoding CCITT encoded bytes and in turn create a bitmap out of it?
Thanks,
Chandru
I found a workaround to get bitmap from CCITT encoded PDF files.
Ghostscript supports converting PDF files to Tiff. There is a simple C# wrapper available to convert PDF files to jpg files here.
http://www.mattephraim.com/blog/2009/01/06/a-simple-c-wrapper-for-ghostscript/
The wrapper can be easily modified to get CCITT compressed Tiff files instead of jpg files.
The wrapper supports converting a specific page of PDF to Tiff.
The solution is, convert the specific page of PDF to a temporary tiff file, load the bitmap from the tiff and delete the tiff file.
Chandru
but in your answer get resolution and i will get resolution from original image in pdf
Related
I wanted to download an image from the web. But when I 'save image', it opens as a .txt file. I figure this is some type of encoding for the image but I can't find out which.
I want to eventually automate downloading the image for further processing, specifically text recognition. I've tried to convert the .txt using some online base64 encoders/decoders with no success. However, https://convertio.co/ was able to convert the .txt to .gif but I don't know how it did what it did.
I've given a sample of the .txt file. The actual file is much bigger.
The file name beings as such (if it helps):
data:image;base64,R0lGODlhyABGAIMAAPRDNvRDNvRDNvRDNvRDNvRDNvRDNvRDNvRDNvRDNvRDNvRDNvRDNvRDNvRDNv///ywAAAAAyABGAAAE+vDB (and it goes on, its very long).
GIF89aÈ�F�ƒ��ôC6ôC6ôC6ôC6ôC6ôC6ôC6ôC6ôC6ôC6ôC6ôC6ôC6ôC6ôC6ÿÿÿ,����È�F��úðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|úðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|úðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|úðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ|ðÁ;
I can see that there are '|' characters in between. May be for separating pixels.
The entire file is here: https://pastebin.com/BPbTHMZ7
It's seems to be a GIF image encoded as a data URL:
data:image;base64,R0lGODlhyABGAIMAAPRDNvRDNvRDNvR...
This format can be used in HTML and CSS file and is handy because the image data is directly embedded in the HTML/CSS file and does not need to be loaded with a separate request.
The start of the text basically says it's data URL, containing data for an image and the image is encoded using Base 64.
To decode it:
Chop off the start of the text, namely data:image;base64,.
Run the remaining text (R0lGODlhy...) through a Base64 decoder. The result will be binary data.
Save the binary data to a file using a file name with the extension .gif.
Now you have a proper GIF image as a file.
I wanted to do the following for PDF Compression
Extract text from PDF
Use Zopfli for efficient gzip compression of Text
Insert the compressed text into PDF
Is this feasible using PDFBox or iText
Our project has requirement to generate end report both in PDF and MS-Word Document. We are using iTextSharp to dynamically generate tables and rows in report. Finally we will upload the file to server as PDF and MS-word. Both will be converted to Byte Array/Stream file and saved as PDF and MS-Word Document. In Which,uploaded PDF working as expected, but MS-word getting error and not opening(Attaching the screen shot).
iTextSharp doesn't produce MS Word documents, so this isn't an actual iText question. When I look at your screen shot, I see that you are trying to import a PDF file into Word. Since Word can't interpret PDF syntax, it shows you the syntax of the PDF file:
%PDF-1.4
%âãÏÓ
1 0 obj
<</Type/Font...
I think your question is wrong. You are not using iTextSharp to create a PDF file and an MS Word file. You are using iTextSharp to create a PDF file, and not an MS Word file.
There is no such thing as "Save a PDF as MS Word file" in iTextSharp, and it will be extremely difficult to find another tool that can convert a PDF document to a Word document in an acceptable way. (There are such tools, but the quality is suboptimal for PDFs that weren't made to be converted to another format.)
I need to write a document using images, texts, hyperlinks... And then convert it to PDF and DOC (but in the future it can be converted to more file formats).
What's the best "starting format" for this document?
Doc or Docx might be the best file format for creating the document containing images, texts, hyperlinks, and many more elements. Once created, it's easy to convert files in .doc/.docx format into other file format, such as Image, PDF, HTML, by using OpenXML or even commercial library like Spire.Doc.
I have a bunch (about 1200) of jpg/jpeg files, which have a filename pattern of: IMG-YYYYMMDD-WA####.jpg or .jpeg. None of them have any exif data. I would like to (batch) add exif dates (created, modified, ...) using the date pattern in the filename. Time doesn really matter for me.
I have searched this (and other) forums, but i cannot find anything related to ADDING these dated to jpeg files. I was hoping someone here could help me out.
EDIT: Using Linux (Mint 17,1)
This should not be difficult to write. What you need to create is a filter that:
Removes the existing JPEG file APPn header
Inserts an EXIF header with the date.
You would not need to mess with the compressed data at all. You're going to need to read a bit of the JPEG standard, just enough to get an idea of the block structure. Do a byte-by-byte copy until you hit an APPn marker.The APPn markers have byte counts so you know how much to skip over. Insert your own EXIF marker into the stream. Then copy the rest of the data.
You're need to read the EXIF standard to figure out how to format the header.