Remove PdfImageOject from a PDF - itext

I have 1000th of PDF generated from emails containing .png (I am not owner of the generator). For some reasons, those PDF are very very slow to render with the Imaging system I am using (I am not the developer of that system and may not change it).
If I use iTextSharp and implement a IRenderListener to count the Images to be rendered, there are thousands per page (99% being 1 or 2 pixels only). But if I count the Images in the resources of the PDF, there are only a few (~tens).
I am counting the images in the resources, per page, with the code here after
var dict = pdfReader.GetPageN(currentPage)
PdfDictionary res = (PdfDictionary)PdfReader.GetPdfObject(dict.Get(PdfName.RESOURCES));
PdfDictionary xobj = (PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT));
if (xobj != null)
{
foreach (PdfName name in xobj.Keys)
{
PdfObject obj = xobj.Get(name);
if ((obj.IsIndirect()))
{
PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(obj);
PdfName subtype = (PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));
if (PdfName.IMAGE.Equals(subtype))
{
Count++
And my IRenderListener looks like this:
class ImageRenderListener : IRenderListener
{
public void RenderImage(iTextSharp.text.pdf.parser.ImageRenderInfo renderInfo)
{
PdfImageObject image = renderInfo.GetImage();
if (image == null) return;
var refObj = renderInfo.GetRef();
if (refObj == null)
Count++; // but why no ref ??
else
Count++;
}
I just started to learn about PDF specification and iTextSharp this evening, to analyze my PDF and understand what could be wrong... if I am correct, I see that many images to be rendered that are not referencing a resource (refObj == null) and that they are .png (image.streamContentType.FileExtension = "png"). So, I think those are the images making the rendering so slow...
For testing purpose, I would like to delete those images from the PDF but don't find how to proceed.
I only found code samples to remove image that are in the resources... but the images I want to delete are not :/
Is there any code sample somewhere to help me ? I did google on "iTextSharp remove object", etc... but there was nothing similar to my case :(

Let me start with the blunt observation that you have a shitty PDF.
The image you see when opening the PDF in a PDF viewer seems to be composed of several small 1- or 2-pixel images. The drawing operations to show these pixels one by one is suboptimal, no matter which imaging system you use: you are faced with a bad PDF.
In your first snippet, I see that you loop over all of the indirect objects stored in the the XObject resources of each page in search of images. You count these images, resulting in a number of Image XObjects stored in the PDF. If you add up all the Count values for all the pages, this number can be higher than the actual number of Image XObject stored in the PDF as you don't take into account that some images can be reused on different pages.
You do not count the inline images that are stored in the content streams. I'm biased. In the ISO committees for PDF, I'm on the side of the group of people saying that "inline images are evil" and "inline images should die". For now, we didn't succeed in getting rid of inline images, but we introduced some substantial limitations that should reduce the (ab)use of inline images in PDF that conform to ISO-32000-2 (the PDF 2.0 spec that is due in 2016).
You've already discovered that your PDF has inline images. Those are the images where refObj == null. They are not stored as indirect objects; they are stored inline, in the content stream of the page. As you can imagine based on my feelings towards inline images, I consider your PDF being a bad PDF for this reason (although it does conform to ISO-32000-1).
The presence of inline images is a first explanation why you have a different image count: when you loop over the indirect objects you only find part of the images. When you parse the document for images, you also find the inline images.
A second explanation could be the fact that the Image XObject are used more than once. That's the whole point of not using inline images. For instance: if you have an image that represents a logo that needs to be repeated on every page, one could use inline images. That would be a bad idea: the same image bytes would be present in the PDF as many times as there are pages. One should use an Image XObject. In this case, the image bytes of the logo are stored only once in an indirect object. There's a reference to this object from every page, so that the image bytes are stored in the document only once. In a 10-page document, you can see 10 identical images on 10 pages, but when looking inside the document, you'll find only one image that is referenced from every page.
If you remove Image XObjects by removing the indirect objects containing the image stream objects, you have to be very careful: are you sure you're not corrupting your document? Because there's a reference to the Image XObject in the content stream of your page. This reference points to an entry in the /XObjects entry of the page's /Resources. This /XObject references to the stream object with the image bytes. If you remove that indirect object without removing the references (e.g. from the content stream), you break your PDF. Some viewers will ignore those errors, but at some point in time some tool (or some body) is going to complain that your PDF is corrupt.
If you want to remove inline images, you have to parse all the content streams in your PDF: page content streams as well as Form XObject content streams. You have to rewrite all these streams and make sure all inline images are removed. That is: all objects that that start with the BI operator (Begin Image) and end with the EI operator (End Image).
That's a task for a PDF specialist who knows both iTextSharp and ISO-32000-1 inside-out. The solution to your problem probably doesn't fit into an answering window on StackOverflow.
I'm the original author of iText. From a certain point of view, iText is like a sharp knife. A sharp knife is a very good tool that can be used for many good things. However, you can also seriously cut your fingers when you're not using the knife in a correct way. I hope you'll be careful and that you're not going to create a whole series of damaged PDF files.
For instance: you assume that some of the files in the PDF are PNGs because iText suggests to store them as PNGs. However: PNG is not supported by ISO-32000-1, so your assumption that your PDF contains PNGs is wrong. I honestly worry when I see questions like yours.

Related

three.js toDataURL PNG is black

I'm trying to grab a screenshot with renderer.domElement.toDataURL("image/png"), and save it to a file.
The image is the right size, but it's black.
I have preserveDrawingBuffer turned on.
I think I'm decoding and saving the file correctly, because when I hexdump it I can see the correct initial characters for the PNG format, as well as the IHDR and IDAT chunk headers. However the closing IEND is missing.
Any known issues here? Hints? Windows 7/Firefox up to date if it matters.
Thanks... (Sorry if this is dumb, I'm very new to three.js)
I had somewhat similar problems with Windows 7/Firefox. PNG Data URL's would be randomly truncated or something, much shorter than a successful PNG export. Trying to set that data url as image src resulted in "Image corrupt" exception or something in FF. As little sense it makse, setting a small window.setTimeout (10ms) between rendering and getting the data URL helped in my case. Maybe Firefox needs a rest from the JS engine before it refreshes some canvas internal state or something.. weird.
I switched to JPG format (smaller files => truncation less of an issue?) and still saw it not working, then I tried this tip which I found here
If you want to save data that is derived from a Javascript
canvas.toDataURL() function, you have to convert blanks into plusses.
If you do not do that, the decoded data is corrupted:
<?php
$encodedData = str_replace(' ','+',$encodedData);
$decodedData = base64_decode($encodedData);
?>
This worked. Thanks, Mekal.
This tip seems to apply to JPGs only. I saw PNGs decoding correctly without the + replacement, and corruptly with it. I can use JPGs so my personal problem is solved. However I never saw a PNG that wasn't black even when decoded correctly and not truncated.
Kind of a lousy situation either way, I feel like. What is up with the +'s?
A black texture is a sign that you did not indicate the texture needs to be updated.
Also, you do not need to use canvas.toDataURL(). You can pass in the canvas reference to the THREE.Texture object.
var canvas = document.getElementById('#myCanvas');
var texture = new THREE.Texture(canvas);
texture.needsUpdate = true;
// Now render the scene

How to avoid image duplication in image gallery of iOS device?

I want to store the images to photo gallary. But if images are already available at photo gallary then how to distinguish for whether imgaes are already exists or not? I didnt find property like unique identifier or should i compare by taking data in the form of NSdata?
thanks..
You can maintain a dictionary of hashes for each the image in the photo gallery, and then show only additional images who do not hashes are not present in the dictionary
As a reminder, you can check for an object in a dictionary by doing:
if ([myDictionary objectForKey:variableStoringImageHash] == nil) {
//No such image present
}
else {
//image is present
}
For a bit about hashing an image, this might help:
iPhone: fast hash function for storing web images (url) as files (hashed filenames)
I am not sure if what eitan27 says will work so as an alternative I would say that your best hope is comparing NSData objects. But this as you can see will become very tedious as there will be n number of images in library and comparing each one for repetition does't make sense , still if you want to compare data you look at this answer which will give you a fraction of how much the data is matching.

OpenXML: Issue adding images to documents

Up until now, this block of code has been using to build documents with text for several months with no snags. I am now trying to dynamically add images. I've spent about two days staring at code and researching and am at an end. I suspect the issue is that relationships are not being created (more details below.) Maybe not?
//set stuff up...
WordprocessingDocument doc = WordprocessingDocument.Open(fsPat, true, new OpenSettings(){
AutoSave = true,
MarkupCompatibilityProcessSettings = new MarkupCompatibilityProcessSettings(MarkupCompatibilityProcessMode.ProcessAllParts,
DocumentFormat.OpenXml.FileFormatVersions.Office2007),
MaxCharactersInPart = long.MaxValue
});
MainDocumentPart mainPart = doc.MainDocumentPart;
.
.Other stuff goes here
.
//now the fun...
Run r2 = new Run();
// Add an ImagePart.
ImagePart ip = mainPart.AddImagePart(ImagePartType.Png);
string imageRelationshipID = mainPart.CreateRelationshipToPart(ip); //
using (Stream imgStream = ip.GetStream())
{
System.Drawing.Bitmap b = new System.Drawing.Bitmap("myfile.png");
b.Save(imgStream, System.Drawing.Imaging.ImageFormat.Png);
}
Drawing drawing = BuildImage(imageRelationshipID, "name"+imageRelationshipID.ToString(), 17, 17);
r2.Append(drawing);
p.Append(r2);
The image part is essentially copied from http://blog.stuartwhiteford.com/?p=33) and is running in a loop presently. I also copied his BuildImage() function and use it as-is.
When I open the resulting docx, I see red Xs where the images are saying "This image cannot currently be displayed."
When I open the zip, the images will appear in root/media, but not root/word/media as I'd expect. I also cannot find the images referenced in any of the relationship files. Ideally they'd be in root/word/_rels/document.xml.rels. You'll notice I changed how imageRelationshipID is set hoping to fix this. It didn't.
Please help. Thank you.
So... It seems like OpenXML just hates me. I copied AddImagePart code from like 3-4 places among trying other things--none of which lasted long--and just could not get relationships to form. The implication I see is that they happen automatically with the AddImagePart function.
I ended up doing a complete workaround where I add all the pictures I might want to put and remove the Drawing nodes' parents of the ones I didn't want (Run nodes, generally.) Since these are very small pictures, it's feasible and in ways more elegant than trying to add them as necessary since I don't have to keep track of where images are stored on disk.

ClearCanvas DICOM Library - How to use Overlay Planes?

NOTE:: This may be a better question to answer:: Free DICOM files, with Multiple Overlays
Hi, I have a question relating to tag DicomTags.OverlayData & Overlay Planes.
As of now I can get back overlay data from a DICOM file in ClearCanvas and uncompress & display it using:
var overlayData = dicomFile.DataSet[DicomTags.OverlayData];
I also use other tags in the DICOM file for Overlays such as, OverlayOrigin, OverlayColumns, OverlayRows etc...
So my question is, how do OverlayPlanes come into play here? All these Overlay tags seem to be global & not grouped in a OverlayPlane tag or something.
Is plane data layered in the OverlayData tag?? I'm new to DICOM & a little confused about this.
The ClearCanvas DICOM assembly has several helper IOD classes that make it a bit easier to access specific modules within a DICOM Message. The OverlayPlaneModuleIod class is one such IOD class that make it easier to access all of the tags together within an overlay plane. The following code shows an example of how to use this class to check and access an each of the potential overlay planes, without having to worry about the various tags involved:
DicomFile theFile = new DicomFile("filename.dcm");
theFile.Load();
OverlayPlaneModuleIod iod = new OverlayPlaneModuleIod(theFile.DataSet);
for (int i = 0; i &lt 16; i++)
{
if (iod.HasOverlayPlane(i))
{
OverlayPlane overlay = iod[i];
byte[] overlayData = overlay.OverlayData;
string description = overlay.OverlayDescription;
}
}
This link answered my question for the most part as I needed to just understand something about overlay grouping.
http://www.medicalconnections.co.uk/wiki/Number_of_Overlays_in_Image

C#: Take Out Image Portion of JPEG to Backup Metadata?

This will be a little backwards from the typical approach.
I've used ExifTool for metadata manipulation before, but I really want to keep the best metadata backup I can before I make anything permanent.
What I want to do is remove the compressed image portion of a JPEG file to leave everything else intact. That's backing up EXIF, Makernotes, IPTC, XMP, etc whether at the beginning or end of the file.
What I've tried so far is to strip all metadata from a copy of the original JPEG, and use it as a basis of what bytes will be taken out of the original. After looking at the raw data, it doesn't seem like the stripped copy is contiguous in the original copy. There may be some header information still remaining in the stripped version. I don't really know. Not a good way to do it, I suppose.
Are there any markers that will absolutely tell me where the compressed JPEG image data starts and ends? I understand that JPEG files have 0xFFD8 and 0xFFD9 to mark the start and end of the image, but have come to find out that metadata is actually between those markers.
I'm using C#.
Thank you.
To do this properly you need to fully parse the JPEG/JFIF format and discard anything you don't want. Metadata is all kept in APP segments or trailers after the JPEG EOI, so presumably you will toss everything else. Full parsing of a JPEG/JFIF is not trivial, and for this I refer you to the JPEF/JFIF specification.
You can use the JpegSegmentReader class from my MetadataExtractor library to retrieve specific segments from a JPEG image.