How to convert an array of stream in Contents to PRIndirectRefernce? - itext

I am trying to read pdf and get the text present in it. Using Nugget iTextSharp -LGPL v4.1.5 . (I am not allowed to use ITextsharp v5.5.13 and it makes life difficult)
private string GetTextFromPage(PdfReader pdfReader, int page)
{
StringBuilder pageText = new StringBuilder();
var cpage = pdfReader.GetPageN(page);
var content = cpage.Get(PdfName.CONTENTS);
//Error for casting Pdfarray to PRIndirectReference
var indirectReference = (PRIndirectReference)content;
}
Getting Exception
System.InvalidCastException : Unable to cast object of type 'iTextSharp.text.pdf.PdfArray' to type 'iTextSharp.text.pdf.PRIndirectReference'.
Kindly suggest how to handle PdfArray object (multiple streams in contents)

I am not too versed in c# but the java code should be easily transferable to c#
I would do it like this:
private byte[] getTextFromPage(PdfReader pdfReader, int page){
PdfDictionary cpage = pdfReader.GetPageN(page);
PdfObject content = cpage.get(PdfName.CONTENTS);
return getContent(content);
}
private byte[] getContent(PdfObject content) throws IOException {
byte[] result=null;
switch (content.type()){
case PdfObject.INDIRECT:
PRIndirectReference ref = (PRIndirectReference) content;
PdfObject directObject = PdfReader.getPdfObject(ref);
result = getContent(directObject);
break;
case PdfObject.ARRAY:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
PdfArray cArray = (PdfArray) content;
for(Object object : cArray.getArrayList()) {
baos.write(getContent((PdfObject) object));
}
result = baos.toByteArray();
break;
case PdfObject.STREAM:
PRStream stream = (PRStream) PdfReader.getPdfObject(content);
result = PdfReader.getStreamBytes(stream);
break;
default:
throw new IllegalStateException("Unsupported content type");
}
return result;
}
But this is only a small step in retrieving the text. You need the whole operator processing, leading, spacing, scaling, font handling etc. So to completely write it from scratch is a big task...

Related

Open OSM pbf results in Protobuf exception

Using OSMSharp I am having trouble to open a stream for a file (which I can provide on demand)
The error occurs in PBFReader (line 104)
using (var tmp = new LimitedStream(_stream, length))
{
header = _runtimeTypeModel.Deserialize(tmp, null, _blockHeaderType) as BlobHeader;
}
and states: "ProtoBuf.ProtoException: 'Invalid field in source data: 0'" which might mean different things as I have read in this SO question.
The file opens and is visualized with QGis so is not corrupt in my opinion.
Can it be that the contracts do not match? Is OsmSharp/core updated to the latest .proto files for OSM from here (although not sure if this is the real original source for the definition files).
And what might make more sense, can it be that the file I attached is generated for v2 of OSM PBF specification?
In the code at the line of the exception I see the following comment which makes me wonder:
// TODO: remove some of the v1 specific code.
// TODO: this means also to use the built-in capped streams.
// code borrowed from: http://stackoverflow.com/questions/4663298/protobuf-net-deserialize-open-street-maps
// I'm just being lazy and re-using something "close enough" here
// note that v2 has a big-endian option, but Fixed32 assumes little-endian - we
// actually need the other way around (network byte order):
// length = IntLittleEndianToBigEndian((uint)length);
BlobHeader header;
// again, v2 has capped-streams built in, but I'm deliberately
// limiting myself to v1 features
So this makes me wonder if OSM Sharp is (still) up-to-date.
My sandbox code looks like this:
using OsmSharp.Streams;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using OsmSharp.Tags;
namespace OsmSharp
{
class Program
{
private const string Path = #"C:\Users\Bernoulli IT\Documents\Applications\Argaleo\Test\";
private const string FileNameAntarctica = "antarctica-latest.osm";
private const string FileNameOSPbf = "OSPbf";
private const Boolean useRegisterSource = false;
private static KeyValuePair<string, string> KeyValuePair = new KeyValuePair<string, string>("joep", "monita");
static void Main(string[] args)
{
Console.WriteLine("Hello World!");
//string fileName = $#"{Path}\{FileNameAntarctica}.pbf";
string fileName = $#"{Path}\{FileNameOSPbf}.pbf";
string newFileName = $"{fileName.Replace(".pbf", string.Empty)}-{Guid.NewGuid().ToString().Substring(0, 4)}.pbf";
Console.WriteLine("*** Complete");
string fileNameOutput = CompleteFlow(fileName, newFileName);
Console.WriteLine("");
Console.WriteLine("*** Display");
DisplayFlow(fileNameOutput);
Console.ReadLine();
}
private static string CompleteFlow(string fileName, string newFileName)
{
// 1. Open file and convert to bytes
byte[] fileBytes = FileToBytes(fileName);
// 2. Bytes to OSM stream source (pbf)
PBFOsmStreamSource osmStreamSource;
osmStreamSource = BytesToOsmStreamSource(fileBytes);
osmStreamSource.MoveNext();
if (osmStreamSource.Current() == null)
{
osmStreamSource = FileToOsmStreamSource(fileName);
osmStreamSource.MoveNext();
if (osmStreamSource.Current() == null)
{
throw new Exception("No current in stream.");
}
}
// 3. Add custom tag
AddTag(osmStreamSource);
// 4. OSM stream source to bytes
//byte[] osmStreamSourceBytes = OsmStreamSourceToBytes(osmStreamSource);
// 5. Bytes to file
//string fileNameOutput = BytesToFile(osmStreamSourceBytes, newFileName);
OsmStreamSourceToFile(osmStreamSource, newFileName);
Console.WriteLine(newFileName);
return newFileName;
}
private static void DisplayFlow(string fileName)
{
// 1. Open file and convert to bytes
byte[] fileBytes = FileToBytes(fileName);
// 2. Bytes to OSM stream source (pbf)
BytesToOsmStreamSource(fileBytes);
}
private static byte[] FileToBytes(string fileName)
{
Console.WriteLine(fileName);
return File.ReadAllBytes(fileName);
}
private static PBFOsmStreamSource BytesToOsmStreamSource(byte[] bytes)
{
MemoryStream memoryStream = new MemoryStream(bytes);
memoryStream.Position = 0;
PBFOsmStreamSource osmStreamSource = new PBFOsmStreamSource(memoryStream);
foreach (OsmGeo element in osmStreamSource.Where(osmGeo => osmGeo.Tags.Any(tag => tag.Key.StartsWith(KeyValuePair.Key))))
{
foreach (Tag elementTag in element.Tags.Where(tag => tag.Key.StartsWith(KeyValuePair.Key)))
{
Console.WriteLine("!!!!!!!!!!!!!! Tag found while reading !!!!!!!!!!!!!!!!!!".ToUpper());
}
}
return osmStreamSource;
}
private static PBFOsmStreamSource FileToOsmStreamSource(string fileName)
{
using (FileStream fileStream = new FileInfo(fileName).OpenRead())
{
PBFOsmStreamSource osmStreamSource = new PBFOsmStreamSource(fileStream);
return osmStreamSource;
}
}
private static void AddTag(PBFOsmStreamSource osmStreamSource)
{
osmStreamSource.Reset();
OsmGeo osmGeo = null;
while (osmGeo == null)
{
osmStreamSource.MoveNext();
osmGeo = osmStreamSource.Current();
if(osmGeo?.Tags == null)
{
osmGeo = null;
}
}
osmGeo.Tags.Add("joep", "monita");
Console.WriteLine($"{osmGeo.Tags.FirstOrDefault(tag => tag.Key.StartsWith(KeyValuePair.Key)).Key} - {osmGeo.Tags.FirstOrDefault(tag => tag.Key.StartsWith(KeyValuePair.Key)).Value}");
}
private static byte[] OsmStreamSourceToBytes(PBFOsmStreamSource osmStreamSource)
{
MemoryStream memoryStream = new MemoryStream();
PBFOsmStreamTarget target = new PBFOsmStreamTarget(memoryStream, true);
osmStreamSource.Reset();
target.Initialize();
UpdateTarget(osmStreamSource, target);
target.Flush();
target.Close();
return memoryStream.ToArray();
}
private static string BytesToFile(byte[] bytes, string fileName)
{
using (FileStream fs = new FileStream(fileName, FileMode.Create, FileAccess.Write))
{
fs.Write(bytes, 0, bytes.Length);
}
return fileName;
}
private static void OsmStreamSourceToFile(PBFOsmStreamSource osmStreamSource, string fileName)
{
using (FileStream fileStream = new FileInfo(fileName).OpenWrite())
{
PBFOsmStreamTarget target = new PBFOsmStreamTarget(fileStream, true);
osmStreamSource.Reset();
target.Initialize();
UpdateTarget(osmStreamSource, target);
target.Flush();
target.Close();
}
}
private static void UpdateTarget(OsmStreamSource osmStreamSource, OsmStreamTarget osmStreamTarget)
{
if (useRegisterSource)
{
osmStreamTarget.RegisterSource(osmStreamSource, osmGeo => true);
osmStreamTarget.Pull();
}
else
{
bool isFirst = true;
foreach (OsmGeo osmGeo in osmStreamSource)
{
Tag? tag = osmGeo.Tags?.FirstOrDefault(t => t.Key == KeyValuePair.Key);
switch (osmGeo.Type)
{
case OsmGeoType.Node:
if (isFirst)
{
for (int indexer = 0; indexer < 1; indexer++)
{
(osmGeo as Node).Tags.Add(new Tag(KeyValuePair.Key + Guid.NewGuid(), KeyValuePair.Value));
}
isFirst = false;
}
osmStreamTarget.AddNode(osmGeo as Node);
break;
case OsmGeoType.Way:
osmStreamTarget.AddWay(osmGeo as Way);
break;
case OsmGeoType.Relation:
osmStreamTarget.AddRelation(osmGeo as Relation);
break;
default:
throw new ArgumentOutOfRangeException();
}
}
}
}
}
}
Already I posted this question on the GITHube page of OSMSharp as is linked here. Any help would be very appreciated.

iTextSharp - Create new document as Byte[]

Have a little method which goes to the database and retrieves a pdf document from a varbinary column and then adds data to it. I would like to add code so that if this document (company stationery ) is not found then a new blank document is created and returned. The method could either return a Byte[] or a Stream.
Problem is that the variable "bytes" in the else clause is null.
Any ideas what's wrong?
private Byte[] GetBasePDF(Int32 AttachmentID)
{
Byte[] bytes = null;
DataTable dt = ServiceFactory
.GetService().Attachments_Get(AttachmentID, null, null);
if (dt != null && dt.Rows.Count > 0)
{
bytes = (Byte[])dt.Rows[0]["Data"];
}
else
{
// Create a new blank PDF document and return it as Byte[]
ITST.Document doc =
new ITST.Document(ITST.PageSize.A4, 50f, 50f, 25f, 25f);
MemoryStream ms = new MemoryStream();
PdfCopy copy = new PdfCopy(doc, ms);
ms.Position = 0;
bytes = ms.ToArray();
}
return bytes;
}
You are trying to use PdfCopy but that's intended for existing documents, not new ones. You just need to create a "blank" document using PdfWriter and Document. iText won't let you create a 100% empty document but the code below essentially does that by just adding a space.
private static Byte[] CreateEmptyDocument() {
using (var ms = new System.IO.MemoryStream()) {
using (var doc = new Document()) {
using (var writer = PdfWriter.GetInstance(doc, ms)) {
doc.Open();
doc.Add(new Paragraph(" "));
doc.Close();
}
}
return ms.ToArray();
}
}
I think you may need to use
bytes = ms.GetBuffer();
not
bytes = ms.ToArray();

parsing a string using a Lucene Analyser

How can you use an Analyser to 'analyse' a string, and return the analysed string?
I am trying the below code found off this site, but it is throwing an ArgumentException - "This AttributeSource does not have the attribute Lucene.Net.Analysis.Tokenattributes.TermAttribute"
public static string AnalyseString(Analyzer analyser, string stringToAnalyse)
{
MemoryStream ms = new MemoryStream();
StreamWriter sw = new StreamWriter(ms);
sw.Write(stringToAnalyse);
sw.Flush();
ms.Seek(0, SeekOrigin.Begin);
StreamReader sr = new StreamReader(ms);
TokenStream tokenStreamResult = analyser.TokenStream(null,sr);
StringBuilder sb = new StringBuilder();
//Lucene.Net.Analysis.Token t = new Lucene.Net.Analysis.Token();
while (tokenStreamResult.IncrementToken())
{
var attrib = tokenStreamResult.GetAttribute<TermAttribute>();
string t2 = tokenStreamResult.GetAttribute<TermAttribute>().Term;
sb.Append(t2 + " ");
}
return sb.ToString();
}
I am using the latest Lucene.Net version (3.0.3.0), and am testing with a SimpleAnalyzer
Try tokenStreamResult.GetAttribute<ITermAttribute>() instead

ASP.NET JSON Web Service Response format

I have written one simple web service which get product list in JSONText which is string object
Web Service code is below
using System;
using System.Collections.Generic;
using System.Web;
using System.Web.Services;
using System.Web.Script.Services;
using System.Runtime.Serialization.Json;
using System.IO;
using System.Text;
/// <summary>
/// Summary description for JsonWebService
/// </summary>
[WebService(Namespace = "http://tempuri.org/")]
[WebServiceBinding(ConformsTo = WsiProfiles.BasicProfile1_1)]
[System.Web.Script.Services.ScriptService]
public class JsonWebService : System.Web.Services.WebService
{
public JsonWebService () {
//Uncomment the following line if using designed components
//InitializeComponent();
}
[WebMethod]
[ScriptMethod(ResponseFormat = ResponseFormat.Json)]
public string GetProductsJson(string prefix)
{
List<Product> products = new List<Product>();
if (prefix.Trim().Equals(string.Empty, StringComparison.OrdinalIgnoreCase))
{
products = ProductFacade.GetAllProducts();
}
else
{
products = ProductFacade.GetProducts(prefix);
}
//yourobject is your actula object (may be collection) you want to serialize to json
DataContractJsonSerializer serializer = new DataContractJsonSerializer(products.GetType());
//create a memory stream
MemoryStream ms = new MemoryStream();
//serialize the object to memory stream
serializer.WriteObject(ms, products);
//convert the serizlized object to string
string jsonString = Encoding.Default.GetString(ms.ToArray());
//close the memory stream
ms.Close();
return jsonString;
}
}
now it give me resoponse like below :
{"d":"[{\"ProductID\":1,\"ProductName\":\"Product 1\"},{\"ProductID\":2,\"ProductName\":\"Product 2\"},{\"ProductID\":3,\"ProductName\":\"Product 3\"},{\"ProductID\":4,\"ProductName\":\"Product 4\"},{\"ProductID\":5,\"ProductName\":\"Product 5\"},{\"ProductID\":6,\"ProductName\":\"Product 6\"},{\"ProductID\":7,\"ProductName\":\"Product 7\"},{\"ProductID\":8,\"ProductName\":\"Product 8\"},{\"ProductID\":9,\"ProductName\":\"Product 9\"},{\"ProductID\":10,\"ProductName\":\"Product 10\"}]"}
But i am looking for below out put
[{"ProductID":1,"ProductName":"Product 1"},{"ProductID":2,"ProductName":"Product 2"},{"ProductID":3,"ProductName":"Product 3"},{"ProductID":4,"ProductName":"Product 4"},{"ProductID":5,"ProductName":"Product 5"},{"ProductID":6,"ProductName":"Product 6"},{"ProductID":7,"ProductName":"Product 7"},{"ProductID":8,"ProductName":"Product 8"},{"ProductID":9,"ProductName":"Product 9"},{"ProductID":10,"ProductName":"Product 10"}]
can any one tell me what is actual problem
Thanks
First there was a change with ASP.NET 3.5 for security reasons Microsoft added the "d" to the response. Below is a link from Dave Ward at the Encosia that talks about what your seeing:
A breaking change between versions of ASP.NET AJAX. He has several posts that talks about this that can help you further with processing JSON and ASP.NET
Actually, if you just remove the
[ScriptMethod(ResponseFormat = ResponseFormat.Json)]
from the method, and you return the jsonString that you serialized using the JavaScriptSerializer you will get exactelly the output that you were looking for.
Notice that u have double quotes beside ur array in your response.In this way u return json format not json object from ur web method.Json format is a string.Therefore u have to use json.parse() function in order to parse json string to json object.If u dont want to use parse fuction,u have to remove serialize in ur web method.Thus u get a json object.
in .net web service
[WebMethod]
public string Android_DDD(string KullaniciKey, string Durum, string PersonelKey)
{
return EU.EncodeToBase64("{\"Status\":\"OK\",\"R\":[{\"ImzaTipi\":\"Paraf\", \"Personel\":\"Ali Veli üğişçöıÜĞİŞÇÖI\", \"ImzaDurumTipi\":\"Tamam\", \"TamamTar\":\"1.1.2003 11:21\"},{\"ImzaTipi\":\"İmza\", \"Personel\":\"Ali Ak\", \"ImzaDurumTipi\":\"Tamam\", \"TamamTar\":\"2.2.2003 11:21\"}]}");
}
static public string EncodeToBase64(string toEncode)
{
UTF8Encoding encoding = new UTF8Encoding();
byte[] bytes = encoding.GetBytes(toEncode);
string returnValue = System.Convert.ToBase64String(bytes);
return returnValue;
}
in android
private static String convertStreamToString(InputStream is)
{
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
StringBuilder sb = new StringBuilder();
String line = null;
try
{
while ((line = reader.readLine()) != null)
{
sb.append(line + "\n");
}
}
catch (IOException e)
{
e.printStackTrace();
}
finally
{
try
{
is.close();
}
catch (IOException e)
{
e.printStackTrace();
}
}
return sb.toString();
}
private void LoadJsonDataFromASPNET()
{
try
{
DefaultHttpClient httpclient = new DefaultHttpClient();
HttpPost httpPostRequest = new HttpPost(this.WSURL + "/WS.asmx/Android_DDD");
JSONObject jsonObjSend = new JSONObject();
jsonObjSend.put("KullaniciKey", "value_1");
jsonObjSend.put("Durum", "value_2");
jsonObjSend.put("PersonelKey", "value_3");
StringEntity se = new StringEntity(jsonObjSend.toString());
httpPostRequest.setEntity(se);
httpPostRequest.setHeader("Accept", "application/json");
httpPostRequest.setHeader("Content-type", "application/json");
// httpPostRequest.setHeader("Accept-Encoding", "gzip"); // only set this parameter if you would like to use gzip compression
HttpResponse response = (HttpResponse) httpclient.execute(httpPostRequest);
HttpEntity entity = response.getEntity();
if (entity != null)
{
InputStream instream = entity.getContent();
String resultString = convertStreamToString(instream);
instream.close();
resultString = resultString.substring(6, resultString.length()-3);
resultString = new String(android.util.Base64.decode(resultString, 0), "UTF-8");
JSONObject object = new JSONObject(resultString);
String oDurum = object.getString("Status");
if (oDurum.equals("OK"))
{
JSONArray jsonArray = new JSONArray(object.getString("R"));
if (jsonArray.length() > 0)
{
for (int i = 0; i < jsonArray.length(); i++)
{
JSONObject jsonObject = jsonArray.getJSONObject(i);
String ImzaTipi = jsonObject.getString("ImzaTipi");
String Personel = jsonObject.getString("Personel");
String ImzaDurumTipi = jsonObject.getString("ImzaDurumTipi");
String TamamTar = jsonObject.getString("TamamTar");
Toast.makeText(getApplicationContext(), "ImzaTipi:" + ImzaTipi + " Personel:" + Personel + " ImzaDurumTipi:" + ImzaDurumTipi + " TamamTar:" + TamamTar, Toast.LENGTH_LONG).show();
}
}
}
}
}
catch (Exception e)
{
e.printStackTrace();
}
}

Extract images using iTextSharp

I have been using this code with great success to pull out the first image found in each page of a PDF. However, it is now not working with some new PDFs for an uknown reason. I have used other tools (Datalogics, etc) that do pull out the images fine with these new PDFs. However, I do not want to buy Datalogics or any tool if I can use iTextSharp. Can anybody tell me why this code is not finding the images in the PDF?
Knowns: my PDFs only have 1 image per page and nothing else.
using iTextSharp.text;
using iTextSharp.text.pdf;
...
public static void ExtractImagesFromPDF(string sourcePdf, string outputPath)
{
// NOTE: This will only get the first image it finds per page.
PdfReader pdf = new PdfReader(sourcePdf);
RandomAccessFileOrArray raf = new iTextSharp.text.pdf.RandomAccessFileOrArray(sourcePdf);
try
{
for (int pageNumber = 1; pageNumber <= pdf.NumberOfPages; pageNumber++)
{
PdfDictionary pg = pdf.GetPageN(pageNumber);
PdfDictionary res = (PdfDictionary)PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES));
PdfDictionary xobj = (PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT));
if (xobj != null)
{
foreach (PdfName name in xobj.Keys)
{
PdfObject obj = xobj.Get(name);
if (obj.IsIndirect())
{
PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(obj);
PdfName type = (PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));
if (PdfName.IMAGE.Equals(type))
{
int XrefIndex = Convert.ToInt32(((PRIndirectReference)obj).Number.ToString(System.Globalization.CultureInfo.InvariantCulture));
PdfObject pdfObj = pdf.GetPdfObject(XrefIndex);
PdfStream pdfStrem = (PdfStream)pdfObj;
byte[] bytes = PdfReader.GetStreamBytesRaw((PRStream)pdfStrem);
if ((bytes != null))
{
using (System.IO.MemoryStream memStream = new System.IO.MemoryStream(bytes))
{
memStream.Position = 0;
System.Drawing.Image img = System.Drawing.Image.FromStream(memStream);
// must save the file while stream is open.
if (!Directory.Exists(outputPath))
Directory.CreateDirectory(outputPath);
string path = Path.Combine(outputPath, String.Format(#"{0}.jpg", pageNumber));
System.Drawing.Imaging.EncoderParameters parms = new System.Drawing.Imaging.EncoderParameters(1);
parms.Param[0] = new System.Drawing.Imaging.EncoderParameter(System.Drawing.Imaging.Encoder.Compression, 0);
System.Drawing.Imaging.ImageCodecInfo jpegEncoder = Utilities.GetImageEncoder("JPEG");
img.Save(path, jpegEncoder, parms);
break;
}
}
}
}
}
}
}
}
catch
{
throw;
}
finally
{
pdf.Close();
raf.Close();
}
}
I found that my problem was that I was not recursively searching inside of forms and groups for images. Basically, the original code would only find images that were embedded at the root of the pdf document. Here is the revised method plus a new method (FindImageInPDFDictionary) that recursively searches for images in the page. NOTE: the flaws of only supporting JPEG and non-compressed images still applies. See R Ubben's code for options to fix those flaws. HTH someone.
public static void ExtractImagesFromPDF(string sourcePdf, string outputPath)
{
// NOTE: This will only get the first image it finds per page.
PdfReader pdf = new PdfReader(sourcePdf);
RandomAccessFileOrArray raf = new iTextSharp.text.pdf.RandomAccessFileOrArray(sourcePdf);
try
{
for (int pageNumber = 1; pageNumber <= pdf.NumberOfPages; pageNumber++)
{
PdfDictionary pg = pdf.GetPageN(pageNumber);
// recursively search pages, forms and groups for images.
PdfObject obj = FindImageInPDFDictionary(pg);
if (obj != null)
{
int XrefIndex = Convert.ToInt32(((PRIndirectReference)obj).Number.ToString(System.Globalization.CultureInfo.InvariantCulture));
PdfObject pdfObj = pdf.GetPdfObject(XrefIndex);
PdfStream pdfStrem = (PdfStream)pdfObj;
byte[] bytes = PdfReader.GetStreamBytesRaw((PRStream)pdfStrem);
if ((bytes != null))
{
using (System.IO.MemoryStream memStream = new System.IO.MemoryStream(bytes))
{
memStream.Position = 0;
System.Drawing.Image img = System.Drawing.Image.FromStream(memStream);
// must save the file while stream is open.
if (!Directory.Exists(outputPath))
Directory.CreateDirectory(outputPath);
string path = Path.Combine(outputPath, String.Format(#"{0}.jpg", pageNumber));
System.Drawing.Imaging.EncoderParameters parms = new System.Drawing.Imaging.EncoderParameters(1);
parms.Param[0] = new System.Drawing.Imaging.EncoderParameter(System.Drawing.Imaging.Encoder.Compression, 0);
System.Drawing.Imaging.ImageCodecInfo jpegEncoder = Utilities.GetImageEncoder("JPEG");
img.Save(path, jpegEncoder, parms);
}
}
}
}
}
catch
{
throw;
}
finally
{
pdf.Close();
raf.Close();
}
}
private static PdfObject FindImageInPDFDictionary(PdfDictionary pg)
{
PdfDictionary res =
(PdfDictionary)PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES));
PdfDictionary xobj =
(PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT));
if (xobj != null)
{
foreach (PdfName name in xobj.Keys)
{
PdfObject obj = xobj.Get(name);
if (obj.IsIndirect())
{
PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(obj);
PdfName type =
(PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));
//image at the root of the pdf
if (PdfName.IMAGE.Equals(type))
{
return obj;
}// image inside a form
else if (PdfName.FORM.Equals(type))
{
return FindImageInPDFDictionary(tg);
} //image inside a group
else if (PdfName.GROUP.Equals(type))
{
return FindImageInPDFDictionary(tg);
}
}
}
}
return null;
}
Here is a simpler solution:
iTextSharp.text.pdf.parser.PdfImageObject pdfImage =
new iTextSharp.text.pdf.parser.PdfImageObject(imgPRStream);
System.Drawing.Image img = pdfImage.GetDrawingImage();
The following code incorporates all of Dave and R Ubben's ideas above, plus it returns a full list of all the images and also deals with multiple bit depths. I had to convert it to VB for the project I'm working on though, sorry about that...
Private Sub getAllImages(ByVal dict As pdf.PdfDictionary, ByVal images As List(Of Byte()), ByVal doc As pdf.PdfReader)
Dim res As pdf.PdfDictionary = CType(pdf.PdfReader.GetPdfObject(dict.Get(pdf.PdfName.RESOURCES)), pdf.PdfDictionary)
Dim xobj As pdf.PdfDictionary = CType(pdf.PdfReader.GetPdfObject(res.Get(pdf.PdfName.XOBJECT)), pdf.PdfDictionary)
If xobj IsNot Nothing Then
For Each name As pdf.PdfName In xobj.Keys
Dim obj As pdf.PdfObject = xobj.Get(name)
If (obj.IsIndirect) Then
Dim tg As pdf.PdfDictionary = CType(pdf.PdfReader.GetPdfObject(obj), pdf.PdfDictionary)
Dim subtype As pdf.PdfName = CType(pdf.PdfReader.GetPdfObject(tg.Get(pdf.PdfName.SUBTYPE)), pdf.PdfName)
If pdf.PdfName.IMAGE.Equals(subtype) Then
Dim xrefIdx As Integer = CType(obj, pdf.PRIndirectReference).Number
Dim pdfObj As pdf.PdfObject = doc.GetPdfObject(xrefIdx)
Dim str As pdf.PdfStream = CType(pdfObj, pdf.PdfStream)
Dim bytes As Byte() = pdf.PdfReader.GetStreamBytesRaw(CType(str, pdf.PRStream))
Dim filter As String = tg.Get(pdf.PdfName.FILTER).ToString
Dim width As String = tg.Get(pdf.PdfName.WIDTH).ToString
Dim height As String = tg.Get(pdf.PdfName.HEIGHT).ToString
Dim bpp As String = tg.Get(pdf.PdfName.BITSPERCOMPONENT).ToString
If filter = "/FlateDecode" Then
bytes = pdf.PdfReader.FlateDecode(bytes, True)
Dim pixelFormat As System.Drawing.Imaging.PixelFormat
Select Case Integer.Parse(bpp)
Case 1
pixelFormat = Drawing.Imaging.PixelFormat.Format1bppIndexed
Case 24
pixelFormat = Drawing.Imaging.PixelFormat.Format24bppRgb
Case Else
Throw New Exception("Unknown pixel format " + bpp)
End Select
Dim bmp As New System.Drawing.Bitmap(Int32.Parse(width), Int32.Parse(height), pixelFormat)
Dim bmd As System.Drawing.Imaging.BitmapData = bmp.LockBits(New System.Drawing.Rectangle(0, 0, Int32.Parse(width), Int32.Parse(height)), System.Drawing.Imaging.ImageLockMode.WriteOnly, pixelFormat)
Marshal.Copy(bytes, 0, bmd.Scan0, bytes.Length)
bmp.UnlockBits(bmd)
Using ms As New MemoryStream
bmp.Save(ms, System.Drawing.Imaging.ImageFormat.Png)
bytes = ms.GetBuffer
End Using
End If
images.Add(bytes)
ElseIf pdf.PdfName.FORM.Equals(subtype) Or pdf.PdfName.GROUP.Equals(subtype) Then
getAllImages(tg, images, doc)
End If
End If
Next
End If
End Sub
This is just another rehash of others' ideas, but the one that worked for me. Here I use #Malco's image grabbing snippet with R Ubben's looping:
private IList<System.Drawing.Image> GetImagesFromPdfDict(PdfDictionary dict, PdfReader doc)
{
List<System.Drawing.Image> images = new List<System.Drawing.Image>();
PdfDictionary res = (PdfDictionary)(PdfReader.GetPdfObject(dict.Get(PdfName.RESOURCES)));
PdfDictionary xobj = (PdfDictionary)(PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT)));
if (xobj != null)
{
foreach (PdfName name in xobj.Keys)
{
PdfObject obj = xobj.Get(name);
if (obj.IsIndirect())
{
PdfDictionary tg = (PdfDictionary)(PdfReader.GetPdfObject(obj));
PdfName subtype = (PdfName)(PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE)));
if (PdfName.IMAGE.Equals(subtype))
{
int xrefIdx = ((PRIndirectReference)obj).Number;
PdfObject pdfObj = doc.GetPdfObject(xrefIdx);
PdfStream str = (PdfStream)(pdfObj);
iTextSharp.text.pdf.parser.PdfImageObject pdfImage =
new iTextSharp.text.pdf.parser.PdfImageObject((PRStream)str);
System.Drawing.Image img = pdfImage.GetDrawingImage();
images.Add(img);
}
else if (PdfName.FORM.Equals(subtype) || PdfName.GROUP.Equals(subtype))
{
images.AddRange(GetImagesFromPdfDict(tg, doc));
}
}
}
}
return images;
}
De c# version:
private IList<System.Drawing.Image> GetImagesFromPdfDict(PdfDictionary dict, PdfReader doc){
List<System.Drawing.Image> images = new List<System.Drawing.Image>();
PdfDictionary res = (PdfDictionary)(PdfReader.GetPdfObject(dict.Get(PdfName.RESOURCES)));
PdfDictionary xobj = (PdfDictionary)(PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT)));
if (xobj != null)
{
foreach (PdfName name in xobj.Keys)
{
PdfObject obj = xobj.Get(name);
if (obj.IsIndirect())
{
PdfDictionary tg = (PdfDictionary)(PdfReader.GetPdfObject(obj));
pdf.PdfName subtype = (pdf.PdfName)(pdf.PdfReader.GetPdfObject(tg.Get(pdf.PdfName.SUBTYPE)));
if (pdf.PdfName.IMAGE.Equals(subtype))
{
int xrefIdx = ((pdf.PRIndirectReference)obj).Number;
pdf.PdfObject pdfObj = doc.GetPdfObject(xrefIdx);
pdf.PdfStream str = (pdf.PdfStream)(pdfObj);
byte[] bytes = pdf.PdfReader.GetStreamBytesRaw((pdf.PRStream)str);
string filter = tg.Get(pdf.PdfName.FILTER).ToString();
string width = tg.Get(pdf.PdfName.WIDTH).ToString();
string height = tg.Get(pdf.PdfName.HEIGHT).ToString();
string bpp = tg.Get(pdf.PdfName.BITSPERCOMPONENT).ToString();
if (filter == "/FlateDecode")
{
bytes = pdf.PdfReader.FlateDecode(bytes, true);
System.Drawing.Imaging.PixelFormat pixelFormat;
switch (int.Parse(bpp))
{
case 1:
pixelFormat = System.Drawing.Imaging.PixelFormat.Format1bppIndexed;
break;
case 24:
pixelFormat = System.Drawing.Imaging.PixelFormat.Format24bppRgb;
break;
default:
throw new Exception("Unknown pixel format " + bpp);
}
var bmp = new System.Drawing.Bitmap(Int32.Parse(width), Int32.Parse(height), pixelFormat);
System.Drawing.Imaging.BitmapData bmd = bmp.LockBits(new System.Drawing.Rectangle(0, 0, Int32.Parse(width),
Int32.Parse(height)), System.Drawing.Imaging.ImageLockMode.WriteOnly, pixelFormat);
Marshal.Copy(bytes, 0, bmd.Scan0, bytes.Length);
bmp.UnlockBits(bmd);
using (var ms = new MemoryStream())
{
bmp.Save(ms, System.Drawing.Imaging.ImageFormat.Png);
bytes = ms.GetBuffer();
}
}
images.Add(System.Drawing.Image.FromStream(new MemoryStream(bytes)));
}
else if (pdf.PdfName.FORM.Equals(subtype) || pdf.PdfName.GROUP.Equals(subtype))
{
images.AddRange(GetImagesFromPdfDict(tg, doc));
}
}
}
}
return images;
}
The above will only work with JPEGs. Excluding inline images and embedded files, you need to go through the objects of subtype IMAGE, then look at the filter and take the appropriate action. Here's an example, assuming we have a PdfObject of subtype IMAGE:
PdfReader pdf = new PdfReader("c:\\temp\\exp0.pdf");
int xo=pdf.XrefSize;
for (int i=0;i<xo;i++)
{
PdfObject obj=pdf.GetPdfObject(i);
if (obj!=null && obj.IsStream())
{
PdfDictionary pd=(PdfDictionary)obj;
if (pd.Contains(PdfName.SUBTYPE) && pd.Get(PdfName.SUBTYPE).ToString()=="/Image")
{
string filter=pd.Get(PdfName.FILTER).ToString();
string width=pd.Get(PdfName.WIDTH).ToString();
string height=pd.Get(PdfName.HEIGHT).ToString();
string bpp=pd.Get(PdfName.BITSPERCOMPONENT).ToString();
string extent=".";
byte [] img=null;
switch (filter)
{
case "/FlateDecode":
byte[] arr=PdfReader.FlateDecode(PdfReader.GetStreamBytesRaw((PRStream)obj),true);
Bitmap bmp=new Bitmap(Int32.Parse(width),Int32.Parse(height),PixelFormat.Format24bppRgb);
BitmapData bmd=bmp.LockBits(new Rectangle(0,0,Int32.Parse(width),Int32.Parse(height)),ImageLockMode.WriteOnly,
PixelFormat.Format24bppRgb);
Marshal.Copy(arr,0,bmd.Scan0,arr.Length);
bmp.UnlockBits(bmd);
bmp.Save("c:\\temp\\bmp1.png",ImageFormat.Png);
break;
default:
break;
}
}
}
}
This will mess the color up because of the Microsoft BGR, of course, but I wanted to keep it short. Do something similar for "/CCITTFaxDecode", etc.