I got the following exception in itextsharp 5.0.6 library when I am trying to parse one PDF file
here is the link of that PDF File
https://backup.filesanywhere.com/fs/v.aspx?v=8c726b8f5a6673b56b6d
try
{
string s = null;
MessageBox.Show("Not-Protected");
PdfReader read = new PdfReader(openFileDialog1.FileName);
//MessageBox.Show(read.NumberOfPages.ToString());
for (int i = 1; i <= read.NumberOfPages; i++)
{
s = PdfTextExtractor.GetTextFromPage(read, i, new SimpleTextExtractionStrategy());
MessageBox.Show(s);
}
}
catch (Exception ex)
{
MessageBox.Show(ex.ToString());
}
System.NullReferenceException: Object reference not set to an instance of an object.
at iTextSharp.text.pdf.PdfContentParser.ReadArray()
at iTextSharp.text.pdf.PdfContentParser.ReadPRObject()
at iTextSharp.text.pdf.PdfContentParser.Parse(List`1 ls)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[] contentBytes, PdfDictionary resources)
at iTextSharp.text.pdf.parser.PdfReaderContentParser.ProcessContent[E](Int32 pageNumber, E renderListener)
at iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(PdfReader reader, Int32 pageNumber, ITextExtractionStrategy strategy)
The content stream of the first page of the PDF contains an array start bracket '[' inside the array operand of a TJ operator. This is not allowed as the array operand of a TJ operator may only contain strings and numbers.
Furthermore there is no matching array end bracket ']' inside that array operand, so the end bracket of the array operand itself closes this (illegal) inner array and the array operand does not have a closing bracket anymore. Thus, iText parses all the remaining content stream into the array and at the end of the content stream runs into the exception.
Adobe Reader is well known to ignore certain errors and try to fix others on the run. Knowing that there are no nested arrays allowed in page content descriptions, it seems to simply ignore the illegal opening bracket. This behavior of Adobe Reader is quite a nuisance because it allows defect PDF creation software to flourish.
PS: The line in question:
[(&)110($,"#'#"0'#.\(1\(2'0',#+345467839':'#.\(1;<"'0',#;345467839':'#.\(1!=.0',#\(345467839':'+.\(1\(2'0',#+7)(5)35(5467834':'+.\(1;<"0',#;7)(5)35(5467834)[(&)110($,"#'#"0'#.\(1\(2'0',#+345467839':'#.\(1;<"'0',#;345467839':'#.\(1!=.0',#\(345467839':'+.\(1\(2'0',#+7)(5)35(5467834':'+.\(1;<"0',#;7)(5)35(5467834)(':'*!>1;<"0',#;385467837)] TJ
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>^
Related
I'm trying to read from a Word document and I want the computer to tell me what is written in document not to write itself in other place. So when I say the keyword "word" my program should open a dialog menu and let me to select a word file and tell me what is inside. The other keywords work. So here's my code and also my error.
case "word":
if (openFileDialog1.ShowDialog() == System.Windows.Forms.DialogResult.OK) {
Microsoft.Office.Interop.Word.Application app = new Microsoft.Office.Interop.Word.Application();
object readFromPath = null;
Document doc = app.Documents.Open(ref readFromPath);
foreach (Paragraph objParagraph in doc.Paragraphs)
ss.SpeakAsync(objParagraph.Range.Text.Trim());
((_Document)doc).Close();
((_Application)app).Quit();
}
And my error is enter image description here
Application.Documents.Open takes the full path and filename.
The path must end with \ and prefix the string with # (or leave out the # and double the backslashes \ as one backslash is considered to be an escape character)
object readFromPath = #"C:\Users\N.Horatiu\Desktop\s.docx"
Document doc = app.Documents.Open(ref readFromPath);
I have a custom template in which I'd like to control (as best I can) the types of content that can exist in a document. To that end, I disable controls, and I also intercept pastes to remove some of those content types, e.g. charts. I am aware that this content can also be drag-and-dropped, so I also check for it later, but I'd prefer to stop or warn the user as soon as possible.
I have tried a few strategies:
RTF manipulation
Open XML manipulation
RTF manipulation is so far working fairly well, but I'd really prefer to use Open XML as I expect it to be more useful in the future. I just can't get it working.
Open XML Manipulation
The wonderfully-undocumented (as far as I can tell) "Embed Source" appears to contain a compound document object, which I can use to modify the copied content using the Open XML SDK. But I have been unable to put the modified content back into an object that lets it be pasted correctly.
The modification part seems to work fine. I can see, if I save the modified content to a temporary .docx file, that the changes are being made correctly. It's the return to the clipboard that seems to be giving me trouble.
I have tried assigning just the Embed Source object back to the clipboard (so that the other types such as RTF get wiped out), and in this case nothing at all gets pasted. I've also tried re-assigning the Embed Source object back to the clipboard's data object, so that the remaining data types are still there (but with mismatched content, probably), which results in an empty embedded document getting pasted.
Here's a sample of what I'm doing with Open XML:
using OpenMcdf;
using DocumentFormat.OpenXml;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;
...
object dataObj = Forms.Clipboard.GetDataObject();
object embedSrcObj = dateObj.GetData("Embed Source");
if (embedSrcObj is Stream)
{
// read it with OpenMCDF
Stream stream = embedSrcObj as Stream;
CompoundFile cf = new CompoundFile(stream);
CFStream cfs = cf.RootStorage.GetStream("package");
byte[] bytes = cfs.GetData();
string savedDoc = Path.GetTempFileName() + ".docx";
File.WriteAllBytes(savedDoc, bytes);
// And then use the OpenXML SDK to read/edit the document:
using (WordprocessingDocument openDoc = WordprocessingDocument.Open(savedDoc, true))
{
OpenXmlElement body = openDoc.MainDocumentPart.RootElement.ChildElements[0];
foreach (OpenXmlElement ele in body.ChildElements)
{
if (ele is Paragraph)
{
Paragraph para = (Paragraph)ele;
if (para.ParagraphProperties != null && para.ParagraphProperties.ParagraphStyleId != null)
{
string styleName = para.ParagraphProperties.ParagraphStyleId.Val;
Run run = para.LastChild as Run; // I know I'm assuming things here but it's sufficient for a test case
run.RunProperties = new RunProperties();
run.RunProperties.AppendChild(new DocumentFormat.OpenXml.Wordprocessing.Text("test"));
}
}
// etc.
}
openDoc.MainDocumentPart.Document.Save(); // I think this is redundant in later versions than what I'm using
}
// repackage the document
bytes = File.ReadAllBytes(savedDoc);
cf.RootStorage.Delete("Package");
cfs = cf.RootStorage.AddStream("Package");
cfs.Append(bytes);
MemoryStream ms = new MemoryStream();
cf.Save(ms);
ms.Position = 0;
dataObj.SetData("Embed Source", ms);
// or,
// Clipboard.SetData("Embed Source", ms);
}
Question
What am I doing wrong? Is this just a bad/unworkable approach?
How can one find where are lines located in a document with iText?
Suppose say I have a table in a PDF document, and want to read its contents; I would like to find where exactly the cells are located. In order to do that I thought I might find the intersections of lines.
I think your only option using iText will be to parse the PDF tokens manually. Before doing that I would have a copy of the PDF spec handy.
(I'm a .Net guy so I use iTextSharp but other than some capitalization differences and property declarations they're almost 100% the same.)
You can get the individual tokens using the PRTokeniser object which you feed bytes into from calling getPageContent(pageNum) on your PdfReader.
//Get bytes for page 1
byte[] pageBytes = reader.getPageContent(1);
//Get the tokens for page 1
PRTokeniser tokeniser = new PRTokeniser(pageBytes);
Then just loop through the PRTokeniser:
PRTokeniser.TokType tokenType;
string tokenValue;
while (tokeniser.nextToken()) {
tokenType = tokeniser.tokenType;
tokenValue = tokeniser.stringValue;
//...check tokenValue, do something with it
}
As far a tokenValue, you'd want to probably look for re and l values for rectangle and line. If you see an re then you want to look at the previous 4 values and if you see an l then previous 2 values. This also means that you need to store each tokenValue in an array so you can look back later.
Depending on what you used to create the PDF with you might get some interesting results. For instance, I created a 4 cell table with Microsoft Word and saved as a PDF. For some reason there are two sets of 10 rectangles with many duplicates, but the general idea still works.
Below is C# code targeting iTextSharp 5.1.1.0. You should be able to convert it to Java and iText very easily, I noted the one line that has .Net-specific code that needs to be adjusted from a Generic List (List<string>) to a Java equivalent, probably an ArrayList. You'll also need to adjust some casing, .Net uses Object.Method() whereas Java uses Object.method(). Lastly, .Net accesses properties without gets and sets, so Object.Property is both the getter and setter compared to Java's Object.getProperty and Object.setProperty.
Hopefully this gets you started at least!
//Source file to read from
string sourceFile = "c:\\Hello.pdf";
//Bind a reader to our PDF
PdfReader reader = new PdfReader(sourceFile);
//Create our buffer for previous token values. For Java users, List<string> is a generic list, probably most similar to an ArrayList
List<string> buf = new List<string>();
//Get the raw bytes for the page
byte[] pageBytes = reader.GetPageContent(1);
//Get the raw tokens from the bytes
PRTokeniser tokeniser = new PRTokeniser(pageBytes);
//Create some variables to set later
PRTokeniser.TokType tokenType;
string tokenValue;
//Loop through each token
while (tokeniser.NextToken()) {
//Get the types and value
tokenType = tokeniser.TokenType;
tokenValue = tokeniser.StringValue;
//If the type is a numeric type
if (tokenType == PRTokeniser.TokType.NUMBER) {
//Store it in our buffer for later user
buf.Add(tokenValue);
//Otherwise we only care about raw commands which are categorized as "OTHER"
} else if (tokenType == PRTokeniser.TokType.OTHER) {
//Look for a rectangle token
if (tokenValue == "re") {
//Sanity check, make sure we have enough items in the buffer
if (buf.Count < 4) throw new Exception("Not enough elements in buffer for a rectangle");
//Read and convert the values
float x = float.Parse(buf[buf.Count - 4]);
float y = float.Parse(buf[buf.Count - 3]);
float w = float.Parse(buf[buf.Count - 2]);
float h = float.Parse(buf[buf.Count - 1]);
//..do something with them here
}
}
}
I've been reading through the adobe pdf spec, along with apple's quartz 2d documentation for pdf rendering and parsing. I've also downloaded Voyeur and inspected a local pdf with it to see it's internal data. At this point I'm able to get the document catalog, and then fetch the outlines dictionary from there. I can see that nested within the outlines dictionary dictionaries that there are named "/Dest" nodes with values such as:
G1.1025588
etc
I'm wondering if there is a way for me to use these values to get a reference to page to render using some methods I've seen github projects such as Reader, along with apple documented examples.
PDF processing is definitely a challenge, so any help would be appreciated.
The /Dest entry in an outline item dictionary can either be a name, a string, or an array.
The simplest case is if it's an array; then the first item is the page object the outline entry points to (a dictionary). To get the page number, you have to iterate over all pages in the document and see which one is equal (==) to the dictionary you have (CGPDFPageRefs are actually CGPDFDictionaryRefs). You could also traverse the page tree, which is a bit harder, but may be faster (not as much as you might expect, I wouldn't optimize prematurely here). The other items in the array are position on the page etc., search for "Explicit Destinations" in the PDF spec to learn more.
If the entry is a name or string, it is a named destination. You have to map the name to a destination from the document catalog's /Dests entry which is a dictionary that contains a name tree. A name tree is essentially a tree map that allows fast access to named values without requiring to read all the data at once (as with a plain dictionary). Unfortunately, there's no direct support for name trees in Quartz, so you'll have to do a little more work to parse this structure recursively (see "Name Trees" in the PDF spec).
Note that an outline item doesn't necessarily have a /Dest entry, it can also specify its destination via an /A (action) entry, which is a little bit more complex. In most cases, however, the action will be a "GoTo" action that is essentially a wrapper for a destination.
The mapping of names to destinations can also be stored as a plain dictionary. In that case, it's in the /Dests entry of the /Names dictionary in the document's catalog. I've rarely seen this though and it was deprecated after PDF 1.2 (current is 1.7).
You will definitely need the PDF spec for this: http://www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
Thanks to Omz, here is a piece of code to retreive a page number for an outline destination in a PDF file :
// Get Page Number from an array
- (int) getPageNumberFromArray:(CGPDFArrayRef)array ofPdfDoc:(CGPDFDocumentRef)pdfDoc withNumberOfPages:(int)numberOfPages
{
int pageNumber = -1;
// Page number reference is the first element of array (el 0)
CGPDFDictionaryRef pageDic;
CGPDFArrayGetDictionary(array, 0, &pageDic);
// page searching
for (int p=1; p<=numberOfPages; p++)
{
CGPDFPageRef page = CGPDFDocumentGetPage(pdfDoc, p);
if (CGPDFPageGetDictionary(page) == pageDic)
{
pageNumber = p;
break;
}
}
return pageNumber;
}
// Get page number from an outline. Only support "Dest" and "A" entries
- (int) getPageNumber:(CGPDFDictionaryRef)node ofPdfDoc:(CGPDFDocumentRef)pdfDoc withNumberOfPages:(int)numberOfPages
{
int pageNumber = -1;
CGPDFArrayRef destArray;
CGPDFDictionaryRef dicoActions;
if(CGPDFDictionaryGetArray(node, "Dest", &destArray))
{
pageNumber = [self getPageNumberFromArray:destArray ofPdfDoc:pdfDoc withNumberOfPages:numberOfPages];
}
else if(CGPDFDictionaryGetDictionary(node, "A", &dicoActions))
{
const char * typeOfActionConstChar;
CGPDFDictionaryGetName(dicoActions, "S", &typeOfActionConstChar);
NSString * typeOfAction = [NSString stringWithUTF8String:typeOfActionConstChar];
if([typeOfAction isEqualToString:#"GoTo"]) // only support "GoTo" entry. See PDF spec p653
{
CGPDFArrayRef dArray;
if(CGPDFDictionaryGetArray(dicoActions, "D", &dArray))
{
pageNumber = [self getPageNumberFromArray:dArray ofPdfDoc:pdfDoc withNumberOfPages:numberOfPages];
}
}
}
return pageNumber;
}
i am using the following code
String keyword=request.getParameter("keyword");
keyword = keyword.toLowerCase();
keyword.replaceAll(" "," "); //first double space and then single space
keyword = keyword.trim();
System.out.println(keyword);
i am given the input as t s
but iam getting as
[3/12/10 12:07:10:431 IST] 0000002c SystemOut O t s // here i am getting the two spaces
how can decrease two single space
use the follwoing program
public class whitespaces {
public static void main(String []args){
try{
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
String str = br.readLine();
System.out.println( str.replaceAll("\b\s{2,}\b", " "));
}catch(Exception e){
e.printStackTrace();
}
}
}
thanks,
murali
If your database always have only one space, you could use some keypress event to automatically ignore any occurrences of multiple spaces (by replace double spaces with single space in the search string or something).
StackOverflow has solved the same (or at least a similar) problem regarding spaces in tags, by not having them. Instead, if you want to denote a space in a tag on SO, use - (dash). You could run a query to replace all spaces with - in your database (even though it would probably take quite some time to run you'll only have to do it once). If you want to display them as spaces on the page, just do a replace when you render.