How do I retrieve a page number or page reference for an Outline destination in a PDF on iOS? - iphone

I've been reading through the adobe pdf spec, along with apple's quartz 2d documentation for pdf rendering and parsing. I've also downloaded Voyeur and inspected a local pdf with it to see it's internal data. At this point I'm able to get the document catalog, and then fetch the outlines dictionary from there. I can see that nested within the outlines dictionary dictionaries that there are named "/Dest" nodes with values such as:
G1.1025588
etc
I'm wondering if there is a way for me to use these values to get a reference to page to render using some methods I've seen github projects such as Reader, along with apple documented examples.
PDF processing is definitely a challenge, so any help would be appreciated.

The /Dest entry in an outline item dictionary can either be a name, a string, or an array.
The simplest case is if it's an array; then the first item is the page object the outline entry points to (a dictionary). To get the page number, you have to iterate over all pages in the document and see which one is equal (==) to the dictionary you have (CGPDFPageRefs are actually CGPDFDictionaryRefs). You could also traverse the page tree, which is a bit harder, but may be faster (not as much as you might expect, I wouldn't optimize prematurely here). The other items in the array are position on the page etc., search for "Explicit Destinations" in the PDF spec to learn more.
If the entry is a name or string, it is a named destination. You have to map the name to a destination from the document catalog's /Dests entry which is a dictionary that contains a name tree. A name tree is essentially a tree map that allows fast access to named values without requiring to read all the data at once (as with a plain dictionary). Unfortunately, there's no direct support for name trees in Quartz, so you'll have to do a little more work to parse this structure recursively (see "Name Trees" in the PDF spec).
Note that an outline item doesn't necessarily have a /Dest entry, it can also specify its destination via an /A (action) entry, which is a little bit more complex. In most cases, however, the action will be a "GoTo" action that is essentially a wrapper for a destination.
The mapping of names to destinations can also be stored as a plain dictionary. In that case, it's in the /Dests entry of the /Names dictionary in the document's catalog. I've rarely seen this though and it was deprecated after PDF 1.2 (current is 1.7).
You will definitely need the PDF spec for this: http://www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf

Thanks to Omz, here is a piece of code to retreive a page number for an outline destination in a PDF file :
// Get Page Number from an array
- (int) getPageNumberFromArray:(CGPDFArrayRef)array ofPdfDoc:(CGPDFDocumentRef)pdfDoc withNumberOfPages:(int)numberOfPages
{
int pageNumber = -1;
// Page number reference is the first element of array (el 0)
CGPDFDictionaryRef pageDic;
CGPDFArrayGetDictionary(array, 0, &pageDic);
// page searching
for (int p=1; p<=numberOfPages; p++)
{
CGPDFPageRef page = CGPDFDocumentGetPage(pdfDoc, p);
if (CGPDFPageGetDictionary(page) == pageDic)
{
pageNumber = p;
break;
}
}
return pageNumber;
}
// Get page number from an outline. Only support "Dest" and "A" entries
- (int) getPageNumber:(CGPDFDictionaryRef)node ofPdfDoc:(CGPDFDocumentRef)pdfDoc withNumberOfPages:(int)numberOfPages
{
int pageNumber = -1;
CGPDFArrayRef destArray;
CGPDFDictionaryRef dicoActions;
if(CGPDFDictionaryGetArray(node, "Dest", &destArray))
{
pageNumber = [self getPageNumberFromArray:destArray ofPdfDoc:pdfDoc withNumberOfPages:numberOfPages];
}
else if(CGPDFDictionaryGetDictionary(node, "A", &dicoActions))
{
const char * typeOfActionConstChar;
CGPDFDictionaryGetName(dicoActions, "S", &typeOfActionConstChar);
NSString * typeOfAction = [NSString stringWithUTF8String:typeOfActionConstChar];
if([typeOfAction isEqualToString:#"GoTo"]) // only support "GoTo" entry. See PDF spec p653
{
CGPDFArrayRef dArray;
if(CGPDFDictionaryGetArray(dicoActions, "D", &dArray))
{
pageNumber = [self getPageNumberFromArray:dArray ofPdfDoc:pdfDoc withNumberOfPages:numberOfPages];
}
}
}
return pageNumber;
}

Related

xPages REST Service Results into Combobox or Typeahead Text Field

I've read all the documentation I can find and watched all the videos I can find and don't understand how to do this. I have set up an xPages REST Service and it works well. Now I want to place the results of the service into either a combobox or typeahead text field. Ideally I would like to know how to do it for both types of fields.
I have an application which has a view containing a list of countries, another view containing a list of states, and another containing a list of cities. I would like the first field to only display the countries field from the list of data it returns in the XPages REST Service. Then, depending upon which country was selected, I would like the states for that country to be listed in another field for selection, etc.
I can see code for calling the REST Service results from a button, or from a dojo grid, but I cannot find how to call it to populate either of the types of fields identified above.
Where would I call the Service for the field? I had thought it would go in the Data area, but perhaps I've just not found the right syntax to use.
November 6, 2017:
I have been following your suggestion, but am still lost as can be. Here's what I currently have in my code:
x$( "#{id:ApplCountry}" ).select2({
placeholder: "select a country",
minimumInputLength: 2,
allowClear : true,
multiple: false,
ajax: {
dataType: 'text/plain',
url: "./Application.xsp/gridData",
quietMillis: 250,
data: function (params) {
return {
search:'[name=]*'+params.term+'*',
page: params.page
};
},
processResults: function (data, page) {
var data = $.map(data, function (obj) {
obj.id = obj.id || obj["#entityid"];
obj.text = obj.text || obj.name;
return obj;
});
},
return {results: data};
}
}
});
I'm using the dataType of 'text/plain' because that was what I understood I should use when gathering data from a domino application. I have tried changing this to json but it makes no difference.
I'm using processResults because I understand this is what should be used in version 4 of select2.
I don't understand the whole use of the hidden field, so I've stayed away from that.
No matter what I do, although my REST service works if I put it directly in the url, I cannot get any data to display in the field. All I want to display in the field is the country code of the document, which is in the field named "name" (not my choice, it's how it came before I imported the data from MySQL.
I have read documentation and watched videos, but still don't really understand how everything fits together. That was my problem with the REST service. If you use it in Dojo, you just put the name of the service in a field on the Dojo element and it's done, so I don't understand why all the additional coding for another type of domino element. Shouldn't it work the same way?
I should point out that at some points it does display the default message, so it does find the field. Just doesn't display the country selections.
I think the issue may be that you are not returning SelectItems to your select2, and that is what it is expecting. When I do something like you are trying, I actually use a bean to generate the selection choices. You may want to try that or I'm putting in the working part of my bean below.
The Utils.getItemValueAsString is a method I use to return either the string value of a field, or if it is not on the document/empty/null an empty string. I took out an if that doesn't relate to this, so there my be a mismatch, but I hope not.
You might be able to jump directly to populating the arrayList, but as I recall I needed to leverage the LinkedHashMap for something.
You should be able to do the same using SSJS, but since that renders to Java before executing, I find this more efficient.
For label/value pairs:
LinkedHashMap lhmap = new LinkedHashMap();
Document doc = null;
Document tmpDoc = null;
allObjects.addElement(doc);
if (dc.getCount() > 0) {
doc = dc.getFirstDocument();
while (doc != null) {
lhmap.put(Utils.getItemValueAsString(doc, LabelField, true), Utils.getItemValueAsString(doc, ValueField, true));
}
tmpDoc = dc.getNextDocument(doc);
doc.recycle();
doc = tmpDoc;
}
}
List<SelectItem> options = new ArrayList<SelectItem>();
Set set = lhmap.entrySet();
Iterator hsItr = set.iterator();
while (hsItr.hasNext()) {
Map.Entry me = (Map.Entry) hsItr.next();
// System.out.println("after: " + hStr);
SelectItem option = new SelectItem();
option.setLabel(me.getKey() + "");
option.setValue(me.getValue() + "");
options.add(option);
}
System.out.println("About to return from generating");
return options;
}
I ended up using straight up SSJS. Worked like a charm - very simple.

Reverse display order in UITableView of Childs retrieved from Firebase Database [duplicate]

I'm trying to test out Firebase to allow users to post comments using push. I want to display the data I retrieve with the following;
fbl.child('sell').limit(20).on("value", function(fbdata) {
// handle data display here
}
The problem is the data is returned in order of oldest to newest - I want it in reversed order. Can Firebase do this?
Since this answer was written, Firebase has added a feature that allows ordering by any child or by value. So there are now four ways to order data: by key, by value, by priority, or by the value of any named child. See this blog post that introduces the new ordering capabilities.
The basic approaches remain the same though:
1. Add a child property with the inverted timestamp and then order on that.
2. Read the children in ascending order and then invert them on the client.
Firebase supports retrieving child nodes of a collection in two ways:
by name
by priority
What you're getting now is by name, which happens to be chronological. That's no coincidence btw: when you push an item into a collection, the name is generated to ensure the children are ordered in this way. To quote the Firebase documentation for push:
The unique name generated by push() is prefixed with a client-generated timestamp so that the resulting list will be chronologically-sorted.
The Firebase guide on ordered data has this to say on the topic:
How Data is Ordered
By default, children at a Firebase node are sorted lexicographically by name. Using push() can generate child names that naturally sort chronologically, but many applications require their data to be sorted in other ways. Firebase lets developers specify the ordering of items in a list by specifying a custom priority for each item.
The simplest way to get the behavior you want is to also specify an always-decreasing priority when you add the item:
var ref = new Firebase('https://your.firebaseio.com/sell');
var item = ref.push();
item.setWithPriority(yourObject, 0 - Date.now());
Update
You'll also have to retrieve the children differently:
fbl.child('sell').startAt().limitToLast(20).on('child_added', function(fbdata) {
console.log(fbdata.exportVal());
})
In my test using on('child_added' ensures that the last few children added are returned in reverse chronological order. Using on('value' on the other hand, returns them in the order of their name.
Be sure to read the section "Reading ordered data", which explains the usage of the child_* events to retrieve (ordered) children.
A bin to demonstrate this: http://jsbin.com/nonawe/3/watch?js,console
Since firebase 2.0.x you can use limitLast() to achieve that:
fbl.child('sell').orderByValue().limitLast(20).on("value", function(fbdataSnapshot) {
// fbdataSnapshot is returned in the ascending order
// you will still need to order these 20 items in
// in a descending order
}
Here's a link to the announcement: More querying capabilities in Firebase
To augment Frank's answer, it's also possible to grab the most recent records--even if you haven't bothered to order them using priorities--by simply using endAt().limit(x) like this demo:
var fb = new Firebase(URL);
// listen for all changes and update
fb.endAt().limit(100).on('value', update);
// print the output of our array
function update(snap) {
var list = [];
snap.forEach(function(ss) {
var data = ss.val();
data['.priority'] = ss.getPriority();
data['.name'] = ss.name();
list.unshift(data);
});
// print/process the results...
}
Note that this is quite performant even up to perhaps a thousand records (assuming the payloads are small). For more robust usages, Frank's answer is authoritative and much more scalable.
This brute force can also be optimized to work with bigger data or more records by doing things like monitoring child_added/child_removed/child_moved events in lieu of value, and using a debounce to apply DOM updates in bulk instead of individually.
DOM updates, naturally, are a stinker regardless of the approach, once you get into the hundreds of elements, so the debounce approach (or a React.js solution, which is essentially an uber debounce) is a great tool to have.
There is really no way but seems we have the recyclerview we can have this
query=mCommentsReference.orderByChild("date_added");
query.keepSynced(true);
// Initialize Views
mRecyclerView = (RecyclerView) view.findViewById(R.id.recyclerView);
mManager = new LinearLayoutManager(getContext());
// mManager.setReverseLayout(false);
mManager.setReverseLayout(true);
mManager.setStackFromEnd(true);
mRecyclerView.setHasFixedSize(true);
mRecyclerView.setLayoutManager(mManager);
I have a date variable (long) and wanted to keep the newest items on top of the list. So what I did was:
Add a new long field 'dateInverse'
Add a new method called 'getDateInverse', which just returns: Long.MAX_VALUE - date;
Create my query with: .orderByChild("dateInverse")
Presto! :p
You are searching limitTolast(Int x) .This will give you the last "x" higher elements of your database (they are in ascending order) but they are the "x" higher elements
if you got in your database {10,300,150,240,2,24,220}
this method:
myFirebaseRef.orderByChild("highScore").limitToLast(4)
will retrive you : {150,220,240,300}
In Android there is a way to actually reverse the data in an Arraylist of objects through the Adapter. In my case I could not use the LayoutManager to reverse the results in descending order since I was using a horizontal Recyclerview to display the data. Setting the following parameters to the recyclerview messed up my UI experience:
llManager.setReverseLayout(true);
llManager.setStackFromEnd(true);
The only working way I found around this was through the BindViewHolder method of the RecyclerView adapter:
#Override
public void onBindViewHolder(final RecyclerView.ViewHolder holder, int position) {
final SuperPost superPost = superList.get(getItemCount() - position - 1);
}
Hope this answer will help all the devs out there who are struggling with this issue in Firebase.
Firebase: How to display a thread of items in reverse order with a limit for each request and an indicator for a "load more" button.
This will get the last 10 items of the list
FBRef.child("childName")
.limitToLast(loadMoreLimit) // loadMoreLimit = 10 for example
This will get the last 10 items. Grab the id of the last record in the list and save for the load more functionality. Next, convert the collection of objects into and an array and do a list.reverse().
LOAD MORE Functionality: The next call will do two things, it will get the next sequence of list items based on the reference id from the first request and give you an indicator if you need to display the "load more" button.
this.FBRef
.child("childName")
.endAt(null, lastThreadId) // Get this from the previous step
.limitToLast(loadMoreLimit+2)
You will need to strip the first and last item of this object collection. The first item is the reference to get this list. The last item is an indicator for the show more button.
I have a bunch of other logic that will keep everything clean. You will need to add this code only for the load more functionality.
list = snapObjectAsArray; // The list is an array from snapObject
lastItemId = key; // get the first key of the list
if (list.length < loadMoreLimit+1) {
lastItemId = false;
}
if (list.length > loadMoreLimit+1) {
list.pop();
}
if (list.length > loadMoreLimit) {
list.shift();
}
// Return the list.reverse() and lastItemId
// If lastItemId is an ID, it will be used for the next reference and a flag to show the "load more" button.
}
I'm using ReactFire for easy Firebase integration.
Basically, it helps me storing the datas into the component state, as an array. Then, all I have to use is the reverse() function (read more)
Here is how I achieve this :
import React, { Component, PropTypes } from 'react';
import ReactMixin from 'react-mixin';
import ReactFireMixin from 'reactfire';
import Firebase from '../../../utils/firebaseUtils'; // Firebase.initializeApp(config);
#ReactMixin.decorate(ReactFireMixin)
export default class Add extends Component {
constructor(args) {
super(args);
this.state = {
articles: []
};
}
componentWillMount() {
let ref = Firebase.database().ref('articles').orderByChild('insertDate').limitToLast(10);
this.bindAsArray(ref, 'articles'); // bind retrieved data to this.state.articles
}
render() {
return (
<div>
{
this.state.articles.reverse().map(function(article) {
return <div>{article.title}</div>
})
}
</div>
);
}
}
There is a better way. You should order by negative server timestamp. How to get negative server timestamp even offline? There is an hidden field which helps. Related snippet from documentation:
var offsetRef = new Firebase("https://<YOUR-FIREBASE-APP>.firebaseio.com/.info/serverTimeOffset");
offsetRef.on("value", function(snap) {
var offset = snap.val();
var estimatedServerTimeMs = new Date().getTime() + offset;
});
To add to Dave Vávra's answer, I use a negative timestamp as my sort_key like so
Setting
const timestamp = new Date().getTime();
const data = {
name: 'John Doe',
city: 'New York',
sort_key: timestamp * -1 // Gets the negative value of the timestamp
}
Getting
const ref = firebase.database().ref('business-images').child(id);
const query = ref.orderByChild('sort_key');
return $firebaseArray(query); // AngularFire function
This fetches all objects from newest to oldest. You can also $indexOn the sortKey to make it run even faster
I had this problem too, I found a very simple solution to this that doesn't involved manipulating the data in anyway. If you are rending the result to the DOM, in a list of some sort. You can use flexbox and setup a class to reverse the elements in their container.
.reverse {
display: flex;
flex-direction: column-reverse;
}
myarray.reverse(); or this.myitems = items.map(item => item).reverse();
I did this by prepend.
query.orderByChild('sell').limitToLast(4).on("value", function(snapshot){
snapshot.forEach(function (childSnapshot) {
// PREPEND
});
});
Someone has pointed out that there are 2 ways to do this:
Manipulate the data client-side
Make a query that will order the data
The easiest way that I have found to do this is to use option 1, but through a LinkedList. I just append each of the objects to the front of the stack. It is flexible enough to still allow the list to be used in a ListView or RecyclerView. This way even though they come in order oldest to newest, you can still view, or retrieve, newest to oldest.
You can add a column named orderColumn where you save time as
Long refrenceTime = "large future time";
Long currentTime = "currentTime";
Long order = refrenceTime - currentTime;
now save Long order in column named orderColumn and when you retrieve data
as orderBy(orderColumn) you will get what you need.
just use reverse() on the array , suppose if you are storing the values to an array items[] then do a this.items.reverse()
ref.subscribe(snapshots => {
this.loading.dismiss();
this.items = [];
snapshots.forEach(snapshot => {
this.items.push(snapshot);
});
**this.items.reverse();**
},
For me it was limitToLast that worked. I also found out that limitLast is NOT a function:)
const query = messagesRef.orderBy('createdAt', 'asc').limitToLast(25);
The above is what worked for me.
PRINT in reverse order
Let's think outside the box... If your information will be printed directly into user's screen (without any content that needs to be modified in a consecutive order, like a sum or something), simply print from bottom to top.
So, instead of inserting each new block of content to the end of the print space (A += B), add that block to the beginning (A = B+A).
If you'll include the elements as a consecutive ordered list, the DOM can put the numbers for you if you insert each element as a List Item (<li>) inside an Ordered Lists (<ol>).
This way you save space from your database, avoiding unnecesary reversed data.

How to edit pasted content using the Open XML SDK

I have a custom template in which I'd like to control (as best I can) the types of content that can exist in a document. To that end, I disable controls, and I also intercept pastes to remove some of those content types, e.g. charts. I am aware that this content can also be drag-and-dropped, so I also check for it later, but I'd prefer to stop or warn the user as soon as possible.
I have tried a few strategies:
RTF manipulation
Open XML manipulation
RTF manipulation is so far working fairly well, but I'd really prefer to use Open XML as I expect it to be more useful in the future. I just can't get it working.
Open XML Manipulation
The wonderfully-undocumented (as far as I can tell) "Embed Source" appears to contain a compound document object, which I can use to modify the copied content using the Open XML SDK. But I have been unable to put the modified content back into an object that lets it be pasted correctly.
The modification part seems to work fine. I can see, if I save the modified content to a temporary .docx file, that the changes are being made correctly. It's the return to the clipboard that seems to be giving me trouble.
I have tried assigning just the Embed Source object back to the clipboard (so that the other types such as RTF get wiped out), and in this case nothing at all gets pasted. I've also tried re-assigning the Embed Source object back to the clipboard's data object, so that the remaining data types are still there (but with mismatched content, probably), which results in an empty embedded document getting pasted.
Here's a sample of what I'm doing with Open XML:
using OpenMcdf;
using DocumentFormat.OpenXml;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;
...
object dataObj = Forms.Clipboard.GetDataObject();
object embedSrcObj = dateObj.GetData("Embed Source");
if (embedSrcObj is Stream)
{
// read it with OpenMCDF
Stream stream = embedSrcObj as Stream;
CompoundFile cf = new CompoundFile(stream);
CFStream cfs = cf.RootStorage.GetStream("package");
byte[] bytes = cfs.GetData();
string savedDoc = Path.GetTempFileName() + ".docx";
File.WriteAllBytes(savedDoc, bytes);
// And then use the OpenXML SDK to read/edit the document:
using (WordprocessingDocument openDoc = WordprocessingDocument.Open(savedDoc, true))
{
OpenXmlElement body = openDoc.MainDocumentPart.RootElement.ChildElements[0];
foreach (OpenXmlElement ele in body.ChildElements)
{
if (ele is Paragraph)
{
Paragraph para = (Paragraph)ele;
if (para.ParagraphProperties != null && para.ParagraphProperties.ParagraphStyleId != null)
{
string styleName = para.ParagraphProperties.ParagraphStyleId.Val;
Run run = para.LastChild as Run; // I know I'm assuming things here but it's sufficient for a test case
run.RunProperties = new RunProperties();
run.RunProperties.AppendChild(new DocumentFormat.OpenXml.Wordprocessing.Text("test"));
}
}
// etc.
}
openDoc.MainDocumentPart.Document.Save(); // I think this is redundant in later versions than what I'm using
}
// repackage the document
bytes = File.ReadAllBytes(savedDoc);
cf.RootStorage.Delete("Package");
cfs = cf.RootStorage.AddStream("Package");
cfs.Append(bytes);
MemoryStream ms = new MemoryStream();
cf.Save(ms);
ms.Position = 0;
dataObj.SetData("Embed Source", ms);
// or,
// Clipboard.SetData("Embed Source", ms);
}
Question
What am I doing wrong? Is this just a bad/unworkable approach?

Add object to sorted NSMutable array and answer index path

I have a sorted mutable array of a class called Topic. The topics represent a an array of Publications. I present the topics in a table, and periodically fetch new publications from a web service. When a new publication arrives, I'd like to add to the table with an animation.
What's bothering me is the computational work I need to do to add into this array, and answer the correct index path. Can someone suggest a more direct way than this:
// add a publication to the topic model. if the publication has a new topic, answer
// the index path of the new topic
- (NSIndexPath *)addPublication:(Publication *)pub {
// first a search to fit into an existing topic
NSNumber *topicId = [pub valueForKey:#"topic_id"];
for (Topic *topic in self.topics) {
if ([topicId isEqualToNumber:[topic valueForKey:"id"]]) {
// this publication is part of an existing topic, no new index path
[topic addPublication:pub];
return nil;
}
}
// the publication must have a new topic, add a new topic (and therefore a new row)
Topic *topic = [[Topic alloc] initWithPublication:publication];
[self.topics addObject:topic];
// sort it into position
[self.topics sortUsingSelector:#selector(compareToTopic:)];
// oh no, we want to return an index path, but where did it sort to?
// yikes, another search!
NSInteger row = [self.topics indexOfObject:topic];
return [NSIndexPath indexPathForRow:row inSection:0];
}
// call this in a loop for all the publications I fetch from the server,
// collect the index paths for table animations
// so much computation, poor user's phone is going to melt!
There's no getting around the first search, I guess. But is there some more efficient way to add a new thing to an array, maintaining a sort and remembering where it got placed?
It's pretty straightforward to insert a value into a sorted list. Think about how you would insert the number "3" into the list "1, 2, 7, 9", for instance. You want to do exactly the same thing.
Loop through the array by index, using a for loop.
For each object, use compareToTopic: to compare it to the object you want to insert.
When you find the appropriate index to insert at, use -[NSArray insertObject:atIndex:] to insert it.
Then return an NSIndexPath with that index.
Edit: and, as the other answers point out, a binary search would be faster -- but definitely trickier to get right.
This is almost certainly not an issue; NSArrays are actually hashes, and search is a lot faster than it would be for a true array. How many topics can you possibly have anyways?
Still, if you measure the performance and find it poor, you could look into using a B-tree; Kurt Revis commented below with a link to a similar structure (a binary heap) in Core Foundation: CFBinaryHeap.
Another option (which would also need to be measured) might be to do the comparison as you walk the array the first time; you can mark the spot and do the insertion directly:
NSUInteger insertIndex = 0;
NSComparisonResult prevOrder = NSOrderedDescending;
for (Topic *topic in self.topics) {
NSComparisonResult order = [topicId compareToTopic:topic];
if (NSOrderedSame == order) {
// this publication is part of an existing topic, no new index path
[topic addPublication:pub];
return nil;
}
else if( prevOrder == NSOrderedDescending &&
order == NSOrderedAscending )
{
break;
}
insertIndex++;
prevOrder = order;
}
Please note that I haven't tested this, sorry.
I'm not sure this is actually better or faster than the way you've written it, though.
Don't worry about the work the computer is doing unless it's demonstrably doing it too slowly.
What you have done is correct I guess. There's another way. You can write your own binary search implementation method. (Which has only few lines of code). And you can retrieve the index where the new object should fit in. And add the new object to the required index using insertObject:atIndex: method.

Getting line locations with iText

How can one find where are lines located in a document with iText?
Suppose say I have a table in a PDF document, and want to read its contents; I would like to find where exactly the cells are located. In order to do that I thought I might find the intersections of lines.
I think your only option using iText will be to parse the PDF tokens manually. Before doing that I would have a copy of the PDF spec handy.
(I'm a .Net guy so I use iTextSharp but other than some capitalization differences and property declarations they're almost 100% the same.)
You can get the individual tokens using the PRTokeniser object which you feed bytes into from calling getPageContent(pageNum) on your PdfReader.
//Get bytes for page 1
byte[] pageBytes = reader.getPageContent(1);
//Get the tokens for page 1
PRTokeniser tokeniser = new PRTokeniser(pageBytes);
Then just loop through the PRTokeniser:
PRTokeniser.TokType tokenType;
string tokenValue;
while (tokeniser.nextToken()) {
tokenType = tokeniser.tokenType;
tokenValue = tokeniser.stringValue;
//...check tokenValue, do something with it
}
As far a tokenValue, you'd want to probably look for re and l values for rectangle and line. If you see an re then you want to look at the previous 4 values and if you see an l then previous 2 values. This also means that you need to store each tokenValue in an array so you can look back later.
Depending on what you used to create the PDF with you might get some interesting results. For instance, I created a 4 cell table with Microsoft Word and saved as a PDF. For some reason there are two sets of 10 rectangles with many duplicates, but the general idea still works.
Below is C# code targeting iTextSharp 5.1.1.0. You should be able to convert it to Java and iText very easily, I noted the one line that has .Net-specific code that needs to be adjusted from a Generic List (List<string>) to a Java equivalent, probably an ArrayList. You'll also need to adjust some casing, .Net uses Object.Method() whereas Java uses Object.method(). Lastly, .Net accesses properties without gets and sets, so Object.Property is both the getter and setter compared to Java's Object.getProperty and Object.setProperty.
Hopefully this gets you started at least!
//Source file to read from
string sourceFile = "c:\\Hello.pdf";
//Bind a reader to our PDF
PdfReader reader = new PdfReader(sourceFile);
//Create our buffer for previous token values. For Java users, List<string> is a generic list, probably most similar to an ArrayList
List<string> buf = new List<string>();
//Get the raw bytes for the page
byte[] pageBytes = reader.GetPageContent(1);
//Get the raw tokens from the bytes
PRTokeniser tokeniser = new PRTokeniser(pageBytes);
//Create some variables to set later
PRTokeniser.TokType tokenType;
string tokenValue;
//Loop through each token
while (tokeniser.NextToken()) {
//Get the types and value
tokenType = tokeniser.TokenType;
tokenValue = tokeniser.StringValue;
//If the type is a numeric type
if (tokenType == PRTokeniser.TokType.NUMBER) {
//Store it in our buffer for later user
buf.Add(tokenValue);
//Otherwise we only care about raw commands which are categorized as "OTHER"
} else if (tokenType == PRTokeniser.TokType.OTHER) {
//Look for a rectangle token
if (tokenValue == "re") {
//Sanity check, make sure we have enough items in the buffer
if (buf.Count < 4) throw new Exception("Not enough elements in buffer for a rectangle");
//Read and convert the values
float x = float.Parse(buf[buf.Count - 4]);
float y = float.Parse(buf[buf.Count - 3]);
float w = float.Parse(buf[buf.Count - 2]);
float h = float.Parse(buf[buf.Count - 1]);
//..do something with them here
}
}
}