How to find the first instance of a pdf file using HTMLElement? - htmlelements

void DownloadFile(object sender, WebBrowserDocumentCompletedEventArgs e)
{
HtmlElementCollection links = webBrowser1.Document.Links;
foreach (HtmlElement link in links) //
{
if (link.InnerText.Equals("*.pdf"))
{
link.InvokeMember("Click");
break;
}
}
}
How do I find the first instance of a pdf file using HTML element. I was trying to do *.pdf but it does not work.

Looks like you are using C#, you've tagged this as htmlelements which is a Java library, so you might have the wrong place.
However, if InnerText gets the link href (or if the link text contains the .pdf) then you probably want:
EndsWith(".pdf")
instead of
Equals("*.pdf").

Related

How to determine if something was copied or cut to the clipboard

in my #execute method I am able to get the selection out of the clipboard / LocalSelectionTransfer. But I have no idea how to react on that based on how the user has put the content to the clipboard.
I have to decide whether I duplicate or not the content.
This is what I have:
#Execute
public void execute(#Named(IServiceConstants.ACTIVE_SHELL) Shell shell, #Named(IServiceConstants.ACTIVE_PART) MPart activePart) {
Clipboard clipboard = new Clipboard(shell.getDisplay());
TransferData[] transferDatas = clipboard.getAvailableTypes();
boolean weCanUseIt= false;
for(int i=0; i<transferDatas.length; i++) {
if(LocalSelectionTransfer.getTransfer().isSupportedType(transferDatas[i])) {
weCanUseIt = true;
break;
}
}
if (weCanUseIt) {
#SuppressWarnings("unchecked")
List<Object> objects = ((StructuredSelection)LocalSelectionTransfer.getTransfer().getSelection()).toList();
for(Object o: objects) {
System.out.println(o.getClass());
}
}
}
any Ideas???
You only get something in the clipboard using LocalSelectionTransfer if you code a part in your RCP to use this transfer type for a Copy operation. It provides a way to transfer the selection directly.
This transfer type will not be used if something is copied to the clipboard any other way (in this case it might be something like TextTransfer or FileTransfer).
So you will only be using LocalSelectionTransfer to deal with a selection from another part in which case you presumably know how to deal with the objects.
If you are trying to do Copy and Cut then you should do the Cut in the source viewer - but this will remove the selection so you can't use LocalSelectionTransfer for that. Use a transfer such as FileTransfer or TextTransfer which doesn't rely on the current selection.

How to edit pasted content using the Open XML SDK

I have a custom template in which I'd like to control (as best I can) the types of content that can exist in a document. To that end, I disable controls, and I also intercept pastes to remove some of those content types, e.g. charts. I am aware that this content can also be drag-and-dropped, so I also check for it later, but I'd prefer to stop or warn the user as soon as possible.
I have tried a few strategies:
RTF manipulation
Open XML manipulation
RTF manipulation is so far working fairly well, but I'd really prefer to use Open XML as I expect it to be more useful in the future. I just can't get it working.
Open XML Manipulation
The wonderfully-undocumented (as far as I can tell) "Embed Source" appears to contain a compound document object, which I can use to modify the copied content using the Open XML SDK. But I have been unable to put the modified content back into an object that lets it be pasted correctly.
The modification part seems to work fine. I can see, if I save the modified content to a temporary .docx file, that the changes are being made correctly. It's the return to the clipboard that seems to be giving me trouble.
I have tried assigning just the Embed Source object back to the clipboard (so that the other types such as RTF get wiped out), and in this case nothing at all gets pasted. I've also tried re-assigning the Embed Source object back to the clipboard's data object, so that the remaining data types are still there (but with mismatched content, probably), which results in an empty embedded document getting pasted.
Here's a sample of what I'm doing with Open XML:
using OpenMcdf;
using DocumentFormat.OpenXml;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;
...
object dataObj = Forms.Clipboard.GetDataObject();
object embedSrcObj = dateObj.GetData("Embed Source");
if (embedSrcObj is Stream)
{
// read it with OpenMCDF
Stream stream = embedSrcObj as Stream;
CompoundFile cf = new CompoundFile(stream);
CFStream cfs = cf.RootStorage.GetStream("package");
byte[] bytes = cfs.GetData();
string savedDoc = Path.GetTempFileName() + ".docx";
File.WriteAllBytes(savedDoc, bytes);
// And then use the OpenXML SDK to read/edit the document:
using (WordprocessingDocument openDoc = WordprocessingDocument.Open(savedDoc, true))
{
OpenXmlElement body = openDoc.MainDocumentPart.RootElement.ChildElements[0];
foreach (OpenXmlElement ele in body.ChildElements)
{
if (ele is Paragraph)
{
Paragraph para = (Paragraph)ele;
if (para.ParagraphProperties != null && para.ParagraphProperties.ParagraphStyleId != null)
{
string styleName = para.ParagraphProperties.ParagraphStyleId.Val;
Run run = para.LastChild as Run; // I know I'm assuming things here but it's sufficient for a test case
run.RunProperties = new RunProperties();
run.RunProperties.AppendChild(new DocumentFormat.OpenXml.Wordprocessing.Text("test"));
}
}
// etc.
}
openDoc.MainDocumentPart.Document.Save(); // I think this is redundant in later versions than what I'm using
}
// repackage the document
bytes = File.ReadAllBytes(savedDoc);
cf.RootStorage.Delete("Package");
cfs = cf.RootStorage.AddStream("Package");
cfs.Append(bytes);
MemoryStream ms = new MemoryStream();
cf.Save(ms);
ms.Position = 0;
dataObj.SetData("Embed Source", ms);
// or,
// Clipboard.SetData("Embed Source", ms);
}
Question
What am I doing wrong? Is this just a bad/unworkable approach?

How to get text from textbox of MS word document using Apache POI?

I want to get information written in Textbox in an MS word document. I am using Apache POI to parse word document.
Currently I am iterating through all the Paragraph objects but this Paragraph list does not contain information from TextBox so I am missing this information in output.
e.g.
paragraph in plain text
**<some information in text box>**
one more paragraph in plain text
what i want to extract :
<para>paragraph in plain text</para>
<text_box>some information in text box</text_box>
<para>one more paragraph in plain text</para>
what I am getting currently :
paragraph in plain text
one more paragraph in plain text
Anyone knows how to extract information from text box using Apache POI?
This worked for me,
private void printContentsOfTextBox(XWPFParagraph paragraph) {
XmlObject[] textBoxObjects = paragraph.getCTP().selectPath("
declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main'
declare namespace wps='http://schemas.microsoft.com/office/word/2010/wordprocessingShape'
declare namespace v='urn:schemas-microsoft-com:vml'
.//*/wps:txbx/w:txbxContent | .//*/v:textbox/w:txbxContent");
for (int i =0; i < textBoxObjects.length; i++) {
XWPFParagraph embeddedPara = null;
try {
XmlObject[] paraObjects = textBoxObjects[i].
selectChildren(
new QName("http://schemas.openxmlformats.org/wordprocessingml/2006/main", "p"));
for (int j=0; j<paraObjects.length; j++) {
embeddedPara = new XWPFParagraph(
CTP.Factory.parse(paraObjects[j].xmlText()), paragraph.getBody());
//Here you have your paragraph;
System.out.println(embeddedPara.getText());
}
} catch (XmlException e) {
//handle
}
}
}
To extract all occurrences of text from Word .doc and .docx files for crgrep I used the Apache Tika source as a reference of how the Apache POI APIs should be correctly used. This is useful if you want to use POI directly and not depend on Tika.
For Word .docx files, take a look at this Tika class:
org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator
if you ignore XHTMLContentHandler and formatting code you can see how to navigate a XWPFDocument correctly using POI.
For .doc files this class is helpful:
org.apache.tika.parser.microsoft.WordExtractor
both from the tika-parsers-1.x.jar. An easy way to access the Tika code through your maven dependencies is add Tika temporarily to your pom.xml such as
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.7</version>
</dependency>
let your IDE resolve attached source and step into the classes above.
If you want to get text from textbox in docx file (using POI 3.10-FINAL) here is sample code:
FileInputStream fileInputStream = new FileInputStream(inputFile);
XWPFDocument document = new XWPFDocument(OPCPackage.open(fileInputStream));
for (XWPFParagraph xwpfParagraph : document.getParagraphs()) {
String text = xwpfParagraph.getParagraphText(); //here is where you receive text from textbox
}
Or you can iterate over each
XWPFRun in XWPFParagraph and invoke toString() method. Same result.

How do I retrieve a page number or page reference for an Outline destination in a PDF on iOS?

I've been reading through the adobe pdf spec, along with apple's quartz 2d documentation for pdf rendering and parsing. I've also downloaded Voyeur and inspected a local pdf with it to see it's internal data. At this point I'm able to get the document catalog, and then fetch the outlines dictionary from there. I can see that nested within the outlines dictionary dictionaries that there are named "/Dest" nodes with values such as:
G1.1025588
etc
I'm wondering if there is a way for me to use these values to get a reference to page to render using some methods I've seen github projects such as Reader, along with apple documented examples.
PDF processing is definitely a challenge, so any help would be appreciated.
The /Dest entry in an outline item dictionary can either be a name, a string, or an array.
The simplest case is if it's an array; then the first item is the page object the outline entry points to (a dictionary). To get the page number, you have to iterate over all pages in the document and see which one is equal (==) to the dictionary you have (CGPDFPageRefs are actually CGPDFDictionaryRefs). You could also traverse the page tree, which is a bit harder, but may be faster (not as much as you might expect, I wouldn't optimize prematurely here). The other items in the array are position on the page etc., search for "Explicit Destinations" in the PDF spec to learn more.
If the entry is a name or string, it is a named destination. You have to map the name to a destination from the document catalog's /Dests entry which is a dictionary that contains a name tree. A name tree is essentially a tree map that allows fast access to named values without requiring to read all the data at once (as with a plain dictionary). Unfortunately, there's no direct support for name trees in Quartz, so you'll have to do a little more work to parse this structure recursively (see "Name Trees" in the PDF spec).
Note that an outline item doesn't necessarily have a /Dest entry, it can also specify its destination via an /A (action) entry, which is a little bit more complex. In most cases, however, the action will be a "GoTo" action that is essentially a wrapper for a destination.
The mapping of names to destinations can also be stored as a plain dictionary. In that case, it's in the /Dests entry of the /Names dictionary in the document's catalog. I've rarely seen this though and it was deprecated after PDF 1.2 (current is 1.7).
You will definitely need the PDF spec for this: http://www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
Thanks to Omz, here is a piece of code to retreive a page number for an outline destination in a PDF file :
// Get Page Number from an array
- (int) getPageNumberFromArray:(CGPDFArrayRef)array ofPdfDoc:(CGPDFDocumentRef)pdfDoc withNumberOfPages:(int)numberOfPages
{
int pageNumber = -1;
// Page number reference is the first element of array (el 0)
CGPDFDictionaryRef pageDic;
CGPDFArrayGetDictionary(array, 0, &pageDic);
// page searching
for (int p=1; p<=numberOfPages; p++)
{
CGPDFPageRef page = CGPDFDocumentGetPage(pdfDoc, p);
if (CGPDFPageGetDictionary(page) == pageDic)
{
pageNumber = p;
break;
}
}
return pageNumber;
}
// Get page number from an outline. Only support "Dest" and "A" entries
- (int) getPageNumber:(CGPDFDictionaryRef)node ofPdfDoc:(CGPDFDocumentRef)pdfDoc withNumberOfPages:(int)numberOfPages
{
int pageNumber = -1;
CGPDFArrayRef destArray;
CGPDFDictionaryRef dicoActions;
if(CGPDFDictionaryGetArray(node, "Dest", &destArray))
{
pageNumber = [self getPageNumberFromArray:destArray ofPdfDoc:pdfDoc withNumberOfPages:numberOfPages];
}
else if(CGPDFDictionaryGetDictionary(node, "A", &dicoActions))
{
const char * typeOfActionConstChar;
CGPDFDictionaryGetName(dicoActions, "S", &typeOfActionConstChar);
NSString * typeOfAction = [NSString stringWithUTF8String:typeOfActionConstChar];
if([typeOfAction isEqualToString:#"GoTo"]) // only support "GoTo" entry. See PDF spec p653
{
CGPDFArrayRef dArray;
if(CGPDFDictionaryGetArray(dicoActions, "D", &dArray))
{
pageNumber = [self getPageNumberFromArray:dArray ofPdfDoc:pdfDoc withNumberOfPages:numberOfPages];
}
}
}
return pageNumber;
}

KRL and Yahoo Local Search

I'm trying to use Yahoo Local Search in a Kynetx Application.
ruleset avogadro {
meta {
name "yahoo-local-ruleset"
description "use results from Yahoo local search"
author "randall bohn"
key yahoo_local "get-your-own-key"
}
dispatch { domain "example.com"}
global {
datasource local:XML <- "http://local.yahooapis.com/LocalSearchService/V3/localsearch";
}
rule add_list {
select when pageview ".*" setting ()
pre {
ds = datasource:local("?appid=#{keys:yahoo_local()}&query=pizza&zip=#{zip}&results=5");
rs = ds.pick("$..Result");
}
append("body","<ul id='my_list'></ul>");
always {
set ent:pizza rs;
}
}
rule add_results {
select when pageview ".*" setting ()
foreach ent:pizza setting pizza
pre {
title = pizza.pick("$..Title");
}
append("#my_list", "<li>#{title}</li>");
}
}
The list I wind up with is
. [object Object]
and 'title' has
{'$t' => 'Pizza Shop 1'}
I can't figure out how to get just the title. It looks like the 'text content' from the original XML file turns into {'$t' => 'text content'} and the '$t' give problems to pick().
When XML datasources and datasets get converted into JSON, the text value within an XML node gets assigned to $t. You can pick the text of the title by changing your pick statement in the pre block to
title = pizza.pick("$..Title.$t");
Try that and see if that solves your problem.
Side notes on things not related to your question to consider:
1) Thank you for sharing the entire ruleset, what problem you were seeing and what you expected. Made answering your question much easier.
2) The ruleset identifier should not be changed from what AppBuilder or the command-line gem generate for you. Your identifier that is currently
ruleset avogadro {
should look something more like
ruleset a60x304 {
3) You don't need the
setting ()
in the select statement unless you have a capture group in your regular expression
Turns out that pick("$..Title.$t") does work. It looks funny but it works. Less funny than a clown hat I guess.
name = pizza.pick("$..Title.$t");
city = pizza.pick("$..City.$t");
phone = pizza.pick("$..Phone.$t");
list_item = "<li>#{name}/#{city} #{phone}</li>"
Wish I had some pizza right now!