How would I handle RTF hyperlinks using Apache Tika in XSLT?

How would I handle RTF hyperlinks using Apache Tika in XSLT? - eclipse

This question is a follow-up to: What are some methods to converting RTF text nodes in XML using XSLT 2 / Saxon HE 11.3?.
After implementing the answered solution, I ran the code against a large dataset. During the processing of all that data, an item in source RTF caused the application to error.
The error:
Error on line 11 column 92 of urn:from-string: SXXP0003 Error reported by XML parser: The element type "a" must be terminated by the matching end-tag "</a>".: The element type "a" must be terminated by the matching end-tag "</a>".
I took a look at the source xml, which contained several RTF HYPERLINK codes. Source:
<SPECORMETHOD>{\rtf1\ansi\deff0\uc1\ansicpg1252\deftab720{\fonttbl{\f0\fnil\fcharset1 Arial;}{\f1\fnil\fcharset1 Times New Roman;}{\f2\fnil\fcharset1 WingDings;}}{\colortbl\red0\green0\blue0;\red255\green0\blue0;\red0\green128\blue0;\red0\green0\blue255;\red255\green255\blue0;\red255\green0\blue255;\red128\green0\blue128;\red128\green0\blue0;\red0\green255\blue0;\red0\green255\blue255;\red0\green128\blue128;\red0\green0\blue128;\red255\green255\blue255;\red192\green192\blue192;\red128\green128\blue128;\red0\green0\blue0;\red128\green128\blue0;}\wpprheadfoot1\paperw12240\paperh15840\margl720\margr720\margt720\margb720\headery720\footery720\endnhere\sectdefaultcl{\*\generator WPTools_5.17;}{\stylesheet{\s1\li0\fi0\ri0\sb0\sa0\ql\vertalt\fs20 Normal;}{\s2\li0\fi0\ri0\sb0\sa0\ql\vertalt\fs20 Default Paragraph Font;}{\s3\li0\fi0\ri0\sb0\sa0\ql\vertalt\fs20\cf3\ul\sbasedon2 Hyperlink;}}{\pard\plain\plain\f1\fs36\par\pard\plain\plain\f1\fs36\par\plain\f1\fs28\tab 10\'94Flour Tortilla\par\plain\f1\fs28\tab Caesar \f1\b\i DIP\f1\i0 : {\field{\*\fldinst{HYPERLINK "..\\\\..\\\\SAUCES\\\\Dips\\\\Dip, Caesar.doc"}}{\*\fldtitle{..\\\\..\\\\SAUCES\\\\Dips\\\\Dip, Caesar.doc}}{\fldrslt{\f1\cf3\cs103\ul\cs3 Dip, Caesar.doc\plain\f1\fs28\b}}}\par\plain\f1\fs28\tab Ripped Romaine\par\plain\f1\fs28\tab Blackened Salmon julienne\par\plain\f1\fs28\tab Shaved Red Onion\par\plain\f1\fs28\tab Julienne Tomato\par\plain\f1\fs28\tab Grated Parmesan\par\plain\f1\fs28\tab Blackening spice: {\field{\*\fldinst{HYPERLINK "..\\\\..\\\\SPICE\\\\Blackening Spice.doc"}}{\*\fldtitle{..\\\\..\\\\SPICE\\\\Blackening Spice.doc}}{\fldrslt{\f1\cf3\cs103\ul\cs3 Blackening Spice.doc\plain\f1\fs28}}}\par\pard\plain\plain\f1\fs28\par\plain\f1\fs28 Method\par\plain\f1\fs28 Procedure Text \par\pard\plain\plain\f1\fs36\par}}</SPECORMETHOD>
For my purposes, the URL is not going to be a functional component, but for the sake of utility of this RTF conversion project, what might be needed to have the hyperlink codes work correctly, or to output them as text for reference? One way I can handle this is in the XSLT by intercepting the element, looking for the HYPERLINK code and replacing it with regular text.
The desired output for a hyperlink from this example would be (text only):
CAESAR DIP: ..\..\SAUCES\Dips\Dip, Caesar.doc
The only modification to the original code was in XSLT to do a check for an empty element when processing the <SPECORMETHOD>.
<xsl:choose>
<xsl:when test="string-length(SPECORMETHOD) > 0">
<rtf-as-xhtml>
<xsl:sequence select="tika:parse-rtf(SPECORMETHOD[string-length(.) > 0])"/>
</rtf-as-xhtml>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="'[EMPTY]'"/>
</xsl:otherwise>
</xsl:choose>
I've built this project in Eclipse 2022-12 (4.26.0). It's a Maven project using Apache Tika 2.7.0, and Saxon HE 11.3, using Java SE 1.8. Special thanks to Martin H.

I have run your sample rtf through Tika and the supposed XHTML output is unfortunately not well-formed:
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.microsoft.rtf.RTFParser" />
<meta name="Content-Type" content="application/rtf" />
<title></title>
</head>
<body><p />
<p />
<p> 10”Flour Tortilla</p>
<p> Caesar <b><i>DIP</i>: <a href="..\\..\\SAUCES\\Dips\\Dip, Caesar.doc">Dip, Caesar.doc</b><b /></b></p>
<p><b /> Ripped Romaine</p>
<p> Blackened Salmon julienne</p>
<p> Shaved Red Onion</p>
<p> Julienne Tomato</p>
<p> Grated Parmesan</p>
<p> Blackening spice: Blackening Spice.doc</p>
<p />
<p>Method</p>
<p>Procedure Text </p>
<p />
<p />
</body></html>
So the error is in the fragment <p> Caesar <b><i>DIP</i>: <a href="..\\..\\SAUCES\\Dips\\Dip, Caesar.doc">Dip, Caesar.doc</b><b /></b></p>.
I don't know for sure whether that is a problem with the input somehow not being proper rtf but it looks more like a bug in the Tia parser and ToXmlContentHandler.
I have raised the potential issue https://issues.apache.org/jira/browse/TIKA-3972
In the end, with the help of the Saxonica guys (thanks to Michael Kay and Norm Walsh) I have found a better (probably anyway) approach of using Saxon with the Tika parser; instead of using Tika's ToXMLContentHandler() and its toString() method result fed to Saxon's DocumentBuilder it is possible to pass a Saxon BuildingContentHandler to Tika's parser directly to get an XdmNode:
public static XdmNode parseRtfToHTML2(String rtf, Processor processor) throws IOException, SAXException, TikaException, URISyntaxException, SaxonApiException {
DocumentBuilder docBuilder = processor.newDocumentBuilder();
BuildingContentHandler handler = docBuilder.newBuildingContentHandler();
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
try (InputStream stream = new ByteArrayInputStream(rtf.getBytes("utf8"))) {
parser.parse(stream, handler, metadata);
return handler.getDocumentNode();//docBuilder.build(new StreamSource(new StringReader(handler.toString())));
} catch (SaxonApiException e) {
throw new RuntimeException(e);
}
}
Using that approach, at least in a short test, no error is thrown for the hyperlink RTF example, see the updated project https://github.com/martin-honnen/SaxonTikaRtfTest1 for the code in more context.

Related

XForms: Possible to modify HTML element attribute ? (I'm using XSLTForms)

Does XForms have a mechanism for manipulating attributes of the resultant HTML?
I guess I mean emitting HTML dynamically and setting the attributes as part of that.
I know that using a xf:repeat - you can effectively emit HTML elements, but I can't work out if this would stretch to attributes?
I'm using XSLTForms as the implementation - so maybe this support hooks for Javascript to do this if there isn't a built-in way?
The reason to ask specifically - I would like to work with the audio element (and some other HTML5 elements).

Yes, it is named AVT for Attribute Value Template. As in XSLT, just wrap XPath expressions into curly braces like in <div class="proto{$myclass}">.

Thanks to the help from Alain Couthures - I was able to put together the following. Sharing in case others find it interesting.
<?xml-stylesheet href="xsltforms/xsltforms.xsl" type="text/xsl"?>
<html
xmlns="http://www.w3.org/1999/xhtml"
xmlns:xf="http://www.w3.org/2002/xforms">
<head>
<title>Podcast Player</title>
<xf:model>
<xf:instance xmlns="">
<data>
<url/>
</data>
</xf:instance>
<xf:instance id="feed" src="https://podcasts.files.bbci.co.uk/b05qqhqp.rss"/>
</xf:model>
<style><![CDATA[
* { font-family: arial; background-color:black; color: white }
]]></style>
</head>
<body>
<h1><xf:output ref="instance('feed')/channel/title"/></h1>
<blockquote><xf:output ref="instance('feed')/channel/description"/></blockquote>
<xf:select1 ref="url" appearance="full">
<xf:itemset nodeset="instance('feed')/channel/item">
<xf:label ref="title"/>
<xf:value ref="enclosure/#url"/>
</xf:itemset>
</xf:select1>
<audio src="{url}" controls="true"/>
</body>
</html>
The relevant bit to this post is the "audio" tag and in particular the "{url}" attribute template.
Here's a screenshot:
For those that wish to try this example, you'll need XSLTForms : https://en.wikibooks.org/wiki/XSLTForms , other XForms implementations are available.
Note: save the file with the extension '.xhtml' and place behind a webserver of your choice.
For instance using test HTTP servers: php, python etc.

German Novel with DkPro

I tried German Novel with DkPro. My Sample input file is an XHTML file. How can I get my PosTagger output based on the XHTML index.
Script:
PACKAGE com.github.uima.ruta.novel;
ENGINE utils.HtmlAnnotator;
ENGINE utils.HtmlConverter;
ENGINE utils.ViewWriter;
TYPESYSTEM utils.HtmlTypeSystem;
TYPESYSTEM utils.TypeSystem;
IMPORT PACKAGE de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos FROM desc.type.POS;
IMPORT de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Lemma FROM desc.type.LexicalUnits;
UIMAFIT org.dkpro.core.opennlp.OpenNlpSegmenter;
UIMAFIT org.dkpro.core.stanfordnlp.StanfordPosTagger;
CONFIGURE(HtmlAnnotator, "onlyContent" = false);
Document{-> EXEC(HtmlAnnotator)};
Document { -> CONFIGURE(HtmlConverter, "inputView" = "_InitialView","outputView" = "plain"),
EXEC(HtmlConverter,{TAG})};
"<\\?xml version=\"1.0\" encoding=\"UTF-8\"\\?>"->MARKUP;
uima.tcas.DocumentAnnotation{-CONTAINS(POS)} -> {
uima.tcas.DocumentAnnotation{-> SETFEATURE("language", "de")};
EXEC(OpenNlpSegmenter);
EXEC(StanfordPosTagger, {POS});
};
Sample Input
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"><head xmlns="http://www.w3.org/1999/xhtml"><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><meta name="viewport" content="width=device-width, initial-scale=1.0" /><style></style><title></title></head><link xmlns="http://www.w3.org/1999/xhtml" src="./ckeditor.css" /><body xmlns="http://www.w3.org/1999/xhtml"><div class="WordSection1"><p class="Normal" data-name="Normal"><span data-bkmark="para10000"></span><span style="font-size:9pt">Der Idiot</span><span data-bkmark="para10000"></span></p>
<p class="Normal" data-name="Normal"><span data-bkmark="para10001"></span><span style="font-size:9pt">Ein Roman in vier Teilen.</span><span data-bkmark="para10001"></span></p>
</div>
<hr align="left" size="1" width="33%" /></body>
</html>
In the sample script, uima.tcas.DocumentAnnotation is sent to PosTagger Process. The MARKUP in this annotation affecting the accuracy. What I need to do to get the accuracy.

The HtmlAnnotator can be used to hide additional MARKUP so that rules are not affected by them.
The HtmlConverter is able to create a new document text without html/xml markup, but only in a new CAS view as the initial text in a CAS is static and cannot be changed.
The EXEC action is able to apply an external analysis engine on the current CAS object, and it can be configured to be applied on a different CAS view. However, the external analysis engine is applied on the complete CAS including the markup. No new CAS is created on the fly.
There are several options what you could do.
You could apply the pos tagger on the ‘plain’ view, but you cannot access these annotation with rules as the annotation will be present in a different view
You setup a multi view setting, e.g, by a two stage process. First convert the text to plain text without markup, and then apply the pos tagger on the new text
Depending on the external analysis engine, you maybe can also solve this by redefining what a token is.

ZK Textbox content containing <html> tag

When I put the string '' into a ZK Textbox, that is
<textbox value="<html>" />
it causes a JavaScript error in the browser
Uncaught SyntaxError: Unexpected token ILLEGAL
and I can see in the developer tools of the browser that the generated JS code is really incomplete:
zkmx(
[0,'g2JQ_',{dt:'z_y20',cu:'\x2Fdtag',uu:'\x2Fdtag\x2Fzkau',ru:'\x2Fzul\x2Fcomponent\x2Fmenu.zul',style:'width\x3A100\x25\x3B',ct:true},[
...
['zul.inp.Textbox','g2JQp7',{id:'tb',$$0onBlur:true,$$0onSwipe:true,$$0onError:true,$$0onAfterSize:true,$$0onChanging:true,$$1onChange:true,$$1onSelection:true,$$0onFocus:true,width:'500px',style:'font-size:11px;',_value:'
</div>
How can I escape the content of the textbox so that I can display any HTML code in the textbox?
I tried
Replace '<' with '> but then > is displayed.
I also tried
<![CDATA[ <html></html>]]>
but then it was literally displayed, that is, also the
<![CDATA[
UPDATE
It is somehow due to the fact that we have JSPs containing several ZK pages.
And the exact content what causes problem is the closing HTML tag
</html>
The workaround is the following:
Events.echoEvent("onLater", txtDescription, txtDescription);
txtDescription.addEventListener("onLater", new EventListener<Event>() {
#Override
public void onEvent(Event event) throws Exception {
txtDescription.setValue("<html>...</html>");
}
});

Normally you should get values from your composer or viewmodel, and then this problem doesn't exist.
If you want to do it in the zul, you can make a parameter in zscript like this :
<zscript>
<![CDATA[
String a = "<html>";
]]>
</zscript>
<textbox value="${a}" />
Here I created a fiddle so you can test it.

Many time we have saved HTML code in database but when we are going to display that saved value in zk page it show HTML code as well with data .
To resolve this issue we have to use HTML escape there are plenty of way to fix this
You can do it on Java side as well zul page and here is simple way how you can achieve in the zul page
<html>
<![CDATA[
${vm.accessYourValue}
]]>
</html>

Ektron Forms - Generate Plaintext Email

In Ektron I have a form that generates an HTML email on submission and sends it to a mailbox (foo#bar.com). An application (MailReader) checks that mailbox, reads the messages, strips all markup, and saves the resulting messages for later use. This is a problem because all the text from the HTML email end up getting mashed together and are completely unreadable by someone using the MailReader app.
For example, this HTML:
<h1>Header1</h1>
<div>
<h2>Header2</h2>
<p>Some text in a paragraph.</p>
</div>
Becomes:
Header1Header2Some text in a paragraph.
I cannot change MailReader in any way, it will always strip any markup so my solution is to have Ektron generate an email that contains no HTML for just this form. I know the email is generated using XSLT transformations with the file /Workarea/controls/forms/DefaultFormEmailBody.xslt.
My attempt at my solution involved adding a hidden input to the form with name "__nohtml". Then the XSLT will do the following:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:choose>
<xsl:when test="/field[starts-with(#name, '__nohtml')]">
<xsl:output method="text" />
</xsl:when>
<xsl:otherwise>
<xsl:output method="html" />
</xsl:otherwise>
</xsl:choose>
<xsl:template match="/">
<xsl:choose>
<xsl:when test="/field[starts-with(#name, '__nohtml')]">
text output
</xsl:when>
<xsl:otherwise>
html output
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
But it never sends the email when I use this. I tried just rendering with this template on my own machine and I get errors. And I've also noticed that w3 documentation says the The xsl:output element is only allowed as a top-level element. That probably explains why I can't put it in the element.
I've also tried just completely omitting the element and it seems to default to HTML no matter what.
I've tried searching our local Ektron code for the point where the transformation occurs so I can tell Ektron to use the default XSLT in standard cases or use a different XSLT if __nohtml exists, but I don't know if that code is even accessible.
I would greatly appreciate it if someone can help me find a template that would allow either HTML or plaintext based on whether a field exists. If not then it would be equally awesome if someone could point me to the point of XSLT transformation in Ektron's code (if it's even accessible).

I would recommend a form strategy to generate a version of an email that you build out. There should be a public override void OnAfterSubmit(FormData formData, FormSubmittedData submittedFormData, string formXml, CmsEventArgs eventArgs) that would allow you to create pull out the submitted form data (either in the raw object or I believe it returns the xml as well. From there you could just parse it and save your html file.

Is the inputmode attribute valid (in HTML5 forms) or not?

I am getting validation errors with the inputmode attribute on text areas and text fields. The validator tells me Attribute inputmode not allowed on element input at this point but the HTML5 spec indicates that it is allowed.
Is there actually something wrong with this code, or is the validator at fault?
Here is a bare bones case which will produce exactly this kind of validation error (twice), in one case on an email input, and on the other on a textarea.
<!DOCTYPE HTML>
<html lang="en">
<head>
<meta charset="utf-8">
</head>
<body>
<form method="post" action="contactme.php">
<label class='pas block'>
Your E-Mail:<br/>
<input type='email' name='email' required inputmode='latin' placeholder='your e-mail here' />
</label>
<label class='pas block'>
Your Message:<br/>
<textarea name='message' required inputmode='latin' placeholder='and your message here!'></textarea>
</label>
</form>
</body>
</html>

Also, see the chart about which attributes apply to the different input types here:
http://www.whatwg.org/specs/web-apps/current-work/multipage/the-input-element.html#attr-input-type
The "inputmode" attribute applies only to "text" and "search".
UPDATE 2019-09-04: "inputmode" is now a global attribute (per WHATWG) and can be specified on any HTML element: https://html.spec.whatwg.org/multipage/dom.html#global-attributes
Another reference page for "inputmode":
https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/inputmode
On another note, "inputmode" is not a W3C HTML5 attribute, but it is a W3C HTML 5.1 attribute (at least at the time I'm writing this). UPDATE 2019-09-04: "inputmode" has been removed from HTML 5.2 and HTML 5.3.

The HTML5 spec says
The following content attributes must not be specified and do not apply to the element: accept, alt, checked, dirname, formaction, formenctype, formmethod, formnovalidate, formtarget, height, inputmode, max, min, src, step, and width.
It's under bookkeeping details at https://html.spec.whatwg.org/multipage/input.html#e-mail-state-(type=email)

Five years after the question was asked, some may wonder why some of the properties listed by #dsas doesn't trigger such errors, like enctype
The answer is simple support, while enctype for instance gained a wide support
inputmethod is supported only as of IE11 and Edge 14, for more infos click here

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How would I handle RTF hyperlinks using Apache Tika in XSLT? - eclipse

Related

XForms: Possible to modify HTML element attribute ? (I'm using XSLTForms)

German Novel with DkPro

ZK Textbox content containing <html> tag

Ektron Forms - Generate Plaintext Email

Is the inputmode attribute valid (in HTML5 forms) or not?

Categories

Resources