German Novel with DkPro - uima

I tried German Novel with DkPro. My Sample input file is an XHTML file. How can I get my PosTagger output based on the XHTML index.
Script:
PACKAGE com.github.uima.ruta.novel;
ENGINE utils.HtmlAnnotator;
ENGINE utils.HtmlConverter;
ENGINE utils.ViewWriter;
TYPESYSTEM utils.HtmlTypeSystem;
TYPESYSTEM utils.TypeSystem;
IMPORT PACKAGE de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos FROM desc.type.POS;
IMPORT de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Lemma FROM desc.type.LexicalUnits;
UIMAFIT org.dkpro.core.opennlp.OpenNlpSegmenter;
UIMAFIT org.dkpro.core.stanfordnlp.StanfordPosTagger;
CONFIGURE(HtmlAnnotator, "onlyContent" = false);
Document{-> EXEC(HtmlAnnotator)};
Document { -> CONFIGURE(HtmlConverter, "inputView" = "_InitialView","outputView" = "plain"),
EXEC(HtmlConverter,{TAG})};
"<\\?xml version=\"1.0\" encoding=\"UTF-8\"\\?>"->MARKUP;
uima.tcas.DocumentAnnotation{-CONTAINS(POS)} -> {
uima.tcas.DocumentAnnotation{-> SETFEATURE("language", "de")};
EXEC(OpenNlpSegmenter);
EXEC(StanfordPosTagger, {POS});
};
Sample Input
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"><head xmlns="http://www.w3.org/1999/xhtml"><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><meta name="viewport" content="width=device-width, initial-scale=1.0" /><style></style><title></title></head><link xmlns="http://www.w3.org/1999/xhtml" src="./ckeditor.css" /><body xmlns="http://www.w3.org/1999/xhtml"><div class="WordSection1"><p class="Normal" data-name="Normal"><span data-bkmark="para10000"></span><span style="font-size:9pt">Der Idiot</span><span data-bkmark="para10000"></span></p>
<p class="Normal" data-name="Normal"><span data-bkmark="para10001"></span><span style="font-size:9pt">Ein Roman in vier Teilen.</span><span data-bkmark="para10001"></span></p>
</div>
<hr align="left" size="1" width="33%" /></body>
</html>
In the sample script, uima.tcas.DocumentAnnotation is sent to PosTagger Process. The MARKUP in this annotation affecting the accuracy. What I need to do to get the accuracy.

The HtmlAnnotator can be used to hide additional MARKUP so that rules are not affected by them.
The HtmlConverter is able to create a new document text without html/xml markup, but only in a new CAS view as the initial text in a CAS is static and cannot be changed.
The EXEC action is able to apply an external analysis engine on the current CAS object, and it can be configured to be applied on a different CAS view. However, the external analysis engine is applied on the complete CAS including the markup. No new CAS is created on the fly.
There are several options what you could do.
You could apply the pos tagger on the ‘plain’ view, but you cannot access these annotation with rules as the annotation will be present in a different view
You setup a multi view setting, e.g, by a two stage process. First convert the text to plain text without markup, and then apply the pos tagger on the new text
Depending on the external analysis engine, you maybe can also solve this by redefining what a token is.

Related

XForms: Possible to modify HTML element attribute ? (I'm using XSLTForms)

Does XForms have a mechanism for manipulating attributes of the resultant HTML?
I guess I mean emitting HTML dynamically and setting the attributes as part of that.
I know that using a xf:repeat - you can effectively emit HTML elements, but I can't work out if this would stretch to attributes?
I'm using XSLTForms as the implementation - so maybe this support hooks for Javascript to do this if there isn't a built-in way?
The reason to ask specifically - I would like to work with the audio element (and some other HTML5 elements).
Yes, it is named AVT for Attribute Value Template. As in XSLT, just wrap XPath expressions into curly braces like in <div class="proto{$myclass}">.
Thanks to the help from Alain Couthures - I was able to put together the following. Sharing in case others find it interesting.
<?xml-stylesheet href="xsltforms/xsltforms.xsl" type="text/xsl"?>
<html
xmlns="http://www.w3.org/1999/xhtml"
xmlns:xf="http://www.w3.org/2002/xforms">
<head>
<title>Podcast Player</title>
<xf:model>
<xf:instance xmlns="">
<data>
<url/>
</data>
</xf:instance>
<xf:instance id="feed" src="https://podcasts.files.bbci.co.uk/b05qqhqp.rss"/>
</xf:model>
<style><![CDATA[
* { font-family: arial; background-color:black; color: white }
]]></style>
</head>
<body>
<h1><xf:output ref="instance('feed')/channel/title"/></h1>
<blockquote><xf:output ref="instance('feed')/channel/description"/></blockquote>
<xf:select1 ref="url" appearance="full">
<xf:itemset nodeset="instance('feed')/channel/item">
<xf:label ref="title"/>
<xf:value ref="enclosure/#url"/>
</xf:itemset>
</xf:select1>
<audio src="{url}" controls="true"/>
</body>
</html>
The relevant bit to this post is the "audio" tag and in particular the "{url}" attribute template.
Here's a screenshot:
For those that wish to try this example, you'll need XSLTForms : https://en.wikibooks.org/wiki/XSLTForms , other XForms implementations are available.
Note: save the file with the extension '.xhtml' and place behind a webserver of your choice.
For instance using test HTTP servers: php, python etc.

How would I handle RTF hyperlinks using Apache Tika in XSLT?

This question is a follow-up to: What are some methods to converting RTF text nodes in XML using XSLT 2 / Saxon HE 11.3?.
After implementing the answered solution, I ran the code against a large dataset. During the processing of all that data, an item in source RTF caused the application to error.
The error:
Error on line 11 column 92 of urn:from-string: SXXP0003 Error reported by XML parser: The element type "a" must be terminated by the matching end-tag "</a>".: The element type "a" must be terminated by the matching end-tag "</a>".
I took a look at the source xml, which contained several RTF HYPERLINK codes. Source:
<SPECORMETHOD>{\rtf1\ansi\deff0\uc1\ansicpg1252\deftab720{\fonttbl{\f0\fnil\fcharset1 Arial;}{\f1\fnil\fcharset1 Times New Roman;}{\f2\fnil\fcharset1 WingDings;}}{\colortbl\red0\green0\blue0;\red255\green0\blue0;\red0\green128\blue0;\red0\green0\blue255;\red255\green255\blue0;\red255\green0\blue255;\red128\green0\blue128;\red128\green0\blue0;\red0\green255\blue0;\red0\green255\blue255;\red0\green128\blue128;\red0\green0\blue128;\red255\green255\blue255;\red192\green192\blue192;\red128\green128\blue128;\red0\green0\blue0;\red128\green128\blue0;}\wpprheadfoot1\paperw12240\paperh15840\margl720\margr720\margt720\margb720\headery720\footery720\endnhere\sectdefaultcl{\*\generator WPTools_5.17;}{\stylesheet{\s1\li0\fi0\ri0\sb0\sa0\ql\vertalt\fs20 Normal;}{\s2\li0\fi0\ri0\sb0\sa0\ql\vertalt\fs20 Default Paragraph Font;}{\s3\li0\fi0\ri0\sb0\sa0\ql\vertalt\fs20\cf3\ul\sbasedon2 Hyperlink;}}{\pard\plain\plain\f1\fs36\par\pard\plain\plain\f1\fs36\par\plain\f1\fs28\tab 10\'94Flour Tortilla\par\plain\f1\fs28\tab Caesar \f1\b\i DIP\f1\i0 : {\field{\*\fldinst{HYPERLINK "..\\\\..\\\\SAUCES\\\\Dips\\\\Dip, Caesar.doc"}}{\*\fldtitle{..\\\\..\\\\SAUCES\\\\Dips\\\\Dip, Caesar.doc}}{\fldrslt{\f1\cf3\cs103\ul\cs3 Dip, Caesar.doc\plain\f1\fs28\b}}}\par\plain\f1\fs28\tab Ripped Romaine\par\plain\f1\fs28\tab Blackened Salmon julienne\par\plain\f1\fs28\tab Shaved Red Onion\par\plain\f1\fs28\tab Julienne Tomato\par\plain\f1\fs28\tab Grated Parmesan\par\plain\f1\fs28\tab Blackening spice: {\field{\*\fldinst{HYPERLINK "..\\\\..\\\\SPICE\\\\Blackening Spice.doc"}}{\*\fldtitle{..\\\\..\\\\SPICE\\\\Blackening Spice.doc}}{\fldrslt{\f1\cf3\cs103\ul\cs3 Blackening Spice.doc\plain\f1\fs28}}}\par\pard\plain\plain\f1\fs28\par\plain\f1\fs28 Method\par\plain\f1\fs28 Procedure Text \par\pard\plain\plain\f1\fs36\par}}</SPECORMETHOD>
For my purposes, the URL is not going to be a functional component, but for the sake of utility of this RTF conversion project, what might be needed to have the hyperlink codes work correctly, or to output them as text for reference? One way I can handle this is in the XSLT by intercepting the element, looking for the HYPERLINK code and replacing it with regular text.
The desired output for a hyperlink from this example would be (text only):
CAESAR DIP: ..\..\SAUCES\Dips\Dip, Caesar.doc
The only modification to the original code was in XSLT to do a check for an empty element when processing the <SPECORMETHOD>.
<xsl:choose>
<xsl:when test="string-length(SPECORMETHOD) > 0">
<rtf-as-xhtml>
<xsl:sequence select="tika:parse-rtf(SPECORMETHOD[string-length(.) > 0])"/>
</rtf-as-xhtml>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="'[EMPTY]'"/>
</xsl:otherwise>
</xsl:choose>
I've built this project in Eclipse 2022-12 (4.26.0). It's a Maven project using Apache Tika 2.7.0, and Saxon HE 11.3, using Java SE 1.8. Special thanks to Martin H.
I have run your sample rtf through Tika and the supposed XHTML output is unfortunately not well-formed:
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.microsoft.rtf.RTFParser" />
<meta name="Content-Type" content="application/rtf" />
<title></title>
</head>
<body><p />
<p />
<p> 10”Flour Tortilla</p>
<p> Caesar <b><i>DIP</i>: <a href="..\\..\\SAUCES\\Dips\\Dip, Caesar.doc">Dip, Caesar.doc</b><b /></b></p>
<p><b /> Ripped Romaine</p>
<p> Blackened Salmon julienne</p>
<p> Shaved Red Onion</p>
<p> Julienne Tomato</p>
<p> Grated Parmesan</p>
<p> Blackening spice: Blackening Spice.doc</p>
<p />
<p>Method</p>
<p>Procedure Text </p>
<p />
<p />
</body></html>
So the error is in the fragment <p> Caesar <b><i>DIP</i>: <a href="..\\..\\SAUCES\\Dips\\Dip, Caesar.doc">Dip, Caesar.doc</b><b /></b></p>.
I don't know for sure whether that is a problem with the input somehow not being proper rtf but it looks more like a bug in the Tia parser and ToXmlContentHandler.
I have raised the potential issue https://issues.apache.org/jira/browse/TIKA-3972
In the end, with the help of the Saxonica guys (thanks to Michael Kay and Norm Walsh) I have found a better (probably anyway) approach of using Saxon with the Tika parser; instead of using Tika's ToXMLContentHandler() and its toString() method result fed to Saxon's DocumentBuilder it is possible to pass a Saxon BuildingContentHandler to Tika's parser directly to get an XdmNode:
public static XdmNode parseRtfToHTML2(String rtf, Processor processor) throws IOException, SAXException, TikaException, URISyntaxException, SaxonApiException {
DocumentBuilder docBuilder = processor.newDocumentBuilder();
BuildingContentHandler handler = docBuilder.newBuildingContentHandler();
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
try (InputStream stream = new ByteArrayInputStream(rtf.getBytes("utf8"))) {
parser.parse(stream, handler, metadata);
return handler.getDocumentNode();//docBuilder.build(new StreamSource(new StringReader(handler.toString())));
} catch (SaxonApiException e) {
throw new RuntimeException(e);
}
}
Using that approach, at least in a short test, no error is thrown for the hyperlink RTF example, see the updated project https://github.com/martin-honnen/SaxonTikaRtfTest1 for the code in more context.

Sling Mapping Rewrite Rules do not rewrite paths in meta tags

I have sling mappings setup that rewrite outgoing paths to the external URL. An example of this rewrite:
/content/www-sitename/home.html would be rewritten to http://www.sitename.com/home.html
I have also configured the LinkCheckerTransformerFactory: linkcheckertransformer.rewriteElements=["a:href","area:href","form:action","link:href","meta:content"]
Some HTML on a page component:
<head>
<link rel="canonical" href="/content/www-sitename/home.html" />
<meta name="canonical" content="/content/www-sitename/home.html" />
</head>
When visited, only the link:href has been rewritten, meta:content is unchanged:
<head>
<link rel="canonical" href="http://www.sitename.com/home.html" />
<meta name="canonical" content="/content/www-sitename/home.html" />
</head>
Worth noting is that the link:href was not rewritten prior to configuring the linkcheckertransformer.rewriteElements to include it. Why did this change work for link:href, but not meta:content. Aside from creating a custom rewrite filter, what can be done to get links in meta:content attributes to be rewritten?
nerd answer is correct, by default the internal Sling mechanism responsible for parsing HTML (htmlparser) supports only following tags: a, area, form, base, link, script, body, so even if you add meta:content to the LinkChecker configuration, CQ won't recognize the <meta> as a tag which needs processing.
In order to reconfigure htmlparser, create a node named generator-htmlparser under /libs/cq/config/rewriter/default with following properties:
jcr:primaryType = nt:unstructured
includeTags = [A, AREA, FORM, BASE, LINK, SCRIPT, BODY, META]
The includeTags property should be multivalued, so you can add other tags in the future.
If you don't want to override the content under /libs, create your own rewriter configuration:
Copy /libs/cq/config/rewriter/default and its children to /apps/YOURAPP/config/rewriter/my-rewriter.
Set order property on the my-rewriter to 1.
Create generator-htmlparser under the my-rewriter as above.
I think you have to add meta tag to the htmplparser generator.
see my question and answer: How to add additional element to htmlparser generator

Why use template engine in playframework 2 (scala) if we may stay with pure scala

Why use scala template-engine in playframework 2 (scala) if we may stay with just scala.
Using template engine is:
additional processor time transforming template syntax into scala code
then compiling this code (which is not so concise as it if write it by hand - then it compiles even slower)
Also if template is not converted yet into scala you can see that code inconsistency (red highlighting in your IDE) from you main code -
so you should every time think about it..
Why not just use core xml/html support what scala provides like here: http://www.scala-lang.org/node/131
Is there any pure scala template (you can recommend) I can use in play-framework or alone ?
Actually you should ask this question to the dev team, however consider few points:
Actually you don't need to use the Play's templating engine at all, you can easily return any string with Ok() method, so according to your link you can just do something like Ok(theDate("John Doe").toString())
Play uses approach which is very typical for other MVC web-frameworks, where views are HTML based files, because... it's web dedicated framework. I can't see nothing wrong about this, sometimes I'm working with other languages/frameworks and can see that only difference in views between them is just a language-specific syntax, that's the goal!
Don't also forget, that Play is bilingual system, someone could ask 'why don't use some Java lib for processing the views?'
The built-in Scala XML literals are not well-suited for creating complex programs, you easily run into issues (that's also why there's a library called anti-xml); Martin Odersky himself regretted making this a language feature
Finally, there are IDEs with support for Play 2 views, I'm working on Idea 12 with Play2 support and although it's not perfect (it's quite new, so sometimes there are small problems) in most cases it works fine. It understands Play view's syntax, offers autocomplete, even you can use option+click on some object in the view to jump directly to the method/model's declaration, etc.
Answering to your last question, AFAIK officially there is Groovy engine available as a module, which offers template engine known from Play 1.x, however, keep in mind it's just a bridge for people migrating from Play 1.x to Play 2.x as it's just slower than native engine of Play 2.
For me this fits
as an answer, for last question at least.
This just scala. Just XML built-in magic.
http://www.alvarocarrasco.com/2011/03/play-framework-and-templating-with.html?m=1
Sample:
This is a template: Templates.scala file
package templates
import play.api.templates.Html
import scala.xml.Xhtml
import controllers.routes
object Main {
def page (title:String="Default title")(content: => scala.xml.Elem) = Html {
"<!DOCTYPE html>" +
Xhtml.toXhtml(
<html>
<head>
<title>{title}</title>
<link rel="stylesheet" media="screen" href={routes.Assets.at("stylesheets/main.css").toString()} />
<link rel="shortcut icon" type="image/png" href={routes.Assets.at("images/favicon.png").toString()} />
<script src={routes.Assets.at("javascripts/jquery-1.9.0.min.js").toString()} type="text/javascript" />
</head>
<body>
{content}
</body>
</html>
)
}
// a panel template, just as an example
def panel (label:String="Some label")(content: => scala.xml.Elem) = {
<div class="panel">
<div class="panel-label">{label}</div>
<div>{content}</div>
</div>
}
}
This is an index page index.scala file
package views
import templates.Main._
object IndexPage {
def apply() = {
page(title="Welcome to my Page!") {
<div>
<h1>Hello</h1>
<p>Some template markup</p>
{
panel(label="Dashboard panel")(
<div>
Panel content
</div>
)
}
</div>
}
}
}
This is a controller: Application.scala file
package controllers
import play.api.mvc._
object Application extends Controller {
def index = Action {
Ok(
views.IndexPage()
);
}
}

Is the inputmode attribute valid (in HTML5 forms) or not?

I am getting validation errors with the inputmode attribute on text areas and text fields. The validator tells me Attribute inputmode not allowed on element input at this point but the HTML5 spec indicates that it is allowed.
Is there actually something wrong with this code, or is the validator at fault?
Here is a bare bones case which will produce exactly this kind of validation error (twice), in one case on an email input, and on the other on a textarea.
<!DOCTYPE HTML>
<html lang="en">
<head>
<meta charset="utf-8">
</head>
<body>
<form method="post" action="contactme.php">
<label class='pas block'>
Your E-Mail:<br/>
<input type='email' name='email' required inputmode='latin' placeholder='your e-mail here' />
</label>
<label class='pas block'>
Your Message:<br/>
<textarea name='message' required inputmode='latin' placeholder='and your message here!'></textarea>
</label>
</form>
</body>
</html>
Also, see the chart about which attributes apply to the different input types here:
http://www.whatwg.org/specs/web-apps/current-work/multipage/the-input-element.html#attr-input-type
The "inputmode" attribute applies only to "text" and "search".
UPDATE 2019-09-04: "inputmode" is now a global attribute (per WHATWG) and can be specified on any HTML element: https://html.spec.whatwg.org/multipage/dom.html#global-attributes
Another reference page for "inputmode":
https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/inputmode
On another note, "inputmode" is not a W3C HTML5 attribute, but it is a W3C HTML 5.1 attribute (at least at the time I'm writing this). UPDATE 2019-09-04: "inputmode" has been removed from HTML 5.2 and HTML 5.3.
The HTML5 spec says
The following content attributes must not be specified and do not apply to the element: accept, alt, checked, dirname, formaction, formenctype, formmethod, formnovalidate, formtarget, height, inputmode, max, min, src, step, and width.
It's under bookkeeping details at https://html.spec.whatwg.org/multipage/input.html#e-mail-state-(type=email)
Five years after the question was asked, some may wonder why some of the properties listed by #dsas doesn't trigger such errors, like enctype
The answer is simple support, while enctype for instance gained a wide support
inputmethod is supported only as of IE11 and Edge 14, for more infos click here