Need to find empty passage-uima ruta - uima

I need to annotate empty passage in a document.I used regex pattern to annotate.But it also covering the non-emptypassage
Sample Input file:
<p class="MsoNormal"><a name="para10001">You can easily change the formatting</a></p>
<p class="MsoNormal"><a name="para10002"> </a></p>
<p class="MsoNormal"><a name="para10003"></a></p>
<p class="MsoNormal"><a name="para10004">To change the overall look of your document</a></p>
<p class="MsoNormal"><a name="para10005"></a></p>
<p class="MsoNormal"><a name="para10006"></a></p>
Ruta Script:
"<p(.*?)><a name=\"para(\\d+)\"></a></p>"->EMPTYPASSAGE;
"<p(.*?)><a name=\"para(\\d+)\"> </a></p>"->EMPTYPASSAGE;
or
"<p(.*?)><a name=\"para(.+?)\"></a></p>"->EMPTYPASSAGE;
"<p(.*?)><a name=\"para(.+?)\"> </a></p>"->EMPTYPASSAGE;

Your regex consumes several <p> tags. Try something like:
"<p([^>]*?)><a name=\"para(\\d+)\"></a></p>"->EMPTYPASSAGE;
"<p([^>]*?)><a name=\"para(\\d+)\"> </a></p>"->EMPTYPASSAGE;

Related

Sightly - Empty check on list HTL

How to check the empty list on Sightly?
I wanted to prevent render the item-list DIV if there was no item on itemImgaeList. But it returns me one (1) always if there were no items while trying with -
LIST_SIZE_PRINT = "${container.itemImgaeList.size}"; // retrun 1
HTL:
<div data-sly-test="${container.itemImgaeList.size > 1}">
<sly data-sly-list.imageList="${container.itemImgaeList}">
<div class="item-list">
<picture>
<img alt="${imageList.qlImageText}" src="${imageList.qlImagePath}" />
</picture>
</div>
</sly>
</div>
Any help?
data-sly-list can be used for implementing the above requirement of rendering the list elements only when the list is not empty.
The use of 'data-sly-test' is not required for checking a list, as the check for emptiness is done inherently by data-sly-list.
Here is a working example using data-sly-list:
<div class="item-list" data-sly-list.item="${container.itemImgaeList}">
<picture>
<img alt="${item.qlImageText}" src="${item.qlImagePath}" />
</picture>
</div>
More information:
https://www.aemquickstart.in/2016/08/htl-sightly-notes.html

Targeting individual elements in HTML using Perl and Mojo::DOM in well-formated HTML

Relative begginer with Perl, with my first question here, trying the following:
I am trying to retrieve certain information from a large online dataset (Eur-Lex), where each HTML document is well-formed HTML, with constant elements. Each HTML file is identified by its Celex number, which is supplied as the argument to the script (see my Perl code below).
The HTML data looks like this (showing only the part I'm interested in):
<!--
<blahblah>
< lots of stuff here, before the interesting part>
-->
<div id="PPClass_Contents" class="panel-collapse collapse in" role="tabpanel"
aria-labelledby="PP_Class">
<div class="panel-body">
<dl class="NMetadata">
<dt xmlns="http://www.w3.org/1999/xhtml">EUROVOC descriptor: </dt>
<dd xmlns="http://www.w3.org/1999/xhtml">
<ul>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&DC_CODED=341&lang=en">
<span lang="en">descriptor_1</span>
</a>
</li>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&DC_CODED=5158&lang=en">
<span lang="en">descriptor_2</span>
</a>
</li>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&DC_CODED=7983&lang=en">
<span lang="en">descriptor_3</span>
</a>
</li>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&DC_CODED=933&lang=en">
<span lang="en">descriptor_4</span>
</a>
</li>
</ul>
</dd>
<dt xmlns="http://www.w3.org/1999/xhtml">Subject matter: </dt>
<dd xmlns="http://www.w3.org/1999/xhtml">
<ul>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&CT_CODED=BUDG&lang=en">
<span lang="en">Subject_1</span>
</a>
</li>
</ul>
</dd>
<dt xmlns="http://www.w3.org/1999/xhtml">Directory code: </dt>
<dd xmlns="http://www.w3.org/1999/xhtml">
<ul>
<li>01.60.20.00 <a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&CC_1_CODED=01&lang=en">
<span lang="en">Designation_level_1</span>
</a> / <a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&CC_2_CODED=0160&lang=en">
<span lang="en">Designation_level_2</span>
</a> / <a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&CC_3_CODED=016020&lang=en">
<span lang="en">Designation_level_3</span>
</a>
</li>
</ul>
</dd>
</dl>
</div>
</div>
</div>
<!--
<still more stuff here>
-->
I am interested in the info contained in "PPClass_Contents" div id, which consists of 3 elements:
- EUROVOC descriptor:
- Subject matter:
- Directory code:
Based on the above HTML, I would like to get the children of those 3 main elements, using Perl and Mojo, getting the result similar to this (single line text file, 3 groups separated by tabs, multiple child elements within a grup are separated by pipe characters, something like this:
CELEX_No "TAB" descriptor_1|descriptor_2|descriptor_3|descriptor_4|..|descriptor_n "TAB" Subject_1|..|Subject_n "TAB" Designation_level_1|Designation_level_2|Designation_level_3|..|Designation_level_n
"descriptors", "Subjects" and "Designation_levels" elements (children of those 3 main groups) can be from 1 to "n", the number is not fixed, and is not known in advance.
I have the following code, which does print out the plain text of the interesting part, but I need to address the individual elements and print them out in a new file as described above:
#!/usr/bin/perl
# returns "Classification" descriptors for given CELEX and Language
use strict;
use warnings;
use Mojo::UserAgent;
if ($#ARGV ne "1") {
print "Wrong number of arguments!\n";
print "Syntax: clookup.pl Lang_ID celex_No.\n";
exit -1;
}
my $lang = $ARGV[0];
my $celex = $ARGV[1];
my $lclang = lc $lang;
# fetch the eurlex page
my $ua = Mojo::UserAgent->new;
my $dom = $ua->get("https://eur-lex.europa.eu/legal-content/$lang/ALL/?uri=CELEX:$celex")->res->dom;
################ let's extract interesting parts:
my $text = $dom->at('#PPClass_Contents')->all_text;
print "$text\n";
EDIT (added):
You can try my Perl script using two arguments:
lang_code ("DE","EN","IT", etc.)
Celex number (e.g.: E2014C0303, 52015BP2212, 52015BP0930(48), 52015BP0930(36), 52015BP0930(41), E2014C0302, E2014C0301, E2014C0271, E2014C0134).
For example (if you name my script "clookup.pl"):
$ perl clookup.pl EN E2014C0303
So, how can I address individual elements (of unknown number) as described above, using Mojo::DOM?
Or, is there something simpler or faster (using Perl)?
You are on the right track. First, you need to understand the HTML inside your #PPClass_Contents. Each set of things is in a definition list. Since you only care about the definition texts, you can search directly for the <dd> elements.
$dom->at('#PPClass_Contents')->find('dd')
This will give you a Mojo::Collection, which you can iterate with ->each. We pass that an anonymous function, pretty much like a callback.
$dom->at('#PPClass_Contents')->find('dd')->each(sub {
$_; # this is the current element
});
Each element will be passed to that sub, and can be referenced using the topic variable $_. There is an <ul> inside, and each <li> contains a <span> element with the text you want. So let's find those.
$_->find('span')
We can directly build the column in your output at this stage. Let's use the other form of ->each, which turns the Mojo::Collection returned from ->find into a normal Perl list. We can then use a regular map operation to grab each <span>'s text node and join that into a string.
join '|', map { $_->text } $_->find('span')->each
To tie all that together, we declare an array outside this construct, and stick the $celex number in it as the first column.
my #columns = ($celex);
$dom->at('#PPClass_Contents')->find('dd')->each(sub {
push #columns, join '|', map { $_->text } $_->find('span')->each;
});
Producing the final tab-separated output is now trivial.
print join "\t", #columns;
I've done this with EN as the language and the $celex number 32006L0121, which the search used in its example tooltip. The result is this:
32006L0121 marketing standard|chemical product|approximation of laws|dangerous substance|scientific report|packaging|European Chemicals Agency|labelling Internal market - Principles|Approximation of laws|Technical barriers|Environment|Consumer protection Industrial policy and internal market|Internal market: approximation of laws|Dangerous substances

Selenium- Tag traversal Xpath for facebook First name not working

<div id="reg_form_box" class="large_form">
<div class="clearfix _58mh">
<div class="mbm _3-90 lfloat _ohe">
<div id="u_0_0" class="_5dbb">
<div class="uiStickyPlaceholderInput uiStickyPlaceholderEmptyInput">
<div class="placeholder" aria-hidden="true">First name</div>
<input id="u_0_1" class="inputtext _58mg _5dba _2ph-" data-type="text" name="firstname" aria-required="1" placeholder="" aria-label="First name" aria-controls="js_0" aria-haspopup="true" role="null" aria-describedby="js_w" aria-invalid="true" type="text"/>
</div>
<i class="_5dbc img sp_beZQzZ7Rg6Q sx_5ca7f2"/>
<i class="_5dbd img sp_beZQzZ7Rg6Q sx_9c246c"/>
</div>
Above is the code for which i want to write Xpath using tag name traversal. Here is the xpath i have made
"//div[#id='reg_form_box']/div[1]/div[1]/div[1]/div/input"
Please suggest what's wrong here and how can i correct the same. Website is Facebook and field is First name on homepage.
Ideally unless you have a case of multiple modes that match the same xpath, you don't have to traverse through the entire hierarchy.
This will work:
//input[#name='firstname']

Perl Scrappy select using class attribute

I was trying to scrape using Perl Scrappy. I would like to select html elements with class attribute using 'select'.
<p>
<h1>
<a href='http://test.com'>Test</a>
<a href='http://list.com'>List</a>
</h1>
</p>
<p class='parent-1'>
<h1>
<a class='child-1' href="http://sample.com">SampleLink</a>
<a class='child-2' href="http://list.com">List</a>
</h1>
</p>
I need to get element('a' tag) with class name 'child-1' which is a child nod of <p class='parent-1'> using select method.
I have tried like this
#!/usr/bin/perl
use Scrappy;
my $scraper = Scrappy->new;
$scraper->get($url);
$scraper->select('p a')->data;
But it will select the first 'p' tag also.
Could you please help me with this?
Bearing in mind choroba's warning, to select an <a> element with a class of child-1 that is a child of a <p> element with a class of parent-1 you would write
$scraper->select('p.parent-1 > a.child-1')
The problem is that in HTML, a <p> tag can't contain a <h1> tag. In fact, the HTML is parsed as
<p></p>
<h1>
<a href='http://test.com'>Test</a>
<a href='http://list.com'>List</a>
</h1>
<p class='parent-1'></p>
<h1>
<a class='child-1' href="http://sample.com">SampleLink</a>
<a class='child-2' href="http://list.com">List</a>
</h1>

Parsing HTML String in to ios

I am parsing html tags into ios using TFHpple successfully, but here i got a small problem,
if my HTML Tag is
<div align="center">
<b>
<a href="/?PageName='TeacherPage'&StaffID=194121">
<span class="sectionheader">
Jessica
Cortes
</span>
</a></b><BR>
<span class="subheader">Migrant Education</span>
<BR>
<img src="/images/Phone.gif" width="22" height="23">
912-367-8630
<img src="images/EmailIconSmall.gif" width="16" height="16" style="vertical-align:bottom" />
<a onclick="openme('z','/Common/Email/Email.asp?UserID=194121&SchoolID=786',417,320);return false;" href="#">Email</a>
<BR><BR>
View All Teachers
<BR><BR>
<table cellpadding="4" cellspacing="4" class="subnavtd">
i am parsing it in to ios by using example: NSString *tutorialsXpathQueryString = #"//div/span[#class='subheader']]";
now in one of the HTML page there is no Tag, it has just a number like 912-367-8630 now how to call this in NSString *tutorialsXpathQueryString = #" this number is in above given tags
Are you able to reform the HTML output and wrap that phone number in a tag that you can target? If not, you will probably have to grab the inner text value of a parent div and regex match for a phone number pattern in the string.