Need to match the annotation feature-UIMA RUTA - uima

I need to match the feature of an annotation and to also need to mark the second annotation of the matched feature. I've tried it but I'm facing two issues
ISSUE 1:
SEPERATEDA annotation values got reduced.I think its due to dictRemoveWS.
ISSUE 2:
It showing only the last match.(Due to some looping problem).
Sample file 1:
Arash Alipour
Rahul Bhargava
Lisette I.S. Wintgens
B. Rahul
Alipour A
Ali Aldabahi
M. Naziruddin Khan
Martin J. Swaans
Naziruddin Khan
Expected Output for file 1:
Rahul
Alipour
Naziruddin
Khan
Sample file 2:
M. Naziruddin Khan
Arash Alipour
Rahul Bhargava
Lisette I.S. Wintgens
Alipour A
Ali Aldabahi
M. Naziruddin Khan
Expected Output for file 2:
Alipour
Naziruddin
Khan
My Script:
PACKAGE uima.ruta.example;
DECLARE SINGLEINITIAL;
CW{REGEXP(".")->MARK(SINGLEINITIAL)};
DECLARE SeperateDA;
DECLARE DA;
"Arash Alipour"->DA;
"Lisette I.S. Wintgens"->DA;
"Alipour A"->DA;
"Rahul Bhargava"->DA;
"M. Naziruddin Khan"->DA;
"B. Rahul"->DA;
"Ali Aldabahi"->DA;
"A. S. Al Dwayyan"->DA;
"Lucas V.A. Boersma"->DA;
"Jippe C. Bal"->DA;
"Benno J.W.M. Rensing"->DA;
"Martin J. Swaans"->DA;
BLOCK(DocAuth) DA{}
{
CW{-PARTOF(SINGLEINITIAL)-> MARK(SeperateDA)};
}
DECLARE RepeatedDA(STRING auth);
STRING MatchedAuth;
SeperateDA{->MARK(RepeatedDA),MATCHEDTEXT(MatchedAuth)}->{RepeatedDA{->RepeatedDA.auth=MatchedAuth};};
STRING auth;
FOREACH(RepAuth) RepeatedDA{}
{
(da1:RepeatedDA {->UNMARK(RepeatedDA)}# da2:RepeatedDA){da1.auth != da2.auth};
}
I also tried something like this
da:RepeatedDA{->da.auth = RepeatedDA.auth};
FOREACH(RepAuth, true) RepeatedDA{}
{
# da:RepeatedDA{->auth = da.auth, LOG(" auth-" +auth)};
da:RepeatedDA {auth != da.auth-> UNMARK(da)};
}
My goal is to remove the more over similar name from DA. For example from the above sample file both Rahul Bhargava and B. Rahul are in DA.But I need only Rahul Bhargava to be in DA.

There seems to be a problem with your rule logic.
da1:RepeatedDA # da2:RepeatedDA da2 match always on the directly next RepeatedDA/SeperateDA since the value of the auth feature differs. Thus, the rule applies to often, almost every time.
Try this:
DECLARE SINGLEINITIAL;
CW{REGEXP(".")->MARK(SINGLEINITIAL)};
DECLARE SeperateDA (STRING auth);
DECLARE DA;
"Arash Alipour"->DA;
"Lisette I.S. Wintgens"->DA;
"Alipour A"->DA;
"Rahul Bhargava"->DA;
"M. Naziruddin Khan"->DA;
"B. Rahul"->DA;
"Ali Aldabahi"->DA;
"A. S. Al Dwayyan"->DA;
"Lucas V.A. Boersma"->DA;
"Jippe C. Bal"->DA;
"Benno J.W.M. Rensing"->DA;
"Martin J. Swaans"->DA;
BLOCK(DocAuth) DA{}
{
CW{-PARTOF(SINGLEINITIAL)-> CREATE(SeperateDA, "auth" = CW.ct)};
}
DECLARE RepeatedDA;
da1:SeperateDA{-> RepeatedDA} # da2:SeperateDA{da1.auth == da2.auth};
DISCLAIMER: I am a developer of UIMA Ruta

Related

Error when importing tm Vcorpus into Quanteda corpus

This code snippet worked just fine until I decided to update R(3.6.3) and RStudio(1.2.5042) yesterday, though it is not obvious to me that is the source of the problem.
In a nutshell, I convert 91 pdf files into a volatile corpus named Vcorp and confirm that I created a volatile corpus as follows:
> Vcorp <- VCorpus(VectorSource(citiesText))
> class(Vcorp)
[1] "VCorpus" "Corpus"
Then I attempt to import this tm Vcorpus into quanteda, but keep getting an error message, which I did not get before (eg the day before the update).
> data(Vcorp, package = "tm")
> citiesCorpus <- corpus(Vcorp)
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 8714, 91
Any suggestions? Thank you.
Impossible to know the exact problem without a) version information on your packages and b) a reproducible example.
Why use tm at all? You could have created a quanteda corpus directly as:
corpus(citiesText)
Converting a VCorpus works fine for me.
library("quanteda")
## Package version: 2.0.1
library("tm")
packageVersion("tm")
## [1] ‘0.7.7’
reut21578 <- system.file("texts", "crude", package = "tm")
VCorp <- VCorpus(
DirSource(reut21578, mode = "binary"),
list(reader = readReut21578XMLasPlain)
)
corpus(VCorp)
## Corpus consisting of 20 documents and 16 docvars.
## text1 :
## "Diamond Shamrock Corp said that effective today it had cut i..."
##
## text2 :
## "OPEC may be forced to meet before a scheduled June session t..."
##
## text3 :
## "Texaco Canada said it lowered the contract price it will pay..."
##
## text4 :
## "Marathon Petroleum Co said it reduced the contract price it ..."
##
## text5 :
## "Houston Oil Trust said that independent petroleum engineers ..."
##
## text6 :
## "Kuwait"s Oil Minister, in remarks published today, said ther..."
##
## [ reached max_ndoc ... 14 more documents ]

how to iterate on column in pyspark dataframe based on unique records and non na values

I have below code in python
for i in (map.area.unique()):
# Select all the map records from the currently processed area
f_0 = f_map[(f_map['area'] == i )]
m_0 = m_map[(m_map['area'] == i) | (m_map['area'] == "Unknown")]
I am rewriting it in pyspark. But the third line is throwing exception. Can anyone point out what I doing wrong.
map dataframe is :
play_id calendar_period telephone area
1: 286178 201811 03235095 510
2: 286179 201811 03235113 500
f_map:
id value area type
1: 227149 385911000059 510 mob
2: 122270 385911000661 100 fix
m_map:
id area type
1: 227149 590 mob
2: 122270 190 fix
Ouput should be :
id value area type
1: 227149 385994266007 Unknown mob
2: 122270 385989281716 Unknown mob
I think the problem arises from the last line. If I understand your problem correctly, this should be what you're looking for:
temp1 = sampdf[(sampdf['area'] == i) | (sampdf['area'] == "Unknown")]

htmlTable in Rmd - conversion to Word docx

I have the following Rmd file, which produces an html file, which I then copy-paste into a docx file (for collaborators). Here are things I'd like to know how to do with the tables, but I can't find answers in the vignettes here:
A. I want to know how to remove the blank column that gets inserted in Word in between Cgroup 1 and Cgroup 2.
B. I want to know how to set the width of the column with the row names ("1st row",...)
C. How can I change the font and font size? I tried following this but it doesn't work to have output: word_document with htmlTable()
D. To ease the conversion to Word, is there a way to specify page breaks? Landscape orientation?
Thank you so much!
---
title: "Example"
output:
Gmisc::docx_document:
fig_caption: TRUE
force_captions: TRUE
---
Results
=======
```{r, echo = FALSE}
library(htmlTable)
library(Gmisc)
library(knitr)
mx <-
matrix(ncol=6, nrow=8)
rownames(mx) <- paste(c("1st", "2nd",
"3rd",
paste0(4:8, "th")),
"row")
colnames(mx) <- paste(c("1st", "2nd",
"3rd",
paste0(4:6, "th")),
"hdr")
for (nr in 1:nrow(mx)){
for (nc in 1:ncol(mx)){
mx[nr, nc] <-
paste0(nr, ":", nc)
}
}
htmlTable(mx,
cgroup = c("Cgroup 1", "Cgroup 2"),
n.cgroup = c(2,4))
```
The styling seemed to be off for the row names and it is now fixed in version 1.10.1 that you can download using the devtools package: devtools::install_github("gforge/htmlTable", ref="develop")
Regarding the styling the function allows almost any CSS-style you could image. Unfortunately it requires copy-pasting into Word and this functionality hasn't been Microsofts highest priority. You can easily adapt you example to accomodate the requiered changes using the css.cell:
library(htmlTable)
library(knitr)
mx <-
matrix(ncol=6, nrow=8)
rownames(mx) <- paste(c("1st", "2nd",
"3rd",
paste0(4:8, "th")),
"row")
colnames(mx) <- paste(c("1st", "2nd",
"3rd",
paste0(4:6, "th")),
"hdr")
for (nr in 1:nrow(mx)){
for (nc in 1:ncol(mx)){
mx[nr, nc] <-
paste0(nr, ":", nc)
}
}
css.cell = rep("font-size: 1.5em;", times = ncol(mx) + 1)
css.cell[1] = "width: 4cm; font-size: 2em;"
htmlTable(mx,
css.cell=css.cell,
css.cgroup = "color: red",
css.table = "color: blue",
cgroup = c("Cgroup 1", "Cgroup 2"),
n.cgroup = c(2,4))
There is no way to remove the empty column generated by cgroups. This was required for the table to look nice and is a conscious design choice.
Regarding page-breaks I doubt there is any elegant way for doing that. An alternative could possibly be the ReporteRs package. I haven't used it myself but it's closer integrated with Word and could possibly be a solution.

Xtext 2.8+ formatter, formatting HiddenRegion with comment

I am using Xtext 2.9 formatter and I am trying to format hiddenRegion which contains comment. Here is part of my document region i am trying to format:
Columns: 1:offset 2:length 3:kind 4: text 5:grammarElement
Kind: H=IHiddenRegion S=ISemanticRegion B/E=IEObjectRegion
35 0 H
35 15 S ""xxx::a::b"" Refblock:namespace=Namespace
50 0 H
50 1 S "}" Refblock:RCBRACKET
E Refblock PackageHead:Block=Refblock path:PackageHead/Block=Package'xxx_constants'/head=Model/packages[0]
51 0 H
51 1 S ":" PackageHead:COLON
E PackageHead Package:head=PackageHead path:Package'xxx_constants'/head=Model/packages[0]
52 >>>H "\n " Whitespace:TerminalRule'WS'
"# asd" Comment:TerminalRule'SL_COMMENT'
15 "\n " Whitespace:TerminalRule'WS'<<<
B Error'ASSD' Package:expressions+=Expression path:Package'xxx_constants'/expressions[0]=Model/packages[0]
67 5 S "error" Error:'error'
72 1 H " " Whitespace:TerminalRule'WS'
and corresponding part of the grammar
Model:
{Model}
(packages+=Package)*;
Expression:
Error | Warning | Enum | Text;
Package:
{Package}
'package' name=Name head=PackageHead
(BEGIN
(imports+=Import)*
(expressions+=Expression)*
END)?;
Error:
{Error}
('error') name=ENAME parameter=Parameter COLON
(BEGIN
(expressions+=Estatement)+
END)?;
PackageHead:
Block=Refblock COLON;
Problem is that when i try prepend some characters before error keyword
for example
error.regionFor.keyword('error').prepend[setSpace("\n ")]
This indentation is prepended before the comment and not behind it. This results into improper formatting in case of single line comment before the 'error' keyword.
To provide more clarity, here is example code from my grammar and description of desired behavior:
package xxx_constants {namespace="xxx::a::b"}:
# asd
error ASSD {0}:
Hello {0,world}
This is expected result: (one space to the left)
package xxx_constants {namespace="xxx::a::b"}:
# asd
error ASSD {0}:
Hello {0,world}
and this is the actual result with prepend method
package xxx_constants {namespace="xxx::a::b"}:
# asd
error ASSD {0}:
Hello {0,world}
As the document structure says, the HiddenRegion is in this case is the statement:
# asd
error
How can i prepend my characters directly before the keyword 'error' and not before the comment? Thanks.
I assume you're creating an indentation-sensitive language, because you're explicitly calling BEGIN and END.
For indentation-sensitive language my answer is: You'll want to overwrite
org.eclipse.xtext.formatting2.internal.HiddenRegionReplacer.applyHiddenRegionFormatting(List<ITextReplacer>)
The methods append[] and prepend[] you're using are agnostic to comments and at a later time applyHiddenRegionFormatting() is called to decide how that formatting is weaved between comments.
To make Xtext use your own subclass of HiddenRegionReplacer, overwrite
org.eclipse.xtext.formatting2.AbstractFormatter2.createHiddenRegionReplacer(IHiddenRegion, IHiddenRegionFormatting)
For languages that do not do whitespace-sensitive lexing/parsing (that's the default) the answer is to not call setSpace() to create indentation or line-wraps.
Instead, do
pkg.interior[indent]
pkg.regionFor.keyword(":").append[newLine]
pkg.append[newLine]

uima ruta Score Condition

I tried a Script to mark the Journal using Score Condition.
W{REGEXP("Journal",true)->MARK(ONLY_Journal)};
W{REGEXP("Retraction|Retracted")->MARK(RETRACT)};
W{REGEXP("Suppl")->MARK(SUPPLY)};
NUM {->MARK(VOLUMEISSUE,1,6)}LParen NUM SPECIAL?{REGEXP("-")} NUM? RParen;
Reference{CONTAINS(ONLY_Journal)->MARKSCORE(10,JOURNAL_MAYBE)};
Reference{CONTAINS(JournalVolumeMarker)->MARKSCORE(5,JOURNAL_MAYBE)};
Reference{CONTAINS(VOLUMEISSUE)->MARKSCORE(15,JOURNAL_MAYBE)};
Reference{CONTAINS(JOURNALNAME)->MARKSCORE(10,JOURNAL_MAYBE)};
Reference{CONTAINS(RETRACT)->MARKSCORE(10,JOURNAL_MAYBE)};
Reference{CONTAINS(SUPPLY)->MARKSCORE(5,JOURNAL_MAYBE)};
JOURNAL_MAYBE{SCORE(20,55)->MARK(JOURNAL)};
Sample Text
1.Lawrence RA. A review of the medical 342–340 benefits and contraindications to breastfeeding in the United States [Internet] . Arlington (VA): National Center for Education in Maternal and Child Health; 1997 Oct [cited 2000 Apr 24]. p. 40. Available from: www.ncemch.org/pubs/PDFs/Welcometojungle.pdf.
2.Shishido A. Retraction notice: Effect of platinum compounds on murine lymphocyte mitogenesis [Retraction of Alsabti EA, Ghalib ON, Salem MH. In: Jpn J Med Biol 1979 Apr; 32(2):53-65]. Jpn J Med Sci Biol 1980 Aug;33(4):235-237.
3.Leist TP, Zinkernagel RM. Effects of treatment with IL-2 receptor specific monoclonal antibody in mice [letter] [Retraction of Leist TP, Kohler M, Eppler M, Zinkernagel RM. In: J Immunol 1989 Jul 15; 143(2): 628-32]. J Immunol 1990 Apr 1;144(7):2847.
4.Chen, L., James, N., Barker, C., Busam, K., & Marghoob, A. (2013). Desmoplastic
melanoma: A review. Journal of the American Academy of Dermatology, 68(5), 825-833.
doi: 10.1016/j.jaad.2012.10.041.
But the above script is not working.Can anyone find a solution for it.
Thanks in advance.
This should work jsut fine, but depends of course on the amount of annotations of the types existence of ONLY_Journal, JournalVolumeMarker, and so on ...
Here's the test script for a simple ruta project:
ENGINE utils.PlainTextAnnotator;
TYPESYSTEM utils.PlainTextTypeSystem;
Document{->EXEC(PlainTextAnnotator, {Paragraph})};
DECLARE Reference, ONLY_Journal, JOURNAL_MAYBE, JournalVolumeMarker, VOLUMEISSUE, JOURNALNAME, RETRACT, SUPPLY;
DECLARE JOURNAL;
Paragraph{-> Reference};
"Jpn J Med Biol" -> JOURNALNAME;
"32\\(2\\)" -> VOLUMEISSUE;
Reference{CONTAINS(ONLY_Journal)->MARKSCORE(10,JOURNAL_MAYBE)};
Reference{CONTAINS(JournalVolumeMarker)->MARKSCORE(5,JOURNAL_MAYBE)};
Reference{CONTAINS(VOLUMEISSUE)->MARKSCORE(15,JOURNAL_MAYBE)};
Reference{CONTAINS(JOURNALNAME)->MARKSCORE(10,JOURNAL_MAYBE)};
Reference{CONTAINS(RETRACT)->MARKSCORE(10,JOURNAL_MAYBE)};
Reference{CONTAINS(SUPPLY)->MARKSCORE(5,JOURNAL_MAYBE)};
JOURNAL_MAYBE{SCORE(20,55)->MARK(JOURNAL)};
... applied sample text, the second reference is annotated with JOURNAL.
DISCLAIMER: I am a develoepr of UIMA Ruta.