uima ruta Score Condition - uima

I tried a Script to mark the Journal using Score Condition.
W{REGEXP("Journal",true)->MARK(ONLY_Journal)};
W{REGEXP("Retraction|Retracted")->MARK(RETRACT)};
W{REGEXP("Suppl")->MARK(SUPPLY)};
NUM {->MARK(VOLUMEISSUE,1,6)}LParen NUM SPECIAL?{REGEXP("-")} NUM? RParen;
Reference{CONTAINS(ONLY_Journal)->MARKSCORE(10,JOURNAL_MAYBE)};
Reference{CONTAINS(JournalVolumeMarker)->MARKSCORE(5,JOURNAL_MAYBE)};
Reference{CONTAINS(VOLUMEISSUE)->MARKSCORE(15,JOURNAL_MAYBE)};
Reference{CONTAINS(JOURNALNAME)->MARKSCORE(10,JOURNAL_MAYBE)};
Reference{CONTAINS(RETRACT)->MARKSCORE(10,JOURNAL_MAYBE)};
Reference{CONTAINS(SUPPLY)->MARKSCORE(5,JOURNAL_MAYBE)};
JOURNAL_MAYBE{SCORE(20,55)->MARK(JOURNAL)};
Sample Text
1.Lawrence RA. A review of the medical 342–340 benefits and contraindications to breastfeeding in the United States [Internet] . Arlington (VA): National Center for Education in Maternal and Child Health; 1997 Oct [cited 2000 Apr 24]. p. 40. Available from: www.ncemch.org/pubs/PDFs/Welcometojungle.pdf.
2.Shishido A. Retraction notice: Effect of platinum compounds on murine lymphocyte mitogenesis [Retraction of Alsabti EA, Ghalib ON, Salem MH. In: Jpn J Med Biol 1979 Apr; 32(2):53-65]. Jpn J Med Sci Biol 1980 Aug;33(4):235-237.
3.Leist TP, Zinkernagel RM. Effects of treatment with IL-2 receptor specific monoclonal antibody in mice [letter] [Retraction of Leist TP, Kohler M, Eppler M, Zinkernagel RM. In: J Immunol 1989 Jul 15; 143(2): 628-32]. J Immunol 1990 Apr 1;144(7):2847.
4.Chen, L., James, N., Barker, C., Busam, K., & Marghoob, A. (2013). Desmoplastic
melanoma: A review. Journal of the American Academy of Dermatology, 68(5), 825-833.
doi: 10.1016/j.jaad.2012.10.041.
But the above script is not working.Can anyone find a solution for it.
Thanks in advance.

This should work jsut fine, but depends of course on the amount of annotations of the types existence of ONLY_Journal, JournalVolumeMarker, and so on ...
Here's the test script for a simple ruta project:
ENGINE utils.PlainTextAnnotator;
TYPESYSTEM utils.PlainTextTypeSystem;
Document{->EXEC(PlainTextAnnotator, {Paragraph})};
DECLARE Reference, ONLY_Journal, JOURNAL_MAYBE, JournalVolumeMarker, VOLUMEISSUE, JOURNALNAME, RETRACT, SUPPLY;
DECLARE JOURNAL;
Paragraph{-> Reference};
"Jpn J Med Biol" -> JOURNALNAME;
"32\\(2\\)" -> VOLUMEISSUE;
Reference{CONTAINS(ONLY_Journal)->MARKSCORE(10,JOURNAL_MAYBE)};
Reference{CONTAINS(JournalVolumeMarker)->MARKSCORE(5,JOURNAL_MAYBE)};
Reference{CONTAINS(VOLUMEISSUE)->MARKSCORE(15,JOURNAL_MAYBE)};
Reference{CONTAINS(JOURNALNAME)->MARKSCORE(10,JOURNAL_MAYBE)};
Reference{CONTAINS(RETRACT)->MARKSCORE(10,JOURNAL_MAYBE)};
Reference{CONTAINS(SUPPLY)->MARKSCORE(5,JOURNAL_MAYBE)};
JOURNAL_MAYBE{SCORE(20,55)->MARK(JOURNAL)};
... applied sample text, the second reference is annotated with JOURNAL.
DISCLAIMER: I am a develoepr of UIMA Ruta.

Related

Cite in github readme Markdown

I'm trying to get a publication on JOSS (journal open source software) and they require the paper written in markdown on github. I'm struggling in understanding how i can add the citation. So I included a file named paper.bib in my github main folder. In the Readme.md i wrote
---
title: 'CREDO: a friendly Customizable, REproducible, DOcker file generator'
tags:
- Docker
- Reproducibility
- Docker generator
- User Iinterface
authors:
- name: Simone Alessandri'
equal-contrib: 1
affiliation: 1
- name: Rabellino Sergio
equal-contrib: 2
affiliation: 2
- name: Sandro Contaldo
equal-contrib: 3
affiliation: 2
- name: Maria Ratto
equal-contrib: 3
affiliation: 4
- name: Gabriele Piacenti
equal-contrib: 3
affiliation: 5
- name: Qi Wang
equal-contrib: 3
affiliation: 3
- name: Marco Beccuti
equal-contrib: 4
affiliation: 2
- name: Raffaele Adolfo Calogero
equal-contrib: 4
affiliation: 4
- name: Luca Alessandri
equal-contrib: 5
affiliation: "3,4"
- name: Author with no affiliation
corresponding: true
affiliation: 3
affiliations:
- name: Politechnic of Turin, Torino, Italy
index: 1
- name: Department of Computer Science, University of Torino, Torino
index: 2
- name: Department of Pathology, Boston Children's Hospital, Harvard Medical School, Boston, MA, USA
index: 3
- name: Department of Molecular Biotechnology and Health Sciences, University of Torino, Torino
index: 4
- name: Molecular Biotechnology Center & Department of Life Sciences and Systems Biology, University of Turin, Torino, Italy
index: 5
date: 11 July 2022
bibliography: paper.bib
aas-doi:
aas-journal: JOSS The Journal of Open Source Software
---
Is this enough to load the citations? Here is my bib file.
#inproceedings{uno,
title={Reproducible bioinformatics project: a community for reproducible bioinformatics analysis pipelines},
author={N. Kulkarni , L. Alessandri, R. Panero, M. Arigoni, M. Olivero, G. Ferrero, et al},
booktitle={BMC Bioinformatic},
pages={vol. 19 Suppl 10:349, 2018, doi:10.1186/s12859-018-2296-x},
doi={10.1186/s12859-018-2296-x}
}
#inproceedings{due,
title={https://docs.docker.com/engine/}
}
#inproceedings{tre,
title={Containers in Bioinformatics: Applications, Practical Considerations, and Best Practices in Molecular Pathology},
author={S. Kadri, A. Sboner, A. Sigaras and S. Roy},
booktitle={J Mol Diagn., 2022},
doi={10.1016/j.jmoldx.2022.01.006}
}
#inproceedings{quattro,
title={https://cran.r-project.org/}
}
#inproceedings{cinque,
title={https://www.python.org/}
}
#inproceedings{sei,
title={Using R and Bioconductor in Clinical Genomics and Transcriptomics},
author={J.L. Sepulveda. },
booktitle={J Mol Diagn vol. 22},
doi={10.1016/j.jmoldx.2019.08.006}
}
#inproceedings{sette,
title={Sparsely-connected autoencoder (SCA) for single cell RNAseq data mining},
author={L. Alessandri, F. Cordero, M. Beccuti, N. Licheri, M. Arigoni, M. Olivero, et al },
booktitle={NPJ Syst Biol Appl. vol. 7},
doi={10.1038/s41540-020-00162-6}
}
#inproceedings{otto,
title={Sparsely Connected Autoencoders: A Multi-Purpose Tool for Single Cell omics Analysis},
author={L. Alessandri, M.L. Ratto, S.G. Contaldo, M. Beccuti, F. Cordero, M. Arigoni, et al},
booktitle={nt J Mol Sci., vol. 22},
doi={10.3390/ijms222312755}
}
#inproceedings{nove,
title={rCASC: reproducible classification analysis of single-cell sequencing data},
author={L. Alessandri, F. Cordero, M. Beccuti, M. Arigoni, M. Olivero, G. Romano, et al},
booktitle={Gigascience, vol. 8},
doi={10.1093/gigascience/giz105}
}
#inproceedings{dieci,
title={https://docs.conda.io/en/latest/}
}
#inproceedings{undici,
title={https://bioconda.github.io/}
}
#inproceedings{dodici,
title={Orchestrating high-throughput genomic analysis with Bioconductor},
author={W. Huber, V.J. Carey, R. Gentleman, S. Anders, M. Carlson, B.S. Carvalho, et al},
booktitle={Nat Methods, vol. 12},
doi={10.1038/nmeth.3252}
}
#inproceedings{tredici,
title={Bioconductor: open software development for computational biology and bioinformatics},
author={R.C. Gentleman, V.J. Carey, D.M. Bates, B. Bolstad, M. Dettling, S. Dudoit, et al},
booktitle={Genome Biol., vol. 5},
doi={10.1186/gb-2004-5-10-r80}
}
#inproceedings{quattordici,
title={https://github.com/}
}
#inproceedings{quindici,
title={https://uwekorn.com/2021/03/01/deploying-conda-environments-in-docker-how-to-do-it-right.html}
}
#inproceedings{sedici,
title={https://pythonspeed.com/articles/activate-conda-dockerfile/}
}
#inproceedings{diciassette,
title={https://biocontainers.pro/}
}
In the text, how can i cite the first paper? I tried \cite{uno} as suggested from other questions but is not working. Here is the link to the repository https://github.com/alessandriLuca/CREDO_paper
In general, in pandoc's markdown, which JOSS uses, you can cite by doing [#bibtex-key], eg. in your first case, [#uno]. Here is the documentation regarding citations with pandoc's markdown.
The complete setup is demonstrated in the JOSS documentation: "Example paper and Bibliography".
Another way to see examples would be to look at how other papers on JOSS do it: see this paper which is the first one I find on JOSS. You can see there the [#bibtex-key] syntax.

Extract date from string with another numbers from R

I need to extract the date from this text:
Mellisoni 2014 Malbec (Columbia Valley (WA))
Okapi 2013 Estate Cabernet Sauvignon (Napa Valley)
Podere dal Nespoli 2015 Prugneto Sangiovese (Romagna)
Simonnet-Febvre 2015 Chablis
Lagler 2012 1000 Eimerberg Smaragd Neuburger (Wachau)
I use this code:
vino<-mutate(vino, year1=sub("^.*([0-9]{4}).*", "\\1", vino$title))
It works, but I have the last value extract on 1000 instead of 2012, how can I fix it if have another numbers?

Unicode letters with more than 1 alphabetic latin character?

I'm not really sure how to express it but I'm searching for unicode letters which are more than one visual latin letter.
I found this in Word so far:
DZ
Dz
dz
NJ
Lj
LJ
Nj
nj
Any others?
Here are some of the characters I've found. I'd first done this manually by looking at some probable blocks. However I've later written a Python script to do this automatically that you can find at the end of this answer
Digraphs
Two Glyphs
Digraph
Unicode Code Point
HTML
DZ, Dz, dz
DZ, Dz, dz
U+01F1 U+01F2 U+01F3
DZ Dz dz
DŽ, Dž, dž
DŽ, Dž, dž
U+01C4 U+01C5 U+01C6
DŽ Dž dž
IJ, ij
IJ, ij
U+0132 U+0133
IJ ij
LJ, Lj, lj
LJ, Lj, lj
U+01C7 U+01C8 U+01C9
LJ Lj lj
NJ, Nj, nj
NJ, Nj, nj
U+01CA U+01CB U+01CC
NJ Nj nj
Ligatures
Non-ligature
Ligature
Unicode
HTML
AA, aa
Ꜳ, ꜳ
U+A732, U+A733
Ꜳ ꜳ
AE, ae
Æ, æ
U+00C6, U+00E6
Æ æ
AO, ao
Ꜵ, ꜵ
U+A734, U+A735
Ꜵ ꜵ
AU, au
Ꜷ, ꜷ
U+A736, U+A737
Ꜷ ꜷ
AV, av
Ꜹ, ꜹ
U+A738, U+A739
Ꜹ ꜹ
AV, av (with bar)
Ꜻ, ꜻ
U+A73A, U+A73B
Ꜻ ꜻ
AY, ay
Ꜽ, ꜽ
U+A73C, U+A73D
Ꜽ ꜽ
et
🙰
U+1F670
🙰
f‌f
ff
U+FB00
ff
f‌f‌i
ffi
U+FB03
ffi
f‌f‌l
ffl
U+FB04
ffl
f‌i
fi
U+FB01
fi
f‌l
fl
U+FB02
fl
OE, oe
Œ, œ
U+0152, U+0153
Œ œ
OO, oo
Ꝏ, ꝏ
U+A74E, U+A74F
Ꝏ ꝏ
ſs, ſz
ẞ, ß
U+1E9E, U+00DF
ß
st
st
U+FB06
st
ſt
ſt
U+FB05
ſt
TZ, tz
Ꜩ, ꜩ
U+A728, U+A729
Ꜩ ꜩ
ue
ᵫ
U+1D6B
ᵫ
VY, vy
Ꝡ, ꝡ
U+A760, U+A761
Ꝡ ꝡ
There are a few other ligatures that are used for phonetic transcription but looks like Latin characters
Non-ligature
Ligature
Unicode
HTML
db
ȸ
U+0238
ȸ
dz
ʣ
U+02A3
ʣ
IJ, ij
IJ, ij
U+0132, U+0133
IJ ij
ls
ʪ
U+02AA
ʪ
lz
ʫ
U+02AB
ʫ
qp
ȹ
U+0239
ȹ
ts
ʦ
U+02A6
ʦ
ui
ꭐ
U+AB50
ꭐ
turned ui
ꭑ
U+AB51
ꭑ
https://en.wikipedia.org/wiki/List_of_precomposed_Latin_characters_in_Unicode#Digraphs_and_ligatures
Edit:
There are more letterlike symbols beside ℻ and ℡ like what the OP found in the comment:
℀ ℁ ⅍ ℅ ℆ ℔ ℠ ™
Longer letters are mainly from the CJK Compatibility block
U+XXXX
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
U+338x
㎀
㎁
㎂
㎃
㎄
㎅
㎆
㎇
㎈
㎉
㎊
㎋
㎌
㎍
㎎
㎏
U+339x
㎐
㎑
㎒
㎓
㎔
㎕
㎖
㎗
㎘
㎙
㎚
㎛
㎜
㎝
㎞
㎟
U+33Ax
㎠
㎡
㎢
㎣
㎤
㎥
㎦
㎧
㎨
㎩
㎪
㎫
㎬
㎭
㎮
㎯
U+33Bx
㎰
㎱
㎲
㎳
㎴
㎵
㎶
㎷
㎸
㎹
㎺
㎻
㎼
㎽
㎾
㎿
U+33Cx
㏀
㏁
㏂
㏃
㏄
㏅
㏆
㏇
㏈
㏉
㏊
㏋
㏌
㏍
㏎
㏏
U+33Dx
㏐
㏑
㏒
㏓
㏔
㏕
㏖
㏗
㏘
㏙
㏚
㏛
㏜
㏝
㏞
㏟
Among the 3-letter-like symbols are ㎈ ㎑ ㎒ ㎓ ㎔㏒ ㏕ ㏖ ㏙ ㎪ ㎫ ㎬ ㎭ ㏆ ㏿ ㍱... Probably the ones with most characters are ㎉ and ㎯
Unicode even have codepoints for Roman numerals. Here another 4-letter-like character can be found: Ⅷ
U+XXXX
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
U+215x
⅐
⅑
⅒
⅓
⅔
⅕
⅖
⅗
⅘
⅙
⅚
⅛
⅜
⅝
⅞
⅟
U+216x
Ⅰ
Ⅱ
Ⅲ
Ⅳ
Ⅴ
Ⅵ
Ⅶ
Ⅷ
Ⅸ
Ⅹ
Ⅺ
Ⅻ
Ⅼ
Ⅽ
Ⅾ
Ⅿ
U+217x
ⅰ
ⅱ
ⅲ
ⅳ
ⅴ
ⅵ
ⅶ
ⅷ
ⅸ
ⅹ
ⅺ
ⅻ
ⅼ
ⅽ
ⅾ
ⅿ
U+218x
ↀ
ↁ
ↂ
Ↄ
ↄ
ↅ
ↆ
ↇ
ↈ
↉
↊
↋
If normal numbers can be considered then there are some other code points for multiple digits like ⒆ ⒇ ⓳ ⓴ in enclosed alphanumerics
U+XXXX
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
U+246x
①
②
③
④
⑤
⑥
⑦
⑧
⑨
⑩
⑪
⑫
⑬
⑭
⑮
⑯
U+247x
⑰
⑱
⑲
⑳
⑴
⑵
⑶
⑷
⑸
⑹
⑺
⑻
⑼
⑽
⑾
⑿
U+248x
⒀
⒁
⒂
⒃
⒄
⒅
⒆
⒇
⒈
⒉
⒊
⒋
⒌
⒍
⒎
⒏
U+249x
⒐
⒑
⒒
⒓
⒔
⒕
⒖
⒗
⒘
⒙
⒚
⒛
⒜
⒝
⒞
⒟
U+24Ax
⒠
⒡
⒢
⒣
⒤
⒥
⒦
⒧
⒨
⒩
⒪
⒫
⒬
⒭
⒮
⒯
U+24Bx
⒰
⒱
⒲
⒳
⒴
⒵
Ⓐ
Ⓑ
Ⓒ
Ⓓ
Ⓔ
Ⓕ
Ⓖ
Ⓗ
Ⓘ
Ⓙ
U+24Cx
Ⓚ
Ⓛ
Ⓜ
Ⓝ
Ⓞ
Ⓟ
Ⓠ
Ⓡ
Ⓢ
Ⓣ
Ⓤ
Ⓥ
Ⓦ
Ⓧ
Ⓨ
Ⓩ
U+24Dx
ⓐ
ⓑ
ⓒ
ⓓ
ⓔ
ⓕ
ⓖ
ⓗ
ⓘ
ⓙ
ⓚ
ⓛ
ⓜ
ⓝ
ⓞ
ⓟ
U+24Ex
ⓠ
ⓡ
ⓢ
ⓣ
ⓤ
ⓥ
ⓦ
ⓧ
ⓨ
ⓩ
⓪
⓫
⓬
⓭
⓮
⓯
U+24Fx
⓰
⓱
⓲
⓳
⓴
⓵
⓶
⓷
⓸
⓹
⓺
⓻
⓼
⓽
⓾
⓿
and in Enclosed Alphanumeric Supplement
🅫, 🅪, 🆋, 🆌, 🆍, 🄭, 🄮, 🅊, 🅋, 🅌, 🅍, 🅎, 🅏
A few more:
Currency symbol group
₧ ₨ ₶ ₯ ₠ ₢ ₷
Miscellaneous technical group
⎂ ⏨
Control pictures (probably you'll need to zoom out to see)
U+XXXX
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
U+240x
␀
␁
␂
␃
␄
␅
␆
␇
␈
␉
␊
␋
␌
␍
␎
␏
U+241x
␐
␑
␒
␓
␔
␕
␖
␗
␘
␙
␚
␛
␜
␝
␞
␟
U+242x
␠
␡
␢
␣
␤
␥
␦
Alchemical Symbols
🜀 🜅 🜆 🜇 🜈 🝪 🝫 🝬 🝛 🝜 🝝
Musical Symbols
𝄶 𝄷 𝄸 𝄹 𝄉 𝄊 𝄫
And there are the emojis 🔟 💤🆔🚾🆖🆗🔢🔡🔠 💯🆘🆎🆑™🔙🔚🔜🔝🔛📆🗓🔞
Vertical bars may be considered uppercase i or lowercase L (like your 〷 example which is actually the TELEGRAPH LINE FEED SEPARATOR SYMBOL) and we have
Vai syllable see ꔖ 0xa516
Large triple vertical bar operator ⫼ 0x2afc
Counting rod tens digit three: 𝍫 0x1d36b
Suzhou numerals 〢 〣
Chinese river 川
║ BOX DRAWINGS DOUBLE VERTICAL...
Here's the automatic script to find the multi-character letters
import unicodedata
for c in range(0, 0x10FFFF + 1):
d = unicodedata.normalize('NFKD', chr(c))
if len(d) > 1 and d.isascii() and d.isalpha():
print("U+%04X (%s): %s\n" % (c, chr(c), d))
It won't be able to find many ligatures like æ or œ because they're not considered orthographic ligatures and aren't decomposable in Unicode. Here's the result in Unicode 11.0.0 (checked with unicodedata.unidata_version)
U+0132 (IJ): IJ
U+0133 (ij): ij
U+01C7 (LJ): LJ
U+01C8 (Lj): Lj
U+01C9 (lj): lj
U+01CA (NJ): NJ
U+01CB (Nj): Nj
U+01CC (nj): nj
U+01F1 (DZ): DZ
U+01F2 (Dz): Dz
U+01F3 (dz): dz
U+20A8 (₨): Rs
U+2116 (№): No
U+2120 (℠): SM
U+2121 (℡): TEL
U+2122 (™): TM
U+213B (℻): FAX
U+2161 (Ⅱ): II
U+2162 (Ⅲ): III
U+2163 (Ⅳ): IV
U+2165 (Ⅵ): VI
U+2166 (Ⅶ): VII
U+2167 (Ⅷ): VIII
U+2168 (Ⅸ): IX
U+216A (Ⅺ): XI
U+216B (Ⅻ): XII
U+2171 (ⅱ): ii
U+2172 (ⅲ): iii
U+2173 (ⅳ): iv
U+2175 (ⅵ): vi
U+2176 (ⅶ): vii
U+2177 (ⅷ): viii
U+2178 (ⅸ): ix
U+217A (ⅺ): xi
U+217B (ⅻ): xii
U+3250 (㉐): PTE
U+32CC (㋌): Hg
U+32CD (㋍): erg
U+32CE (㋎): eV
U+32CF (㋏): LTD
U+3371 (㍱): hPa
U+3372 (㍲): da
U+3373 (㍳): AU
U+3374 (㍴): bar
U+3375 (㍵): oV
U+3376 (㍶): pc
U+3377 (㍷): dm
U+337A (㍺): IU
U+3380 (㎀): pA
U+3381 (㎁): nA
U+3383 (㎃): mA
U+3384 (㎄): kA
U+3385 (㎅): KB
U+3386 (㎆): MB
U+3387 (㎇): GB
U+3388 (㎈): cal
U+3389 (㎉): kcal
U+338A (㎊): pF
U+338B (㎋): nF
U+338E (㎎): mg
U+338F (㎏): kg
U+3390 (㎐): Hz
U+3391 (㎑): kHz
U+3392 (㎒): MHz
U+3393 (㎓): GHz
U+3394 (㎔): THz
U+3396 (㎖): ml
U+3397 (㎗): dl
U+3398 (㎘): kl
U+3399 (㎙): fm
U+339A (㎚): nm
U+339C (㎜): mm
U+339D (㎝): cm
U+339E (㎞): km
U+33A9 (㎩): Pa
U+33AA (㎪): kPa
U+33AB (㎫): MPa
U+33AC (㎬): GPa
U+33AD (㎭): rad
U+33B0 (㎰): ps
U+33B1 (㎱): ns
U+33B3 (㎳): ms
U+33B4 (㎴): pV
U+33B5 (㎵): nV
U+33B7 (㎷): mV
U+33B8 (㎸): kV
U+33B9 (㎹): MV
U+33BA (㎺): pW
U+33BB (㎻): nW
U+33BD (㎽): mW
U+33BE (㎾): kW
U+33BF (㎿): MW
U+33C3 (㏃): Bq
U+33C4 (㏄): cc
U+33C5 (㏅): cd
U+33C8 (㏈): dB
U+33C9 (㏉): Gy
U+33CA (㏊): ha
U+33CB (㏋): HP
U+33CC (㏌): in
U+33CD (㏍): KK
U+33CE (㏎): KM
U+33CF (㏏): kt
U+33D0 (㏐): lm
U+33D1 (㏑): ln
U+33D2 (㏒): log
U+33D3 (㏓): lx
U+33D4 (㏔): mb
U+33D5 (㏕): mil
U+33D6 (㏖): mol
U+33D7 (㏗): PH
U+33D9 (㏙): PPM
U+33DA (㏚): PR
U+33DB (㏛): sr
U+33DC (㏜): Sv
U+33DD (㏝): Wb
U+33FF (㏿): gal
U+FB00 (ff): ff
U+FB01 (fi): fi
U+FB02 (fl): fl
U+FB03 (ffi): ffi
U+FB04 (ffl): ffl
U+FB05 (ſt): st
U+FB06 (st): st
U+1F12D (🄭): CD
U+1F12E (🄮): WZ
U+1F14A (🅊): HV
U+1F14B (🅋): MV
U+1F14C (🅌): SD
U+1F14D (🅍): SS
U+1F14E (🅎): PPV
U+1F14F (🅏): WC
U+1F16A (🅪): MC
U+1F16B (🅫): MD
U+1F190 (🆐): DJ

Need to match the annotation feature-UIMA RUTA

I need to match the feature of an annotation and to also need to mark the second annotation of the matched feature. I've tried it but I'm facing two issues
ISSUE 1:
SEPERATEDA annotation values got reduced.I think its due to dictRemoveWS.
ISSUE 2:
It showing only the last match.(Due to some looping problem).
Sample file 1:
Arash Alipour
Rahul Bhargava
Lisette I.S. Wintgens
B. Rahul
Alipour A
Ali Aldabahi
M. Naziruddin Khan
Martin J. Swaans
Naziruddin Khan
Expected Output for file 1:
Rahul
Alipour
Naziruddin
Khan
Sample file 2:
M. Naziruddin Khan
Arash Alipour
Rahul Bhargava
Lisette I.S. Wintgens
Alipour A
Ali Aldabahi
M. Naziruddin Khan
Expected Output for file 2:
Alipour
Naziruddin
Khan
My Script:
PACKAGE uima.ruta.example;
DECLARE SINGLEINITIAL;
CW{REGEXP(".")->MARK(SINGLEINITIAL)};
DECLARE SeperateDA;
DECLARE DA;
"Arash Alipour"->DA;
"Lisette I.S. Wintgens"->DA;
"Alipour A"->DA;
"Rahul Bhargava"->DA;
"M. Naziruddin Khan"->DA;
"B. Rahul"->DA;
"Ali Aldabahi"->DA;
"A. S. Al Dwayyan"->DA;
"Lucas V.A. Boersma"->DA;
"Jippe C. Bal"->DA;
"Benno J.W.M. Rensing"->DA;
"Martin J. Swaans"->DA;
BLOCK(DocAuth) DA{}
{
CW{-PARTOF(SINGLEINITIAL)-> MARK(SeperateDA)};
}
DECLARE RepeatedDA(STRING auth);
STRING MatchedAuth;
SeperateDA{->MARK(RepeatedDA),MATCHEDTEXT(MatchedAuth)}->{RepeatedDA{->RepeatedDA.auth=MatchedAuth};};
STRING auth;
FOREACH(RepAuth) RepeatedDA{}
{
(da1:RepeatedDA {->UNMARK(RepeatedDA)}# da2:RepeatedDA){da1.auth != da2.auth};
}
I also tried something like this
da:RepeatedDA{->da.auth = RepeatedDA.auth};
FOREACH(RepAuth, true) RepeatedDA{}
{
# da:RepeatedDA{->auth = da.auth, LOG(" auth-" +auth)};
da:RepeatedDA {auth != da.auth-> UNMARK(da)};
}
My goal is to remove the more over similar name from DA. For example from the above sample file both Rahul Bhargava and B. Rahul are in DA.But I need only Rahul Bhargava to be in DA.
There seems to be a problem with your rule logic.
da1:RepeatedDA # da2:RepeatedDA da2 match always on the directly next RepeatedDA/SeperateDA since the value of the auth feature differs. Thus, the rule applies to often, almost every time.
Try this:
DECLARE SINGLEINITIAL;
CW{REGEXP(".")->MARK(SINGLEINITIAL)};
DECLARE SeperateDA (STRING auth);
DECLARE DA;
"Arash Alipour"->DA;
"Lisette I.S. Wintgens"->DA;
"Alipour A"->DA;
"Rahul Bhargava"->DA;
"M. Naziruddin Khan"->DA;
"B. Rahul"->DA;
"Ali Aldabahi"->DA;
"A. S. Al Dwayyan"->DA;
"Lucas V.A. Boersma"->DA;
"Jippe C. Bal"->DA;
"Benno J.W.M. Rensing"->DA;
"Martin J. Swaans"->DA;
BLOCK(DocAuth) DA{}
{
CW{-PARTOF(SINGLEINITIAL)-> CREATE(SeperateDA, "auth" = CW.ct)};
}
DECLARE RepeatedDA;
da1:SeperateDA{-> RepeatedDA} # da2:SeperateDA{da1.auth == da2.auth};
DISCLAIMER: I am a developer of UIMA Ruta

Mongodb find documents

I have a MongoDB instance which contains a translation of texts:
{
"_id" : ObjectId("57c68ba415f4d42b6ecd9ee7"),
"en" : "Adana (pronounced [aˈda.na]) is a major city in southern Turkey. The city is situated on the Seyhan river, 35 km (22 mi) inland from the Mediterranean Sea, in south-central Anatolia. It is the administrative seat of the Adana Province and has a population of 1.7 million,[1] making it the fifth most populous city in Turkey. Adana-Mersin polycentric metropolitan area, with a population of 3 million, stretches over 70 km (43 mi) east-west and 25 km (16 mi) north-south; encompassing the cities of Mersin, Tarsus and Adana.",
"sw" : "Adana (Kigiriki Άδανα) ni mji mkubwa katika nchi ya Uturuki. Kwa mujibu wa sensa iliyofanyika mwaka wa 2000, mji una wakazi wapatao 1,130,710 waishio huko,[2] na kuufanya kuwa mmoja kati ya miji mitano mikubwa ya Uturuku (baada ya Istanbul, Ankara, İzmir na Bursa). Mwaka wa 2006 mji wa Adana umekadiriwa kufikia iadadi ya wakazi wapatao 1,271,894. Huu ndiyo mji mkuu wa Mkoa wa Adana."
}
{
"_id" : ObjectId("57c68ba915f4d42b6ecd9eea"),
"en" : "Addis Ababa or Addis Abeba (the spelling used by the official Ethiopian Mapping Authority),(Amharic: አዲስ አበባ? Addis Abäba IPA: [adˈdis ˈabəba] ( listen), \"new flower\"; Oromo: Finfinne,[3][4] [fɪnˈfɪ́n.nɛ́] \"Natural Spring(s)\"), is the capital and largest city of Ethiopia. Finfinne is its Oromo name. It has a population of 3,384,569 according to the 2007 population census, with annual growth rate of 3.8%. This number has been increased from the originally published 2,738,248 figure and appears to be still largely underestimated.[2][5]",
"sw" : "Addis Ababa (pia Addis Abeba; kwa Kiamhara አዲስ አበባ, \"Ua Jipya\"; kwa Kioromo Finfinne) ni mji mkuu wa Ethiopia na wa Umoja wa Afrika."
}
{
"_id" : ObjectId("57c68bab15f4d42b6ecd9eec"),
"en" : "Adelaide of Italy (931 – 16 December 999), also called Adelaide of Burgundy, was the second wife of Holy Roman Emperor Otto the Great[2] and was crowned as the Holy Roman Empress with him by Pope John XII in Rome on February 2, 962. Empress Adelaide was perhaps the most prominent European woman of the 10th century; she was regent of the Holy Roman Empire as the guardian of her grandson in 991-995.[2]",
"sw" : "Adelaide wa Italia (takriban 931 – 16 Desemba, 999) alikuwa binti wa Rudolf II, mfalme wa Burgundia. Kwanza aliolewa na Lothar, mfalme wa Italia. Alipofariki Lothar, Adelaide aliolewa na Otto I, mfalme wa Ujerumani. Aliishi maisha matakatifu. Sikukuu yake ni 16 Desemba."
}
What I would like to do is to select one specific record. For example I expect to select the last record by doing this:
db.wiki.find({"sw": "Adelaide wa Italia"}).pretty();
But the mongo shell returns nothing.
Indeed, I know that I can create an index and do something like:
db.wiki.find({$text: {$search: "\"Adelaide wa Italia\""}}).pretty();
which indeed returns the record as expected.
What am I doing wrong in the non-index searching please?
In this case you should use search with regex:
db.wiki.find({"sw": /Adelaide wa Italia/}).pretty();
The way you are doing it by:
db.wiki.find({"sw": "Adelaide wa Italia"}).pretty();
you simply tell Mongo to return you all documents where sw is equal to Adelaide wa Italia but you want to get all documents which contains this phrase in sw field instead.