How should I use UIMA Ruta to match the all words between line break? - uima

Thank for any strong hands!
I have some text like the following
aaaaa aaaa aaaaa aaaaaa
bbbbb bbbbb bbbb bbbbbb
cccccc ccccc ccccc cccccc
I want to use Ruta to create annotation that matches all strings between line break. I want my annotation to create the following three match:
1. aaaaa aaaa aaaaa aaaaaa
2. bbbbb bbbbb bbbb bbbbbb
3. cccccc ccccc ccccc cccccc
I try to match everything between line break, like the following
BREAK #{-> MARK(Stuff)} BREAK;
But no luck. Could anyone please make some suggestion?
Thank you very much!

The problem with your rule is probably the currently used filtering setting. Whitespaces, breaks and markup are not visible by default. The rule is probably not able to find any anchors to start the match process. You need to make breaks visible for the rules, e.g, with RETAINTYPE:
Document{-> RETAINTYPE(BREAK)};
BREAK #{-> MARK(Stuff)} BREAK;
Document{-> RETAINTYPE}; // for restoring the default setting
There is also an analysis engine that is able to create these annotations:
PlainTextAnnotator.
This analysis engine includes however also whitespaces at the beginning and end of the line. These could be removed with something like:
Document{-> RETAINTYPE(SPACE)};
Line{->TRIM(SPACE)};
In UIMA Ruta 2.2.1 (next release) you can also write something like:
Document{-> RETAINTYPE(BREAK)};
(#{-> Stuff} BREAK)+;
(I am a developer of UIMA Ruta)

Related

Is there a way to change multiple lines with one loop in Power Shell

I have a text file like this:
info
:
I went to Paris
xxx
yyy
zzz
info
:
I went to Italy
aaa
bbb
ccc
I want this text file to be like this
Info : I went to Paris
xxx
yyy
zzz
Info : I went to Italy
aaa
bbb
ccc
So it will be like;
1- finding every colons
2- (a way) double pressing to backspace button and moving to upper line and pressing spacebar + colon + spacebar + delete button which will get those paris and italy lines to the upper line.
Use the regex-based -replace operator:
Tip of the hat to Santiago Squarzon for helping to simplify the regex and substitution.
# Outputs to the screen; pipe to Set-Content to save back to a file as needed.
(Get-Content -Raw file.txt) -replace '(?m)(?<=^info)(?:\r?\n){2}:(?:\r?\n){2}', ' : '
For an explanation of the regex and the ability to experiment with it, see this regex101.com page.

how to use markdown to create orderlist in github

anyone know how can i create a order number list using markdown on github? for unorder list I can
* aaa
* bbb
* ccc
then it looks like
aaa
bbb
ccc
but i want it looks like
aaa
bbb
ccc
The GitHub Flavored Markdown specification for list items does include:
An ordered list marker is a sequence of 1–9 arabic digits (0-9), followed by either a . character or a ) character.
(The reason for the length limit is that with 10 digits we start seeing integer overflows in some browsers.)
So your markdown source must include those digits explicitly:
1. aaa
2. aaa
3. aaa
As flaxel points out, you also have lazy numbering, which means the markdown source would be:
1. aaa
1. aaa
1. aaa

How match the last part of a line conditionally?

I am very new to perl, currently I am using a very simple perl regex to print the last part of a line after the string "Lecture" reading from a file 1.txt.
cat 1.txt | perl -ne 'print "$1 \n" while /Lecture\s+(\d+\w)/g;'
It works well but I need to add a simple condition to it:
First Preference is always print the characters after the string "Lecture".
If string "Lecture" is not found in a line, simply print the characters at the very end of line.
PS: It might occur that string "Lecture" doesn't have a space around it and throughout I used word character because it not necessarily would be a plain number, it can be alphanumeric .
Example
cat 1.txt
Some Topic 1 Lecture 001
Some Topic 2 Lecture 002
Topic 3 ( classroom Session ) Lecture2B
Practicals 07A
Submissions 10
Topic5Lecture4
Expected output:
001
002
2B
07A
10
4
I preferably want a solution which I can directly run in the cli/console. ( Just Like my original code - cat 1.txt | perl code ).
I don't want to execute a separate .pl file.
This
(?:\w*Lecture)?([^\s]+)$
Will capture ((...)) all (+) non-whitespace ([^\s]) at the end of line ($),
optionally (?) preceeded by non-captured((?:...)) "Lecture", even if there are other letters before (\w*).
It gets the desired output:
001
002
2B
07A
10
4
4
For the sample input:
Some Topic 1 Lecture 001
Some Topic 2 Lecture 002
Topic 3 ( classroom Session ) Lecture2B
Practicals 07A
Submissions 10
Topic5 Lecture4
Topic5Lecture4

Generate word from list of characters

I asked this question and I realized I was asking the question incorrectly, though the answer #Zdim provided is exactly what I asked: So now I need to change that question a bit.
my $str = 'aaaa';
print $str++, $/ while $str le 'dddd';
So the above code does each combination from aaaa to dddd for instance:
aaaa
aaab
aaac
...
daaa
...
dddd
However, we need to generate all the possible combinations of a given set of the given characters. whether they are numeric, special characters or alphabetical characters. So If I tell the script the minimum 2 and maximum is 4 letter words and I give an input string of:
abcdefG1234%##
it will then generate:
aa
aaa
aaaa
bb
aaab
bbbb
####
abc#
ab#1
...
So it should use each of the characters and create each possible combination from minimum 2 characters to maximum 4 characters.
So even if I give the entire alphanumeric and special characters, it will create each possible word or string within the range of 2 to 4 characters.
If We take this glob example, it is close, but it will only do all the sets of 4, not all combinations from 2, then 3 and then 4
print, while glob '{A,B,C,D,#,#,a,d,e,f}'x4
for my $i (2..4) {
say while glob '{A,B,C,D,#,#,a,d,e,f}' x $i;
}
One way for this is to use a little extension of the linked question and answer. To generate the sequence of ascii codes which will be sampled from, from a given string
perl -wE'say for map { ord($_) } split "", q(abcdefG1234%##)'
Now with that list on hand, run the code from the linked page for sequences of length 2 through 4.

sed: replace only in part of string

I have a simple playlist of song files:
1003 James Brown - The Boss Unknown Artist.mp3
1004 James Brown - Slaughters Theme Unknown Artist.mp3
1005 James Brown - Payback(1) Unknown Artist.mp3
...
I would like them in the following format:
1003 James_Brown_-_The_Boss_Unknown_Artist.mp3
1004 James_Brown_-_Slaughters_Theme_Unknown_Artist.mp3
...
Notice that the whitespace behind the number in front is NOT replaced. I have the following simple sed script:
sed "s/ /_/g"
but that replaces also the space after the number. I know how to form capture groups, but that will not help either. How can I convince sed to only apply the replacement to a portion of the input string, rather than the whole string?
You could do
sed 's/ /_/g; s/_/ /'
I.e. first turn all spaces into underscores, then turn the first underscore back into a space.