WORDTABLE - Not matching the word - UIMA RUTA - uima

I've tried to match a word using WORDTABLE. But some text is not matching.
In the below input the word Afghanistan is not matching. If I remove A Coruña;n.a. from WORDTABLE, then it's matching.
Sample Input:
Afghanistan
Report
report
Sample CSV ( test.csv):
Afghanistan;Afghan.
report;rep.
A Coruña;n.a.
Code:
PACKAGE uima.ruta.example;
RETAINTYPE(SPACE);
WORDTABLE Table = 'test.csv';
DECLARE Annotation Abbr(STRING short);
Document{->MARKTABLE(Abbr, 1, Table,true,0,"",0, "short" = 2)};
RETAINTYPE;

This is most likely caused by the whitespace in the wordlist. There are several options to avoid this problem, e.g., activating the configuration parameter dictRemoveWS.

Related

How to Get only numeric value in text file using powershell?

I have a text file sample.txt containing
computer
computer.pc = 1
pc
i want only number 1, where i want to assign that value to a variable
$number = Get -content "sample.txt"
You can extract the number by using the Regex Match method.
Example code to do this:
$number = ([regex]::Match((Get-content "sample.txt"), "\d+")).Value
The pattern \d+ means to match one or more decimal digits and using the Match method will return the first match found.
See Quantifiers in Regular Expressions for additional information regarding the quantifiers available.

Word - delete text between <> including tables

I'm trying to delete text between < and > that includes 2 tables. I can do text including multiple lines using wildcard search and replace using (\<)(*)(>)
but this doesn't work when the text includes tables. Any ideas? There are varying numbers of lines in the tables too.
The correct wildcard Find expression would be:
\<*\>
Nevertheless, your observation is correct: It won't find content that includes a table between the < and >. You would need to use two Find/Replace operations, one that uses the above expression then another that employs a loop, looking for
<
Then extending the found range until:
>
is encountered.
Got the solution: https://answers.microsoft.com/en-us/msoffice/forum/msoffice_word-msoffice_custom-mso_2016/delete-text-between-including-tables/51f09dcb-8c77-41d3-840c-e8e0545f313a?tm=1531844975462&auth=1
Dim rng As Range
Selection.HomeKey wdStory
With Selection.Find
Do While .Execute(findText:="<", Forward:=True, _
MatchWildcards:=False, Wrap:=wdFindStop, MatchCase:=True) = True
Set rng = Selection.Range
rng.End = ActiveDocument.Range.End
rng.End = rng.Start + InStr(rng, ">")
rng.Select
Selection.Delete
Loop
End With
End Sub

Removing Unwanted commas from a csv

I'm writing a program in Progress, OpenEdge, ABL, and whatever else it's known as.
I have a CSV file that is delimited by commas. However, there is a "gift message" field, and users enter messages with "commas", so now my program will see additional entries because of those bad commas.
The CSV fields are not in double qoutes so I CAN NOT just use my main method with is
/** this next block of code will remove all unwanted commas from the data. **/
if v-line-cnt > 1 then /** we won't run this against the headers. Otherwise thhey will get deleted **/
assign
v-data = replace(v-data,'","',"\t") /** Here is a special technique to replace the comma delim wiht a tab **/
v-data = replace(v-data,','," ") /** now that we removed the comma delim above, we can remove all nuisance commas **/
v-data = replace(v-data,"\t",'","'). /** all nuisance commas are gone, we turn the tabs back to commas. **/
Any advice?
edit:
From Progress, I cal call Linux commands. So I should be able to execute C++/PHP/Shell etc all from my Progress Program. I look forward to advice, until then I shall look into using external scripts.
You are not providing quite enough data for a perfect answer but given what you say I think the IMPORT statement should handle this automatically.
In my example here commaimport.csv is a comma-separated csv-file with quotes around text fields. Integers, logical variables etc have no quotes. The last field contains a comma in one line:
commaimport.csv
=======================
"Id1", 123, NO, "This is a message"
"Id2", 124, YES, "This is a another message, with a comma"
"Id3", 323, NO, "This is a another message without a comma"
To import this file I define a temp-table matching the file layout and use the IMPORT statement with comma as delimiter:
DEFINE TEMP-TABLE ttImport NO-UNDO
FIELD field1 AS CHARACTER FORMAT "xxx"
FIELD field2 AS INTEGER FORMAT "zz9"
FIELD field3 AS LOGICAL
FIELD field4 AS CHARACTER FORMAT "x(50)".
INPUT FROM VALUE("c:\temp\commaimport.csv").
REPEAT :
CREATE ttImport.
IMPORT DELIMITER "," ttImport.
END.
INPUT CLOSE.
FOR EACH ttImport:
DISPLAY ttImport.
END.
You don't have to import into a temp-table. You could import into variables instead.
DEFINE VARIABLE c AS CHARACTER NO-UNDO FORMAT "xxx".
DEFINE VARIABLE i AS INTEGER NO-UNDO FORMAT "zz9".
DEFINE VARIABLE l AS LOGICAL NO-UNDO.
DEFINE VARIABLE d AS CHARACTER NO-UNDO FORMAT "x(50)".
INPUT FROM VALUE("c:\temp\commaimport.csv").
REPEAT :
IMPORT DELIMITER "," c i l d.
DISP c i l d.
END.
INPUT CLOSE.
This will render basically the same output:
You don't show what your data file looks like. But if the problematic field is the last one, and there are no quotes, then your best bet is probably to read it using INPUT UNFORMATTED to get it a line at a time, and then split the line into fields using ENTRY(). That way you can treat everything after the nth comma as a single field no matter how many commas the line has.
For example, say your input file has three columns like this:
boris,14.23,12 the avenue
mark,32.10,flat 1, the grange
percy,1.00,Bleak house, Dartmouth
... so that column three is an address which might contain a comma and is not enclosed in quotes so that IMPORT DELIMITER can't help you.
Something like this would work in that case:
/* ...skipping a lot of definitions here ... */
input from "datafile.csv".
repeat:
import unformatted v-line.
create tt-thing.
assign tt-thing.name = entry(1, v-line, ',')
tt-thing.price = entry(2, v-line, ',')
tt-thing.address = entry(3, v-line, ',').
do v=i = 4 to num-entries(v-line, ','):
tt-thing.address = tt-thing.address
+ ','
+ entry(v-i, v-line, ',').
end.
end.
input close.

UIMA Ruta wordlist case ignore

My use case is such that i have a list of match words in a WORDLIST "MonthNames.txt".
Now i want to Mark all the occurrences of these words in the given document irrespective of the text case.
PACKAGE uima.ruta.example;
WORDLIST MonthNameList = 'MonthNames.txt';
DECLARE MonthNames;
DECLARE MonthNameValue;
// Regex to be used in finding dates
STRING monthNameValueRegex = "(?i)(january|february|march|april|may|june|july|august|september|october|november|december|jan|feb|mar|apr|jun|jul|aug|sept|oct|nov|dec)";
// Mark month name
Document{-> MARKFAST(MonthNames, MonthNameList)};
Document{CONTAINS(MonthNames) -> MARK(MonthNameValue)};
Document{REGEXP(monthNameValueRegex) -> MARK(MonthNameValue)};
Is there any way to do it ?
I tried
Document{-> MARKFAST(MonthNames, MonthNameList,true)};
But that is just to ignore whitespaces not text case.
Please help
Passing a 3rd variable as true makes it ignore the word case.
Document{-> MARKFAST(MonthNames, MonthNameList,true)};
Thanks to Peter for his help.

Separating file name in parts by identifier

This may be a very simple task for many but I could not find anything appropriate for me.
I have a file name: filenm_A006.2011.269.10.47.G25_2010
I want to separate all its parts (separated by . and _) to use them separately. How can I do it with simple matlab commands?
Kind Regards,
Mushi
I recommend regexp:
fname = 'filenm_A006.2011.269.10.47.G25_2010';
parts = regexp(fname, '[^_.]+', 'match');
parts =
'filenm' 'A006' '2011' '269' '10' '47' 'G25' '2010'
You can now refer to parts{1} through parts{8} for the pieces. Explanation: the regexp pattern [^_.] means all characters not equal to _ or ., and the + means you want groups of at least 1 character. Then 'match' asks the regexp function to return a cell array of the strings of all the matches of that pattern. There are other regexp modes; for example, the indices of each piece of the file.
Use the command
strsplit.
cellArrayOfParts = strsplit(fileName,{'.' '_'});
You can use strsplit to split it:
strsplit('filenm_A006.2011.269.10.47.G25_2010',{'_','.'})
ans =
'filenm' 'A006' '2011' '269' '10' '47' 'G25' '2010'
Another option is to use regexp, like Peter suggested.