UIMA Ruta: Can't ignore periods using MarkTable - uima

If I have a dictionary containing various acronyms and designations, I would ideally like to be able to avoid having entries for each "U.S.A.", "USA", and "usa". I have no trouble ignoring case, but the ignore chars argument does not seem to work across the board. After the appropriate import and declare statements, I get something like the following:
Document{->MARKTABLE(Acroynm,1,AcronymDict,true,0,".,-",10,"expandedForm"=2)};
It successfully ignores a single set of 1-10 hyphens. It does not ignore 10 hyphens spaced throughout the word. (It will ignore a-bc and a--bc but not a-b-c.) This is actually fine for hyphens, but I cannot, with the above statement, get it to ignore periods at all. (It ignores neither a.bc or a.b.c.) Further, if I can get it to ignore periods, is there any way I can ignore the periods in A.B.C. and not just the one in A.BC?
Any further description of the limitations of this argument would be useful. Thanks.
Relevant Ruta Documentation: https://uima.apache.org/d/ruta-current/tools.ruta.book.html#ugr.tools.ruta.language.actions.marktable

Related

Multiple regex in one command

Disclaimer: I have no engineering background whatsoever - please don't hold it against me ;)
What I'm trying to do:
Scan a bunch of text strings and find the ones that
are more than one word
contain title case (at least one capitalized word after the first one)
but exclude specific proper nouns that don't get checked for title case
and disregard any parameters in curly brackets
Example: Today, a Man walked his dogs named {FIDO} and {Fifi} down the Street.
Expectation: Flag the string for title capitalization because of Man and Street, not because of Today, {FIDO} or {Fifi}
Example: Don't post that video on TikTok.
Expectation: No flag because TikTok is a proper noun
I have bits and pieces, none of them error-free from what https://www.regextester.com/ keeps telling me so I'm really hoping for help from this community.
What I've tried (in piece meal but not all together):
(?=([A-Z][a-z]+\s+[A-Z][a-z]+))
^(?!(WordA|WordB)$)
^((?!{*}))
I think your problem is not really solvable solely with regex...
My recommendation would be splitting the input via [\s\W]+ (e.g. with python's re.split, if you really need strings with more than one word, you can check the length of the result), filtering each resulting word if the first character is uppercase (e.g with python's string.isupper) and finally filtering against a dictionary.
[\s\W]+ matches all whitespace and non-word characters, yielding words...
The reasoning behind this different approach: compiling all "proper nouns" in a regex is kinda impossible, using "isupper" also works with non-latin letters (e.g. when your strings are unicode, [A-Z] won't be sufficient to detect uppercase). Filtering utilizing a dictionary is a way more forward approach and much easier to maintain (I would recommend using set or other data type suited for fast lookups.
Maybe if you can define your use case more clearer we can work out a pure regex solution...

crystal reports attempting to link two tables by matching string with no luck

As stated in the title, I have two tables I'm attempting to link. Both Strings appear to be a match, however Crystal Reports is not picking it up. The only thing I can think is that that length of the field is different, even though the strings are the same. could that cause a discrepancy? If so how can I correct for it? Thank you
Length of the string will prevent a match. If you are using the Trim(string) function, that only removes spaces found at the beginning or end of your string, so the two strings could still be of different lengths after using this function. You will need to use another function to capture a substring of the original string. To do this you can use the Left(string, length) function to ensure both strings are the same length.
If they still do not match then you may have non-printable characters in one or both of your strings. Carriage Return and Line Feed tend to be the most commonly found non-printable characters. A Carriage Return is represented as Chr(10), while a Line Feed is represented as Chr(13). These are Built In Constants similar to those found in VBA and Visual Basic.
You can use a find and replace to remove them with the following formula. Its not a bad idea to also include the trim and left functions in this as well to ensure you get the best match possible.
Replace(Replace(Left(Trim({YourStringField}), 10),Chr(10), ""),Chr(13), "")
There are a few additional Built In Constants you may need to check for if this doesn't work. A Tab is represented as Chr(9) for example. Its very rare for strings to contain the other Built In Constants though. In most cases Carriage Return and Line Feed are the only ones that are typically found in Plain Text. Tabs and the other constants should only be found in Rich Text and are very rare in string data.

Removing spaces from a string using Powershell

I have an issue where extracting data from database it sometimes (quite often) adds spaces in between strings of texts that should not be there.
What I'm trying to do is create a small script that will look at these strings and remove the spaces.
The problem is that the spaces can be in any position in the string, and the string is a variable that changes.
Example:
"StaffID": "0000 25" <- The space in the number should not be there.
Is there a way to have the script look at this particular line, and if it finds spaces, to remove them.
Or:"DateOfBirth": "23-10-199 0" <-It would also need to look at these spaces and remove them.
The problem is that the same data also has lines such as:
"Address": " 91 Broad street" <- The spaces should be here obviously.
I've tried using TRIM, but that only removes spaces from start/end.
Worth mentioning that the data extracted is in json format and is then imported using API into the new system.
You should think about the logic of what you want to do, and whether or not it's programmatically possible to determine if you can teach your script where it is or is not appropriate to put spaces. As it is, this is one of the biggest problems facing AI research right now, so unfortunately you're probably going to have to do this by hand.
If it were me, I'd specify the kind of data format that I expect from each column, and try my best to attempt to parse those strings. For example, if you know that StaffID doesn't contain spaces, you can have a rule that just deletes them:
$staffid = $staffid.replace("\s+",'')
There are some more complicated things that you can do with forced formatting (.replace) that have already been covered in this answer, but again, that requires some expectation of exactly what data is going to come out of what column.
You might want to look more closely at where those spaces are coming from, rather than process the output like this. Is the retrieval script doing it? Maybe you can optimize the database that you're drawing from?

Are periods in object names bad practice?

For example, a constraint for a default value of 0 could be named DF__tablename.columnname.
Although my search for this being bad practice doesn't yield results, in the numerous constraints examples I've seen on SO and many other sites, I never spotted a period.
Using period in an object name is bad practice.
Don't use dot character in an identifier. Yes it can be done but the drawbacks outweigh any benefits.
tl;dr
Special characters, such as a dot, are not allowed in regular identifiers. If an identifier does not follow the rules for regular identifier, then references to the identifier must be enclosed in square brackets (or ANSI double quotes).
https://learn.microsoft.com/en-us/sql/relational-databases/databases/database-identifiers?view=sql-server-2017
In terms of the period (dot character), using that in an identifier is not allowed in a regular identifier; but it could be used within square brackets.
The dot character is even more of a special-ish character in SQL; it's used to separate an identifier from a preceding qualifier.
SELECT mytable.mycolumn FROM mytable
We could also write that as
SELECT [mytable].[mycolumn] FROM mytable
We could also write
SELECT [mytable.mycolumn] FROM mytable
but that means something very different. With that, we aren't referencing a column named mycolumn, we are now referencing an identifier that contains a dot character.
SQL Server will deal with this just fine.
But if we do this, and start using the dot character in our identifiers, we will be causing confusion and frustration to future readers. Any benefit we would gain by using dot characters in identifiers is going to be far outweighed by the downside for others.
Similarly, why we don't create tables named WHERE (1=1) OR, or create columns named SUBSTR(foo.bar,1,10) to avoid monstrosities like
SELECT [SUBSTR(foo.bar,1,10)] FROM [WHERE (1=1)] OR]
Which may be valid SQL, but it will cause future readers to become very upset, and cause them to curse us, our descendants and loved ones. Don't make them do that. For the love of all that is good and beautiful in this world, don't use dot characters in identifiers.
It is perfectly valid to have periods in the object names. However, this requires you to use square brackets around the object name when referring to it. In case you forget these square brackets you will get some error messages that can be less intuitive to the inexperienced developer. For this reason I recommend not to use periods in the object names. I would also guess this is the main reason you don't often see examples of periods in object names on the internet.
In your example, you could use another underscore instead of the period, like this: DF__tablename_columnname

Any way to extend org-mode tags to include characters like "-"?

BRIEF: can the characters permitted in org-mode tags be extenced in any way?
E.g. to include -, dashes?
DETAIL:
I see
http://orgmode.org/org.html#Tags
Tags are normal words containing letters, numbers, ‘_’, and ‘#’. Tags
must be preceded and followed by a single colon, e.g., ‘:work:’.
I am somewhat surprised that this is not extensible. Is it, and I have missed it?
TODO keywords can include dashes. Occasionally I would like to treat TODOs as intermiscible wth tags - e.g. move a TODO to a tag, and vice versa - but this syntax difference gets in the way.
Before I start coding, does anyone know why dashes are not allowed? I conjecture confusion with timestamps.
Not certain, but it is likely because certain agenda search strings use "-" as an operator, and there is not yet a way to escape those using "-".
I have used hyphens for properties and spaces for todo kw with no problems for my purposes, but I have not tried them in searches.
The new exporter will be included in 8.0 and with it a detailed description of syntax. If you want to file a feature request to allow hyphens, perhaps by allowing escaping in search strings, now is the time to do it.