String Indexing and Suffix Trees - suffix-tree

I have to build some kind of a "string catalogue" out of large PDF documents for faster string/substring searches.
The mechanism should work like this:
A PDF scanner scans the PDF document for strings and invokes a callback-method in my catalogue to index that string.
Now, what technique should be used to build such a catalogue?
I have heard of:
- Suffix trees
- Generalized suffix trees
- Suffix arrays
I am mainly tending to the generalized suffix trees. Am I right or wrong then?
I guess "normal" suffix trees are only good for indexing a SINGLE string.
But what about suffix arrays? Are there generalized suffix arrays out there?
I found a lot of code in C/C++ for building a suffix tree out of a string, but none for building generalized suffix trees!

Related

Multiple regex in one command

Disclaimer: I have no engineering background whatsoever - please don't hold it against me ;)
What I'm trying to do:
Scan a bunch of text strings and find the ones that
are more than one word
contain title case (at least one capitalized word after the first one)
but exclude specific proper nouns that don't get checked for title case
and disregard any parameters in curly brackets
Example: Today, a Man walked his dogs named {FIDO} and {Fifi} down the Street.
Expectation: Flag the string for title capitalization because of Man and Street, not because of Today, {FIDO} or {Fifi}
Example: Don't post that video on TikTok.
Expectation: No flag because TikTok is a proper noun
I have bits and pieces, none of them error-free from what https://www.regextester.com/ keeps telling me so I'm really hoping for help from this community.
What I've tried (in piece meal but not all together):
(?=([A-Z][a-z]+\s+[A-Z][a-z]+))
^(?!(WordA|WordB)$)
^((?!{*}))
I think your problem is not really solvable solely with regex...
My recommendation would be splitting the input via [\s\W]+ (e.g. with python's re.split, if you really need strings with more than one word, you can check the length of the result), filtering each resulting word if the first character is uppercase (e.g with python's string.isupper) and finally filtering against a dictionary.
[\s\W]+ matches all whitespace and non-word characters, yielding words...
The reasoning behind this different approach: compiling all "proper nouns" in a regex is kinda impossible, using "isupper" also works with non-latin letters (e.g. when your strings are unicode, [A-Z] won't be sufficient to detect uppercase). Filtering utilizing a dictionary is a way more forward approach and much easier to maintain (I would recommend using set or other data type suited for fast lookups.
Maybe if you can define your use case more clearer we can work out a pure regex solution...

Searching for two Word wildcard strings that are nested

I'm having trouble finding the proper Word wildcard string to find numbers that fit the following patterns:
"NN NN NN" or "NN NN NN.NN" (where N is any number 0-9)
The trouble is the first string is a subset of the second string. My goal is to find a single wildcard string that will capture both. Unfortunately, I need to use an operator that is zero or more occurrences for the ".NN" portion and that doesn't exist.
I'm having to do two searches, and I'm using the following patterns:
[0-9]{2}[^s ][0-9]{2}[^s ][0-9]{2}?[!0-9]
[0-9]{2}[^s ][0-9]{2}[^s ][0-9]{2}.[0-9]{2}
The problem is that first pattern (in bold). It works well unless I have the number in a table or something and there is nothing after it to match (or not match, if you will) the [!0-9].
You could use a single wildcard Find:
[0-9]{2}[^s ][0-9]{2}[^s ][0-9][0-9.]{1,4}
or:
[0-9]{2}[^s ][0-9]{2}[^s ][0-9][0-9.]{1;4}
to capture both. Which you use depends on your regional settings.

How do I extract Unicode normalization tables from the XML Unicode Character Database?

I'm writing a script to create tables containing unicode characters for case folding, etc.
I was able to extract those tables just fine, but I'm struggling to figure out which properties to use to get codepoints for normalization.
In Unicode Annex #44 the closest property group I can find is the NF(C|D|KC|KD)_QC which is for telling if a string has already been normalized.
and it still doesn't list the values I need to actually build the tables.
What am I doing wrong here?
Edit: I'm writing a C library to handle unicode, this isn't a simple one and done, write it in python problem, I'm trying to write my own normalization (technically composition/decomposition) functions.
Edit2: The decomposition property is "dm", but what about composition, and the Kompatibility variants?
The Unicode XML database in the ucdxml directory is not authoritative. I'd suggest to work with the authoritative files in the ucd directory. You'll need
the fields Decomposition_Type and Decomposition_Mapping from column 5 of UnicodeData.txt,
the field Canonical_Combining_Class from column 3, and
the composition exclusions from CompositionExclusions.txt.
If there's a decomposition type in angle brackets, it's a compatibility mapping (NFKD), otherwise it's a canonical mapping. Composition is defined in terms of decomposition mappings. See section 3.11 Normalization Forms of the Unicode standard and UAX #15 for details.

Unicode case folding to upper case

I'm trying to implement a library for reading Microsoft CFB (Compound File Binary) Format files, according to the official specification of that format. The specification is available from this site.
In a nutshell - some of the structures of the file are stored in a red-black tree. I've got a problem with the comparison predicate used for storing these structures in that tree. The specification says that, if the names (the strings are stored as UTF-16, the standard in Windows APIs) of these structures are different, it is necessary to iterate through every UTF-16 code point and :
(...) convert to upper-case with the Unicode Default Case Conversion
Algorithm, simple case conversion variant (simple case foldings), with the following notes.<2> Compare each upper-cased UTF-16 code point binary value.
The <2> reference says that :
or Windows XP and Windows Server 2003: The compound file implementation
conforms to the Unicode 3.0.1 Default Case Conversion Algorithm, simple case folding
(http://www.unicode.org/Public/3.1-Update1/CaseFolding-4.txt) with the following exceptions.
However, when I looked up the referenced case folding file, and read the UTR #21 "Case Mapping" referenced there, I realized that the case folding is defined as an operation that bears much more resemblance to lower-casing, rather than upper-casing.
By using CaseFolding-4.txt, we can obtain the case folding mappings of upper-case letters to lower-case ones. The mapping is always 1-to-1, since full foldings (those that expand to multiple characters) aren't needed here. However, the reverse mapping of lower-case letters to upper-case ones isn't straightforward anymore. For example,
0392; C; 03B2; # GREEK CAPITAL LETTER BETA
03D0; C; 03B2; # GREEK BETA SYMBOL
Thus, we have no way of knowing whether 03B2 should be converted to 0392 or 03D0. Does the standard define something like folding to upper-case? Maybe I should use case folding, and then convert to upper-case? Or have I understood the specification completely wrong?
Summary: The wording used by Microsoft is...confusing to say the least. It appears that simple upper case mapping should be done, though I can't be certain.
Background
Part of the confusion might be the difference between case folding and case mapping. Case mapping maps every character to a designated case. Case folding, while it is based on lower-casing, is defined to result in case-less characters (UTR #21 ยง1.3).
Now there are two variants of case mapping and case folding, simple and full. Unlike the simple transformation, The full one can change string length, and as you rightly point out is not needed here. The specification specifically mentions simple, and is probably the only clear thing in this answer. I do feel the need to mention for future reference that the the current Unicode Standard (6.3.0) mentions that the default case transformation is the full one, though the version Microsoft references (3.1.1) does not appear to make this distinction.
Spec Analysis
(...) convert to upper-case with the Unicode Default Case Conversion Algorithm, simple case conversion variant (simple case foldings), with the following notes.<2> Compare each upper-cased UTF-16 code point binary value.
To me this quote appears to suggest they want upper case, and simply made an error by saying case folding instead of case mapping. But then comes that reference you quoted:
For Windows XP and Windows Server 2003: The compound file implementation conforms to the Unicode 3.0.1 Default Case Conversion Algorithm, simple case folding (http://www.unicode.org/Public/3.1-Update1/CaseFolding-4.txt) with the following exceptions.
They actually mention the case folding data file! At this point, I'm not sure what to think. My main line of thought is that Microsoft wants case folding though erroneously thought that it was based on upper casing rather than lower casing. This is even a stretch though, but its the closest I've been able to come to reconciling this possible contradiction, and I hope there's a better explanation.
I've found in section 2.6.1 the following which supports some form of upper-casing:
[...] the directory entry name is compared using a special case-insensitive upper-case mapping, described in Red-Black Tree.
Note that they do in fact use the term mapping here.
The exception list
Taking a look at the exception list for the mentioned Windows XP and Windows Server 2003, most entries are subtractions, suggesting code points Microsoft wants to keep distinct. However, in the table, the code points are actually listed in reverse order to the Unicode case folding data file.
One interpretation of this is that it's just a display quirk. This idea is shot down by the last row where they subtract the case transformation 0x03C2 -> 0x03C2. This transformation does not exist in the data file since the transformation 0x03C2 -> 0x03C3 does (an unlisted case transformation is considered to transform to itself).
Another interpretation is that they do in fact erroneously believe that its the reverse mapping that's the correct one. As you mentioned though, this runs into trouble, as the reverse mapping is not always straightforward. Otherwise, this interpretation would be fine.
A third interpretation is to consider their reference to the Unicode case folding data file wrong. This of course makes me feel uneasy, but if they actually did mean case mapping originally, they might have just provided the link as a quick reference point. The exception list they mention does have column headers such as "Lowercase UTF-16 code point", but we know that case folding is in fact case-less.
As an aside, I did look at the exception list for the later operating systems, hoping to gain some more insight. I found more confusion. In particular the addition of 0x03C3 -> 0x03A3 troubles me. Since the exception list and the Unicode file list their code points in the opposite order, it appears that the transformation is already in the data file and doesn't need to be added. This part of the specification does not want to be understood!
Conclusion
If you've read all of the above, you'll probably guess that this conclusion is going to be less than ideal. Clearly at one or more points, the specification is in error, but it's hard to tell where. Really there are three possibilities depending on your interpretation as to what kind of case transformation needs to be done.
Simple upper case mapping
Simple case folding, followed by simple upper case mapping
Simple case folding
To me it seems like Microsoft does in fact want upper casing. From there I believe that the case folding reference is an error, and as such my guess is they just want simple upper case mapping.
I highly doubt it's the last simple case folding only option though. Both of the other options would give very similar results with only a small amount of code points possibly giving different results.
It seems like the only way to know for sure would be to either contact Microsoft, or painstakingly look at binaries to see which method is followed.
In 3.13 Default Case Algorithms (p. 115) of The Unicode Standard
Version 6.2 โ€“ Core Specification the text refers to UnicodeData.txt. This contains:
03B2;GREEK SMALL LETTER BETA;Ll;0;L;;;;;N;;;0392;;0392
03D0;GREEK BETA SYMBOL;Ll;0;L;<compat> 03B2;;;;N;GREEK SMALL LETTER CURLED BETA;;0392;;0392
which indicates that the Greek small letter Beta should indeed map to the Greek Beta symbol, and as an aside indicates that the two symbols have some level of compatibility. It also contains the remainder of the bidirectional case conversion you are looking for. You may also need to look at SpecialCasing.txt for boundary cases.

Encoding special chars in XSLT output

I have built a set of scripts, part of which transform XML documents from one vocabulary to a subset of the document in another vocabulary.
For reasons that are opaque to me, but apparently non-negotiable, the target platform (Java-based) requires the output document to have 'encoding="UTF-8"' in the XML declaration, but some special characters within text nodes must be encoded with their hex unicode value - e.g. 'โ€' must be replaced with 'โ€' and so forth. I have not been able to acquire a definitive list of which chars must be encoded, but it does not appear to be as simple as "all non-ASCII".
Currently, I have a horrid mess of VBScript using ADODB to directly check each line of the output file after processing, and replace characters where necessary. This is painfully slow, and unsurprisingly some characters get missed (and are consequently nuked by the target platform).
While I could waste time "refining" the VBScript, the long-term aim is to get rid of that entirely, and I'm sure there must be a faster and more accurate way of achieving this, ideally within the XSLT stage itself.
Can anyone suggest any fruitful avenues of investigation?
(edit: I'm not convinced that character maps are the answer - I've looked at them before, and unless I'm mistaken, since my input could conceivably contain any unicode character, I would need to have a map containing all of them except the ones I don't want encoded...)
<xsl:output encoding="us-ascii"/>
Tells the serialiser that it has to produce ASCII-compatible output. That should force it to produce character references for all non-ASCII characters in text content and attribute values. (Should there be non-ASCII in other places like tag or attribute names, serialisation will fail.)
Well with XSLT 2.0 you have tagged your post with you can use a character map, see http://www.w3.org/TR/xslt20/#character-maps.