In Python (or any language) what does an "upper" function do to Hindi, Amharric and other non-Latin character sets? - uppercase

Subject says it all. Been looking for an answer, but cannot seem to find it.
I am writing a web app that will store data in a database and also have language files translated into a wide variety of character sets. At various moments, the text will be presented. I want to control presentation such as spurious blank spaces at the beginning and end of strings. Also I want to ensure some letters are upper or lower case.
My question is: what happens in upper/lower case functions when the character set only has one case?
EDIT Sub question: Are there any unexpected side effects to be aware of?
My guess is that you simply get back the one and only character.
EDIT - Added Description
The main reason for asking this question is that I am writing a webapp that will be distributed and run on machines in remote areas with little or no chance to fix "on-the-spot" bugs. It's not a complicated webapp, but will run with many different language char sets. I want to be certain of my footing before releasing the server.

First of all the upper() and lower() method in python can be applied to Hindi, Amharric and non-letter character sets.
For instance will the upper() method converts the lowercase characters if an equivalent uppercase of this char exists. If not, then not.
Or better said, if there is nothing to convert, it stays the same.

Related

The list of unicode unusual characters

Where can I get the complete list of all unicode characters that doesn't behave as simple characters. Examples: character 0x0363 (won't be printed without another one before), character 0x0084 (does weird things when printed). I need just a raw list of such unusual characters to replace them with something harmless to avoid unwanted output effects. Regular characters (those who not in this list) should use exactly one character place when printed (= cursor moved +1 to the right), should not depend on previous or next characters, and should not affect printing style in any way.
Edit because of multiple comments:
I have some unicode string, usually consists of "usual" characters like 0x20-0x7E or cyrillic letters. Also, there are a lot of other unicode characters that are usual and may be safely assumed as having strlen() = 1. The string is printed on the terminal and I should know the resulting position of the cursor. I don't want to use some complex and non-stable libraries to do that, i want to have simplest possible logic to do that. Every problematic character may be replaced with U+0xFFFD or something like "<U+0363>" (ASCII string with its index instead of character itself). I want to have a list of "possibly-problematic" characters to replace. It is acceptable to have some non-problematic characters in this list too, but not much.
There is no simple algorithm for this. You'll likely need a complex, but extremely stable library: libicu, or something based on it. Basically every other library that does this kind of work is based on libicu, which is maintained by the Unicode organization.
If you don't want to use the official library (or something based on their library), you'll need to parse the Unicode Character Database yourself. In particular, you need to look at Character Properties, and parse the files in the UCD.
I believe you're asking for Bidi_Class (i.e. "direction") to be Left_To_Right, Canonical_Combining_Class to be Not_Reordered, and Joining_Type to be Non_Joining.
You probably also want to check the General_Category and avoid M* (Marks) and C* (Other).
This should work for some Emoji, but this whole approach will break a lot of emoji that look simple and are not. Most famously: ❤️, which is two "characters," not one. You may want to filter out Emoji. As a simple starting point, you may want to restrict yourself to the Basic Multilingual Plane (BMP), which are code points 0000-FFFF. Anything above this range is, almost by definition, rare or unusual. The BMP does include some emoji, but most emoji (and all new emoji) are outside the range.
Remember that the glyphs for single characters can still have radically different widths, even in nominally fixed-width fonts. For example, 𒈙 (U+12219 CUNEIFORM SIGN LUGAL OPPOSING LUGAL) is a completely "normal" character in the way you're describing. It is left-to-right. It doesn't depend on or influence characters around it (it's non-combining and non-joining). Its "length in characters" is 1. Its glyph is also extremely wide in most fonts and breaks a lot of layout. I don't know anything in the Unicode database that would warn you of this, since "glyph width" is entirely a function of fonts, not characters, and Unicode explicitly does not consider fonts. (That said, most of the most problematic characters are outside the BMP. Probably the most common exception is DŽ, but many fixed-width fonts have a narrow glyph for it: DŽ.)
Let's write some cuneiform in a fixed-width font.
Normally, every character should line up with a character above.
Here: 𒈙. See how these characters don't align correctly?
Not only is it a very wide glyph, but its width is not even a multiple.
At least not in my font (Mac Safari 15.0).
But DŽ is ok.
Also remember that there are multiple ways to encode the same "character." For example, é can be a "simple" character (U+00E9), or it can be two characters (U+0065, U+0301). So in some cases é may print in your scheme, and in others it won't. I suspect this is fine for your problem, but if it isn't, you're going to need to apply a normalization form (likely NFC).

What determines whether a combining character will combine?

If I have a string "\u05D2\u0308" I don't actually get a diaeresis on top of a gimel. It sits to the side, as a discrete glyph.
I don't actually want the combined glyph, but I'm confused in general. How does a combining diacritic like U+0308 decide whether to combine with the preceeding character or hang out on its own?
And how much of this behavior is specified in the unicode standard and how much is up to the individual text renderer or font?
Actually, it does combine.
You are perhaps using some environment where the text engine fails to render this correctly, but in fact your string is one character long (using the conventional sense of "character"), and a correct Unicode-compliant environment will tell you so.

Multiple regex in one command

Disclaimer: I have no engineering background whatsoever - please don't hold it against me ;)
What I'm trying to do:
Scan a bunch of text strings and find the ones that
are more than one word
contain title case (at least one capitalized word after the first one)
but exclude specific proper nouns that don't get checked for title case
and disregard any parameters in curly brackets
Example: Today, a Man walked his dogs named {FIDO} and {Fifi} down the Street.
Expectation: Flag the string for title capitalization because of Man and Street, not because of Today, {FIDO} or {Fifi}
Example: Don't post that video on TikTok.
Expectation: No flag because TikTok is a proper noun
I have bits and pieces, none of them error-free from what https://www.regextester.com/ keeps telling me so I'm really hoping for help from this community.
What I've tried (in piece meal but not all together):
(?=([A-Z][a-z]+\s+[A-Z][a-z]+))
^(?!(WordA|WordB)$)
^((?!{*}))
I think your problem is not really solvable solely with regex...
My recommendation would be splitting the input via [\s\W]+ (e.g. with python's re.split, if you really need strings with more than one word, you can check the length of the result), filtering each resulting word if the first character is uppercase (e.g with python's string.isupper) and finally filtering against a dictionary.
[\s\W]+ matches all whitespace and non-word characters, yielding words...
The reasoning behind this different approach: compiling all "proper nouns" in a regex is kinda impossible, using "isupper" also works with non-latin letters (e.g. when your strings are unicode, [A-Z] won't be sufficient to detect uppercase). Filtering utilizing a dictionary is a way more forward approach and much easier to maintain (I would recommend using set or other data type suited for fast lookups.
Maybe if you can define your use case more clearer we can work out a pure regex solution...

Why use Unicode if your program is English only?

So I've read Joel's article, and looked through SO, and it seems the only reason to switch from ASCII to Unicode is for internationalization. The company I work for, as a policy, will only release software in English, even though we have customers throughout the world. Since all of our customers are scientists, they have functional enough English to use our software as a non-native speaker. Or so the logic goes. Because of this policy, there is no pressing need to switch to Unicode to support other languages.
However, I'm starting a new project and wanted to use Unicode (because that is what a responsible programmer is supposed to do, right?). In order to do so, we would have to start converting all of the libraries we've written into Unicode. This is no small task.
If internationalization of the programs themselves is not considered a valid reason, how would one justify all the time spent recoding libraries and programs to make the switch to Unicode?
This obviously depends on what your app actually does, but just because you only have an english version in no way means that internationalization is not an issue.
What if I want to store a customer name which uses non-english characters? Or the name of a place in another country?
As an added bonus (since you say you're targeting scientists) is that all sorts of scientific symbols and notiations are supported as part of Unicode.
Ultimately, I find it much easier to be consistent. Unicode behaves the same no matter whose computer you run the app on. Non-unicode means that you use some locale-dependant character set or codepage by default, and so text that looks fine on your computer may be full of garbage characters on someone else's.
Apart from that, you probably don't need to translate all your libraries to Unicode in one go. Write wrappers as needed to convert between Unicode and whichever encoding you use otherwise.
If you use UTF-8 for your Unicode text, you even get the ability to read plain ASCII strings, which should save you some conversion headaches.
They say they will always put it in English now, but you admit you have worldwide clients. A client comes in and says internationalization is a deal breaker, will they really turn them down?
To clarify the point I'm trying to make you say that they will not accept this reasoning, but it is sound.
Always better to be safe than sorry, IMO.
The extended Scientific, Technical and Mathematical character set rules.
Where else can you say ⟦∀c∣c∈Unicode⟧ and similar technical stuff.
Characters beyond the 7-bit ASCII range are useful in English as well. Does anyone using your software even need to write the € sign? Or £? How about distinguishing "résumé" from "resume"?You say it's used by scientists around the world, who may have names like "Jörg" or "Guðmundsdóttir". In a scientific setting, it is useful to talk about wavelengths like λ, units like Å, or angles as Θ, even in English.
Some of these characters, like "ö", "£", and "€" may be available in 8-bit encodings like ISO-8859-1 or Windows-1252, so it may seem like you could just use those encodings and be done with it. The problem is that there are characters outside of those ranges that many people use very frequently, and so lots of existing data is encoded in UTF-8. If your software doesn't understand that when importing data, it may interpret the "£" character in UTF-8 as a sequence of 2 Windows-1252 characters, and render it as "£". If this sort of error goes undetected for long enough, you can start to get your data seriously garbled, as multiple passes of misinterpretation alter your data more and more until it becomes unrecoverable.
And it's good to think about these issues early on in the design of your program. Since strings tend to be very low-level concept that are threaded throughout your entire program, with lots of assumptions about how they work implicit in how they are used, it can be very difficult and expensive to add Unicode support to a program later on if you have never even thought about the issue to begin with.
My recommendation is to always use Unicode capable string types and libraries wherever possible, and make sure any tests you have (whether they be unit, integration, regression, or any other sort of tests) that deal with strings try passing some Unicode strings through your system to ensure that they work and come through unscathed.
If you don't handle Unicode, then I would recommend ensuring that all data accepted by the system is 7-bit clean (that is, there are no characters beyond the 7-bit US-ASCII range). This will help avoid problems with incompatibilities between 8-bit legacy encodings like the ISO-8859 family and UTF-8.
Suppose your program allows me to put my name in it, on a form, a dialog, whatever, and my name can't be written with ascii characters... Even though your program is in English, the data may be in other language...
It doesn't matter that your software is not translated, if your users use international characters then you need to support unicode to be able to do correct capitalization, sorting, etc.
If you have no business need to switch to unicode, then don't do it. I'm basing this on the fact that you thought you'd need to change code unrelated to component you already need to change to make it all work with Unicode. If you can make the component/feature you're working on "Unicode ready" without spreading code churn to lots of other components (especially other components without good test coverage) then go ahead and make it unicode ready. But don't go churn your whole codebase without business need.
If the business need arises later, address it then. Otherwise, you aren't going to need it.
People in this thread may suppose scenarios where it becomes a business requirement. Run those scenarios by your product managers before considering them scenarios worth addressing. Make sure they know the cost of addressing them when you ask.
Well for one, your users might know and understand english, but they can still have 'local' names. If you allow your users to do any kind of input to your application, they might want to use characters that are not part of ascii. If you don't support unicode, you will have no way of allowing these names. You'd be forcing your users to adopt a more simple name just because the application isn't smart enough to handle special characters.
Another thing is, even if the standard right now is that the app will only be released in English, you are also blocking the possibility of internationalization with ASCII, adding to the work that needs to be done when the company policy decides that translations are a good thing. Company policy is good, but has also been known to change.
I'd say this attitude expressed naïveté, but I wouldn't be able to spell naïveté in ASCII-only.
ASCII still works for some computer-only codes, but is no good for the façade between machine and user.
Even without the New Yorker's old-fashioned style of coöperation, how would some poor woman called Zoë cope if her employers used such a system?
Alas, she wouldn't even seek other employment, as updating her résumé would be impossible, and she'd have to resume instead. How's she going to explain that to her fiancée?
The company I work for, **as a policy**, will only release software in English, even though we have customers throughout the world.
1 reason only: Policies change, and when they change, they will break your existing code. Period.
Design for evil, and you have a chance of not breaking your code so soon. In this case, use Unicode. Happened to me on a brazilian specific stock-market legacy system.
Many languages (Java [and thus most JVM-based language implementations], C# [and thus most .NET-based language implementatons], Objective C, Python 3, ...) support Unicode strings by preference or even (nearly) exclusively (you have to go out of your way to work with "strings" of bytes rather than of Unicode characters).
If the company you work for ever intends to use any of these languages and platforms, it would therefore be quite advisable to start planning a Unicode-support strategy; a pilot project in particular might not be a bad idea.
That's a really good question. The only reason I can think of that has nothing to do with I18n or non-English text is that Unicode is particularly suited to being what might be called a hub character set. If you think of your system as a hub with its external dependencies as spokes, you want to isolate character encoding conversions to the spokes, so that your hub system works consistently with your chosen encoding. What makes Unicode a ideal character set for the hub of your system is that it acknowledges the existence of other character sets, it defines equivalences between its own characters and characters in those external character sets, and there's an ongoing process where it extends itself to keep up with the innovation and evolution of external character sets. There are all sorts of weird encodings out there: even when the documentation assures you that the external system or library is using plain ASCII it often turns out to be some variant like IBM775 or HPRoman8, and the nice thing about Unicode is that no matter what encoding is thrown at you, there's a good chance that there's a table on unicode.org that defines exactly how to convert that data into Unicode and back out again without losing information. Then again, equivalents of a-z are fairly well-defined in every character set, so if your data really is restricted to the standard English alphabet, ASCII may do just as well as a hub character set.
A decision on encoding is a decision on two things - what set of characters are permitted and how those characters are represented. Unicode permits you to use pretty much any character ever invented, but you may have your own reasons not to want or need such a wide choice. You might still restrict usernames, for example, to combinations of a-z and underscore, maybe because you have to put them into an external LDAP system whose own character set is restricted, maybe because you need to print them out using a font that doesn't cover all of Unicode, maybe because it closes off the security problems opened up by lookalike characters. If you're using something like ASCII or ISO8859-1, the storage/transmission layer implements a lot of those restrictions; with Unicode the storage layer doesn't restrict anything so you might have to implement your own rules at the application layer. This is more work - more programming, more testing, more possible system states. The tradeoff for that extra work is more flexibility, application-level rules being easier to change than system encodings.
The reason to use unicode is to respect proper abstractions in your design.
Just get used to treating the concept of text properly. It is not hard. There's no reason to create a broken design even if your users are English.
Just think of a customer wanting to use names like Schrödingers Cat for files he saved using your software. Or imagine some localized Windows with a translation of My Documents that uses non-ASCII characters. That would be internationalization that has, though you don't support internationalization at all, have effects on your software.
Also, having the option of supporting internationalization later is always a good thing.
Internationalization is so much more than just text in different languages. I bet it's the niche of the future in the IT-world. Heck, it already is. A lot has already been said, just thought I would add a small thing. Even though your customers right now are satisfied with english, that might change in the future. And the longer you wait, the harder it will be to convert your code base. They might even today have problems with e.g. file names or other types of data you save/load in your application.
Unicode is like cooties. Once it "infects" one area, it's usually hard to contain it given interconnectedness of dependencies. Sooner or later, you'll probably have to tie in a library that is unicode compliant and thus will use wchar_t's or the like. Instead of marshaling between character types, it's nice to have consistent strings throughout.
Thus, it's nice to be consistent. Otherwise you'll end up with something similar to the Windows API that has a "A" version and a "W" version for most APIs since they weren't consistent to start with. (And in some cases, Microsoft has abandoned creating "A" versions altogether.)
You haven't said what language you're using. In some languages, changing from ASCII to Unicode may be pretty easy, whereas in others (which don't support Unicode) it might be pretty darn hard.
That said, maybe in your situation you shouldn't support Unicode: you can't think of a compelling reason why you should, and there are some reasons (i.e. your cost to change your existing libraries) which argue against. I mean, perhaps 'ideally' you should but in practice there might be some other, more important or more urgent, thing to spend your time and effort on at the moment.
If program takes text input from the user, it should use unicode; you never know what language the user is going to use.
When using Unicode, it leaves the door open for internationalization if requirements ever change and you are required to use text in other languages than English.
Also, in your new project you could always just write wrappers for the libraries that internally convert between ASCII and Unicode and vice-versa.
Your potential client may already be running a non-unicode application in a language other than English and won't be able to run your program without swichting the windows unicode locale back and forth, which will be a big pain.
Because the internet is overwhelmingly using Unicode. Web pages use unicode. Text files including your customer's documents, and the data on their clipboards, is Unicode.
Secondly Windows, is natively Unicode, and the ANSI APIs are a legacy.
Modern applications should use Unicode where applicable, which is almost everywhere.

Theory: "Lexical Encoding"

I am using the term "Lexical Encoding" for my lack of a better one.
A Word is arguably the fundamental unit of communication as opposed to a Letter. Unicode tries to assign a numeric value to each Letter of all known Alphabets. What is a Letter to one language, is a Glyph to another. Unicode 5.1 assigns more than 100,000 values to these Glyphs currently. Out of the approximately 180,000 Words being used in Modern English, it is said that with a vocabulary of about 2,000 Words, you should be able to converse in general terms. A "Lexical Encoding" would encode each Word not each Letter, and encapsulate them within a Sentence.
// An simplified example of a "Lexical Encoding"
String sentence = "How are you today?";
int[] sentence = { 93, 22, 14, 330, QUERY };
In this example each Token in the String was encoded as an Integer. The Encoding Scheme here simply assigned an int value based on generalised statistical ranking of word usage, and assigned a constant to the question mark.
Ultimately, a Word has both a Spelling & Meaning though. Any "Lexical Encoding" would preserve the meaning and intent of the Sentence as a whole, and not be language specific. An English sentence would be encoded into "...language-neutral atomic elements of meaning ..." which could then be reconstituted into any language with a structured Syntactic Form and Grammatical Structure.
What are other examples of "Lexical Encoding" techniques?
If you were interested in where the word-usage statistics come from :
http://www.wordcount.org
This question impinges on linguistics more than programming, but for languages which are highly synthetic (having words which are comprised of multiple combined morphemes), it can be a highly complex problem to try to "number" all possible words, as opposed to languages like English which are at least somewhat isolating, or languages like Chinese which are highly analytic.
That is, words may not be easily broken down and counted based on their constituent glyphs in some languages.
This Wikipedia article on Isolating languages may be helpful in explaining the problem.
Their are several major problems with this idea. In most languages, the meaning of a word, and the word associated with a meaning change very swiftly.
No sooner would you have a number assigned to a word, before the meaning of the word would change. For instance, the word "gay" used to only mean "happy" or "merry", but it is now used mostly to mean homosexual. Another example is the morpheme "thank you" which originally came from German "danke" which is just one word. Yet another example is "Good bye" which is a shortening of "God bless you".
Another problem is that even if one takes a snapshot of a word at any point of time, the meaning and usage of the word would be under contention, even within the same province. When dictionaries are being written, it is not uncommon for the academics responsible to argue over a single word.
In short, you wouldn't be able to do it with an existing language. You would have to consider inventing a language of your own, for the purpose, or using a fairly static language that has already been invented, such as Interlingua or Esperanto. However, even these would not be perfect for the purpose of defining static morphemes in an ever-standard lexicon.
Even in Chinese, where there is rough mapping of character to meaning, it still would not work. Many characters change their meanings depending on both context, and which characters either precede or postfix them.
The problem is at its worst when you try and translate between languages. There may be one word in English, that can be used in various cases, but cannot be directly used in another language. An example of this is "free". In Spanish, either "libre" meaning "free" as in speech, or "gratis" meaning "free" as in beer can be used (and using the wrong word in place of "free" would look very funny).
There are other words which are even more difficult to place a meaning on, such as the word beautiful in Korean; when calling a girl beautiful, there would be several candidates for substitution; but when calling food beautiful, unless you mean the food is good looking, there are several other candidates which are completely different.
What it comes down to, is although we only use about 200k words in English, our vocabularies are actually larger in some aspects because we assign many different meanings to the same word. The same problems apply to Esperanto and Interlingua, and every other language meaningful for conversation. Human speech is not a well-defined, well oiled-machine. So, although you could create such a lexicon where each "word" had it's own unique meaning, it would be very difficult, and nigh on impossible for machines using current techniques to translate from any human language into your special standardised lexicon.
This is why machine translation still sucks, and will for a long time to come. If you can do better (and I hope you can) then you should probably consider doing it with some sort of scholarship and/or university/government funding, working towards a PHD; or simply make a heap of money, whatever keeps your ship steaming.
It's easy enough to invent one for yourself. Turn each word into a canonical bytestream (say, lower-case decomposed UCS32), then hash it down to an integer. 32 bits would probably be enough, but if not then 64 bits certainly would.
Before you ding for giving you a snarky answer, consider that the purpose of Unicode is simply to assign each glyph a unique identifier. Not to rank or sort or group them, but just to map each one onto a unique identifier that everyone agrees on.
How would the system handle pluralization of nouns or conjugation of verbs? Would these each have their own "Unicode" value?
As a translations scheme, this is probably not going to work without a lot more work. You'd like to think that you can assign a number to each word, then mechanically translate that to another language. In reality, languages have the problem of multiple words that are spelled the same "the wind blew her hair back" versus "wind your watch".
For transmitting text, where you'd presumably have an alphabet per language, it would work fine, although I wonder what you'd gain there as opposed to using a variable-length dictionary, like ZIP uses.
This is an interesting question, but I suspect you are asking it for the wrong reasons. Are you thinking of this 'lexical' Unicode' as something that would allow you to break down sentences into language-neutral atomic elements of meaning and then be able to reconstitute them in some other concrete language? As a means to achieve a universal translator, perhaps?
Even if you can encode and store, say, an English sentence using a 'lexical unicode', you can not expect to read it and magically render it in, say, Chinese keeping the meaning intact.
Your analogy to Unicode, however, is very useful.
Bear in mind that Unicode, whilst a 'universal' code, does not embody the pronunciation, meaning or usage of the character in question. Each code point refers to a specific glyph in a specific language (or rather the script used by a group of languages). It is elemental at the visual representation level of a glyph (within the bounds of style, formatting and fonts). The Unicode code point for the Latin letter 'A' is just that. It is the Latin letter 'A'. It cannot automagically be rendered as, say, the Arabic letter Alif (ﺍ) or the Indic (Devnagari) letter 'A' (अ).
Keeping to the Unicode analogy, your Lexical Unicode would have code points for each word (word form) in each language. Unicode has ranges of code points for a specific script. Your lexical Unicode would have to a range of codes for each language. Different words in different languages, even if they have the same meaning (synonyms), would have to have different code points. The same word having different meanings, or different pronunciations (homonyms), would have to have different code points.
In Unicode, for some languages (but not all) where the same character has a different shape depending on it's position in the word - e.g. in Hebrew and Arabic, the shape of a glyph changes at the end of the word - then it has a different code point. Likewise in your Lexical Unicode, if a word has a different form depending on its position in the sentence, it may warrant its own code point.
Perhaps the easiest way to come up with code points for the English Language would be to base your system on, say, a particular edition of the Oxford English Dictionary and assign a unique code to each word sequentially. You will have to use a different code for each different meaning of the same word, and you will have to use a different code for different forms - e.g. if the same word can be used as a noun and as a verb, then you will need two codes
Then you will have to do the same for each other language you want to include - using the most authoritative dictionary for that language.
Chances are that this excercise is all more effort than it is worth. If you decide to include all the world's living languages, plus some historic dead ones and some fictional ones - as Unicode does - you will end up with a code space that is so large that your code would have to be extremely wide to accommodate it. You will not gain anything in terms of compression - it is likely that a sentence represented as a String in the original language would take up less space than the same sentence represented as code.
P.S. for those who are saying this is an impossible task because word meanings change, I do not see that as a problem. To use the Unicode analogy, the usage of letters has changed (admittedly not as rapidly as the meaning of words), but it is not of any concern to Unicode that 'th' used to be pronounced like 'y' in the Middle ages. Unicode has a code point for 't', 'h' and 'y' and they each serve their purpose.
P.P.S. Actually, it is of some concern to Unicode that 'oe' is also 'œ' or that 'ss' can be written 'ß' in German
This is an interesting little exercise, but I would urge you to consider it nothing more than an introduction to the concept of the difference in natural language between types and tokens.
A type is a single instance of a word which represents all instances. A token is a single count for each instance of the word. Let me explain this with the following example:
"John went to the bread store. He bought the bread."
Here are some frequency counts for this example, with the counts meaning the number of tokens:
John: 1
went: 1
to: 1
the: 2
store: 1
he: 1
bought: 1
bread: 2
Note that "the" is counted twice--there are two tokens of "the". However, note that while there are ten words, there are only eight of these word-to-frequency pairs. Words being broken down to types and paired with their token count.
Types and tokens are useful in statistical NLP. "Lexical encoding" on the other hand, I would watch out for. This is a segue into much more old-fashioned approaches to NLP, with preprogramming and rationalism abound. I don't even know about any statistical MT that actually assigns a specific "address" to a word. There are too many relationships between words, for one thing, to build any kind of well thought out numerical ontology, and if we're just throwing numbers at words to categorize them, we should be thinking about things like memory management and allocation for speed.
I would suggest checking out NLTK, the Natural Language Toolkit, written in Python, for a more extensive introduction to NLP and its practical uses.
Actually you only need about 600 words for a half decent vocabulary.