I want to create a Eng<->Rus dictionary application. As far as I know, I need to use SQLite. But how can I implement the translation function? How can I find corresponding translation of the word requested by the user? Any help, guidance, tutorials will be appreciated.
http://en.wikipedia.org/wiki/FreeDict has a good collection of bilingual dictionaries, including eng-rus. It's under GPL.
If you need only eng->rus translation, this is basically a hashtable, where the english word is the key, and the list of corresponding russian words are the value.
If you need bidirectional translation, I would store each word as follows:
[word], [language], [set of corresponding translations]
The translation is easy then: you lookup the word W in language L, and return the set of possible translations.
You may want to have a look here too How do I initialize a store with default data in a CoreData application?
Also, it's probably a good thing to add some non-aggressive stemming and/or normalization, so that if I search for a plural (ie. eggs) it's resolved to the singular, or if I search a specific verbal form (ie. walking) it's resolved to the verb's infinitive.
Well, there are a lot of ways you could achieve this, but at the simplest level, consider using paired-arrays....two separate arrays. One contains English words, the second array containing the translated equivalents. You can then request a given term at a specified index, and access both terms for it.
In CoreData, you could specify a "Word" entity, with attributes of "English" and "Russian".
Related
I'm currently struggling at a complex URL handling concept question. The application have a product property database table/collection with all the different product types (i.e. categories, colors, manufacturers, materials, etc.).
{_id:1,alias:"mercedes-benz",type:"brand"},
{_id:2,alias:"suv-cars",type:"category"},
{_id:3,alias:"cars",type:"category"},
{_id:4,alias:"toyota",type:"manufacturer"},
{_id:5,alias:"red",type:"color"},
{_id:6,alias:"yellow",type:"color"},
{_id:7,alias:"bmw",type:"manufacturer"},
{_id:8,alias:"leather",type:"material"}
...
Now the mission is to handle URL requests in the style below in every(!) possible order to retrieve the included product properties. The only allowed character is the dash (settled SEO requirement, some properties also can include dashes by themselve - i think also an important point - i.e. the category "suv-cars" or the manufacturer "mercedes-benz"):
http:\\www.example.com\{category}-{color}-{manufacturer}-{material}
http:\\www.example.com\{color}-{manufacturer}
http:\\www.example.com\{color}-{category}-{material}-{manufacturer}
http:\\www.example.com\{category}-{color}-nonexistingproperty-{manufacturer}
http:\\www.example.com\{color}-{category}-{manufacturer}
http:\\www.example.com\{manufacturer}
http:\\www.example.com\{manufacturer}-{category}-{color}-{material}
http:\\www.example.com\{category}
http:\\www.example.com\{manufacturer}-nonexistingproperty-{category}-{color}-{material}
http:\\www.example.com\{color}-crap-{manufacturer}
...
...so: every order of the properties should be allowed! The result have to be the information about the used properties per URL-Request (BTW yes, the duplicate content will be fixed by redirects and a predefined schema). The "nonexistingproperties"/"crap" are possible and just should be ignored.
UPDATE:
Idea 1: One way i'm thinking about the question is to split the query string by dashes and analyze them value by value, the problem: At the two or three or more word combinations at some properties there are too many different combinations and variations so a loooot of queries which kills this idea i think..
Idea 2: The other way is to build a (in my opinion) too large Alias/URL-Table with all of the different combinations, but i think that's just an ugly workaround. There are about 15.000 of different properties so the count of the aliases in the different sort orders is killing this idea.
Idea 3: It's your turn! Thanks for your mind and your time.
While your question is a bit broad, below are some ideas. There isn't a single awesome answer unless you find a free or commercial engine for this that works exactly the way you want.
The way I thought about your problem was to consider the URL as a list of keywords.
use Lucene as a keyword/tag system. It's good at the types of searches you suggest you want, including phrases, stems, etc.
store and index the data in DB of choice, but pull the keywords into memory and build a bit index of all keywords vs items. Iterate through the keyword table producing weighted results. If order of keywords matters, you'll also need make a pass through the result set to weight based on word order. These types of searches always need to cap their result set quickly in order to return results quickly.
cache the results like crazy from working matches, and give precedence to results that users seem to click on the most for a given URL.
attack the database by using tag indexes in MongoDB. You'd still need to merge and weight results. Very intensive and not likely a good use of DB resources.
read some of the academic papers on keyword searches. It's a popular topic.
build a table of words that have dashes in them, and normalize/convert those before running your queries
always check for full exact matches first
The only way this may work, if you restrict all property values to be unique. So, you make a set of categories+colors+manufacturers, etc. All values have to be unique. This will allow you to find to what property the value belongs.
The data structure for this should be fairly simple:
{_id:ValueOfTheProperty, Property:TypeOfProperty}
Here are some possible samples:
{ _id: Red, Property: Color }
{ _id: Green, Property: Color }
{ _id: Boots, Property: Category }
{ _id: Shoes, Property: Category }
...
This way, the order does not matter, and you are able to convert them in a single pass to a map:
{ Color: Red, Category: Boots }
Though, I predict some problems with ambigous names here.
I'm working on an Xcode project.
I need to add a dictionary of words, like English dictionary for example. Then, when required, I need to find some words in it.
I have been thinking about creating an array from a file to store all the words, then sort it in some way I don't know and at last do a binary search or something like this to find the word I'm looking for.
Do you think is a good idea?
Do you have any clue to sort the array?
Is binary search a good idea?
The best way would be to use a database (e.g. SQL-based), or CoreData to store it. That is unless you require the words to be entirely in the memory, which would be the case if you often listed all of them. But even then it can be solved by lazy-loading.
I'm working on an iOS app. I have a Core Data database with a lot of company names.
When the user insert a company name that does not exist, I would like to show "similar" company names. For example, if the user entered "Aple", I would like to show "Did you mean Apple?".
I know that the technique of finding strings that match a pattern approximately (rather than exactly) is called approximate string matching or, colloquially, fuzzy string searching.
In theory, there are many algorithms, more or less valid: the Levenshtein distance computing algorithm and so on.
But in practice, is there someone who has already implemented something similar that can be used easily with core data?
I found a solution. Use this NSString's category available on GitHub: NSString-DamerauLevenshtein.
Try looking at Soundex, I believe that is part of the core featureset for SQLite, if that is your underlying data store.
I am wondering what the best approach would be to check whether or not a common first name is contained within an NSString on an iPhone app. I've got a sorted flat text file of ~5500 common American first names delimited by new lines. The NSString I am searching within for a name is not very long, most likely the size of a normal sentence.
My original plan was to load the sorted list into memory and then iterate over every word in the NSString performing a binary search of the list to determine whether or not that word was a common name.
Am I better off trying to put this name list into CoreData or a SQLite table and performing a query with that? My understanding is I would not have to load the entire list into memory if I went that route.
I am guessing this situation is a common problem with word dictionaries for word games, so I'm just wondering what the best practice is for fast lookups. Thanks!
SQLite sounds ideal for this in terms of both speed of lookup and minimising memory usage. It would also make it potentially possible to update the first name list over the internet if so desired.
Using Core Data (which is in effect an elabourate wrapper around SQLite) would be overkill in this instance, especially as you don't require the ORM like capabilities.
An NSSet might be useful as well. Dave DeLong's answer for another question demonstrates that NSSets have constant look-up times, i.e. O(1).
Load your names into an NSMutableSet one by one. This will be the slowest part but will only need to be done once. If your file is a simple line-delimited file of names, it may be easier to use the standard C library for reading the file, since line-by-line input is not well-supported by Cocoa.
After that, simply use [nameSet containsObject:name] to check whether it is in the list.
A couple of drawbacks to this approach:
The name you want to test must be in the same case as the name in the set, that is “paul” and “Paul” are different strings. You can circumvent this by converting all names to lowercase before inserting them into the set, and then also converting the name you want to check into lowercase before checking it against the set.
It might be easier just to go with the already-accepted answer.
Recently there was a big debate during a code reveiw session on the use of constants.
The developers had used constants for the following purposes:
Each and every message key used in the i18N application was declared as a constant. The application contained around 3000 message keys and hence the same number of constants.
Each and every database column name was declared as a constant. There were around 5000 column names and still counting..
Does it make sense to have such a huge number of constants in any application?
IMHO, common sense should prevail. Message keys just don't need to be declared as constants. We already have one level of indirection - why add one more?
Reg. database column names, I have mixed opinions. If a column is being used in multiple classes, does it make sense to declare it as a global constant?
Please pour in with your thoughts...
If I18N message keys aren't defined as constants, how do you enforce consistency? How do you automatically differentiate between a typo and a missing value? How do you audit to make sure that all I18N keys are fulfilled in each new language file?
As to database columns, you could definitely use some indirection - if your application knows about column names, you've got a binding problem. So there, you might consider a config file with the actual column names - but of course, you would want to refer to the column names by symbolic keys, which should be defined as auditable constants, just like the I18N keys.
I think is a good practice to put message keys used for i18N as constants.
I don't see much benefits in doing the same for the DB columns, if you have a well designed persistence layer.
This depends on the programming language, I think.
In PHP it's not uncommon to ude defines aka contants for such things, while I'd not use this in Java or C#.
In most projects we tried to extract the SQL to templates, so not only the table and column names were configurable but the whole sql statement. We used velocity for basic templating mechanics like variables, small loops,...
Regarding the language constants:
Another layer doesn't make much sense to me, but you hav eto choose your identifiers for the language translation carefully. Using the whole english sentence as key may end up in a lot of work for the translators if you fix the wording for example in the english sentence without changing the meaning. So all translators would have to update their files.
If the constant is used in multiple places and the compiler really catches the problem, yes.