Is it possible to create a dictionary in PostgreSQL dynamically? - postgresql

I'm new to full text search in PostgreSQL and discovered things like Dictionaries and stop words in it.
I have a table with a lot of words from many texts. I want to create my own dictionary and put the first 30 most frequent words as stop words.
Is it possible to do this at runtime?

Anything is possible. Not everything is feasible.
What you can do without too much difficulty is create a stored procedure in a language like pl/perlU which breaks up the words, analyzes them, and writes stop words to a file. You'd have to do a pg_ctl reload in order to ensure that the new stop words file was used. However I don't think you can dynamically determine stop words at search time because if you search through the strings to look for stop words, there isn't much point in then having full text searching.
The actual stop words file is just a new-line separated list of words. Also I think you'd need to start with a template for stemming purposes. Trying to dynamically discover stemming would be both difficult and error-prone.

Related

Word database for iOS custom keyboard

I am writing a custom keyboard for an Asian language and I have a word database with over half a million words. I use Realm for now and use it to give word suggestions. When users type the first few letters keyboard will search the DB and provide words based on priority values given to each word. But this seems inefficient compared to other keyboards in the App Store, I can't find any concrete way or idea on this issue. Anyone can point me in the direction to increase the efficiency of word searching with a custom iOS keyboard.
I haven't tried CoreData but generally, the realm is considered faster than CoreData.
First, type of storage: maybe consider using plist files or .text files wouldn't be bad.
and saving words in a sorted way in ASCII mode would be great.
Second: you need an algorithm to break into a group of words so fast. you can do this by saving the ASCII code.
Here is an example of a binary search algorithm :
Please search around about different algorithms.

Can Sphinx query be tested command line w/o an index?

If I just want to test that my query will work vs a term inside a sentence "and am selling an Iphone 3gs", is it possible to use command line to test this? This way I don't need to keep adding to and rotating an index but can simply tweak my query and the data I'd plan on storing. Mainly I am trying to tweak various query parameters like SENTENCE and PROXIMITY vs wordforms/stopwords/ignore_char and would like to be able to work fast and test different query structures vs test words/patterns.
In theory can use BuildExcerpts, to run an arbitrary query, against a block of text (make sure use query_mode=true).
http://sphinxsearch.com/docs/current.html#api-func-buildexcerpts
But even then not totally sure it will completely honour the query, not sure SENTENCE etc will truely work.
... but if you wanting to play with ignore_char etc, you are going to be modifying the index config file. So surely just quickly running indexer to rebuild the index, to see the results is not htat difficult.

Huge dictionary with random selections: iPhone Dev

i am making an app that will have a very large dictionary of words i choose (so that the words aren't too complicated) and i want it to randomly choose the words. I dont have a problem with the randomly selecting words, but what would be the best way to store all these words, and how? I feel like using an NSMutable array would take up too much memory creating thousands of objects, so what else can i use... Thanks for you help
Core data!, is your best option, or to manage your own SQLite
check a core data tutorial
or a SQLite on iOS tutorial
If all all the application needs to do is access words at random (so no key based queries, or updates), an alternative to core data and SQLite would be to just fseek() to a random location in a flat text file of newline delimited words and then read out the next complete word, possibly with fscanf(dict,"%s\n%s\n",partial_word,full_word).
Deal with EOF by retrying with a different random number, or limit the fseek() range to never hit last word in file.
An issue with the above outline is words won't be uniformly selected. There is a bias towards words following long words. Discarding strlen(partial_word) (or a larger random number) of words before keeping a word might help the distribution if it is a concern.

will this implementation affects the user experience

I am assigned with the task to implement a functionality to shorten text typed text
For example , I type text like "you" when I highlight it and it has to change like "u"
I will have table which has list of words which has longer version of text and with text to be replaced.so whenever a user types word and highlights it i want to query the db for match , if a match is found I want to replace the word with the shortened word.
This is not my idea and am being assigned to this implementation.
I think this functionality will down the speed of the app responsiveness. And it has some disadvantages over the user friendliness of the application.
So I'd like to hear your opinions on what are the disadvantages it has and how can I implement this in a better manner. Or is that ok to have this kind of functionlity? Won't it affect the app speed?
It's hard to imagine that you'll see a noticeable decrease in performance. Even the iPhone 3G's processor runs at around 400MHz, and someone typing really fast on an iPhone might get four or five characters entered in a second. A simple implementation of the sort of thing you're talking about would involve a lookup in a data structure such as a dictionary, tree, or database, and you ought to be able to do that pretty quickly.
Why not try it? Implement the simplest thing you can think of and measure its performance. For the purpose of measuring, you might want to use a loop to repeatedly look up words from an array. Count the number of lookups you can do in, say, 20 seconds, and divide by 20 to get the average number per second.
i dont think it will take a lot of performance, anyway you can use the profiler to check how long every method is taking, as for the functionality, i believe you should give the user the opportunity to "undo" and keep his own word (same as apple's auto correction)

What's the fastest way to save data and read it next time in a IPhone App?

In my dictionary IPhone app I need to save an array of strings which actually contains about 125.000 distinct words; this transforms in aprox. 3.2Mb of data.
The first time I run the app I get this data from an SQLite db. As it takes ages for this query to run, I need to save the data somehow, to read it faster each time the app launches.
Until now I've tried serializing the array and write it to a file, and afterword I've tested if writing directly to NSUserDefaults to see if there's any speed gain but there's none. In both ways it takes about 7 seconds on the device to load the data. It seems that not reading from the file (or NSUserDefaults) actually takes all that time, but the deserialization does:
objectsForCharacters = [[NSKeyedUnarchiver unarchiveObjectWithData:data] retain];
Do you have any ideeas about how I could write this data structure somehow that I could read/put in memory it faster?
The UITableView is not really designed to handle 10s of thousands of records. If would take a long time for a user to find what they want.
It would be better to load a portion of the table, perhaps a few hundred rows, as the user enters data so that it appears they have all the records available to them (Perhaps providing a label which shows the number of records that they have got left in there filtered view.)
The SQLite db should be perfect for this job. Add an index to the words table and then select a limited number of rows from it to show the user some progress. Adding an index makes a big difference to the performance of the even this simple table.
For example, I created two tables in a sqlite db and populated them with around 80,000 words
#Create and populate the indexed table
create table words(word);
.import dictionary.txt words
create unique index on words_index on word DESC;
#Create and populate the unindexed table
create table unindexed_words(word);
.import dictionary.txt unindexed_words
Then I ran the following query and got the CPU Time taken for each query
.timer ON
select * from words where word like 'sn%' limit 5000;
...
>CPU Time: user 0.031250 sys 0.015625;
select * from unindex_words where word like 'sn%' limit 5000;
...
>CPU Time: user 0.062500 sys 0.0312
The results vary but the indexed version was consistently faster that the unindexed one.
With fast access to parts of the dictionary through an indexed table, you can bind the UITableView to the database using NSFecthedResultsController. This class takes care of fecthing records as required, caches results to improve performance and allows predicates to be easily specified.
An example of how to use the NSFetchedResultsController is included in the iPhone Developers Cookbook. See main.m
Just keep the strings in a file on the disk, and do the binary search directly in the file.
So: you say the file is 3.2mb. Suppose the format of the file is like this:
key DELIMITER value PAIRDELIMITER
where key is a string, and value is the value you want to associate. The DELIMITER and PAIRDELIMITER must be chosen as such that they don't occur in the value and key.
Furthermore, the file must be sorted on the key
With this file you can just do the binary search in the file itself.
Suppose one types a letter, you go to the half of the file, and search(forwards or backwards) to the first PAIRDELIMITER. Then check the key and see if you have to search upwards or downwards. And repeat untill you find the key you need,
I'm betting this will be fast enough.
Store your dictionary in Core Data and use NSFetchedResultsController to manage the display of these dictionary entries in your table view. Loading all 125,000 words into memory at once is a terrible idea, both performance- and memory-wise. Using the -setFetchBatchSize: method on your fetch request for loading the words for your table, you can limit NSFetchedResultsController to only handling the small subset of words that are visible at any given moment, plus a little buffer. As the user scrolls up and down the list of words, new batches of words are fetched in transparently.
A case like yours is exactly why this class (and Core Data) was added to iPhone OS 3.0.
Do you need to store/load all data at once?
Maybe you can just load the chunk of strings you need to display and load all other strings in the background.
Perhaps you can load data into memory in one thread and search from it in another? You may not get search results instantly, but having some searches feel snappier may be better than none at all, by waiting until all data are loaded.
Are some words searched more frequently or repeatedly than others? Perhaps you can cache frequently searched terms in a separate database or other store. Load it in a separate thread as a searchable store, while you are loading the main store.
As for a data structure solution, you might look into a suffix trie to search for substrings in linear time. This will probably increase your storage requirements, though, which may affect your ability to implement this with an iPhone's limited memory and disk storage capabilities.
I really don't think you're on the right path trying to load everything at once.
You've already determined that your bottleneck is the deserialization.
Regardless what the UI does, the user only sees a handful (literally) of search results at a time.
SQLlite already has a robust indexing mechanism, there is likely no need to re-invent that wheel with your own indexing, etc.
IMHO, you need to rethink how you are using UITableView. It only needs a few screenfuls of data at a time, and you should reuse cell objects as they scroll out of view rather than creating a ton of them to begin with.
So, use SQLlite's indexing and grab "TOP x" rows, where x is the right balance between giving the user some immediately-available rows to scroll through without spending too much time loading them. Set the table's scroll bar scaling using a separate SELECT COUNT(*) query, which only needs to be updated when the user types something different.
You can always go back and cache aggressively after you deserialize enough to get something up on-screen. A slight lag after the first flick or typing a letter is more acceptable than a 7-second delay just starting the app.
I have currently a somewhat similar coding problem with a large amount of searchable strings.
My solution is to store the prepared data in one large memory array, containing both the texttual data and offsets as links. Meaning I do not allocate objects for each item. This makes the data use less memory and also allows me to load & save it to a file without further processing.
Not sure if this is an option for you, since this is quite an obvious solution once you've realized that the object tree is causing the slowdown.
I use a large NSData memory block, then search through it. Well, there's more to it, it took me about two days to get it well optimized.
In your case I suspect you have a dictionary with a lot of words that have similar beginnings. You could prepare them on another computer in a format the both compacts the data and also facilitates fast lookup. As a first step, the words should be sorted. With that, you can already perform a binary search on them for a fast lookup. If you store it all in one large memory area, you can do the search quite fast, compared to how sqlite would search, I think.
Another way would be to see the words as a kind of tree: You have many thousands that begin with the same letter. So you divide your data accordingly: You have a sql table for each beginning letter of your set of words. that way, if you look up a word, you'd select one of the now-smaller tables depening on the first letter. This makes the amount that has to be searched already much smaller. and you can do this for the 2nd and 3rd letter as well, and you already could have quite a fast access.
Did this give you some ideas?
Well actually I figured it out myself in the end, but of course I thank you all for your quick and pertinent answers. To be concise I will just say that, the fact that Objective-C, just like any other object-based programming language, due to introspection and other objective requirements is significantly slower than procedural programming languages.
The solution was in fact to load all my data in a continuous chunk of memory using malloc (a char **) and search on-demand in it and transform to objects. This concluded in a .5 sec loading time (from file to memory) and resonable (should be read "fast") operations during execution. Thank you all again and if you have any questions I'm here for you. Thanks