Latest research on suffix arrays vs suffix trees

Latest research on suffix arrays vs suffix trees - suffix-tree

I've been trying to ascertain whether suffix trees or suffix arrays (including their variants) are more space efficient (amongst other properties as given below), but I seem to be coming up with different viewpoints depending on where I look. This wikipedia article, for example, suggests that suffix arrays are more space efficient. In this book, in section 1.6, it is suggested that (compressed) suffix trees are very efficient for space, based on the paper "Compressed Suﬃx Trees with Full Functionality" by Kunihiko Sadakane. So what is the latest research on the comparison between suffix trees and suffix arrays (including their variants)? More specifically, I am interested in knowing which is better in terms of i) construction, ii) space (theoretical and practical), iii) querying performance.
I know that portions of this question may have been asked previously, but those questions are at least a year old and I am interested in the latest research.

Related

Have "Brodal search trees" really been implemented for practical use?

Brodal et al. demonstrated in their ESA '06 paper the existence of a purely functional structure with logarithmic-time search, update and insertion, and constant-time merge. (Note I'm not talking about Brodal heaps, which are a different data structure, quite widely used to implement purely functional priority queues.) This seems to be a very lucrative result, and should lead to efficient purely functional sets and maps, but I don't see them used anywhere:
Haskell's containers uses Adams trees;
OCaml standard library uses AVL trees;
Scala's immutable sorted maps are implemented using red-black trees.
If Brodal trees really have such good results, why have they not been adapted into mainstream functional programming languages standard libraries? In fact, I have not seen even one implementation of Brodal trees at all!
Specifically, is this because:
they are very hard (or in fact nearly impossible) to implement correctly;
the constants are very large, and real gains seem small;
other reasons;
or a combination of the aforementioned?

As mentioned in the comments there is very limited information in the paper, leading one to suspect very large constants, in addition:
The structure doesn't actually claim to support general merge in O(1) time. It only claims to support a much more restricted join function concatenating trees that are sorted relative to each other. Given a way to split trees, this operation is useful for parallel computation, but in that context logarithmic join is quite cheap enough for any practical purpose.
The structure doesn't support splitting at an element, and no efficient implementations are offered for unions, intersections, etc.
Somewhat overlapping with the above, reading the article I would think the following note from the conclusion may be why not much effort has been made toward implementation
Splitting will invalidate this property for every tree-collection and
will lead to (log n log log n) search and update times.

PostgreSQL ltree- vs tree module vs integer/string arrays or string delimited path

As you may know there's a module for PostgreSQL called ltree. Also you have the possibility to use the Array type for integers (*1, see comment below), which in this test shows to actually perform a little slower with its recursive queries, compared to ltree - except from the string indexing (*2, see comment below).
I'm not too sure about the credibility of these testresults though.
My biggest question here is actually about the relatively unknown, and almost undocumented tree module. Described here (where the documentation also can be found!!) as:
support for hierachical data types (sort of lexicographical trees),
should go to contrib/tree, pending because of lack of proper
documentation.
After reading through the documentation I'm a little bit confused as to whether or not I should base my big application (a CMS, where everything will be stored in a hiearchical tree structure - not only content, also files etc, so you can see this quickly scales up) around ltree, normal Materialized Path (Path Enumeration) with a delimited string or integer array as path - or if the relatively unknown "tree" module in theory should be the faster performing, more scalable and better solution of the two.
I've already analysed the different tree structure models and due to query performance, scalability and reordering of nodes and subtrees being my main requirements, I've been able to rule out Adjacency Lists (recursive CTE will not solve performance as the tree scales huge), Nested Sets/Intervals (not fast enough in some queries, considering its disadvantages when manupulating the tree), Closure Tables (terribly at scaling big in complex trees - not useful for such a large project as mine) etc and decided to go with the Materialized Path, which is super fast for read operations, and makes it easy to move subtrees and nodes around the hiearchy. So the question is only about the best of the proposed implementations for Materialized Path.
I'm especially curious in hearing your theories or experiences with "tree" in PostgreSQL.

AS far as I read, contrib/tree was never officially released, whereas ltree was merged into PostgreSQL's core.
I understand both use the same idea of labeled path, but tree only allowed integer labels, when ltree allows text labels that permits fulltext searches, thought the full path length is limited (65Kb max, 2Kb prefered).

How do I prove a DFA has no synchronizing word?

To find a synchronizing word I have always just used trial and error, which for small DFAs is fine but not so useful on larger DFAs. What I want to know, however, is if there exists an algorithm for determining a synchronizing word or if there is a way of being able to tell that one does not exist. (Rather than just saying "I can't find one, therefore one can not exist" which is by no means a proof).
I have had a look around on google and so far just came across methods for determining what the upper and lower bounds for a length of a synchronizing word would be based on the number of states, however this is not helpful to me.

The existence of upper bounds on the length of a synchronizing word immediately implies the existence of a (very slow) algorithm for finding one: just list all strings of length less than the upper bound and test whether each is a synchronizing word. If any of them are, then the synchronizing word exists, and if none of them are, there's no synchronizing word. This is exponentially slow, though, so it's not advisable on large DFAs.
David Eppstein designed a polynomial-time algorithm for finding synchronizing words in DFAs, though I'm not very familiar with this algorithm.
Hope this helps!

Clustering structured (numeric) and text data simultaneously

Folks,
I have a bunch of documents (approx 200k) that have a title and abstract. There is other meta data available for each document for example category - (only one of cooking, health, exercise etc), genre - (only one of humour, action, anger) etc. The meta data is well structured and all this is available in a MySql DB.
I need to show to our user related documents while she is reading one of these document on our site. I need to provide the product managers weight-ages for title, abstract and meta data to experiment with this service.
I am planning to run clustering on top of this data, but am hampered by the fact that all Mahout Clustering example use either DenseVectors formulated on top of numbers, or Lucene based text vectorization.
The examples are either numeric data only or text data only. Has any one solved this kind of a problem before. I have been reading Mahout in Action book and the Mahout Wiki, without much success.
I can do this from the first principles - extract all titles and abstracts in to a DB, calculate TFIDF & LLR, treat each word as a dimension and go about this experiment with a lot of code writing. That seems like a longish way to the solution.
That in a nutshell is where I am trapped - am I doomed to the first principles or there exist a tool / methodology that I somehow missed. I would love to hear from folks out there who have solved similar problem.
Thanks in advance

You have a text similarity problem here and I think you're thinking about it correctly. Just follow any example concerning text. Is it really a lot of code? Once you count the words in the docs you're mostly done. Then feed it into whatever clusterer you want. The term extractions is not something you do with Mahout, though there are certainly libraries and tools that are good at it.

I'm actually working on something similar, but without the need of distinciton between numeric and text fields.
I have decided to go with the semanticvectors package which does all the part about tfidf, the semantic space vectors building, and the similarity search. It uses a lucene index.
Please note that you can also use the s-space package if semanticvectors doesn't suit you (if you go down that road of course).
The only caveat I'm facing with this approach is that the indexing part can't be iterative. I have to index everything every time a new document is added, or an old document is modified. People using semanticvectors say they have very good indexing times. But I don't know how large their corpora are. I'm going to test these issues with the wikipedia dump to see how fast it can be.

Words Prediction - Get most frequent predecessor and successor

Given a word I want to get the list of most frequent predecessors and successors of the word in English language.
I have developed a code that does bigram analysis on any corpus ( I have used Enron email corpus) and can predict the most frequent next possible word but I want some other solution because
a) I want to check the working / accuracy of my prediction
b) Corpus or dataset based solutions fail for an unseen word
For example, given the word "excellent" I want to get the words that are most likely to come before excellent and after excellent
My question is whether any particular service or api exists for the purpose?

Any solution to this problem is bound to be a corpus-based method; you just need a bigger corpus. I'm not aware of any web service or library that is does this for you, but there are ways to obtain bigger corpora:
Google has published a huge corpus of n-grams collected from the English part of the web. It's available via the Linguistic Data Consortium (LDC), but I believe you must be an LDC member to obtain it. (Many universities are.)
If you're not an LDC member, try downloading a Wikipedia database dump (get enwiki) and training your predictor on that.
If you happen to be using Python, check out the nice set of corpora (and tools) delivered with NLTK.
As for the unseen words problem, there are ways to tackle it, e.g. by replacing all words that occur less often than some threshold by a special token like <unseen> prior to training. That will make your evaluation a bit harder.

You have got to give some more instances or context of "unseen" word so that the algorithm can make some inference.
One indirect way can be reading rest of the words in the sentences.. and looking into a dictionary for the words where those words are encountered.
In general, you cant expect the algorithm to learn and understand the inference in the first time. Think about yourself.. If you were given a new word.. how well can you make out its meaning (probably by looking into how it has been used in the sentence and how well your understanding is) but then you make an educated guess and over the period of time you understand the meaning.

I just re-read the original question and I realize the answers, mine included got off base. I think the original person just wanted to solve a simple programming problem, not look for datasets.
If you list all distinct word-pairs and count them, then you can answer your question with simple math on that list.
Of course you have to do a lot of processing to generate the list. While it's true that if the total number of distinct words is as much a 30,000 then there are a billion possible pairs, I doubt that in practice there are that many. So you can probably make a program with a huge hash table in memory (or on disk) and just count them all. If you don't need the insignificant pairs you could write a program that flushes out the less important ones periodically while scanning. Also you can segment the word list and generate pairs of a hundred words verses the rest, then the next hundred and so on, and calculate in passes.
My original answer is here I'm leaving it because it's my own related question:
I'm interested in something similar (I'm writing a entry system that suggest word completions and punctuation and I would like it to be multilingual).
I found a download page for google's ngram files, but they're not that good, they're full of scanning errors. 'i's become '1's, words run together etc. Hopefully Google has improved their scanning technology since then.
The just-download-wikipedia-unpack=it-and-strip-the-xml idea is a bust for me, I don't have a fast computer (heh, I have a choice between an atom netbook here and an android device). Imagine how long it would take me to unpack a 3 gigabytes of bz2 file becoming what? 100 of xml, then process it with beautiful soup and filters that he admits crash part way through each file and need to be restarted.
For your purpose (previous and following words) you could create a dictionary of real words and filter the ngram lists to exclude the mis-scanned words. One might hope that the scanning was good enough that you could exclude misscans by only taking the most popular words... But I saw some signs of constant mistakes.
The ngram datasets are here by the way http://books.google.com/ngrams/datasets
This site may have what you want http://www.wordfrequency.info/

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse