how much data can i store per node in Neo4j - nosql

I need to save big chunks of Unicode text strings in Neo4j nodes. The documentation does not mention anything about the size of data one can store per node.
Does anyone know that?

I just tried the following with the neo4j web interface :
I wrote a line of 26 characters and duplicated it through 32000 lines, which makes a total of 832000 characters.
I created a node with a property "text" and copied my text in it, and it worked perfectly.
I tried again with 64000 lines with white spaces at the end of lines, with a total of 1728000 characters. Created a new node, then queried the node and copied the result back in a file to check the size (you never know), and wc gave me 1728001 (the one must be an error in the copy/paste process I suppose).
It didn't seem to complain.
FYI this is equivalent to a text with 345600 words of an average size of 4 and a space (5 characters), and a book of 1000 pages with 300 words per page.
I don't know however how this could impact performances if there are too many nodes. If it doesn't work well because of this, you could always consider having neo4j for storing informations about relationships, with a property ID as an id for another document oriented database to retrieve the text (or simply the path of a file as a path property).

Neo4j is by default indexed using Lucene. Lucene was built as a full text search toolbox (with Solr being the de facto search engine implementation). Since Lucene was intended to search over large amounts of text, my suspicion is that you can put as much text into a node as you want and it will work just fine.

Neo4j is a very nice solution for managing relationships between objects. As you may already know these relationships can have properties as well as the nodes themselves. But I think you cannot store "a big chunk" of data on these nodes. I think Neo4j was intended to be used with another database such as MongoDb or even mysql. You get "really fast" the information you first need and then look up for it using another engine. On my projects I store usernames, names, date of birth, ids, and these kind of information, but not very large text strings.

Related

Database design: Postgres or EAV to hold semi-structured data

I was given the task to decide whether our stack of technologies is adequate to complete the project we have at hand or should we change it (and to which technologies exactly).
The problem is that I'm just a SQL Server DBA and I have a few days to come up with a solution...
This is what our client wants:
They want a web application to centralize pharmaceutical researches separated into topics, or projects, in their jargon. These researches are sent as csv files and they are somewhat structured as follows:
Project (just a name for the project)
Segment (could be behavioral, toxicology, etc. There is a finite set of about 10 segments. Each csv file holds a segment)
Mandatory fixed fields (a small set of fields that are always present, like Date, subjects IDs, etc. These will be the PKs).
Dynamic fields (could be anything here, but always as a key/pair value and shouldn't be more than 200 fields)
Whatever files (images, PDFs, etc.) that are associated with the project.
At the moment, they just want to store these files and retrieve them through a simple search mechanism.
They don't want to crunch the numbers at this point.
98% of the files have a couple of thousand lines, but there's a 2% with a couple of million rows (and around 200 fields).
This is what we are developing so far:
The back-end is SQL 2008R2. I've designed EAVs for each segment (before anything please keep in mind that this is not our first EAV design. It worked well before with less data.) and the mid-tier/front-end is PHP 5.3 and Laravel 4 framework with Bootstrap.
The issue we are experiencing is that PHP chokes up with the big files. It can't insert into SQL in a timely fashion when there's more than 100k rows and that's because there's a lot of pivoting involved and, on top of that, PHP needs to get back all the fields IDs first to start inserting. I'll explain: this is necessary because the client wants some sort of control on the fields names. We created a repository for all the possible fields to try and minimize ambiguity problems; fields, for instance, named as "Blood Pressure", "BP", "BloodPressure" or "Blood-Pressure" should all be stored under the same name in the database. So, to minimize the issue, the user has to actually insert his csv fields into another table first, we called it properties table. This action won't completely solve the problem, but as he's inserting the fields, he's seeing possible matches already inserted. When the user types in blood, there's a panel showing all the fields already used with the word blood. If the user thinks it's the same thing, he has to change the csv header to the field. Anyway, all this is to explain that's not a simple EAV structure and there's a lot of back and forth of IDs.
This issue is giving us second thoughts about our technologies stack choice, but we have limitations on our possible choices: I only have worked with relational DBs so far, only SQL Server actually and the other guys know only PHP. I guess a MS full stack is out of the question.
It seems to me that a non-SQL approach would be the best. I read a lot about MongoDB but honestly, I think it would be a super steep learning curve for us and if they want to start crunching the numbers or even to have some reporting capabilities,
I guess Mongo wouldn't be up to that. I'm reading about PostgreSQL which is relational and it's famous HStore type. So here is where my questions start:
Would you guys think that Postgres would be a better fit than SQL Server for this project?
Would we be able to convert the csv files into JSON objects or whatever to be stored into HStore fields and be somewhat queryable?
Is there any issues with Postgres sitting in a windows box? I don't think our client has Linux admins. Nor have we for that matter...
Is it's licensing free for commercial applications?
Or should we stick with what we have and try to sort the problem out with staging tables or bulk-insert or other technique that relies on the back-end to do the heavy lifting?
Sorry for the long post and thanks for your input guys, I appreciate all answers as I'm pulling my hair out here :)

Getting large rows out of SQL Azure - but where to go? Tables, Blob or something like MongoDB?

I read through a lot of comparisons between Azure Table/Blob/SQL storage and I think I have a good understanding of all of those ... but still, I'm unsure where to go for my specific needs. Maybe someone with experience in similar scenarios and is able to make a recommendation.
What I have
A SQL Azure DB that stores articles in raw HTML inside a varchar(max) column. Each row also has many metadata columns and many indexes for easy querying. The table contains many references to Users, Subscriptions, Tags and more - so a SQL DB will always be needed for my project.
What's the problem
I already have about 500,000 articles in this table and I expect it to grow by millions of articles per year. Each article's HTML content can be anywhere between a few KB and 1 MB or, in very few cases, larger than 1 MB.
Two problems arise: as Azure SQL storage is expensive, rather earlier than later I'll shoot myself in the head with the costs for storing this. Also, I will hit the 150 GB DB size limit also rather earlier than later. Those 500,000 articles already consume 1,6 GB DB space now.
What I want
It's clear those HTML content has to get out of the SQL DB. While the article table itself has to remain for joining it to users, subscriptions, tags and more for fast relational discovery of the needed articles, at least the colum that holds the HTML content could be outsourced to a cheaper storage.
At first sight, Azure Table storage seems like the perfect fit
Terabytes of data in one large table for very cheap prices and fast queries - sounds perfect to have a singe Table Storage table holding the article contents as an add-on to the SQL DB.
But reading through comparisons here shows it might not even be an option: 64 KB per column would be enough for 98 % of my articles, but there are those 2 % left where for some single articles even the whole 1 MB of the row limit might not be enough.
Blob storage sounds completely wrong, but ...
So there's just one option on Azure left: Blobs. Now, it might not be as wrong as it sounds. In most of the cases, I would need the content of only a single article at once. This should work fine and fast enough with Blob storage.
But I also have queries where I would need 50, 100 or even more rows at once INCLUDING even the content. So I would have to run the SQL query to fetch the needed articles and then fetch every single article out of the Blob storage. I have no experience with that but I can't believe I'd be able to remain in millisecond timespan for the queries when doing that. And queries that take multiple seconds are an absolute no-go for my project.
So it also does not seem to be to be an appropriate solution.
Do I look like a guy with a plan?
At least I have something like a plan. I thought about only "exporting" appropriate records into SQL Table Storage and/or Blob Storage.
Something like "as long as the content is < 64 KB export it to table storage, else keep it in the SQL table (or even export this single XL record into BLOB storage)"
That might work good enough. But it makes things complicated and maybe unnecessary error-prone.
Those other options
There are some other NoSQL DBs like MongoDB and CouchDB that seem to better fit my needs (at least from my naive point of view as someone who just read the specs on paper, I don't have experience with them). But they'd require self-hosting, some thing I'd like to get out of it's way if possible. I'm on Azure to do as little as needed in terms of self-hosting servers and services.
Did you really read until here?
Then thank you very much for your valuable time and thinking about my problems :)
Any suggestions would be greatly appreciated. As you see, I have my ideas and plans, but nothing beats experience from someone who walked down the road before :)
Thanks,
Bernhard
I signed up just solely to help with this question. In the past, I have found useful answers to my problems from Stackoverflow - thank you community - so I thought it would just be fair (perhaps fair is an understatement) to attempt to give something back with this question, as it falls on my alley.
In short, while considering all factors stated in the question, table storage may be the best option - iif you can properly estimate transactions per month: a nice article on this.
You can solve the two limitations that you mentioned, row and column limit, by splitting (plain text method or serializing it) the document/html/data. Speaking from experience with 40 GB+ data stored in Table Storage, where frequently our app retrieves more than 10 rows per each page visit in milliseconds - no argument here! If you need 50+ rows at times, you are looking at low single digits second(s), or you can do them in parallel (and further by splitting the data in different partitions), or in some async fashion. Or, read suggested multi level caching below.
A bit more detail. I tried with SQL Azure, Blob (both page and block), and Table Storage. I can not speak for Mongo DB since, partially for the reasons already mentioned here, I did not want to go that route.
Table Storage is fast; in the range of 20-50 milliseconds, or even faster sometimes (depends, for instance in the same data center i have seen it gone as low as 10 milliseconds), when querying with partition and row key. You may also further have several partitions, in some fashion based on your data and your knowledge about it.
It scales better, in terms of GB's but not transactions
Row and column limitations that you mentioned are a burden, agreed, but not a show stopper. I have written my own solution to split entities, you can too easily, or you can see this already-written-solution (does not solve the whole problem but it is a good start): https://code.google.com/p/lokad-cloud/wiki/FatEntities
Also need to keep in mind that uploading data to table storage is time consuming, even when batching entities due to other limitations (i.e., request size less than 4 MB, upload bandwidth, etc).
But using solely just TableStorage may not be the best solution (thinking about growth and economics). The best solution that we ended up implementing used multi-level caching/storage, starting from static classes, Azure Role Based Cache, Table Storage, and Block Blobs. Lets call this, for readability purposes, level 1A, 1B, 2 and 3 respectively. Using this approach, we are using a medium single instance (2 CPU Cores and 3.5 GB Ram - my laptop has better performance), and are able to process/query/rank 100GB+ of data in seconds (95% of cases in under 1 second). I believe this is fairly impressive given that we check all "articles" before displaying them (4+ million "articles").
First, this is tricky and may or may not be possible in your case. I do not have sufficient knowledge about the data and its query/processing usage, but if you can find a way to organize the data well this may be ideal. I will make an assumption: it sounds like you are trying to search through and find relevant articles given some information about a user and some tags (a variant of a news aggregator perhaps, just got a hunch for that). This assumption is made for the sake of illustrating the suggestion, so even if not correct, I hope it will help you or trigger new ideas on how this could be adopted.
Level 1A data.
Identify and add key entities or its properties in a static class (periodically, depending on how you foresee updates). Say we identify user preferences (e.g., demographics and interest, etc) and tags (tech, politics, sports, etc). This will be used to retrieve quickly who the user is, his/her preferences, and any tags. Think of these as key/value pair; for instance key being a tag, and its value being a list of article IDs, or a range of it. This solves a small piece of a problem, and that is: given a set of keys (user pref, tags, etc) what articles are we interested in! This data should be small in size, if organized properly (e.g., instead of storing article path, you can only store a number). *Note: the problem with data persistence in a static class is that application pool in Azure, by default, resets every 20 minutes or so of inactivity, thus your data in the static class is not persistent any longer - also sharing them across instances (if you have more than 1) can become a burden. Welcome level 1B to the rescue.
Leval 1B data
A solution we used, is to keep layer 1A data in a Azure Cache, for its sole purpose to re-populate the static entity when and if needed. Level 1B data solves this problem. Also, if you face issues with application pool reset timing, you can change that programmatically. So level 1A and 1B have the same data, but one is faster than the other (close enough analogy: CPU Cache and RAM).
Discussing level 1A and 1B a bit
One may point out that it is an overkill to use a static class and cache, since it uses more memory. But, the problem we found in practice, is that, first it is faster with static. Second, in cache there are some limitations (ie., 8 MB per object). With big data, that is a small limit. By keeping data in a static class one can have larger than 8 MB objects, and store them in cache by splitting them (i.e., currently we have over 40 splits). BTW please vote to increase this limit in the next release of azure, thank you! Here is the link: www.mygreatwindowsazureidea.com/forums/34192-windows-azure-feature-voting/suggestions/3223557-azure-preview-cache-increase-max-item-size
Level 2 data
Once we get the values from the key/value entity (level 1A), we use the value to retrieve the data in Table Storage. The value should tell you what partition and Row Key you need. Problem being solved here: you only query those rows relevant to the user/search context. As you can see now, having level 1A data is to minimize row querying from table storage.
Level 3 data
Table storage data can hold a summary of your articles, or the first paragraph, or something of that nature. When it is needed to show the whole article, you will get it from Blob. Table storage, should also have a column that uniquely identifies the full article in blob. In blob you may organize the data in the following manner:
Split each article in separate files.
Group n articles in one file.
Group all articles in one file (not recommended although not as bad as the first impression one may get).
For the 1st option you would store, in table storage, the path of the article, then just grab it directly from Blob. Because of the above levels, you should need to read only a few full articles here.
For the 2nd and 3rd option you would store, in table storage, the path of the file and the start and end position from where to read and where to stop reading, using seek.
Here is a sample code in C#:
YourBlobClientWithReferenceToTheFile.Seek(TableStorageData.start, SeekOrigin.Begin);
int numBytesToRead = (int)TableStorageData.end - (int)TableStorageData.start;
int numBytesRead = 0;
while (numBytesToRead > 0)
{
int n = YourBlobClientWithReferenceToTheFile.Read(bytes,numBytesRead,numBytesToRead);
if (n == 0)
break;
numBytesRead += n;
numBytesToRead -= n;
}
I hope this didn't turn into a book, and hope it was helpful. Feel free to contact me if you have follow up questions or comments.
Thanks!
The proper storage for a file is a blob. But if your query needs to return dozens of blobs at the same time, it will be too slow as you are pointing out. So you could use a hybrid approach: use Azure Tables for 98% of your data, and if it's too large, use a Blob instead and store the Blob URI in your table.
Also, are you compressing your content at all? I sure would.
My thoughts on this: Going the MongoDB (or CouchDB) route is going to end up costing you extra Compute, as you'll need to run a few servers (for high availability). And depending on performance needed, you may end up running 2- or 4-core boxes. Three 4-core boxes is going to run more than your SQL DB costs (plus then there's the cost of storage, and MongoDB etc. will back their data in an Azure blob for duable storage).
Now, as for storing your html in blobs: this is a very common pattern, to offload large objects to blob storage. The GETs should be doable in a single call to blob storage (single transaction) especially with the file size range you mentioned. And you don't have to retrieve each blob serially; you can take advantage of TPL to download several blobs to your role instance in parallel.
One more thing: How are you using the content? If you're streaming it from your role instances, then what I said about TPL should work nicely. If, on the other hand, you're injecting href's into your output page, you can just put the blob url directly into your html page. And if you're concerned about privacy, make the blobs private and generate a short-TTL "shared access signature" granting access for a small time window (this only applies if inserting blob url's into some other html page; it doesn't apply if you're downloading to the role instance and then doing something with it there).
You could use MongoDB's GridFS feature: http://docs.mongodb.org/manual/core/gridfs/
It splits the data into 256k chunks by default (configurable up to 16mb) and lets you use the sharded database as a filesystem which you can use to store and retrieve files. If the file is larger than the chunk size, the mongo db drivers handle splitting up / re-assembling the data when the file needs to be retrieved. To add additional disk space, simply add additional shards.
You should be aware, however that only some mongodb drivers support this and it is a driver convention and not a server feature that allows for this behavior.
A few comments:
What you could do is ALWAYS store HTML content in blob storage and store the blob's URL in table storage. I personally don't like the idea of storing data conditionally i.e. if content of HTML file is more than 64 KB only then store it in blob storage otherwise use table storage. Other advantage you get out of this approach is that you can still query the data. If you store everything in blob storage, you would lose querying capability.
As far as using other NoSQL stores are concerned, only problem I see with them is that they are not natively supported on Windows Azure thus you would be responsible for managing them as well.
Another option would be to store your files as a VHD image in blob storage. Your roles can mount the VHD to their filesystem, and read the data from there.
The complication seems to be that only one VM can have read/write access to the VHD. The others can create a snapshot and read from that, but they won't see updates. Depending on how frequently your data is updated that could work. eg, if you update data at well-known times you could have all the clients unmount, take a new snapshot, and remount to get the new data.
You can also share out a VHD using SMB sharing as described in this MSDN blog post. This would allow full read/write access, but might be a little less reliable and a bit more complex.
you don't say, but if you are not compressing your articles that probably solves your issue then just use table storage.
Otherwise just use table storage and use a unique partition key for each article. If an article's too big put it in 2 rows, as long as you query by partition key you'll get both rows, then use the row key as the index indicating how the articles fit back together
One idea that i have would be to use CDN to store your article content, and link them directly from the client side, instead of any multi phase, operation of getting data from sql then going to some storage.
It would be something like
http://<cdnurl>/<container>/<articleId>.html
Infact same thing can be done with Blob storage too.
The advantage here is that this becomes insanely fast.
Disadvantage here is that security aspect is lost.
Something like Shared Access Signature can be explored for security, but I am not sure how helpful would it be for client side links.

Is there a way to get around space usage issues when using long field names in MongoDB?

It looks like having descriptive field names (the ones I like the most) can take much space in the memory for big collections. I don't like the idea of giving them short and cryptic names to save memory, neither do I like the idea to translate field names to shortened fields somewhere in the application.
Is there a way to tell mongo not to store every field name as text?
For now the only thing you can do is to vote and wait for SERVER-863 to be solved. After almost a year of discussion the status of this issue has been changes to planned but not scheduled...
The workaround is to use document mapping libraries likes Spring Data Document or morphia (in Java world) and work with nicely named objects. But the underlying database names are still cryptic.
If you are using an "object-document mapper" library to access MongoDB, many of them provide facilities for using descriptive names within your application code, but storing short names in the database. If your application has a data access layer, it may be possible for you to implement this logic in your application code, as well.
Since you haven't said what language you're using, or whether you're using an ODM at all, I provide any more guidance on which ODMs might fit your needs.

Best data store w/full text search for lots of small documents? (e.g. a Splunk-like system)

We are specing out a system that will index and store zillions of Syslog messages. These are text messages, with a few attributes (system name, date/time, message type, message body), that are typically 100 to 1500 bytes each.
We generate 2 to 10 gb of these messages per day, and need to retain at least 30 days of them.
The splunk system has a really great indexing and document compression system.
What to use?
I thought of mongodb, but it seems inappropriate for documents of this small size.
SQL Server is a possibility, but seems perhaps not super efficient for this purpose.
Text files with lucene?
-- The windows file system doesn't always like dirs with zillions of files
Suggestions ?
Thanks!
I thought of mongodb, but it seems inappropriate for documents of this small size
There's a company called Boxed Ice that actually builds a server monitoring system using MongoDB. I would argue that it's definitely appropriate.
These are text messages, with a few attributes (system name, date/time, message type, message body), that are typically 100 to 1500 bytes each.
From a MongoDB perspective, we would say that you are storing lots of small documents with a few attributes. In a case like this MongoDB has several benefits here.
It can handle changing attributes seamlessly.
It will flexibly handle different types.
We generate 2 to 10 gb of these messages per day, and need to retain at least 30 days of them.
This is well within the type of data range that MongoDB can handle. There are several different methods of handling the 30 day retention periods. These will depend on your reporting needs. I would poke around on the groups for ideas here.
Based on the people I've worked with, this type of insert-heavy logging is one of the places where Mongo tends to be a very good fit.
Graylog2 is an open-source log management tool that is built on top of MongoDB. I believe Loggy, a logging-as-a-service provider, also uses MongoDB as their backend store. So there are quiet a few products using MongoDB for logging.
It should be possible to store the ngrams returned by a Lucene analyzer for better text searching. Not sure about the feasibility though given the large amount of documents. What is primary reporting use case?
It seems that you would want something like mongodb full-text search server, which will enable you to search on different attributes without losing performance. You may try MongoLantern: http://sourceforge.net/projects/mongolantern/. Though it's in alpha stage but gives very best result for me for 5M records.
Let me know whether this serves your purpose.
I would strongly consider using something Lucene or Solr.
Lucene is built specifically for full text search and provides a ton of additional helpful features that you may find useful in your application. As a bonus, Solr is dead simple to setup and configure. (And its super fast for searching)
They do not keep a file per entry, so you shouldnt have to worry much about zillions of files.
None of the free database options specialize in full text search - dont try to force them to do what you want.
I think you should deploy your own (intranet-wide) stack of Grafana, Logstash + ElasticSearch
When setup once you have a flexibel schema, retention and a wonderful UI for your data with Grafana.

How to reduce the size of an sqlite3 database for iphone?

edit: many thanks for all the answers. Here are the results after applying the optimisations so far:
Switching to sorting the characters and run length encoding - new DB size 42M
Dropping the indexes on the booleans - new DB size 33M
The really nice part is this hasn't required any changes in the iphone code
I have an iphone application with a large dictionary held in sqlite format (read only). I'm looking for ideas to reduce the size of the DB file, which is currently very large.
Here is the number of entries and resulting size of the sqlite DB:
franks-macbook:DictionaryMaker frank$ ls -lh dictionary.db
-rw-r--r-- 1 frank staff 59M 8 Oct 23:08 dictionary.db
franks-macbook:DictionaryMaker frank$ wc -l dictionary.txt
453154 dictionary.txt
...an average of about 135 bytes per entry.
Here is my DB schema:
create table words (word text primary key, sowpods boolean, twl boolean, signature text)
create index sowpods_idx on words(sowpods)
create index twl_idx on words(twl)
create index signature_idx on words(signature)
Here is some sample data:
photoengrave|1|1|10002011000001210101010000
photoengraved|1|1|10012011000001210101010000
photoengraver|1|1|10002011000001210201010000
photoengravers|1|1|10002011000001210211010000
photoengraves|1|1|10002011000001210111010000
photoengraving|1|1|10001021100002210101010000
The last field represents the letter frequencies for anagram retrieval (each position is in the range 0..9). The two booleans represent sub dictionaries.
I need to do queries such as:
select signature from words where word = 'foo'
select word from words where signature = '10001021100002210101010000' order by word asc
select word from words where word like 'foo' order by word asc
select word from words where word = 'foo' and (sowpods='1' or twl='1')
One idea I have is to encode the letter frequencies more efficiently, e.g. binary encode them as a blob (perhaps with RLE as there are many zeros?). Any ideas for how best to achieve this, or other ideas to reduce the size? I am building the DB in ruby, and reading it on the phone in objective C.
Also is there any way to get stats on the DB so I can see what is using the most space?
Have you tried typing the "vacuum" command to make sure you don't have extra space in the db you forgot to reclame?
Remove the indexes on sowpods and twl -- they are probably not helping your query times and are definitely taking lots of space.
You can get stats on the database using sqlite3_analyzer from the SQLite downloads page.
As a totally different approach, you could try using a bloom filter instead of a comprehensive database. Basically, a bloom filter consists of a bunch of hash functions, each of which is associated with a bitfield. For each legal word, each hash function is evaluated, and the corresponding bit in the corresponding bit field is set. Drawback is it's theoretically possible to get false positives, but those can be minimized/practically eliminated with enough hashes. Plus side is a huge space savings.
I'm not clear on all the use cases for the signature field but it seems like storing an alphabetized version of the word instead would be beneficial.
The creator of SQLite sells a version of SQLite that includes database compression (and encryption). This would be perfect.
Your best bet is to use compression, which unfortunately SQLite does not support natively at this point. Luckily, someone took the time to develop a compression extension for it which could be what you need.
Otherwise I'd recommend storing your data mostly in compressed format and uncompressing on the fly.
As a text field, signature is currently using at least 26 * 8 bytes per entry (208 bytes) but if you were to pack the data into a bitfield, you could probably get away with only 3 bits per letter (reducing your maximum frequency per letter to 7). That would mean you could pack the entire signature in 26 * 3 bits = 78 bits = 10 bytes. Even if you used 4 bits per letter (for a maximum frequency of 15 per letter) you would only use 104 bits (13 bytes).
EDIT: After a bit more thought, I think 4 bits per letter (instead of 3) would be a better idea because it would make the binary math easier.
EDIT2: Reading through the docs on SQLite data types, it seems that you might be able to just make the "signature" field span 26 columns of type INTEGER and SQLite will do the right thing and only use as many bits as required to store the value.
Do I reckon correctly that you have about 450K words like that in your database ?
I've got no clue about iPhone, neither serious about sqlitem but... as long as sqlite does not allow for a way to save the file as gz right away (it maybe already does internally? no, does not look like that when you say it's about 135 b per entry. not even with both indexes), I would move away from the table approach, save it "manually" in a dictionary approach compression and build the rest on the fly and in memory. That should perform VERY well on your type of data.
Wait... Are you using that signature to allow for fulltextsearching or mistyping recogition ? Would full text search on sqlite not obsolete that field ?
As noted storing "Signature" more efficiently seems like a good idea.
However, it also seems like you could gain a ton of space savings by using some kind of lookup table for words - since you seem to be taking a root word and then appending "er", "ed", "es", etc why not have a column with a numeric ID that references a root word from a separate lookup table, and then a separate column with a numeric ID that references a table of common word suffixes that would be appended to the base word.
If there were any tricks around storing shorthand versions of signatures for multiple entries with a single root word, you could also employ those to reduce the size of stored signatures (not sure what algorithm is producing those values)
This also seems to make a lot of sense to me as you have the "word" column as a primary key, but do not even index it - just create a separate numeric column that is the primary ID for the table.
mhmm... an iPhone... doesn't it have a permanent data connection ?
I think this is where a webapplication/webservice can jump in snugly.
Move most of your business logic to the webserver (he's gonna have real SQL with FTS and looooots of memory) and fetch that info online to the client on the device.
As mentioned elsewhere, lose the indexes on the boolean columns, they will almost certainly be slower (if used at all) than a table scan and are going to use space needlessly.
I'd consider applying a simple compression to the words, Huffman coding is pretty good for this sort of thing. Also, I'd look at the signatures: sort the columns in letter frequency order and don't bother storing trailing zeroes, which can be implied. I guess you could Huffman-encode those, too.
Always assuming your encoded strings don't upset SQLite, of course.