why are orientdb index sizes on disk so large

why are orientdb index sizes on disk so large - orientdb

orientdb 2.0.5
I have a database in which we create non-unque index on 2 properties on a class called indexstat.
The two properties which make up the index are a string identifier plus a long timestamp.
Data is created in batches of few hundred records every 5 minutes. After a few hours old records are deleted.
This is file listing are the files related to that table.
Question:
Why is the .irs file which according to documentation (is related to non-unique indexes)...so monstrously huge after a few hours. 298056704 bytes larger than actual data (.irs size - .sbt size - .cpm size).
I would think the index would be smaller than the actual data.
Second question:
What is best practice here. Should I be using unique indexes instead of non-unique? Should I find a way to make the data in the index smaller (e.g. use longs instead of strings as identifiers)?
Below are file names and the sizes of each.
indexstat.cpm 727778304
indexstatidx.irs 1799095296
indexstatidx.sbt 263168
indexstat.pcl 773260288
This is repeated for a few tables where the index size is larger than the database data.

Internals of *.irs files organised in a such way that when you delete something from an index there is an unused hole left in the file. At some point, when about a half of the file space is wasted, those unused holes come into play again and become available for reuse and allocation. That is done for performance reasons to lower the index data fragmentation. In your case this means that sooner or later the *.irs file will stop growing, and its maximum size should be around 2-3 times larger than the maximum observed size of the corresponding *.pcl file, assuming your single stat record size is not much bigger compared to the size of the id-timestamp pair size.
Regarding the second question, in a long run it is almost always better to use the most specific/strict data types to model the data and the most specific/strict index types to index it.

At this link is shown a discussion relative to the index file, maybe can help you.
For the second question, the index should be chosen according to your purpose and your data (not vice versa). The data type (long, string) must be the one that best represents your fields (and already here if for example if you just an integer and this is sufficient to the scope, it is useless to use a long). The same choice for the index, if you need to not have duplicate the choice will be non-unique. if you need an index that allows to choose the range sb-tree instead of the hash and so on ...

Related

PostgreSQL varchar length performance impact

I have a table in PostgreSQL which is under heavy load (reads). It is practically core table of an application. One column is used as a discriminator - column used by application, that determines type of entity (class) that represents given row. It has to be exactly one varchar column. Currently I do store full class name in it, like: "bank_statement_transaction".
When application selects all bank statement transactions, query is built like ... WHERE Discriminator = 'bank_statement_transaction' . This brings more readability and clarity to data, structure and code.
Table contains currently 3M rows and counting, approximately 100k new rows monthly. Discriminator was indexed during some performance tunings. I don't have any performance issues right now.
I am working on a new feature that requires some little refactoring and yeah I had an idea to change full class name (bank_statement_transaction) to short unique codes (BST)
I replicated dbo and changed full class name to code. With 3M rows, performance gain is barely measurable, same or 1-2 milliseconds faster.
Can anyone share experience with VARCHAR length impact on INDEX size and performance? On bigger data set? Is this change worth of it?

If you index strings, the index will become larger if the strings are long. The fan-out will be less, so the index will become deeper.
With an index scan that searches for a few rows, this won't be noticable: reading a few blocks more and running comparisons on longer strings may be lost in the noise for any but the simplest queries. Still, you'll be faster with smaller strings.
Maybe the most noticeable effect will be that a smaller index needs less RAM for caching, so the number of disk reads should go down.

Postgres determine size of all blobs

hi i'm trying to find the size of all blobs. I always used this
SELECT sum(pg_column_size(pg_largeobject)) lob_size FROM pg_largeobject
but while my database is growing ~40GB this takes several hours and loads the cpu too much.
is there any more efficent way?

Some of the functions mentioned in Database Object Management Functions give an immediate result for the entire table.
I'd suggest pg_table_size(regclass) which is defined as:
Disk space used by the specified table, excluding indexes (but
including TOAST, free space map, and visibility map)
It differs from sum(pg_column_size(tablename)) FROM tablename because it counts entire pages, so that includes the padding between the rows, the dead rows (updated or deleted and not reused), and the row headers.

Lucene: Loading Index files while searching?

Can anyone explain how index files are loaded in memory while searching?
Is the whole file (fnm, tis, fdt etc) loaded at once or in chunks?
How individual segments are loaded and in which order?
How to encrypt Lucene index?

The main point of having the index segments is that you can rarely load the whole index in the memory.
The most important limitation that is taken into account while designing the index format is that disk seek time is relatively long (on plate-base hard drives, that are still most widely used). A good estimation is that the transfer time per byte is about 0.01 to 0.02 μs, while average seek time of disk head is about 5 ms!
So the part that is kept in memory is typically only the dictionary, used to find out the beginning block of the postings list on the disk*. The other parts are loaded only on-demand and then purged from the memory to make room for other searches.
As for encryption, it depends on whether you need to keep the index encrypted all the time (even when in memory) or if it suffices to encrypt only the index files. As for the latter, I think that an encrypted file system will be enough. As for the former, it is also certainly possible, as different index compression techniques are already in place. However, I don't think it's widely used, as the first and foremost requirement for full-text engine is speed.
[*] It's not really such simple, as we're performing binary searches against the dictionary, so we need to ensure that all entries in the first structure have equal length. As it's clearly not the case with normal words in dictionary and applying padding is too much costly (think of word lengths for some chemical substances), we actually maintain two levels of dictionary, the first one (which needs to fit in the memory and is stored in .tii files) keeps sorted list of starting positions of terms in the second index (.tis files). The second index is then a concatenated array of all terms in an increasing order, along with pointer to the sector in the .frq file. The second index often fits in the memory and is loaded at the start, but it can be impossible e.g. for bigram indexes. Also note that for some time Lucene by default doesn't use individual files, but so called compound files (with .cfs extension) to cut down the number of open files.

What's the fastest way to save data and read it next time in a IPhone App?

In my dictionary IPhone app I need to save an array of strings which actually contains about 125.000 distinct words; this transforms in aprox. 3.2Mb of data.
The first time I run the app I get this data from an SQLite db. As it takes ages for this query to run, I need to save the data somehow, to read it faster each time the app launches.
Until now I've tried serializing the array and write it to a file, and afterword I've tested if writing directly to NSUserDefaults to see if there's any speed gain but there's none. In both ways it takes about 7 seconds on the device to load the data. It seems that not reading from the file (or NSUserDefaults) actually takes all that time, but the deserialization does:
objectsForCharacters = [[NSKeyedUnarchiver unarchiveObjectWithData:data] retain];
Do you have any ideeas about how I could write this data structure somehow that I could read/put in memory it faster?

The UITableView is not really designed to handle 10s of thousands of records. If would take a long time for a user to find what they want.
It would be better to load a portion of the table, perhaps a few hundred rows, as the user enters data so that it appears they have all the records available to them (Perhaps providing a label which shows the number of records that they have got left in there filtered view.)
The SQLite db should be perfect for this job. Add an index to the words table and then select a limited number of rows from it to show the user some progress. Adding an index makes a big difference to the performance of the even this simple table.
For example, I created two tables in a sqlite db and populated them with around 80,000 words
#Create and populate the indexed table
create table words(word);
.import dictionary.txt words
create unique index on words_index on word DESC;
#Create and populate the unindexed table
create table unindexed_words(word);
.import dictionary.txt unindexed_words
Then I ran the following query and got the CPU Time taken for each query
.timer ON
select * from words where word like 'sn%' limit 5000;
...
>CPU Time: user 0.031250 sys 0.015625;
select * from unindex_words where word like 'sn%' limit 5000;
...
>CPU Time: user 0.062500 sys 0.0312
The results vary but the indexed version was consistently faster that the unindexed one.
With fast access to parts of the dictionary through an indexed table, you can bind the UITableView to the database using NSFecthedResultsController. This class takes care of fecthing records as required, caches results to improve performance and allows predicates to be easily specified.
An example of how to use the NSFetchedResultsController is included in the iPhone Developers Cookbook. See main.m

Just keep the strings in a file on the disk, and do the binary search directly in the file.
So: you say the file is 3.2mb. Suppose the format of the file is like this:
key DELIMITER value PAIRDELIMITER
where key is a string, and value is the value you want to associate. The DELIMITER and PAIRDELIMITER must be chosen as such that they don't occur in the value and key.
Furthermore, the file must be sorted on the key
With this file you can just do the binary search in the file itself.
Suppose one types a letter, you go to the half of the file, and search(forwards or backwards) to the first PAIRDELIMITER. Then check the key and see if you have to search upwards or downwards. And repeat untill you find the key you need,
I'm betting this will be fast enough.

Store your dictionary in Core Data and use NSFetchedResultsController to manage the display of these dictionary entries in your table view. Loading all 125,000 words into memory at once is a terrible idea, both performance- and memory-wise. Using the -setFetchBatchSize: method on your fetch request for loading the words for your table, you can limit NSFetchedResultsController to only handling the small subset of words that are visible at any given moment, plus a little buffer. As the user scrolls up and down the list of words, new batches of words are fetched in transparently.
A case like yours is exactly why this class (and Core Data) was added to iPhone OS 3.0.

Do you need to store/load all data at once?
Maybe you can just load the chunk of strings you need to display and load all other strings in the background.

Perhaps you can load data into memory in one thread and search from it in another? You may not get search results instantly, but having some searches feel snappier may be better than none at all, by waiting until all data are loaded.

Are some words searched more frequently or repeatedly than others? Perhaps you can cache frequently searched terms in a separate database or other store. Load it in a separate thread as a searchable store, while you are loading the main store.
As for a data structure solution, you might look into a suffix trie to search for substrings in linear time. This will probably increase your storage requirements, though, which may affect your ability to implement this with an iPhone's limited memory and disk storage capabilities.

I really don't think you're on the right path trying to load everything at once.
You've already determined that your bottleneck is the deserialization.
Regardless what the UI does, the user only sees a handful (literally) of search results at a time.
SQLlite already has a robust indexing mechanism, there is likely no need to re-invent that wheel with your own indexing, etc.
IMHO, you need to rethink how you are using UITableView. It only needs a few screenfuls of data at a time, and you should reuse cell objects as they scroll out of view rather than creating a ton of them to begin with.
So, use SQLlite's indexing and grab "TOP x" rows, where x is the right balance between giving the user some immediately-available rows to scroll through without spending too much time loading them. Set the table's scroll bar scaling using a separate SELECT COUNT(*) query, which only needs to be updated when the user types something different.
You can always go back and cache aggressively after you deserialize enough to get something up on-screen. A slight lag after the first flick or typing a letter is more acceptable than a 7-second delay just starting the app.

I have currently a somewhat similar coding problem with a large amount of searchable strings.
My solution is to store the prepared data in one large memory array, containing both the texttual data and offsets as links. Meaning I do not allocate objects for each item. This makes the data use less memory and also allows me to load & save it to a file without further processing.
Not sure if this is an option for you, since this is quite an obvious solution once you've realized that the object tree is causing the slowdown.

I use a large NSData memory block, then search through it. Well, there's more to it, it took me about two days to get it well optimized.
In your case I suspect you have a dictionary with a lot of words that have similar beginnings. You could prepare them on another computer in a format the both compacts the data and also facilitates fast lookup. As a first step, the words should be sorted. With that, you can already perform a binary search on them for a fast lookup. If you store it all in one large memory area, you can do the search quite fast, compared to how sqlite would search, I think.
Another way would be to see the words as a kind of tree: You have many thousands that begin with the same letter. So you divide your data accordingly: You have a sql table for each beginning letter of your set of words. that way, if you look up a word, you'd select one of the now-smaller tables depening on the first letter. This makes the amount that has to be searched already much smaller. and you can do this for the 2nd and 3rd letter as well, and you already could have quite a fast access.
Did this give you some ideas?

Well actually I figured it out myself in the end, but of course I thank you all for your quick and pertinent answers. To be concise I will just say that, the fact that Objective-C, just like any other object-based programming language, due to introspection and other objective requirements is significantly slower than procedural programming languages.
The solution was in fact to load all my data in a continuous chunk of memory using malloc (a char **) and search on-demand in it and transform to objects. This concluded in a .5 sec loading time (from file to memory) and resonable (should be read "fast") operations during execution. Thank you all again and if you have any questions I'm here for you. Thanks

How to reduce the size of an sqlite3 database for iphone?

edit: many thanks for all the answers. Here are the results after applying the optimisations so far:
Switching to sorting the characters and run length encoding - new DB size 42M
Dropping the indexes on the booleans - new DB size 33M
The really nice part is this hasn't required any changes in the iphone code
I have an iphone application with a large dictionary held in sqlite format (read only). I'm looking for ideas to reduce the size of the DB file, which is currently very large.
Here is the number of entries and resulting size of the sqlite DB:
franks-macbook:DictionaryMaker frank$ ls -lh dictionary.db
-rw-r--r-- 1 frank staff 59M 8 Oct 23:08 dictionary.db
franks-macbook:DictionaryMaker frank$ wc -l dictionary.txt
453154 dictionary.txt
...an average of about 135 bytes per entry.
Here is my DB schema:
create table words (word text primary key, sowpods boolean, twl boolean, signature text)
create index sowpods_idx on words(sowpods)
create index twl_idx on words(twl)
create index signature_idx on words(signature)
Here is some sample data:
photoengrave|1|1|10002011000001210101010000
photoengraved|1|1|10012011000001210101010000
photoengraver|1|1|10002011000001210201010000
photoengravers|1|1|10002011000001210211010000
photoengraves|1|1|10002011000001210111010000
photoengraving|1|1|10001021100002210101010000
The last field represents the letter frequencies for anagram retrieval (each position is in the range 0..9). The two booleans represent sub dictionaries.
I need to do queries such as:
select signature from words where word = 'foo'
select word from words where signature = '10001021100002210101010000' order by word asc
select word from words where word like 'foo' order by word asc
select word from words where word = 'foo' and (sowpods='1' or twl='1')
One idea I have is to encode the letter frequencies more efficiently, e.g. binary encode them as a blob (perhaps with RLE as there are many zeros?). Any ideas for how best to achieve this, or other ideas to reduce the size? I am building the DB in ruby, and reading it on the phone in objective C.
Also is there any way to get stats on the DB so I can see what is using the most space?

Have you tried typing the "vacuum" command to make sure you don't have extra space in the db you forgot to reclame?

Remove the indexes on sowpods and twl -- they are probably not helping your query times and are definitely taking lots of space.
You can get stats on the database using sqlite3_analyzer from the SQLite downloads page.

As a totally different approach, you could try using a bloom filter instead of a comprehensive database. Basically, a bloom filter consists of a bunch of hash functions, each of which is associated with a bitfield. For each legal word, each hash function is evaluated, and the corresponding bit in the corresponding bit field is set. Drawback is it's theoretically possible to get false positives, but those can be minimized/practically eliminated with enough hashes. Plus side is a huge space savings.

I'm not clear on all the use cases for the signature field but it seems like storing an alphabetized version of the word instead would be beneficial.

The creator of SQLite sells a version of SQLite that includes database compression (and encryption). This would be perfect.

Your best bet is to use compression, which unfortunately SQLite does not support natively at this point. Luckily, someone took the time to develop a compression extension for it which could be what you need.
Otherwise I'd recommend storing your data mostly in compressed format and uncompressing on the fly.

As a text field, signature is currently using at least 26 * 8 bytes per entry (208 bytes) but if you were to pack the data into a bitfield, you could probably get away with only 3 bits per letter (reducing your maximum frequency per letter to 7). That would mean you could pack the entire signature in 26 * 3 bits = 78 bits = 10 bytes. Even if you used 4 bits per letter (for a maximum frequency of 15 per letter) you would only use 104 bits (13 bytes).
EDIT: After a bit more thought, I think 4 bits per letter (instead of 3) would be a better idea because it would make the binary math easier.
EDIT2: Reading through the docs on SQLite data types, it seems that you might be able to just make the "signature" field span 26 columns of type INTEGER and SQLite will do the right thing and only use as many bits as required to store the value.

Do I reckon correctly that you have about 450K words like that in your database ?
I've got no clue about iPhone, neither serious about sqlitem but... as long as sqlite does not allow for a way to save the file as gz right away (it maybe already does internally? no, does not look like that when you say it's about 135 b per entry. not even with both indexes), I would move away from the table approach, save it "manually" in a dictionary approach compression and build the rest on the fly and in memory. That should perform VERY well on your type of data.
Wait... Are you using that signature to allow for fulltextsearching or mistyping recogition ? Would full text search on sqlite not obsolete that field ?

As noted storing "Signature" more efficiently seems like a good idea.
However, it also seems like you could gain a ton of space savings by using some kind of lookup table for words - since you seem to be taking a root word and then appending "er", "ed", "es", etc why not have a column with a numeric ID that references a root word from a separate lookup table, and then a separate column with a numeric ID that references a table of common word suffixes that would be appended to the base word.
If there were any tricks around storing shorthand versions of signatures for multiple entries with a single root word, you could also employ those to reduce the size of stored signatures (not sure what algorithm is producing those values)
This also seems to make a lot of sense to me as you have the "word" column as a primary key, but do not even index it - just create a separate numeric column that is the primary ID for the table.

mhmm... an iPhone... doesn't it have a permanent data connection ?
I think this is where a webapplication/webservice can jump in snugly.
Move most of your business logic to the webserver (he's gonna have real SQL with FTS and looooots of memory) and fetch that info online to the client on the device.

As mentioned elsewhere, lose the indexes on the boolean columns, they will almost certainly be slower (if used at all) than a table scan and are going to use space needlessly.
I'd consider applying a simple compression to the words, Huffman coding is pretty good for this sort of thing. Also, I'd look at the signatures: sort the columns in letter frequency order and don't bother storing trailing zeroes, which can be implied. I guess you could Huffman-encode those, too.
Always assuming your encoded strings don't upset SQLite, of course.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse