what is wrong with my extendible hashing solution? - hash

I have the solution to extendible hashing below. I was wondering if hashing it this way is a correct representation? I know that your directories can also be a binary representation and you rehash it and increase your global depth every time a clash happens. But is this also a correct representation?
There is an extendable hash table with leaf size M=4 entries, and with the directory initially indexed using two senior bits. Consider insertion, into an initially empty table, of the keys that hash into the following values: 0100010, 0100100, 1000000, 0110101, 0101111, 1000001, 0100000, 1001000, 1001001, 1000010.
A. Show the state of the table after the first 8 insertions.
B. Show the state of the table after the remaining two insertions.

Related

Hash Table Confusion - How much space is needed for Hash Table with a good (eg. Cryptographic) Hash Function?

I am learning about Hash Tables, Hash Maps etc. I have just implemented a Hash Table in C, with operations: insert(HTable, key), delete(HTable, key), initialize(HTable) and search(HTable, key).
I would like to ask something. Since in a (proper) Hash Table the computed hashed indexes could be very large, doesn't this mean that the space consumed will be like INT_MAX (which is still O(n) of course), or more? I mean given the input element that we want to store in a hash table (ie insert it in), the insert() function would call the hash function which would then compute the hashed index for the element to go in. Thus it would use the hash function to find this index.
When we use the hash function to operate on the element, the hashed index could become very large. With a proper, for example cryptographic hash function, this index could become huge (they are using prime numbers with 300 digits - Diffie Hellman public key cryptography etc.), right? I know that in normal hash functions (such as the trivial ones beginners use to learn) we apply mod operation in order for the element to fit within the hash table's bounds, but in doing so, maybe we limit the hash function potential?
So to uniquely map an element to the hash table we must use a HUGE Hash Table. How are these cryptographic hash tables implemented? They must be completely secure, right? Even the Stack Overflow tag on "cryptographichashfunction" says that it is extremely unlikely to find two inputs that will map to the same element (as such the possibility of collisions is tiny). Wouldn't this require though a HUGE array to be stored in memory (or to disk)? Therefore, the memory consumption would be huge.
Of course, the time complexity is not a problem. We just see the start address of the hash table / array add it with the index and just go that place in memory to get the value (O(1) - search principle of Hash Table).
Am i wrong somewhere? Is there something i'm missing? I hope i made myself clear. So to conclude, i would like confirmation on this. Does a good hash function require a huge array (Hash Table) and as such a very large amount of memory to be properly implemented? Is so much space justified, or is there something i don't quite get? Thanks.
In general cryptographic hash values are not used for hash tables. Instead a fast hash is used. Of that hash value only as many bits may be used to tweak the size of the table. If multiple key values map to the same index then the values are stored in a separate structure, possibly with additional information to choose between the two.
It is not required that the hash output is unique; the hash function output would be too large and the table required would certainly not fit in memory. Besides that, cryptographic hashes are generally quite slow.
Cryptographic hash functions are usually build from operations also used in symmetric block ciphers. That means mixing and bitwise operators used in a large amount of rounds. Modular arithmetic, as used for e.g. RSA are commonly not used.
All in all, the main thing is that the index generated doesn't need to be unique. Usually if one hash leeds to multiple values they are stored in a list or set where the key can be compared by value.

PostgreSQL using UUID vs Text as primary key

Our current PostgreSQL database is using GUID's as primary keys and storing them as a Text field.
My initial reaction to this is that trying to perform any kind of minimal cartesian join would be a nightmare of indexing trying to find all the matching records. However, perhaps my limited understanding of database indexing is wrong here.
I'm thinking that we should be using UUID as these are stored as a binary representation of the GUID where a Text is not and the amount of indexing that you get on a Text column is minimal.
It would be a significant project to change these, and I'm wondering if it would be worth it?
When dealing with UUID numbers store them as data type uuid. Always. There is simply no good reason to even consider text as alternative. Input and output is done via text representation by default anyway. The cast is very cheap.
The data type text requires more space in RAM and on disk, is slower to process and more error prone. #khampson's answer provides most of the rationale. Oddly, he doesn't seem to arrive at the same conclusion.
This has all been asked and answered and discussed before. Related questions on dba.SE with detailed explanation:
Would index lookup be noticeably faster with char vs varchar when all values are 36 chars
What is the optimal data type for an MD5 field?
bigint?
Maybe you don't need UUIDs (GUIDs) at all. Consider bigint instead. It only occupies 8 bytes and is faster in every respect. It's range is often underestimated:
-9223372036854775808 to +9223372036854775807
That's 9.2 millions of millions of millions positive numbers. IOW, nine quintillion two hundred twenty-three quadrillion three hundred seventy-two trillion thirty-six something billion.
If you burn 1 million IDs per second (which is an insanely high number) you can keep doing so for 292471 years. And then another 292471 years for negative numbers. "Tens or hundreds of millions" is not even close.
UUID is really just for distributed systems and other special cases.
As #Kevin mentioned, the only way to know for sure with your exact data would be to compare and contrast both methods, but from what you've described, I don't see why this would be different from any other case where a string was either the primary key in a table or part of a unique index.
What can be said up front is that your indexes will probably larger, since they have to store larger string values, and in theory the comparisons for the index will take a bit longer, but I wouldn't advocate premature optimization if to do so would be painful.
In my experience, I have seen very good performance on a unique index using md5sums on a table with billions of rows. I have found it tends to be other factors about a query which tend to result in performance issues. For example, when you end up needing to query over a very large swath of the table, say hundreds of thousands of rows, a sequential scan ends up being the better choice, so that's what the query planner chooses, and it can take much longer.
There are other mitigating strategies for that type of situation, such as chunking the query and then UNIONing the results (e.g. a manual simulation of the sort of thing that would be done in Hive or Impala in the Hadoop sphere).
Re: your concern about indexing of text, while I'm sure there are some cases where a dataset produces a key distribution such that it performs terribly, GUIDs, much like md5sums, sha1's, etc. should index quite well in general and not require sequential scans (unless, as I mentioned above, you query a huge swath of the table).
One of the big factors about how an index would perform is how many unique values there are. For that reason, a boolean index on a table with a large number of rows isn't likely to help, since it basically is going to end up having a huge number of row collisions for any of the values (true, false, and potentially NULL) in the index. A GUID index, on the other hand, is likely to have a huge number of values with no collision (in theory definitionally, since they are GUIDs).
Edit in response to comment from OP:
So are you saying that a UUID guid is the same thing as a Text guid as far as the indexing goes? Our entire table structure is using Text fields with a guid-like string, but I'm not sure Postgre recognizes it as a Guid. Just a string that happens to be unique.
Not literally the same, no. However, I am saying that they should have very similar performance for this particular case, and I don't see why optimizing up front is worth doing, especially given that you say to do so would be a very involved task.
You can always change things later if, in your specific environment, you run into performance problems. However, as I mentioned earlier, I think if you hit that scenario, there are other things that would likely yield better performance than changing the PK data types.
A UUID is a 128-bit data type (so, 16 bytes), whereas text has 1 or 4 bytes of overhead plus the actual length of the string. For a GUID, that would mean a minimum of 33 bytes, but could vary significantly depending on the encoding used.
So, with that in mind, certainly indexes of text-based UUIDs will be larger since the values are larger, and comparing two strings versus two numerical values is in theory less efficient, but is not something that's likely to make a huge difference in this case, at least not usual cases.
I would not optimize up front when to do so would be a significant cost and is likely to never be needed. That bridge can be crossed if that time does come (although I would persue other query optimizations first, as I mentioned above).
Regarding whether Postgres knows the string is a GUID, it definitely does not by default. As far as it's concerned, it's just a unique string. But that should be fine for most cases, e.g. matching rows and such. If you find yourself needing some behavior that specifically requires a GUID (for example, some non-equality based comparisons where a GUID comparison may differ from a purely lexical one), then you can always cast the string to a UUID, and Postgres will treat the value as such during that query.
e.g. for a text column foo, you can do foo::uuid to cast it to a uuid.
There's also a module available for generating uuids, uuid-ossp.

Good encoding for VARCHAR with similarity across rows

What is a good Amazon Redshift column encoding for a VARCHAR column where each row contains a short (usually 50-100 characters) value that contains little repetition, but for which there is a high degree of similarity across the rows? (Identical prefixes, in particular.)
The maddeningly terse LZO description makes it sound like LZO is applied individually to each value. In that case, there will be no shared dictionary across the rows and little commonality to exploit. OTOH, if the LZO is applied to an entire 1 MB block of values written to disk, it would perform well.
Byte Dictionary sounds like it only yields savings when the values are identical rather than similar, so not a good option.
Compression is applied per block, which means that LZO is almost always the right choice for VARCHAR. Most of the other alternatives require the values to be either completely identical to other values (e.g. BYTEDICT, RUNLENGTH), or be numeric (e.g. DELTA, MOSTLY8).
The only other alternative for VARCHARS is TEXT255/TEXT32K, which might work for your use case. They build dictionaries of the first N words (245 for TEXT255 and variable for TEXT32K) and replaces occurrences of these words with a one byte index. If your values share a lot of words then TEXT255 might work better than LZO.

HBase row key design for monotonically increasing keys

I've an HBase table where I'm writing the row keys like:
<prefix>~1
<prefix>~2
<prefix>~3
...
<prefix>~9
<prefix>~10
The scan on the HBase shell gives an output:
<prefix>~1
<prefix>~10
<prefix>~2
<prefix>~3
...
<prefix>~9
How should a row key be designed so that the row with key <prefix>~10 comes last? I'm looking for some recommended ways or the ways that are more popular for designing HBase row keys.
How should a row key be designed so that the row with key ~10 comes last?
You see the scan output in this way because rowkeys in HBase are kept sorted lexicographically irrespective of the insertion order. This means that they are sorted based on their string representations. Remember that rowkeys in HBase are treated as an array of bytes having a string representation. The lowest order rowkey appears first in a table. That's why 10 appears before 2 and so on. See the sections Rows on this page to know more about this.
When you left pad the integers with zeros their natural ordering is kept intact while sorting lexicographically and that's why you see the scan order same as the order in which you had inserted the data. To do that you can design your rowkeys as suggested by #shutty.
I'm looking for some recommended ways or the ways that are more popular for designing HBase row keys.
There are some general guidelines to be followed in order to devise a good design :
Keep the rowkey as small as possible.
Avoid using monotonically increasing rowkeys, such as timestamp etc. This is a poor shecma design and leads to RegionServer hotspotting. If you can't avoid that use someway, like hashing or salting to avoid hotspotting.
Avoid using Strings as rowkeys if possible. String representation of a number takes more bytes as compared to its integer or long representation. For example : A long is 8 bytes. You can store an unsigned number up to 18,446,744,073,709,551,615 in those eight bytes. If you stored this number as a String -- presuming a byte per character -- you need nearly 3x the bytes.
Use some mechanism, like hashing, in order to get uniform distribution of rows in case your regions are not evenly loaded. You could also create pre-splitted tables to achieve this.
See this link for more on rowkey design.
HTH
HBase stores rowkeys in lexicographical order, so you can try to use this schema with fixed-length rowrey:
<prefix>~0001
<prefix>~0002
<prefix>~0003
...
<prefix>~0009
<prefix>~0010
Keep in mind that you also should use random prefixes to avoid region hot-spotting (when a single region accepts most of the writes, while the other regions are idle).
monotonically increasing keys isnt a good schema for hbase.
you can read more here:
http://hbase.apache.org/book/rowkey.design.html
there also a link there to OpenTSDB that solve this problem.
Fixed length keys are really recommended if possible. Bytes.toBytes(Long value) can be used to get a byte array from a counter. It will sort well for positive longs less than Long.MAX_VALUE.

How to reduce the size of an sqlite3 database for iphone?

edit: many thanks for all the answers. Here are the results after applying the optimisations so far:
Switching to sorting the characters and run length encoding - new DB size 42M
Dropping the indexes on the booleans - new DB size 33M
The really nice part is this hasn't required any changes in the iphone code
I have an iphone application with a large dictionary held in sqlite format (read only). I'm looking for ideas to reduce the size of the DB file, which is currently very large.
Here is the number of entries and resulting size of the sqlite DB:
franks-macbook:DictionaryMaker frank$ ls -lh dictionary.db
-rw-r--r-- 1 frank staff 59M 8 Oct 23:08 dictionary.db
franks-macbook:DictionaryMaker frank$ wc -l dictionary.txt
453154 dictionary.txt
...an average of about 135 bytes per entry.
Here is my DB schema:
create table words (word text primary key, sowpods boolean, twl boolean, signature text)
create index sowpods_idx on words(sowpods)
create index twl_idx on words(twl)
create index signature_idx on words(signature)
Here is some sample data:
photoengrave|1|1|10002011000001210101010000
photoengraved|1|1|10012011000001210101010000
photoengraver|1|1|10002011000001210201010000
photoengravers|1|1|10002011000001210211010000
photoengraves|1|1|10002011000001210111010000
photoengraving|1|1|10001021100002210101010000
The last field represents the letter frequencies for anagram retrieval (each position is in the range 0..9). The two booleans represent sub dictionaries.
I need to do queries such as:
select signature from words where word = 'foo'
select word from words where signature = '10001021100002210101010000' order by word asc
select word from words where word like 'foo' order by word asc
select word from words where word = 'foo' and (sowpods='1' or twl='1')
One idea I have is to encode the letter frequencies more efficiently, e.g. binary encode them as a blob (perhaps with RLE as there are many zeros?). Any ideas for how best to achieve this, or other ideas to reduce the size? I am building the DB in ruby, and reading it on the phone in objective C.
Also is there any way to get stats on the DB so I can see what is using the most space?
Have you tried typing the "vacuum" command to make sure you don't have extra space in the db you forgot to reclame?
Remove the indexes on sowpods and twl -- they are probably not helping your query times and are definitely taking lots of space.
You can get stats on the database using sqlite3_analyzer from the SQLite downloads page.
As a totally different approach, you could try using a bloom filter instead of a comprehensive database. Basically, a bloom filter consists of a bunch of hash functions, each of which is associated with a bitfield. For each legal word, each hash function is evaluated, and the corresponding bit in the corresponding bit field is set. Drawback is it's theoretically possible to get false positives, but those can be minimized/practically eliminated with enough hashes. Plus side is a huge space savings.
I'm not clear on all the use cases for the signature field but it seems like storing an alphabetized version of the word instead would be beneficial.
The creator of SQLite sells a version of SQLite that includes database compression (and encryption). This would be perfect.
Your best bet is to use compression, which unfortunately SQLite does not support natively at this point. Luckily, someone took the time to develop a compression extension for it which could be what you need.
Otherwise I'd recommend storing your data mostly in compressed format and uncompressing on the fly.
As a text field, signature is currently using at least 26 * 8 bytes per entry (208 bytes) but if you were to pack the data into a bitfield, you could probably get away with only 3 bits per letter (reducing your maximum frequency per letter to 7). That would mean you could pack the entire signature in 26 * 3 bits = 78 bits = 10 bytes. Even if you used 4 bits per letter (for a maximum frequency of 15 per letter) you would only use 104 bits (13 bytes).
EDIT: After a bit more thought, I think 4 bits per letter (instead of 3) would be a better idea because it would make the binary math easier.
EDIT2: Reading through the docs on SQLite data types, it seems that you might be able to just make the "signature" field span 26 columns of type INTEGER and SQLite will do the right thing and only use as many bits as required to store the value.
Do I reckon correctly that you have about 450K words like that in your database ?
I've got no clue about iPhone, neither serious about sqlitem but... as long as sqlite does not allow for a way to save the file as gz right away (it maybe already does internally? no, does not look like that when you say it's about 135 b per entry. not even with both indexes), I would move away from the table approach, save it "manually" in a dictionary approach compression and build the rest on the fly and in memory. That should perform VERY well on your type of data.
Wait... Are you using that signature to allow for fulltextsearching or mistyping recogition ? Would full text search on sqlite not obsolete that field ?
As noted storing "Signature" more efficiently seems like a good idea.
However, it also seems like you could gain a ton of space savings by using some kind of lookup table for words - since you seem to be taking a root word and then appending "er", "ed", "es", etc why not have a column with a numeric ID that references a root word from a separate lookup table, and then a separate column with a numeric ID that references a table of common word suffixes that would be appended to the base word.
If there were any tricks around storing shorthand versions of signatures for multiple entries with a single root word, you could also employ those to reduce the size of stored signatures (not sure what algorithm is producing those values)
This also seems to make a lot of sense to me as you have the "word" column as a primary key, but do not even index it - just create a separate numeric column that is the primary ID for the table.
mhmm... an iPhone... doesn't it have a permanent data connection ?
I think this is where a webapplication/webservice can jump in snugly.
Move most of your business logic to the webserver (he's gonna have real SQL with FTS and looooots of memory) and fetch that info online to the client on the device.
As mentioned elsewhere, lose the indexes on the boolean columns, they will almost certainly be slower (if used at all) than a table scan and are going to use space needlessly.
I'd consider applying a simple compression to the words, Huffman coding is pretty good for this sort of thing. Also, I'd look at the signatures: sort the columns in letter frequency order and don't bother storing trailing zeroes, which can be implied. I guess you could Huffman-encode those, too.
Always assuming your encoded strings don't upset SQLite, of course.