Good encoding for VARCHAR with similarity across rows

Good encoding for VARCHAR with similarity across rows - encoding

What is a good Amazon Redshift column encoding for a VARCHAR column where each row contains a short (usually 50-100 characters) value that contains little repetition, but for which there is a high degree of similarity across the rows? (Identical prefixes, in particular.)
The maddeningly terse LZO description makes it sound like LZO is applied individually to each value. In that case, there will be no shared dictionary across the rows and little commonality to exploit. OTOH, if the LZO is applied to an entire 1 MB block of values written to disk, it would perform well.
Byte Dictionary sounds like it only yields savings when the values are identical rather than similar, so not a good option.

Compression is applied per block, which means that LZO is almost always the right choice for VARCHAR. Most of the other alternatives require the values to be either completely identical to other values (e.g. BYTEDICT, RUNLENGTH), or be numeric (e.g. DELTA, MOSTLY8).
The only other alternative for VARCHARS is TEXT255/TEXT32K, which might work for your use case. They build dictionaries of the first N words (245 for TEXT255 and variable for TEXT32K) and replaces occurrences of these words with a one byte index. If your values share a lot of words then TEXT255 might work better than LZO.

Related

Which column compression type should i choose in amazon redshift?

I have a table over 120 million rows.
Following command analyze compression tbl; shows LZO encoding for almost every VARCHAR field, but i think that runlenght encoding may be better for fields with finite number of options (traffic source, category, etc.).
So should i move certain fields to another encoding or stay with LZO?

Thoughts on runlength
The point about runlength, rather than a finite number of options, is that field values are repeated over many consecutive rows. This is usually the case when table is sorted by that column. You are right, though, that the fewer distinct values you have, the more likely it is for any particular value to occur in a sequence.
Documentation
Redshift states in their documentation:
We do not recommend applying runlength encoding on any column that is designated as a sort key. Range-restricted scans perform better when blocks contain similar numbers of rows. If sort key columns are compressed much more highly than other columns in the same query, range-restricted scans might perform poorly.
And also:
LZO encoding provides a very high compression ratio with good performance. LZO encoding works especially well for CHAR and VARCHAR columns that store very long character strings, especially free form text, such as product descriptions, user comments, or JSON strings.
Benchmark
So, ultimately, you'll have to take a close look at your data, the way it is sorted, the way you are going to join other tables on it and, if in doubt, benchmark the encodings. Create the same table twice and apply runlength encoding to the column in one table, and lzo in the other. Ideally, you already have a query that you know will be used often. Run it several times for each table and compare the results.
My recommendation
Do your queries perform ok? Then don't worry about encoding and take Redshift's suggestion. If you want to take it as a learning project, then make sure that you are aware of how performance improves or degrades when you double (quadruple, ...) the rows in the table. 120 million rows are not many and it might well be that one encoding looks great now but will cause queries to perform poorly when a certain threshold is passed.

PostgreSQL using UUID vs Text as primary key

Our current PostgreSQL database is using GUID's as primary keys and storing them as a Text field.
My initial reaction to this is that trying to perform any kind of minimal cartesian join would be a nightmare of indexing trying to find all the matching records. However, perhaps my limited understanding of database indexing is wrong here.
I'm thinking that we should be using UUID as these are stored as a binary representation of the GUID where a Text is not and the amount of indexing that you get on a Text column is minimal.
It would be a significant project to change these, and I'm wondering if it would be worth it?

When dealing with UUID numbers store them as data type uuid. Always. There is simply no good reason to even consider text as alternative. Input and output is done via text representation by default anyway. The cast is very cheap.
The data type text requires more space in RAM and on disk, is slower to process and more error prone. #khampson's answer provides most of the rationale. Oddly, he doesn't seem to arrive at the same conclusion.
This has all been asked and answered and discussed before. Related questions on dba.SE with detailed explanation:
Would index lookup be noticeably faster with char vs varchar when all values are 36 chars
What is the optimal data type for an MD5 field?
bigint?
Maybe you don't need UUIDs (GUIDs) at all. Consider bigint instead. It only occupies 8 bytes and is faster in every respect. It's range is often underestimated:
-9223372036854775808 to +9223372036854775807
That's 9.2 millions of millions of millions positive numbers. IOW, nine quintillion two hundred twenty-three quadrillion three hundred seventy-two trillion thirty-six something billion.
If you burn 1 million IDs per second (which is an insanely high number) you can keep doing so for 292471 years. And then another 292471 years for negative numbers. "Tens or hundreds of millions" is not even close.
UUID is really just for distributed systems and other special cases.

As #Kevin mentioned, the only way to know for sure with your exact data would be to compare and contrast both methods, but from what you've described, I don't see why this would be different from any other case where a string was either the primary key in a table or part of a unique index.
What can be said up front is that your indexes will probably larger, since they have to store larger string values, and in theory the comparisons for the index will take a bit longer, but I wouldn't advocate premature optimization if to do so would be painful.
In my experience, I have seen very good performance on a unique index using md5sums on a table with billions of rows. I have found it tends to be other factors about a query which tend to result in performance issues. For example, when you end up needing to query over a very large swath of the table, say hundreds of thousands of rows, a sequential scan ends up being the better choice, so that's what the query planner chooses, and it can take much longer.
There are other mitigating strategies for that type of situation, such as chunking the query and then UNIONing the results (e.g. a manual simulation of the sort of thing that would be done in Hive or Impala in the Hadoop sphere).
Re: your concern about indexing of text, while I'm sure there are some cases where a dataset produces a key distribution such that it performs terribly, GUIDs, much like md5sums, sha1's, etc. should index quite well in general and not require sequential scans (unless, as I mentioned above, you query a huge swath of the table).
One of the big factors about how an index would perform is how many unique values there are. For that reason, a boolean index on a table with a large number of rows isn't likely to help, since it basically is going to end up having a huge number of row collisions for any of the values (true, false, and potentially NULL) in the index. A GUID index, on the other hand, is likely to have a huge number of values with no collision (in theory definitionally, since they are GUIDs).
Edit in response to comment from OP:
So are you saying that a UUID guid is the same thing as a Text guid as far as the indexing goes? Our entire table structure is using Text fields with a guid-like string, but I'm not sure Postgre recognizes it as a Guid. Just a string that happens to be unique.
Not literally the same, no. However, I am saying that they should have very similar performance for this particular case, and I don't see why optimizing up front is worth doing, especially given that you say to do so would be a very involved task.
You can always change things later if, in your specific environment, you run into performance problems. However, as I mentioned earlier, I think if you hit that scenario, there are other things that would likely yield better performance than changing the PK data types.
A UUID is a 128-bit data type (so, 16 bytes), whereas text has 1 or 4 bytes of overhead plus the actual length of the string. For a GUID, that would mean a minimum of 33 bytes, but could vary significantly depending on the encoding used.
So, with that in mind, certainly indexes of text-based UUIDs will be larger since the values are larger, and comparing two strings versus two numerical values is in theory less efficient, but is not something that's likely to make a huge difference in this case, at least not usual cases.
I would not optimize up front when to do so would be a significant cost and is likely to never be needed. That bridge can be crossed if that time does come (although I would persue other query optimizations first, as I mentioned above).
Regarding whether Postgres knows the string is a GUID, it definitely does not by default. As far as it's concerned, it's just a unique string. But that should be fine for most cases, e.g. matching rows and such. If you find yourself needing some behavior that specifically requires a GUID (for example, some non-equality based comparisons where a GUID comparison may differ from a purely lexical one), then you can always cast the string to a UUID, and Postgres will treat the value as such during that query.
e.g. for a text column foo, you can do foo::uuid to cast it to a uuid.
There's also a module available for generating uuids, uuid-ossp.

HBase row key design for monotonically increasing keys

I've an HBase table where I'm writing the row keys like:
<prefix>~1
<prefix>~2
<prefix>~3
...
<prefix>~9
<prefix>~10
The scan on the HBase shell gives an output:
<prefix>~1
<prefix>~10
<prefix>~2
<prefix>~3
...
<prefix>~9
How should a row key be designed so that the row with key <prefix>~10 comes last? I'm looking for some recommended ways or the ways that are more popular for designing HBase row keys.

How should a row key be designed so that the row with key ~10 comes last?
You see the scan output in this way because rowkeys in HBase are kept sorted lexicographically irrespective of the insertion order. This means that they are sorted based on their string representations. Remember that rowkeys in HBase are treated as an array of bytes having a string representation. The lowest order rowkey appears first in a table. That's why 10 appears before 2 and so on. See the sections Rows on this page to know more about this.
When you left pad the integers with zeros their natural ordering is kept intact while sorting lexicographically and that's why you see the scan order same as the order in which you had inserted the data. To do that you can design your rowkeys as suggested by #shutty.
I'm looking for some recommended ways or the ways that are more popular for designing HBase row keys.
There are some general guidelines to be followed in order to devise a good design :
Keep the rowkey as small as possible.
Avoid using monotonically increasing rowkeys, such as timestamp etc. This is a poor shecma design and leads to RegionServer hotspotting. If you can't avoid that use someway, like hashing or salting to avoid hotspotting.
Avoid using Strings as rowkeys if possible. String representation of a number takes more bytes as compared to its integer or long representation. For example : A long is 8 bytes. You can store an unsigned number up to 18,446,744,073,709,551,615 in those eight bytes. If you stored this number as a String -- presuming a byte per character -- you need nearly 3x the bytes.
Use some mechanism, like hashing, in order to get uniform distribution of rows in case your regions are not evenly loaded. You could also create pre-splitted tables to achieve this.
See this link for more on rowkey design.
HTH

HBase stores rowkeys in lexicographical order, so you can try to use this schema with fixed-length rowrey:
<prefix>~0001
<prefix>~0002
<prefix>~0003
...
<prefix>~0009
<prefix>~0010
Keep in mind that you also should use random prefixes to avoid region hot-spotting (when a single region accepts most of the writes, while the other regions are idle).

monotonically increasing keys isnt a good schema for hbase.
you can read more here:
http://hbase.apache.org/book/rowkey.design.html
there also a link there to OpenTSDB that solve this problem.

Fixed length keys are really recommended if possible. Bytes.toBytes(Long value) can be used to get a byte array from a counter. It will sort well for positive longs less than Long.MAX_VALUE.

Does a varchar field's declared size have any impact in PostgreSQL?

Is VARCHAR(100) any better than VARCHAR(500) from a performance point of view? What about disk usage?
Talking about PostgreSQL today, not some database some time in history.

They are identical.
From the PostgreSQL documentation:
http://www.postgresql.org/docs/8.3/static/datatype-character.html
Tip: There are no performance
differences between these three types,
apart from increased storage size when
using the blank-padded type, and a few
extra cycles to check the length when
storing into a length-constrained
column. While character(n) has
performance advantages in some other
database systems, it has no such
advantages in PostgreSQL. In most
situations text or character varying
should be used instead.
Here they are talking about the differences between char(n), varchar(n) and text (= varchar(1G)). The official story is that there is no difference between varchar(100) and text (very large varchar).

There is no difference between varchar(m) and varchar(n)..
http://archives.postgresql.org/pgsql-admin/2008-07/msg00073.php
There is a difference between varchar(n) and text though, varchar(n) has a built in constraint which must be checked and is actually a little slower.
http://archives.postgresql.org/pgsql-general/2009-04/msg00945.php

TEXT /is/ the same as VARCHAR without an explicit length, the text
"The storage requirement for a short
string (up to 126 bytes) is 1 byte
plus the actual string, which includes
the space padding in the case of
character. Longer strings have 4 bytes
overhead instead of 1. Long strings
are compressed by the system
automatically, so the physical
requirement on disk might be less.
Very long values are also stored in
background tables so that they do not
interfere with rapid access to shorter
column values. In any case, the
longest possible character string that
can be stored is about 1 GB."
refers to both VARCHAR and TEXT (since VARCHAR(n) is just a limited version of TEXT). Limiting your VARCHARS artificially has no real storage or performance benefits (the overhead is based on the actual length of the string, not the length of the underlying varchar), except possibly for comparisons against wildcards and regexes (but at the level where that starts to matter, you should probably be looking at something like PostgreSQL's full-text indexing support).

How to reduce the size of an sqlite3 database for iphone?

edit: many thanks for all the answers. Here are the results after applying the optimisations so far:
Switching to sorting the characters and run length encoding - new DB size 42M
Dropping the indexes on the booleans - new DB size 33M
The really nice part is this hasn't required any changes in the iphone code
I have an iphone application with a large dictionary held in sqlite format (read only). I'm looking for ideas to reduce the size of the DB file, which is currently very large.
Here is the number of entries and resulting size of the sqlite DB:
franks-macbook:DictionaryMaker frank$ ls -lh dictionary.db
-rw-r--r-- 1 frank staff 59M 8 Oct 23:08 dictionary.db
franks-macbook:DictionaryMaker frank$ wc -l dictionary.txt
453154 dictionary.txt
...an average of about 135 bytes per entry.
Here is my DB schema:
create table words (word text primary key, sowpods boolean, twl boolean, signature text)
create index sowpods_idx on words(sowpods)
create index twl_idx on words(twl)
create index signature_idx on words(signature)
Here is some sample data:
photoengrave|1|1|10002011000001210101010000
photoengraved|1|1|10012011000001210101010000
photoengraver|1|1|10002011000001210201010000
photoengravers|1|1|10002011000001210211010000
photoengraves|1|1|10002011000001210111010000
photoengraving|1|1|10001021100002210101010000
The last field represents the letter frequencies for anagram retrieval (each position is in the range 0..9). The two booleans represent sub dictionaries.
I need to do queries such as:
select signature from words where word = 'foo'
select word from words where signature = '10001021100002210101010000' order by word asc
select word from words where word like 'foo' order by word asc
select word from words where word = 'foo' and (sowpods='1' or twl='1')
One idea I have is to encode the letter frequencies more efficiently, e.g. binary encode them as a blob (perhaps with RLE as there are many zeros?). Any ideas for how best to achieve this, or other ideas to reduce the size? I am building the DB in ruby, and reading it on the phone in objective C.
Also is there any way to get stats on the DB so I can see what is using the most space?

Have you tried typing the "vacuum" command to make sure you don't have extra space in the db you forgot to reclame?

Remove the indexes on sowpods and twl -- they are probably not helping your query times and are definitely taking lots of space.
You can get stats on the database using sqlite3_analyzer from the SQLite downloads page.

As a totally different approach, you could try using a bloom filter instead of a comprehensive database. Basically, a bloom filter consists of a bunch of hash functions, each of which is associated with a bitfield. For each legal word, each hash function is evaluated, and the corresponding bit in the corresponding bit field is set. Drawback is it's theoretically possible to get false positives, but those can be minimized/practically eliminated with enough hashes. Plus side is a huge space savings.

I'm not clear on all the use cases for the signature field but it seems like storing an alphabetized version of the word instead would be beneficial.

The creator of SQLite sells a version of SQLite that includes database compression (and encryption). This would be perfect.

Your best bet is to use compression, which unfortunately SQLite does not support natively at this point. Luckily, someone took the time to develop a compression extension for it which could be what you need.
Otherwise I'd recommend storing your data mostly in compressed format and uncompressing on the fly.

As a text field, signature is currently using at least 26 * 8 bytes per entry (208 bytes) but if you were to pack the data into a bitfield, you could probably get away with only 3 bits per letter (reducing your maximum frequency per letter to 7). That would mean you could pack the entire signature in 26 * 3 bits = 78 bits = 10 bytes. Even if you used 4 bits per letter (for a maximum frequency of 15 per letter) you would only use 104 bits (13 bytes).
EDIT: After a bit more thought, I think 4 bits per letter (instead of 3) would be a better idea because it would make the binary math easier.
EDIT2: Reading through the docs on SQLite data types, it seems that you might be able to just make the "signature" field span 26 columns of type INTEGER and SQLite will do the right thing and only use as many bits as required to store the value.

Do I reckon correctly that you have about 450K words like that in your database ?
I've got no clue about iPhone, neither serious about sqlitem but... as long as sqlite does not allow for a way to save the file as gz right away (it maybe already does internally? no, does not look like that when you say it's about 135 b per entry. not even with both indexes), I would move away from the table approach, save it "manually" in a dictionary approach compression and build the rest on the fly and in memory. That should perform VERY well on your type of data.
Wait... Are you using that signature to allow for fulltextsearching or mistyping recogition ? Would full text search on sqlite not obsolete that field ?

As noted storing "Signature" more efficiently seems like a good idea.
However, it also seems like you could gain a ton of space savings by using some kind of lookup table for words - since you seem to be taking a root word and then appending "er", "ed", "es", etc why not have a column with a numeric ID that references a root word from a separate lookup table, and then a separate column with a numeric ID that references a table of common word suffixes that would be appended to the base word.
If there were any tricks around storing shorthand versions of signatures for multiple entries with a single root word, you could also employ those to reduce the size of stored signatures (not sure what algorithm is producing those values)
This also seems to make a lot of sense to me as you have the "word" column as a primary key, but do not even index it - just create a separate numeric column that is the primary ID for the table.

mhmm... an iPhone... doesn't it have a permanent data connection ?
I think this is where a webapplication/webservice can jump in snugly.
Move most of your business logic to the webserver (he's gonna have real SQL with FTS and looooots of memory) and fetch that info online to the client on the device.

As mentioned elsewhere, lose the indexes on the boolean columns, they will almost certainly be slower (if used at all) than a table scan and are going to use space needlessly.
I'd consider applying a simple compression to the words, Huffman coding is pretty good for this sort of thing. Also, I'd look at the signatures: sort the columns in letter frequency order and don't bother storing trailing zeroes, which can be implied. I guess you could Huffman-encode those, too.
Always assuming your encoded strings don't upset SQLite, of course.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse