From reading some stuff about TOAST I've learned Postgres uses an LZ-family compression algorithm, which it calls PGLZ. And it kicks in automatically for values larger than 2KB.
How does PGLZ compare to GZIP in terms of speed and compression ratio?
I'm curious to know if PGLZ and GZIP have similar speeds and compression rates such that doing an extra GZIP step before inserting large JSON strings as data into Postgres would be unnecessary or harmful.
It's significantly faster, but has a lower compression ratio than gzip. It's optimised for lower CPU costs.
There's definitely a place for gzip'ing large data before storing it in a bytea field, assuming you don't need to manipulate it directly in the DB, or don't mind having to use a function to un-gzip it first. You can do it with things like plpython or plperl if you must do it in the DB, but it's usually more convenient to just do it in the app.
If you're going to go to the effort of doing extra compression though, consider using a stronger compression method like LZMA.
There have been efforts to add support for gzip and/or LZMA compression to TOAST in PostgreSQL. The main problem with doing so has been that we need to maintain compatibility with on-disk format for older versions, make sure it stays compatible into the future, etc. So far nobody's come up with an implementation that's satisfied relevant core team members. See e.g. pluggable compression support. It tends to get stuck in a catch-22 where pluggable support gets rejected (see that thread for why) but nobody can agree on a suitable, software-patent-safe algorithm we should adopt as a new default method, agree on how to change the format to handle multiple compression methods, etc.
Custom compression methods are coming to reality. As reported here
https://www.postgresql.org/message-id/20180618173045.7f734aca%40wp.localdomain
synthetic tests showed that zlib gives more compression but usually
slower than pglz
Related
I suppose that storing images (or any binary data - pdfs, movies, etc. ) outside of DB (MongoDB in my case) and putting them in public server folder can be at least faster (no encoding, decoding and things around that).
But since there is such an option in MondoDB, I'd like to know advantages of using this, and use cases, when that approach is recommended.
Replication: It is pretty easy to set up a highly available replica set. So even if one machine goes down, the files would still be available. While this is possible to achieve by various means for a simple filesystem as well, the overhead for this might well eliminate the performance advantage (if there is any: MongoDB has quite sophisticated internal caching going on). Furthermore, setting up a DRBD and making sure consistency and availability requires quite more knowledge and administrative effort than with MongoDB. Plus, you'd need to have your DB be highly available as well.
Scalability: It can get quite complicated and/or costly when your files exceed the storage capacity of a single node. While in theory you can scale vertically, there is a certain point where the bang you get for the buck decreases and scaling horizontally makes more sense. However, with a filesystem approach, you'd have to manage which file is located at which node, how and when to balance and whatnot. MongoDB's GridFS in a sharded environment does this for you automatically and – more important – transparently. You neither have to reinvent the wheel nor maintain it.
Query by metadata: While in theory you can do this by an approach with a database and links to a filesystem, GridFS comes with means to insert arbitrary metadata and query by it. Again, this saves you reinventing the wheel. As an interesting example is that finding duplicates is quite easy with GridFS: a hash sum is automatically calculated for each file in GridFS. With a rather simple aggregation, you can find dupes and then deal with them accordingly.
When you have large amount of binary data and you want to take advantage of sharding, you can go with storing the binary data in mongo db using gridfs. But from performance point of view, Obviously as you pointed storing the images in a file system is a better way.
When creating data tables in Amazon Redshift, you can specify various encodings such as MOSTLY32 or BYTEDICT or LZO. Those are the compressions used when storing the columnar values on disk.
I am wondering if my choice of encoding is supposed to make a difference in query execution times. For example, if I make a column BYTEDICT would that make a difference over LZO when it comes to SELECTs, GROUP BYs or FILTERs?
Yes. The compression encoding used translates to amount of disk storage. Generally, the lower the storage the better would be query performance.
But, which encoding would be be more beneficial to you depends on your data type and its distribution. There is no gurantee that LZO will always be better than Bytedict or vice-a-versa. In my experience, I usually load some sample data in the intended table. Than do a analyze compression. Now whatever Redshift suggests, I go with it. That has worked for me.
Amazon actually has released a python script that can apply this automatically to your database. You can find this script here https://github.com/awslabs/amazon-redshift-utils/blob/master/src/ColumnEncodingUtility/analyze-schema-compression.py
Bit late but likely useful to anyone taking a look here:
Amazon can now decide on the best compression to use (Loading Tables with Automatic Compression), if you are using a COPY command to load your table, and there is no existing compression defined in your table.
You just have to add COMPUPDATE ON to your COPY command.
I have a problem...
I need to store a daily barrage of about 3,000 mid-sized XML documents (100 to 200 data elements).
The data is somewhat unstable in the sense that the schema changes from time to time and the changes are not announced with enough advance notice, but need to be dealt with retroactively on an emergency "hotfix" basis.
The consumption pattern for the data involves both a website and some simple analytics (some averages and pie charts).
MongoDB seems like a great solution except for one problem; it requires converting between XML and JSON. I would prefer to store the XML documents as they arrive, untouched, and shift any intelligent processing to the consumer of the data. That way any bugs in the data-loading code will not cause permanent damage. Bugs in the consumer(s) are always harmless since you can fix and re-run without permanent data loss.
I don't really need "massively parallel" processing capabilities. It's about 4GB of data which fits comfortably in a 64-bit server.
I have eliminated from consideration Cassandra (due to complex setup) and Couch DB (due to lack of familiar features such as indexing, which I will need initially due to my RDBMS ways of thinking).
So finally here's my actual question...
Is it worthwhile to look for a native XML database, which are not as mature as MongoDB, or should I bite the bullet and convert all the XML to JSON as it arrives and just use MongoDB?
You may have a look at BaseX, (Basex.org), with built in XQuery processor and Lucene text indexing.
That Data Volume is Small
If there is no need for parallel data processing, there is no need for Mongo DB. Especially if dealing with small data amounts like 4GB, the overhead of distributing work can easily get larger than the actual evaluation effort.
4GB / 60k nodes is not large of XML databases, either. After some time of getting into it you will realize XQuery as a great tool for XML document analysis.
Is it Really?
Or do you get daily 4GB and have to evaluate that and all data you already stored? Then you will get to some amount which you cannot store and process on one machine any more; and distributing work will get necessary. Not within days or weeks, but a year will already bring you 1TB.
Converting to JSON
How does you input look like? Does it adhere any schema or even resemble tabular data? MongoDB's capabilities for analyzing semi-structured are way worse than what XML databases provide. On the other hand, if you only want to pull a few fields on well-defined paths and you can analyze one input file after the other, Mongo DB probably will not suffer much.
Carrying XML into the Cloud
If you want to use both an XML database's capabilities in analyzing the data and some NoSQL's systems capabilities in distributing the work, you could run the database from that system.
BaseX is getting to the cloud with exactly the capabilities you need -- but it will probably still take some time for that feature to get production-ready.
Can anyone explain why memcached folks decided to support multi get but not multi set.
By multi I mean operation involving more than one key (see protocol at http://code.google.com/p/memcached/wiki/NewCommands).
So you can get multiple keys in one shot (basic advantage is the standard saving you get by doing less round trips) but why can not you get bulk sets?
My theory is that it was meant to do less number of sets and that too individually (e.g. on a cache read and miss). But I still do not see how multi-set really conflicts with the general philosophy of memcached.
I looked at the client features at http://code.google.com/p/memcached/wiki/NewCommonFeatures and it seems that some clients potentially do support "Multi-Set" (why only in binary protocol?). I am using Java spy memcached, btw.
It's not supported in the text protocol because it'd be very, very complicated to express, no clients would support it, and it would provide very little that you can't already do from the text protocol.
It's supported in the binary protocol because it's a trivial use case of binary operations.
spymemcached supports it implicitly -- just do a bunch of sets and magic happens:
http://dustin.github.com/2009/09/23/spymemcached-optimizations.html
I don't know a lot about memcache internals, but I assume writes have to be blocking, atomic operations. I assume that allowing multiple set operations to be batched, you could block all reads for a long time (or risk a get occurring while only half of a write had been applied). Forcing writes to be done individually allows them to be interleaved fairly with gets.
I would imagine that the restriction against using multi sets is to avoid collisions when writing cached values to the memcache.
As an object cache, I can't foresee an example of when you would need transactional type writes. This use case seems less suited for a caching layer, but better suited for the underlying database.
If sets come in interleaved from different clients, it is most likely the case that for one key, the last one wins, or is at least close enough, until the cache is invalidated and a newer value is written.
As Gian mentions, there don't seem to be any good reasons to block reads from the cache while several or many writes to the cache happen.
I'm going to store large amount of data (logs) in fragmented PostgreSQL tables (table per day). I would like to compress some of them to save some space on my discs, but I don't want to lose the ability to query them in the usual manner.
Does PostgreSQL support such a transparent compression and where can I read about it in more detail? I think there should be some well-known magic name for such a feature.
Yes, PostgreSQL will do this automatically for you when they go above a certain size. Compression is applied at each individual data value though - not at the full table level. Meaning that if you have a billion rows that are very narrow, they won't get compressed. Or if you have very many columns each with only a small value in it, they won't get compressed. Details about this scheme in the manual.
If you need it on the full table level, a solution is to create a TABLESPACE for those tables that you want compressed, and point it to a compressed filesystem. As long as the filesystem still obeys fsync() and standard POSIX semantics, this should be perfectly safe. Details about this in the manual.
Probably not what you have in mind but still useful info - Chapter 53. Database Physical Storage of the fine manual. The TOAST section warrants further attention.