PostgreSQL and S3QL for storing/accessing lots of data - postgresql

We're currently using Postgres 9 on Amazon's EC2 and are very satisfied with the performance. Now we're looking at adding ~2TB of data to Postgres, which is larger than our EC2 small instance can hold.
I found S3QL and am considering using it in conjunction with moving the Postgres data directory to S3 storage. Has anyone had experience with doing this? I'm mainly concerned with performance (frequent reads, less frequent writes). Any advice is welcome, thanks.

My advice is "don't do that". I don't know anything about the context of your problem, but I guess that a solution doesn't have to involve doing bulk data processing through PostgreSQL. The whole reason grid processing systems were invented was to solve the problem of analyzing large data sets. I think you should consider building a system that follows standard BI practices around extracting dimensional data. Then take that normalized data and, assuming it's still pretty large, load it into Hadoop/Pig. Do your analysis and aggregation there. Dump the resulting aggregate data into a file and load that into your PG database along-side the dimensions.

Related

NoSQL in a single machine

As part of my university curriculum I ended up with a real project which consists in helping a company shifting from their relational data warehouse into a NoSQL data warehouse. The thing is that what they are looking for is better performance in large jobs but so far they have used a single machine and if they indeed migrate to NoSQL they still wish to keep using a single machine for cost reasons.
As far as I know the whole point of NoSQL is to run it in a large distributed system of several machines. So I don't see the point of this migration, specially since I am pretty sure (but not entirely) that if they do install NoSQL, they will probably end having even worst performance.
But still I am not comfortable telling them this since I am still new to this area (less than a month), so I wonder, is there are any situation where using NoSQL in a single machine for a datawarehouse would be justifiable performance wise? Or is it just a plain bad idea?
The answer to your question, like the answer to so many questions, is "it depends."
Ignoring the commentary on the question, I think there may be legitimacy to your client's question. Both relational and non-relational databases ultimately hold data in key-value tuples, with indexes and such to ensure quick and speedy access to the data. The difference is that SQL/relational databases contain an incredible amount of overhead to attempt the optimal way to retrieve results given an unknown set of queries, as well as ensure stable concurrency. This overhead is both computationally expensive and rarely results in the optimal solution. As a result, SQL databases often perform significantly slower for simple repetitive queries.
No-sql databases, on the other hand, are more of a "bare-bones" database, relying on programmers and intelligent design to achieve success. They are optimized to retrieve a value for a given key very quickly, often sub-millisecond. As a result, increased up front investment in the design results in superior and near-optimal performance. It will be necessary to determine the cost-benefit of doing this up-front design, but it is all but guaranteed that the no-sql approach will perform better regardless of the number of machines involved (in fact, SQL databases are very difficult or impossible to cluster together and is one of the main reasons why NoSql was developed).
Eventually we will see relational-like solutions implemented on a no-sql platform. In fact, Mongo, Elasticsearch, and Couchbase (probably others) already have SQL-like query functionality. But right now, you are faced with this dilemma.
For a single machine if the load is write heavy e.g. your logging a lot of events you could do for cassandra. Also a good alternative is hbase but its heavy and not suggested for single node. If they expose api in json you could look into document based dbs such as couchbase, mongo db. If you have an idea about the load then selecting a nosql data store is much easier
If you're in a position where you need to pick one, I think you should look first at MongoDB. If you've never tried it, I really recommend you visit their live demo with tutorial and give it a try. If you like, download and follow the installation guide on their site. It's free, runs well on a single machine, and is incredibly easy to use.
In addition to MongoDB, I've used Oracle, SQL Server, MySQL, SQLite, and HBase. I understand Cassandra should be in the list but I've not tried it. With MongoDB, I was fully deployed and executing reads and writes from an application in like two hours. I attribute most of that to their website's clear and concise instructional content. The biggest learning curve was figuring out how the queries work for things like updating a record or deleting a record without deleting the entire set of similar records.
Regarding NoSQL vs RDBMS, some points to consider:
Adding a new column to RDBMS table can lock the database in or degrade performance in another
MongoDB is schema-less so adding a new field, does not effect old documents and will be instant (think how flexible that really is - throw any dimension of data into this system without maintenance overhead)
You're less likely to require a DBA to solve your schema problems when an application changes
I think problems related to table size are irrelevant, so you won't run into a scaling problem - just a disk space problem on single machine

Which NoSQL ... again :), but a different use case

Suggestions for a NoSQL datastore so that we can push data and generate real time Qlikview reports easily?
Easily means:
1. Qlikview support for reads (mongodb connector available, otherwise maybe can write a JDBC connector, otherwise maybe can write a custom QVX connector to the datastore)
Easily adaptable to changes in schema, or schemaless. We change our schema quite frequently ...
Java support for writes
Super fast reads - real time incremental access, as well as batch access for old data within a time range. I read that Cassandra excels in ranges.
Reasonably fast writes
Reasonably big data storage - 20 million rows stored per day, about 200 bytes each
Would be nice if it can scale for a years worth of data, elasticity not so important.
Easy to use, install, and run. Looking at minimal setup and configuration time.
Matlabe support for adhoc querying
Initially I don't think we need a distributed system however a cluster is a possibility.
I've looked at Mongodb, Cassandra and Hbase. I don't think going over REST is a good idea due to (theoretically) slower performance.
I'm leaning towards MongoDB at the moment due to its ease of use, matlab support, totally schema less, Qlikview support (beta connector is available). However if anyone can suggest something better that would be great!
Depending on the server infrastructure you will use, I guess the best choice is amazon's NoSQL service, avalaible in aws.amazon.com.
The fact is any DB will have a poor performance in cloud infrastructure due to the way it stores data, amazon EC2 with EBS for instance is VERY slow for this task, requiring you to connect up to 20 EBS volumes in raid to acquire a decent speed. They solved this issue creating this NoSQL service, which I never used, but seems nice.

Which noSQL database is best for high volume inserts / writes?

Which nosql system is better equipped for handling high volume inserts out of the box?
Preferably, running on 1 physical machine (many instances allowed).
Has anyone done any benchmarks? (googling did not help)
Note: I understand that choosing noSQL database depends on what kind of data needs to be stored (document:MongoDB, graph:Neo4j, etc.).
If you want fast write speed, you can just insert your data into memory and flush data to the disc at a background every minute or so. That should be fastest solution.
MongoDB and Redis do this actually. For example, in mongodb you can go without journal enabled and writes will be very fast. But keep in mind that if you store data in memory at a single server there is possibility to loose your data (data that not flushed to the disc yet) when your server goes down.
In general, what database to use highly depends on data you want to store and task you are trying to solve.
Apache Cassandra is great in write operations, thanks to its unique persistence model. Some claim that it writes about 20 times faster than it reads but I believe it's really dependent on your usage profile.
Read about it in their FAQ and in various blog posts.
That is, of course, if you have "classical" DB profile of large amounts of data. If your data is small, or is used temporarily and/or as a cache layer, then of course opt for Redis which has the fastest throughput both for reads and for writes, since it's memory-based (with eventual disk persistence).
If you're dealing with a complex object model for inserts your best option is an object database like Versant's:
http://www.versant.com/vision/The_Magic_Cube.aspx
According to my benchmarks, Cassandra is better than MongoDB on large arrays, but MongodDB is more flexible.

What is the best database/storage to store statistic data?

I'm having a system that collects real-time Apache log data from about 90-100 Web Servers. I had also defined some url patterns.
Now I want to build another system that updates the time of occurrence of each pattern based on those logs.
I had thought about using MySQL to store statistic data, update them by statement:
"Update table set count=count+1 where ....",
but i'm afraid that MySQL will be slow for data from such amount of servers. Moreover, I'm looking for some database/storage solutions that more scalable and simple. (As a RDBMS, MySQL supports too much things that I don't need in this situation) . Do you have any idea ?
Apache Cassandra is a high-performance column-family store and can scale extremely well. The learning curve is a bit steep, but will have no problem handling large amounts of data.
A more simple solution would be a key-value store, like Redis. It's easier to understand than Cassandra. Redis only seems to support master-slave replication as a way to scale, so the write performance of your master server could be a bottleneck. Riak has a decentralized architecture without any central nodes. It has no single point of failure nor any bottlenecks, so it's easier to scale out.
Key value storage seems to be an appropriate solution for my system. After taking a quick look on those storages, I'm concerning about race-condition issue, as there will be a lot of clients trying to do these steps on the same key:
count = storage.get(key)
storage.set(key,count+1)
I had worked with Tokyo Cabinet before, and they have 'addint' method which perfectly matched with my case, I wonder if other storages have similar feature? I didn't choose Tokyo Cabinet/Tyrant cause I had experienced some issues about its scalability and data stability (e.g. repair corrupted data, ...)

Has anyone used an object database with a large amount of data?

Object databases like MongoDB and db4o are getting lots of publicity lately. Everyone that plays with them seems to love it. I'm guessing that they are dealing with about 640K of data in their sample apps.
Has anyone tried to use an object database with a large amount of data (say, 50GB or more)? Are you able to still execute complex queries against it (like from a search screen)? How does it compare to your usual relational database of choice?
I'm just curious. I want to take the object database plunge, but I need to know if it'll work on something more than a sample app.
Someone just went into production with a 12 terabytes of data in MongoDB. The largest I knew of before that was 1 TB. Lots of people are keeping really large amounts of data in Mongo.
It's important to remember that Mongo works a lot like a relational database: you need the right indexes to get good performance. You can use explain() on queries and contact the user list for help with this.
When I started db4o back in 2000 I didn't have huge databases in mind. The key goal was to store any complex object very simply with one line of code and to do that good and fast with low ressource consumption, so it can run embedded and on mobile devices.
Over time we had many users that used db4o for webapps and with quite large amounts of data, going close to todays maximum database file size of 256GB (with a configured block size of 127 bytes). So to answer your question: Yes, db4o will work with 50GB, but you shouldn't plan to use it for terabytes of data (unless you can nicely split your data over multiple db4o databases, the setup costs for a single database are negligible, you can just call #openFile() )
db4o was acquired by Versant in 2008, because it's capabilites (embedded, low ressource-consumption, lightweight) make it a great complimentary product to Versant's high-end object database VOD. VOD scales for huge amounts of data and it does so much better than relational databases. I think it will merely chuckle over 50GB.
MongoDB powers SourceForge, The New York Times, and several other large databases...
You should read the MongoDB use cases. People who are just playing with technology are often just looking at how does this work and are not at the point where they can understand the limitations. For the right sorts of datasets and access patterns 50GB is nothing for MongoDB running on the right hardware.
These non-relational systems look at the trade-offs which RDBMs made, and changed them a bit. Consistency is not as important as other things in some situations so these solutions let you trade that off for something else. The trade-off is still relatively minor ms or maybe secs in some situations.
It is worth reading about the CAP theorem too.
I was looking at moving the API I have for sure with the stack overflow iphone app I wrote a while back to MongoDB from where it currently sits in a MySQL database. In raw form the SO CC dump is in the multi-gigabyte range and the way I constructed the documents for MongoDB resulted in a 10G+ database. It is arguable that I didn't construct the documents well but I didn't want to spend a ton of time doing this.
One of the very first things you will run into if you start down this path is the lack of 32 bit support. Of course everything is moving to 64 bit now but just something to keep in mind. I don't think any of the major document databases support paging in 32 bit mode and that is understandable from a code complexity standpoint.
To test what I wanted to do I used a 64 bit instance EC2 node. The second thing I ran into is that even though this machine had 7G of memory when the physical memory was exhausted things went from fast to not so fast. I'm not sure I didn't have something set up incorrectly at this point because the non-support of 32 bit system killed what I wanted to use it for but I still wanted to see what it looked like. Loading the same data dump into MySQL takes about 2 minutes on a much less powerful box but the script I used to load the two database works differently so I can't make a good comparison. Running only a subset of the data into MongoDB was much faster as long as it resulted in a database that was less than 7G.
I think my take away from it was that large databases will work just fine but you may have to think about how the data is structured more than you would with a traditional database if you want to maintain the high performance. I see a lot of people using MongoDB for logging and I can imagine that a lot of those databases are massive but at the same time they may not be doing a lot of random access so that may mask what performance would look like for more traditional applications.
A recent resource that might be helpful is the visual guide to nosql systems. There are a decent number of choices outside of MongoDB. I have used Redis as well although not with as large of a database.
Here's some benchmarks on db4o:
http://www.db4o.com/about/productinformation/benchmarks/
I think it ultimately depends on a lot of factors, including the complexity of the data, but db4o seems to certainly hang with the best of them.
Perhaps worth a mention.
The European Space Agency's Planck mission is running on the Versant Object Database.
http://sci.esa.int/science-e/www/object/index.cfm?fobjectid=46951
It is a satelite with 74 onboard sensors launched last year which is mapping the infrarred spectrum of the universe and storing the information in a map segment model. It has been getting a ton of hype these days because of it's producing some of the coolest images ever seen of the universe.
Anyway, it has generated 25T of information stored in Versant and replicated across 3 continents. When the mission is complete next year, it will be a total of 50T
Probably also worth noting, object databases tend to be a lot smaller to hold the same information. It is because they are truly normalized, no data duplication for joins, no empty wasted column space and few indexes rather than 100's of them. You can find public information about testing ESA did to consider storage in multi-column relational database format -vs- using a proper object model and storing in the Versant object database. THey found they could save 75% disk space by using Versant.
Here is the implementation:
http://www.planck.fr/Piodoc/PIOlib_Overview_V1.0.pdf
Here they talk about 3T -vs- 12T found in the testing
http://newscenter.lbl.gov/feature-stories/2008/12/10/cosmic-data/
Also ... there are benchmarks which show Versant orders of magnitude faster on the analysis side of the mission.
CHeers,
-Robert