Greenplum selection criteria - postgresql

I'm getting familiar with the greenplum solution concepts, and trying to understand whether, and if so, when the organisation I work for should use this solution. Our conceptual idea is to setup a kind of central 'datastore' suitable for both OLTP and OLAP access.
My research: this article suggests Greenplum is more suitable for OLAP, and PostgreSQL for OLTP. But I also read about Greenplum improvements for OLTP processing. And in favour of Postgresql, there are also articles like this that suggest that OLAP (eg, a datawarehouse implementation) can be done by means of Postgresql.
So my question is: how to move forward, and what are the main criteria to decide? For example, in case we now have a just a few TB's (1-5), start with a Postgresql cluster (for OLTP+OLAP), and when data volumes grow, move on to Greenplum? Or start straight away with Greenplum?

maybe use postgres if it can handle your use case. If you have you have too much data and need to finish reports and analytics faster; change to greenplum

Related

Postgresql tuning to be used as DatawareHouse

I am assembling a Business Intelligence solution using the Pentaho software as a BI engine. Within this solution, I had to set up a requirements for a PostgreSQL database server.
The current situation is very easy, since no ETL process is being carried out for data extraction, so the PostgreSQL configuration has not changed it much, and it is practically as it is configured as "factory".
I would like to know what Postgres configuration parameters have to be touched and modified to optimize it as a Datawarehouse. I have seen a lot of documentation, but it is not clear to me at all, since one documentation says that such values have to be modified, and other documentation, other completely different values.
I would like to know just that, if there is a clearer and more precise documentation to optimize a postgres 9.6 to be used as a Pentaho DW.
Thank you very much

Data mining with postgres in production environment - is there a better way?

There is a web application which is running for a years and during its life time the application has gathered a lot of user data. Data is stored in relational DB (postgres). Not all of this data is needed to run application (to do the business). However form time to time business people ask me to provide reports of this data data. And this causes some problems:
sometimes these SQL queries are long running
quires are executed against production DB (not cool)
not so easy to deliver reports on weekly or monthly base
some parts of data is stored in way which is not suitable for such
querying (queries are inefficient)
My idea (note that I am a developer not the data mining specialist) how to improve this whole process of delivering reports is:
create separate DB which regularly is update with production data
optimize how data is stored
create a dashboard to present reports
Question: But is there a better way? Is there another DB which better fits for such data analysis? Or should I look into modern data mining tools?
Thanks!
Do you really do data mining (as in: classification, clustering, anomaly detection), or is "data mining" for you any reporting on the data? In the latter case, all the "modern data mining tools" will disappoint you, because they serve a different purpose.
Have you used the indexing functionality of Postgres well? Your scenario sounds as if selection and aggregation are most of the work, and SQL databases are excellent for this - if well designed.
For example, materialized views and triggers can be used to process data into a scheme more usable for your reporting.
There are a thousand ways to approach this issue but I think that the path of least resistance for you would be postgres replication. Check out this Postgres replication tutorial for a quick, proof-of-concept. (There are many hits when you Google for postgres replication and that link is just one of them.) Here is a link documenting streaming replication from the PostgreSQL site's wiki.
I am suggesting this because it meets all of your criteria and also stays withing the bounds of the technology you're familiar with. The only learning curve would be the replication part.
Replication solves your issue because it would create a second database which would effectively become your "read-only" db which would be updated via the replication process. You would keep the schema the same but your indexing could be altered and reports/dashboards customized. This is the database you would query. Your main database would be your transactional database which serves the users and the replicated database would serve the stakeholders.
This is a wide topic, so please do your diligence and research it. But it's also something that can work for you and can be quickly turned around.
If you really want try Data Mining with PostgreSQL there are some tools which can be used.
The very simple way is KNIME. It is easy to install. It has full featured Data Mining tools. You can access your data directly from database, process and save it back to database.
Hardcore way is MADLib. It installs Data Mining functions in Python and C directly in Postgres so you can mine with SQL queries.
Both projects are stable enough to try it.
For reporting, we use non-transactional (read only) database. We don't care about normalization. If I were you, I would use another database for reporting. I will desing the tables following OLAP principals, (star schema, snow flake), and use an ETL tool to dump the data periodically (may be weekly) to the read only database to start creating reports.
Reports are used for decision support, so they don't have to be in realtime, and usually don't have to be current. In other words it is acceptable to create report up to last week or last month.

OLAP and postgresql- tool or methodology?

I was reviewing some documents for making my database perform better and I came across with "OLAP" pre-aggregation term. I was wondering if OLAP is a tool or or methodology or approach. For example my DBMS is postgresql and I am working on a big databse. To speed up I have to use some aggregation and pre-aggregation methods. How OLAP can be helpful?
OLAP is a database role. When storing OLAP data in the db, typically you aren't running live transactional information off the db, but rather keeping it around for analytical and business intelligence reasons.
It isn't a tool. It isn't an approach either, since some approaches are needed for OLAP but some are helpful together in transactional environments as well.
In general you shouldn't think about speeding up an application by incorporating OLAP into it. Instead you would look at separating out reporting functions into a separate db server, and import the data periodically, and then separate data feeds from operation data stores, etc. This is a very different field than transactional application development.

How can I use MongoDB as a cache for Postgresql?

I have an application that can not afford to lose data, so Postgresql is my choice for database (ACID)
However, speed and query advantages of MongoDB are very attractive, but based on what I've read so far, MongoDB can report a successful write which may not have gone to disk, so I can't make it my mission critical db (I'll also need transactions)
I've seen references to people using mysql and MongoDB together, one for the transactions and the other for queries. Please not that I'm not talking about keeping some data in one DB and the rest in another. I want to use Postgresql as a gateway to data entry, and MongoDB for reads.
Are there any resources that offer an architecture/guide for Postgresql + MongoDB usage in this way? I can remember seeing this topic in Postgresql conference agenda, but I could not find the link.
I don't think you'll get much speed using MongoDB just as a cache. It's strengths are replication and horizontal scalability. On one computer you'd make Mongo and Postgres compete for memory, IO bandwidth and processor time.
As you can not afford to loose transactions you'll be better with Postgres only. Its has efficient caching, sophisticated query planner, prepared queries and wide indexing support cause that read-only queries will be very fast - really comparable to MongoDB on a single computer.
Postgres can even scale horizontally now using asynchronous, or, from version 9.1, synchronous replication.
One way to achieve this would be to set up a master-slave replication with the PostgreSQL database as master, and the MongoDB database as slave. You would then do all reads from MongoDB, and all writes to PostgreSQL.
This post discusses such a setup using a tool called Bucardo:
http://blog.endpoint.com/2011/06/mongodb-replication-from-postgres-using.html
You may also be able to do it with Tungsten Replicator, although it seems designed to be used with MySQL:
http://code.google.com/p/tungsten-replicator/wiki/TRCHeterogeneousReplication
I can remember seeing this topic in Postgresql conference agenda, but I could not find the
link.
Maybe, you are talking about this: https://www.postgresqlconference.org/content/hybrid-applications-using-mongodb-and-postgres
Depending how important transactions are to you, one option is to use MongoDb driver's safe mode and drop Postgresql.
http://www.mongodb.org/display/DOCS/getLastError+Command
How can you expect transactional consistency from Postgres but trust MongoDB for reads? How would you support rollbacks in this scenario? How do you detect when they've gotten out of sync?
I think you're better off going with memcache and implementing a higher level object cache. Alternatively, you could consider a replication slave for reads. If you have performance needs beyond what a dedicated read slave can provide, consider denormalizing your tables on your slave system.
Make sure that any of this is actually needed. For thin tables with PK lookups most modern database engines like Postgres or InnoDB are going to generally keep up with NoSQL solutions. Don't fall into the ROFLSCALE trap
http://www.youtube.com/watch?v=b2F-DItXtZs
I think you can run a mongo replica set.. Let say 3 Slave and 1 Master.. Then in your app you should run all write transactions on Postgresql and then on Mongo ReplicaSet.. After that you can query read operations on Mongo Replica set..
But Synchronizing will be a problem, you should work on it..
you may find some replacement for mongo in here or here that is safer and fast as well.
but I advise to simplify your solution instead of making a complicated design.
Visual Guide to NoSQL Systems
lucky
In mongodb we can specify writeConcern property to specify that it should write to journal/ instances and then send confirmation/ acknowledgement and i think even mongodb has teh concept of transactions. Not sure why we need postgres behind it.

Postgresql for OLAP

Does anyone have experience of using PostgreSQL for an OLAP setup, using cubes against the database etc. Having come across a number of idiosyncracies when using MySQL for OLAP, are there reasons in favour of using PostgreSQL instead (assuming that I want to go the open source route)?
There are a number of data warehousing software vendors that are based on Postgresql (and contribute OLAP related changes back to core fairly regularly). Check out https://greenplum.org/. You'll find that PG works a lot better (for nearly any workload, OLAP especially) than MySQL. Greenplum and other similar solutions should work a bit better than PG depending on your data sets and use cases.
PGSQL is much better suited for Data Warehousing compared to MySQL. We had thought initially to go with MySQL, but it performs poorly in aggregations if data grows to a few million rows. PGSQL performs almost 20 times faster in caparison with MySQL for 20 million records for a single fact table on same hardware setup. If for some reason you choose to go with MySQL, then you should use MyISAM storage engine for fact tables rather then InnoDB; you will see slightly better performance.