Data mining with postgres in production environment - is there a better way? - postgresql

There is a web application which is running for a years and during its life time the application has gathered a lot of user data. Data is stored in relational DB (postgres). Not all of this data is needed to run application (to do the business). However form time to time business people ask me to provide reports of this data data. And this causes some problems:
sometimes these SQL queries are long running
quires are executed against production DB (not cool)
not so easy to deliver reports on weekly or monthly base
some parts of data is stored in way which is not suitable for such
querying (queries are inefficient)
My idea (note that I am a developer not the data mining specialist) how to improve this whole process of delivering reports is:
create separate DB which regularly is update with production data
optimize how data is stored
create a dashboard to present reports
Question: But is there a better way? Is there another DB which better fits for such data analysis? Or should I look into modern data mining tools?
Thanks!

Do you really do data mining (as in: classification, clustering, anomaly detection), or is "data mining" for you any reporting on the data? In the latter case, all the "modern data mining tools" will disappoint you, because they serve a different purpose.
Have you used the indexing functionality of Postgres well? Your scenario sounds as if selection and aggregation are most of the work, and SQL databases are excellent for this - if well designed.
For example, materialized views and triggers can be used to process data into a scheme more usable for your reporting.

There are a thousand ways to approach this issue but I think that the path of least resistance for you would be postgres replication. Check out this Postgres replication tutorial for a quick, proof-of-concept. (There are many hits when you Google for postgres replication and that link is just one of them.) Here is a link documenting streaming replication from the PostgreSQL site's wiki.
I am suggesting this because it meets all of your criteria and also stays withing the bounds of the technology you're familiar with. The only learning curve would be the replication part.
Replication solves your issue because it would create a second database which would effectively become your "read-only" db which would be updated via the replication process. You would keep the schema the same but your indexing could be altered and reports/dashboards customized. This is the database you would query. Your main database would be your transactional database which serves the users and the replicated database would serve the stakeholders.
This is a wide topic, so please do your diligence and research it. But it's also something that can work for you and can be quickly turned around.

If you really want try Data Mining with PostgreSQL there are some tools which can be used.
The very simple way is KNIME. It is easy to install. It has full featured Data Mining tools. You can access your data directly from database, process and save it back to database.
Hardcore way is MADLib. It installs Data Mining functions in Python and C directly in Postgres so you can mine with SQL queries.
Both projects are stable enough to try it.

For reporting, we use non-transactional (read only) database. We don't care about normalization. If I were you, I would use another database for reporting. I will desing the tables following OLAP principals, (star schema, snow flake), and use an ETL tool to dump the data periodically (may be weekly) to the read only database to start creating reports.
Reports are used for decision support, so they don't have to be in realtime, and usually don't have to be current. In other words it is acceptable to create report up to last week or last month.

Related

Greenplum selection criteria

I'm getting familiar with the greenplum solution concepts, and trying to understand whether, and if so, when the organisation I work for should use this solution. Our conceptual idea is to setup a kind of central 'datastore' suitable for both OLTP and OLAP access.
My research: this article suggests Greenplum is more suitable for OLAP, and PostgreSQL for OLTP. But I also read about Greenplum improvements for OLTP processing. And in favour of Postgresql, there are also articles like this that suggest that OLAP (eg, a datawarehouse implementation) can be done by means of Postgresql.
So my question is: how to move forward, and what are the main criteria to decide? For example, in case we now have a just a few TB's (1-5), start with a Postgresql cluster (for OLTP+OLAP), and when data volumes grow, move on to Greenplum? Or start straight away with Greenplum?
maybe use postgres if it can handle your use case. If you have you have too much data and need to finish reports and analytics faster; change to greenplum

Getting data from REST API and storing it in Azure Data Lake

Does it make sense to get data from REST API and store it as JSON in an Azure Data Lake? Or the data should be stored directly into Azure SQL?
I've tried both options, but it's not clear in which scenario it is worth to save the data into Azure Data Lake.
Yes this is a perfectly normal pattern that has emerged for collecting large volumes in particular. Writing to a database is great but there are (at least) two aspects to consider:
schema-on-write - you have to know the schema before you write to the database. That means all columns, all datatypes, nullability, collation even before you can even think about writing a record. How are you going to handle the schema of your JSON changing for example?
transaction logging - most Microsoft SQL databases work with write-ahead-log or WAL, which means the transaction logging has to complete before the transaction is considered complete as part of an ACID transaction. What will happen in situations of heavy load on the database or high concurrency - queuing and blocking. Often these things take milliseconds but low tiers etc come into play. Alternate patterns like eventual consistency eg with Cosmos are a possibility if you need that sort of thing.
Data Lakes in contract are schema-on-read, ie you do not have to know the schema in order to write to the lake, so you can just land it and figure out the other stuff later.
This does not necessarily apply to your other question about Synapse as you run the risk of losing your perfectly good SQL Server datatypes. Look at one of the migration wizards for that instead.

Postgresql tuning to be used as DatawareHouse

I am assembling a Business Intelligence solution using the Pentaho software as a BI engine. Within this solution, I had to set up a requirements for a PostgreSQL database server.
The current situation is very easy, since no ETL process is being carried out for data extraction, so the PostgreSQL configuration has not changed it much, and it is practically as it is configured as "factory".
I would like to know what Postgres configuration parameters have to be touched and modified to optimize it as a Datawarehouse. I have seen a lot of documentation, but it is not clear to me at all, since one documentation says that such values have to be modified, and other documentation, other completely different values.
I would like to know just that, if there is a clearer and more precise documentation to optimize a postgres 9.6 to be used as a Pentaho DW.
Thank you very much

Postgres Multi-tenant administration/maintenance

We have a SaaS application where each tenant has its own database in Postgres. How would I apply a patch to all the databses? For example if I want to add a table or add a column to a table, I have to either write a program that loops through all databases and execute a SQL against them or using pgadmin, go through them one by one.
Is there smarter and/or faster way?
Any help is greatly appreciated.
Yes, there's a smarter way.
Don't create a new database for each tenant. If everything is in one database then you only need to alter one database.
Pick one database, alter each table to have the column TENANT and add this to the primary key. Then insert into this database every record for all tenants and drop the other databases (obviously considerably more work than this as your application will need to be changed).
The differences with your approach are extensively discussed elsewhere:
What problems will I get creating a database per customer?
What are the advantages of using a single database for EACH client?
Multiple schemas versus enormous tables
Practicality of multiple databases per client vs one database
Multi-tenancy - single database vs multiple database
If you don't put everything in one database then I'm afraid you have to alter them all individually, and doing it programatically would be simplest.
At a higher level, all multi-tenant applications follow one of three approaches:
One tenant's data lives in one database,
One tenant's data lives in one schema, or
Add a tenant_id / account_id column to your tables (shared schema).
I usually find that developers use the following criteria when they evaluate these different approaches.
Isolation: Since you can put each tenant into its own database in one hand, and have tenants share the same table on the other, this becomes the most apparent dimension. If you provide your users raw SQL access or you're in a regulated industry such as healthcare, you may need strict guarantees from your database. That said, PostgreSQL 9.5 comes with row level security policies that makes this less of a concern for most applications.
Extensibility: If your tenants are sharing the same schema (approach #3), and your tenants have fields that varies between them, then you need to think about how to merge these fields.
This article on multi-tenant databases has a great summary of different approaches. For example, you can add a dozen columns, call them C1, C2, and so forth, and have your application infer the actual data in this column based on the tenant_id. PostgresQL 9.4 comes with JSONB support and natively allows you to use semi-structured fields to express variations between different tenants' data.
Scaling: Another criteria is how easily your database would scale-out. If you create a tenant per database or schema (#1 or #2 above), your application can make use of existing Ruby Gems or [Django packages][1] to simplify app integration. That said, you'll need to manually manage your tenants' data and the machines they live on. Similarly, you'll need to build your own sharding logic to propagate foreign key constraints and ALTER TABLE commands.
With approach #3, you can use existing open source scaling solutions, such as Citus. For example, this blog post describes how to easily shard a multi-tenant app with Postgres.
it's time for me to give back to the community :) So after 4 years, our multi-tenant platform is in production and I would like to share the following observations/experiences with all of you.
We used a database per each tenant. This has given us extreme flexibility as the size of the databases in the backups are not huge and hence we can easily import them into our staging environment for customers issues.
We use Liquibase for database development and upgrades. This has been a tremendous help to us, allowing us to package the entire build into a simple war file. All changes are easily versioned and managed very efficiently. There is a bit of learning curve here an there but nothing substantial. 2-5 days can significantly save you time.
Given that we use Spring/JPA/Hibernate, we use a technique called Dynamic Data Source Routing. So when a user logs-in, we find the related datasource with a lookup and connect them to the session to the right database. That's also when the Liquibase scripts get applied for updates.
This is, for now, I will come back with more later on.
Well, there are problems with one database for all tenants in our case for sure.
The backup file gets huge and becomes almost not practical hard to manage
For troubleshooting, we need to restore customer's data in our dev env, we just use that customer's backup file and usually the file is not as big as if we were to use one database for all customers.
Again, Liquibase has been key in allowing to manage updates across all the tenants seamlessly and without any issues. Without Liquibase, I can see lots of complications with this approach. So Liquibase, Liquibase and more Liquibase.
I also suspect that we would need a more powerful hardware to manage a huge database with large joins across millions of records vs much lighter database with much smaller queries.
In case of problems, the service doesn't go down for everyone and there will be limited to one or few tenants.
In general, for our purposes, this has been a great architectural decision and we are benefiting from it every day. One time we had one customer that didn't have their archiving active and their database size grew to over 3 GB. With offshore teams and slower internet as well as storage/bandwidth prices, one can see how things may become complicated very quickly.
Hope this helps someone.
--Rex

OLAP and postgresql- tool or methodology?

I was reviewing some documents for making my database perform better and I came across with "OLAP" pre-aggregation term. I was wondering if OLAP is a tool or or methodology or approach. For example my DBMS is postgresql and I am working on a big databse. To speed up I have to use some aggregation and pre-aggregation methods. How OLAP can be helpful?
OLAP is a database role. When storing OLAP data in the db, typically you aren't running live transactional information off the db, but rather keeping it around for analytical and business intelligence reasons.
It isn't a tool. It isn't an approach either, since some approaches are needed for OLAP but some are helpful together in transactional environments as well.
In general you shouldn't think about speeding up an application by incorporating OLAP into it. Instead you would look at separating out reporting functions into a separate db server, and import the data periodically, and then separate data feeds from operation data stores, etc. This is a very different field than transactional application development.