Row based database or Column based database - nosql

We are working on a audit system where auditor are given access to transaction processed in last quarter. Auditor performs various analysis on the data to find out invalid/erroneous transactions that have some exceptions.
Generally, these analysis requires data to be present on some charts to view the out-layers or sometime duplication detection are done based on multiple columns.
Sometime exception detection algorithm are pretty involved that require multiple processing steps using stored procedure.
Please note that analysis rarely involves aggregation on huge rows.
Occasionally , they can change some data if they find it missing or incorrect.
We are evaluating row based (sql & nosql databases) and column store (like data warehouse systems).
Is this a use case for datawarehouse or row based store, like nosql or some RDBMS?
In short, requirements are:
- Occasional update
- Mostly read queries over last 3/months of data
- Reading data my require several messaging steps, like creating temp table in step 1, forming join with another table in step rule, delete some rows ect.
Thanks

For your task, it does not really matter how the data is stored. You need to think instead how to create a solid dimensional model, populate it with data properly, and what reporting tools to use.
To give you an example, here are a couple of common setups I've used in my projects:
Microsoft stack setup:
SQL Server for data storage
SSIS for data ETL (or write your own stored procedures if you know what you are doing)
Publish dimensional model on the same SQL Server. If your data set is large (over billion records), use SSAS Tabular instead
Power Pivot or Power BI for interactive reporting, or SSRS for paginated reports.
Open-source setup:
PostgreSQL for data storage
Use stored procedures and/or Python to process data
Publish dimensional model to another PostgreSQL database. If your data is large, publish the dimensional model to Redshift or
other columnar database
Use Tableau or Power BI for interactive reporting, or build your own reporting interface.
I think NoSQL database is a wrong choice here because audit will require highly structured data.

Related

Is there a way to get to the stored queries for reports and graphs in Tableau?

We're using Tableau 10.5.6. I used a reporting tool years ago called Oracle Sales Analyzer. In that tool you could get to the queries generated by the reports and graphs you created through back-end catalogs using their command line.
There you could rewrite the query to be more efficient by fine-tuning the code if you needed. It was a very cool feature of that reporting tool for geeks like me who like to dive into the back end of the product and tune it at a very low level.
My question is, does Tableau have any of this type of facility? Is there a way to get to the queries that get stored once you create a report or a graph. Also is there command line where you can access these catalogs if they exist? Otherwise are these queries just stored in ASCII flat files that can be accessed by a user.
Thanks!
There are two ways that Tableau will query a database.
Option 1: Custom SQL
In your data source, you paste in the sql you have written and Tableau will pass that query through to the database. This gives you complete control over the sql, including adding any indexing hints you may want. See https://onlinehelp.tableau.com/current/pro/desktop/en-us/customsql.html
Option 2: Use the Tableau data source designer
This is what many people do. Here, you visually design your data source with the joins. Tableau translates that design into what the Hyper engine considers to be the most effective way to run the query. Sometimes, Hyper translates that into a regular sql statement. Sometimes it does some additional things to help boost performance, like breaking it up into different queries. A lot depends on the db engine you are connecting to. There is no "sql" stored in a flat file for this. Tableau just translates your design at run-time. The Hyper engine does a good job with fine-tuning, assuming you have an efficient database design with proper indexing and current table statistics.
There is a way to see the sql from option 2 at run-time using Performance Recording. Performance Recording keeps track of each step of the visualization process and will spit out the sql statement(s) that Tableau ran to generate your dataset. The sql is not stored in the twb file though, it's a run-time analysis.

How do I use Redshift Database for Transformation and Reporting?

I have 3 tables in my redshift database and data is coming from 3 different csv files from S3 every few seconds. One table has ~3 billion records and other 2 has ~100 million record. For the near realtime reporting purpose, I have to merge this table into 1 table. How do I achieve this in redshift ?
Near Real Time Data Loads in Amazon Redshift
I would say that the first step is to consider whether Redshift is the best platform for the workload you are considering. Redshift is not an optimal platform for streaming data.
Redshift's architecture is better suited for batch inserts than streaming inserts. "COMMIT"s are "costly" in Redshift.
You need to consider the performance impact of VACUUM and ANALYZE if those operations are going to compete for resources with streaming data.
It might still make sense to use Redshift on your project depending on the entire set of requirements and workload, but bear in mind that in order to use Redshift you are going to engineer around it, and probably change your workload from a "near-real-time" to a micro batch architecture.
This blog posts details all the recommendations for micro batch loads in Redshift. Read the Micro-batch article here.
In order to summarize it:
Break input files --- Break your load files in several smaller files
that are a multiple of the number of slices
Column encoding --- Have column encoding pre-defined in your DDL.
COPY Settings --- Ensure COPY does not attempt to evaluate the best
encoding for each load
Load in SORT key order --- If possible your input files should have
the same "natural order" as your sort key
Staging Tables --- Use multiple staging tables and load them in
parallel.
Multiple Time Series Tables --- This documented approach for dealing
with time-series in Redshift
ELT --- Do transformations in-database using SQL to load into the
main fact table.
Of course all the recommendations for data loading in Redshift still apply. Look at this article here.
Last but not least, enable Workload Management to ensure the online queries can access the proper amount of resources. Here is an article on how to do it.

Data mining with postgres in production environment - is there a better way?

There is a web application which is running for a years and during its life time the application has gathered a lot of user data. Data is stored in relational DB (postgres). Not all of this data is needed to run application (to do the business). However form time to time business people ask me to provide reports of this data data. And this causes some problems:
sometimes these SQL queries are long running
quires are executed against production DB (not cool)
not so easy to deliver reports on weekly or monthly base
some parts of data is stored in way which is not suitable for such
querying (queries are inefficient)
My idea (note that I am a developer not the data mining specialist) how to improve this whole process of delivering reports is:
create separate DB which regularly is update with production data
optimize how data is stored
create a dashboard to present reports
Question: But is there a better way? Is there another DB which better fits for such data analysis? Or should I look into modern data mining tools?
Thanks!
Do you really do data mining (as in: classification, clustering, anomaly detection), or is "data mining" for you any reporting on the data? In the latter case, all the "modern data mining tools" will disappoint you, because they serve a different purpose.
Have you used the indexing functionality of Postgres well? Your scenario sounds as if selection and aggregation are most of the work, and SQL databases are excellent for this - if well designed.
For example, materialized views and triggers can be used to process data into a scheme more usable for your reporting.
There are a thousand ways to approach this issue but I think that the path of least resistance for you would be postgres replication. Check out this Postgres replication tutorial for a quick, proof-of-concept. (There are many hits when you Google for postgres replication and that link is just one of them.) Here is a link documenting streaming replication from the PostgreSQL site's wiki.
I am suggesting this because it meets all of your criteria and also stays withing the bounds of the technology you're familiar with. The only learning curve would be the replication part.
Replication solves your issue because it would create a second database which would effectively become your "read-only" db which would be updated via the replication process. You would keep the schema the same but your indexing could be altered and reports/dashboards customized. This is the database you would query. Your main database would be your transactional database which serves the users and the replicated database would serve the stakeholders.
This is a wide topic, so please do your diligence and research it. But it's also something that can work for you and can be quickly turned around.
If you really want try Data Mining with PostgreSQL there are some tools which can be used.
The very simple way is KNIME. It is easy to install. It has full featured Data Mining tools. You can access your data directly from database, process and save it back to database.
Hardcore way is MADLib. It installs Data Mining functions in Python and C directly in Postgres so you can mine with SQL queries.
Both projects are stable enough to try it.
For reporting, we use non-transactional (read only) database. We don't care about normalization. If I were you, I would use another database for reporting. I will desing the tables following OLAP principals, (star schema, snow flake), and use an ETL tool to dump the data periodically (may be weekly) to the read only database to start creating reports.
Reports are used for decision support, so they don't have to be in realtime, and usually don't have to be current. In other words it is acceptable to create report up to last week or last month.

tELTPostgresql* usage issue

I'm trying to use tELTPostgresqlOutput with postgres 9.3 server and this is the result:
With a simple tPostgresqlInput and a tLogRow it works perfectly.
This is not how to use the ELT components. These should be used to do in database server transformations such as creating a star schema table from multiple tables in the same database. This allows you to use the database to do the transformation and avoid reading the data into memory for your job. It's particularly useful when dealing with large datasets that can't be broken down for the transformation.
If you want to transfer data from one database server/vendor to another you will need to use ETL components (pretty much anything not explicitly marked ELT) to read data out of the source database and write it back to the target database.
In this case you should be using a tMSSQLInput component to read in the data you need, a tMap to transform the data in the way you want and a tPostgresqlOutput component to write the data out to the Postgres database.

No merge tables in MariaDB - options for very large tables?

I've been happily using MySQl for years, and have followed the MariahDB fork with interest.
The server for one of my projects is reaching end of life and needs to be rehosted - likely to CentOS 7, which includes MariahDB
One of my concerns is the lack of the merge table feature, which I use extensively. We have a very large (at least by my standards) data set with on the order of 100M records/20 GB (with most data compressed) and growing. I've split this into read only compressed myisam "archive" tables organized by data epoch, and a regular myisam table for current data and inserts. I then span all of these with a merge table.
The software working against this database is then written such that it figures out which table to retrieve data from for the timespan in question, and if the timespan spans multiple tables, it queries the overlying merge table.
This does a few things for me:
Queries are much faster against the smaller tables - unfortunately, the index needed for the most typical query, and preventing duplicate records is relatively complicated
Frees the user from having to query multiple tables and assemble the results when a query spans multiple tables
Allowing > 90% of the data to reside in the compressed tables saves alot of disk space
I can back up the archive tables once - this saves tremendous time, bandwidth and storage on the nightly backups
An suggestions for how to handle this without merge tables? Does any other table type offer the compressed, read-only option that myisam does?
I'm thinking we may have to go with separate tables, and live with the additional complication and changing all the code in the multiple projects which use this database.
MariaDB 10 introduced the CONNECT storage engine that does a lot of different things. One of the table types it provides is TBL, which is basically an expansion of the MERGE table type. The TBL CONNECT type is currently read only, but you should be able to just insert into the base tables as needed. This is probably your best option but I'm not very familiar with the CONNECT engine in general and you will need to do a bit of experimentation to decide if it will work.