Just going through the concepts of Business inteligence for Relational Databases. There Present lots of Tools for Relational DB's.
I want to know is there any tool which is used to do BI for NOSQL(MongoDB) and if yes then which is more powerful.
I have heared about Nucleaon BI. But dont know how powerful it is and advantages above other tools
There are currently 3 major BI platforms for MongoDB ecosystem.
Jaspersoft :
The only BI server that can connect directly to MongoDB, leveraging the aggregation framework APIs, so that you can report on and analyze data in MongoDB without having to move the data through ETL to a relational database.
Pentaho :
Increase Data Value – With Pentaho, MongoDB data can be accessed, blended, visualized and reported in combination with any other data source for increased insight and operational analytics. Reduce Complexity – Reporting on data stored in MongoDB is simplified, increasing developer productivity with Pentaho’s automatic document sampling, drag and drop interface and schema generation. Accelerate Data Access and Querying– With no impact on throughput, this integration builds on the features and capabilities in MongoDB, such as the Aggregation Framework, Replication and Tag Sets.
JSON Analytics :
Native JSON handling – no mapping to dimensions and measures means very short up-and-running times and no changes when the structure of the data changes. Contrary to previous-generation BI tools, JSON Studio was built from the ground up for JSON and MongoDB and is not based on a connector that tries to map JSON data into columns.
Native usage of MongoDB’s aggregation framework under an easy to use UI means very fast response times, for the first time accessible to all types of users.
HTTP Gateway with parameters means power users can design reports and graphs that can be used by any user, used for building dashboards and used from within other applications.
Rich d3 visualization and exploratory analytics gives power users the perfect platform to understand and work with data.
Low cost.
Nucleon BI is also in the picture but not so popular.
I have used Jaspersoft and found it great for BI and reporting.
Related
I am using the delta lake oss version 0.8.0.
Let's assume we calculated aggregated data and cubes using the raw data and saved the results in a gold table using delta lake.
My question is, is there a well known way to access these gold table data and deliver them to a web dashboard for example?
In my understanding, you need a running spark session to query a delta table.
So one possible solution could be to write a web api, which executes these spark queries.
Also you could write the gold results in a database like postgres to access it, but that seems just duplicating the data.
Is there a known best practice solution?
The real answer depends on your requirements regarding latency, number of requests per second, amount of data, deployment options (cloud/on-prem, where data located - HDFS/S3/...), etc. Possible approaches are:
Have the Spark running in the local mode inside your application - it may require a lot of memory, etc.
Run Thrift JDBC/ODBC server as a separate process, and access data via JDBC/ODBC
Read data directly using the Delta Standalone Reader library for JVM, or via delta-rs library that works with Rust/Python/Ruby
We are working on a audit system where auditor are given access to transaction processed in last quarter. Auditor performs various analysis on the data to find out invalid/erroneous transactions that have some exceptions.
Generally, these analysis requires data to be present on some charts to view the out-layers or sometime duplication detection are done based on multiple columns.
Sometime exception detection algorithm are pretty involved that require multiple processing steps using stored procedure.
Please note that analysis rarely involves aggregation on huge rows.
Occasionally , they can change some data if they find it missing or incorrect.
We are evaluating row based (sql & nosql databases) and column store (like data warehouse systems).
Is this a use case for datawarehouse or row based store, like nosql or some RDBMS?
In short, requirements are:
- Occasional update
- Mostly read queries over last 3/months of data
- Reading data my require several messaging steps, like creating temp table in step 1, forming join with another table in step rule, delete some rows ect.
Thanks
For your task, it does not really matter how the data is stored. You need to think instead how to create a solid dimensional model, populate it with data properly, and what reporting tools to use.
To give you an example, here are a couple of common setups I've used in my projects:
Microsoft stack setup:
SQL Server for data storage
SSIS for data ETL (or write your own stored procedures if you know what you are doing)
Publish dimensional model on the same SQL Server. If your data set is large (over billion records), use SSAS Tabular instead
Power Pivot or Power BI for interactive reporting, or SSRS for paginated reports.
Open-source setup:
PostgreSQL for data storage
Use stored procedures and/or Python to process data
Publish dimensional model to another PostgreSQL database. If your data is large, publish the dimensional model to Redshift or
other columnar database
Use Tableau or Power BI for interactive reporting, or build your own reporting interface.
I think NoSQL database is a wrong choice here because audit will require highly structured data.
We have developed a model in Tabular Object Model(TOM), <= 3.5 GB in size., and built few Tableau Dashboard(s) on top of this model.
Each dashboard is built by dragging multiple sheets into one dashboard. All the sheets (dragged in one dashboard) fetch data from one fact table (of course it has relationships with Date and other related dimensions) in TOM.
Now, when we interact with Tableau dashboard, we see a performance degradation. When we checked the SQL profiler, Tableau is generating a huge query for almost every interaction that we have with the dashboard.
We checked the huge query and observed that it includes the DAX/query for almost all the measures in fact tables, irrespective of whether the fact table is used in the said dashboard or not.
We have verified the filter settings in the dashboard, the settings are applicable only for the sheets dragged in our dashboard, so there is no question of visualizations getting changed in other dashboards.
Ironically, we still see that Tableau is creating a huge query and incorporating all the DAX/queries and this results into performance impact.
Is there any way we can restrict this behavior?
In case anyone else is having this issue, this is tied into Tableau not actually supporting SSAS Tabular, the connector you using is for SSAS Multidimensional so Tableau generates MDX queries against the DAX-based Tabular model.
This is also evident from Tableau's own techspecs site:
https://www.tableau.com/products/techspecs
"Microsoft SQL Server Analysis Services 2008 SP4 or later, multi-dimensional mode only*
"
Tableau's website at https://www.tableau.com/products/techspecs clearly states support for
"Microsoft SQL Server Analysis Services 2005 or later, non-tabular mode only*(incl. support for Kerberos)"
There is a web application which is running for a years and during its life time the application has gathered a lot of user data. Data is stored in relational DB (postgres). Not all of this data is needed to run application (to do the business). However form time to time business people ask me to provide reports of this data data. And this causes some problems:
sometimes these SQL queries are long running
quires are executed against production DB (not cool)
not so easy to deliver reports on weekly or monthly base
some parts of data is stored in way which is not suitable for such
querying (queries are inefficient)
My idea (note that I am a developer not the data mining specialist) how to improve this whole process of delivering reports is:
create separate DB which regularly is update with production data
optimize how data is stored
create a dashboard to present reports
Question: But is there a better way? Is there another DB which better fits for such data analysis? Or should I look into modern data mining tools?
Thanks!
Do you really do data mining (as in: classification, clustering, anomaly detection), or is "data mining" for you any reporting on the data? In the latter case, all the "modern data mining tools" will disappoint you, because they serve a different purpose.
Have you used the indexing functionality of Postgres well? Your scenario sounds as if selection and aggregation are most of the work, and SQL databases are excellent for this - if well designed.
For example, materialized views and triggers can be used to process data into a scheme more usable for your reporting.
There are a thousand ways to approach this issue but I think that the path of least resistance for you would be postgres replication. Check out this Postgres replication tutorial for a quick, proof-of-concept. (There are many hits when you Google for postgres replication and that link is just one of them.) Here is a link documenting streaming replication from the PostgreSQL site's wiki.
I am suggesting this because it meets all of your criteria and also stays withing the bounds of the technology you're familiar with. The only learning curve would be the replication part.
Replication solves your issue because it would create a second database which would effectively become your "read-only" db which would be updated via the replication process. You would keep the schema the same but your indexing could be altered and reports/dashboards customized. This is the database you would query. Your main database would be your transactional database which serves the users and the replicated database would serve the stakeholders.
This is a wide topic, so please do your diligence and research it. But it's also something that can work for you and can be quickly turned around.
If you really want try Data Mining with PostgreSQL there are some tools which can be used.
The very simple way is KNIME. It is easy to install. It has full featured Data Mining tools. You can access your data directly from database, process and save it back to database.
Hardcore way is MADLib. It installs Data Mining functions in Python and C directly in Postgres so you can mine with SQL queries.
Both projects are stable enough to try it.
For reporting, we use non-transactional (read only) database. We don't care about normalization. If I were you, I would use another database for reporting. I will desing the tables following OLAP principals, (star schema, snow flake), and use an ETL tool to dump the data periodically (may be weekly) to the read only database to start creating reports.
Reports are used for decision support, so they don't have to be in realtime, and usually don't have to be current. In other words it is acceptable to create report up to last week or last month.
I was reviewing some documents for making my database perform better and I came across with "OLAP" pre-aggregation term. I was wondering if OLAP is a tool or or methodology or approach. For example my DBMS is postgresql and I am working on a big databse. To speed up I have to use some aggregation and pre-aggregation methods. How OLAP can be helpful?
OLAP is a database role. When storing OLAP data in the db, typically you aren't running live transactional information off the db, but rather keeping it around for analytical and business intelligence reasons.
It isn't a tool. It isn't an approach either, since some approaches are needed for OLAP but some are helpful together in transactional environments as well.
In general you shouldn't think about speeding up an application by incorporating OLAP into it. Instead you would look at separating out reporting functions into a separate db server, and import the data periodically, and then separate data feeds from operation data stores, etc. This is a very different field than transactional application development.