I'm in the middle of designing a SSAS db. I get the theory and the use of this stuff. Here's the thing, I've got a logging database that logs interesting order statuses which I would like to measure time to complete. I've got these tables (not implemented), to measure status times
time_dimension
user_dimension
status_dimension
status_fact - dimension references and timeInStatus measure
So my question is, do I create regular database and stage these things up for an SSIS task to pull into a SSAS db, or do I just create an SSAS db and describe the regular db with SSAS?
Naturally I'm new at this, but this type of analysis has been an interest of mine for a looong time! Your help is appreciated.
If your source DB (the logging one) is really nicely normalized around the data you need, you can probably get away without the stage.
Performance may suffer, development may suffer, etc.. I think a DW (stage) db is almost a necessity to fully leverage SSAS though...
Related
We have Contact Center in which there are about 1 million of records are created every days. we use mysql as primary database. Records are about calls time, agents that answer it, call type and ...
Create analytical report from this system is really time consuming (Example: Calculate agents calls for specific month). We need near real time report from our system.
So we decide to store logs and reports in nosql database in improve access time to data.
Which method do you prefer? and why?
use mongoDb
use elasticsearch as primary database.
use big data (Hadoop, spark, ...)
others
Lot of people are using elasticsearch plus Kibana to do such things.
I'm doing myself demos on my laptop with more than 1 million records representing people on which I'm building BI real time reports with Kibana.
Disclaimer: I'm working at elastic.
MongoDB can offer you much flexibility and is a general purpose database so you can use it for much more than simple text searching/storage. Storing 1 million documents in MongoDB will probably not even require sharding... a simple replica set should suffice. However, give thought to your document structure - and be sure you're not simply migrating tables to collections - that will not likely give you the performance you require. Look at the read/write profile of your application and be careful to not store unbounded arrays. Also, try to summarize where it makes sense so reporting and retrieval performance is good. BTW, you can test this out using MongoDB Atlas - starting for free. I just completed a screencast/blog showing you how to get started: http://blog.mlynn.org/getting-started-with-mongodb-atlas/ Hope this helps.
Hi
I am in the process of adding analytics to my SaaS app, and I'd love to hear other people's experiences doing this.
Current I see two different approches:
Do most of the data handling at the DB level, building and aggregating data into materialized views for performance boost. This way the data will stay normalized.
Have different cronjobs/processes that will run at different intervals (10 min, 1 hour etc.) that will query the database and insert aggregate results into a new table. In this case, the metrics/analytics are denormalized.
Which approach makes the most sense, maybe something completely different?
On really big data, the cronjob or ETL is the only option. You read the data once, aggregate it and never go back. Querying aggregated data is then relatively cheap.
Views will go through tables. If you use "explain" for a view-based query, you might see the data is still being read from tables, possibly using indexes (if corresponding indexes exist). Querying terabytes of data this way is not viable.
The only problem with the cronjob/ETL approach is that it's PITA to maintain. If you find a bug on production environment - you are screwed. You might spend days and weeks fixing and recalculating aggregations. Simply said: you have to get it right the first time :)
I have a couple of millions entries in a table which start and end timestamps. I want to implement an analysis tool which determines unique entries for a specific interval. Let's say between yesterday and 2 month before yesterday.
Depending on the interval the queries take between a couple of seconds and 30 minutes. How would I implement an analysis tool for a web front-end which would allow to quite quickly query this data, similar to Google Analytics.
I was thinking of moving the data into Redis and do something clever with interval and sorted sets etc. but I was wondering if there's something in PostgreSQL which would allow to execute aggregated queries, re-use old queries, so that for instance, after querying the first couple of days it does not start from scratch again when looking at different interval.
If not, what should I do? Export the data to something like Apache Spark or Dynamo DB and analysis in there to fill Redis for retrieving it quicker?
Either will do.
Aggregation is a basic task they all can do, and your data is smll enough to fit into main memory. So you don't even need a database (but the aggregation functions of a database may still be better implemented than if you rewrite them; and SQL is quite convenient to use.
Jusr do it. Give it a try.
P.S. make sure to enable data indexing, and choose the right data types. Maybe check query plans, too.
I was just searching for the best explanations and reasons to build a OLAP Cube from Relational Data. Is that all about performance and query optimization?
It will be great if you can give links or point out best explanations and reasons for building a cube, as we can do all the things from relational database that we can do from cube and cube is faster to show results.Is there any other explanation or reasons?
There are many reasons why you should use a cube for analytical proccessing.
Speed. Olap wharehouses are read only infrastractures providing 10 times faster queries than their oltp counterparts. See wiki
Multiple data integration. On a cube you can easily use multiple data sources and do minimal work with many automated tasks (especially when you use SSIS) to intergrate them on a single analysis system. See elt process
Minimum code. That is, you need not write queries. Even though you can write MDX - the language of the cubes in SSAS, the BI Studio does most of the hard work for you. On a project I am working on, at first we used SSRS to provide reports for the client. The queries were long and hard to make and took days to implement. Their SSAS equivalent reports took us half an hour to make, writing only a few simple queries to trasform some data.
A cube provides reports and drill up-down-through, without the need to write additional queries. The end user can traverse the dimension automatically, as the aggregations are already stored in the warehouse. This helps as the users of the cube need only traverse its dimensions to produce their own reports without the need to write queries.
Is is part of the Bussiness Intelligence. When you make a cube it can be fed to many new technologies and help in the implementation of BI solutions.
I hope this helps.
If you want a top level view, use OLAP. Say you have millions of rows detailing product sales and you want to know your monthly sales totals.
If you want bottom-level detail, use OLTP (e.g. SQL). Say you have millions of rows detailing product sales and want to examine one store's sales on one particular day to find potential fraud.
OLAP is good for big numbers. You wouldn't use it to examine string values, really...
It's bit like asking why using JAVA/C++ when we can do everything with Assembly Language ;-) Building a cube (apart from performance) is giving you the MDX language; this language has higher level concepts than SQL and is better with analytic tasks. Perhaps this question gives more info.
My 2 centavos.
I just got into a new company and my task is to optimize the Database performance. One possible (and suggested) way would be to use multiple servers instead of one. As there are many possible ways to do that, i need to analyse the DB first. Is there a tool with which i can measure how many Inserts/Updates and Deletes are performed for each table?
I agree with Surfer513 that the DMV is going to be much better than CDC. Adding CDC is fairly complex and will add a load to the system. (See my article here for statistics.)
I suggest first setting up a SQL Server Trace to see which commands are long-running.
If your system makes heavy use of stored procedures (which hopefully it does), also check out sys.dm_exec_procedure_stats. That will help you to concentrate on the procedures/tables/views that are being used most-often. Look at execution_count and total_worker_time.
The point is that you want to determine which parts of your system are slow (using Trace) so that you know where to spend your time.
One way would be to utilize Change Data Capture (CDC) or Change Tracking. Not sure how in depth you are looking for with this, but there are other simpler ways to get a rough estimate (doesn't look like you want exacts, just ballpark figures..?).
Assuming that there are indexes on your tables, you can query sys.dm_db_index_operational_stats to get data on inserts/updates/deletes that affect the indexes. Again, this is a rough estimate but it'll give you a decent idea.