How does tabular partitions affect end users? - ssas-tabular

I have a large SSAS Tabular model. I can do a full process on the server in about 10 minutes. That is not a problem. Problem is that end users who are using the model from either excel or Power BI are reporting sluggish performance.
Can partitioning the data help this and is there other solutions to speed up end users experiences?

Partitions would only help you to reduce the processing time. Eg you could only reload "new" data on daily basis. They shouldn't affect the performance on client side.
I would recommend to focus on optimization on the DAX and model level.
Also when users connect with Excel as the client application there might be some performance issues even on optimized model. This is because Excel sends MDX queries to the Tabular and they are not really well optimized when using larger number of measures.

Related

Apache Drill for a Homogeneous Data Store

Just beginning to explore apache drill for as a data engine for a reporting app.
We're a PostGres shop as our transactional data is all in RDBMS.
Moving to any NoSQL (MongoDB) is a distant dream for us and there's no pressing need for us to spend money on that as of today.
Our data size is big (but still all in PostGres). We have a few tables spanning upto a few lower hundreds of millions (say 150M).
Performance is a key for us. We want our reports to be generated as fast as possible to the end user real time.
I have a basic question here for my use case:
If the time-cost of a native (direct) postgres query is say: P
By going through drill, I would imagine the cost is going to be: P + D, where D is the extra cost of Drill?
At the end of the day, if Postgres proves to be a bottleneck (say missing indices etc), then Drill can't help in making the situation better right, no matter how many ever Drill bits I horizontally add?
So, in what way using Drill for my use case help than optimise PostGres and querying it directly?
Apache Drill is usually being used to consolidate access and being able to join over different database systems, e.g. a PostgreSQL and a MongoDB.
Here my first question would be why change a working and proven database system which is in the newer versions is fully capable of handling JSON data? What is the main success factor which is being seen which opens the wish to move to MongoDB?
If you have only one database system, I'd concentrate in getting the most performance out of that. If using Apache Drill to consolidate different systems, you'd have to remember a few facts designing the drill layer:
You need Zookeeper nodes for Drill if you setup several drillbits
You need a few drillbit servers which do have compute power and big memory
You need to make sure to understand how Drill uses the underlying databases when queries are being sent: Drill tries to use the most power of the database systems to minimize any processing it needs to do (e.g. joins, like statemens happen in the database system). Because of that the underlying database infrastructure has to be powerful

Improve calculation efficiency in Tableau

I have over 300k records (rows) in my dataset and I have a tab that stratifies these records into different categories and does conditional calculations. The issue with this is that it takes approximately an hour to run this tab. Is there any way that I can improve calculation efficiency in tableau?
Thank you for your help,
Probably the issue is accessing your source data. I have found this problem working directly live data as an sql dabase.
An easy solution is to use extracts. Quoting Tableau Improving Database Query Performance article
Extracts allow you to read the full set of data pointed to by your data connection and store it into an optimized file structure specifically designed for the type of analytic queries that Tableau creates. These extract files can include performance-oriented features such as pre-aggregated data for hierarchies and pre-calculated calculated fields (reducing the amount of work required to render and display the visualization).
Edited
If you are using an extraction and you still have performance issues I suggest to you to massage your source data to be more friendly for tableau, for example, generate pre-calculated fields on ETL process.

How to use Task Parallel Iibrary with ADO .NET data access

I'm trying to optimise ADO .NET (.Net 4.5) data access with Task parallel library (.Net 4.5), For an example when selecting 1000,000,000 records from a database how can we use the machine multicore processor effectively with Task parallel library. If anyone has found use full sources to get some idea please post :)
The following applies to all DB access technologies, not just ADO.NET.
Client-side processing is usually the wrong place to solve data access problems. You can achieve several orders of magnitude improvement in performance by optimizing your schema, create proper indexes and writing proper SQL queries.
Why transfer 1M records to a client for processing, over a limited network connection with significant latency, when a proper query could return the 2-3 records that matter?
RDBMS systems are designed to take advantage of available processors, RAM and disk arrays to perform queries as fast as possible. DB servers typically have far larger amounts of RAM and faster disk arrays than client machines.
What type of processing are you trying to do? Are you perhaps trying to analyze transactional data? In this case you should first extract the data to a reporting, or better yet, an OLAP database. A star schema with proper indexes and precalculated analytics can be 1000x times faster than an OLTP schema for analysis.
Improved SQL coding can also result in 10x-50x times improvement or more. A typical mistake by programmers not accustomed to SQL is to use cursors instead of set operations to process data. This usually leads to horrendous performance degradation, in the order of 50x times and worse.
Pulling all data to the client to process them row-by-row is even worse. This is essentially the same as using cursors, only the data has to travel over the wire and processing will have to use the client's often limited memory.
The only place where asynchronous processing offers any advantage, is when you want to fire off a long operation and execute code when processing finishes. ADO.NET already provides asynchronous operations using the APM model (BeginExecute/EndExecute). You can use TPL to wrap this in a task to simplify programming but you won't get any performance improvements.
It could be that your problem is not suited to database processing at all. If your algorithm requires that you scan the entire dataset multiple times, it would be better to extract all the data to a suitable file format in one go, and transfer it to another machine for processing.

Why to build a SSAS Cube?

I was just searching for the best explanations and reasons to build a OLAP Cube from Relational Data. Is that all about performance and query optimization?
It will be great if you can give links or point out best explanations and reasons for building a cube, as we can do all the things from relational database that we can do from cube and cube is faster to show results.Is there any other explanation or reasons?
There are many reasons why you should use a cube for analytical proccessing.
Speed. Olap wharehouses are read only infrastractures providing 10 times faster queries than their oltp counterparts. See wiki
Multiple data integration. On a cube you can easily use multiple data sources and do minimal work with many automated tasks (especially when you use SSIS) to intergrate them on a single analysis system. See elt process
Minimum code. That is, you need not write queries. Even though you can write MDX - the language of the cubes in SSAS, the BI Studio does most of the hard work for you. On a project I am working on, at first we used SSRS to provide reports for the client. The queries were long and hard to make and took days to implement. Their SSAS equivalent reports took us half an hour to make, writing only a few simple queries to trasform some data.
A cube provides reports and drill up-down-through, without the need to write additional queries. The end user can traverse the dimension automatically, as the aggregations are already stored in the warehouse. This helps as the users of the cube need only traverse its dimensions to produce their own reports without the need to write queries.
Is is part of the Bussiness Intelligence. When you make a cube it can be fed to many new technologies and help in the implementation of BI solutions.
I hope this helps.
If you want a top level view, use OLAP. Say you have millions of rows detailing product sales and you want to know your monthly sales totals.
If you want bottom-level detail, use OLTP (e.g. SQL). Say you have millions of rows detailing product sales and want to examine one store's sales on one particular day to find potential fraud.
OLAP is good for big numbers. You wouldn't use it to examine string values, really...
It's bit like asking why using JAVA/C++ when we can do everything with Assembly Language ;-) Building a cube (apart from performance) is giving you the MDX language; this language has higher level concepts than SQL and is better with analytic tasks. Perhaps this question gives more info.
My 2 centavos.

Are document databases good for storing large amounts of Stock Tick data? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I was thinking of using a database like mongodb or ravendb to store a lot of stock tick data and wanted to know if this would be viable compared to a standard relational such as Sql Server.
The data would not really be relational and would be a couple of huge tables. I was also thinking that I could sum/min/max rows of data by minute/hour/day/week/month etc for even faster calculations.
Example data:
500 symbols * 60 min * 60sec * 300 days... (per record we store: date, open, high,low,close, volume, openint - all decimal/float)
So what do you guys think?
Since when this question was asked in 2010, several database engines were released or have developed features that specifically handle time series such as stock tick data:
InfluxDB - see my other answer
Cassandra
With MongoDB or other document-oriented databases, if you target performance, the advices is to contort your schema to organize ticks in an object keyed by seconds (or an object of minutes, each minute being another object with 60 seconds). With a specialized time series database, you can query data simply with
SELECT open, close FROM market_data
WHERE symbol = 'AAPL' AND time > '2016-09-14' AND time < '2016-09-21'
I was also thinking that I could sum/min/max rows of data by minute/hour/day/week/month etc for even faster calculations.
With InfluxDB, this is very straightforward. Here's how to get the daily minimums and maximums:
SELECT MIN("close"), MAX("close") FROM "market_data" WHERE WHERE symbol = 'AAPL'
GROUP BY time(1d)
You can group by time intervals which can be in microseconds (u), seconds (s), minutes (m), hours (h), days (d) or weeks (w).
TL;DR
Time-series databases are better choices than document-oriented databases for storing and querying large amounts of stock tick data.
The answer here will depend on scope.
MongoDB is great way to get the data "in" and it's really fast at querying individual pieces. It's also nice as it is built to scale horizontally.
However, what you'll have to remember is that all of your significant "queries" are actually going to result from "batch job output".
As an example, Gilt Groupe has created a system called Hummingbird that they use for real-time analytics on their web site. Presentation here. They're basically dynamically rendering pages based on collected performance data in tight intervals (15 minutes).
In their case, they have a simple cycle: post data to mongo -> run map-reduce -> push data to webs for real-time optimization -> rinse / repeat.
This is honestly pretty close to what you probably want to do. However, there are some limitations here:
Map-reduce is new to many people. If you're familiar with SQL, you'll have to accept the learning curve of Map-reduce.
If you're pumping in lots of data, your map-reduces are going to be slower on those boxes. You'll probably want to look at slaving / replica pairs if response times are a big deal.
On the other hand, you'll run into different variants of these problems with SQL.
Of course there are some benefits here:
Horizontal scalability. If you have lots of boxes then you can shard them and get somewhat linear performance increases on Map/Reduce jobs (that's how they work). Building such a "cluster" with SQL databases is lot more costly and expensive.
Really fast speed and as with point #1, you get the ability to add RAM horizontally to keep up the speed.
As mentioned by others though, you're going to lose access to ETL and other common analysis tools. You'll definitely be on the hook to write a lot of your own analysis tools.
Here's my reservation with the idea - and I'm going to openly acknowledge that my working knowledge of document databases is weak. I’m assuming you want all of this data stored so that you can perform some aggregation or trend-based analysis on it.
If you use a document based db to act as your source, the loading and manipulation of each row of data (CRUD operations) is very simple. Very efficient, very straight forward, basically lovely.
What sucks is that there are very few, if any, options to extract this data and cram it into a structure more suitable for statistical analysis e.g. columnar database or cube. If you load it into a basic relational database, there are a host of tools, both commercial and open source such as pentaho that will accommodate the ETL and analysis very nicely.
Ultimately though, what you want to keep in mind is that every financial firm in the world has a stock analysis/ auto-trader application; they just caused a major U.S. stock market tumble and they are not toys. :)
A simple datastore such as a key-value or document database is also beneficial in cases where performing analytics reasonably exceeds a single system's capacity. (Or it will require an exceptionally large machine to handle the load.) In these cases, it makes sense to use a simple store since the analytics require batch processing anyway. I would personally look at finding a horizontally scaling processing method to coming up with the unit/time analytics required.
I would investigate using something built on Hadoop for parallel processing. Either use the framework natively in Java/C++ or some higher level abstraction: Pig, Wukong, binary executables through the streaming interface, etc. Amazon offers reasonably cheap processing time and storage if that route is of interest. (I have no personal experience but many do and depend on it for their businesses.)