SSAS Tabular server has 16 cores, but only 4 are used in queries, at 100% cpu each? - ssas-tabular

Trying to improve/optimize query response time on a 7gb tabular model, SSAS Tabular 2017.
Server is a VM, has 128gb memory, 8 sockets, 16 cores. No NUMA nodes according to coreinfo.exe.
Biggest table is ~42 million rows, 2nd biggest is ~24 million rows. Query response times are commonly in the 5-15 second range, which seems long to me.
When I run queries against my model from local excel, cores 0-3 peg at 100% while the others remain at 0-ish.
Is this core utilization pattern normal? Should I be looking into server settings that can distribute query work over ALL the cores? Where are such settings?

In Tabular there are two query execution engines: Formula Engine is single-threaded (used for sophisticated calculations), Storage Engine is multi-threaded (used for simple calculations). From what you say it looks like the queries you run are not optimized to take advantage of Storage Engine. Please look for an article on how queries are executed in Tabular on www.sqlbi.com. Hope this helps anyone.

Related

No data available in performance insight

I have a critical application deployed in AWS RDS; the DB engine is PostgreSQL version 10.18. The architecture is unusual, because we're talking about medical data. This means that all the doctors connecting the database (through a PGBouncer) have their own schema; arount 4000 doctors means around 4000 schemas, with the same structure but different data obviously. Around 2000 doctors are actually connecting every day.
The Instance type is db.r5.4xlarge and there's a total buffer of around 100 GB. Still, there are a lot of hits on the disk, this means that on the performance insight side, I can actually see that the greater AAS is because of a metric called "DataFileRead", which (as far as I know) means that the data couldn't be fetched from the buffer and the Engine went for the disk. There's an average value of 60 AAS on DataFileRead.
This is not really the problem; I'm trying to apply some optimizations creating the right indexes for example, the problem is that on the TOP SQL tab I cannot see any data right after the query (like Calls/sec, Rows/sec, Blk hits/sec, etc.).
Does this means that the limit of 5000 rows of the pg_stat_statements is too low? Also I can't find any information about the impact on the database performance about having the statistics enabled. Does it increase critically increasing the limit of 5000 records? Can I go up to 50000 for example?

How to improve application performance? [Updated]

To give you an idea of the data:
DB has a collections/tables that has over a hundred million documents/records each containing more than 100 attributes/columns. The data size is expected to grow by hundred times soon.
Operations on the data:
There are mainly the following types of operations on the data:
Validating the data and then importing the data into the DB, that happens multiple times daily
Aggregations on this imported data
Searches/ finds
Updates
Deletes
Tools/softwares used:
MongoDB for database: PSS architecture based replicaset, indexes (most of the queries are INDEX scans)
NodeJS using Koa.js
Problems:
HOWEVER, the tool is very badly slow when it comes to aggregations, finds, etc.
What have I implemented for performance so far?:
DB Indexing
Caching
Pre-aggregations (using MongoDB aggregate to aggregate the data before hand and store it in different collections during importing to avoid aggregations at runtime)
Increased RAM and CPU cores on the DB server
Separate server for NodeJS server and Front-end build
PM2 to manage NodeJS server application and for spawning clusters
However from my experience, even after implementing all the above, the application is not performant enough. I feel that the reason for this is that the data is pretty huge. I am not aware of how Big Data applications are managed to deliver high performance. Please advise.
Also, is the selection of technology not suitable or will changing the technology/tools help? If yes, what is advised under such scenarios?
I'm requesting your advise to help me improve the performance of the application.
Not easy to give a correct answer because we do not really have that much details. What I would do is a detailed monitoring, at least the following:
Machine Level:
monitor the overall CPU load (for all cores) and RAM usage on your DB machine
monitor disk IO on the disks where the data is stored
this should show, if the machine specs are a bottleneck
Database & DB Process Level (my first guess, that this is the critical part):
what is the overall size of your data at the moment (I know, it will increase drastically but if it is already to slow now, this could be an interesting information - especially in relation to the current RAM size and number of CPU cores)
monitor memory usage and CPU load for your mongo DB process...
did a look on the query plans (while doing aggregations) guided you, what improvements can be done?
have look at the caching strategy. What strategy are you using?
this should give more detailed results on where to make improvements on a DB level. Is it just because of hardware bottlenecks or is it a aggregation problem...
Node.JS APP Level:
node.js app: how much RAM and CPU usage does this one take ...?
if there are multiple instances of the node.js app, track this for all instances
is the data import also happens through the nodejs app. Does the load on the app increases drastically while importing data?
if you see that you have a high load on this app that there is a need to act here (increasing instances, splitting it into seperate apps (e.g. import as a seperate app)

Amazon Redshift for SaaS application

I am currently testing Redshift for a SaaS near-realtime analytics application.
The queries performance are fine on a 100M rows dataset.
However, the concurrency limit of 15 queries per cluster will become a problem when more users will be using the application at the same time.
I cannot cache all aggregated results since we authorize to customize filters on each query (ad-hoc querying)
The requirements for the application are:
queries must return results within 10s
ad-hoc queries with filters on more than 100 columns
From 1 to 50 clients connected at the same time on the application
dataset growing at 10M rows / day rate
typical queries are SELECT with aggregated function COUNT, AVG with 1 or 2 joins
Is Redshift not correct for this use case? What other technologies would you consider for those requirements?
This question was also posted on the Redshift Forum. https://forums.aws.amazon.com/thread.jspa?messageID=498430&#498430
I'm cross-posting my answer for others who find this question via Google. :)
In the old days we would have used an OLAP product for this, something like Essbase or Analysis Services. If you want to look into OLAP there is an very nice open source implementation called Mondrian that can run over a variety of databases (including Redshift AFAIK). Also check out Saiku for an OSS browser based OLAP query tool.
I think you should test the behaviour of Redshift with more than 15 concurrent queries. I suspect that it will not be user noticeable as the queries will simply queue for a second or 2.
If you prove that Redshift won't work you could test Vertica's free 3-node edition. It's a bit more mature than Redshift (i.e. it will handle more concurrent users) and much more flexible about data loading.
Hadoop/Impala is overly complex for a dataset of your size, in my opinion. It is also not designed for a large number of concurrent queries or short duration queries.
Shark/Spark is designed for the case where you data is arriving quickly and you have a limited set of metrics that you can pre-calculate. Again this does not seem to match your requirements.
Good luck.
Redshift is very sensitive to the keys used in joins and group by/order by. There are no dynamic indexes, so usually you define your structure to suit the tasks.
What you need to ensure is that your joins match the structure 100%. Look at the explain plans - you should not have any redistribution or broadcasting, and no leader node activities (such as Sorting). It sounds like the most critical requirement considering the amount of queries you are going to have.
The requirement to be able to filter/aggregate on arbitrary 100 columns can be a problem as well. If the structure (dist keys, sort keys) don't match the columns most of the time, you won't be able to take advantage of Redshift optimisations. However, these are scalability problems - you can increase the number of nodes to match your performance, you just might be surprised of the costs of the optimal solution.
This may not be a serious problem if the number of projected columns is small, otherwise Redshift will have to hold large amounts of data in memory (and eventually spill) while sorting or aggregating (even in distributed manner), and that can again impact performance.
Beyond scaling, you can always implement sharding or mirroring, to overcome some queue/connection limits, or contact AWS support to have some limits lifted
You should consider pre-aggregation. Redshift can scan billions of rows in seconds as long as it does not need to do transformations like reordering. And it can store petabytes of data - so it's OK if you store data in excess
So in summary, I don't think your use case is not suitable based on just the definition you provided. It might require work, and the details depend on the exact usage patterns.

MongoDB Insert performance - Huge table with a couple of Indexes

I am testing Mongo DB to be used in a database with a huge table of about 30 billion records of about 200 bytes each. I understand that Sharding is needed for that kind of volume, so I am trying to get 1 to 2 billion records on one machine. I have reached 1 billion records on a machine with 2 CPU's / 6 cores each, and 64 GB of RAM. I mongoimport-ed without indexes, and speed was okay (average 14k records/s). I added indexes, which took a very long time, but that is okay as it is a one time thing. Now inserting new records into the database is taking a very long time. As far as I can tell, the machine is not loaded while inserting records (CPU, RAM, and I/O are in good shape). How is it possible to speed -up inserting new records?
I would recommend adding this host to MMS (http://mms.10gen.com/help/overview.html#installation) - make sure you install with munin-node support and that will give you the most information. This will allow you to track what might be slowing you down. Sorry I can't be more specific in the answer, but there are many, many possible explanations here. Some general points:
Adding indexes means that that the indexes as well as your working data set will be in RAM now, this may have strained your resources (look for page faults)
Now that you have indexes, they must be updated when you are inserting - if everything fits in RAM this should be OK, see first point
You should also check your Disk IO to see how that is performing - how does your background flush average look?
Are you running the correct filesystem (XFS, ext4) and a kernel version later than 2.6.25? (earlier versions have issues with fallocate())
Some good general information for follow up can be found here:
http://www.mongodb.org/display/DOCS/Production+Notes

Configuration Recommendations for a PostgreSQL Installation

I have a Windows Server 2003 machine which I will be using as a Postgres database server, the machine is a Dual Core 3.0Ghz Xeon with 4 GB ECC Memory and 4 x 120GB 10K RPM SAS Drives, all stripped.
I have read that the default Postgres install is configured to run nicely on a 486 with 32MB RAM, and I have read several web pages about configuration optimizations - but was hoping for something more concrete from my Stackoverflow peeps.
Generally, its only going to serve 1 database (potentially one or two more) but the catch is that the database has 1 table in particular which is massive (hundreds of millions of records with only a few coloumn). Presently, with the default configuration, it's not slow, but I think it could potentially be even faster.
Can people please give me some guidance and recomendations for configuration settings which you would use for a server such as this.
4*stripped drive was a bad idea — if any of this drives will fail then you'll loose all data, and even SAS drives sometimes fail — with 4 drivers it is 4 times more likely than with 1 drive; you should go for RAID 1+0.
Use the latest version of Postgres, 8.3.7 now; there are many performance improvements in every major version.
Set shared_buffers parameter in postgresql.conf to about 1/4 of your memory.
Set effective_cache_size to about 1/2 of your memory.
Set checkpoint_segments to about 32 (checkpoint every 512MB) and checkpoint_completion_target to about 0.8.
Set default_statistics_target to about 100.
Migrate to Enterprise Linux or FreeBSD: Postgres works much better on Unix type systems — Windows support is a recent addition, not very mature.
You can read more on this page: Tuning Your PostgreSQL Server — PostgreSQL Wiki
My experience suggests that (within limits) the hardware is typically the least important factor in database performance.
Assuming that you have enough memory to keep commonly used data in cache, then your CPU speed may vary 10-50% between a top-of the line machine and a common or garden box.
However, a missing index in an important search, or a poorly written recursive trigger could easily make a difference of 1,000% or 10,000% or more in your response times.
Without knowing exactly your table structure and row counts, I think anybody would suggest that your your hardware looks amply sufficient. It is only your database structure which will kill you. :)
UPDATE:
Without knowing the specific queries and your index details, there's not much more we can do. And in general, even knowing the queries, it's often very difficult to optimize without actually installing and running the queries with realistic data sets.
Given the cost of the server, and the cost of your time, I think you need to invest thirty bucks in a book. Then install your database with test data, run the queries, and see what runs well and what runs badly. Fix, rinse, and repeat.
Both of these books are specific to SQL Server and both have high ratings:
http://www.amazon.com/Inside-Microsoft%C2%AE-SQL-Server-2005/dp/0735621969/ref=sr_1_1
http://www.amazon.com/Server-Performance-Tuning-Distilled-Second/dp/B001GAQ53E/ref=sr_1_5