I am in process of building Dimension from EDW (source), wherein I need to pivot columns of source to load Dimension.
Currently most of the pivoting stuff am doing is by using T-SQL PIVOT which further get used in my SSIS package to merge with Dim table
This pivoting can also be achieved by SSIS PIVOT Transformation component.
In regards to Performance which approach would be the best?
Thanks
In theory, SQL Server pivot perf should be faster, or at least the same, but to be sure, requires doing some perf comparison tests.
But even if SSIS currently has the advantage, feel free to use SQL Server, as staying out of SSIS is a good thing.
Related
When combining multiple fact tables using shared dimensions you need to use a drill-across query or a multi-pass query to get the correct results.
I'm looking for a BI tool that does this correctly based on recognising which tables are fact tables and which are dimension tables. The tool should preferably generate a postgreSQL query.
For most of the tools that I've been looking into, you need to recognise these situations and write SQL manually to fix this.
Are there any tools that will generate the correct queries for you without the need for writing the multi-pass or drill across yourself?
We are working on a audit system where auditor are given access to transaction processed in last quarter. Auditor performs various analysis on the data to find out invalid/erroneous transactions that have some exceptions.
Generally, these analysis requires data to be present on some charts to view the out-layers or sometime duplication detection are done based on multiple columns.
Sometime exception detection algorithm are pretty involved that require multiple processing steps using stored procedure.
Please note that analysis rarely involves aggregation on huge rows.
Occasionally , they can change some data if they find it missing or incorrect.
We are evaluating row based (sql & nosql databases) and column store (like data warehouse systems).
Is this a use case for datawarehouse or row based store, like nosql or some RDBMS?
In short, requirements are:
- Occasional update
- Mostly read queries over last 3/months of data
- Reading data my require several messaging steps, like creating temp table in step 1, forming join with another table in step rule, delete some rows ect.
Thanks
For your task, it does not really matter how the data is stored. You need to think instead how to create a solid dimensional model, populate it with data properly, and what reporting tools to use.
To give you an example, here are a couple of common setups I've used in my projects:
Microsoft stack setup:
SQL Server for data storage
SSIS for data ETL (or write your own stored procedures if you know what you are doing)
Publish dimensional model on the same SQL Server. If your data set is large (over billion records), use SSAS Tabular instead
Power Pivot or Power BI for interactive reporting, or SSRS for paginated reports.
Open-source setup:
PostgreSQL for data storage
Use stored procedures and/or Python to process data
Publish dimensional model to another PostgreSQL database. If your data is large, publish the dimensional model to Redshift or
other columnar database
Use Tableau or Power BI for interactive reporting, or build your own reporting interface.
I think NoSQL database is a wrong choice here because audit will require highly structured data.
We're using Tableau 10.5.6. I used a reporting tool years ago called Oracle Sales Analyzer. In that tool you could get to the queries generated by the reports and graphs you created through back-end catalogs using their command line.
There you could rewrite the query to be more efficient by fine-tuning the code if you needed. It was a very cool feature of that reporting tool for geeks like me who like to dive into the back end of the product and tune it at a very low level.
My question is, does Tableau have any of this type of facility? Is there a way to get to the queries that get stored once you create a report or a graph. Also is there command line where you can access these catalogs if they exist? Otherwise are these queries just stored in ASCII flat files that can be accessed by a user.
Thanks!
There are two ways that Tableau will query a database.
Option 1: Custom SQL
In your data source, you paste in the sql you have written and Tableau will pass that query through to the database. This gives you complete control over the sql, including adding any indexing hints you may want. See https://onlinehelp.tableau.com/current/pro/desktop/en-us/customsql.html
Option 2: Use the Tableau data source designer
This is what many people do. Here, you visually design your data source with the joins. Tableau translates that design into what the Hyper engine considers to be the most effective way to run the query. Sometimes, Hyper translates that into a regular sql statement. Sometimes it does some additional things to help boost performance, like breaking it up into different queries. A lot depends on the db engine you are connecting to. There is no "sql" stored in a flat file for this. Tableau just translates your design at run-time. The Hyper engine does a good job with fine-tuning, assuming you have an efficient database design with proper indexing and current table statistics.
There is a way to see the sql from option 2 at run-time using Performance Recording. Performance Recording keeps track of each step of the visualization process and will spit out the sql statement(s) that Tableau ran to generate your dataset. The sql is not stored in the twb file though, it's a run-time analysis.
We create several crystal reports based on SQL Server - usually 2005 or 2008. Broadly there are 2 kind of reports
a) tabular reports - which shows some data in a table (for example, invoice list)
b) document layouts - which shows data in specific format - usually from one or two main tables - and several secondary tables (for example, invoice)
We sometimes use tables directly in crystal. Or create a procedure in SQL and than use that procedure. One invoice could refer to usually around 10-12 tables. Most of these linked using left outer join to the primary invoice table.
What option is better - using tables in crystal (and let crystal create and run the sql query) - or create a query - and than use that query in crystal. Which one will give better performance?
There will be no difference in performance between a query generated by the 'Database Expert' versus the same SQL added to a Command. One caveat: ensure that the record-selection formula can be parsed and sent to the database (a filter applied WhileReadingRecords will definitely be less efficient that a pure-SQL one).
Reasons to prefer the 'Database Expert':
prior to v 2008, Command objects didn't support a multivalued parameter
easier to manage (somewhat subjective)
Reasons to prefer a Command:
you can add hints
you have more finely-grained control over the SQL (e.g. in-line views, CTEs, more-complex JOINs, subselects)
Personally, I try to avoid stored procedures as they offer minimal performance benefits, but require a more-signification investment in development and maintenance.
In the end, there is no substitute for performance. Try you query both ways and measure the results.
Coding it yourself will almost invariably run faster -- after all, you know what your data looks like, and Crystal doesn't. Also, there are things you can do in manual queries (windowing functions, for example) that Crystal can't.
Crystal had tendency to do some crazy stuff behind the scenes. You can view the "Show SQL Query" under the Database menu options to see what it creates. If find it easier to write the query in SQL as I can optimize it myself much easier. I also prefer to do any calculated/formula fields in SQL to and just use Crystal as a display interface. If you do put logic in crystal remember that it is running that logic for every record returned... so if there are conditions that exclude a record from a formula put that first to limit the time spent in the calculation.
We're considering using SSIS to maintain a PostgreSql data warehouse. I've used it before between SQL Servers with no problems, but am having a lot of difficulty getting it to play nicely with Postgres. I’m using the evaluation version of the OLEDB PGNP data provider (http://www.postgresql.org/about/news.1004).
I wanted to start with something simple like UPSERT on the fact table (10k-15k rows are updated/inserted daily), but this is proving very difficult (not to mention I’ll want to use surrogate keys in the future).
I’ve attempted (Link) and (http://consultingblogs.emc.com/jamiethomson/archive/2006/09/12/SSIS_3A00_-Checking-if-a-row-exists-and-if-it-does_2C00_-has-it-changed.aspx) which are effectively the same (except I don’t really understand the union all at the end when I’m trying to upsert) But I run into the same problem with parameters when doing the update using a OLEDb command – which I tried to overcome using (http://technet.microsoft.com/en-us/library/ms141773.aspx) but that just doesn’t seem to work, I get a validation error –
The external columns for complent.... are out of sync with the datasource columns... external column “Param_2” needs to be removed from the external columns.
(this error is repeated for the first two parameters as well – never came across this using the sql connection as it supports named parameters)
Has anyone come across this?
AND:
The fact that this simple task is apparently so difficult to do in SSIS suggests I’m using the wrong tool for the job - is there a better (and still flexible) way of doing this? Or would another ETL package be better for use between two Postgres database? -Other options include any listed on (http://en.wikipedia.org/wiki/Extract,_transform,_load#Open-source_ETL_frameworks). I could just go and write a load of SQL to do this for me, but I wanted a neat and easily maintainable solution.
I have used the Slowly Changing Dimension wizard for this with good success. It may give you what you are looking for especially with the Wizard
http://msdn.microsoft.com/en-us/library/ms141715.aspx
The External Columns Out Of Sync: SSIS is Case Sensitive - I encountered this issue multiple times and it makes me want to pull my hair out.
This simple task is going to take some work either way. SSIS is by no means an enterprise class ETL product yet, but it does give you some quick and easy functionality, and is sufficient for most ETL work. I guess it is also about your level of comfort with it as well.
SCD is way too slow for what I want. I need to use set based sql.
It turned out that a lot of my problems were with bugs in the provider.
I opened a forum topic (http://www.pgoledb.com/forum/viewtopic.php?f=4&t=49) and had a useful discussion with the moderator/support/developer person.
Also Postgres doesn't let you do cross db querys, so I solved the problem this way:
Data Source from Production DB to a temp Archive DB table
Run set based query between temp table and archive table
Truncate temp table
Note that the temp table is not atchally a temp table, but a copy of the archive table schema to temporarily stored data in.
Took a while, but I got there in the end.
This simple task is going to take some work either way. SSIS is by no means an enterprise class ETL product yet, but it does give you some quick and easy functionality, and is sufficient for most ETL work. I guess it is also about your level of comfort with it as well.
What enterprise ETL solution would you suggest?