Report Model Inefficiencies. Need Advice - ssrs-2008

I'm trying to help my power users have more access to our data so I don't have to interrupt my work (playing Pac-Man) 25 times a day writing Ad Hoc Queries and such.
I'm trying to use Data Source Views, Data Models, and Report Builder 2 and 3 to allow them to have access to cleansed data in which they can safely do their own basic analysis. I want to create generic Report Models covering business processes rather than a specific report model for each ad hoc report they would need.
I have to create the Data Source View (DSV) with a named query because the source database lacks primary keys, but does have unique clustered indexes on identity_columns.
Here's my problem. When I use a relatively simple query like this:
SELECT SOM.FSONO AS SalesNo
, SOM.FCUSTNO AS CustNo
,SLC.fcompany as CustName
, SOM.FCUSTPONO AS CustPONo
, SOM.fsoldby AS SalesPerson
, SOR.FENUMBEr AS ItemNo
, SOR.finumber AS IntItemNo
, SOR.frelease AS Rels
, SOI.fprodcl AS ProdClass
, SOI.fgroup AS GroupCode
, rtrim(SOR.FPARTNO) AS PartNo
, SOR.fpartrev AS PartRev
, cast(SOI.fdesc AS VARCHAR(20)) AS PartDescription
,SOM.forderdate as OrderDate
,SOR.fduedate as DueDate
, SOR.FORDERQTY AS QtyOrd
, SOR.FUNETPRICE AS NetUnitPrice
, (SOR.FORDERQTY * SOR.funetprice) AS NetAmountOrdered
FROM slcdpm SLC inner join
somast SOM on SLC.fcustno = SOM.fcustno
LEFT OUTER JOIN soitem SOI
ON (SOM.fsono = SOI.fsono)
LEFT OUTER JOIN sorels SOR
ON (SOI.fsono = SOR.fsono)
AND (SOI.finumber = SOR.finumber)
Let's assume the user takes the Report Model in Report Builder 3 and only requests SalesNo, PartNo, PartRev, OrderDate, and TotalNetAmount for their dataset.
The SQL Generated to pull that data is:
SET DATEFIRST 7
SELECT
CAST(1 AS BIT) [c0_is_agg],
CAST(1 AS BIT) [c1_is_agg],
CAST(1 AS BIT) [c2_is_agg],
CAST(1 AS BIT) [c3_is_agg],
4 [agg_row_count],
[CustomerSales].[TotalNetAmountOrdered] [TotalNetAmountOrdered],
[CustomerSales].[SalesNo] [SalesNo],
[CustomerSales].[PartNo] [PartNo],
[CustomerSales].[PartRev] [PartRev],
[CustomerSales].[OrderDate] [OrderDate]
FROM
(
SELECT
SUM([CustomerSales].[NetAmountOrdered]) [TotalNetAmountOrdered],
[CustomerSales].[SalesNo] [SalesNo],
[CustomerSales].[PartNo] [PartNo],
[CustomerSales].[PartRev] [PartRev],
[CustomerSales].[OrderDate] [OrderDate]
FROM
(
SELECT SOM.fsono AS SalesNo, SOM.fcustno AS CustNo, SLC.fcompany AS CustName, SOM.fcustpono AS CustPONo, SOM.fsoldby AS SalesPerson,
SOR.fenumber AS ItemNo, SOR.finumber AS IntItemNo, SOR.frelease AS Rels, SOI.fprodcl AS ProdClass, SOI.fgroup AS GroupCode, RTRIM(SOR.fpartno) AS PartNo,
SOR.fpartrev AS PartRev, CAST(SOI.fdesc AS VARCHAR(20)) AS PartDescription, SOM.forderdate AS OrderDate, SOR.fduedate AS DueDate, SOR.forderqty AS QtyOrd,
SOR.funetprice AS NetUnitPrice, SOR.forderqty * SOR.funetprice AS NetAmountOrdered
FROM slcdpm AS SLC INNER JOIN
somast AS SOM ON SLC.fcustno = SOM.fcustno LEFT OUTER JOIN
soitem AS SOI ON SOM.fsono = SOI.fsono LEFT OUTER JOIN
sorels AS SOR ON SOI.fsono = SOR.fsono AND SOI.finumber = SOR.finumber
) [CustomerSales]
WHERE
CAST(1 AS BIT) = 1
GROUP BY
[CustomerSales].[SalesNo], [CustomerSales].[PartNo], [CustomerSales].[PartRev], [CustomerSales].[OrderDate]
) [CustomerSales]
ORDER BY
[SalesNo], [PartNo], [PartRev], [OrderDate]
I would have expected only the fields pulled which the user requests in the report and not every single field in the DSV. Also, if parameters are created which constrain the data such as a beginning and ending date for OrderDate, the full data set is returned anyway.
Am I doing something wrong here?
Is there a better way to approach this?
Do other administrators find themselves with performance issues when using Report Models?

There are sometimes performance issues when dealing with Report Models. This is one of the reasons that report models are not meant for rolling out to all of your users to replace all reports. The queries generated by the semantic query engine behind reports models are not tunable and are often totally NOT the way you yourself woudl write them.
The engine essentially treats the named query as a view, which it expands into the underlying query, just as it would a view. This is often an issue when building a model directly overlaying your database.
The ideal situation, from my perspective, is to have a separate database (datawarehouse possibly) that is preferrably housed on a separate server. This dw would be flattenned out such that you could optimize it for read performance. Then, you could use those tables directly in your data source view and the semantic query engine behind the model should be able to make better queries.
This ideal is often not possible due to economic or other constraints. Could you try having a job more or less ETL from your base tables into a new set of tables that you could optimize for reporting to support your model?

Related

Is it OK to store transactional primary key on data warehouse dimension table to relate between fact-dim?

I have data source (postgres transactional system) like this (simplified, the actual tables has more fields than this) :
Then I need to create an ETL pipeline, where the required report is something like this :
order number (from sales_order_header)
item name (from sales_order_lines)
batch shift start & end (from receiving_batches)
delivered quantity, approved received quantity, rejected received quantity (from receiving_inventories)
My design for fact-dim tables is this (simplified).
What I don't know about, is the optimal ETL design.
Let's focus on how to insert the fact, and relationship between fact with dim_sales_orders
If I have staging tables like these:
The ETL runs daily. After 22:00, there will be no more receiving, so I can run the ETL at 23:00.
Then I can just fetch data from sales_order_header and sales_order_lines, so at 23:00, the script can runs, kind of :
INSERT
INTO
staging_sales_orders (
SELECT
order_number,
item_name
FROM
sales_order_header soh,
sales_order_lines sol
WHERE
soh.sales_order_id = sol.sales_order_header_id
and date_trunc('day', sol.created_timestamp) = date_trunc('day', now())
);
And for the fact table, can runs at 23:30, with query
SELECT
soh.order_number,
rb.batch_shift_start,
rb.batch_shift_end,
sol.item_name,
ri.delivered_quantity,
ri.approved_received_quantity,
ri.rejected_received_quantity
FROM
receiving_batches rb,
receiving_inventories ri,
sales_order_lines sol,
sales_order_header soh
WHERE
rb.batch_id = ri.batch_id
AND ri.sales_order_line_id = sol.sales_order_line_id
AND sol.sales_order_header_id = soh.sales_order_id
AND date_trunc('day', sol.created_timestamp) = date_trunc('day', now())
But how to optimally load the data into fact table, particulary the fact table?
My approach
select from staging_sales_orders and insert them into dim_sales_orders, using auto increment primary key.
before inserting into fact_receiving_inventories, I need to know the dim_sales_order_id. So in that case, I select :
SELECT
dim_sales_order_id
FROM
dim_sales_orders dso
WHERE
order_number = staging_row.order_number
AND item_name = staging_row.item_name
then insert to fact table.
Now what I doubt, is on point 2 (selecting from existing dim). In here, I select based on 2 varchar columns, which should be performance hit. Since in the normalized form, I'm thinking of modifying the staging tables, adding sales_order_line_id on both staging tables. Hence, during point 2 above, I can just do
SELECT
dim_sales_order_id
FROM
dim_sales_orders dso
WHERE
sales_order_line_id = staging_row.sales_order_line_id
But as consequences, I will need to add sales_order_line_id into dim_sales_orders, which I don't find common on tutorials. I mean, adding transactional table PK, is technically can be done since I can access the data source. But is it a good DW fact-dim dimension, to add such transactional field (especially since it is PK)?
Or there is any other approach, rather than selecting the existing dim based on 2 varchars?
How to optimally select dimension id for fact tables?
Thanks
It is practically mandatory to include the source PK/BK in a dimension.
The standard process is to load your Dims and then load your facts. For the fact loads you translate the source data to the appropriate Dim SKs with lookups to the Dims using the PK/BK

Why is performance of CTE worse than temporary table in this example

I recently asked a question regarding CTE's and using data with no true root records (i.e Instead of the root record having a NULL parent_Id it is parented to itself)
The question link is here; Creating a recursive CTE with no rootrecord
The answer has been provided to that question and I now have the data I require however I am interested in the difference between the two approaches that I THINK are available to me.
The approach that yielded the data I required was to create a temp table with cleaned up parenting data and then run a recursive CTE against. This looked like below;
Select CASE
WHEN Parent_Id = Party_Id THEN NULL
ELSE Parent_Id
END AS Act_Parent_Id
, Party_Id
, PARTY_CODE
, PARTY_NAME
INTO #Parties
FROM DIMENSION_PARTIES
WHERE CURRENT_RECORD = 1),
WITH linkedParties
AS
(
Select Act_Parent_Id, Party_Id, PARTY_CODE, PARTY_NAME, 0 AS LEVEL
FROM #Parties
WHERE Act_Parent_Id IS NULL
UNION ALL
Select p.Act_Parent_Id, p.Party_Id, p.PARTY_CODE, p.PARTY_NAME, Level + 1
FROM #Parties p
inner join
linkedParties t on p.Act_Parent_Id = t.Party_Id
)
Select *
FROM linkedParties
Order By Level
I also attempted to retrieve the same data by defining two CTE's. One to emulate the creation of the temp table above and the other to do the same recursive work but referencing the initial CTE rather than a temp table;
WITH Parties
AS
(Select CASE
WHEN Parent_Id = Party_Id THEN NULL
ELSE Parent_Id
END AS Act_Parent_Id
, Party_Id
, PARTY_CODE
, PARTY_NAME
FROM DIMENSION_PARTIES
WHERE CURRENT_RECORD = 1),
linkedParties
AS
(
Select Act_Parent_Id, Party_Id, PARTY_CODE, PARTY_NAME, 0 AS LEVEL
FROM Parties
WHERE Act_Parent_Id IS NULL
UNION ALL
Select p.Act_Parent_Id, p.Party_Id, p.PARTY_CODE, p.PARTY_NAME, Level + 1
FROM Parties p
inner join
linkedParties t on p.Act_Parent_Id = t.Party_Id
)
Select *
FROM linkedParties
Order By Level
Now these two scripts are run on the same server however the temp table approach yields the results in approximately 15 seconds.
The multiple CTE approach takes upwards of 5 minutes (so long in fact that I have never waited for the results to return).
Is there a reason why the temp table approach would be so much quicker?
For what it is worth I believe it is to do with the record counts. The base table has 200k records in it and from memory CTE performance is severely degraded when dealing with large data sets but I cannot seem to prove that so thought I'd check with the experts.
Many Thanks
Well as there appears to be no clear answer for this some further research into the generics of the subject threw up a number of other threads with similar problems.
This one seems to cover many of the variations between temp table and CTEs so is most useful for people looking to read around their issues;
Which are more performant, CTE or temporary tables?
In my case it would appear that the large amount of data in my CTEs would cause issue as it is not cached anywhere and therefore recreating it each time it is referenced later would have a large impact.
This might not be exactly the same issue you experienced, but I just came across a few days ago a similar one and the queries did not even process that many records (a few thousands of records).
And yesterday my colleague had a similar problem.
Just to be clear we are using SQL Server 2008 R2.
The pattern that I identified and seems to throw the sql server optimizer off the rails is using temporary tables in CTEs that are joined with other temporary tables in the main select statement.
In my case I ended up creating an extra temporary table.
Here is a sample.
I ended up doing this:
SELECT DISTINCT st.field1, st.field2
into #Temp1
FROM SomeTable st
WHERE st.field3 <> 0
select x.field1, x.field2
FROM #Temp1 x inner join #Temp2 o
on x.field1 = o.field1
order by 1, 2
I tried the following query but it was a lot slower, if you can believe it.
with temp1 as (
DISTINCT st.field1, st.field2
FROM SomeTable st
WHERE st.field3 <> 0
)
select x.field1, x.field2
FROM temp1 x inner join #Temp2 o
on x.field1 = o.field1
order by 1, 2
I also tried to inline the first query in the second one and the performance was the same, i.e. VERY BAD.
SQL Server never ceases to amaze me. Once in a while I come across issues like this one that reminds me it is a microsoft product after all, but in the end you can say that other database systems have their own quirks.

Dynamic FROM clause in Postgres

Using PostgreSQL 9.1.13 I've written the followed query to calculate some data:
WITH windowed AS (
SELECT a.person_id, a.category_id,
CAST(dense_rank() OVER w AS float) / COUNT(*) OVER (ORDER BY category_id) * 100.0 AS percentile
FROM (
SELECT DISTINCT ON (person_id, category_id) *
FROM performances s
-- Want to insert a FROM clause here
INNER JOIN person p ON s.person_id = p.ident
ORDER BY person_id, category_id, created DESC
) a
WINDOW w AS (PARTITION BY category_id ORDER BY score)
)
SELECT category_id,percentile FROM windowed
WHERE person_id = 1;
I now want to turn this into a stored procedure but my issue is that in the middle there, where I showed the comment, I need to place a dynamic WHERE clause. For example, I'd like to add something like:
WHERE p.weight > 110 OR p.weight IS NULL
The calling application let's people pick filters and so I want to be able to pass the appropriate filters into the query. There could be 0 or many filters, depending on the caller, but I could pass it all in as a properly formatted where clause as a string parameter, for example.
The calling application just sends values to a webservice, which then builds the string and calls the stored procedure, so SQL injection attacks won't really be an issue.
The calling application just sends values to a webservice, which then
builds the string and calls the stored procedure, so SQL injection
attacks won't really be an issue.
Too many cooks spoil the broth.
Either let your webserive build the SQL statement or let Postgres do it. Don't use both on the same query. That leaves two possible weak spots for SQL injection attacks and makes debugging and maintenance a lot harder.
Here is full code example for a plpgsql function that builds and executes an SQL statement dynamically while making SQL injection impossible (just from two days ago):
Robust approach for building SQL queries programmatically
Details heavily depend on exact requirements.

zf many to many relationship, how to find values stored in intermediary table without a second lookup

I have a many to many relationship between two tables which is represented with an intermediary table. I am using the ZF1 table relationships methodology to model the database in my application and this works fine. One thing I am struggling with is to pull data from the intermediary table when performing a many to many lookup. For exmaple:
productsTable
product_id,
product_name
customerTable
customer_id,
customer_name
salesTable
customer_id,
product_id,
date_of_sale
In this case where the sales table is the intermediary table and the many to many relationship is between customers and products. I add the referenceMap to the sales table model for products and customers and the dependent table "sales" to the product table model and the customer table model.
I can then successfully use the following code to get all the products for a given customer (or vice-versa).
$productTable = new productsTable();
$product = $productTable->find(1)->current();
$customers = $product->findManyToManyRowset('customerTable','salesTable');
But it does not include the "date_of_sale" value in the rowset returned. Is there a way of including the values from the intermediary table without doint a separate database lookup. Ican't see anything in the zf docs.
Any help would be cool.
I hope to eventually replace the zend_table with a datamapper implementation as it seems highly inefficient in terms of the number of db queries it executes which could be hugely reduced with slightly more complex SQL join queries rather than multiple simple selects but for now I'm stuck with this.
Thanks.
You can use JOIN queries in your code to make it in one call. From what I understand, you want to build this query
SELECT p.*, c.*, s.date_of_sale FROM sales AS s
INNER JOIN products AS p ON p.product_id = s.product_id
INNER JOIN customers AS c ON c.customer_id = s.customer_id
WHERE p.product_id = '1';
To achieve this, you can refer to Complex SQL query into Zend to see how I translate it to a Zend_Db Query or just use something like
$select = $this->select()->setIntegrityCheck(false);
$select->from(array('s'=>'sales'), array('date_of_sale'));
$select->join(array('p'=>'products'), 'p.product_id = s.product_id', '*');
$select->join(array('c'=>'customers'), 'c.customer_id = s.customer_id', '*');
$select->where('p.product_id = '.$productId);
Obviously, it would be better to quoteInto the where clause but I'm just too lazy.
Hope this helps !

Feasibility of recreating complex SQL query in Crystal Reports XI

I have about 10 fairly complex SQL queries on SQL Server 2008 - but the client wants to be able to run them from their internal network (as opposed to from the non-local web app) through Crystal Reports XI.
The client's internal network does not allow us to (a) have write access to their proprietary db, nor (b) allow us to set up an intermediary SQL server (meaning we can not set up stored procedures or other data cleaning).
The SQL contains multiple instances of row_number() over (partition by col1, col2), group by col1, col2 with cube|rollup, and/or (multiple) pivots.
Can this even be done? Everything I've read seems to indicate that this is only feasible via stored procedure and I would still need to pull the data from the proprietary db first.
Following is a stripped back version of one of the queries (eg, JOINs not directly related to functionality, WHERE clauses, and half a dozen columns have been removed)...
select sum(programID)
, sum([a.Asian]) as [Episodes - Asian], sum([b.Asian]) as [Eps w/ Next Svc - Asian], sum([c.Asian])/sum([b.Asian]) as [Avg Days to Next Svc - Asian]
, etc... (repeats for each ethnicity)
from (
select programID, 'a.' + ethnicity as ethnicityA, 'b.' + ethnicity as ethnicityB, 'c.' + ethnicity as ethnicityC
, count(*) as episodes, count(daysToNextService) as episodesWithNextService, sum(daysToNextService) as daysToNextService
from (
select programID, ethnicity, datediff(dateOfDischarge, nextDateOfService) as daysToNextService from (
select t1.userID, t1.programID, t1.ethnicity, t1.dateOfDischarge, t1.dateOfService, min(t2.dateOfService) as nextDateOfService
from TABLE1 as t1 left join TABLE1 as t2
on datediff(d, t1.dateOfService, t2.dateOfService) between 1 and 31 and t1.userID = t2.userID
group by t1.userID, t1.programID, t1.ethnicity, t1.dateOfDischarge, t1.dateOfService
) as a
) as a
group by programID
) as a
pivot (
max(episodes) for ethnicityA in ([A.Asian],[A.Black],[A.Hispanic],[A.Native American],[A.Native Hawaiian/ Pacific Isl.],[A.White],[A.Unknown])
) as pA
pivot (
max(episodesWithNextService) for ethnicityB in ([B.Asian],[B.Black],[B.Hispanic],[B.Native American],[B.Native Hawaiian/ Pacific Isl.],[B.White],[B.Unknown])
) as pB
pivot (
max(daysToNextService) for ethnicityC in ([C.Asian],[C.Black],[C.Hispanic],[C.Native American],[C.Native Hawaiian/ Pacific Isl.],[C.White],[C.Unknown])
) as pC
group by programID with rollup
Sooooooo.... can something like this even be translated into Crystal Reports XI?
Thanks!
When you create your report instead of selecting a table or stored procedure choose add command
This will allow you to put whatever valid TSQL statement in there that you want. Using Common Table Expressions (CTE's) and inline Views I've managed to create some rather large complex statements (excess of 400 lines) against Oracle and SQL Server so it is indeed feasible, however if you use parameters you should consider using sp_executesql you'll have to figure out how to avoid SQL injection.