Constant issues with Tableau custom sql and Athena - tableau-api

Queries I seem to be easily able to run in SQL workbench just don't work in Tableau - it's literally one java error after another... rant over.
One thing I've noticed is that Tableau keeps trying to wrap an additional SELECT which Athena doesn't recognise. I thought I could overcome this using Athena views, but that doens't seem to work either.
When I do the following in Tableau:
SELECT count(distinct uuid), category
FROM "pregnancy_analytics"."final_test_parquet"
GROUP BY category
I get the following in Athena (that throws an error - SYNTAX_ERROR: line 1:8: Column 'tableausql._col0' cannot be resolved). As I say, since it looks like Tableau is trying to "nest" the SELECT:
SELECT "TableauSQL"."_col0" AS "xcol0"
FROM (
SELECT count(distinct uuid)
FROM "pregnancy_analytics"."final_test_parquet"
WHERE category = ''
LIMIT 100
) "TableauSQL"
LIMIT 10000
NB: The error, as I said above, arises because Tableau sticks another SELECT around this to a table that doesn't exist, and as such Athena kicks up an error.
Starting to feel like Tableau is not a good fit with Athena? Is there a better suggestion perhaps?
Thanks!

Related

when to use direct sql query vs type-safe slick

I am not sure when slick runs the sql query and returns the result set. I want to run different queries but my table is very big. The sql command it self takes few seconds to run. I am afraid if I use slick to return entire table and then do selection or group by, it consumes a lot of memory and makes everything slow.
Here is one example to get the count of customers per day:
sql"""
select timestamp::date, count(distinct id) from tenant GROUP BY timestamp::date ORDER BY timestamp::date asc;
""".as[(Date, Int)]
My table has multiple columns other than date and id. If I want to bring the entire table and then to do group by and map in scala, it takes a lot of memory.
If I use above command, it is prone to errors.
Is there a way to return only id and date from sql db? and then what is the best way to do group by?
I have tried above sql command but not sure how to use slick better.

Convert T-SQL Cross Apply to Redshift

I am converting the following T-SQL statement to Redshift. The purpose of the query is to convert a column in the table with a value containing a comma delimited string with up to 60 values into multiple rows with 1 value per row.
SELECT
id_1
, id_2
, value
into dbo.myResultsTable
FROM myTable
CROSS APPLY STRING_SPLIT([comma_delimited_string], ',')
WHERE [comma_delimited_string] is not null;
In SQL this processes 10 million records in just under 1 hour which is fine for my purposes. Obviously a direct conversation to Redshift isn't possible due to Redshift not having a Cross Apply or String Split functionality so I built a solution using the process detailed here (Redshift. Convert comma delimited values into rows) which utilizes split_part() to split the comma delimited string into multiple columns. Then another query that unions everything to get the final output into a single column. But the typical run takes over 6 hours to process the same amount of data.
I wasn't expecting to run into this issue just knowing the power difference between the machines. The SQL Server I was using for the comparison test was a simple server with 12 processors and 32 GB of RAM while the Redshift server is based on the dc1.8xlarge nodes (I don't know the total count). The instance is shared with other teams but when I look at the performance information there are plenty of available resources.
I'm relatively new to Redshift so I'm still assuming I'm not understanding something. But I have no idea what am I missing. Are there things I need to check to make sure the data is loaded in an optimal way (I'm not an adim so my ability to check this is limited)? Are there other Redshift query options that are better than the example I found? I've searched for other methods and optimizations but unless I start looking into Cross Joins, something I'd like to avoid (Plus when I tried to talk to the DBA's running the Redshift cluster about this option their response was a flat "No, can't do that.") I'm not even sure where to go at this point so any help would be much appreciated!
Thanks!
I've found a solution that works for me.
You need to do a JOIN on a number table, for which you can take any table as long as it has more rows that the numbers you need. You need to make sure the numbers are int by forcing the type. Using the funcion regexp_count on the column to be split for the ON statement to count the number of fields (delimiter +1), will generate a table with a row per repetition.
Then you use the split_part function on the column, and use the number.num column to extract for each of the rows a different part of the text.
SELECT comma_delimited_string, numbers.num, REGEXP_COUNT(comma_delimited_string , ',')+1 AS nfields, SPLIT_PART(comma_delimited_string, ',', numbers.num) AS field
FROM mytable
JOIN
(
select
(row_number() over (order by 1))::int as num
from
mytable
limit 15 --max num of fields
) as numbers
ON numbers.num <= regexp_count(comma_delimited_string , ',') + 1

Execution Plan on a View looking at Partitioned Tables

I currently have tables that are partitioned out by year & month for our sales transactions. For example, we have sales tables that would look something like this:
factdailysales_201501
factdailysales_201502
factdailysales_201503 etc ...
Generally, I've always performed dynamic SQL to capture a Start Date, End Date, find out what partitions those are, and then loop through each of those partitions ... but its starting to become such a hassle and I've learned that this is probably not the best way to do it in terms of just maintenance, trouble shooting, and performance.
I decided to build a view that would UNION ALL of my sales partitions together. However, I don't want selecting from the view to have to scan all of the partitions on execution, it would take away the whole purpose of partitioning tables out. Because of this, I added check constraints on date to each of my sales tables. This way when I selected from the view, it would know which tables to access from instead of scanning every table.
Here are the following examples below:
SELECT SUM([retail])
FROM Sales_Orig
WHERE [Date] >= '2015-03-01'
This query has the execution plan of only pulling from the partitions that I need.
My problem that i'm facing right now is that most of the time when my team will be writing stored procedures, they would more than likely write their queries where a date variable is passed into the where statement.
DECLARE #SD DATE = '2015-03-01'
SELECT SUM([retail])
FROM Sales_Orig
WHERE [Date] >= #SD
However, when a variable is being passed in, the execution plan now scans ALL of the partitions in the view, causing the performance to take wayyy longer than when I hard coded in the date
I suppose I could do dynamic SQL again and insert the date string into the SELECT statement, but it would bring me back to the beginning of trying to get rid of dynamic SQL in the first place for this simple sales query.
So my question is, am I setting this up wrong? Am I on the right track? It seems that the view can't take in a variable for the check constraint and ends up scanning every table. Is there another approach anyone would recommend? Maybe my original solution of just looping through partitions via dynamic SQL is the best way to do it?
** EDIT **
http://sqlsunday.com/2014/08/31/partitioned-views/
This article is actually where I initially saw the idea! It seems when using that exact same solution, I'm still experiencing the same struggle!
Thanks!!
Okay this might work. It's a table-valued function that only access tables according to your #start and #end parameters so only accessing your "partitions" that it needs. I figured you could take this concept and write some dynamic SQL to create all the if statements.
Now of course new tables are added every day so how does that tie in. Well I think the best way would be is that every day you alter the function adding the next sales table. That way querying it is simple. And you could use the same dynamic sql you used to create the function to alter it which should be relatively simple.
Note: I added default values that are the min and max of the data type DATE. That way you could query something like everything from 20140101 and onward or vice versa.
Your tables
SELECT CAST('20150101' AS DATE) datesVal INTO factDailySales_20150101;
SELECT CAST('20150102' AS DATE) datesVal INTO factDailySales_20150102;
SELECT CAST('20150103' AS DATE) datesVal INTO factDailySales_20150103;
The Function
CREATE FUNCTION ufn_factTotalSales (#Start DATE = '17530101', #End DATE = '99991231')
RETURNS #factTotalSales TABLE
(
datesVal DATE
)
AS
BEGIN
IF(CAST('20150101' AS DATE) BETWEEN #Start AND #End)
BEGIN
INSERT INTO #factTotalSales
SELECT datesVal
FROM factDailySales_20150101
END
IF(CAST('20150102' AS DATE) BETWEEN #Start AND #End)
BEGIN
INSERT INTO #factTotalSales
SELECT datesVal
FROM factDailySales_20150102
END
IF(CAST('20150103' AS DATE) BETWEEN #Start AND #End)
BEGIN
INSERT INTO #factTotalSales
SELECT datesVal
FROM factDailySales_20150103
END
RETURN;
END
GO
All tables
SELECT *
FROM ufn_factTotalSales(default,default)
All tables greater than or equal to 20150102
SELECT *
FROM ufn_factTotalSales('20150102',default)
**All tables less than or equal to 20150102
SELECT *
FROM ufn_factTotalSales(default,'20150102')
All tables between specific range
SELECT *
FROM ufn_factTotalSales('20150101','20150102')
Is this the ideal solution? No. The ideal would be to combine all tables into one and having good indexes. I know you said that wouldn't work because of the way other code has been written. Hear me out. Now perhaps this is off the wall, lets say you do combine the tables but obviously there are old scripts looking for specific daily sales tables. Maybe you could create views with the dailySales names that access the factTotalSales. OR You could create synonyms for the factTotalSales that would correspond to each factDailySales.
Maybe you could look into that. It wouldn't be easy, but I think letting SQL Server optimize your queries the way it was designed is a better way of doing it instead of forcing it with dynamic SQL.
Just my two cents. Hope this helps. At the very least, I hope it gave you some ideas.
5 years later: option(recompile).
The planner needs to have access to the constants to eliminate the table entirely from the query plan. With a variable, without a forced recompile, a generic plan is used. (Related: parameter sniffing.)
While this means the query plan is larger as it has to include all tables, it does not mean that all tables are actually scanned: look at the IO stats, as table scan elimination occurs even if such shows in the query plan.
The 'Number Of Executions' in the query plan will be 0 when the tables are not scanned: unfortunately, these branches are still reported as a non-zero percentage cost "Table Scan" node in the query plan & UI, which will appear high proportionally if the query is trivially fast. The displayed percentage cost of these extra "Table Scan" nodes approaches zero as the amount of data returned from the actually used base tables increases.
This same optimization/elimination occurs when the view is not a Partitioned View (eg. base tables are missing partition column in PK), yet the underlying tables have a suitable Check Constraint on the filtered column. It also occurs when the view selects a constant value to establish the partition that is not otherwise stored in the table. With a constant in the query or recompiled plan the tables will be eliminated entirely. With a variable the tables will still not actually be scanned and thus eliminated logically during query execution.
The use of a proper Partitioned View is only really beneficial to allow a direct Insert & Update, with the major caveat that it requires the partition column to be in each table's PK and disallows the use of an identity column (making a Partitioned View largely useless IMOHO). SQL Server handles the optimizations very similarly for other quasi-Partitioned View cases.
(This is on SQL Server 2014; earlier versions might not have optimized the different patterns as efficiently.)

In SSRS, why do I get the error "item with same key has already been added" , when I'm making a new report?

I'm getting the following error in SSRS:
An error occurred while the query design method was being saved.
An item with the same key has already been added
What does an "item" denote, though? I even tried editing the RDL and deleting all references to the Stored Procedure I need to use called prc_RPT_Select_BI_Completes_Data_View.
Could this possibly have to do with the fact that the Stored Procedure uses Dynamic SQL (the N' notation)?
In the stored procedure I have:
SET #SQL += N'
SELECT bi.SupplierID as ''Supplier ID''
,bi.SupplierName as ''Supplier Name''
,bi.PID as ''PID''
,bi.RespondentID as ''Respondent ID''
,lk_slt.Name as ''Entry Link Type''
,ts.SurveyNumber as ''Initial Survey ID'''
It appears that SSRS has an issue(at leastin version 2008) - I'm studying this website that explains it
Where it says if you have two columns(from 2 diff. tables) with the same name, then it'll cause that problem.
From source:
SELECT a.Field1, a.Field2, a.Field3, b.Field1, b.field99 FROM TableA a
JOIN TableB b on a.Field1 = b.Field1
SQL handled it just fine, since I had prefixed each with an alias
(table) name. But SSRS uses only the column name as the key, not table
+ column, so it was choking.
The fix was easy, either rename the second column, i.e. b.Field1 AS
Field01 or just omit the field all together, which is what I did.
I face the same issue. After debug I fixed the same. if the column name in your SQL query has multiple times then this issue occur. Hence use alias in SQL query to differ the column name.
Ex:
The below query will work proper in sql query but create issue in SSRS report:
Select P.ID, P.FirstName, P.LastName, D.ID, D.City, D.Area, D.Address
From PersonalDetails P
Left Join CommunicationDetails D On P.ID = D.PersonalDetailsID
Reason : ID has mentioned twice (Multiple Times)
Correct Query:
Select P.ID As PersonalDetailsID, P.FirstName, P.LastName, D.ID, D.City, D.Area, D.Address
From PersonalDetails P
Left Join CommunicationDetails D On P.ID = D.PersonalDetailsID
I have experience this issue in past. Based on that I can say that generally we get this issue if your dataset has multiple fieldnames that points to same field source.
Take a look into following posts for detail error description
https://bi-rootdata.blogspot.com/2012/09/an-error-occurred-during-report.html
https://bi-rootdata.blogspot.com/2012/09/an-item-with-same-key-has-already-been.html
In your case, you should check your all field names returned by Sp prc_RPT_Select_BI_Completes_Data_View and make sure that all fields has unique name.
I had the same error in a report query. I had columns from different tables with the same name and the prefix for each table (eg: select a.description, b.description, c.description) that runs ok in Oracle, but for the report you must have an unique alias for each column so simply add alias to the fields with the same name (select a.description a_description, b.description b_description and so on)
Sorry, it is a response to an old thread, but might still be useful.
In addition to above responses,
This generally happens when two columns with same name, even from different tables are included in the same query.
for example if we joining two tables city and state where tables have column name
e.g. city.name and state.name.
when such a query is added to the dataset, SSRS removes the table name or the table alias and only keeps the name, whih eventually appears twice in the query and errors as duplicate key.
The best way to avoid it is to use alias such as calling the column names
city.name as c_name
state.name as s_name.
This will resolve the issue.
I got this error message with vs2015 enterprise, ssdt 14.1.xxx, ssrs. For me I think it was something different than described above with a 2 column, same name problem. I added this report, then deleted the report, then when I tried to add the query back in the ssrs wizard I got this message, " An error occurred while the query design method was being saved :invalid object name: tablename" . where tablename was the table on the query the wizard was reading. I tried cleaning the project, I tried rebuilding the project. In my opinion Microsoft isn't completing cleaning out the report when you delete it and as long as you try to add the original query back it won't add. The way I was able to fix it was to create the ssrs report in a whole new project (obviously nothing wrong with the query) and save it off to the side. Then I reopened my original ssrs project, right clicked on Reports, then Add, then add Existing Item. The report added back in just fine with no name conflict.
I just got this error and i came to know that it is about the local variable alias
at the end of the stored procedure i had like
select #localvariable1,#localvariable2
it was working fine in sql but when i ran this in ssrs
it was always throwing error but after I gave alias it is fixed
select #localvariable1 as A,#localvariable2 as B
SSRS will not accept duplicated columns so ensure that your query or store procedure is returning unique column names.
If you are using SPs and if the sps have multiple Select statements (within if conditions) all those selects needs to be handled with unique field names.

Suggestion needed to avoid table scan in large view w/many duplicate values

Could anyone offer a solution to speed up one of our processes? We have a view used for reporting that is a union all of 10 tables. The view has 180 million rows. We would like to generate a list of the distinct values of individual columns. The current SQL generated by the reporting tool does a select distinct on the view which takes 10 minutes. Preferably the solution would be automatically updated. We have been trying to create a MQT in DB2 udb V8 as a union all, refresh immediate with little success. Any suggestions would be greatly appreciated.
Charles.
There are a lot of restrictions in DB2 8.2 for refresh immediate MQTs, and they can have a significant performance impact on applications that write to the base tables. That said, you may be able to use an MQT. However, instead of using SELECT DISTINCT, try making the query look something like:
select yourcolumn, count(*) as ignore
from union_all_view
group by yourcolumn
The column (yourcolumn) from must be defined as NOT NULL for this to work (in DB2 8.2). The optimizer may not select this MQT if you still issue SELECT DISTINCT against the union all view, so you may need to query the MQT (or a view defined on top of it) directly. Ignore the column "ignore" in the MQT -- that is there only for DB2; if you really don't want to see it, you can create a vew on top of the MQT.
However, this is really a database design issue. Why do you need to scan 180 million rows of data to find the unique values in a particular column? Why don't these values already reside in their own table, with foreign keys defined against it from each of the 10 base tables?