Quicksight dataset incomplete after refresh - amazon-redshift

I have a dataset in Quicksight that is connected to a Redshift table but pulls from SPICE. The dataset is scheduled to refresh from Redshift to SPICE daily. It is a small table, I am using only a fraction of my SPICE capacity, and this method has been working fine for almost two years.
For some reason, suddenly the SPICE table it not refreshing completely & I can't figure out why.
There are 183 records in the table, but there are only 181 records in the SPICE dataset. If I query the table using SQL Workbench/J I get 183 recs but only 181 in the QS dataset.
I have tried to refresh multiple times & have also set the dataset to query directly to bypass SPICE and still cannot get those other two rows returned.
Nothing has changed in our permissions or anything about the Redshift-Quicksight IAM config.
Any ideas about what could possible be going on here?
Thanks for any help!
UPDATE: As I mentioned, if I select * from the table with SQL Workbench/J, I get the 183 rows that I expect. However, if I select * directly from the AWS query editor v2, I only get 181 rows. Can anyone explain to me what is causing this discrepancy?
SOLVED: The difference is that my processing now requires a COMMIT, where it did not require the COMMIT statement before.

Related

(Databricks) last_value function usage on huge dataset

For table which contains bilions of rows (10 mln each day) we trying to "fix/clean" source data. For missing rows with measures x,y,z we copy them from the past. To do it right we are using last_value function on specified aggregation level combined by CROSS JOIN with simple table which contains missing days.
It worked in simple POC where we have less than 1mln rows. In production dataset described above time and performance are pretty bad. We cannot just simply extend cluster, we need to try different approach.
Any thoughts what we can use insted of sql (simple last_value)?
All is caluclated in Azure Databricks environment.

Redshift Compile Time For First Time Run Queries

i am struggling with my dashboard performance which runs queries on Redshift using JDBC driver.
the query is like -
select <ALIAS_TO_SCHEMA.TABLENAME>.<ANOTHER_COLUMN_NAME> as col_0_0_,
sum(<ALIAS_TO_SCHEMA.TABLENAME>.devicecount) as col_1_0_ from <table_schema>.<table_name> <ALIAS_TO_SCHEMA.TABLENAME> where <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$1
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$2
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$3
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$4
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$5
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$6
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$7
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$8
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$9
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$10
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$11
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$12
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$13
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$14
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$15
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$16
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$17
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$18
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$19
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$20
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$21
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$22
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$23
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$24
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$25
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$26
or ........
The For dashboard we use Spring, Hibernate ( I am not 100% sure about it though ).
But the query might sometimes stretch till $1000 + according to the filters/options being selected on the UI.
But the problem we are seeing is - The First Time this query is being run by the reports, it takes more than 40 sec - 60 seconds for the response. After the first time , the query runs quite fast and takes only few seconds to run.
We initially suspected there must be something wrong with redshift caching , but it turns out that , Even simple queries like these ( But Huge ) takes considerable time to COMPILE, which is clear when we look into the svl_compile table which shows this query was compiled in over 35 seconds.
What should I do to handle such issues ?
Recommend restructuring the query generated by your dashboard to use an IN list. Redshift should be able to reuse the already compiled query segments for different length IN lists.
Note that IN lists with less than 10 values will still be evaluated as OR. https://docs.aws.amazon.com/redshift/latest/dg/r_in_condition.html#r_in_condition-optimization-for-large-in-lists
SELECT <ALIAS_TO_SCHEMA.TABLENAME>.<ANOTHER_COLUMN_NAME> as col_0_0_
, SUM(<ALIAS_TO_SCHEMA.TABLENAME>.devicecount) AS col_1_0_
FROM <table_schema>.<table_name> <ALIAS_TO_SCHEMA.TABLENAME>
WHERE <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME> IN ( $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11 … $1000 )
;

Inaccurate COUNT DISTINCT Aggregation with Date dimension in Google Data Studio

When I aggregate values in Google Data Studio with a date dimension on a PostgreSQL Connector, I see buggy behaviour. The symptom is that performing COUNT(DISTINCT) returns the same value as COUNT():
My theory is that it has something to do with the aggregation on the data occurring after the count has already happened. If I attempt the exact same aggregation on the same data in an exported CSV instead of directly from a PostgreSQL Connector Data Source, the issue does not reproduce:
My PostgreSQL Connector is connecting to Amazon Redshift (jdbc:postgresql://*******.eu-west-1.redshift.amazonaws.com) with the following custom query:
SELECT
userid,
submissionid,
date
FROM mytable
Workaround
If I stop using the default date field for the Date Dimension and aggregate my own dates directly in within the SQL query (date_byweek), the COUNT(DISTINCT) aggregation works as expected:
SELECT
userid,
submissionid,
to_char(date,'YYYY-IW') as date_byweek
FROM mytable
While this workaround solves my immediate problem, it sucks because I miss out on all the date functionality provided by Data Studio (Hierarchy Drill Down, Date Range filtering, etc.). Not to mention reducing my confidence at what else may be "buggy" within the product 😞
How to Reproduce
If you'd like to re-create the issue, using the following data as a PostgreSQL Data Source should suffice:
> SELECT * FROM mytable
userid submissionid
-------- -------------
1 1
2 2
1 3
1 4
3 5
> COUNT(DISTINCT userid) -- ERROR: Returns 5 when data source is PostgreSQL
> COUNT(DISTINCT userid) -- EXPECTED: Returns 3 when data source is CSV (exported from same PostgreSQL query above)
I'm happy to report that as of Sep 17 2020, there's a workaround.
DataStudio added the DATETIME_TRUNC function (see here https://support.google.com/datastudio/answer/9729685?), that allows you to add a custom field that truncs the original date to whatever granularity you want, without causing the distinct bug.
Attempting to set the display granularity in the report still causes the bug (i.e., you'll still set Oct 1 2020 12:00:00 instead of Oct 2020).
This can be solved by creating a SECOND custom field, which just returns the first, and then you can add IT to the report, change the display granularity, and everything will work OK.
I have the same issue with MySQL Connector. But my problem is solved, when I change date field format in DB from DATETIME (YYYY-MM-DD HH:MM:SS) to INT (Unixtimestamp). After connection this table to the Googe Datastudio I set type for this field as Date (YYYYMMDD) and all works, as expected. Hope, this may help you :)
In this Google forum there is a curious solution by Damien Choizit that involves combining your data source with itself. It works well for me.
https://support.google.com/datastudio/thread/13600719?hl=en&msgid=39060607
It says:
I figured out a solution in my case: I used a Blend Data joining twice the same data source with corresponding join key(s), then I specified a data range dimension only on the left side and selected the columns I wanted to CTD aggregate as "dimensions" (and not metric!) on the right side.

SQL Server performance function vs no function

I have a query (relationship between CONTRACT <-> ORDERS) that I decided to break up into 2 parts (contract and orders) so I can reuse in another stored procedure.
When I run the code before the break up, it took around 10 secs to run; however, when I use a function for getting the contract, then pump the data into a temp table first, then join to the other parts it takes 2m:30s - why the difference in time?
The function takes less than a second to run and returns only one row i.e. details of one contract (contract_id is the parameter supplied to the function).
The part that is most effecting the performance the (ORDERS) largest table in the query has 4.1 million rows and joins to a few other tables however; if I just run the sub query for orders in isolation with a particular filter i.e. the contract id it takes less than a second to run and just happens to return zero records based for the contract I am testing on (due to filtering on the type of order it is looking for).
Base on the above information you would think 1 sec at most for the function + 1 sec at most to get the orders + summarize = 2 seconds at most, not 2 and half minutes!
Where am I going wrong, how do I begin to isolate the issue in time difference?
I know someone is going to tell me to paste the code but surely it is an issue of the database vs indexes perhaps vs how the compiler performs when dealing with raw code versus broken up code into parts. Is there an area of the code I can look at before having to post my whole code as I have tried variations of OUTER APPLY vs LEFT JOIN from the contract temp table to the orders subquery and both give me about the same result. Any ideas?
I don't think the issue was with the code but the network I was running it on. Although bizarre in the fact I had 2 versions of the proc running side by side and yesterday or rather before the weekend one was running in 10 secs and it is still running in 10 secs 3 days later and my new version (using the function) was taking anywhere between 2 to 3 minutes. This morning it is running at 2 or 3 seconds!! So I don't know if it is the fact I changed from declaring my table structure and using a table variable instead first to where previously I was using SELECT ... INTO #Contract made the difference or the network or precompiling has an affect. Whatever it is it no longer an issue. Should I delete this post?

Tableau Indicates required columns are not present when performing an incremental extract, but they are

I have a table in BigQuery that has 35 million rows that I want to turn into a Tableau extract, but the number of rows is so great that it can't add all of them at one time. The solution I have come up with is to first the number of rows that are presented by the BigQuery view by only showing rows within a particular date range for a full extract and then after the extract is completed slowly increase the number of rows displayed by the view (by altering the where statement to only show particular rows) and then having Tableau perform an incremental extract based on the field containing the time stamp. That worked for a few of the incremental updates, but now I've run into a problem where Tableau Desktop says 'Required columns are not present in the remote data source. Perform full refresh of the extract', except that nothing in the data source or within tableau has changed other than the WHERE clause affecting the range of dates that the view presents. This was apparently a problem in Tableau 8, but I'm on Tableau 9.3. Anyone have any suggestions?
I'm not sure if this will be a solution for everyone, but the problem that I was having seems to have been caused by making connections to Google BigQuery per hour.