Inaccurate COUNT DISTINCT Aggregation with Date dimension in Google Data Studio - amazon-redshift

When I aggregate values in Google Data Studio with a date dimension on a PostgreSQL Connector, I see buggy behaviour. The symptom is that performing COUNT(DISTINCT) returns the same value as COUNT():
My theory is that it has something to do with the aggregation on the data occurring after the count has already happened. If I attempt the exact same aggregation on the same data in an exported CSV instead of directly from a PostgreSQL Connector Data Source, the issue does not reproduce:
My PostgreSQL Connector is connecting to Amazon Redshift (jdbc:postgresql://*******.eu-west-1.redshift.amazonaws.com) with the following custom query:
SELECT
userid,
submissionid,
date
FROM mytable
Workaround
If I stop using the default date field for the Date Dimension and aggregate my own dates directly in within the SQL query (date_byweek), the COUNT(DISTINCT) aggregation works as expected:
SELECT
userid,
submissionid,
to_char(date,'YYYY-IW') as date_byweek
FROM mytable
While this workaround solves my immediate problem, it sucks because I miss out on all the date functionality provided by Data Studio (Hierarchy Drill Down, Date Range filtering, etc.). Not to mention reducing my confidence at what else may be "buggy" within the product 😞
How to Reproduce
If you'd like to re-create the issue, using the following data as a PostgreSQL Data Source should suffice:
> SELECT * FROM mytable
userid submissionid
-------- -------------
1 1
2 2
1 3
1 4
3 5
> COUNT(DISTINCT userid) -- ERROR: Returns 5 when data source is PostgreSQL
> COUNT(DISTINCT userid) -- EXPECTED: Returns 3 when data source is CSV (exported from same PostgreSQL query above)

I'm happy to report that as of Sep 17 2020, there's a workaround.
DataStudio added the DATETIME_TRUNC function (see here https://support.google.com/datastudio/answer/9729685?), that allows you to add a custom field that truncs the original date to whatever granularity you want, without causing the distinct bug.
Attempting to set the display granularity in the report still causes the bug (i.e., you'll still set Oct 1 2020 12:00:00 instead of Oct 2020).
This can be solved by creating a SECOND custom field, which just returns the first, and then you can add IT to the report, change the display granularity, and everything will work OK.

I have the same issue with MySQL Connector. But my problem is solved, when I change date field format in DB from DATETIME (YYYY-MM-DD HH:MM:SS) to INT (Unixtimestamp). After connection this table to the Googe Datastudio I set type for this field as Date (YYYYMMDD) and all works, as expected. Hope, this may help you :)

In this Google forum there is a curious solution by Damien Choizit that involves combining your data source with itself. It works well for me.
https://support.google.com/datastudio/thread/13600719?hl=en&msgid=39060607
It says:
I figured out a solution in my case: I used a Blend Data joining twice the same data source with corresponding join key(s), then I specified a data range dimension only on the left side and selected the columns I wanted to CTD aggregate as "dimensions" (and not metric!) on the right side.

Related

JBPM Business central - Data set not working with aggregate function

I have an issue with the Data set - Execution server. I am using PostgreSQL as DB. I want to calculate the difference between the two dates column for my report. The query I have used in DB is:
Query 1:
SELECT end_date as end,
start_date as start,
processid as pidd,
AGE(end_date, start_date) as duration
from processinstancelog
Query 2:
select end_date,start_date,processid, end_date - start_date as
duration from processinstancelog
Both queries reflecting the correct expected result in Postgres DB. But when I am using the same queries in the Data set>Execution server it's not showing the "duration" column.
Question
Can anyone please advise what is issue why the data set is not showing the duration column?
Many Thanks
Both queries reflecting the correct expected result in Postgres DB. But when I am using the same queries in the Data set>Execution server it's not showing the "duration" column.
How do you use in the query in the execution server? Are you implementing advance query functionality? If yes can you please share the exact steps you are following and your advance query definition to review.
Answer, I deleted the old setup and installed the new setup for JBPM and data set start appearing

Get value from Measure in OLAP Cube and Insert it in a SQL table

I need help with the the following:
I got an OLAP Cube, let's call it "company_prod" on "server\instance"
That Cube has (among many) a calculated member called "[Measures].[Value]"
One of the Dimensions in the Cube is for time (Year, Month, Date and so on). e.g. [TIME].[Y_M_D].[YEAR].&[2020]
Our main frontend is Excel, where we retrieve Data from the Cube with CUBEELEMENT, CUBEVALUE etc.
We got some measures which unfortunately, when I update the Excel report now and show numbers for last year, the result is different when I update that same report in a few weeks or months. This is something I won't be able to change and in some reports it's the desired behaviour because underlying data from SAP is changed and sometimes valid_from and valid_to dates are changed retroactively.
Now I want to get the value from my "[Measures].[Value]" on a certain date, let's say April 1st. I then want to insert the value I get on April 1st for 2020 in a SQL table. This should be done by an agent job that executes a stored procedure or runs a dtsx package or anything else, whichever works.
I hope it's clear what I am trying to accomplish...
If you can create linked servers from your SQL Server to SSAS Server, then you can run your MDX query against the linked SSAS Server using OPENQUERY and save the result directly to the SQL Server table.
i.e.,
INSERT INTO <your table>
EXECUTE <your mdx statement> AT <linked server>
You can add the run the above via your SQL Agent job on schedule.

Identifying next closest record by date in tableau

I have a table of users and another table of transactions.
The transactions all have a date against them. What I am trying to ascertain for each user is the average time between transactions.
User | Transaction Date
-----+-----------------
A | 2001-01-01
A | 2001-01-10
A | 2001-01-12
Consider the above transactions for user A. I am basically looking for the distance from one transaction to the next chronologically to determine the distances.
There are 9 days between transactions one and two; and there are 2 days between transactions three and four. The average of these is obviously 4.5, so I would want to identify the average time between user A's transactions to be 4.5 days.
Any idea of how to achieve this in Tableau?
I am trying to create a calculated field for each transaction to identify the date of the "next" transaction but I am struggling.
{ FIXED [user id] : MIN(IF [Transaction Date] > **this transaction date** THEN [Transaction Date]) }
I am not sure what to replace this transaction date with or whether this is the right approach at all.
Any advice would be greatly appreciated.
LODs dont have access to previous values directly, so you need to create a self join in your data connection. Follow below steps to achieve what you want.
Create a self join with your data with following criteria
Create an LOD calculation as below
{FIXED [User],[Transaction Date]:
MIN(DATEDIFF('day',[Transaction Date],[Transaction Date (Data1)]))
}
Build the View
PS: If you want to improve the performance, Custom SQL might be the way.
The only type of calculation that can take order sequence into account (e.g., when the value for a calculated field depends on the value of the immediately preceding row) is a table calc. You can't use an LOD calc for this kind of problem.
You'll need to understand how partitioning and addressing works with table calcs, along with specifying your sort order criteria. See the online help. You can then do something like, for example, define days_since_last_transaction as:
if first() > 0 then min([Transaction Date]) -
lookup(min([Transaction Date]), -1) end
If you have very large data or for other reasons want to do your calculations at the database instead of in Tableau by a table calc, then you use SQL windowing (aka analytical) queries instead via Tableau's custom SQL.
Please attach an example workbook and anything you tried along with the error you have.
This might not be useful if you cannot set User ID Field as a filter.
So, you can set
User ID
as a filter. Then following the steps mentioned in here will lead you to calculating difference between any two dates. Ideally if you select any one value in the filter, the calculated field from the link should give you the difference in the dates that you have in the transaction dates column.

How to get all missing days between two dates

I will try to explain the problem on an abstract level first:
I have X amount of data as input, which is always going to have a field DATE. Before, the dates that came as input (after some process) where put in a table as output. Now, I am asked to put both the input dates and any date between the minimun date received and one year from that moment. If there was originally no input for some day between this two dates, all fields must come with 0, or equivalent.
Example. I have two inputs. One with '18/03/2017' and other with '18/03/2018'. I now need to create output data for all the missing dates between '18/03/2017' and '18/04/2017'. So, output '19/03/2017' with every field to 0, and the same for the 20th and 21st and so on.
I know to do this programmatically, but on powercenter I do not. I've been told to do the following (which I have done, but I would like to know of a better method):
Get the minimun date, day0. Then, with an aggregator, create 365 fields, each has that "day0"+1, day0+2, and so on, to create an artificial year.
After that we do several transformations like sorting the dates, union between them, to get the data ready for a joiner. The idea of the joiner is to do an Full Outer Join between the original data, and the data that is going to have all fields to 0 and that we got from the previous aggregator.
Then a router picks with one of its groups the data that had actual dates (and fields without nulls) and other group where all fields are null, and then said fields are given a 0 to finally be written to a table.
I am wondering how can this be achieved by, for starters, removing the need to add 365 days to a date. If I were to do this same process for 10 years intead of one, the task gets ridicolous really quick.
I was wondering about an XOR type of operation, or some other function that would cut the number of steps that need to be done for what I (maybe wrongly) feel is a simple task. Currently I now need 5 steps just to know which dates are missing between two dates, a minimun and one year from that point.
I have tried to be as clear as posible but if I failed at any point please let me know!
Im not sure what the aggregator is supposed to do?
The same with the 'full outer' join? A normal join on a constant port is fine :) c
Can you calculate the needed number of 'dublicates' before the 'joiner'? In that case a lookup configured to return 'all rows' and a less-than-or-equal predicate can help make the mapping much more readable.
In any case You will need a helper table (or file) with a sequence of numbers between 1 and the number of potential dublicates (or more)
I use our time-dimension in the warehouse, which have one row per day from 1753-01-01 and 200000 next days, and a primary integer column with values from 1 and up ...
You've identified you know how to do this programmatically and to be fair this problem is more suited to that sort of solution... but that doesn't exclude powercenter by any means, just feed the 2 dates into a java transformation, apply some code to produce all dates between them and for a record to be output for each. Java transformation is ideal for record generation
You've identified you know how to do this programmatically and to be fair this problem is more suited to that sort of solution... but that doesn't exclude powercenter by any means, just feed the 2 dates into a java transformation, apply some code to produce all dates between them and for a record to be output for each. Java transformation is ideal for record generation
Ok... so you could override your source qualifier to achieve this in the selection query itself (am giving Oracle based example as its what I'm used to and I'm assuming your data in is from a table). I looked up the connect syntax here
SQL to generate a list of numbers from 1 to 100
SELECT (MIN(tablea.DATEFIELD) + levquery.n - 1) AS Port1 FROM tablea, (SELECT LEVEL n FROM DUAL CONNECT BY LEVEL <= 365) as levquery
(Check if the query works for you - haven't access to pc to test it at the minute)

"#Error" when using Lookup in Microsoft Reporting 3.0

We are using Microsoft Reporting for generating a daily report. I want to add another column to one of the tables we have. Initially, I had set this up correctly and the report worked fine. However, due to technicalities I have to use a different table (with exact the same data) so I edited the query and once I do that I star getting "#Error" in the cell calues of my column.
The cell expression:
=Lookup(Fields!fldFlight.Value, Fields!OutboundFlightNumber.Value, Fields!OnTime.Value, "DataSet")
I use the following query to form DataSet:
SELECT
turnarounds_staging.OutboundFlightNumber
,turnarounds_staging.VisitDatabaseID AS [turnarounds_staging VisitDatabaseID]
,turnarounds_staging.STDDate
,events_staging.VisitDatabaseID AS [events_staging VisitDatabaseID]
,events_staging.OnTime
,events_staging.Event
FROM
turnarounds_staging
LEFT OUTER JOIN events_staging
ON turnarounds_staging.VisitDatabaseID = events_staging.VisitDatabaseID
WHERE
events_staging.Event ='PDC'AND
turnarounds_staging.STDDate= #Date
Where #Date is a parameter indicating yesterday.
If I change the query to the original table (identical). It works fine.
Any ideas why this happens when turnarounds_staging is identical to the original table?