Tableau - How to query large data sources efficiently? - tableau-api

I am new to Tableau, and having performance issues and need some help. I have a query that joins several large tables. I am using a live data connection to a MySQL db.
The issue I am having is that it is not applying the filter criteria before asking MySQL for the data. So it is essentially doing a SELECT * from my query and not applying the filter criteria to the where clause. It pulls all the data from MySQL db back to Tableau, then throws away the un-needed data based on my filter criteria. My two main filter criteria are on account_id and a date range.
I can cleanly get a list of the accounts from just doing a select from my account table to populate the filter list, then need to know how to apply that selection when it goes to pull the data from the main data query from MySQL.

To apply a filter at the data source first, try using context filters.
Performance can also be improved by using extracts.

I would personally use an extract, go into your MySQL DB Back-end, run the query, and a CREATE TABLE extract1 AS statement, or whatever you want to call your data table.
When you import this table into Tableau it will already have a SELECT * of your aggregate data in the workbook. From here your query efficiency will be increased ten fold.
Unfortunately, it's going to take awhile for Tableau processing time + mySQL backend DB query time = Ntime to process your data.
Try the extracts...

I've been struggling with the very same thing. I have found that the tableau extracts aren't any faster than pulling directly from a SQL table. What I have done is within SQL created tables that already have the filtered data in them, so the Select * will have only the needed data. The downside to this is it takes up more space on the server, but this isn't a problem on my side.

For the Large Data sets Tableau recommend using an Extract.
An extract will create a snapshot of the data that you are connected with and processing on this data will be faster than a live connection.
All the charts and visualization will load faster and saves your time, each time when you go to the Dashboard.
For the filters that you are using to filter the data-set will work faster in an extract connection. But to get the latest data you have to refresh the extract or schedule a refresh in the server ( if you are uploading the report to server).
There are multiple type of filters available in Tableau, the use of which depends on your application, context filters and global filters can be use to filter the whole set of data.

Related

How to combine data from postgreSQL and dynamic json in grafana

I have a grafana dashboard where I want to use an orcestra cities map dashboard to show status of some stations. The status is available as json from a http server (using nagios for this part) but the status has no idea of the location of the stations. This I have in a postGIS database.
I know I can set up a script that reads the status json and inserts the data into a table in the postgis database. This can run each five minutes or something. This feels a bit kludgy, so I wonder if there are some other ways of doing this.
Could it be possible to use a foreign data wrapper to fetch the json into postgis? The only json fdw I have found is to read a set of files, I would need to read from a http server.
If not, is it possible to combine data from json and postgres in one data set in grafana? I can read in data from both sources and present them e.g. as time series in one panel, but here I need to be able to join the two so that I use some of the attributes from json to categorize the points from postgis (or the other way around if that should be easier)
In theory you can do that in the Grafana. You need to have 2 queries with results from both sources (how to write query, configure datasources for that is not in the scope of this question) + you need a key, which can be used for a join in both results (e.g. city_id).
Then you may use join transformation to "join" both query results into single dataset.

ELT pipeline for Mongo

I am trying to get my data into Amazon Redshift using Fivetran, but have some questions in general about the ELT/ETL process. My source database is Mongo but I want to perform deep analysis on the data using a 3rd party BI tool like Looker, but they integrate with SQL. I am new to the ELT/ETL process and was wondering would it look like this.
Extract data from Mongo (handled by Fivetran)
Load into Amazon Redshift (handled by Fivetran)
Perform Transformation - This is where my biggest knowledge gap is. I obviously have to convert objects and arrays into compatible SQL types. I can perform a transformation on all objects to extract those to columns and transform all arrays to a table. Is this the right idea? Should I design a MYSQL schema and write all the transformations according to that schema design?
as you state, Fivetran will load your data into Redshift putting individual fields in columns where it can and putting everything else into varchar columns as JSON. So at that point you basically have a Data Lake - all your data in an analytical platform but basically still in source format and available for you to do whatever you want with it.
Initially, if you don't know much about your data and just want to investigate it, you can probably leave it as it is. Redshift has SQL functions that allow you to query the elements of a JSON structure so there is no need to build additional tables and more ETL just to allow you to investigate your data - especially as these tables may get thrown away once you understand your data and decide what you want to do with it.
If you have proper reporting requirements then that is the point where you can start to design a schema that will support these requirements (I'm not sure why you suggested a MYSQL schema as MYSQL is a database vendor?). Traditionally an analytical schema would be designed as a Kimball Dimensional model (facts and dimensions) but the type of schema you decide to design will depend on:
The database platform you are using (in your case, Redshift) and the type of structures it works best with e.g. star schema or "flat" tables
The BI tool you are using and how it expects to have data presented to it
For example (and I'm not saying this is a real world example), if Redshift works ok with star schemas but better with flat tables and Looker has to have a star schema then it probably makes more sense to build star schemas in Redshift as this is a single modelling exercise - rather than model flat tables in Redshift and then have to model star schemas in Looker.
Hope this helps?
It depends on how you need the final stage of your data analysis presented, and what the purpose of your data analysis is. As stated by NickW, assuming you need to integrate your data into a BI tool the schema should be adapted according to the tool's data format requirements.
a mongodb ETL/ELT process might looks like this:
Select Connection: Select the set connection
Collection Name:Choose the collection by using the [database].[collection] format.
If you pulling data from your authentication database, only the [collection] name can be determined. Examples: ea sample.products east .
Extract Method:
All: pull the entire data in the table.
Incremental: pull data by incremental value.
Incremental Attributes: Set the name of the incremental attribute to run by. I.e: UpdateTime .
Incremental Type: Timestamp | Epoch. Choose the type of incremental attribute.
Choose Range:
In Timestamp, choose your date increment range to run by.
In Epoch, choose the value increment range to run by.
If no End Date/Value entered, the default is the last date/value in the table.
The increment will be managed automatically
Include End Value: Should the increment process take the end value or not
Interval Chunks: On what chunks the data will be pulled by. Split the data by minutes, hours, days, months or years.
Filter: Filter the data to pull. The filter format will be a MongoDB Extended JSON.
Limit: Limit the rows to pull.
Auto Mapping: You can choose the set of columns you want to bring, add a new column or leave it as it is.
Converting Entire Key Data As a STRING
In cases the data is not as expected by a target, like key names started with numbers, or flexible and inconsistent object data, You can convert attributes to a STRING format by setting their data types in the mapping section as STRING
Conversion exists for any value under that key.
Arrays and objects will be converted to JSON strings.
Use cases:
Here are few filtering examples:
{"account":{"$oid":"1234567890abcde"}, "datasource": "google", "is_deleted": {"$ne": true}}
date(MODIFY_DATE_START_COLUMN) >=date("2020-08-01")

Azure Data Factory - querying Cosmos DB using dynamic timestamps

I want to create and maintain a snapshot of a collection in Cosmos DB.
Periodically, I want to retrieve only the delta (new or modified documents) from Cosmos and write them to the snapshot, which will be stored in a Azure Data Explorer cluster.
I wish to get the delta using the _ts member of the documents. In other words, I will fetch only records for which the _ts is between some range.
The range will be the range of a time window, which I get using a tumbling window trigger in the data factory.
The issue is that if I print the dynamic timestamps which I create in the query, and hard code them into the query, it works. But if I let the query generate them, I don't get any results.
For example:
I'm using those value to simulate the window range of the trigger.
I use this query to create timestamps in unix time.
and I see that the timestamps created are correct.
And if I run my query using those hardcoded timestamps, I get results
But, if I run a query using the code that just create those timestamps, I get no results from the query
This is the code to create the timestamps:
select
DateTimeToTimestamp('#{formatDateTime('2020-05-20T12:00:00.0000000Z','yyyy-MM
ddTHH:mm:ss.fffffffZ')}')/1000,
DateTimeToTimestamp('#{formatDateTime('2020-08-20T12:00:00.0000000Z','yyyy-MM
ddTHH:mm:ss.fffffffZ')}')/1000
Does anyone have a clue as to what might be the issue?
Any other way to achieve this is also welcome.
Thanks
EDIT: I managed to work around this by taking the other, simpler option:
where TimestampToDateTime(c._ts*1000)> "#{formatDateTime(pipeline().parameters.windowStart,'yyyy-MM-ddTHH:mm:ss.fffffffZ')}"
We are glad that you resolved this problem:
You managed to work around this by taking the other, simpler option:
where TimestampToDateTime(c._ts*1000)> "#{formatDateTime(pipeline().parameters.windowStart,'yyyy-MM-ddTHH:mm:ss.fffffffZ')}"
I think the error in first option is most caused by the different data type between c.ts and DateTimeToTimestamp('#{formatDateTime('2020-05-20T12:00:00.0000000Z','yyyy-MM ddTHH:mm:ss.fffffffZ')}')/1000.

UI5: Retrieve and display thousands of items in sap.m.Table

There is a relational database (MySQL 8) with tens of thousands of items in the table, which need to be displayed in sap.m.Table. The straight forward approach is to retrieve all the items with SQL-query and to deliver it to the client-side in JSON in an async way. The key drawback of this approach is performance and memory consumption at the client-side. The whole table needs to be displayed on the client-side to provide user and ability to conduct fast searches. This is crucial for the app.
Currently, there are two options:
Fetch top 100 records and push them into the table. This way user can search the last 100 records immediately. At the same time to run an additional query in a web worker, which will take about 2…5 seconds and get all records except those 100. Then, to merge two JSONs.
Keep JSON on the application server as a cached variable and update it when the user adds a new record or deletes a record. Then I fetch the JSON which supposed to be much faster than querying the database.
How can I show in OpenUI5's sap.m.Table thousands of items?
My opinion;
You need to create OData backend for your tables. User can filter or search records with OData capabilities. You don't need to push all data to client, sap.m.Table automatically request rest of data with OData protocol while user scroll the table.
Quick answer you can`t.
Use sap.ui.table or provide a proper odata service with top/skip support as shown here under 4.3 and 4.4.
Based on your backend code(java, abap, node) there are libs to help you.
The SAP recommandation says max 100 datasets for sap.m.Table. In praxis, I would advise to follow the recommendation, even on fast PC the rendering will be slowed down.
If you want to test more than 100 datasets, you need to set the size limit on your oModel like oModel.setSizeLimit(1000);

Aggregate as part of ETL or within the database?

Is there a general preference or best practice when it comes whether the data should be aggregated in memory on an ETL worker (with pandas groupby or pd.pivot_table, for example), versus doing a groupby query at the database level?
At the visualization layer, I connect to the last 30 days of detailed interaction-level data, and then the last few years of aggregated data (daily level).
I suppose that if I plan on materializing the aggregated table, it would be best to just do it during the ETL phase since that can be done remotely and not waste the resources of the database server. Is that correct?
If your concern is to put as little load on the source database server as possible, it is best to pull the tables from the source database to a staging area and do joins and aggregations there. But take care that the ETL tool does not perform a nested loop join on the source database tables, that is to pull in one of the tables and then run thousands of queries against the other table to find matching rows.
If your target is to perform joins and aggregations as fast and efficient as possible, by all means push them down to the source database. This may put more load on the source database though. I say “may” because if all you need is an aggregation on a single table, it can be cheaper to perform this in the source database than to pull the whole table.
If you aggregate by day, what if you boss wants it aggregated by hour or week?
The general rule is: Your fact table granularity should be as granular as possible. Then you can drill-down.
You can create pre-aggregated tables too, for example by hour, day, week, month, etc. Space is cheap these days.
Tools like Pentaho Aggregation Designer can automate this for you.