How to select column which are needed to create report in tableau - tableau-api

I have tableau desktop. I am creating a report using 5 tables out of 5 table 2 tables are big. These tables are joined and applied filter. extract creation taking a long time (6-7 hours and still running). big tables have 100+ columns, I use only 12 columns to build my report.
Now, there is an option to use custom SQL which take less time for creating extract but then I cannot use tableau to its full potential.
any suggestion is welcome. I am looking for the name of the column I can choose for creating the extract.

Follow Process:
Make database connection
Join tables
Go to sheet and take required fields needed in report then right click on connection and create a extract then don't forget to click Hide unused fields and then apply required filtering and create a extract
This process should show you only required fields out of all fields.
Especially for very large extracts, you can also consider the option to aggregate to visible dimensions when making an extract. That can dramatically reduce the size of the extract and time to create and access it. But that option requires care to be sure you use the faster extract in a way that still gets accurate results. There are assumptions built in to that feature.
An extract is really a cached query result. If you perform aggregation when creating the extract, you can compute totals, mins, max, avg etc during extract creation, and then simply display the aggregate values in Tableau. This can save a lot of time. Of course, you can’t then further drill down past the level of detail in the extract in that case.
More importantly, if you perform further aggregation in Tableau, you have to be careful that the double aggregation gives the result you intend. Some functions are always safe — sums of sums, mins of mins, maxes of maxes always give the same answer as if you only did one large aggregation operation. These are called additive operations. Other combinations may or may not give the result you intend, averages of averages, and definitely countd of countd can be unexpected - although sometimes repeated aggregation can be well defined - averages of daily sums can make sense for example.
So performing aggregation during extract creation can lead to huge performance gains at visualization time - you effectively precompute much or all of the information you need to display. You just have to understand how it works and use accordingly. Experiment.
By the way, that feature uses the default aggregation defined for each measure in the data source. Usually SUM(). You can change that in the data pane.

Related

How to make a desktop workbook (TWBX) readonly

I want to share a Tableau workbook (TWBX) file that has TDE as a data source to a different group. I don't want them to able to be see the raw data behind the workbook basically disable the data source tab. And also I don't want them to access any calculated variables inside the workbook. Is it possible to make TWBX file readonly? We don't have a Tableau server so publishing the workbook into a Tableau site with a "viewer" privilege is not an option here.
No you can’t make the data in the extract unreadable, but you can reduce it to a bare minimum, especially with aggregation.
When you make your extract, you can first hide fields you don’t want included in the extract, filter to exclude Records you don’t want included, and aggregate to visible dimensions to pre aggregate the data in the extract (using the default aggregation for each measure)
That way you can limit the detail that is in your extract. You can make the extract only contain the values you need to display. If you pre-aggregate the extract, be careful about using additional aggregation in your viz.
Some aggregation functions can be meaningfully repeated, known as additive aggregation functions. Sums of sums for example ; or mins of mins. But others can lead to strange or incorrect results, avg of avg in many cases.

OLAP Approach for Backend redshift connection

We have a system where we do some aggregations in Redshift based on some conditions. We aggregate this data with complex joins which usually takes about 10-15 minutes to complete. We then show this aggregated data on Tableau to generate our reports.
Lately, we are getting many changes regarding adding a new dimension ( which usually requires join with a new table) or get data on some more specific filter. To entertain these requests we have to change our queries everytime for each of our subprocesses.
I went through OLAP a little bit. I just want to know if it would be better in our use case or is there any better way to design our system to entertain such adhoc requests which does not require developer to change things everytime.
Thanks for the suggestions in advance.
It would work, rather it should work. Efficiency is the key here. There are few things which you need to strictly monitor to make sure your system (Redshift + Tableau) remains up and running.
Prefer Extract over Live Connection (in Tableau)
Live connection would query the system everytime someone changes the filter or refreshes the report. Since you said the dataset is large and queries are complex, prefer creating an extract. This'll make sure data is available upfront whenever someone access your dashboard .Do not forget to schedule the extract refresh, other wise the data will be stale forever.
Write efficient queries
OLAP systems are expected to query a large dataset. Make sure you write efficient queries. It's always better to first get a small dataset and join them rather than bringing everything in the memory and then joining / using where clause to filter the result.
A query like (select foo from table1 where ... )a left join (select bar from table2 where) might be the key at times where you only take out small and relevant data and then join.
Do not query infinite data.
Since this is analytical and not transactional data, have an upper bound on the data that Tableau will refresh. Historical data has an importance, but not from the time of inception of your product. Analysing the data for the past 3, 6 or 9 months can be the key rather than querying the universal dataset.
Create aggregates and let Tableau query that table, not the raw tables
Suppose you're analysing user traits. Rather than querying a raw table that captures 100 records per user per day, design a table which has just one (or two) entries per user per day and introduce a column - count which'll tell you the number of times the event has been triggered. By doing this, you'll be querying sufficiently smaller dataset but will be logically equivalent to what you were doing earlier.
As mentioned by Mr Prashant Momaya,
"While dealing with extracts,your storage requires (size)^2 of space if your dashboard refers to a data of size - **size**"
Be very cautious with whatever design you implement and do not forget to consider the most important factor - scalability
This is a typical problem and we tackled it by writing SQL generators in Python. If the definition of the metric is the same (like count(*)) but you have varying dimensions and filters you can declare it as JSON and write a generator that will produce the SQL. Example with pageviews:
{
metric: "unique pageviews"
,definition: "count(distinct cookie_id)"
,source: "public.pageviews"
,tscol: "timestamp"
,dimensions: [
['day']
,['day','country']
}
can be relatively easy translated to 2 scripts - this:
drop table metrics_daily.pageviews;
create table metrics_daily.pageviews as
select
date_trunc('day',"timestamp") as date
,count(distinct cookie_id) as "unique_pageviews"
from public.pageviews
group by 1;
and this:
drop table metrics_daily.pageviews_by_country;
create table metrics_daily.pageviews_by_country as
select
date_trunc('day',"timestamp") as date
,country
,count(distinct cookie_id) as "unique_pageviews"
from public.pageviews
group by 1,2;
the amount of complexity of a generator required to produce such sql from such config is quite low but in increases exponentially as you need to add new joins etc. It's much better to keep your dimensions in the encoded form and just use a single wide table as aggregation source, or produce views for every join you might need and use them as sources.

Improve calculation efficiency in Tableau

I have over 300k records (rows) in my dataset and I have a tab that stratifies these records into different categories and does conditional calculations. The issue with this is that it takes approximately an hour to run this tab. Is there any way that I can improve calculation efficiency in tableau?
Thank you for your help,
Probably the issue is accessing your source data. I have found this problem working directly live data as an sql dabase.
An easy solution is to use extracts. Quoting Tableau Improving Database Query Performance article
Extracts allow you to read the full set of data pointed to by your data connection and store it into an optimized file structure specifically designed for the type of analytic queries that Tableau creates. These extract files can include performance-oriented features such as pre-aggregated data for hierarchies and pre-calculated calculated fields (reducing the amount of work required to render and display the visualization).
Edited
If you are using an extraction and you still have performance issues I suggest to you to massage your source data to be more friendly for tableau, for example, generate pre-calculated fields on ETL process.

Performance improvement for fetching records from a Table of 10 million records in Postgres DB

I have a analytic table that contains 10 million records and for producing charts i have to fetch records from analytic table. several other tables are also joined to this table and data is fetched currently But it takes around 10 minutes even though i have indexed the joined column and i have used Materialized views in Postgres.But still performance is very low it takes 5 mins for executing the select query from Materialized view.
Please suggest me some technique to get the result within 5sec. I dont want to change the DB storage structure as so much of code changes has to be done to support it. I would like to know if there is some in built methods for query speed improvement.
Thanks in Advance
In general you can take care of this issue by creating a better data structure(Most engines do this to an extent for you with keys).
But if you were to create a sorting column of sorts. and create a tree like structure then you'd be left to a search rate of (N(log[N]) rather then what you may be facing right now. This will ensure you always have a huge speed up in your searches.
This is in regards to binary tree's, Red-Black trees and so on.
Another implementation for a speedup may be to make use of something allong the lines of REDIS, ie - a nice database caching layer.
For analytical reasons in the past I have also chosen to make use of technologies related to hadoop. Though this may be a larger migration in your case at this point.

realtime querying/aggregating millions of records - hadoop? hbase? cassandra?

I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!
Here's what I have:
1000s of datasets
dataset keys:
all datasets have the same keys
1 million keys (this may later be 10 or 20 million)
dataset columns:
each dataset has the same columns
10 to 20 columns
most columns are numerical values for which we need to aggregate on (avg, stddev, and use R to calculate statistics)
a few columns are "type_id" columns, since in a particular query we may
want to only include certain type_ids
web application
user can choose which datasets they are interested in (anywhere from 15 to 1000)
application needs to present: key, and aggregated results (avg, stddev) of each column
updates of data:
an entire dataset can be added, dropped, or replaced/updated
would be cool to be able to add columns. But, if required, can just replace the entire dataset.
never add rows/keys to a dataset - so don't need a system with lots of fast writes
infrastructure:
currently two machines with 24 cores each
eventually, want ability to also run this on amazon
I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.
partitions are nice, since can easily add/drop/replace partitions
database is nice for filtering based on type_id
databases aren't easy for writing parallel queries
databases are good for structured data, and my data is not structured
As a proof of concept I tried out hadoop:
created a tab separated file per dataset for a particular type_id
uploaded to hdfs
map: retrieved a value/column for each key
reduce: computed average and standard deviation
From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds).
Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?
thanks!
Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds
HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.
check out http://en.wikipedia.org/wiki/Standard_deviation
stddev(X) = sqrt(E[X^2]- (E[X])^2)
this implies that you can get the stddev of AB by doing
sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)
Since your data seems to be pretty much homogeneous, I would definitely take a look at Google BigQuery - You can ingest and analyze the data without a MapReduce step (on your part), and the RESTful API will help you create a web application based on your queries. In fact, depending on how you want to design your application, you could create a fairly 'real time' application.
It is serious problem without immidiate good solution in the open source space. In commercial space MPP databases like greenplum/netezza should do.
Ideally you would need google's Dremel (engine behind BigQuery). We are developing open source clone, but it will take some time...
Regardless of the engine used I think solution should include holding the whole dataset in memory - it should give an idea what size of cluster you need.
If I understand you correctly and you only need to aggregate on single columns at a time
You can store your data differently for better results
in HBase that would look something like
table per data column in today's setup and another single table for the filtering fields (type_ids)
row for each key in today's setup - you may want to think how to incorporate your filter fields into the key for efficient filtering - otherwise you'd have to do a two phase read (
column for each table in today's setup (i.e. few thousands of columns)
HBase doesn't mind if you add new columns and is sparse in the sense that it doesn't store data for columns that don't exist.
When you read a row you'd get all the relevant value which you can do avg. etc. quite easily
You might want to use a plain old database for this. It doesn't sound like you have a transactional system. As a result you can probably use just one or two large tables. SQL has problems when you need to join over large data. But since your data set doesn't sound like you need to join, you should be fine. You can have the indexes setup to find the data set and the either do in SQL or in app math.