Filtering Values from Two Data Sets

Filtering Values from Two Data Sets - tableau-api

I am very new to Tableau and am having trouble finding a way to filter between two data sets. These sets are Tableau Data Extracts so I am unable to create custom SQL to achieve this.
In DataSet1 I have levels of precipitation by date.
In DataSet2 I have sales revenue per date and store location.
I am trying to visualize the sum of sales revenue per store location on only days with precipitation. I thought I would be able to simply create a filtered list of all dates in DataSet1 that saw precipitation then subsequently filter all dates in DataSet2 to = my filtered list.
Any thoughts on how I would go about this? I feel like it should be relatively simple, but being so unfamiliar with the software I am having trouble locating a solution.
Thanks!

Please look into Data Blending for Tableau.It is an interesting feature in tableau to link multiple data sources into one sheet.It can get laggy at times so use it carefully.
Some Tutorial Links:
http://www.tableau.com/learn/tutorials/on-demand/data-blending-0
http://kb.tableau.com/articles/knowledgebase/relate-summarized-data-60
Also the screenshot below shows the option to edit relationships between data sources which is essentially Data Blending

Related

Tableau - Data shown on map based on filter

I am new to Tableau and I've tried Googling but can't find an answer on my issue (perhaps my keywords are wrong).
I have a set of data like this:
I am able to create three individual worksheets to show on the map how many small, medium, big houses are there. It is linked to a second data set which pulls out the highest selling price for the house of the category and location.
I want to learn if it's possible to combine all three individual worksheets into one worksheet so that I can filter based on the size of the house (small, medium, big). If not, can I do the filtering on the dashboard instead?
Thanks in advance!

You need to reshape the data to make it flatter. Check this on pivoting the data in Tableau

Improve calculation efficiency in Tableau

I have over 300k records (rows) in my dataset and I have a tab that stratifies these records into different categories and does conditional calculations. The issue with this is that it takes approximately an hour to run this tab. Is there any way that I can improve calculation efficiency in tableau?
Thank you for your help,

Probably the issue is accessing your source data. I have found this problem working directly live data as an sql dabase.
An easy solution is to use extracts. Quoting Tableau Improving Database Query Performance article
Extracts allow you to read the full set of data pointed to by your data connection and store it into an optimized file structure specifically designed for the type of analytic queries that Tableau creates. These extract files can include performance-oriented features such as pre-aggregated data for hierarchies and pre-calculated calculated fields (reducing the amount of work required to render and display the visualization).
Edited
If you are using an extraction and you still have performance issues I suggest to you to massage your source data to be more friendly for tableau, for example, generate pre-calculated fields on ETL process.

Tableau: Difference between 2 sets

I have two different data sources (Users and Shipments). I wish to display all the users who didn't receive any shipment for a specific month which will be selected by user.

Assuming you are working with structured data, you can use Tableau's data munging features. See this Tableau Knowledge Base article. Also, Tableau can support joins across different data stores, a.k.a Data Blending. See this Tableau video for more on data blending.
Also based on your question, I believe you must be relatively new to Tableau. I was in your position about a year ago. Watch Tableau's free On Demand Training videos on its website.

Calculating price drop Apps or Apps gonna free - App Store

I am working on a Website which is displaying all the apps from the App Store. I am getting AppStore data by their EPF Data Feeds through EPF Importer. In that database I get the pricing of each App for every store. There are dozen of rows in that set of data whose table structure is like:
application_price
The retail price of an application.
Name Key Description
export_date The date this application was exported, in milliseconds since the UNIX Epoch.
application_id Y Foreign key to the application table.
retail_price Retail price of the application, or null if the application is not available.
currency_code The ISO3A currency code.
storefront_id Y Foreign key to the storefront table.
This is the table I get now my problem is that I am not getting any way out that how I can calculate the price reduction of apps and the new free apps from this particular dataset. Can any one have idea how can I calculate it?
Any idea or answer will be highly appreciated.
I tried to store previous data and the current data and then tried to match it. Problem is the table is itself too large and comparing is causing JOIN operation which makes the query execution time to more than a hour which I cannot afford. there are approx 60, 000, 000 rows in the table

With these fields you can't directly determine price drops or new application. You'll have to insert these in your own database, and determine the differences from there. In a relational database like MySQL this isn't too complex:
To determine which applications are new, you can add your own column "first_seen", and then query your database to show all objects where the first_seen column is no longer then a day away.
To calculate price drops you'll have to calculate the difference between the retail_price of the current import, and the previous import.
Since you've edited your question, my edited answer:
It seems like you're having storage/performance issues, and you know what you want to achieve. To solve this you'll have to start measuring and debugging: with datasets this large you'll have to make sure you have the correct indexes. Profiling your queries should helping in finding out if they do.
And probably, your environment is "write once a day", and read "many times a minute". (I'm guessing you're creating a website). So you could speed up the frontend by processing the differences (price drops and new application) on import, rather than when displaying on the website.
If you still are unable to solve this, I suggest you open a more specific question, detailing your DBMS, queries, etc, so the real database administrators will be able to help you. 60 million rows are a lot, but with the correct indexes it should be no real trouble for a normal database system.

Compare the table with one you've downloaded the previous day, and note the differences.
Added:
For only 60 million items, and on a contemporary PC, you should be able to store a sorted array of the store id numbers and previous prices in memory, and do an array lookup faster than the data is arriving from the network feed. Mark any differences found and double-check them against the DB in post-processing.

Actually I also trying to play with these data, and I think best approach for you base on data from Apple.
You have 2 type of data : full and incremental (updated data daily). So within new data from incremental (not really big as full) you can compare only which record updated and insert them into another table to determine pricing has changed.
So you have a list of records (app, song, video...) updated daily with price has change, just get data from new table you created instead of compare or join them from various tables.
Cheers

realtime querying/aggregating millions of records - hadoop? hbase? cassandra?

I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!
Here's what I have:
1000s of datasets
dataset keys:
all datasets have the same keys
1 million keys (this may later be 10 or 20 million)
dataset columns:
each dataset has the same columns
10 to 20 columns
most columns are numerical values for which we need to aggregate on (avg, stddev, and use R to calculate statistics)
a few columns are "type_id" columns, since in a particular query we may
want to only include certain type_ids
web application
user can choose which datasets they are interested in (anywhere from 15 to 1000)
application needs to present: key, and aggregated results (avg, stddev) of each column
updates of data:
an entire dataset can be added, dropped, or replaced/updated
would be cool to be able to add columns. But, if required, can just replace the entire dataset.
never add rows/keys to a dataset - so don't need a system with lots of fast writes
infrastructure:
currently two machines with 24 cores each
eventually, want ability to also run this on amazon
I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.
partitions are nice, since can easily add/drop/replace partitions
database is nice for filtering based on type_id
databases aren't easy for writing parallel queries
databases are good for structured data, and my data is not structured
As a proof of concept I tried out hadoop:
created a tab separated file per dataset for a particular type_id
uploaded to hdfs
map: retrieved a value/column for each key
reduce: computed average and standard deviation
From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds).
Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?
thanks!

Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds
HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.
check out http://en.wikipedia.org/wiki/Standard_deviation
stddev(X) = sqrt(E[X^2]- (E[X])^2)
this implies that you can get the stddev of AB by doing
sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)

Since your data seems to be pretty much homogeneous, I would definitely take a look at Google BigQuery - You can ingest and analyze the data without a MapReduce step (on your part), and the RESTful API will help you create a web application based on your queries. In fact, depending on how you want to design your application, you could create a fairly 'real time' application.

It is serious problem without immidiate good solution in the open source space. In commercial space MPP databases like greenplum/netezza should do.
Ideally you would need google's Dremel (engine behind BigQuery). We are developing open source clone, but it will take some time...
Regardless of the engine used I think solution should include holding the whole dataset in memory - it should give an idea what size of cluster you need.

If I understand you correctly and you only need to aggregate on single columns at a time
You can store your data differently for better results
in HBase that would look something like
table per data column in today's setup and another single table for the filtering fields (type_ids)
row for each key in today's setup - you may want to think how to incorporate your filter fields into the key for efficient filtering - otherwise you'd have to do a two phase read (
column for each table in today's setup (i.e. few thousands of columns)
HBase doesn't mind if you add new columns and is sparse in the sense that it doesn't store data for columns that don't exist.
When you read a row you'd get all the relevant value which you can do avg. etc. quite easily

You might want to use a plain old database for this. It doesn't sound like you have a transactional system. As a result you can probably use just one or two large tables. SQL has problems when you need to join over large data. But since your data set doesn't sound like you need to join, you should be fine. You can have the indexes setup to find the data set and the either do in SQL or in app math.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse