I am really new to Tableau.
I have an "accident" table (excel) that describes each traffic accident in the past few years, including its "district" (location). I have another "district" table that describes each district of the city, including its population.
Now I want to join those two tables and create a graph of accidents per person for each district.
The problem I face is: The two excel files are from different databases, which means that the same district may appear to have different names in two tables. How do I let tableau know the matching between districts?
Could you tell me how I can join those two tables so that I can create my chart?
Please let me know if there are any problems with my approach or understanding. Thank you in advance!
Assuming I understand question correctly.
Your accident transactions might have..
District#1, Accident Date, ,
District#2, Accident Date, ....
District#3, Accident Date, ....
Your district dimension table has 1 unique record per district, but names don't match.
District1, CityName
District2, CityName
District#3, CityName
You want to combine the results correctly...
Excel is the preferred datasource?
I assume the actual sources of data do NOT have a singular "code" value for the district name that actually does match. This is where MDM is important for system integrations. Right answer IMO is to actually have source systems/dbms understand that they are integrated and create XREF in 1 system that is required when a district is setup or exists.
Options. I think you need to "clean" the data someplace. The question is where.
MDM in your source system integrations.
Are the excel sources created in automated fashion? You could create a manual XREF excel sheet and lookup/decode one of the district values to conform a single definition of the district. In addition to your graph. create an audit that looks for any new data that doesn't "lookup" correctly to know to maintain the manual XREF.
Probably could create calculated XREF is tableau calc itself and then blend based on that calculated field. I think that would require full desktop versus web editor.
Tableau Prep might help as well. (although my company doesn't use this)
Related
The idea of the SaaS tool is to have dynamic tables with dynamic custom fields and values of different types, we were thinking to use "force.com/salesforce.com" example but is seems to be too complicated to maintain moving forward, also making some reports to create with a huge abstraction level, so we came up with simple idea but we have to be sure that this is kinda good approach.
This is the architecture we have today (in few steps).
Each tenant has it own separate database on the cluster (Postgres 12).
TABLE table, used to keep all of those tables as reference, this entity has ManyToOne relation to META table and OneToMany relation with DATA table.
META table is used for metadata configuration, has OneToMany relation with FIELDS (which has name of the fields as well as the type of field e.g. TEXT/INTEGER/BOOLEAN/DATETIME etc. and attribute value - as string, only as reference).
DATA table has ManyToOne relation to TABLES and 50 character varying columns with names like: attribute1...50 which are NULL-able.
Example flow today:
When user wants to open a TABLE DATA e.g. "CARS", we load the META table with all the FIELDS (to get fields for this query). User specified that he want to query against: Brand, Class, Year, Price columns.
We are checking by the logic, the reference for Brand, Class, Year and Price in META>FIELDS table, so we know that Brand = attribute2, Class = attribute 5, Year = attribute6 and Price = attribute7.
We parse his request into a query e.g.: SELECT [attr...2,5,6,7] FROM DATA and then show the results to user, if user decide to do some filters on it, based on this data e.g. Year > 2017 AND Class = 'A' we use CAST() functionality of SQL for example SELECT CAST(attribute6 AS int) AND attribute5 FROM DATA WHERE CAST(attribute6 AS int) > 2017 AND attribute5 = 'A';, so then we can actually support most principles of SQL.
However moving forward we are scared a bit:
Manage such a environment for more tenants while we are going to have more tables (e.g. 50 per customer, with roughly 1-5 mil per TABLE (5mil is maximum which we allow, for bigger data we have BigQuery) which is giving us 50-250 mil rows in single table DATA_X) which might affect performance of the queries, especially when we gave possibilities to manage simple WHERE statements (less,equal,null etc.) using some abstraction language e.g. GET CARS [BRAND,CLASS,PRICE...] FILTER [EQ(CLASS,A),MT(YEAR,2017)] developed to be similar to JQL (Jira Query Language).
Transactions lock, as we allow to batch upload CSV into the DATA_X so once they want to load e.g. 1GB of the data, it kinda locks the table for other systems to access the DATA table.
Keeping multiple NULL columns which can affect space a bit (for now we are not that scared as while TABLE creation, customer can decide how many columns he wants, so based on that we are assigning this TABLE to one of hardcoded entities DATA_5, DATA_10, DATA_15, DATA_20, DATA_30, DATA_50, where numbers corresponds to limitations of the attribute columns, and those entities are different, we also support migration option if they decide to switch from 5 to 10 attributes etc.
We are on super early stage, so we can/should make those before we scale, as we knew that this is most likely not the best approach, but we kept it to run the project for small customers which for now is working just fine.
We were thinking also about JSONB objects but that is not the option, as we want to keep it simple for getting the data.
What do you think about this solution (fyi DATA has PRIMARY key out of 2 tables - (ID,TABLEID) and built in column CreatedAt which is used form most of the queries, so there will be maximum 3 indexes)?
If it seem bad, what would you recommend as the alternative to this solution based on the details which I shared (basically schema-less RDBMS)?
IMHO, I anticipate issues when you wanted to join tables and also using cast etc.
We had followed the approach below that will be of help to you
We have a table called as Cars and also have a couple of tables like CarsMeta, CarsExtension columns. The underlying Cars table will have all the common fields for a ll tenant's. Also, we will have the CarsMeta table point out what are the types of columns that you can have for extending the Cars entity. In the CarsExtension table, you will have columns like StringCol1...5, IntCol1....5, LongCol1...10
In this way, you can easily filter for data also like,
If you have a filter on the base table, perform the search, if results are found, match the ids to the CarsExtension table to get the list of exentended rows for this entity
In case the filter is on the extended fields, do a search on the extension table and match with that of the base entity ids.
As we will have the extension table organized like below
id - UniqueId
entityid - uniqueid (points to the primary key of the entity)
StringCol1 - string,
...
IntCol1 - int,
...
In this case, it will be easy to do a join for entity and then get the data along with the extension fields.
In case you are having the table metadata and data being inferred from separate tables, it will be a difficult task to maintain this over long period of time and also huge volume of data.
HTH
Is there a way to create several groups in Case When statement?
For example,
CASE [Sales Manager]
WHEN "Manager 1" THEN "Germany"
WHEN "Manager 1" THEN "Russia"
WHEN "Manager 2" THEN "Russia"
END
Such statement will assign Manager 1 only to Germany, while I need to have it for both countries. Any other possible ways to do that ?
One solution is to define a table in your database (or Excel) that maps managers to countries. You just need two columns, one for manager and for country, and a row in the table for each association between a manager and a country.
That way you can easily represent a manager that works with many countries, or a country that has many managers (a many-to-many relationship).
You can then combine that table with your other data using joins or data blending. Realize that when you join data that has a to-many type of relationship that you can in general cause duplicate values to arise in the query results (e.g. the sales quota for a manager can be repeated multiple times, once for each country the manager visits). Unless your filters and work flow eliminate that case, you need to make sure your calculations account for duplication and avoid double counting.
Bottom line -- sometimes it is alot easier to specify information as data than as code.
I have a set of 100 “student records”, I want to have checkboxes for each "favorite_food_type" and "favorite_food", whichever is checked would filter a "bar graph" that counts number of reports that contain that specific "favorite_food"type" and "favorite_food" schema could be:
name
favorite_food_type (e.g. vegetable)
favorite_food (e.g. banana)
I would like to in the dashboard be able to select via checkboxes, “Give me all the COUNT OF DISTINCT students with favorite_food of banana, apple, pear“ and filter graphs for all records. My issue is for a single student record, maybe one student likes both banana and apple. How do I best capture that? Should I have:
CASE A: Duplicate Records (this captures the two different “favorite_food”, but now I have to figure out how many students there are (which is one student)
NAME, FAVORITE_FOOD_TYPE,FRUIT
Charlie, Fruit, Apple
Charlie, Fruit, Pear
CASE B: Single Records (this captures the two different “favorite_food”, but is there a way to pick out from delimiters?)
NAME, FAVORITE_FOOD_TYPE,FRUITS
Charlie, Fruit, Apple#Pear
CASE C: Column for Each Fruit (this captures one record per student, but need a loooot of columns for each fruit, many would be false)
NAME, FAVORITE_FOOD_TYPE, APPLE, BANANA, PINEAPPLE, PEAR
Charlie, Fruit, TRUE, FALSE, TRUE, FALSE
I want to do this as easy as possible.
Avoid Case B if at all possible. Repeating information is almost always best handled by repeating rows -- not by cramming multiple values into a single table cell, nor by creating multiple columns such as Favorite_1 and Favorite_2
If you are provided data with multiple values in a field, Tableau does have functions and data connection features that can be used to split a single field into its constituent parts to form multiple fields. That works well with fixed number of different kinds of information -- say splitting a City, State field into separate fields for City and State.
Avoid Case C if at all possible. That cross tab structure makes it hard to analyze the data and make useful visualizations. Each value is treated as a separated field.
If you are provided data in crosstab format, Tableau allows you to pivot the data in the data connection pane to reshape into a form with fewer columns and many rows.
Case A is usually the best approach. You can simplify it further by factoring out repeating information into separated tables -- a process known as normalization. Then you can use a join to recombine the tables and see the repeating information when desired.
A normalized approach to your example would have two tables (or tabs in excel). The first table would have exactly one row per student with 2 columns: name and favorite_food_type. The second table would have a row per student/favorite food combination, with 2 columns: name and favorite_food. Now each student can have as many favorite foods as you like or none at all. Since both columns have a name field, that would be the key used to join (combine) the tables when needed.
Given that table design, you could have 2 data sources in Tableau. The first one just pointed to the student table and could be used to create visualizations that only involved students and favorite_food_types. The second data source would use a (left) join to read from both tables and could be used to look at favorite foods. When working with the second data source, you would have to be careful about reporting information about student names and favorite food types to account for the duplicate information. So use the first data source when possible. Finally, you could put both kinds of visualizations on a dashboard and use filter and highlight actions to make interaction seamless despite the two sources -- getting the best of both worlds.
My company uses a third-party vendor to get all of our NPS information. I'm trying to set up a data feed from this vendor into our data warehouse, which runs PostgreSQL.
The feed is in the form of 2 tab-separated text files: "question mapping" and the responses. The question map is one row per question, with columns for question id, question text, question label question type, etc - straightforward. The responses are one row per survey response, with a column for each question and stuff like user id, etc. Here are the 2 biggest problems:
The survey questions sometimes use the same question ID for different questions, resulting in multiple columns in the response data having the same name but not being the same question.
The number of questions could change, resulting in a different number of columns in the data.
Both of these things make it a real headache to automate a data feed into a single table.
I'm afraid I don't quite know how to phrase my real question other than, "Does anyone have any ideas how I can accomplish this?" If I think of something better than that, I'll come and update this, so for now:
Does anyone have any ideas at all about how I can efficiently set up my automated data feed without having to always drop and recreate everything?
If your data is a mess and doesn't really have well defined columns you can use the entity attribute value pattern, where you turn each fact into a set of rows with 4 columns - a unique row id, the same entity id for each row extracted from the map, an attribute column (where you put what would be the name of the column) you get from the key of the map, and a value column where you put the value from the map. It's not that neat but you can still query it and you won't have to drop it when you receive a map with a new column.
I am designing a warehouse to accommodate a movie related database. I have a table with the columns - Title, Genre, SalesAmount, ProductionAmount.
One such row would be say GodFather, Crime|Drama,1000000,20000.
I want to move this to DW, I am looking at getting this into a Fact table say FactSale and have linkage to Genre dimension.
My objective is to analyze revenues by Genres. In this case, how would I be building the cube matrix? I have another mapping table with TitleId,GenreId present.
Also would it be possible to create a dynamic hierarchy say under Action, Drama, Romance etc. Idea is to gather info on a single genre or combination of genres.
Can someone guide me on how to go about it?
I have found the answer. Thanks to the whitepaper from SQLBI with extensive examples and explanations - http://www.sqlbi.com/wp-content/uploads/The_Many-to-Many_Revolution_2.0.pdf