Working with a delimited list of items in a Tableau field - tableau-api

I am preparing a data visualization in Tableau.
I have some data that can be simplified like this:
Name, Score, Tag
Joe, 5, A;B
Phil, 7, D
Quinn, 9, A;C
Bill, 3, A;B;C
I would like to generate a word cloud on the Tag field that counts
occurances of each item A,B,C. So I need to generate this:
A,3
B,2
C,2
D,1
In other words, I need help working with a field that contains a list of delimited values.
In the example data ; is the delimiter, but it could be anything.
I would like the word cloud to update as the user
applies filters, e.g. dragging a slider to set score > 5.
So the tag count has to be done on the fly.
I'm pretty sure I'll need to use field calculations and table calculations..?
Possibly I'll need to have a separate table tracking the tags..?
I have no problem building the word cloud and other viz elements.
What I'm looking for help with is parsing the delimited list field and
calculating the tag counts.
I do have full control over the source data, so if there is an easier way to
do this by reorganizing the schema, I'd be glad to do that. I thought of breaking
the field up into spearate tag1, tag2, tagX fields and trying to count over the
separate fields... but not sure if this is any simpler.
Thanks for any tips.

Another (probably better in your case) approach is to reshape the data before feeding it to Tableau. Tableau works best with normalized data.
Preprocess it to look like:
Name, Score, Tag
Joe, 5, A
Joe, 5, B
Phil, 7, D
Quinn, 9, A
Quinn, 9, C
Bill, 3, A
Bill, 3, B
Bill, 3, C
At that point, the standard Tableau word cloud charts should work well, and it will scale easily as you add more tags and data.
Reshaping data to normalize it prior to analysis with Tableau is a pretty standard step. Sometimes you can do it automatically, say with custom SQL, but often you'll have to use some sort of script first. If your data comes from Excel, Tableau has a plug in that can help with reshaping data. Look for it on the Tableau knowledge base.

Here's an approach that would be tolerable if you had a fixed set of 3 or 4 tags. Since you have closer to 50K possible tags, it's not a feasible approach for your problem as is. But maybe it will give you an idea. Similar approaches can be used to solve different kinds of problems in Tableau, so its a useful trick to know.
For each tag, create a boolean calculated field that returns 1 if the current row contains that particular tag and null otherwise (or whatever the condition is you want to detail)
For example, define a calculated field called Tag_A defined as:
if contains(Tag, "A") then
1
end
Similar, define calculated fields Tag_B, Tag_C etc
So far it's easy.
Then you can use those fields in other calculations to count the number of records that contain tag A, filter to only those that contain A, use the calculated field on the condition tab when defining sets that are computed dynamically by a formula ... Of course, the low level calculated field function can be more complex, say checking for the presence of at least 2 fields out of a list for example.
If nothing else, this approach sometimes lets you break complex problems into bite sized pieces.
Unfortunately, hard coding calculated field names won't scale to 50K tags. For that, you probably want to reshape your data.

Related

Is is possible limit the number of rows in the output of a Dataprep flow?

I'm using Dataprep on GCP to wrangle a large file with a billion rows. I would like to limit the number of rows in the output of the flow, as I am prototyping a Machine Learning model.
Let's say I would like to keep one million rows out of the original billion. Is this possible to do this with Dataprep? I have reviewed the documentation of sampling, but that only applies to the input of the Transformer tool and not the outcome of the process.
You can do this, but it does take a bit of extra work in your Recipe--set up a formula in a new column using something like RANDBETWEEN to give you a random integer output between 1 and 1,000 (in this million-to-billion case). From there, you can filter rows based on whatever random integer between 1 and 1,000 as what you'll keep, and then your output will only have your randomized subset. Just have your last part of the recipe remove this temporary column.
So indeed there are 2 approaches to this.
As Courtney Grimes said, you can use one of the 2 functions that create random-number out of a range.
randbetween :
rand :
These methods can be used to slice an "even" portion of your data. As suggested, a randbetween(1,1000) , then pick 1<x<1000 to filter, because it's 1\1000 of data (million out of a billion).
Alternatively, if you just want to have million records in your output, but either
Don't want to rely on the knowledge of the size of the entire table
just want the first million rows, agnostic to how many rows there are -
You can just use 2 of these 3 row filtering methods: (top rows\ range)
P.S
By understanding the $sourcerownumber metadata parameter (can read in-product documentation), you can filter\keep a portion of the data (as per the first scenario) in 1 step (AKA without creating an additional column.
BTW, an easy way of "discovery" of how-to's in Trifacta would be to just type what you're looking for in the "search-transtormation" pane (accessed via ctrl-k). By searching "filter", you'll get most of the relevant options for your problem.
Cheers!

tableau show categories from calculation even when a category is not visible

I have a calculation and it outputs multiple values. Then I am creating a table on those values. For example, in below data my formula is
if data is 1 then calculation is `one`
if data is 2 then calculation is `two`
if data is 3 then calculation is `three`
as three doesn't really appear in the output, when I create a table, three is not displayed. Is there any way to display it?
I tried table layout >> show empty rows and columns and it didn't work
data calculation
1 one
2 two
Tableau discovers the possible values for a dimension field dynamically from the query results.
If ‘three’ does not appear in your data, then how do you expect Tableau to know to make a column header for that non existent, but potential, value? It can’t read your mind.
This situation does occur often though - perhaps you want row or column headers to remain stable, even when you change filters in a way that causes some to no longer appear in the query results.
There are a few ways you can force Tableau to pad ** or **complete a domain:
one solution is to pad your data to make sure each value for your dimension field appears in at least one data row.
You can often do this easily by using a union to append some extra rows to your original data. You can often add padding rows that don’t impact any results by leaving all your Measure columns null since nulls are ignored by aggregation functions
Another common solution that is a bit more effort is to make what is known as scaffolding data source that is not much more than a list of your dimension members. You can then use that data source as a primary data source with data blending, making your original data source secondary.
There are two situations where Tableau can detect the absence of data and leave space for it in the visualization automatically
for numeric types, you can create a bin field that will automatically pad for missing bins
similarly, date fields can show missing values because, like bins, Tableau can tell when a month doesn’t appear in the data and leave room for it in the view

Best way to store metric data used for graphs

What is the best way to store metrics data used in displaying graphs?
Currently I have a table analytics(domain::text, interval_in_days::int, grouping::text, metric::text, type::text, labels[], data[], summary::json)
domain is the overall category of the metrics. Like what part of the application they're under. Could be sales or support etc.
the interval_in_days and grouping are 'view options' the end user can specify at the interface level to have a different view of the data points.
grouping can be date, day_of_week or time_of_day
interval_in_days can be 7, 30 or 90
labels is an array of the labels on the x-axis and data are the corresponding datapoints.
type is either data_series or summary. If data series, the row represent's the data used for drawing the graph, while a summary has the summary:json field populated with an object like {total_number_of_X: 132, median_X: 320.. etc}
metric is simply the metric the corresponding graph represents, so there's a separate graph for each value of metric
From this it follows that for each metric/graph I display, I have 9 (3 intervals * 3 groupings). For each domain I have a single row with type summary.
Every few hours I aggregate a lot of data across multiple tables into the analytics table. So I don't have to perform expensive queries adhoc.
I feel this is not the optimal approach, so I'm really interested in seeing how other people accomplishes the same task or any suggestions.
There is nothing wrong with storing 9 rows of raw data and later aggregating them to something more comfortable. It's a common approach and has performance benefits in some situations.
What I would really re-think in your design are the datatypes. From your description it seems you can transform all ::text fields into something like ::varchar(20). Then you can use STORAGE PLAIN on these columns and your table will become more efficient.
Also, consider adding foreign keys to describe what is stored in individual columns. For example, you stated grouping can be date, day_of_week or time_of_day, so you could have a groupings table that will list these options. But again, the foreign key would have to be covered by an index, so you may want to skip on that due to performance reasons.

Count the number of instances of values across multiple dimensions in Tableau

I'm currently looking to count the number of instances a values shared across multiple dimensions. For example, say I have the following set of data:
And I want to return something like:
But ideally in the form of a bar graph. I want to keep the names associated with the data, so I can filter lets say by all "Bobs" or all "Hannahs".
Does anyone have any advice on how to do this in Tableau?
Here are a couple of ways you may be able to do this.
1) Create a calculated field for each food type. This is a bit cumbersome and you would need to add new ones for any new foods added. You calculations would look like this:
Hamburgers:
SUM(IF [Food1] = 'Hamburgers' OR [Food2] = 'Hamburgers' THEN 1 END)
Then you would make use of the Measure Names and Measure Values built-in fields.
2) You can normalize your data. If you are referencing a Excel or Text file, you can do this right in Tableau. Simply go to the Data Source tab, select the Food fields, and choose to Pivot them:
Goes to:
Now you can do:
Finally, both results support creating a bar chart:

Filemaker: making queries of large data more efficient

OK I have a Master Table of shipments, and a separate Charges table. There are millions of records in each, and it's come into Filemaker from a legacy system, so all the fields are defined as Text even though they may be Date, Number, etc.
There's a date field in the charges table. I want to create a number field to represent just the year. I can use the Middle function to parse the field and get just the year in a Calculation field. But wouldn't it be faster to have the year as a literal number field, especially since I'm going to be filtering and sorting? So how do I turn this calculation into its value? I've tried just changing the Calculation field to Number, but it just renders blanks.
There's something wrong with your calculation, it should not turn blank just because field type is different. I.e.:
Middle("10-12-2010", 7, 4)
should suffice, provided the calc result is set to Number. You may also wrap it into GetAsNumber(...), but, really, there's no difference as long as field type is right.
If you have FM Advanced, try to set up your calc in the Data Viewer (Tools -> Data Viewer) rather than in Define Fields, this would be faster and, once you like the result, you can transfer it into a field or make a replace. But, from the searching/sorting standpoint there's no difference between a (stored) calculation and a regular field, so replacing is pointless and, actually, more dangerous, as there's no way to undo a wrong replace.
Here's what i was looking for, from
http://help.filemaker.com/app/answers/detail/a_id/3366/~/converting-unstored-calculation-fields-to-store-data
:
Basically, instead of using a
Calculation field, you create am EMPTY
Number, date or text field and use
Replace Field Contents from the Records menu, and put
your calculation (or reference, or
both) there.
Not dissing FileMaker at all, but millions of records means FileMaker is probably the wrong choice here. Your system will be slow, slow, slow. FileMaker is great for workgroups and there is no way to develop a database app faster. But one thing FileMaker is not good at is handling huge numbers of records.
BTW, Mikhail Edoshin is exactly right.