I have a dataset that I'd like to summarize in chart form. There are about 30 categories whose counts I'd like to display in a bar chart from about 300+ responses. I think a pivot table is probably the best way to do this, but when I create a pivot table and select multiple columns, each new column added gets entered as a sub-set of a previous column. My data looks something like the following
ID Country Age thingA thingB thingC thingD thingE thingF
1 US 5-9 thB thD thF
2 FI 5-9 thA thF
3 GA 5-9 thA thF
4 US 10-14 thC
5 US 10-14 thB thF
6 US 15-18
7 BR 5-9 thA
8 US 15-18 thD thF
9 FI 10-14 thA
So, I'd like to be able to create an interactive chart that showed the counts of "thing" items; I'd then like to be able to filter based upon demographic data (e.g., Country, Age). Notice that the data is non-numeric, so I have to use a CountA to see how many there are in each category.
Is there a simple way to display chart data that summarizes the counts and will allow me to filter based on different criteria?
The query can summarize the data in the form you want. The fact that you have "thA", "thB", etc, instead of "1" complicates the matter, but one can transform the strings to numeric data on the fly.
Assuming the data you've shown is in the cells A1:I10, the following formula will summarize it:
=query({B2:C10, arrayformula(if(len(D2:I10), 1, 0))}, "select Col1, Col2, count(Col3), sum(Col3), sum(Col4), sum(Col5), sum(Col6), sum(Col7) group by Col1, Col2", 0)
Explanation:
{B2:C10, arrayformula(if(len(D2:I10), 1, 0))} creates a table where the first two columns are your B,C (Country, Age) and the other six are filled with 1 or 0 depending on whether the cells in D-I are filled or not.
select Col1, Col2, count(Col3), sum(Col3), ... group by Col1, Col2 selects Country, Age, the total count of rows with this Country-Age combination, the number of rows with thingA for this Country-Age combination, etc.
the last argument, 0, indicates there are no header rows in the table passed to the query.
It's possible to give labels to the columns returned by the query, using label: see query language documentation. It would be something like
label Col1 'Country', Col2 'Age', count(Col3) 'Total count', sum(Col3) 'thingA count', ...
Add a Count column to your data with a "1" for whatever occurrence, this might solve your problem in the Pivot Table. I was just looking for a solution and thought about this. Working now for me.
Related
I'm a recovering engineer who is trying to remember how to do things like this in Googlesheet query language and I've completed just about everything I need. There is one more query to do and I'm stuck.
I know how to split up timestamps, pivot on date info from it, how to find contains results and count occurrences but I'm stumped on this one. How do you group by a substring? I've tried Left, regex extract, and about anything else I can think of but no luck.
I have a google spreadsheet with 5 pages, each with 4 columns. Each page is created from a form that a user is entering data into. I've added a date filter in B1 and B2 on my results sheet that works fine, too.
Here is an example of a query I'd like to see work.
=query({User1!A1:F; User2!A1:F; User3!A1:F; User4!A1:F; User5!A1:F}, "SELECT Col4, COUNT(Col4) Where Col1>= datetime '"&TEXT(B1,"yyyy-mm-dd HH:mm:ss")&"' AND Col1 <= datetime '"&TEXT(B2+1,"yyyy-mm-dd HH:mm:ss")&"' AND (Col4 is not null AND Col4 = 'Submits') group by **Left(Col2,4) AND** Col4 pivot month(Col1)+1, Day(Col1), Year(Col1)",1)
That bold bit seems to be the problem area. The rest of this works.
Here are the contents of the fields I'm working within each of the pages:
Col 1 - Timestamp
Col 2 - Opportunity # String that I want to use the left 4 chars as
a group by - strings look like O347-183XXXX so I want to use just
O347 in this case
Col 3 - Opportunity Name (not needed in result)
Col 4 - Strings that I want to count the occurrence of one item (7
different strings total) so I want the result to be a count the
occurrence of 'submits' for example.
I want to output a table with a group by Col2 Substring on the left column, Col4 count values for each substring, and pivot by date.
It ought to look like this when I get the results
I've seen some things here and other places that lead me partway there but I just hope there is an easy way to do this.
I used an Apriori algorithm to view the frequent relationships in the dataset and I want to do a dashboard to better visualize this data but I don't know how to do this filter.
This is the bar chart that I created to show the support (amount of times something happend) and the confidence (probability of B happening given A) of these associations:
Apriori Chart
Next to it on the dashboard, I'll have a table with the full dataset used in this Apriori analysis where I have more information such as ID, Income, Hours Worked, etc:
Table from different data source
How can I create this relationship? The two data sources don't have a column in common that I can use for that.
I would need some way to:
Split the values in the antecedents columns by comma and filter only those columns with value equal to 1 in the other dataset
**Dataset A**
'Age Range <=30, Joblevel 1, Maritalstatus Single'
->
'Age Range <=30'
'Joblevel 1'
'Maritalstatus Single'
**Dataset B**
'Age Range <=30' == 1
'Joblevel 1' == 1
'Maritalstatus Single' == 1
Clicking this would filter the table next to it
Is there any way I can do this in Tableau?
You can download the tbwx i used in this example here https://community.tableau.com/servlet/JiveServlet/download/1083124-384949/Apriori.twbx
Thanks in advance for the help!
I am not able to check your twbx on the machine I'm using but I think you should be able to do this. The fields in the 2 data sources need to match so manipulate the data sources the make this happen.
For data source 1 there's a function SPLIT which will mean you are able to split the comma separated string to 3 fields.
Putting those 3 fields to the Detail shelf of your bar chart (or even Rows and hiding the header) will mean you can use them in an action filter.
Your second data source is a cross tab - post pivot. You should be able to pivot this data source. Highlight the measures and pivot them. This will give you the field Pivot Field Names and Pivot Field Values.
You only want to keep those with a value of 1 so create a calculated field
[Lookup1]: IF [Pivot Field Values] = 1 THEN [Pivot Field Names] END
Duplicate this field twice so you have Lookup1, Lookup2 and Lookup 3.
Then you should be able to action filter the table.
In the action filter set it up so SplitField1 = Lookup1, SplitField2 = Lookup2, etc.
Fingers crossed this works, I haven't been able to test so I am pulling it out of my head.
I have a spreadsheet that looks like this:
https://docs.google.com/spreadsheets/d/1b29gyEgCDwor_KJ6ACP2rxdvauOzacDI9FL2K-jgg5E/edit?usp=sharing
I have two columns I'm interested in, Date and Count. Every few dates, there will be a "TOTAL" line where all the Counts corresponding to that TOTAL will be summed.
I want an output that looks like the cells to the right, where all the TOTAL counts are summed according to month. The problem lies in that Column A has only the date or TOTAL, in separate rows, and this layout can't be changed, leaving me thinking I need to reference the cell directly above TOTAL in column A, which has the correct month I want to group that TOTAL by.
The reason why I can't just filter column A by date range is because of inconsistent use, where sometimes the count data is only entered in the TOTAL row.
I've scoured the internet exploring FILTER, INDIRECT, QUERY, SUMIFS, etc... but can't find exactly how to do this.
I can easily filter column B where A:A="TOTAL", but what I think I am needing to do after that is use each cell above where A:A="TOTAL" as a range for the month criteria, somehow using what I found here: https://exceljet.net/formula/sum-by-month, expressed by ">="&D3 and "<="&EOMONTH(D3,0).
Any help or alternatives would be appreciated. Thank you.
or a different (offset) approach:
=QUERY(FILTER({EOMONTH(INDIRECT("A1:A"&ROWS(B2:B)), 0), B2:B}, A2:A="total"),
"select Col1,sum(Col2)
group by Col1
label sum(Col2)''
format Col1'mmmm'", 0)
Query formula is great for these kind of situations but looking at it by month will introduce issues if you plan on looking at multi-year data:
=arrayformula(QUERY(QUERY({row(A:A),TEXT(A:A,"MMMM"),B:B},"SELECT max(Col1),Col2,sum(Col3) where Col3 is not null group by Col2 order by max(Col1) label Col2 'Month', sum(Col3) 'Count'"),"SELECT Col2,Col3"))
try:
=ARRAYFORMULA(QUERY({IF((B2:B="")*(A2:A=""),,VLOOKUP(ROW(A2:A),
IF(A2:A<>"total", {ROW(A2:A), DATEVALUE("01/"&MONTH(A2:A)&"/2000")}), 2, 1)),
IF(A2:A= "total", A2:A, ), B2:B},
"select Col1,sum(Col3)
where lower(Col2) = 'total'
group by Col1
label sum(Col3)''
format Col1'mmmm'", 0))
I am trying to pivot using crosstab function and unable to achieve for the requirement. Is there is a way to perform crosstab dynamically and also dynamic result set?
I have tried using crosstab built-in function and unable to meet my requirement.
select * from crosstab ('select item,cd, type, parts, part, cnt
from item
order by 1,2')
AS results (item text,cd text, SUM NUMERIC, AVG NUMERIC);
Sample Data:
ITEM CD TYPE PARTS PART CNT
Item 1 A AVG 4 1 10
Item 1 B AVG 4 2 20
Item 1 C AVG 4 3 30
Item 1 D AVG 4 4 40
Item 1 A SUM 4 1 10
Item 1 B SUM 4 2 20
Item 1 C SUM 4 3 30
Item 1 D SUM 4 4 40
Expected Results:
ITEM CD PARTS TYPE_1 CNT_1 TYPE_1 CNT_1 TYPE_2 CNT_2 TYPE_2 CNT_2 TYPE_3 CNT_3 TYPE_3 CNT_3 TYPE_4 CNT_4 TYPE_4 CNT_4
Item 1 A 4 AVG 10 SUM 10 AVG 20 SUM 20 AVG 30 SUM 30 AVG 40 SUM 40
The PARTS value is based on a parameter passed by the user. If the user passes 2 for example, there will be 4 rows in the result set (2 parts for AVG and 2 parts of SUM).
Can I achieve this requirement using CROSSTAB function or is there a custom SQL statement that need to be developed?
I'm not following your data, so I can't offer examples based on it. But I have been looking at pivot/cross-tab features over the past few days. I was just looking at dynamic cross tabs just before seeing your post. I'm hoping that your question gets some good answers, I'll start off with a bit of background.
You can use the crosstab extension for standard cross tabs, what when wrong when you tried it? Here's an example I wrote for myself the other day with a bunch of comments and aliases for clarity. The pivot is looking at item scans to see where the scans were "to", like the warehouse or the floor.
/* Basic cross-tab example for crosstab (text) format of pivot command.
Notice that the embedded query has to return three columns, see the aliases.
#1 is the row label, it shows up in the output.
#2 is the category, what determines how many columns there are. *You have to work this out in advance to declare them in the return.*
#3 is the cell data, what goes in the cross tabs. Note that this form of the crosstab command may return NULL, and coalesce does not work.
To get rid of the null count/sums/whatever, you need crosstab (text, text).
*/
select *
from crosstab ('select
specialty_name as row_label,
scanned_to as column_splitter,
count(num_inst)::numeric as cell_data
from scan_table
group by 1,2
order by 1,2')
as scan_pivot (
row_label citext,
"Assembly" numeric,
"Warehouse" numeric,
"Floor" numeric,
"QA" numeric);
As a manual alternative, you can use a series of FILTER statements. Here's an example that summaries errors_log records by day of the week. The "down" is the error name, the "across" (columns) are the days of the week.
select "error_name",
count(*) as "Overall",
count(*) filter (where extract(dow from "updated_dts") = 0) as "Sun",
count(*) filter (where extract(dow from "updated_dts") = 1) as "Mon",
count(*) filter (where extract(dow from "updated_dts") = 2) as "Tue",
count(*) filter (where extract(dow from "updated_dts") = 3) as "Wed",
count(*) filter (where extract(dow from "updated_dts") = 4) as "Thu",
count(*) filter (where extract(dow from "updated_dts") = 5) as "Fri",
count(*) filter (where extract(dow from "updated_dts") = 6) as "Sat"
from error_log
where "error_name" is not null
group by "error_name"
order by 1;
You can do the same thing with CASE, but FILTER is easier to write.
It looks like you want something basic, maybe the FILTER solution appeals? It's easier to read than calls to crosstab(), since that was giving you trouble.
FILTER may be slower than crosstab. Probably. (The crosstab extension is written in C, and I'm not sure how smart FILTER is about reading off indexes.) But I'm not sure as I haven't tested it out yet. (It's on my to do list, but I haven't had time yet.) I'd be super interested if anyone can offer results. We're on 11.4.
I wrote a client-side tool to build FILTER-based pivots over the past few days. You have to supply the down and across fields, an aggregate formula and the tool spits out the SQL. With support for coalesce for folks who don't want NULL, ROLLUP, TABLESAMPLE, view creation, and some other stuff. It was a fun project. Why go to that effort? (Apart from the fun part.) Because I haven't found a way to do dynamic pivots that I actually understand. I love this quote:
"Dynamic crosstab queries in Postgres has been asked many times on SO all involving advanced level functions/types. Consider building your needed query in application layer (Java, Python, PHP, etc.) and pass it in a Postgres connected query call. Recall SQL is a special-purpose, declarative type while app layers are general-purpose, imperative types." – Parfait
So, I wrote a tool to pre-calculate and declare the output columns. But I'm still curious about dynamic options in SQL. If that's of interest to you, have a look at these two items:
https://postgresql.verite.pro/blog/2018/06/19/crosstab-pivot.html
Flatten aggregated key/value pairs from a JSONB field?
Deep magic in both.
Running into an issue when running a query in Google Sheets. The results of the array formula query are correct but the column utilized to order the results (Col1) is comprised of both blank/null cells and dates. As such, when ordered by this column the blank/null values are listed first before the dates. Is it possible to have the dates ranked first and push the blank/null cells to the bottom?
Ordering by DESC will not work as I would want the earlier dates listed first. Additionally, the blank/null cells cannot be excluded entirely from the results either (e.g. they correspond to tasks without deadlines but must still be listed).
The formula I am currently using is:
=ARRAYFORMULA((QUERY({DATA RANGE},"SELECT Col1 WHERE Col2 = X OR Col3 = X ORDER BY Col1 LIMIT 10",0))
Seems like there is an easy way to achieve this but I cannot find anything on the topic in other forums. Any help would be greatly appreciated.
Use SORT()
I believe for your example you could make it work like so:
=SORT(ARRAYFORMULA((QUERY({DATA RANGE},"SELECT Col1 WHERE Col2 = X OR Col3 = X",0)), 1, 1) (untested)
If your LIMIT 10 is important, then I think you could wrap the whole thing in another query and re-add the LIMIT.
Illustrated Example:
Range That Needs Querying and Sorting
Formula
Simple version defining a range in which the header is omitted:
=SORT(QUERY(A2:B7, "select *"), 1, 1)
Version that handles headers:
={A1:B1;SORT(QUERY(tabname!A2:B7, "select *"), 1, 1)}
This version creates an array combining the header row and the data rows so it can sort the data rows independently of the header.
Queried and Sorted Results
Breakdown of Formula Components
Array {[range 1]; [range 2]}
SORT() SORT([range], [column to sort on], [sort ascending - true/false or 1/0)
Query() QUERY([range], "[query]")