Please see example screengrab
I would like to populate cell M2. Firstly to match K2 (Taylor) against column headers C1:I1 looking at the results in the column C2:C32. I would like to find the amount of times "a" appears in C2:C32 where Type (Column B) = "r".
So the result would be 3 (Reynolds, Maggio & Hamilton).
As you can see I've managed to populate Column R with totals without comparing against Type (Column B) but am having great difficulty understanding how to extend the comparison, intentionally without the use of helper columns/rows.
Any help would be greatly appreciated.
Since you have to depend on 2 columns, you will have to use COUNTIFS. Without being dynamic, the formula for M2 would be:
=COUNTIFS($B$2:$B$32,"r",$C$2:$C$32,"a")
^------------^ ^------------^
1st Condition 2nd Condition
To make it dynamic, only the second column needs to be changed:
=COUNTIFS($B$2:$B$32,"r",OFFSET($B$2:$B$32,0,MATCH($K2,$C$1:$I$1,0)),"a")
Your total's formula could be simplified to this also (keep the range as it is instead of manually putting it as 32 rows high for instance):
=COUNTA(OFFSET($B$2:$B$32,0,MATCH($K2,$C$1:$I$1,0)))
Related
The first column only has one row while the second column has three rows that correspond to the first row of the first column. For exemple, something like this.
Is there a way to run a logical test where if any of the values in the second column pass the test, I get a 1 and if none of the values in the second column pass the test, I get a 0.
Thank you for your help!
Yes, you can do this, using LODs and simple boolean formula. Please give a more specific example of what you want to do and you can have a formula that'll do it.
I have a table with two columns. For simplicity, lets say Column A is General Contractors and Column B is subcontractors. Any given general contractor can have a variable number of subcontractors. I would like to add a third column that simply displays a count of how many subcontractors each contractor has.
I have tried several calculations using "fixed" and "include" functions as well as "Count" and "CountD" functions and have tried directly using the count functions (right-click>>measure>>count) but all I get are 1's in the resulting column.
The data come from a table where there is one row for each subcontractor, so the if a general contractor had 5 subcontractors then there would be 5 rows where the general contractor repeats it self over and over with a different subcontractor next to it.
There are far too many different general contractors to use conditional statements.
Is what I'm doing possible and what other things should I try?
Try this
{Fixed [general contractor]: Countd([sub contractor]) }
Add this field to your view after contractor and sub-contractor, you'll get that variable count say 5 repeated in each row that general contractor.
I have a scenario where my dataframe has 3 columns a,b and c. I need to validate if the length of all the columns is equal to 100. Based on validation I am creating status column like a_status,b_status,c_status with values 5 (Success) and 10 (Failure). In Failure scenarios I need to update count and create new columns a_sample,b_sample,c_sample with some 5 failure sample values separated by ",". For creating samples column I tried like this
df= df.select(df.columns.toList.map(col(_)) :::
df.columns.toList.map( x => (lit(getSample(df.select(x, x + "_status").filter(x + "_status=10" ).select(x).take(5))).alias(x + "_sample")) ).toList: _* )
getSample method will just get array of rows and concatenate as a string. This works fine for limited columns and data size. However if the number of columns > 200 and data is > 1 million rows it creates huge performance impact. Is there any alternate approach for same.
While the details of your problem statement are unclear, you can break up the task into two parts:
Transform data into a format where you identify several different types of rows you need to sample.
Collect sample by row type.
The industry jargon for "row type" is stratum/strata and the way to do (2), without collecting data to the driver, which you don't want to do when the data is large, is via stratified sampling, which Spark implements via df.stat.sampleBy(). As a statistical function, it doesn't work with exact row numbers but fractions. If you absolutely must get a sample with an exact number of rows there are two strategies:
Oversample by fraction and then filter unneeded rows, e.g., using the row_number() window function followed by a filter 'row_num < n.
Build a custom user-defined aggregate function (UDAF), firstN(col, n). This will be much faster but a lot more work. See https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html
An additional challenge for your use case is that you want this done per column. This is not a good fit with Spark's transformations such as grouping or sampleBy, which operate on rows. The simple approach is to make several passes through the data, one column at a time. If you absolutely must do this in a single pass through the data, you'll need to build a much more custom UDAF or Aggregator, e.g., the equivalent of takeFirstNFromAWhereBHasValueC(n, colA, colB, c).
I have a dimension I am showing in a text table that can have one of 3 possibilities "A", "B", or "C" and I want at all times to have A, B and C shown in a text table even if one of them has 0 occurrences. The issue is that I am filtering this based on date, so it is possible that for example B may not exist, but I still want to have a 0 printed for B.
I have gone to Analysis -> Table layout -> show empty rows which will show "B", but in the count display it shows a blank. How can I get it to display a 0?
This problem is very famous among tableau users and I still did not see a generic tableau-only solution. All proper solutions start with injecting rows to your data which I assume you do not want this.
Below method will only work if you have a Date Dimension on the measure and no-data dates are not completely filtered-out; so you will be seeing zeros even though that date has no data as you may see on below screenshot.
When you filter out the no-data dates, unfortunately you will keep on seeing NULLs.
If you are using the SUM of Number of Records as your occurrences, then you may create a calculated field as below and use it in your pane:
ZN(LOOKUP(SUM([Number of Records]),0))
You can leave the Default Table Calculation as Automatic so the Results are computed along Table (accross).
I am working on data in Spotfire. The table has 4 columns:
RowID
StudID
IMT
Date
I am trying to insert a calculated column in Spotfire to get the date from the previous row for a specific StudID. The date should not be filled for first entry for a specific StudID since it does not have a previous row.
Please refer to the image for details:
This will be a calculated column using the OVER function, along with Intersect, Previous and the First aggregation.
First([Date]) OVER Intersect(Previous([Date]), [StudID])
It reads: over the intersection between (group of) the previous (to the current row) dates (which are the same) and the Student ID's (the same as the current row), give me the first row of that group. In your example, it will only ever return one date for that group, but the formula needs to be able to handle what happens if there are multiple rows. You may also need to think about whether this will happen in your data and what you're going to do about it. I.e.
StudID Date
124-639 6/12/2018
124-639 6/12/2018
124-639 6/14/2018
Building off of JasonJ's answer, it looks like his solution ran into issues when the dates of different StudIDs overlapped with one another.
So I was seeing something along the lines of this:
StudID, Date, Result
A, 10/1/2014,
A, 10/10/2014, 10/1/2014
A, 10/17/2014, 10/10/2014
B, 10/20/2014,
A, 10/21/2014,
B, 10/22/2014,
B, 10/24/2014, 10/22/2014
I created a weird workaround by adding another Calculated Column.
I doubt this is the IDEAL way to do this (I'd bet there's a better OVER function, but I couldn't identify it right off), but it looks like it's working.
First Calculated Column (Named [CalcRank]):
Rank(Concatenate([StudID],Year([Date]),If(DayOfYear([Date])<10,"0",""),If(DayOfYear([Date])<100,"0",""),DayOfYear([Date])))
Second Calculated Column:
Max([Date]) OVER (Intersect(Previous([CalcRank]),[StudID]))
Please note, you may have to pad your StudID with 0s to make sure it orders properly, like I did with the Date column.