I am trying to split a ranking table into two, but the problem is when I split it, the ranking is messed up.
See the images below. This is the ranking table before split
So, with the above table, I wanted to split with the Profit column. So, when the profit is above 10,000 then group one, other wise group 2.
Next 2 images are after split.
As you can see the second split image starts with 1 instead of 10.
How can I split it without messing up the ranking?
Duplicate the Rank of Profit calc and use that to filter.
Below, rank of profit = rank(sum([Profit]))
Related
Im fairly sure what im attempting is not the ideal way to do things due to my lack of knowledge of power BI but here goes:
I have two tables in the form of:
One has the actual power against wind and the other is a reference
I created calculated columns that add a corresponding binned speed to each row (so 1-2, 2-3, 3-4 etc)
I have filters and slicers applied on the page / visual that will keep changing.
What i want is to create a pivot or a grouped table that is changed dynamically based on my filters.
The reason i want this is currently the table ive got has totals that are averaged (because individual row is averaged) but i want a sum of an average by category. If i can have this as a calculated table instead of a visual (picture below) i would likely be able to aggregate this again to get what i want
so on the above table i want to totals to be sum of individual rows. I also want to be able to use these totals to carry out other calculations (simple stuff like total divided by fixed number etc)
I would like to randomly sample n rows from a table using Impala. I can think of two ways to do this, namely:
SELECT * FROM TABLE ORDER BY RANDOM() LIMIT <n>
or
SELECT * FROM TABLE TABLESAMPLE SYSTEM(1) limit <n>
In my case I set n to 10000 and sample from a table of over 20 million rows. If I understand correctly, the first option essentially creates a random number between 0 and 1 for each row and orders by this random number.
The second option creates many different 'buckets' and then randomly samples at least 1% of the data (in practice this always seems to be much greater than the percentage provided). In both cases I then select only the 10000 first rows.
Is the first option reliable to randomly sample the 10K rows in my case?
Edit: some aditional context. The structure of the data is why the random sampling or shuffling of the entire table seems quite important to me. Additional rows are added to the table daily. For example, one of the columns is country and usually the incoming rows are then first all from country A, then from country B, etc. For this reason I am worried that the second option would maybe sample too many rows from a single country, rather than randomly. Is that a justified concern?
Related thread that reveals the second option: What is the best query to sample from Impala for a huge database?
I beg to differ OP. I prefer second optoin.
First option, you are assigning values 0 to 1 to all of your data and then picking up first 10000 records. so basically, impala has to process all rows in the table and thus the operation will be slow if you have a 20million row table.
Second option, impala randomly picks up rows from files based on percentage you provide. Since this works on the files, so return count of rows may different than the percentage you mentioned. Also, this method is used to compute statistics in Impala. So, performance wise this is much better and correctness of random can be a problem.
Final thought -
If you are worried about randomness and correctness of your random data, go for option 1. But if you are not much worried about randomness and want sample data and quick performance, then pick second option. Since Impala uses this for COMPUTE STATS, i pick this one :)
EDIT : After looking at your requirement, i have a method to sample over a particular field or fields.
We will use window function to set rownumber randomly to each country group. Then pick up 1% or whatever % you want to pick up from that data set.
This will make sure you have data evenly distributed between countries and each country have same % of rows in result data set.
select * from
(
select
row_number() over (partition by country order by country , random()) rn,
count() over (partition by country order by country) cntpartition,
tab.*
from dat.mytable tab
)rs
where rs.rn between 1 and cntpartition* 1/100 -- This is for 1% data
screenshot from my data -
HTH
I am trying to create a graph with two lines, with two filters from the same dimension.
I have a dimension which has 20+ values. I'd like one line to show data based on just one of the selected values and the other line to show a line excluding that same value.
I've tried the following:
-Creating a duplicate/copy dimension and filtering the original one with the first, and the copy with the 2nd. When I do this, the graphic disappears.
-Creating a calculated field that tries to split the measures up. This isn't letting me track the count.
I want this on the same axis; the best I've been able to do is create two sheets, one with the first filter and one with the 2nd, and stack them in a dashboard.
My end user wants the lines in the same visual, otherwise I'd be happy with the dashboard approach. Right now, though, I'd also like to know how to do this.
It is a little hard to tell exactly what you want to achieve, but the problem with filtering is common.
The principle that is important is that Tableau will filter the whole dataset by row. So duplicating the dimension you want to filter won't help as the filter on the original dimension will also filter the corresponding rows in the second dimension. Any solution has to be clever enough to work around this issue.
One solution is to build two new dimensions that use a calculation rather than a filter to create the new result. Let's say you have a dimension, [size] that has a range of numbers from 1 to 10 and you want to compare the total number of rows including and excluding the number 5. You could create a new field using a formula like if [size] <> 5 then 1 else 0 end
Summing the new field will give a count of the number of rows that don't contain a 5 and this can be compared directly to a rowcount of the original [size] field which will give the number including the value 5.
This basic principle can be extended to much more complex logic. The essential point is to realise that filters act on every row in your data and can't, by themselves, show comparisons with alternative filter choices on a single visualisation.
Depending on the nature of your problem there may be other solutions worth looking at including sets and groups but you would need to provide more specific details for users here to tell you whether they would be useful.
We can make a a set out of the values of the dimension and then place it in the required shelf. So, you will have your dimension which will plot accordingly and set which will have data as per the requirement because with filter you can't have that independence of showing data everytime you want.
I have one table with three columns say c1,c2 and c3. I want to show grant total for all three columns. I have tried but grand total's logic was working only with column and not for three column.
So is there any way to do so..
If all three columns are measures then you should be able to just go to Analysis->Totals->Show Column Grand Totals.
Or are you also trying to count a dimension? Your question is not very clear.
I've got a bar chart with three months worth of data. Each column in the chart is one month's data showing the percentage of Rows that met a certain criterion for that month. In the first month, 100% of 2 rows meet the measure. In the second month, 24.2% of 641 rows meet the measure. In the 3rd month, 28.3% of 1004 rows meet the measure. My reference line which is supposed to show the average across the entire time-frame is showing 50.8%, the simple average (i.e. [100+24.2+28.3]/3) instead of the weighted average (i.e. [100*2+641*24.2+1004*28.3]/[2+641+1004]).
In the rows shelf, I have a measure called "% that meet the criterion", this is defined as SUM([Criterion])/SUM([NUMBER OF RECORDS])
The criterion measure is 1 for any record that qualifies and null for any that do not qualify.
If I go to Analysis >> Totals >> Show Row Grand Totals, a 4th bar is added, and that bar shows the correct weighted average of the other three bars (26.8%), but I really want this to be shown as a reference line instead of having an extra bar on the chart. (Adding the Grand Total bar also drops the reference line down to 44.8%, which is the simple average of the 4 bars now shown on the chart--I can't think of a less useful piece of information than that).
How can I add the weighted average as a reference line?
Instead of using 'Average' as your aggregation, try using 'Total' instead in the Edit Reference Line dialogue window.
I have to say it's a bit counter-intuitive, but this is what the Tableau online help has to say about it:
http://onlinehelp.tableau.com/current/pro/online/mac/en-us/reflines_addlines.html
Total - places a line at the aggregate of all the values in either the cell, pane, or the entire view. This option is particularly useful when computing a weighted average rather than an average of averages. It is also useful when working with a calculation with a custom aggregation. The total is computed using the underlying data and behaves the same as selecting one of the totals option the Analysis menu.
If you are using Tableau 9, you can make second calculated field using an LOD expression
{ SUM([Criterion]) / SUM([NUMBER OF RECORDS]) }
This will calculate the ratio for the entire data set after applying context and data source filters, without partitioning the data by any of the other dimensions in your view (such as month in your case)
If you place that new field on the detail shelf then you can use it to create a reference line.
There are other ways to generate a weighted average, but this is probably the simplest in your case.