How does github calculate the "contribution level" of a given day - github

Alright, so I am trying to rework the Github contribution graph feature, and I would like the "level" of contribution to be accurate to Github's. By level I mean the brightness of the square (if you are in darkmode).
Here in this image you can see a day with a high level and a day with a low level.
For starters, the level is calculated in relation to the other days, however it is unclear how that calculation is being done. On the Github Docs it says that the level is calculated based on which quartile the day falls into.
I don't know much about statistics, but shouldn't there be an equal amount of days in each quartile? But, if you look at this contribution graph, you can clearly see there are many more days with a lower level than there are with a higher level.
Is there something I am missing? Am I wrong about quartiles? Any help is appreciated.

Related

How are pace adjusted stats calculated?

Does anyone know how NBA.com calculates pace adjusted stats? When pulling data, there is a pace_adjust option -- I'm wondering how that differs from non pace adjusted. Conceptually I understand what it means, just wondering how they account for it. Thanks!
Pace adjusting is as simple as normalization. The rationale behind it is quite simple: To fairly compare two NBA teams, we have to normalize the number of game opportunities that they generate against common ground. Otherwise, it would be impossible to properly correlate game statistics between them. For example, that would be the case if you'd want to compare statistics coming from a fast-paced team like the Los Angeles Lakers (3rd highest pace in 2021/22 at 100.36) and a slow-paced team like the New York Knicks (bottom last pace in 2021/22 at a mere 95.11).
Formally, if M is a generic NBA player/team's metric, then its pace-adjusted value M_adj would be:
s = pace_lg / pace_tm
M_adj = s*M
where pace_lg and pace_tm are the league's and the team's pace, respectively. To calculate the league's pace (LP), we simply have to average the number of possessions of all NBA teams and adjust that for a full game (or 48 minutes). Instead, to calculate a team's pace (TP), we follow a slightly different formulation: We average the number of possessions of the team with their opponent's, and only then adjust for 48 minutes. Why? Because LP can be interpreted as a census of all possessions, whereas TP is a sample from the population of all possessions.
For practical use of pace adjusting, you can check out my breakdown of the player efficiency rating (PER).
P.S.: When I say "we" I refer to ESPN's J. Hollinger formulation of pace adjusting in the NBA. Different organizations or sports analytics services may slightly alter its computation.

Tableau Summing up aggregated data with FIXED

Data granularity is per customer, per invoice date, per product type.
Generally the idea is simple:
We have a moving average calculation of the volume per week. MA based on last 12 weeks (MA Volume):
window_sum(sum([Volume]),-11,0)/window_count(count([Volume]), -11,0)
We need to see the deviation of the current week vs the MA for that week (Vol DIFF):
SUM([Volume])-[MA Calc]
We need to sum up the deviations for a fixed period of time (Year/Month)
Basically this should show us whether on average, for a given period of time, we deviate positively or negatively vs the base.
enter image description here
Unfortunately I get errors like:
"Argument to SUM (an aggregate function) is already an aggregation, and cannot be further aggregated."
Or
"Level of detail expressions cannot contain table calculations or the ATTR function"
Any ideas how I can go around this one?
Managed to solve this one. Needed to add months to the view and then just WINDOW_SUM(Vol_DIFF).
Simple as that!

prediction and time series

how to decide how in advance my prediction is?
i am following the featuretools churn tutorial https://github.com/Featuretools/predict-customer-churn
what i don't quite understand how did it decide that the prediction is for one month in advance.. in previous churn examples i tried, i just get aggregated data ( it could be historical for a years or months) then i build churn model and predict but i don't know if my prediction is for a month a year or even how many days in advance how is that decided!.
does it depend on the period of aggregation or the data i didn't use. i know cut off time is the time i want to make prediction but how do i tell the system i want to make prediction for 2 month in advance do i just disregard the data for the last two months by setting the cut_off time but provide the label after the two months and say my model based on the features i get is for a 2 month advanced prediction.
for ex. cut_off date is 1/8/2010 label is the customer state on 1/10/2010
so two months period is the advance prediction? and i used all historical data previous to cut_off time?
this might be a time series problem that is turned into a simple classification but i am not sure!
You pick the amount of time in advanced (called "lead time") using your domain expertise. Depending on the real world application the lead time might be more or less. Sometimes you might even build multiple models with different lead times to apply in different situations.
You control the lead time by moving the cutoff earlier with respect to the time the label became known. So, the example you give looks correct.

Showing values for overall dataset as well as subset

I have a dataset that contains various wait-time metrics for all appointments in a practice for a year (check-in to call-back, call-back to check-out, etc). It contains appt time (one of about 40 15 minute slots), provider, various wait times.
I can get Tableau to show me, for each 15 minute slot, the average wait times for each provider in the practice.
What I can't seem to be able to do is also display the overall average for the practice for that given time slot so as to be able to compare that provider vs. the "office standard".
I'm super new to trying out Tableau, so I am sure it is something very simple.
Thanks in advance.
Use a level-of-detail (LOD) calculated field. An LOD calculation occurs at whatever aggregation level you specify, rather than what's on the row or column shelf.
You didn't provide any info about your data set so I will use made up names here.
This gives you the overall average wait time, regardless of other dimensions on row/column shelves:
{FIXED : avg([wait time])}
This gives you the overall average wait time per provider, regardless of other dimensions on row/column shelves:
{FIXED [Provider Name] : avg([wait time])}
See the online Tableau help at https://onlinehelp.tableau.com/current/pro/desktop/en-us/calculations_calculatedfields_lod_overview.html for more information. If you have filtering and need to calculate the overall without filters applied, look at the INCLUDE LOD keyword.

Ways of representing frequency of updates as a graph?

I want to create a graph representing the frequency of updates to a site (for example, how often I have posted to my blog over the past 5 years). One obvious way to do this is to plot "number of entries posted per month" for the past 60 months, but this feels unsatisfying. Should I be looking at using something like a rolling average instead? What are good visialisation techniques for displaying this kind of data?
Some ideas:
rolling average
instead of plotting the number of posts plot the time (or rolling mean time) between posts
if it's something that may have an annual / seasonal component, try plotting it on a circular / spiral plot where r is the datum and theta is the month, day of month, day of year, or whatever scaled appropriately. Some interesting things to plot for r might be
number of posts in an interval
cumulative posts (giving you a spiral)
length of post, (giving each theta it's own exact value, not aggregating)
you might also want to look at a scatter plot, with something like length of post on x, time since prior post + time to next post on y, and the age of the datum as the size, gray level or color of the point (fading out the oldest ones)
I think the radar chart (or circular area chart) would be a good bet. Take a look at the excellent Choosing a good chart post and PDF chart-selecting-guide over at the Extreme Presentation(tm) Method site.