significant differences between means - matlab

Considering the picture below
each values X could be identified by the indeces X_g_s_d_h
g = group g=[1:5]
s = subject number (variable for each g)
d = day number (variable for each s)
h = hour h=[1:24]
so X_1_3_4_12 means that the value X is referred to the
12th hour
of 4th day
of 3rd subject
of group 1
First I calculate the mean (hour by hour) over all the days of each subject. Doing that the index d disappear and each subject is represented by a vector containing 24 values.
X_g_s_h will be the mean over the days of a subject.
Then I calculate the mean (subject by subject) of all the subjects belonging to the same group resulting in X_g_h. Each group is represented by 1 vector of 24 values
Then I calculate the mean over the hours for each group resulting in X_g. Each group now is represented by 1 single value
I would like to see if the means X_g are significantly different between the groups.
Can you tell me what is the proper way?
ps
The number of subjects per group is different and it is also different the number of days for each subject. I have more than 2 groups
Thanks

Ok so I am posting an answer to summarize some of the problems you may have.
Same subjects in both groups
Not averaging:
1-First if we assume that you have only one measure that is repeated every hour for a certain amount of days, that is independent on which day you pick and each hour, then you can reshape your matrix into one column for each subject, per group and perform a ttest with repetitive measures.
2-If you cannot assume that your measure is independent on the hour, but is in day (lets say the concentration of a drug after administration that completely vanish before your next day measure), then you can make a ttest with repetitive measures for each hour (N hours), having a total of N tests.
3-If you cannot assume that your measure is independent on the day, but is in hour (lets say a measure for menstrual cycle, which we will assume stable at each day but vary between days), then you can make a ttest with repetitive measures for each day (M days), having a total of M tests.
4-If you cannot assume that your measure is independent on the day and hour, then you can make a ttest with repetitive measures for each day and hour, having a total of NXM tests.
Averaging:
In the cases where you cannot assume independence you can average the dependent variables, therefore removing the variance but also lowering you statistical power and interpretation.
In case 2, you can average the hours to have a mean concentration and perform a ttest with repetitive measures, therefore having only 1 test. Here you lost the information how it changed from hour 1 to N, and just tested whether the mean concentration between groups within the tested hours is different.
In case 3, you can average both hour and day, and test if for example the mean estrogen is higher in one group than in another, therefore having only 1 test. Again you lost information how it changed between the different days.
In case 4, you can average both hour and day, therefore having only 1 test. Again you lost information how it changed between the different hours and days.
NOT same subjects in both groups
Paired tests are not possible. Follow the same ideology but perform an unpaired test.

You need to perform a statistical test for the null hypothesis H0 that the data in different groups comes from independent random samples from distributions with equal means. It's better to avoid sequential 'mean' operation, but just to regroup data on g. If you assume normality and independence of observations (as pointed out by #ASantosRibeiro below), that you can perform ttest (http://www.mathworks.nl/help/stats/ttest2.html)
clear all;
X = randn(6,5,4,3); %dummy data in g_s_d_h format
Y = reshape(X,5*4*3,6); %reshape data per group
h = zeros(6,6);
for i = 1 : 6
for j = 1 : 6
h(i,j)=ttest2(Y(:,i),Y(:,j));
end
end
If you want to take into account the different weights of the observations, you need to calculate t-value yourself (e.g., see here http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_ttest_a0000000126.htm)

Related

Tableau Weighted Average of Last Value in Date Group over Running Sum Across Extra Level of Detail not in Report

I am an absolute Tableau beginner, so forgive my lack of proper terminology.
Context
To give some context to the problem, think of the dataset as the balances and current interest rates of two different loans for which we are trying to calculate a weighted average cost of funds at any point in time, while retaining the ability to filter on Program (specific loan).
I have a single dataset that looks like:
The Balance field is used as a running sum, i.e. to get the actual balance as of 4/30/2022, you would sum the column across all Date values on or before 4/30/2022.
The Rate field is the opposite: it represents the discrete interest rate as of the Date. Thus, it cannot be summed.
Each data point is specific to a specific loan, or Program.
So to get the interest rate of Program A as of 4/30/2022, you would simply grab the Rate value of the row where Date = 4/30/2022 and Program = A, or 5.30%. Sums are fine here, since the value of Rate is never repeated for a single Program and Date combo, but we cannot use a running sum.
On the other hand, to get the balance of Program A as of 4/30/2022, you would need to add (running sum) the Balance values for all rows where Date <= 4/30/2022 and Program = A, or 10,000 + -2500 + -2500 + -2500 = 2500.
Problem / Need
I need a report (or whatever it's called in Tableau) with the following:
Date as a column
Measures as rows
This report would NOT include Program as a row or column, but would include it as a filter.
In this report, I need a Weighted Average Cost of Funds measure.
This is effectively the weighted average Rate over/weighted by the running sum of Balance across Programs included in the filter, of course for any given Date in the columns.
In other words, by Date, latest Ratefor eachProgramtimes thePrograms running sum of Balance, divided by running sum of all Balancesfor allProgram`s included in filter.
Here's an example in Excel:
Here's an example if we were to exclude Program A:
And here's an example if we were to exclude Program B:
Finally, here's the formulas underneath everything in the Excel example:

Cyclic transformation of dates

I would like to use the day of the year in a machine learning model. As the day of the year is not continuous (day 365 of 2019 is followed by day 1 in 2020), I think of performing cyclic (sine or cosine) transformation, following this link.
However, in each year, there are no unique values of the new transformed variable; for example, two values for 0.5 in the same year, see figures below.
I need to be able to use the day of the year in model training and also in prediction. For a value of 0.5 in the sine transformation, it can be on either 31.01.2019 or 31.05.2019, then using 0.5 value can be confusing for the model.
Is it possible to make the model to differentiate between the two values of 0.5 within the same year?
I am modelling the distribution of a species using Maxent software. The species data is continuous every day in 20 years. I need the model to capture the signal of the day or the season, without using either of them explicitly as categorical variable.
Thanks
EDIT1
Based on furcifer's comment below. However, I find the Incremental modelling approach not useful for my application. It solves the issue of consistent difference between subsequent days; e.g. 30.12.2018, 31.12.2018, and 01.01.2019. But it does not differ than counting the number of days from a certain reference day (weight = 1). Having much higher values on the same date for 2019 than 2014 does not make ecological sense. I hope that interannual changes to be captured from the daily environmental conditions used (explanatory variables). The reason for my need to use day in the model is to capture the seasonal trend of the distribution of a migratory species, without the explicit use of month or season as a categorical variable. To predict suitable habitats for today, I need to make this prediction not only depends on the environmental conditions of today but also on the day of the year.
This is a common problem, but I'm not sure if there is a perfect solution. One thing I would note is that there are two things that you might want to model with your date variable:
Seasonal effects
Season-independent trends and autocorrelation
For seasonal effects, the cyclic transformation is sometimes used for linear models, but I don't see the sense for ML models - with enough data, you would expect a nice connection at the edges, so what's the problem? I think the posts you link to are a distraction, or at least they do not properly explain why and when a cyclic transformation is useful. I would just use dYear to model the seasonal effect.
However, the discontinuity might be a problem for modelling trends / autocorrelation / variation in the time series that is not seasonal, or common between years. For that reason, I would add an absolute date to the model, so use
y = dYear + dAbsolute + otherPredictors
A well-tuned ML model should be able to do the rest, with the usual caveats, and if you have enough data.
This may not the right choice depending on your needs, there are two choices that comes to my mind.
Incremental modeling
In this case, the dates are modeled in a linear fashion, so say 12 Dec, 2018 < 12, Dec, 2019.
For this you just need some form of transformation function that converts dates to numeric values.
As there are many dates that need to be converted to numeric representation, the first thing to make sure is that the output list also has the same order as Lukas mentioned. The easiest way to do this is by adding weight to each unit (weight_year > weight_month > weight_day).
def date2num(date_time):
d, m, y = date_time.split('-')
num = int(d)*10 + int(m)*100 + int(y)*1000 # these weights can be anything as long as
# they are ordered
return num
Now, it's important to normalize the numeric values.
import numpy as np
date_features = []
for d in list(df['date_time']):
date_features.append(date2num(d))
date_features = np.array(date_features)
date_features_normalized = (date_features - np.min(date_features))/(np.max(date_features) - np.min(date_features))
Using the day, month, year as separate features. So, instead of considering the date as whole, we segregate. The motivation is that maybe there will be some relations between the output and a specific date, month, etc. Like, maybe the output suddenly increases in the summer season (specific months) or maybe on weekends (specific days)

Matlab average number of customers during a single day

I'm having problems creating a graph of the average number of people inside a 24h shopping complex. I have two columns of data on a spreadsheet of the times a customer comes in (intime) and when he leaves (outtime). The data spans a couple of years and is in datetime format (dd-mm-yyyy hh:mm:ss).
I want to make a graph of the data with time of day as x-axis, and average number of people as y-axis. So the graph would display the average number of people inside during the day.
Problems arise because the place is open 24h and the timespan of data is years. Also customer intime & outtime might be on different days.
Example:
intime 2.1.2017 21:50
outtime 3.1.2017 8:31
Any idea how to display the data easily using Matlab?
Been on this for multiple hours without any progress...
Seems like you need to decide what defines a customer being in the shop during the day, is 1 min enough? is there a minimum time length under which you don't want to count it as a visit?
In the former case you shouldn't be concerned with the hours at all, and just count it as 1 entry if the entry and exit are in the same day or as 2 different entries if not.
It's been a couple of years since I coded actively in matlab and I don't have a handy IDE but if you add the code you got so far, I can fix it for you.
I think you need to start by just plotting the raw count of people in the complex at the given times. Once that is visualized it may help you determine how you want to define "average people per day" and how to go about calculating it. Does that mean average at a given time or total "ins" per day? Ex. 100 people enter the complex in a day ... but on average there are only 5 in the complex at a given time. Which stat is more important? Maybe you want both.
Here is an example of how to get the raw plot of # of people at any given time. I simulated your in & out time with random numbers.
inTime = cumsum(rand(100,1)); %They show up randomly
outTime = inTime + rand(100,1) + 0.25; % Stay for 0.25 to 1.25 hrs
inCount = ones(size(inTime)); %Add one for each entry
outCount = ones(size(outTime))*-1; %Subtract one for each exit.
allTime = [inTime; outTime]; %Stick them together.
allCount = [inCount; outCount];
[allTime, idx] = sort(allTime);%Sort the timestamps
allCount = allCount(idx); %Sort counts by the timestamps
allCount = cumsum(allCount); %total at any given time.
plot(allTime,allCount);%total at any given time.
Note that the x-values are not uniformly spaced.
IF you decide are more interested in total customers per day then you could just find the intTimes with in a given time range (each day) & probably just ignore the outTimes all together.

Calculating IV60, and IV90 on interactive brokers

I am trading options, but I need to calculate the historical implied volatility in the last year. I am using Interactive Broker's TWS. Unfortunately they only calculate V30 (the implied volatility of the stock using options that will expire in 30 days). I need to calculate the implied volatility of the stock using options that will expire in 60 days, and 90 days.
The problem: Calculate the implied volatility of at least a whole year of an individual stock using options that will expire in 60 days and 90 days giving that:
TWS does not provide V60 or V90.
TWS does not provide historical pricing data for individual options for more than 3 months.
The attempted solution:
Use the V30 that TWS provide too come up with V60 and V90 giving the fact that usually option prices will behave like a skew (horizontal skew). However, the problem to this attempted solution is that the skew does not always have a positive slope, so I can't come up with a mathematical solution to always correctly estimate IV60 and IV90 as this can have a positive or negative slope like in the picture below.
Any ideas?
Your question is either confusing or isn't about programming. This is what IB says.
The IB 30-day volatility is the at-market volatility estimated for a
maturity thirty calendar days forward of the current trading day, and
is based on option prices from two consecutive expiration months.
It makes no sense to me and I can't even get those ticks to arrive (generic type 24). But even if you get them, they don't seem to be useful. My guess is it's an average to estimate what the IV would be for an option expiring exactly 30 days in the future. I can't imagine the purpose for this. The data would be impossible to trade with and doesn't represent reality. Imagine an earnings report at 29 or 31 days!
If you'd like the IV about 60 or 90 days in the future call reqMktData with an option contract that expires around then and an empty generic tick list. You will get tick types 10, 11, 12, and 13 which all have an IV. That's how you build the IV surface. If you'd like to build it with a weighted average to estimate 60 days, it's possible.
This is python but should be self explanatory
tickerId = 1
optCont = Contract()
optCont.m_localSymbol = "AAPL 170120C00130000"
optCont.m_exchange = "SMART"
optCont.m_currency = "USD"
optCont.m_secType = "OPT"
tws.reqMktData(tickerId, optCont, "", False)
Then I get data like
<tickOptionComputation tickerId=1, field=10, impliedVol=0.20363398519176756, delta=0.0186015418248492, optPrice=0.03999999910593033, pvDividend=0.0, gamma=0.007611155331932943, vega=0.012855970569816431, theta=-0.005936076573849303, undPrice=116.735001>
If there's something I'm missing about options, you should ask this at https://quant.stackexchange.com/

Clustering a sequence with time stamps (a time series data of two events)

Have been exploring different options for clustering a time series data that is of the type :
two different events - say 1,2
events time(nanos)
1 1e3
1 6e3
1 8e3
2 12e3
1 54e3
1 58e3
1 62e3
1 67e3
1 70e3
1 75e3
2 103e3
2 108e3
2 114e3
etc etc
ie., the times are stochastic (exponentially distributed) and either event 1 or event 2 is recorded. the recordings are in nanoseconds. The data set is large, going upto 15-20 mts, and with millions of points
The events are correlated and thus a bunch of 2s or 1s could happen. For eg., There will be small pieces (1 millisecond long pieces having 100-200 events of both types). Some cases, there will be a series of just one event type happening which needs to be discarded.
And most of the time, just single or few events are recorded & this is just noise (>80% of the data).
This is clearly a time series data, with event type information.
I would like to apply a clustering methodology to identify the meaningful small pieces. I'm using Matlab and have tried to look into options such as DBSCAN, k-means (not useful since I don't know the number of clusters apriori) etc.,
(the recording times themselves could be taken as a 'distance' since these are sequential chunks. ie., dist(x1,x2) = abs( x2(2) - x1(2) ) if x is (event, time) ;
also, a meaningful sequence of events happening at say time = 10.2 to 10.23 seconds, has no relationship to any other piece. ie., the clustering is done only to "identify" the short pieces (expected to be few 10000s out of the whole dataset)
Any help would be appreciated ! Thanks.
What about taking the difference between time points and determining either empirically or statistically a threshold below which the events are "connected"?
dtimes=diff(nanotimes);
THRESH=100; % completely made up - will depend on your data
current_cluster=1;
assign_clusters=zeros(size(nanotimes));
assign_clusters(1)=current_cluster;
for (v=1:length(dtimes))
if (dtimes(v)>THRESH)
current_cluster=current_cluster+1;
end
assign_clusters(1+v)=current_cluster;
end
for v=1:current_cluster
indices=find(assign_clusters==v);
if (~any(events(indices)==1)) || ...
all(events(indices)==1) || ...
(nanotimes(indices(end))-nanotimes(indices(1)) < TIMETHRESH)
assign_clusters(indices)=-1;
end
end
You probably are looking in the wrong domain.
Cluster analysis is meant for multidimensional data, but you have just one true dimension, time.
You really should look at classic statistical methods for series, such as kernel density estimation, natural breaks optimization and such things.
For example, you could estimate the density of events 1 and event 2 using a kernel density estimator, then split the data set whenever the density of event 1 or event 2 becomes higher than the other by a certain threshold. It's actually quite straightforward, once you compute the KDE curves.