So the data that I'm working with has to do with proficiency in math/language arts. The test that I'm conducting would be how grade influences subject proficiency. Some studies included grades per test statistic (eg. p-value for 6th graders = 0.75), while some others included multiple grades in their statistic (eg. p-values for 6-8th graders = 0.82). My question is whether I can separate the second kind of data and copy-and-paste the statistics in my dataset. In the example, it would be that the p-value for 6th graders = 0.82, p-value for 7th graders = 0.82, and the p-value for 8th graders would be 0.82. Would that be statistically sound, or should I exclude these datapoints from the analysis?
Related
I am using a GLMM model to determine differences in soil compaction across 3 locations and 2 seasons in undisturbed and disturbed sites. I used location and seas as random effects. My teacher says to use the compaction reading divided by its upper bound as the Y value against the different sites (fixed effect). (I was previously using disturbed and undisturbed sites as 1,0 as Y against the compaction reading - so the opposite way around.) The random variables are minimal. I was using both glmer (glmer to determine AIC and therefore best model fit (but this cannot be done in glmmPQL)) while glmmPQL provides all amounts of variation which glmer does not. So while these outcomes are very similar when using disturbed and undisturbed as Y (as well as matching the graphs) only glmmPQL is similar to the graphs when using proportion of compaction reading. glmer using proportions is totally different. Additionally my teacher says I need to validate my model choice with a chi-squared value and if over-dispersed use a quasi binomial. But I cannot find any way to do this in glmmPQL and with glmer showing strange results using proportions as Y I am unsure if this is correct. I also cannot use quasi binomial in either glmer or glmmPQL.
My response was the compaction reading which is measured from 0 to 6 (kg per cm squared) inclusive. The explanatory variable was Type (diff soil either disturbed and not disturbed = 4 categories because some were artificially disturbed to pull out differences). All compaction readings were divided by 6 to make them a proportion and so a continuous variable bounded by 0 and 1 but containing values of both 0 and 1. (I also tried the reverse and coded disturbed as 1 and undisturbed as 0 and compared these groups separately across all Types (due to 4 Types) and left compaction readings as original). Using glmer with code:
model1 <- glmer(comp/6 ~ Type +(1|Loc/Seas), data=mydata,
family = "binomial")
model2 <- glmer(comp/6~Type +(1|Loc) , data=mydata, family="binomial")
and using glmmPQL:
mod1 <-glmmPQL(comp/6~Type, random=~1|Loc, family = binomial, data=mydata)
mod2 <- glmmPQL(comp/6~Type, random=~1|Loc/Seas, family = binomial, data=mydata)
I could compare models in glmer but not in glmmPQL but the latter gave me the variance for all random effects plus residual variance whereas glmer did not provide the residual variance (so was left wondering what was the total amount and what proportion were these random effects responsible for)
When I used glmer this way, the results were completely different to glmmPQL as in no there was no sig difference at all in glmer but very much a sig diff in glmmPQL. (However if I do the reverse and code by disturbed and undisturbed these do provide similar results between glmer and glmmPQL and what is suggested by the graphs - but my supervisor said this is not strictly correct (eg: mod3 <- glmmPQL(Status~compaction, random=~1|Loc/Seas, family = binomial, data=mydata) where Status is 1 or 0 according to disturbed or undisturbed) plus my supervisor would like me to provide a chi squared goodness of fit for the model chosen - so can only use glmer here ?). Additionally, the random effects variance is minimal, and glmer model choice removes these as non significant (although keeping one in provides a smaller AIC). Removing them (as suggested by the chi-squared test (but not AIC) and running as only a glm is consistent to both results from glmmPQL and what is observed on the graph. Sorry if this seems very pedantic, but I am trying to do what is correct for my supervisor and for the species I am researching. I know there are differences.. they are seen, observed, eyeballing the data suggests so and so do the graphs.. Maybe I should just run the glm ? Thank you for answering me. I will find some output to post.
i have a weekly aggregated data set and i splitted it in 80% Train and 20% Test.
I am performing a one step a head forecast. However as the length becomes larger the performance get really bad. Is that Normal?
The first few steps are predicted okayish.
Plot Description
This seems like normal behavior. The forecasts of your model are based on the last observations of your time series. Let us say as an example, the forecast for October 1st is based on the values from September. Now, for October 1st you have 30 correct values from September. Depending on how good your model is, the prediction will deviate a little bit from the true value. When you predict October 2nd, you use 29 correct values from September and the previously predicted value for October 1st. This is called dynamic forecasting. The forecasting error from this prediction is now fed into the model, which impacts the quality of the forecast for October 2nd. If you predict October 3rd, you have 28 correct input values and already two with a slight deviation.
The values for November will be based on forecasts entirely and all forecasting errors in October are fed into the model. Your predictions for July are based on predictions that were based on predictions and so on. Therefore, the predictions get worse and worse the further you forecast into the future and at some points they tend to converge into a straight line.
For this reason, you cannot really predict that far into the future very often. 65 steps is very far into the future and I guess that is too much for ARIMA/VAR.
I am attempting to predict the ranking of NBA teams next season based on the number of games they won this season. To do this, I thought I could use a logistic regression with historical data. As stated, my dependent variable is following season ranking, and my independent variable is current season ranking. For example, the Golden State Warriors had 36 wins in 2010-11, and they finished the 2011-12 season with the 24th best record in the league, and these 2 data points would serve respectively as my IV and DV.
My ultimate goal is to figure out what the odds are of a team that wins 35 games in the current season to have a top-10 record in the following season. Is logistic regression the best way to handle this problem? If so, is it ordinal or hierarchical? The MATLAB code I have been using is below:
B = mnrfit(X,Y)
pihat = mnrval(B,35)
where X is the current wins and Y is next season's ranking. I then summed the first 10 values of pihat to get my probability.
Is there a better way to be going about this? Thanks for the help.
Considering the picture below
each values X could be identified by the indeces X_g_s_d_h
g = group g=[1:5]
s = subject number (variable for each g)
d = day number (variable for each s)
h = hour h=[1:24]
so X_1_3_4_12 means that the value X is referred to the
12th hour
of 4th day
of 3rd subject
of group 1
First I calculate the mean (hour by hour) over all the days of each subject. Doing that the index d disappear and each subject is represented by a vector containing 24 values.
X_g_s_h will be the mean over the days of a subject.
Then I calculate the mean (subject by subject) of all the subjects belonging to the same group resulting in X_g_h. Each group is represented by 1 vector of 24 values
Then I calculate the mean over the hours for each group resulting in X_g. Each group now is represented by 1 single value
I would like to see if the means X_g are significantly different between the groups.
Can you tell me what is the proper way?
ps
The number of subjects per group is different and it is also different the number of days for each subject. I have more than 2 groups
Thanks
Ok so I am posting an answer to summarize some of the problems you may have.
Same subjects in both groups
Not averaging:
1-First if we assume that you have only one measure that is repeated every hour for a certain amount of days, that is independent on which day you pick and each hour, then you can reshape your matrix into one column for each subject, per group and perform a ttest with repetitive measures.
2-If you cannot assume that your measure is independent on the hour, but is in day (lets say the concentration of a drug after administration that completely vanish before your next day measure), then you can make a ttest with repetitive measures for each hour (N hours), having a total of N tests.
3-If you cannot assume that your measure is independent on the day, but is in hour (lets say a measure for menstrual cycle, which we will assume stable at each day but vary between days), then you can make a ttest with repetitive measures for each day (M days), having a total of M tests.
4-If you cannot assume that your measure is independent on the day and hour, then you can make a ttest with repetitive measures for each day and hour, having a total of NXM tests.
Averaging:
In the cases where you cannot assume independence you can average the dependent variables, therefore removing the variance but also lowering you statistical power and interpretation.
In case 2, you can average the hours to have a mean concentration and perform a ttest with repetitive measures, therefore having only 1 test. Here you lost the information how it changed from hour 1 to N, and just tested whether the mean concentration between groups within the tested hours is different.
In case 3, you can average both hour and day, and test if for example the mean estrogen is higher in one group than in another, therefore having only 1 test. Again you lost information how it changed between the different days.
In case 4, you can average both hour and day, therefore having only 1 test. Again you lost information how it changed between the different hours and days.
NOT same subjects in both groups
Paired tests are not possible. Follow the same ideology but perform an unpaired test.
You need to perform a statistical test for the null hypothesis H0 that the data in different groups comes from independent random samples from distributions with equal means. It's better to avoid sequential 'mean' operation, but just to regroup data on g. If you assume normality and independence of observations (as pointed out by #ASantosRibeiro below), that you can perform ttest (http://www.mathworks.nl/help/stats/ttest2.html)
clear all;
X = randn(6,5,4,3); %dummy data in g_s_d_h format
Y = reshape(X,5*4*3,6); %reshape data per group
h = zeros(6,6);
for i = 1 : 6
for j = 1 : 6
h(i,j)=ttest2(Y(:,i),Y(:,j));
end
end
If you want to take into account the different weights of the observations, you need to calculate t-value yourself (e.g., see here http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_ttest_a0000000126.htm)
Im new to matlab and am having some trouble with example.
The Colorado River Drainage Basin covers parts of seven western states. A series of dams has been constructed on the Colorado River and its tributaries to store runoff water and to generate low-cost hydroelectric power. The ability to regulate the flow of water has made the growth of agriculture and population in these arid desert states possible. Even during periods of extended drought, a steady, reliable source of water and electricity has been available to the basing states. Lake Powell is one of these reservoirs lake_powell.txt contains data on the water level in the reservoir for the eight years 2000 to 2007.
a) Use nested for loops to read one water level value at a time into the lake_powell matrix.
lake_powell(month,year) = fscanf(fileID, '%f', 1);
Print the lake_powell matrix with title and year column headings.
b) Use mean to determine the average elevation of the water level for each year and the overall average for the eight-year period over which the data were collected.
c) Use find and length to determine how many months of each year exceed the overall average for the eight-year period.
d) Create a report that lists the month (number) and the year for each of the months that
exceed the overall average. For example, June is month 6. Use find.
e) Determine and print the average elevation of the water for each month for the eight-year period. Use mean.
f) Plot the water level values in lake_powell using
date=2000:1/12:2008-1/12;
plot(date,lake_powell(:))
xlabel('Year')
ylabel('Water level, ft')
It sounds like you should be using textscan instead of fscanf.
testscan reads in a delimited file line by line where each line has a consistent format.
Read the documentation for textscan and you should have your solution.