Paired Samples t-test or Two-sample (heteroschedastic)? - t-test

My dataset consists of many patients who went through a treatment and I want to understand if there is a difference between before and after taking it. By far it appears clearly as a Paired Samples t-test, the problem is that I have different historical observations on each patient and let's say I have:
patient A with 150 observations before and 350 observations after,
patient B with 50 observations before and 300 observations after...
Which test should I use? The one for paired samples (even though n_before ≠ n_after)?
A most general Welch's t-test (they are heteroschedastic)?
Is there anything I am missing or doing wrong?
In case of paired samples should I need to use undersampling in order to make the lenghts match?
Thanks to everyone

Related

PERMANOVA - small and unequal sample sizes

I am comparing fish communities at 2 sites (upstream vs downstream) with data collected in two seasons (wet and dry) over several years (2017-2022), with data from the first pair of wet and dry seasons representing the period before a treatment and subsequent data representing periods after the treatment. During each season I sampled each site four times, and recorded abundance of each fish species from each site. I did not conduct sampling during the last dry season due to resource constraints. The community compositions of the two sites over different seasons and periods are visualised in the NMDS biplots.
I am trying to do further analysis using PERMANOVA to look for any spatial-temporal changes in the fish commnities, mainly if the two groups are becoming more similar in the years following the treatment. As there are samples with no fishes recorded I have to remove those samples from the dataset, which means I have only three instead of four replicates in some of the site x season x year groups.
My question is, does it still make sense to use PERMANOVA if I have unequal sample size among groups, given the number of replicates from each are small (3-4)? I am planning to run the test separately for wet and dry seasons, but that means I still need to do two-way (site x year) PERMANOVAs for each of the seasons.
I learned from some of the online discussions that unequal sample size would be a problem for two (or more)-way PERMANOVAs and the problem of unequal sample size would be more prominent when the sample sizes are small. Would be grateful for any comments or insight on this. Thanks a tonne!

Am I using PCA in Orange in a correct way?

I am analysing if 15 books can be grouped according to 6 variables (of the 15 books, 2 are written by an author, 6 by an other one, and 7 by an other one). I counted the number of occurrences of the variables and I calculated the percentage. Then I used Orange software to use PCA. I uploaded the file. selected the columns and rows. And when it comes to PCA the program asks me if I want to normalize the data or not, but I am not sure about that because I have already calculated the percentage - is normalize different from calculating the percentage? Moreover, below the normalize button it asks me to show only:... and I have to choose a number between 0 and 100 but I don’t really know what it is.
Could you help me understand what I should do? Thank you in advance

repeated measures anova in Matlab, F-value interpretation

I am using this package for repeated measures anova for MATLAB.
However, I am not sure about the interpretation, and the code is not entirely documented. Say I have one group of people, and measurements from three timepoints (conditions) of those, hence a repeated measures ANOVA with 3 factors. Now I want to see, if there is a significant effect of condition - which F-value corresponds to this question? I would have said the one corresponding to the row "time", but for an $F_2,24$ the F-value on the row "Subject" which is around 12 should be what the F-table lists?
the F for time should be the effect of interest, since you are not interested in differences between subjects. F = 20.82 then. You have 3 time points and 2 degrees of freedom. In order to interpret the effect you should see which time point is different than the others.

Generate subset of data with known mean

I have a dataset of n observations (nx1 vector) and would like to create a subset of this data, whose mean is known in advance, by selecting at random only n/3 observations (or within some constraint, ie where the mean of the data subset is within a range about the known mean).
Can someone please help me with the code do this in matlab?
Note, I don't want to use the rand function to create random data as I already have my data collected.
For example on a smaller scale: If I had the following dataset of 12 observations:
data = [8;7;4;6;9;6;4;7;3;2;1;1];
but then wanted to randomly select a subset of this data containing only 4 observations with a mean of 4 (or with a mean between 3.5-4.5 for example):
Then the answer might be datasubset=[7;3;2;4] but the answer could also be datasubset=[6;4;2;4] or datasubset=[6;4;3;4].
It doesn't matter if there are several possible solutions, I just need one of them, though I'd like to know the alternative solutions also.

Random number generation with Poisson distribution in Matlab

I am trying to simulate an arrival process of vehicles to an intersection in Matlab. The vehicles are randomly generated with Poisson distribution.
Let´s say that in one diraction there is the intensity of the traffic flow 600 vehicles per hour. From what I understood from theory, the lambda of the Poisson distribution should be 600/3600 (3600 sec in 1 hour).
Then I run this cycle:
for i = 1:3600
vehicle(i) = poissrnd(600/3600);
end
There is one problem: when I count the "ones" in the array vehicle there are never 600 ones, it is always some number around, like 567, 595 and so on.
The question is, am I doing it wrong, i.e. should lambda be different? Or is it normal, that the numbers will never be equal?
If you generate a random number, you can have an expectation of the output.
If you actually knew the output it would not be random anymore.
As such you are not doing anything wrong.
You could make your code a bit more elegant though.
Consider this vectorized approach:
vehicle = poissrnd(600/3600,3600,1)
If you always want the numbers to be the same (for example to reproduce results) try setting the state of your random generator.
If you have a modern version (without old code) you could do it like so:
rng(983722)