I have a dataset of fish caught in several points in 120 lakes. In each lake, we collected fish at random places (points), resulting in approximately 800 fish observations. Among other factors, I would like to account fish variation (e.g. density of individuals captured) “between” and “within” lakes, since I have the coordinates for each fish observation and also for each lake.
How is it possible to include “within” and “between” lake variation in a model to verify which one is more important for fish response?
This is an example of my dataset:
observation
lake
point
Xcoord.lake
Ycoord.lake
Xcoord.point
Ycoord.point
fish
factor1
factor2
1
lake1
1
453497
6166094
453022
6166137
100
37.10
158.9
2
lake1
2
453025
6166110
453022
6166137
95
15.7
105.1
3
lake1
3
453093
6166079
453022
6166137
73
0
170.0
4
lake2
1
493269
6319292
493269
6318708
0
5.1
185.7
5
lake2
2
493269
6319292
493568
6318542
10
10.7
129.4
6
lake2
3
493269
6319292
493627
6318531
8
20.8
257.3
...
...
...
...
...
...
...
...
...
...
798
lake120
1
517966
6143495
517967
6143454
252
50.2
326.9
799
lake120
2
517966
6143495
517969
6143379
158
33.8
196.4
800
lake120
3
517966
6143495
517972
6143510
300
87.5
93.2
I was trying something like:
fit <-glmmPQL(fish ~ factor1 + factor2, random = ~1|lake, data = data,
family = poisson, correlation = corExp(form = ~ Xcoord.point + Ycoord.lake))
But I don’t know if this responds my question.
I really appreciate any help!
Related
Im trying to implement a custom min max scaler in kdb+/q. I have taken note of the implementation located in the ml package however I'm looking to be able to scale data between a custom range i.e. 0 and 255. What would be an efficient implementation of min max scaling in kdb+/q?
Thanks
Looking at the link to github on the page you referenced it looks like you may be able to define a function like so:
minmax255:{[sf;x]sf*(x-mnx)%max[x]-mnx:min x}[255]
Where sf is your scaling factor (here given by 255).
q)minmax255 til 10
0 28.33333 56.66667 85 113.3333 141.6667 170 198.3333 226.6667 255
If you don't like decimals you could round to the nearest whole number like:
q)minmax255round:{[sf;x]floor 0.5+sf*(x-mnx)%max[x]-mnx:min x}[255]
q)minmax255round til 10
0 28 57 85 113 142 170 198 227 255
(logic here is if I have a number like 1.7, add .5, and floor I'll wind up with 2, whereas if I had a number like 1.2, add .5, and floor I'll end up with 1)
If you don't want to start at 0 you could use | which takes the max of it's left and right arguments
q)minmax255roundlb:{[sf;lb;x]lb|floor sf*(x-mnx)%max[x]-mnx:min x}[255;10]
q)minmax255roundlb til 10
10 28 56 85 113 141 170 198 226 255
Where I'm using lb to mean 'lower bound'
If you want to apply this to a table you could use
q)show testtab:([]a:til 10;b:til 10)
a b
---
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
q)update minmax255 a from testtab
a b
----------
0 0
28.33333 1
56.66667 2
85 3
113.3333 4
141.6667 5
170 6
198.3333 7
226.6667 8
255 9
The following will work nicely
minmaxCustom:{[l;u;x]l + (u - l) * (x-mnx)%max[x]-mnx:min x}
As petty as it sounds, it is my strong recommendation that you do not follow through with Shehir94 solution for a custom minimum value. Applying a maximum to get a starting range, it will mess with the original distribution. A custom minmax scaling should be a simple linear transformation on a standard 0-1 minmax transformation.
X' = a + bX
For example, to get a custom scaling of 10-255, that would be a b=245 and a=10, we would expect the new mean to follow this formula and the standard deviation to only be a Multiplicative, but applying lower bound messes with this, for example.
q)dummyData:10000?100.0
q)stats:{`transform`minVal`maxVal`avgVal`stdDev!(x;min y;max y; avg y; dev y)}
q)minmax255roundlb:{[sf;lb;x]lb|sf*(x-mnx)%max[x]-mnx:min x}[255;10]
q)minmaxCustom:{[l;u;x]l + (u - l) * (x-mnx)%max[x]-mnx:min x}
q)res:stats'[`orig`lb`linear;(dummyData;minmax255roundlb dummyData;minmaxCustom[10;255;dummyData])]
q)res
transform minVal maxVal avgVal stdDev
-----------------------------------------------
orig 0.02741043 99.98293 50.21896 28.92852
lb 10 255 128.2518 73.45999
linear 10 255 133.024 70.9064
// The transformed average should roughly be
q)10 + ((255-10)%100)*49.97936
132.4494
// The transformed std devaition should roughly be
q)2.45*28.92852
70.87487
To answer the comment, this could be applied over a large number of coluwould be applied to a table in the following manner
q)n:10000
q)tab:([]sym:n?`3;col1:n?100.0)
q)multiColApply:{[tab;scaler;colList]flip ft,((),colList)!((),scaler each (ft:flip tab)[colList])}
q)multiColApply[tab;minmaxCustom[10;20];`col1`col2]
sym col1 col2 col3
------------------------------
cag 13.78461 10.60606 392.7524
goo 15.26201 16.76768 517.0911
eoh 14.05111 19.59596 515.9796
kbc 13.37695 19.49495 406.6642
mdc 10.65973 12.52525 178.0839
odn 16.24697 17.37374 301.7723
ioj 15.08372 15.05051 785.033
mbc 16.7268 20 534.7096
bhj 12.95134 18.38384 711.1716
gnf 19.36005 15.35354 411.597
gnd 13.21948 18.08081 493.1835
khi 12.11997 17.27273 578.5203
I run adonis on community data and environmental matrix (that contains a factor of two levels and 6 continuous variables) using Bray-Curtis and I always take 1 df but this is not the case. Probably there is a bug here.
See also the example in adonis
data(dune)
data(dune.env)
str(dune.env)
adonis(dune ~ Management*A1, data=dune.env, permutations=99)
Although A1 is a numeric variable the result provides 1 df.
In the model:
> adonis(dune ~ Management*A1, data=dune.env, permutations=99)
Call:
adonis(formula = dune ~ Management * A1, data = dune.env, permutations = 99)
Permutation: free
Number of permutations: 99
Terms added sequentially (first to last)
Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
Management 3 1.4686 0.48953 3.2629 0.34161 0.01 **
A1 1 0.4409 0.44089 2.9387 0.10256 0.02 *
Management:A1 3 0.5892 0.19639 1.3090 0.13705 0.21
Residuals 12 1.8004 0.15003 0.41878
Total 19 4.2990 1.00000
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The main effect of A1 uses a single degree of freedom because it is a continuous variable. The interaction between Management and A1 uses 3 additional degree's of freedom as there is one additional "effect" of A1 per level of Management.
This is all expected and there is certainly no bug illustrated in adonis() from this model.
Importantly, you must insure that factor variables are coded as factors otherwise, for example if the categories are coded as integers, then R will still interpret those variables as continuous/numeric. It will only interpret them as factors if coerced to the "factor" class. Check the output of str(df), where df is your data frame containing the predictor variables (covariates; the things on the right-hand side of ~), and insure that each factor variable is of the appropriate class. For example, the dune.env data are:
> str(dune.env)
'data.frame': 20 obs. of 5 variables:
$ A1 : num 2.8 3.5 4.3 4.2 6.3 4.3 2.8 4.2 3.7 3.3 ...
$ Moisture : Ord.factor w/ 4 levels "1"<"2"<"4"<"5": 1 1 2 2 1 1 1 4 3 2 ...
$ Management: Factor w/ 4 levels "BF","HF","NM",..: 4 1 4 4 2 2 2 2 2 1 ...
$ Use : Ord.factor w/ 3 levels "Hayfield"<"Haypastu"<..: 2 2 2 2 1 2 3 3 1 1 ...
$ Manure : Ord.factor w/ 5 levels "0"<"1"<"2"<"3"<..: 5 3 5 5 3 3 4 4 2 2 ...
which indicates that Management is a factor, A1 is numeric (it is the thickness of the A1 soil horizon), and the remaining variables are ordered factors (but still factors; they work correctly in R's omdel formula infrastructure).
Suppose I have a MATLAB table of the following type:
Node_Number Generation_Type Total_power(MW)
1 Wind 600
1 Solar 452
1 Tidal 123
2 Wind 200
2 Tidal 159
What I want to do is to produce a table with exactly same dimensions, with the only difference being the value of the data of the Total_Power column that corresponds to the Wind generation type being multiplied with 0.5. Hence the result that I would get would be:
Node_Number Generation_Type Total_power(MW)
1 Wind 300
1 Solar 452
1 Tidal 123
2 Wind 100
2 Tidal 159
What I believe that would do the trick is some code which would scan all the rows that have the string 'Wind', and then after locating the rows which have this string, to multiply the 3rd column of this row with 0.5. A for loop seems like a viable solution, though I am not sure how to implement this. Any help would be greatly appreciated.
Just find the index of rows with the category Wind, and then you could have access to them by calling T(index,:).
clc; clear;
T=readtable('data.txt');
rows = find(ismember(T.Generation_Type,'Wind'));
T(rows,:).Total_power_MW_=T(rows,:).Total_power_MW_*0.5
Output:
Node_Number Generation_Type Total_power_MW_
___________ _______________ _______________
1 'Wind' 300
1 'Solar' 452
1 'Tidal' 123
2 'Wind' 100
2 'Tidal' 159
Once again, sorry if this has been asked before and if its too specific but I'm very stuck and can't quite find a solution.
I have a matrix of say 3 members of a structure called 2, 4 and 16 (in column 1) that have values along their relative distance e.g. member 2 has values at the start, 0m, then at 0.5m then the end of its length 1.5m, where member 4 starts at 0m etc. So that my matrix looks like this:
2 0 125
2 0.5 25
2 1.5 365
4 0 25
4 0.6 57
16 0 354
16 0.2 95
16 0.8 2
and I want to create a matrix that has the overall distance along all the members 2, 4 and 16 combined:
2 0 125
2 0.5 25
2 1.5 365
4 1.5 25
4 2.1 57
16 2.1 354
16 2.3 95
16 3.1 2
is there any way to do this in matlab? Like possibly locating the first zero and adding the value above it to all the rest of the values below then find the next zero value and so on?
Please tell me if this isn't clear, I realise it's a bit confusing but not too sure how to explain it better!
I came up with the following:
idx = find(diff(M(:,1)));
v = zeros(size(M,1),1);
v(idx+1) = M(idx,2);
M(:,2) = M(:,2) + cumsum(v);
The result:
M =
2 0 125
2 0.5 25
2 1.5 365
4 1.5 25
4 2.1 57
16 2.1 354
16 2.3 95
16 2.9 2
Note the last value in the second column disagrees with what you described (2.9 vs 3.1). Either you had a typo, or I'm still not getting it...
data = [2 0 125;
2 0.5 25;
2 1.5 365;
4 0 25;
4 0.6 57;
16 0 354;
16 0.2 95;
16 0.8 2];
idx0 = find(data(:,2)==0);
idx0 = idx0(2:end); %ignore first zero of first member, doesn't need an offset
offset = data(idx0-1,2);
N = size(data,1);
for ii=1:numel(idx0)
idxs = 1:N>=idx0(ii);
data(idxs,2) = data(idxs,2) + offset(ii);
end
ok it seems like a simple problem, but i am having problem
I have a timer for each data set which resets improperly and as a result my timing gets mixed.
Any ideas to correct it? without losing any data.
Example
timer col ideally should be
timer , mine reads
1 3
2 4
3 5
4 6
5 1
6 2
how do i change the colum 2 or make a new colum which reads like colum 1 without changing the order of ther rows which have data
this is just a example as my file lengths are 86000 long , also i have missing timers which i do not want to miss , this imples no data for that period of time.
thanks
EDIT: I do not want to change the other columns. The coulm 1 is the gps counter and so it does not sync with the comp timer due to some other issues. I just want to change the row one such that it goes from high to low without effecting other rows. also take care of missing pts ( if i did not care for missing pts simple n=1: max would work.
missing data in this case is indicated by missing timer. for example i have 4,5,8,9 with missing 6,7
Ok let me try to edit agian
its a 8600x 80 matrix of data:
timer is one row which should go from 0 to 8600
but timer starts at odd times , so i have start of data from middle , lets say 3400, so in the middle of day my timer goes to 0 and then back to 1.
but my other rows are fine. I just need 2 plot other sets based on timer as time.
i cannot use T= 1:length(file) as then it ignores missed time stamps ( timers )
for example my data reads like
timer , mine reads
1 3
2 4
3 5
4 8
5 9
8 1
9 2
so u can see time stamps 6,7 are missing.
if i used n=1:length(file)
i would have got
1 2 3 4 5 6 7
which is wrong
i want
1 2 3 4 5 8 9
without changing the order of other rows , so i cannot use sort for the whole file.
I assume the following problem
data says
3 100
4 101
5 102
NaN 0
1 104
2 105
You want
1 100
2 101
3 102
NaN 0
4 104
5 105
I'd solve the problem like this:
%# create test data
data = [3 100
4 101
5 102
NaN 0
1 104
2 105];
%# find good rows (if missing data are indicated by zeros, use
%# goodRows = data(:,1) > 0;
goodRows = isfinite(data(:,1));
%# count good rows
nGoodRows = sum(goodRows);
%# replace the first column with sequential numbers, but only in good rows
data(goodRows,1) = 1:nGoodRows;
data =
1 100
2 101
3 102
NaN 0
4 104
5 105
EDIT 1
Maybe I understand your question this time
data says
4 101
5 102
1 104
2 105
You want
1 4 101
2 5 102
4 1 104
5 2 105
This can be achieved the following way
%# test data
data = [4 101
5 102
1 104
2 105];
%# use sort to get the correct order of the numbers and add it to the left of data
out = [sort(data(:,1)),data]
out =
1 4 101
2 5 102
4 1 104
5 2 105
EDIT 2
Note that out is the result from the solution in EDIT 1
It seems you want to plot the data so that there is no entry for missing values. One way to do this is to make a plot with dots - there won't be a dot for missing data.
plot(out(:,1),out(:,3),'.')
If you want to plot a line that is interrupted, you have to insert NaNs into out
%# create outNaN, that has NaN-rows for missing entries
outNaN = NaN(max(out(:,1)),size(out,2));
outNaN(out(:,1),:) = out;
%# plot
plot(out(:,1),out(:,3))