Why the RMSE for Kriging is not strictly decreasing for increasing training data? - kriging

I am using gstat package for ordinary kriging and using the walker lake data (data size = 470). I have randomly taken 20 from that data in each trial and calculate the rmse for randomly chosen training dataset from 50-450 dataset. Then I have calculated the average for each dataset. The results are as follows --
trial Index training points avg. rmse
--------------------------------------------------------
1 50 43.5936
2 100 40.3413
3 150 34.8842
4 200 28.1230
5 250 28.3111
6 300 30.9915
7 350 30.8903
8 400 28.3148
9 450 28.9578
My questions are:
1) Why the RMSE is wavy. Why doesn't it always decrease while increasing training data?
2) Does that mean, we don't need large dataset for kriging as when the training dataset is 200, the RMSE is the lowest.
Waiting for the reply.

Related

Mixed-effects linear regression model using multiple independent measurements

I am trying to implement a linear mixed effect (LME) regression model for an x-ray imaging quality metric "CNR" (contrast-to-noise ratio) for which I measured for various tube potentials (kV) and filtration materials (Filter). CNR was measured for 3 consecutive slices so I have a standard deviation of the CNR from these independent measurements as well. I am wondering how I can incorporate these multiple independent measurements in my analysis. A representation of the data for a single measurement and my first attempt using fitlme is shown below. I tried looking at online resources but could not find an answer to my specific questions.
kV=[80 90 100 80 90 100 80 90 100]';
Filter={'Al','Al','Al','Cu','Cu','Cu','Ti','Ti','Ti'}';
CNR=[10 9 8 10.1 8.9 7.9 7 6 5]';
T=table(kV,Filter,CNR);
kV Filter CNR
___ ______ ___
80 'Al' 10
90 'Al' 9
100 'Al' 8
80 'Cu' 10.1
90 'Cu' 8.9
100 'Cu' 7.9
80 'Ti' 7
90 'Ti' 6
100 'Ti' 5
OUTPUT
Linear mixed-effects model fit by ML
Model information:
Number of observations 9
Fixed effects coefficients 4
Random effects coefficients 0
Covariance parameters 1
Formula:
CNR ~ 1 + kV + Filter
Model fit statistics:
AIC BIC LogLikelihood Deviance
-19.442 -18.456 14.721 -29.442
Fixed effects coefficients (95% CIs):
Name Estimate SE pValue
'(Intercept)' 18.3 0.17533 1.5308e-09
'kV' -0.10333 0.0019245 4.2372e-08
'Filter_Cu' -0.033333 0.03849 -0.86603
'Filter_Ti' -3 0.03849 -77.942
Random effects covariance parameters (95% CIs):
Group: Error
Name Estimate Lower Upper
'Res Std' 0.04714 0.0297 0.074821
Questions/Issues with current implementation:
How is the fixed effects coefficients for '(Intercept)' with P=1.53E-9 interpreted?
I only included fixed effects. Should the standard deviation of the ROI measurements somehow be incorporated into the random effects as well?
How do I incorporate the three independent measurements of CNR for three consecutive slices for a give kV/filter combination? Should I just add more rows to the table "T"? This would result in a total of 27 observations.

lsmeans and continuous time

I would like to have your help on this problem.
I wish to express the evolution of a quality of life score over the time with adjustment on clinical factors.
I work on a cohort of more than 1000 patients and a quality of life form was delivered every 3 months. Time is considered continuous as patients did not fill the form exactly at same time.
Here is my problem I used a linear mixed model with fixed factors (the sex and the prognosis). I would like to obtain the average score at 3, 6, 9 and 12 months from this model. I used the lsmeans function but the score obtained corresponds to average time.
How can I get average scores from my model at these 4 different times?
You will find below my code and the result I get with lsmeans
mod_mix2 <- lme(score_utilite~delai+prono1+sex1,random = ~ delai|numero_patient, data = qdv,na.action=na.omit,method="ML")
lsmeans(mod_mix2, specs ="delai")
$lsmeans
delai lsmean SE df lower.CL upper.CL
10.21976 0.8145542 0.005835597 1016 0.803103 0.8260054
Results are averaged over the levels of: prono1, sex1
Confidence level used: 0.95
$contrasts
contrast estimate SE df z.ratio p.value
(nothing) nonEst NA NA NA NA
Results are averaged over the levels of: prono1, sex1
Thank you very much
Try
lsmeans(mod_mix2, specs ="delai",
at = list(delai = c(3,6,8,12)))

MSE in neuralnet results and roc curve of the results

Hi my question is a bit long please bare and read it till the end.
I am working on a project with 30 participants. We have two type of data set (first data set has 30 rows and 160 columns , and second data set has the same 30 rows and 200 columns as outputs=y and these outputs are independent), what i want to do is to use the first data set and predict the second data set outputs.As first data set was rectangular type and had high dimension i have used factor analysis and now have 19 factors that cover up to 98% of the variance. Now i want to use these 19 factors for predicting the outputs of the second data set.
I am using neuralnet and backpropogation and everything goes well and my results are really close to outputs.
My questions :
1- as my inputs are the factors ( they are between -1 and 1 ) and my outputs scale are between 4 to 10000 and integer , should i still scaled them before running neural network ?
2-I scaled the data ( both input and outputs ) and then predicted with neuralnet , then i check the MSE error it was so high like 6000 while my prediction and real output are so close to each other. But if i rescale the prediction and outputs then check The MSE its near zero. Is it unbiased to rescale and then check the MSE ?
3- I read that it is better to not scale the output from the beginning but if i just scale the inputs all my prediction are 1. Is it correct to not to scale the outputs ?
4- If i want to plot the ROC curve how can i do it. Because my results are never equal to real outputs ?
Thank you for reading my question
[edit#1]: There is a publication on how to produce ROC curves using neural network results
http://www.lcc.uma.es/~jja/recidiva/048.pdf
1) You can scale your values (using minmax, for example). But only scale your training data set. Save the parameters used in the scaling process (in minmax they would be the min and max values by which the data is scaled). Only then, you can scale your test data set WITH the min and max values you got from the training data set. Remember, with the test data set you are trying to mimic the process of classifying unseen data. Unseen data is scaled with your scaling parameters from the testing data set.
2) When talking about errors, do mention which data set the error was computed on. You can compute an error function (in fact, there are different error functions, one of them, the mean squared error, or MSE) on the training data set, and one for your test data set.
4) Think about this: Let's say you train a network with the testing data set,and it only has 1 neuron in the output layer . Then, you present it with the test data set. Depending on which transfer function (activation function) you use in the output layer, you will get a value for each exemplar. Let's assume you use a sigmoid transfer function, where the max and min values are 1 and 0. That means the predictions will be limited to values between 1 and 0.
Let's also say that your target labels ("truth") only contains discrete values of 0 and 1 (indicating which class the exemplar belongs to).
targetLabels=[0 1 0 0 0 1 0 ];
NNprediction=[0.2 0.8 0.1 0.3 0.4 0.7 0.2];
How do you interpret this?
You can apply a hard-limiting function such that the NNprediction vector only contains the discreet values 0 and 1. Let's say you use a threshold of 0.5:
NNprediction_thresh_0.5 = [0 1 0 0 0 1 0];
vs.
targetLabels =[0 1 0 0 0 1 0];
With this information you can compute your False Positives, FN, TP, and TN (and a bunch of additional derived metrics such as True Positive Rate = TP/(TP+FN) ).
If you had a ROC curve showing the False Negative Rate vs. True Positive Rate, this would be a single point in the plot. However, if you vary the threshold in the hard-limit function, you can get all the values you need for a complete curve.
Makes sense? See the dependencies of one process on the others?

How to compute & plot Equal Error Rate (EER) from FAR/FRR values using matlab

I have the following values against FAR/FRR. i want to compute EER rates and then plot in matlab.
FAR FRR
19.64 20
21.29 18.61
24.92 17.08
19.14 20.28
17.99 21.39
16.83 23.47
15.35 26.39
13.20 29.17
7.92 42.92
3.96 60.56
1.82 84.31
1.65 98.33
26.07 16.39
29.04 13.13
34.49 9.31
40.76 6.81
50.33 5.42
66.83 1.67
82.51 0.28
Is there any matlab function available to do this. can somebody explain this to me. Thanks.
Let me try to answer your question
1) For your data EER can be the mean/max/min of [19.64,20]
1.1) The idea of EER is try to measure the system performance against another system (the lower the better) by finding the equal(if not equal then at least nearly equal or have the min distance) between False Alarm Rate (FAR) and False Reject Rate (FRR, or missing rate) .
Refer to your data, [19.64,20] gives min distance, thus it could used as EER, you can take mean/max/min value of these two value, however since it means to compare between systems, thus make sure other system use the same method(mean/max/min) to pick EER value.
The difference among mean/max/min can be ignored if the there are large amount of data. In some speaker verification task, there will be 100k data sample.
2) To understand EER ,better compute it by yourself, here is how:
two things you need to know:
A) The system score for each test case (trial)
B) The true/false for each trial
After you have A and B, then you can create [trial, score,true/false] pairs then sort it by the score value, after that loop through the score, eg from min-> max. At each loop assume threshold is that score and compute the FAR,FRR. After loop through the score find the FAR,FRR with "equal" value.
For the code you can refer to my pyeer.py , in function processDataTable2
https://github.com/StevenLOL/Research_speech_speaker_verification_nist_sre2010/blob/master/SRE2010/sid/pyeer.py
This function is written for the NIST SRE 2010 evaluation.
4) There are other measures similar to EER, such as minDCF which only play with the weights of FAR and FRR. You can refer to "Performance Measure" of http://www.nist.gov/itl/iad/mig/sre10results.cfm
5) You can also refer to this package https://sites.google.com/site/bosaristoolkit/ and DETware_v2.1.tar.gz at http://www.itl.nist.gov/iad/mig/tools/ for computing and plotting EER in Matlab
Plotting in DETWare_v2.1
Pmiss=1:50;Pfa=50:-1:1;
Plot_DET(Pmiss/100.0,Pfa/100.0,'r')
FAR(t) and FRR(t) are parameterized by threshold, t. They are cumulative distributions, so they should be monotonic in t. Your data is not shown to be monotonic, so if it is indeed FAR and FRR, then the measurements were not made in order. But for the sake of clarity, we can order:
FAR FRR
1 1.65 98.33
2 1.82 84.31
3 3.96 60.56
4 7.92 42.92
5 13.2 29.17
6 15.35 26.39
7 16.83 23.47
8 17.99 21.39
9 19.14 20.28
10 19.64 20
11 21.29 18.61
12 24.92 17.08
13 26.07 16.39
14 29.04 13.13
15 34.49 9.31
16 40.76 6.81
17 50.33 5.42
18 66.83 1.67
19 82.51 0.28
This is for increasing FAR, which assumes a distance score; if you have a similarity score, then FAR would be sorted in decreasing order.
Loop over FAR until it is larger than FRR, which occurs at row 11. Then interpolate the cross over value between rows 10 and 11. This is your equal error rate.

SOM out of memory in MATLAB

I am trying to use SOM to learn 80000X10 samples (each sample is a vector of size 10). But I can't even configure 8x8 net with 10000X1 samples. It throws "out of memory" error.
Here is my code (data is 80000X10 matrix):
net=selforgmap([8 8])
net=configure(net,data(1:10000,1))
Matlab help: "Unconfigured networks are automatically configured and initialized the first time train is called."
Even for 8000X1 dataset, it takes a lot of time. I noticed a huge numWeightElements: 512000 in net variable (8*8*8000=512000). The weights should be 8*8. SOM training algorithm shouldn't use this much memory. What is wrong?
The output of memory command:
>> memory
Maximum possible array: 3014 MB (3.160e+009 bytes)
Memory available for all arrays: 3014 MB (3.160e+009 bytes)
Memory used by MATLAB: 1154 MB (1.210e+009 bytes)
Physical Memory (RAM): 4040 MB (4.236e+009 bytes)
I think your configuring wrong the input structure. Each input vector must be a column and not a row. Quote from this "Clustering Data - MATLAB & Simulink"
To define a clustering problem, simply arrange Q input vectors to be
clustered as columns in an input matrix (see "Data Structures"
for a detailed description of data formatting for static and time
series data). For instance, you might want to cluster this set of 10
two-element vectors:
inputs = [7 0 6 2 6 5 6 1 0 1; 6 2 5 0 7 5 5 1 2 2]
As you can see each input vector is a column. You have 10 two element input vectors as a 2x10 array.