Comparing RapidMiner models with x-validation - neural-network

I am working in some forecasting models with RapidMiner and need some orientation to interpret the outputs and select the best among them. I am following some tutorials to check their accuracy with x-validation, and I am getting results like:
root_mean_squared_error: 1.613 +/- 0.374 (mikro: 1.651 +/- 0.000)
squared_error: 3.843 +/- 1.632 (mikro: 3.832 +/- 5.398)
for regression, and:
root_mean_squared_error: 1.917 +/- 0.410 (mikro: 1.958 +/- 0.000)
squared_error: 3.843 +/- 1.632 (mikro: 3.832 +/- 5.398)
for the neural nets. Which one would be the best? Why? Is there an straightforward way to get the 'r' value? Thanks !!

Related

How do math programs solve calculus-based problems?

There are many mathematical programs out there out of which some are able to solve calculus-based problems, GeoGebra, Qalculate! to name a few.
How are those programs able to solve calculus-based problems which humans need to evaluate using a long procedure?
For example, the problem:
It takes a lot of steps for humans to solve this problem as shown here on Quora.
How can those mathematical programs solve them with such a good accuracy?
The Church-Turing thesis implies that anything a human being can calculate can be calculated by any Turing-equivalent system of computation - including programs running on computers. That is to say, if we can solve the problem (or calculate an approximate answer that meets some criteria) then a computer program can be made to do the same thing. Let's consider a simpler example:
f(x) = x
a = Integral(f, 0, 1)
A human being presented with this problem has two options:
try to compute the antiderivative using some procedure, then use procedures to evaluate the definite integral over the supplied range
use some numerical method to calculate an approximate value for the definite integral which meets some criteria for closeness to the true value
In either case, human beings have a set of tools that allow them to do this:
recognize that f(x) is a polynomial in x. There are rules for constructing the antiderivatives of polynomials. Specifically, each term ax^b in the polynomial can be converted to a/(b+1)x^(b+1) and then an arbitrary constant c added to the end. We then say Sf(x)dx = (1/2)x^2 + c. Now that we have the antiderivative, we have a procedure for computing the antiderivative over a range: calculate Sf(x)dx for the high value, then subtract from that the result of calculating Sf(x)dx for the low value. This gives ((1/2)1^2) - ((1/2)0^2) = 1/2 - 0 = 1/2.
decide that for our purposes a Riemann sum with dx=1/10 is sufficient and that we'll take the midpoint value. We get 10 rectangles with base 1/10 and heights 1/20, 3/20, 5/20, 7/20, 9/20, 11/20, 13/20, 15/20, 17/20 and 19/20, respectively. The areas are 1/200, 3/200, 5/200, 7/200, 9/200, 11/200, 13/200, 15/200, 17/200 and 19/200. The sum of these is (1+3+5+7+9+11+13+15+17+19)/200 = 100/200 = 1/2. We happened to get the exact answer since we used the midpoint value and evaluated the definite integral of a linear function; in general, we'd have been close but not exact.
The only difficulty is in adequately specifying the procedure human beings use to solve these problems in various ways. Once specified, computers are perfectly capable of doing them. And make no mistake, human beings have a procedure - conscious or subconscious - for doing these problems reliably.

Why is the confidence interval not consistent with the standard errors in this regression?

I am running a linear regression with fixed effect and standard errors clustered by a certain group.
areg ref1 ref1_l1 rf1 ew1 vol_ew1 sk_ew1, a(us_id) vce(cluster us_id)
The one line code is as above and the output is as follows:
Now, the t-stats and the P values look inconsistent. How can we have t-stat >5 and pval >11%?. Similarly the 95% confidence intervals appear to be way wider than Coeff. +- 2 Std. Err.
What am I missing?
There is nothing inconsistent here. You have a small sample size and a less than parsimonious model and have all but run out of degrees of freedom. Notice how areg won't post an F statistic or a P-value for the model, a strong danger sign. Your t statistics are consistent with checks by hand:
. display 2 * ttail(1, 5.54)
.11368912
. display 2 * ttail(1, 113.1)
.00562868
In short, there is no bug here and no programming issue. It's just a matter of your model over-fitting your data and the side-effects of that.
Similarly, +/- 2 SE for a 95% confidence interval is way off as a rule of thumb here. Again, a hand calculation is instructive:
. display invt(1, 0.975)
12.706205
. display invt(60, 0.975)
2.0002978
. display invt(61, 0.975)
1.9996236
. display invnormal(0.975)
1.959964

Matlab: how to find fundamental frequency from list of energy peaks

In a spectrogram, I have a set of harmonic frequencies (peaks in the spectrum) for a given time frame:
5215
3008.1
2428.1
2214.9
1630.2
1315
997.01
881.39
779.04
667.47
554.21
445.77
336.39
237.69
124.6
If I do -diff(ans), I get the differences between the formants, which hint me to the fact that the fundamental frequency f_0 of this frame is around 110 Hz:
2206.9
580.06
213.11
584.72
315.24
317.97
115.62
102.35
111.57
113.26
108.44
109.38
98.705
113.08
It is clear that the last 9 values of the first list are harmonics of the same f_0, because the last 8 values of the second list are around the same value. Their mean is 109.05 (but I'm not sure if that is the correct f_0). How can I calculate f_0 in a neat function?
I found an answer myself: I calculate the difference between the two peaks with the lowest frequency values and with energy values above a certain threshold. Then, I check if that difference is (within a certain range) in the list of frequencies.

Predict Location coordinates from time data

I have data consisting of time (sec) in the x-axis and location (x,y,z) in the y-axis. I want to be able to predict location (x,y,z) using time (sec). What machine learning algorithm can I use? How can I accomplish this in Matlab/Octave?
Specifically, I have the following data
Time Location
`0` `[470 491 0]`
`2` [174 281 5]
70.29 [174 281 0]
72.29 [490 257 2]
How do I predict location from time?
I appreciate your help.
Thanks
too few datapoints, you can always use linear interpolation

Arbitrary distribution -> Uniform distribution (Probability Integral Transform?)

I have 500,000 values for a variable derived from financial markets. Specifically, this variable represents distance from the mean (in standard deviations). This variable has a arbitrary distribution. I need a formula that will allow me to select a range around any value of this variable such that an equal (or close to it) amount of data points fall within that range.
This will allow me to then analyze all of the data points within a specific range and to treat them as "similar situations to the input."
From what I understand, this means that I need to convert it from arbitrary distribution to uniform distribution. I have read (but barely understood) that what I am looking for is called "probability integral transform."
Can anyone assist me with some code (Matlab preferred, but it doesn't really matter) to help me accomplish this?
Here's something I put together quickly. It's not polished and not perfect, but it does what you want to do.
clear
randList=[randn(1e4,1);2*randn(1e4,1)+5];
[xCdf,xList]=ksdensity(randList,'npoints',5e3,'function','cdf');
xRange=getInterval(5,xList,xCdf,0.1);
and the function getInterval is
function out=getInterval(yPoint,xList,xCdf,areaFraction)
yCdf=interp1(xList,xCdf,yPoint);
yCdfRange=[-areaFraction/2, areaFraction/2]+yCdf;
out=interp1(xCdf,xList,yCdfRange);
Explanation:
The CDF of the random distribution is shown below by the line in blue. You provide a point (here 5 in the input to getInterval) about which you want a range that gives you 10% of the area (input 0.1 to getInterval). The chosen point is marked by the red cross and the
interval is marked by the lines in green. You can get the corresponding points from the original list that lie within this interval as
newList=randList(randList>=xRange(1) & randList<=xRange(2));
You'll find that on an average, the number of points in this example is ~2000, which is 10% of numel(randList)
numel(newList)
ans =
2045
NOTE:
Please note that this was done quickly and I haven't made any checks to see if the chosen point is outside the range or if yCdfRange falls outside [0 1], in which case interp1 will return a NaN. This is fairly straightforward to implement, and I'll leave that to you.
Also, ksdensity is very CPU intensive. I wouldn't recommend increasing npoints to more than 1e4. I assume you're only working with a fixed list (i.e., you have a list of 5e5 points that you've obtained somehow and now you're just running tests/analyzing it). In that case, you can run ksdensity once and save the result.
I do not speak Matlab, but you need to find quantiles in your data. This is Mathematica code which would do this:
In[88]:= data = RandomVariate[SkewNormalDistribution[0, 1, 2], 10^4];
Compute quantile points:
In[91]:= q10 = Quantile[data, Range[0, 10]/10];
Now form pairs of consecutive quantiles:
In[92]:= intervals = Partition[q10, 2, 1];
In[93]:= intervals
Out[93]= {{-1.397, -0.136989}, {-0.136989, 0.123689}, {0.123689,
0.312232}, {0.312232, 0.478551}, {0.478551, 0.652482}, {0.652482,
0.829642}, {0.829642, 1.02801}, {1.02801, 1.27609}, {1.27609,
1.6237}, {1.6237, 4.04219}}
Verify that the splitting points separate data nearly evenly:
In[94]:= Table[Count[data, x_ /; i[[1]] <= x < i[[2]]], {i, intervals}]
Out[94]= {999, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000}