xgboost linear regression (gblinear) wrong predictions - linear-regression

I am using the python xgboost library, and I am unable to get a simple working example using the gblinear booster:
M = np.array([
[1, 2],
[2, 4],
[3, 6],
[4, 8],
[5, 10],
[6, 12],
])
xg_reg = xgb.XGBRegressor(objective ='reg:linear', booster='gblinear')
X, y = M[:, :-1], M[:, -1]
xg_reg.fit(X,y)
plt.scatter(range(-5, 20), [xg_reg.predict([i]) for i in range(-5, 20)])
plt.scatter(M[:,0], M[:,-1])
plt.show()
Predictions are in blue, and real data in orange
Am I missing something?

I think the issue is that the model does not converge to the optimum with the configuration and the amount of data that you have chosen. GBM's do not use the boosting model to fit the target directly, but rather to fit the gradient and then to add a fraction of the prediction (fraction is equal to the learning rate) to the prediction from the previous step.
So the obvious ways to improve are: increase the learning rate, increase the number of iterations, increase the data size.
For example, this variant of your code gives already a better prediction:
X = np.expand_dims(range(1,7), axis=1)
y = 2*X
# note increased learning rate!
xg_reg = xgb.XGBRegressor(objective ='reg:linear', booster='gblinear', learning_rate=1)
xg_reg.fit(X, y, verbose=20, eval_set=[(X,y)])
plt.scatter(range(-5, 20), [xg_reg.predict([i]) for i in range(-5, 20)], label='prediction')
plt.scatter(X[:20,:], y[:20], label='target')
plt.legend()
plt.show()
This leads to the metric value of 0.872 on the training data (i've added evaluation in the fit function to see how does it change). This is further reduced to ~0.1, if you increase the number of samples from 7 to 70.

Related

Matlab resample time series for specific times, rather than frequencies

I have the following problem in Matlab:
I have a time series which looks like this:
size(ts) = (n,2); % with n being the number of samples, the first column is the time, the second the value.
Let's say I have:
ts(:,1) = [0, 10, 20, 30, 40];
ts(:,2) = [1, 3, 10, 6, 11];
I would like to resample the signal above to get the interpolated values at different times. Say:
ts(:,1) = [0, 1, 3, 15, 40];
ts(:,2) = ???
I had a look at the Matlab functions for signal processing but they are all only relevant for regular sampling at various frequencies.
Is there a built in function which would give me the above, or do I have to compute the linear interpolation for each new desired time manually? If so, do you have a recommendation to do this efficiently using vecotrized code (just started Matlab a month ago so still 100% at ease with this and relying on for loops a lot still).
For a bit of context, I'm using a finite difference scheme in series to investigate a problem. The output of one FD scheme is fed into the following. Due to the nature of my problem, I have to change the time stepping from one FD to the next, and my time steps can be irregular.
Thanks.
Since your data are 1-D you can use interp1 to perform the interpolation. The code would work as follow:
ts = [0, 10, 20, 30, 40; % Time/step number
1, 3, 10, 6, 11]; % Values
resampled_steps = [0, 1, 3, 15, 40]; % Time for which we want resample
resampled_values = interp1(ts(1, :), ts(2, :), resampled_step);
% Put everything in an array to match initial format
ts_resampled = [resampled_steps; resampled_values];
Or you can alternatively, following the same idea:
ts = [0, 10, 20, 30, 40; % Time/step number
1, 3, 10, 6, 11]; % Values
% Create resample array
ts_resampled = zeros(size(ts));
ts_resampled(1, :) = [0, 1, 3, 15, 40];
% Interpolate
ts_resampled(2, :) = interp1(ts(1, :), ts(2, :), ts_resampled(1, :));
You can even choose the interpolation method you want, by passing a string to the interp1 function. The methods are listed here
Note that this only work if you re-sample with time stamps within your original scope. If you want them outside you have to tell the function how to extrapolate using the key word 'extrap'. Detail here

Repeating rows of matrix in MATLAB

I have a question that is related to this post: "Cloning" row or column vectors. I tried to work around the answers posted there, yet failed to apply them to my problem.
In my case, I'd like to "clone" each row row of a matrix by converting a matrix like
A = [1,2; 3, 4; 5, 6]
into the matrix
B = [1, 2
1, 2
3, 4
3, 4
5, 6
5, 6]
by repeating each row of A a number of times.
So far, I was able to work with repmat for a single row like
A = [1, 2];
B = repmat(A, 2, 1)
>> B = [1, 2
1, 2]
I was trying to build a loop using that formula, in order to obtain the matrix wanted. The loop looked like
T = 3; N = 2;
for t = 1:T
for I = 1:N
B = repmat(C, 21, 1)
end
end
Has anyone an idea how to correctly write the loop, or a better way to do this?
kron
There are a few ways you can do this. The shortest way would be to use the kron function as suggested by Adiel in the comments.
A = [1,2; 3, 4; 5, 6];
B = kron(A, [1;1]);
Note that the number of elements in the ones vector controls how many times each row is duplicated. For n times, use kron(A, ones(n,1)).
kron calculates the kronecker tensor product, which is not necessarily a fast process, nor is it intuitive to understand, but it does give the right result!
reshape and repmat
A more understandable process might involve a combination of reshape and repmat. The aim is to reshape the matrix into a row vector, repeat it the desired number of times, then reshape it again to regain the two-column matrix.
B = reshape(repmat(reshape(A, 1, []), 2, 1), [], 2);
Note that the 2 within the repmat function controls how many times each row is duplicated. For n times, use reshape(repmat(reshape(A, 1, []), n, 1), [], 2).
Speed
A quick benchmark can be written:
% Setup, using a large A
A = rand(1e5, 2);
f = #() kron(A, [1;1]);
g = #() reshape(repmat(reshape(A, 1, []), 2, 1), [], 2);
% timing
timeit(f);
timeit(g);
Output:
kron option: 0.0016622 secs
repmat/reshape option: 0.0012831 secs
Extended benchmark over different sizes:
Summary:
the reshape option is quicker (~25%) for just duplicating the rows once each, so you should go for this option if you want to end up with 2 of each row for a large matrix.
the reshape option appears to have complexity O(n) for the number of row repetitions. kron has some initial overhead, but is much quicker when you want many repetitions and hardly slows down because of them! Go for the kron method if you are doing more than a few repetitions.

How can I find each max element of three matrices as new matrix?

Maybe the question is a little bit confused, I'll make an example below.
Let's say I have a 3 matrices a, b, c with same size.
a = [2, 5; 6, 9];
b = [3, 3; 8, 1];
c = [5, 5; 2, 7];
How can I get the new matrix max with each max element in all three matrices?
max = [5, 5; 8, 9]
I know I could create logical matrix like a>b and then do the math, calc it out, is there any other more efficient way to do it?
You can concatenate the matrices into one 2x2x3 matrix using
d=cat(3,a,b,c)
and then use max-function to get your desired output:
maxValues=max(d,[],3)
The 3rd input to max defines along which dimension of the first input you want to find the maximum value.

Matlab find the maximum and minimum value for each point of series of arrays (with negative values)

lets say that we have the next series of arrays:
A = [1, 2, -2, -24];
B = [1, 4, -7, -2];
C = [3, 1, -7, -14];
D = [11, 4, -7, -1];
E = [1, 2, -3, -4];
F = [5, 14, -17, -12];
I would like to create two arrays,
the first will be the maximum of each column for all arrays,
i.e.
Maxi = [11,14,-2 -1];
the second will be the minimum of each column for all arrays
i.e.
Mini= [1,1,-17 -24];
I am trying all day, using loops, with max, and abs but I cant make it work
in my problem have a matrix (100,200), so with the above example i am trying to easily approach the problem. The ultimate goal is to get a kinda fitting of the 100 y_lines of 200 x_points. The idea is to calculate two lines (i.e. max,min), that will be the "visual" boarders of all lines (maximum and minimum values for each x). The next step will be to calculate an array of the average of these two arrays, so in the end will be a line between all lines.
any help is more than welcome!
How about this?
Suppose you stack all the row vectors , namely A,B...,F as
arr=[A;B;C;D;E;F];% stack the vectors
And then use the max(), min() and mean() functions provided by Matlab. That is,
Maxi = max(arr); % Maxi is a row vector carrying the max of each column of arr
Mini = min(arr);
Meani = mean(arr);
You just have to stack them as shown above. But if you have 100s of row vectors, use a loop to stack them into array arr as shown above.

How to get a regular sampled matrix in Scilab

I'm trying to program a function (or even better it it already exists) in scilab that calculates a regular timed samples of values.
IE: I have a vector 'values' which contains the value of a signal at different times. This times are in the vector 'times'. So at time times(N), the signal has value values(N).
At the moment the times are not regular, so the variable 'times' and 'values' can look like:
times = [0, 2, 6, 8, 14]
values= [5, 9, 10, 1, 6]
This represents that the signal had value 5 from second 0 to second 2. Value 9 from second 2 to second 6, etc.
Therefore, if I want to calculate the signal average value I can not just calculate the average of vector 'values'. This is because for example the signal can be for a long time with the same value, but there will be only one value in the vector.
One option is to take the deltaT to calculate the media, but I will also need to perform other calculations:average, etc.
Other option is to create a function that given a deltaT, samples the time and values vectors to produce an equally spaced time vector and corresponding values. For example, with deltaT=2 and the previous vectors,
[sampledTime, sampledValues] = regularSample(times, values, 2)
sampledTime = [0, 2, 4, 6, 8, 10, 12, 14]
sampledValues = [5, 9, 9, 10, 1, 1, 1, 6]
This is easy if deltaT is small enough to fit exactly with all the times. If the deltaT is bigger, then the average of values or some approximation must be done...
Is there anything already done in Scilab?
How can this function be programmed?
Thanks a lot!
PS: I don't know if this is the correct forum to post scilab questions, so any pointer would also be useful.
If you like to implement it yourself, you can use a weighted sum.
times = [0, 2, 6, 8, 14]
values = [5, 9, 10, 1, 6]
weightedSum = 0
highestIndex = length(times)
for i=1:(highestIndex-1)
// Get the amount of time a certain value contributed
deltaTime = times(i+1) - times(i);
// Add the weighted amount to the total weighted sum
weightedSum = weightedSum + deltaTime * values(i);
end
totalTimeDelta = times($) - times(1);
average = weightedSum / totalTimeDelta
printf( "Result is %f", average )
Or If you want to use functionally the same, but less readable code
timeDeltas = diff(times)
sum(timeDeltas.*values(1:$-1))/sum(timeDeltas)