Panel data regression comparison in Matlab - matlab

I have a very large panel data and would like to apply a number of simple machine learning techniques in Matlab (Logistic Regression, Decision Trees, Bagged Trees).
During my preparation I came across fitglm and fitLifetimePDModel, the latter of which is meant to capture panel data. I was trying to understand how/if that differs from fitglm because when I try the below, the results are exactly the same.
Why is that? For example, under fitglm I'm not telling the program that each customer can have more than one data points.
load RetailCreditPanelData.mat
pdModel_1 = fitLifetimePDModel(data,"Logistic", 'AgeVar','YOB', 'IDVar','ID', 'LoanVars','ScoreGroup','ResponseVar','Default');
disp(pdModel_1.Model)
pdModel_2 = fitglm(data,'Default ~ 1 + ScoreGroup + YOB', 'Distribution','binomial', 'link', 'logit');
disp(pdModel_2)

Related

How do I compare two weighted regressions in MatLab?

I've been using MatLab as a statistics tool. I like how much I can customise and code myself.
I was delighted to find that it's fairly straightforward to do a weighted linear regression in MatLab. As a slightly silly example, I can load the "carbig" data file and compare horsepower vs mileage for US cars to that of cars from other countries, but decide I only trust 8-cylinder cars.
load carbig
w=(Cylinders==8)+0.5*(Cylinders~=8)%1 if 8 cylinders, 0.5 otherwise.
for i=1:length(org)
o(i,1)=strcmp(org(i,:),org(1,:));%strcmp only works on one string.
end
x1=Horsepower(o==1)
x2=Horsepower(o==0)
y1=MPG(o==1)
y2=MPG(o==0)
w1=w(o==1)
w2=w(o==0)
lm1=fitlm(x1,y1,'Weights',w1)
lm2=fitlm(x2,y2,'Weights',w2)
This way, data from 8-cylinder cars will count as one data-point, and data frm 3,4,5,6-cylinder cars will count as half a data point.
Problem is, the obvious way to compare the two regressions is to use ANCOVA, which MatLab has a function for:
aoctool(Horsepower,MPG,o)
This function compares linear regressions on the two groups, but I haven't found an obvious way to include weights.
I suspect I can have a closer look at what the ANCOVA does and include the weights manually. Any easier solution?
I figured if I give the "trusted" measuremets weight 2, the "untrusted" measurements weight 1, for regression purposes that's the same thing as having an extra 1 identical measurement for each trusted one. Setting the weight to 1 and 0.5 should do the same thing. I can do this with a script.
That also increases the degrees of freedom quite a bit, so I manually set the degrees of freedom to sum(w)-rank instead on n-rank.
x=[];
y=[];
g=[];
w=(Cylinders==8)+0.5*(Cylinders~=8);
df=sum(w)
for i=1:length(w)
while w(i)>0
x=[x;Horsepower(i)];
y=[y;MPG(i)];
g=[g;o(i)];
w(i)=w(i)-0.5
end
end
I then copied the aoctool.m file (edit aoctool) and inserted the value of df somewhere in the new file. It isn't elegant, but it seems to work.
edit aoctool.m
%(insert new df somewhere. Save as aoctool2.m)
aoctool2(x,y,g)

Resample factors are too large

I have a large vector of recorded data which I need to resample. The problem I encounter is that when using resample, I get the following error:
??? Error using ==> upfirdn at 82 The product of the downsample factor
Q and the upsample factor P must be less than 2^31.
Now, I understand why this is happening - my two sampling rates are very close together, so the integer factors need to be quite large (something like 73999/74000). Unfortunately this means that the appropriate filter can't be created by MATLAB. I also tried resampling just up, with the intention of then resampling down, but there is not enough memory to do this to even 1 million samples of data (mine is 93M).
What other methods could I use to properly resample this data?
An interpolated polyphase FIR filter can be used to interpolate just the new set of sample points without using an upsampling+downsampling process.
But if performance is completely unimportant, here's a Quick and Dirty windowed-Sinc interpolator in Basic.
here's my code, I hope it helps :
function resig = resamplee(sig,upsample,downsample)
if upsample*downsample<2^31
resig = resample(sig,upsample,downsample);
else
sig1half=sig(1:floor(length(sig)/2));
sig2half=sig(floor(length(sig)/2):end);
resig1half=resamplee(sig1half,floor(upsample/2),length(sig1half));
resig2half=resamplee(sig2half,upsample-floor(upsample/2),length(sig2half));
resig=[resig1half;resig2half];
end

How to build an ARMAX model in Matlab

I'm trying to build an ARMAX model which predicts reservoir water elevation as a function of previous elevations and an upstream inflow. My data is on a timestep of roughly 0.041 days, but it does vary slightly, and I have 3643 time series points. I've tried using the basic armax Matlab command, but am getting this error:
Error using armax (line 90)
Operands to the || and && operators must be convertible to
logical scalar values.
The code I'm trying is:
data = iddata(y,x,[],'SamplingInstants',JDAYs)
m1 = armax(data, [30 30 30 1])
where y is a vector of elevations that starts like y=[135.780
135.800
135.810
135.820
135.820
135.830]', x is a vector of flowrates that starts like x=[238.865
238.411
238.033
237.223
237.223
233.828]', and JDAYs is a vector of timestamps that starts like JDAYs=[122.604
122.651
122.688
122.729
122.771
122.813]'.
I'm new to this model type and the system identification toolbox, so I'm having issues figuring out what's causing that error. The Matlab examples aren't very helpful...
I hope this is not getting to you a bit late.
Checking your code i see that you are using a parameter called SamplingInstants. I'm not sure ARMAX functions works with it. Actually i'm sure. I have tried several times, and no, it doesn't. And it don't seems to be a well documented option for ARMAX -or for other methods- too.
The ARX, ARMAX, and other models are based on linear discrete systems from the Z-Transform formalism, that is, one can ussualy assume that your system has been sampled under a regular sampling rate. Although of course, this is not a law, this is the standard framework when dealing with linear -and also non-linear- systems. And also most industrial control & acquisition systems work under a regular rate sampling. Yet.
Try to get inside the ARMAX standard setting, like this:
y=[135.780 135.800 135.810 135.820 135.820 135.830 .....]';
x=[238.865 238.411 238.033 237.223 237.223 233.828 .....]';
%JDAYs=[122.604 122.651 122.688 122.729 122.771 122.813 .....]';
JDAYs=122.601+[0:length(y)-1]*4.18';
data = iddata(y,x,[],'SamplingInstants',JDAYs);
m1 = armax(data, [30 30 30 1])
And this will always work. Please just ensure that x and y are long enough to enable the proper estimation of all the free coefficients, greater than mean(4*orders), for ARMAX to work -in this case, greater than 121-, and desirable greater than 10*mean(4*orders), for ARMAX algorithm to properly solve your problem, and enough time-variant for prevent reaching onto ill-conditioned solutions.
Good Luck ;)...

function parameters in matlab wander off after curve fitting

first a little background. I'm a psychology student so my background in coding isn't on par with you guys :-)
My problem is as follow and the most important observation is that curve fitting with 2 different programs gives completly different results for my parameters, altough my graphs stay the same. The main program we have used to fit my longitudinal data is kaleidagraph and this should be seen as kinda the 'golden standard', the program I'm trying to modify is matlab.
I was trying to be smart and wrote some code (a lot at least for me) and the goal of that code was the following:
1. Taking an individual longitudinal datafile
2. curve fitting this data on a non-parametric model using lsqcurvefit
3. obtaining figures and the points where f' and f'' are zero
This all worked well (woohoo :-)) but when I started comparing the function parameters both programs generate there is a huge difference. The kaleidagraph program stays close to it's original starting values. Matlab wanders off and sometimes gets larger by a factor 1000. The graphs stay however more or less the same in both situations and both fit the data well. However it would be lovely if I would know how to make the matlab curve fitting more 'conservative' and more located near it's original starting values.
validFitPersons = true(nbValidPersons,1);
for i=1:nbValidPersons
personalData = data{validPersons(i),3};
personalData = personalData(personalData(:,1)>=minAge,:);
% Fit a specific model for all valid persons
try
opts = optimoptions(#lsqcurvefit, 'Algorithm', 'levenberg-marquardt');
[personalParams,personalRes,personalResidual] = lsqcurvefit(heightModel,initialValues,personalData(:,1),personalData(:,2),[],[],opts);
catch
x=1;
end
Above is a the part of the code i've written to fit the datafiles into a specific model.
Below is an example of a non-parametric model i use with its function parameters.
elseif strcmpi(model,'jpa2')
% y = a.*(1-1/(1+(b_1(t+e))^c_1+(b_2(t+e))^c_2+(b_3(t+e))^c_3))
heightModel = #(params,ages) abs(params(1).*(1-1./(1+(params(2).* (ages+params(8) )).^params(5) +(params(3).* (ages+params(8) )).^params(6) +(params(4) .*(ages+params(8) )).^params(7) )));
modelStrings = {'a','b1','b2','b3','c1','c2','c3','e'};
% Define initial values
if strcmpi('male',gender)
initialValues = [176.76 0.339 0.1199 0.0764 0.42287 2.818 18.52 0.4363];
else
initialValues = [161.92 0.4173 0.1354 0.090 0.540 2.87 14.281 0.3701];
end
I've tried to mimick the curve fitting process in kaleidagraph as good as possible. There I've found they use the levenberg-marquardt algorithm which I've selected. However results still vary and I don't have any more clues about how I can change this.
Some extra adjustments:
The idea for this code was the following:
I'm trying to compare different fitting models (they are designed for this purpose). So what I do is I have 5 models with different parameters and different starting values ( the second part of my code) and next I have the general curve fitting file. Since there are different models it would be interesting if I could put restrictions into how far my starting values could wander off.
Anyone any idea how this could be done?
Anybody willing to help a psychology student?
Cheers
This is a common issue when dealing with non-linear models.
If I were, you, I would try to check if you can remove some parameters from the model in order to simplify it.
If you really want to keep your solution not too far from the initial point, you can use upper bounds and lower bounds for each variable:
x = lsqcurvefit(fun,x0,xdata,ydata,lb,ub)
defines a set of lower and upper bounds on the design variables in x so that the solution is always in the range lb ≤ x ≤ ub.
Cheers
You state:
I'm trying to compare different fitting models (they are designed for
this purpose). So what I do is I have 5 models with different
parameters and different starting values ( the second part of my code)
and next I have the general curve fitting file.
You will presumably compare the statistics from fits with different models, to see whether reductions in the fitting error are unlikely to be due to chance. You may want to rely on that comparison to pick the model that not only fits your data suitably but is also simplest (which is often referred to as the principle of parsimony).
The problem is really with the model you have shown resulting in correlated parameters and therefore overfitting, as mentioned by #David. Again, this should be resolved when you compare different models and find that some do just as well (statistically speaking) even though they involve fewer parameters.
edit
To drive the point home regarding the problem with the choice of model, here are (1) results of a trial fit using simulated data (2) the correlation matrix of the parameters in graphical form:
Note that absolute values of the correlation close to 1 indicate strongly correlated parameters, which is highly undesirable. Note also that the trend in the data is practically linear over a long portion of the dataset, which implies that 2 parameters might suffice over that stretch, so using 8 parameters to describe it seems like overkill.

MATLAB interview questions?

I programmed in MATLAB for many years, but switched to using R exclusively in the past few years so I'm a little out of practice. I'm interviewing a candidate today who describes himself as a MATLAB expert.
What MATLAB interview questions should I ask?
Some other sites with resources for this:
"Matlab interview questions" on Wilmott
"MATLAB Questions and Answers" on GlobaleGuildLine
"Matlab Interview Questions" on CoolInterview
This is a bit subjective, but I'll bite... ;)
For someone who is a self-professed MATLAB expert, here are some of the things that I would personally expect them to be able to illustrate in an interview:
How to use the arithmetic operators for matrix or element-wise operations.
A familiarity with all the basic data types and how to convert effortlessly between them.
A complete understanding of matrix indexing and assignment, be it logical, linear, or subscripted indexing (basically, everything on this page of the documentation).
An ability to manipulate multi-dimensional arrays.
The understanding and regular usage of optimizations like preallocation and vectorization.
An understanding of how to handle file I/O for a number of different situations.
A familiarity with handle graphics and all of the basic plotting capabilities.
An intimate knowledge of the types of functions in MATLAB, in particular nested functions. Specifically, given the following function:
function fcnHandle = counter
value = 0;
function currentValue = increment
value = value+1;
currentValue = value;
end
fcnHandle = #increment;
end
They should be able to tell you what the contents of the variable output will be in the following code, without running it in MATLAB:
>> f1 = counter();
>> f2 = counter();
>> output = [f1() f1() f2() f1() f2()]; %# WHAT IS IT?!
We get several new people in the technical support department here at MathWorks. This is all post-hiring (I am not involved in the hiring), but I like to get to know people, so I give them the "Impossible and adaptive MATLAB programming challenge"
I start out with them at MATLAB and give them some .MAT file with data in it. I ask them to analyze it, without further instruction. I can very quickly get a feel for their actual experience.
http://blogs.mathworks.com/videos/2008/07/02/puzzler-data-exploration/
The actual challenge does not mean much of anything, I learn more from watching them attempt it.
Are they making scripts, functions, command line or GUI based? Do they seem to have a clear idea where they are going with it? What level of confidence do they have with what they are doing?
Are they computer scientists or an engineer that learned to program. CS majors tend to do things like close their parenthesis immediately, and other small optimizations like that. People that have been using MATLAB a while tend to capture the handles from plotting commands for later use.
How quickly do they navigate the documentation? Once I see they are going down the 'right' path then I will just change the challenge to see how quickly they can do plots, pull out submatrices etc...
I will throw out some old stuff from Project Euler. Mostly just ramp up the questions until one of us is stumped.
Floating Point Questions
Given that Matlab's main (only?) data type is the double precision floating point matrix, and that most people use floating point arithmetic -- whether they know it or not -- I'm astonished that nobody has suggested asking basic floating point questions. Here are some floating point questions of variable difficulty:
What is the range of |x|, an IEEE dp fpn?
Approximately how many IEEE dp fpns are there?
What is machine epsilon?
x = 10^22 is exactly representable as a dp fpn. What are the fpns xp
and xs just below and just above x ?
How many dp fpns are in [1,2)? How many atoms are on an edge of a
1-inch sugar cube?
Explain why sin(pi) ~= 0, but cos(pi) = -1.
Why is if abs(x1-x2) < 1e-10 then a bad convergence test?
Why is if f(a)*f(b) < 0 then a bad sign check test?
The midpoint c of the interval [a,b] may be calculated as:
c1 = (a+b)/2, or
c2 = a + (b-a)/2, or
c3 = a/2 + b/2.
Which do you prefer? Explain.
Calculate in Matlab: a = 4/3; b = a-1; c = b+b+b; e = 1-c;
Mathematically, e should be zero but Matlab gives e = 2.220446049250313e-016 = 2^(-52), machine epsilon (eps). Explain.
Given that realmin = 2.225073858507201e-308, and Matlab's u = rand gives a dp fpn uniformly distributed over the open interval (0,1):
Are the floating point numbers [2^(-400), 2^(-100), 2^(-1)]
= 3.872591914849318e-121, 7.888609052210118e-031, 5.000000000000000e-001
equally likely to be output by rand ?
Matlab's rand uses the Mersenne Twister rng which has a period of
(2^19937-1)/2, yet there are only about 2^64 dp fpns. Explain.
Find the smallest IEEE double precision fpn x, 1 < x < 2, such that x*(1/x) ~= 1.
Write a short Matlab function to search for such a number.
Answer: Alan Edelman, MIT
Would you fly in a plane whose software was written by you?
Colin K would not hire me (and probably fire me) for saying "that
Matlab's main (only?) data type is the double precision floating
point matrix".
When Matlab started that was all the user saw, but over the years
they have added what they coyly call 'storage classes': single,
(u)int8,16,32,64, and others. But these are not really types
because you cannot do USEFUL arithmetic on them. Arithmetic on
these storage classes is so slow that they are useless as types.
Yes, they do save storage but what is the point if you can't do
anything worthwhile with them?
See my post (No. 13) here, where I show that arithmetic on int32s is 12 times slower than
double arithmetic and where MathWorkser Loren Shure says "By
default, MATLAB variables are double precision arrays. In the olden
days, these were the ONLY kind of arrays in MATLAB. Back then even
character arrays were stored as double values."
For me the biggest flaw in Matlab is its lack of proper types,
such as those available in C and Fortran.
By the way Colin, what was your answer to Question 14?
Ask questions about his expertise and experience in applying MATLAB in your domain.
Ask questions about how he would approach designing an application for implementation in MATLAB. If he refers to recent features of MATLAB, ask him to explain them, and how they are different from the older features they replace or supplement, and why they are preferable (or not).
Ask questions about his expertise with MATLAB data structures. Many of the MATLAB 'experts' I've come across are very good at writing code, but very poor at determining what are the best data structures for the job in hand. This is often a direct consequence of their being domain experts who've picked up MATLAB rather than having been trained in computerism. The result is often good code which has to compensate for the wrong data structures.
Ask questions about his experience, if any, with other languages/systems and invite him to expand upon his observations about the relative strengths and weaknesses of MATLAB.
Ask for top tips on optimising MATLAB programs. Expect the answers: vectorisation, pre-allocation, clearing unused variables, etc.
Ask about his familiarity with the MATLAB profiler, debugger and lint tools. I've recently discovered that the MATLAB 'expert' over in the corner here had never, in 10 years using the tool, found the profiler.
That should get you started.
I. I think this recent SO question
on indexing is a very good question
for an "expert".
I have a 2D array, call it 'A'. I have
two other 2D arrays, call them 'ix'
and 'iy'. I would like to create an
output array whose elements are the
elements of A at the index pairs
provided by x_idx and y_idx. I can do
this with a loop as follows:
for i=1:nx
for j=1:ny
output(i,j) = A(ix(i,j),iy(i,j));
end
end
How can I do this without the loop? If
I do output = A(ix,iy), I get the
value of A over the whole range of
(ix)X(iy).
II. Basic knowledge of operators like element-wise multiplication between two matrices (.*).
III. Logical indexing - generate a random symmetric matrix with values from 0-1 and set all values above T to 0.
IV. Read a file with some properly formatted data into a matrix (importdata)
V. Here's another sweet SO question
I have three 1-d arrays where elements
are some values and I want to compare
every element in one array to all
elements in other two.
For example:
a=[2,4,6,8,12]
b=[1,3,5,9,10]
c=[3,5,8,11,15]
I want to know if there are same
values in different arrays (in this
case there are 3,5,8)
Btw, there's an excellent chance your interviewee will Google "MATLAB interview questions" and see this post :)
Possible question:
I have an array A of n R,G,B triplets. It is a 3xn matrix. I have another array B in the form 1xn which stores an index value (association to a cluster) for each triplet.
How do I plot the triplets of A in 3D space (using plot3 function), coloring each triplet according to its index in B? (The goal is to qualitatively evaluate my clustering)
Really, really good programmers who are MATLAB novices won't be able to give you an efficient (== MATLAB style) solution. However, it is a very simple problem if you do know your MATLAB.
Depends a bit what you want to test.
To test MATLAB fluency, there are several nice Stack Overflow questions that you could use to test e.g. array manipulations (example 1, example 2), or you could use fix-this problems like this question (I admit, I'm rather fond of that one), or look into this list for some highly MATLAB-specific stuff. If you want to be a bit mean, throw in a question like this one, where the best solution is a loop, and the typical MATLAB-way-of-thinking solution would just fill up the memory.
However, it may be more useful to ask more general programming questions that are related to your area of work and see whether they get the problem solved with MATLAB.
For example, since I do image analysis, I may ask them to design a class for loading images of different formats (a MATLAB expert should know how to do OOP, after all, it has been out for two years now), and then ask follow-ups as to how to deal with large images (I want to see a check on how much memory would be used - or maybe they know memory.m - and to hear about how MATLAB usually works with doubles), etc.