Bootstrapping in Matlab - how many original data points are used? - matlab

I have data sets for two groups, with one being much smaller than the other. For that reason, I am using the MatLab bootstrapping function to estimate the performance of the smaller group. I have code that draws on my original data, and it generates 1000 'new' means. However, it is not clear as to how many of the original data points are used each time. Obviously, if all the original data was used, the same mean would continue to be generated.
Can anyone help me out with this?

Bootstrapping comes from sampling with replacement. You'll use the same number of points as the original data, but some of them will be repeated. There are some variants of bootstrapping which work slightly differently, however. See https://en.wikipedia.org/wiki/Bootstrapping_(statistics).

Related

How to plot all the stream lines in paraview?

I am simulating the case "Cavity driven lid" and I try to get all the stream lines with the stream tracer of paraview, but I only get the ones that intersect the reference line, and because of that there are vortices that are not visible. How can I see all the stream-lines in the domain?
Thanks a lot in adavance.
To add a little bit to Mathieu's answer, if you really want streamlines everywhere, then you can create a Stream Tracer With Custom Source (as Mathieu suggested) and set your data to both the Input and the Seed Source. That will create a streamline originating from every point in your dataset, which is pretty much what you asked for.
However, while you can do this, you will probably not be happy with the results. First of all, unless your data is trivially small, this will take a long time to compute and create a large amount of data. Even worse, the result will be so dense that you won't be able to see anything. You will get all those interesting streamlines through vortices, but they will be completely hidden by all the boring streamlines around them.
Thus, you are better off with trying to derive a data set that contains seed points that are likely to trace a stream through the vortices that you are interested in. One thing you might want to try is to compute the vorticity of your vector field (Gradient Of Unstructured Data Set when turning on advanced option Compute Vorticity), find the magnitude of that (Calculator), and then use the Threshold filter to pull out the cells with large vorticity. Then use that as your Seed Source.
Another (probably better) option if your data is 2D or you can extract an interesting surface along the flow of your data is to use the Surface LIC plugin. Details can be found at https://www.paraview.org/Wiki/ParaView/Line_Integral_Convolution.
You have to choose a representative source for your streamline.
You could use a "Sphere Source", so in the StreamTracer properties.
If that fails, you can use a StreamTracerWithCustomSource and use your own source that you will have to create yourself first.

running NN software with my own data

New with Matlab.
When I try to load my own date using the NN pattern recognition app window, I can load the source data, but not the target (it is never on the drop down list). Both source and target are in the same directory. Source is 5000 observations with 400 vars per observation and target can take on 10 different values (recognizing digits). Any Ideas?
Before you do anything with your own data you might want to try out the example data sets available in the toolbox. That should make many problems easier to find later on because they definitely work, so you can see what's wrong with your code.
Regarding your actual question: Without more details, e.g. what your matrices contain and what their dimensions are, it's hard to help you. In your case some of the problems mentioned here might be similar to yours:
http://www.mathworks.com/matlabcentral/answers/17531-problem-with-targets-in-nprtool
From what I understand about nprtool your targets have to consist of a matrix with only one 1 (for the correct class) in either row or column (depending on the input matrix), so make sure that's the case.

Gaussian Mixture Modelling Matlab

Im using the Gaussian Mixture Model to estimate loglikelihood function(the parameters are estimated by the EM algorithm)Im using Matlab...my data is of the size:17991402*1...17991402 data points of one dimension:
When I run gmdistribution.fit(X,2) I get the desired output
But when I run gmdistribution.fit(X,k) for k>2....the code crashes and I get the error"OUT OF MEMORY"..I have also tried an open source code which again gives me the same problem.Can someone help me out here?..Im basically looking for a code which will allow me to use different number of components on such a large dataset.
Thanks!!!
Is it possible for you to decrease the iteration time? The default is 100.
OPTIONS = statset('MaxIter',50,'Display','final','TolFun',1e-6)
gmdistribution.fit(X,3,OPTIONS)
Or you may consider under-sampling the original data.
A general solution to out of memory problem is described in this document.

Collect more than 10,000 data points from a Tektronix oscilloscope?

I am building a MATLAB GUI to do data collection from a Tektronix DPO4104 oscilloscope (MATLAB driver here).
I am playing around with tmtool and with my GUI code and have found that the driver can only collect 10,000 data points, regardless of if the oscilloscope is set to show more than 10k points. I found this post on in CCSM but it hasn't been terribly helpful. (I'm the last post on there if you care to read it.) I am using the DPO4104 driver, whereas this post discusses use of the DPO4100 driver, I believe.
As far as I can tell, the steps are:
Edit driver's readwaveform function to account for the current recordLength - in my case, 100,000 points, say.
Manually edit the driver's MaxNumberPoint from 10,000 to 100,000. (In my case, the default number was 0.. I changed this to 100,000).
Manually edit EndingPoint. I set this to 100,000 also.
Before creating a device object, set(interfaceObj, 'InputBufferLength', 2.5*recordLength), that is, make sure the input buffer can fit more than 100,000 points. It's recommended to use at least double the expected buffer. I used 2.5 just because.
Build device object and waveform object, connect() to it, and readwaveform. Profit.
I am still unable to collect more than 10,000 points, either through tmtool or through my GUI. Any help would be appreciated.
I got ahold of a Tektronix engineer; he basically told me just to use the SCPI commands directly and skip the driver. While annoying, this might be the simplest solution.
Is it possible for you to collect the data points 10,000 at a time, then save them somewhere, collect the next 10,000, append them to the saved points, repeat?
It's a work-around, sure.
I figured it out! I think. Taking a couple weeks to step back and refresh really helped. Here's what I did:
1) Edit the driver's init function to configure a larger buffer size. Complete init code:
function init(obj)
% This method is called after the object is created.
% OBJ is the device object.
% End of function definition - DO NOT EDIT
% Extract the interface object.
interface = get(obj, 'Interface');
fclose(interface);
% Configure the buffer size to allow for waveform transfer.
set(interface, 'InputBufferSize', 12e6);
set(interface, 'OutputBufferSize', 12e6); % Originally is set to 50,000
I originally tried to set the the buffer sizes to 22e6 (I wanted to get 10 million points) but I got out-of-memory errors. Supposedly the buffer should be a little more than double what you expect to get out, plus space for headers. I probably don't need 2 million points worth of "header", but eh.
2) Edit the driver's readwaveform() to first query what the user-settable number of points to collect should be. Then, write SCPI commands to the scope to ensure that the number of data points to be transferred is equal to the number of points the user desires. The following snippet will do the trick in readwaveform:
try
% Specify source
fprintf(interface,['DATA:SOURCE ' trueSource]);
%----------BEGIN CODE TO HANDLE MORE THAN 10k POINTS----------
recordLength = query(interface, 'HORizontal:RECordlength?');
fprintf(interface, 'DATA:START 1');
fprintf(interface, 'DATA:STOP %d', str2num(recordLength));
%----------END CODE TO HANDLE MORE THAN 10k POINTS----------
% Issue the curve transfer command.
fprintf(interface, 'CURVE?');
raw = binblockread(interface, 'int16');
% Tektronix scopes send and extra terminator after the binblock.
fread(interface, 1);
3) In the user code, set a SCPI command to change the record size to the underlying interface object:
% interfaceObj is a VISA object.
fprintf(interfaceObj, 'HORizontal:RECordlength 5000000');
There you have it. Hopefully this helps out anyone else that's trying to figure out this issue.
Here's a bad idea. Start collecting 10,000 points. When you get to 5000 points, start collecting data again (you might need to run that in a new thread). Keep going back and forth until all the data you need are stored in 20 some data structures. Then, combine the structures into one structure by lining up the data points. This might be more work than calling the SCPI commands directly and might have some nasty caveats I haven't thought of. But as I said, it's a bad idea...

Merge sensor data for clustering/neural net usage

I have several datasets i.e. matrices that have a 2 columns, one with a matlab date number and a second one with a double value. Here an example set of one of them
>> S20_EavesN0x2DEAir(1:20,:)
ans =
1.0e+05 *
7.345016409722222 0.000189375000000
7.345016618055555 0.000181875000000
7.345016833333333 0.000177500000000
7.345017041666667 0.000172500000000
7.345017256944445 0.000168750000000
7.345017465277778 0.000166875000000
7.345017680555555 0.000164375000000
7.345017888888889 0.000162500000000
7.345018104166667 0.000161250000000
7.345018312500001 0.000160625000000
7.345018527777778 0.000158750000000
7.345018736111110 0.000160000000000
7.345018951388888 0.000159375000000
7.345019159722222 0.000159375000000
7.345019375000000 0.000160625000000
7.345019583333333 0.000161875000000
7.345019798611111 0.000162500000000
7.345020006944444 0.000161875000000
7.345020222222222 0.000160625000000
7.345020430555556 0.000160000000000
Now that I have those different sensor values, I need to get them together into a matrix, so that I could perform clustering, neural net and so on, the only problem is, that the sensor data was taken with slightly different timings or timestamps and there is nothing I can do about that from a data collection point of view.
My first thought was interpolation to make one sensor data set fit another one, but that seems like a messy approach and I was thinking maybe I am missing something, a toolbox or function that would enable me to do this quicker without me fiddling around. To even complicate things more, the number of sensors grew over time, therefore I am looking at different start dates as well.
Someone a good idea on how to go about this? Thanks
I think your first thought about interpolation was the correct one, at least if you plan to use NNs. Another option would be to use approaches which are designed to deal with missing data, like http://en.wikipedia.org/wiki/Dempster%E2%80%93Shafer_theory for example.
It's hard to give an answer for the clustering part, because I have no idea what you're looking for in the data.
For the neural network, beside interpolating there are at least two other methods that come to mind:
training separate networks for each matrix
feeding them all together to the same network, with a flag specifying which matrix the data is coming from, i.e. something like: input (timestamp, flag_m1, flag_m2, ..., flag_mN) => target (value) where the flag_m* columns are mutually exclusive boolean values - i.e. flag_mK is 1 iff the line comes from matrix K, 0 otherwise.
These are the only things I can safely say with the amount of information you provided.