Extracting data from a very large dataset using matlab [duplicate] - matlab

This question already has an answer here:
Extracting unique data from large .txt files [closed]
(1 answer)
Closed 9 years ago.
http://tinypic.com/r/2dt5ge1/5
this is the link of screen shot of data which i want to extract.Data contains total 5,00,000 records/rows, what i want to do is, extract only those rows which has 19 at a particular position.
As you can see in the 9th and 19th row, after two 350 in the middle, there is 19. So i want to extract these rows only.Please help.
Also how many columns should i make while importing and in which format(text or numeric).

The data set is not really large, I would import everything and filter then. Using a numeric format, your data is <500MB which should be no Problem.
Start here: http://www.mathworks.de/de/help/matlab/import_export/import-numeric-data-from-a-text-file.html
To filter your data quick, use logical indexing e.g. data(data(:,4)==19,:) which would select every row where the 4-th column is 19.

Related

Am I using PCA in Orange in a correct way?

I am analysing if 15 books can be grouped according to 6 variables (of the 15 books, 2 are written by an author, 6 by an other one, and 7 by an other one). I counted the number of occurrences of the variables and I calculated the percentage. Then I used Orange software to use PCA. I uploaded the file. selected the columns and rows. And when it comes to PCA the program asks me if I want to normalize the data or not, but I am not sure about that because I have already calculated the percentage - is normalize different from calculating the percentage? Moreover, below the normalize button it asks me to show only:... and I have to choose a number between 0 and 100 but I don’t really know what it is.
Could you help me understand what I should do? Thank you in advance

How do you do stratified sampling across different groups, when creating train and test sets, in pyspark? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am looking for a solution to split my data to Test and Train sets but I want to have all the levels of my categorical variable in both test and train.
My variable has 200 levels and the data is 18 million records. I tried sampleBy function with fractions (0.8) and could get the training set but had difficulties getting the test set since there is no index in Spark and even with creating a key, using left join or subtract is very slow to get the test set!
I want to do a groupBy based on my categorical variable and randomly sample each category and if there is only one observation for that category, put that in the train set.
Is there a default function or library to help with this operation?
A pretty hard problem.
I don't know of an in-built function which will help you get this. Using sampleBy and then so subtraction subtraction would work but as you said - would be pretty slow.
Alternatively, wonder if you can try this*:
Use window functions, add row num and remove everything with rownum=1 into a separate dataframe which you will add into your training in the end.
On the remaining data, using randomSplit (a dataframe function) to divide into training and test
Add the separated data from Step 1 to training.
This should work faster.
*(I haven't tried it before! Would be great if you can share what worked in the end!)

Difference between Array and Timeseries [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I want to save result to to_file block in model matlab
just I want to know what is difference between array and timeseries in save format field.
Lets start from array - it's easiest thing. If you use To File or To Workspace block with array options it writes to file just column of values of your variable.
If you use Timeseries it writes values in timeseries format. This structure consist of several fields. Main of them are Time and Data. So you get not only values but times corresponded to this data! Furthermore it contain some additional information like interpolation method and other (see it in help).
When I have to use Array and when Timeseries?
It's clear that if time moments important to you you need to use Timeseries. For example if your simulation uses variable time step then data will not be uniformly distributed.So it's helpful to get times too.
Using an array is useful if times of data is not important. For example if I save from Enabled subsystem only 1 value of my variable.

How to give the matlab "find" function a hint that the list is sorted? [duplicate]

This question already has answers here:
Faster version of find for sorted vectors (MATLAB)
(5 answers)
Closed 7 years ago.
I want to use find function in matlab to find the index of first value that is bigger than a number C. the list is too long and it takes a lot of time to execute. But the values are actually sorted in increasing manner. How can I take advantage of that feature of data in matlab?
find(Data>C,1,'first')
set the 'first' switch in find. This will ensure that as soon as it finds the first element satisfying the criterion it will stop looking.

how to calculate mean and variance in online learning [duplicate]

This question already has answers here:
Rolling variance algorithm
(13 answers)
Closed 7 years ago.
how to calculate mean and variance in online learning by matlab?
suppose we have a stream of data that each time we receive only 40 of data. i want to update mean and variance of this data set by get each 40 data.
I would like every time I get 40 data, I update mean and variance of the all data that received so far. please pay attention that I could not save all data and each time I can save only 40 data.
thanks a lot
You might want to calculate a running mean and a running variance. There is a very good tutorial here:
http://www.johndcook.com/blog/standard_deviation/
With these algorithms you don't need to keep all values in memory.