This question already has answers here:
Rolling variance algorithm
(13 answers)
Closed 7 years ago.
how to calculate mean and variance in online learning by matlab?
suppose we have a stream of data that each time we receive only 40 of data. i want to update mean and variance of this data set by get each 40 data.
I would like every time I get 40 data, I update mean and variance of the all data that received so far. please pay attention that I could not save all data and each time I can save only 40 data.
thanks a lot
You might want to calculate a running mean and a running variance. There is a very good tutorial here:
http://www.johndcook.com/blog/standard_deviation/
With these algorithms you don't need to keep all values in memory.
Related
I am analysing if 15 books can be grouped according to 6 variables (of the 15 books, 2 are written by an author, 6 by an other one, and 7 by an other one). I counted the number of occurrences of the variables and I calculated the percentage. Then I used Orange software to use PCA. I uploaded the file. selected the columns and rows. And when it comes to PCA the program asks me if I want to normalize the data or not, but I am not sure about that because I have already calculated the percentage - is normalize different from calculating the percentage? Moreover, below the normalize button it asks me to show only:... and I have to choose a number between 0 and 100 but I don’t really know what it is.
Could you help me understand what I should do? Thank you in advance
This question already has answers here:
ELKI Kmeans clustering Task failed error for high dimensional data
(2 answers)
Closed 3 years ago.
I have gone through this question but the solution doesn't help.
ELKI Kmeans clustering Task failed error for high dimensional data
This is my first time with ELKI so, please bear with me. I have 45000 2D data points (after performing doc2vec ) that contain negative values and are not normalized. The dataset looks something like this :
-4.688612 32.793335
-42.990147 -20.499323
-24.948868 -10.822767
-45.502155 -40.917801
27.979715 -40.012688
1.867812 -9.838544
56.284512 6.756072
I am using the K-means algorithm to get 2 clusters. However, I get the following error:
Task failed
de.lmu.ifi.dbs.elki.data.type.NoSupportedDataTypeException: No data type found satisfying: NumberVector,field AND NumberVector,variable
Available types: DBID DoubleVector,variable,mindim=0,maxdim=1 LabelList
at de.lmu.ifi.dbs.elki.database.AbstractDatabase.getRelation(AbstractDatabase.java:126)
at de.lmu.ifi.dbs.elki.algorithm.AbstractAlgorithm.run(AbstractAlgorithm.java:81)
at de.lmu.ifi.dbs.elki.workflow.AlgorithmStep.runAlgorithms(AlgorithmStep.java:105)
at de.lmu.ifi.dbs.elki.KDDTask.run(KDDTask.java:112)
at de.lmu.ifi.dbs.elki.application.KDDCLIApplication.run(KDDCLIApplication.java:61)
at [...]
So my question is, does ELKI require the data to be in the range of [0,1] because all the examples that I came across had their data within that range.
Or is it that ELKI does not accept negative values?
If something else, can someone please guide me through this?
Thank you!
ELKI can handle negative values just fine.
Your input data is not correctly formatted. Same problem as in ELKI Kmeans clustering Task failed error for high dimensional data
Apparently your lines have either 0 or 1 values. ELKI itself is fine with that, but
k-means requires the data to be in a R^d vector space, hence ELKI cannot run k-means on your data set. But the reason is that the input file is bad. You may want to double check your file - there probably is at least one line that is not properly formatted.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am looking for a solution to split my data to Test and Train sets but I want to have all the levels of my categorical variable in both test and train.
My variable has 200 levels and the data is 18 million records. I tried sampleBy function with fractions (0.8) and could get the training set but had difficulties getting the test set since there is no index in Spark and even with creating a key, using left join or subtract is very slow to get the test set!
I want to do a groupBy based on my categorical variable and randomly sample each category and if there is only one observation for that category, put that in the train set.
Is there a default function or library to help with this operation?
A pretty hard problem.
I don't know of an in-built function which will help you get this. Using sampleBy and then so subtraction subtraction would work but as you said - would be pretty slow.
Alternatively, wonder if you can try this*:
Use window functions, add row num and remove everything with rownum=1 into a separate dataframe which you will add into your training in the end.
On the remaining data, using randomSplit (a dataframe function) to divide into training and test
Add the separated data from Step 1 to training.
This should work faster.
*(I haven't tried it before! Would be great if you can share what worked in the end!)
This question already has answers here:
Faster version of find for sorted vectors (MATLAB)
(5 answers)
Closed 7 years ago.
I want to use find function in matlab to find the index of first value that is bigger than a number C. the list is too long and it takes a lot of time to execute. But the values are actually sorted in increasing manner. How can I take advantage of that feature of data in matlab?
find(Data>C,1,'first')
set the 'first' switch in find. This will ensure that as soon as it finds the first element satisfying the criterion it will stop looking.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Basically I have some hourly and daily data like
Day 1
Hours,Measure
(1,21)
(2,22)
(3,27)
(4,24)
Day 2
hours,measure
(1,23)
(2,26)
(3,29)
(4,20)
Now I want to find outliers in the data by considering hourly variations and as well as the daily variations using bivariate analysis...which includes hourly and measure...
So which is the best clustering algorithm is more suited to find outlier considering this scenario?
.
one 'good' advice (:P) I can give you is that (based on my experience) it is NOT a good idea to treat time similar to spatial features. So beware of solutions that do this. You probably can start with searching the literature in outlier detection for time-series data.
You really should use a different repesentation for your data.
Why don't you use an actual outlier detection method, if you want to detect outliers?
Other than that, just read through some literature. k-means for example is known to have problems with outliers. DBSCAN on the other hand is designed to be used on data with "Noise" (the N in DBSCAN), which essentially are outliers.
Still, the way you are representing your data will make none of these work very well.
You should use time series based outlier detection method because of the nature of your data (it has its own seasonality, trend, autocorrelation etc.). Time series based outliers are of different kinds (AO, IO etc.) and it's kind of complicated but there are applications which make it easy to implement.
Download the latest build of R from http://cran.r-project.org/. Install the packages "forecast" & "TSA".
Use the auto.arima function of forecast package to derive the best model fit for your data amd pass on those variables along with your data to detectAO & detectIO of TSA functions. These functions will pop up any outlier which is present in the data with their time indexes.
R is also easy to integrate with other applications or just simply run a batch job ....Hope that helps...