Which is the best clustering algorithm to find outliers? [closed] - cluster-analysis

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Basically I have some hourly and daily data like
Day 1
Hours,Measure
(1,21)
(2,22)
(3,27)
(4,24)
Day 2
hours,measure
(1,23)
(2,26)
(3,29)
(4,20)
Now I want to find outliers in the data by considering hourly variations and as well as the daily variations using bivariate analysis...which includes hourly and measure...
So which is the best clustering algorithm is more suited to find outlier considering this scenario?
.

one 'good' advice (:P) I can give you is that (based on my experience) it is NOT a good idea to treat time similar to spatial features. So beware of solutions that do this. You probably can start with searching the literature in outlier detection for time-series data.

You really should use a different repesentation for your data.
Why don't you use an actual outlier detection method, if you want to detect outliers?
Other than that, just read through some literature. k-means for example is known to have problems with outliers. DBSCAN on the other hand is designed to be used on data with "Noise" (the N in DBSCAN), which essentially are outliers.
Still, the way you are representing your data will make none of these work very well.

You should use time series based outlier detection method because of the nature of your data (it has its own seasonality, trend, autocorrelation etc.). Time series based outliers are of different kinds (AO, IO etc.) and it's kind of complicated but there are applications which make it easy to implement.
Download the latest build of R from http://cran.r-project.org/. Install the packages "forecast" & "TSA".
Use the auto.arima function of forecast package to derive the best model fit for your data amd pass on those variables along with your data to detectAO & detectIO of TSA functions. These functions will pop up any outlier which is present in the data with their time indexes.
R is also easy to integrate with other applications or just simply run a batch job ....Hope that helps...

Related

How do you do stratified sampling across different groups, when creating train and test sets, in pyspark? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am looking for a solution to split my data to Test and Train sets but I want to have all the levels of my categorical variable in both test and train.
My variable has 200 levels and the data is 18 million records. I tried sampleBy function with fractions (0.8) and could get the training set but had difficulties getting the test set since there is no index in Spark and even with creating a key, using left join or subtract is very slow to get the test set!
I want to do a groupBy based on my categorical variable and randomly sample each category and if there is only one observation for that category, put that in the train set.
Is there a default function or library to help with this operation?
A pretty hard problem.
I don't know of an in-built function which will help you get this. Using sampleBy and then so subtraction subtraction would work but as you said - would be pretty slow.
Alternatively, wonder if you can try this*:
Use window functions, add row num and remove everything with rownum=1 into a separate dataframe which you will add into your training in the end.
On the remaining data, using randomSplit (a dataframe function) to divide into training and test
Add the separated data from Step 1 to training.
This should work faster.
*(I haven't tried it before! Would be great if you can share what worked in the end!)

Optimizing total number of cables [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I need lots of cables of different small sizes (under 100 meters) and cables are only sold in lenghts of 100 meters.
So, to optimize my purchase, I would like a code where I can input the lengths of all pieces of cables that I need. The code will combine my inputs under the constraint the sum is under 100, while minimizing the total number of 100m-length cables that I need to buy.
If anyone could help with a code in VBA, Matlab or Python I would be very grateful.
This is known as a bin-packing problem, and it's actually very difficult (computationally speaking) to find the optimal solution.
However, it is a problem that is practically useful to solve (as you have seen for yourself) and so there are several approaches that seek to find an approximate solution--one that is "good enough" without guaranteeing that it's the best possible solution. I did a quick search and found this course website, which has some examples that may help you out.
If you are looking for an exact solution, you can ask the related question "will I be able to fit the cables I need into N 100-meter cables?". This feasibility problem can be expressed as a "binary program", which is a special case of a "mixed-integer linear program", for which MATLAB has a solver called intlinprog (requires the optimization toolbox).
I'm sorry that I don't have any code to solve your problem, but I hope that this at least gives you some keywords to help you find more resources!
I believe this is like the cutting stock problem. There are some very good methods to solve this. Here is an implementation and some background. It is not too difficult to write an Excel front-end for this (see here).
If you google for "cutting stock problem" you will find lots of references.

ways for speed up MATLAB application [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
i have a question on speed up application built by MATLAB software, i need to know the affect of using vectorization and parallel computation on speed up the application ? and if there is better method than both previous way in such case ? thanks
The first thing you need to do when your MATLAB code runs too slow is to run it in the profiler. In recent versions of MATLAB, this can by done by pressing the "Run and Time" button on the main toolbar. This way, you will now which functions and which lines in these function take up the most time. Once you know this, you may do one of the following, depending on your circumstances and the nature of the particular piece of code:
Think if your algorithm is the most optimal one in terms of O() complexity.
Try turning loops into vector operations. The efficacy of this has declined in recent versions of MATLAB because of improvements in how loops are executed.
If you have a multi-core CPU try using the parallel computing toolbox. If your code parallelizes well, you will get a sped up nearly equal to the number of cores.
If you have an nVidia GPU try using the GPU support. You can get a speed-up by a factor of 10 or more with some problems, but not all problems are amicable to this sort of optimization.
If everything else fails, you may outsource the slowest piece of your code to a low level language like C. See here for how to do this. You could then use low-level profiling tools like Intel vTune to get the absolute maximum speed from the low-level code.
If it is still too slow, you may need to buy an FPGA. See here for a brief tutorial.

Can an artificial neural network predict the outcome of sports games? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I was trying to find something original and fun to do with artificial neural networks (ANNs) as a personal/learning project and I though it would be cool if I could predict the results of sports games (especially NHL games).
I'm pretty sure it would be easy to evolve an ANN that can predict which team is most likely to win (usually the team with the better record). However, what I would like to do is create an ANN that would tell how likely the outcome is, similar to bookmaker odds.
Is this something an ANN can do? In the affirmative, what kind of success can I expect? I know I can't beat the bookmaker (at least not with a software solution). I want do this as a recreational project/challenge to myself. I don't expect to bet money on sports games with this project.
Way back in the days of the IBM XT I played with a shareware ANN program to try and improve my chances on the British football (soccer) pools. This is a form of betting where you try and predict which football matches will result in draws. I assigned each team a number then looked back thorough past results and from them generated a single digit for the result. From memory it was 0 from a home win , 1 for an away win and 2 for a draw. Each result went on a single line in a training file. I would then run the training file through the program and generate the ANN settings. I would then look up the following Saturdays matches and feed them into the ANN then look for matches predicted as draws.
As the weeks went on my predictions of draws did definetly become more and more accurate. However ...
1) The XT was so slow that by Christmas it was taking 24 hours to generate the ANN settings from the training data. I really had better things to do with my precious (and expensive) PC.
2) Although it was better at predicting draws it wasn't predicting enough to actually win any money. Looking back I suppose the program had just worked out that Manchester United would always beat Sheffield United. This was more football knowledge than I had but not enough to win any money.
3) Entering the results into the training data and then generating the forthcoming matches data was taking me ages and to be honest sport bores me rigid.
So I gave up and didn't become a millionaire.
These days however PC's are much faster and much of the training data could be scraped from the web. But I still doubt it is a route to a fortune but its certainly an interesting project.
Ian
A reply above stated:
I know that if the bookmakers odds could be beaten by an ANN,
bookmakers would already be using one to fix their odds.
Bookmakers don't set the line based on their analysis of the teams - they set it based on their analysis of the betting public's opinion of the teams. An ideal line for the bookie is where he has exactly the same amount bet on each side of the line - then he is guaranteed a profit = the 'juice' on the losers' bets. They move the line as game approaches to try to keep that 50/50 split. Bookie may think Home team -5 is accurate line based on game analysis, but if he expects that will draw 2x $$ on the Home team he will not set the line at -5 - he will set at -7 or -8 - to where he expects to draw equal $$ for both -5 and +5 bets.
ANNs are really good at pattern matching and prediction, so yes, odds are you could build an ANN that does what you want.
You'll need more than just team win/loss ratio to make it really effective however. Feed it stats for the players, too. For real effectiveness, try to include game-flow information... like which players are on the line for each play (for football, for example).
Ultimately, the biggest problem you'll run into (aside from the whole "writing the ANN" issue) is getting the data you need to feed it.
I've done some stock market predictions with an AI and my conclusion is that it is not very hard to make an AI that gets good results with the historical data.
Making winning transactions in the future is a different ballgame.
I have just worked on this very problem (predicting English Premier League games) for the past 10 days, and ended up with very similar results using 3 different methods: SVM, Logistic Regression, and NN.
LR and NN will give probabilities. SVM outputs 0/1 (but it can be tweaked for probas too (I haven't tried yet).
I needed a "massive" (by my standards at least) feature set though (almost 300) and a good chunk of data (13 years worth).
Re. data, I got it from the web, simply.
Conclusion: I can just about match the bookies in terms of accuracy (predicting victories in my case). If I add the pre-match odds to the feature set, I get the exact same accuracy as the bookies (as expected), but no better (surely meaning my feature set is summarized in the bookies odds, and they have a little extra knowledge on top).
I'm sure there is a way to get better accuracy, either by improving the algos, or more likely by having extremely granular data (as in which players play which games, for how many minutes, and a lot of player-level historical stats, so as to build bottom-up models of team performance).
But bottom line is I can testify NNs work quite well for that purpose. SVM is slightly better though, in my limited experience.
I think it's indeed all about data, but there's no end to what you could feed it with in order to be more accurate : winning/loosing streaks, players biorhythms, player's girlfriends mood before the game, minor/major injuries they suffered in the recent past, extra-sportive events that are bothering the players, etc, etc, etc.
But I don't think you can accurately predict which team is more likely to win, it would be just a more-or-less educated guess.
In my opinion and experience, because of the excessively large number of factors in play, designing and training the ANN will be unreasonably complex and time-consuming. ANNs are good at pattern matching, and game prediction takes much deductive reasoning rather than mere pattern matching.
But if you want to enjoy learning neural networks, it will be a good adventure. If you are successful, you might want to host your code somewhere for others to see and learn!
For game prediction, it would be much easier and faster with decision trees or a rules engine and so on. This will be no easy task either, but it will be another interesting activity.
My belief is that the unpredictability of an event is due to lack of information and understanding...If you have all the knowledge, then yes it could be done. Or, the more knowledge you have, the better it can be done.
So in theory, the answer is yes.
However, in practice, you can get a PhD and have a whole career working on this question and you still may not succeed.

Separation of singing voice from music [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I want to know how to perform "spectral change detection" for the classification of vocal & non-vocal segments of a song. We need to find the spectral changes from a spectrogram. Any elaborate information about this, particularly involving MATLAB?
Separating out distinct signals from audio is a very active area of research, and it is a very hard problem. This is often called Blind Signal Separation in the literature. (There is some MATLAB demo code in the previous link.
Of course, if you know that there is vocal in the music, you can use one of the many vocal separation algorithms.
As others have noted, solving this problem using only raw spectrum analysis is a dauntingly hard problem, and you're unlikely to find a good solution to it. At best, you might be able to extract some of the vocals and a few extra crossover frequencies from the mix.
However, if you can be more specific about the nature of the audio material you are working with here, you might be able to get a little bit further.
In the worst case, your material would be normal mp3's of regular songs -- ie, a full band + vocalist. I have a feeling that this is the case you are probably looking at given the nature of your question.
In the best case, you have access to the multitrack studio recordings and have at least a full mixdown and an instrumental track, in which case you could extract the vocal frequencies from the mix. You would do this by generating an impulse response from one of the tracks and applying it to the other.
In the middle case, you are dealing with simple music which you could apply some sort of algorithm tuned to the parameters of the music to. For instance, if you are dealing with electronic music, you can use to your advantage the stereo width of the track to eliminate all mono elements (ie, basslines + kicks) to extract the vocals + other panned instruments, and then apply some type of filtering and spectrum analysis from there.
In short, if you are planning on making an all-purpose algorithm to generate clean acapella cuts from arbitrary source material, you're probably biting off more than you can chew here. If you can specifically limit your source material, then you have a number of algorithms at your disposal depending on the nature of those sources.
This is hard. If you can do this reliably you will be an accomplished computer scientist. The most promising method I read about used the lyrics to generate a voice only track for comparison. Again, if you can do this and write a paper about it you will be famous (amongst computer scientists). Plus you could make a lot of money by automatically generating timings for karaoke.
If you just want to decide wether a block of music is clean a-capella or with instrumental background, you could probably do that by comparing the bandwidth of the signal to a normal human singer bandwidth. Also, you could check for the base frequency, which can only be in a pretty limited frequency range for human voices.
Still, it probably won't be easy. However, hearing aids do this all the time, so it is clearly doable. (Though they typically search for speech, not singing)
first sync the instrumental with the original, make sure they are the same length and bitrate and start and end at the exact time and convert them to .wav
then do something like
I = wavread(instrumental.wav);
N = wavread(normal.wav);
i = inv(I);
A = (N - i); // it could be A = (N * i) or A = (N + i) you'll have to play around
wavwrite(A, acapella.wav)
that should do it.. a little linear algebra goes a long way.