How to find the "optimal" cut-off point (threshold) - scipy

I have a set of weighted features for machine learning. I'd like to reduce the feature set and just use those with a very large or very small weight.
So given below image of sorted weights, I'd only like to use the features that have weights above the higher or below the lower yellow line.
What I'm looking for is some kind of slope change detection so I can discard all the features until the first/last slope coefficient increase/decrease.
While I (think I) know how to code this myself (with first and second numerical derivatives), I'm interested in any established methods. Perhaps there's some statistic or index that computes something like that, or anything I can use from SciPy?
Edit:
At the moment, I'm using 1.8*positive.std() as positive and 1.8*negative.std() as negative threshold (fast and simple), but I'm not mathematician enough to determine how robust this is. I don't think it is, though. ⍨

If the data are (approximately) Gaussian distributed, then just using a multiple
of the standard deviation is sensible.
If you are worried about heavier tails, then you may want to base your analysis on order
statistics.
Since you've plotted it, I'll assume you're willing to sort all of the
data.
Let N be the number of data points in your sample.
Let x[i] be the i'th value in the sorted list of values.
Then 0.5( x[int( 0.8413*N)]-x[int(0.1587*N)]) is an estimate of the standard deviation
which is more robust against outliers. This estimate of the std can be used as you
indicated above. (The magic numbers above are the fraction of data that are
less than [mean+1sigma] and [mean-1sigma] respectively).
There are also conditions where just keeping the highest 10% and lowest 10% would be
sensible as well; and these cutoffs are easily computed if you have the sorted data
on hand.
These are somewhat ad hoc approaches based on the content of your question.
The general sense of what you're trying to do is (a form of) anomaly detection,
and you can probably do a better job of it if you're careful in defining/estimating
what the shape of the distribution is near the middle, so that you can tell when
the features are getting anomalous.

Related

Mutate weights and biases in a neural network through genetic algorithm

I have a genetic algorithm evolving a population of neural networks
Until now I make mutation on weights or biases using random.randn (Python) which is a random value from a normal distribution with mean = 0
It works "well" and I managed to achieve my project using it be wouldn't it be better to use a uniform distribution on a given interval ?
My intuition is that it would lead to more variety in my networks
I think, that this question has no simple solution. In case of normal distribution will be numbers around mean have more chances to be "selected" by your number generator, uniform distribution give almost equal chance to all numbers. That is clear but answer to question, will equal chance mean better result, lays according to me only at empirical experiments. So I suggest you to perform experiments with normal and uniform distribution a try to judge based on results.
About variety. I assume that you create some random vector which represents weights. At stage of mutation you perform addition of random number. This number will be more likely from close interval around mean, so in case 0 mutation with high probability will be change of some elements only little. So there will be only little improvements over vector and sometimes something big shows up. In case of uniform distribution will be changes more random, which leads to different individual. Question is, will be these individual better? I don't know, but I offer you another view. I look to genetic algorithms like an analogy to evolution theory. And from this point of view, cumulative little improvements of individual with little probability of some big change is more appropriate. Think about situation, used is uniform distribution, but children has low fitness due to big changes so at phase of creating new generation will be not selected. And you will wait so long for one tiny improvement which make your network works with good results.
Maybe one more thing. Your experiments maybe show that uniform/normal distribution is better. But such result may be true only for your current problem, no at general.

Automatically truncating a curve to discard outliers in matlab

I am generation some data whose plots are as shown below
In all the plots i get some outliers at the beginning and at the end. Currently i am truncating the first and the last 10 values. Is there a better way to handle this?
I am basically trying to automatically identify the two points shown below.
This is a fairly general problem with lots of approaches, usually you will use some a priori knowledge of the underlying system to make it tractable.
So for instance if you expect to see the pattern above - a fast drop, a linear section (up or down) and a fast rise - you could try taking the derivative of the curve and looking for large values and/or sign reversals. Perhaps it would help to bin the data first.
If your pattern is not so easy to define but you are expecting a linear trend you might fit the data to an appropriate class of curve using fit and then detect outliers as those whose error from the fit exceeds a given threshold.
In either case you still have to choose thresholds - mean, variance and higher order moments can help here but you would probably have to analyse existing data (your training set) to determine the values empirically.
And perhaps, after all that, as Shai points out, you may find that lopping off the first and last ten points gives the best results for the time you spent (cf. Pareto principle).

KNN classification with categorical data

I'm busy working on a project involving k-nearest neighbor (KNN) classification. I have mixed numerical and categorical fields. The categorical values are ordinal (e.g. bank name, account type). Numerical types are, for e.g. salary and age. There are also some binary types (e.g., male, female).
How do I go about incorporating categorical values into the KNN analysis?
As far as I'm aware, one cannot simply map each categorical field to number keys (e.g. bank 1 = 1; bank 2 = 2, etc.), so I need a better approach for using the categorical fields. I have heard that one can use binary numbers. Is this a feasible method?
You need to find a distance function that works for your data. The use of binary indicator variables solves this problem implicitly. This has the benefit of allowing you to continue your probably matrix based implementation with this kind of data, but a much simpler way - and appropriate for most distance based methods - is to just use a modified distance function.
There is an infinite number of such combinations. You need to experiment which works best for you. Essentially, you might want to use some classic metric on the numeric values (usually with normalization applied; but it may make sense to also move this normalization into the distance function), plus a distance on the other attributes, scaled appropriately.
In most real application domains of distance based algorithms, this is the most difficult part, optimizing your domain specific distance function. You can see this as part of preprocessing: defining similarity.
There is much more than just Euclidean distance. There are various set theoretic measures which may be much more appropriate in your case. For example, Tanimoto coefficient, Jaccard similarity, Dice's coefficient and so on. Cosine might be an option, too.
There are whole conferences dedicated to the topics of similarity search - nobody claimed this is trivial in anything but Euclidean vector spaces (and actually, not even there): http://www.sisap.org/2012
The most straight forward way to convert categorical data into numeric is by using indicator vectors. See the reference I posted at my previous comment.
Can we use Locality Sensitive Hashing (LSH) + edit distance and assume that every bin represents a different category? I understand that categorical data does not show any order and the bins in LSH are arranged according to a hash function. Finding the hash function that gives a meaningful number of bins sounds to me like learning a metric space.

Distance to nearest palindrome

I'd like an algorithm to provide some kind of measure of how symmetrical a string is.In looking through previous questions, I found one on finding the number of letters that need to be added to a string to turn it into a palindrome. This is close to what I'm looking for but too restrictive in the set of allowable editing operations.
My motivation for this is that I'd like to make an improved version of a video that I put on Youtube called "Numbers are Colorful" The video shows Golden Ratio bases and a couple other related systems using irrational bases. Surprisingly, one system is to begin with completely symmetrical. but the others exhibit partial symmetry which I would like to highlight.
Are you looking for repetition or symmetry? So far I have seen no example that points to symmetry only repetition. 1001010.0010101 is not symmetrical. They are related by a circular shift, i.e. take the first set of digits [1001010], shift it to the left by 1 [0010101] and now you have the right side.
Unless you make it clear what you are trying to identify, this question is too poorly defined to give a sensible answer. If you really mean symmetrical, show me an example of symmetry. You might as well mean "I can see some interesting pattern here" which is so poorly defined it's difficult to quantify.
That said, digital signal processing is the sort of area you might look into for identifying interesting patterns. For example, if you are looking for repetition then I suggest you attempt to use an algorithm designed for detecting repeating patterns.
Consider the digits in your number to be an input signal. Perform frequency analysis on this signal to detect repeating sections of numbers. If you have a strong repeating component in your series of digits this should relate to a strong frequency component in your analysis. You can measure the strength of this pattern from identifying the fundamental frequency by performing the Fourier transform, and summing all of the harmonics for the most significant frequency bin. Divide this by the total energy of the signal and this will give you a measure between 0 and 1 for how "repetitive" the signal is, and will also identify the periodicity of the signal. You may be better off using time-domain algorithms like Autocorrelation, AMDF, or the YIN estimator. (Particularly AMDF)
A similar approach can be adopted if you were to consider actual symmetry (i.e. the numbers are still very similar when you reverse them).Take your input number, create a new signal by reversing it, and then measure their "sameness" at each discrete phase. If you have a digit of length N you could consider padding it with 0's to the length 2N before performing the comparison of the signal with it's inverted self, to consider the possibility of digits lying outside the length of the number.
The time-domain techniques are more likely to work because they are not affected so much by discontinuities. They do literally compare "sameness" of a signal by either computing the difference of all the points at each phase or multiplying the numbers together at each phase. In the subtraction case you hope to get to 0 when they are similar. In the multiplication case you hope to get a peak in the function when the numbers are back in phase. They are however more prone to noise (which in this context means the numbers which aren't quite right).

Process for comparing two datasets

I have two datasets at the time (in the form of vectors) and I plot them on the same axis to see how they relate with each other, and I specifically note and look for places where both graphs have a similar shape (i.e places where both have seemingly positive/negative gradient at approximately the same intervals). Example:
So far I have been working through the data graphically but realize that since the amount of the data is so large plotting each time I want to check how two sets correlate graphically it will take far too much time.
Are there any ideas, scripts or functions that might be useful in order to automize this process somewhat?
The first thing you have to think about is the nature of the criteria you want to apply to establish the similarity. There is a wide variety of ways to measure similarity and the more precisely you can describe what you want for "similar" to mean in your problem the easiest it will be to implement it regardless of the programming language.
Having said that, here is some of the thing you could look at :
correlation of the two datasets
difference of the derivative of the datasets (but I don't think it would be robust enough)
spectral analysis as mentionned by #thron of three
etc. ...
Knowing the origin of the datasets and their variability can also help a lot in formulating robust enough algorithms.
Sure. Call your two vectors A and B.
1) (Optional) Smooth your data either with a simple averaging filter (Matlab 'smooth'), or the 'filter' command. This will get rid of local changes in velocity ("gradient") that appear to be essentially noise (as in the ascending component of the red trace.
2) Differentiate both A and B. Now you are directly representing the velocity of each vector (Matlab 'diff').
3) Add the two differentiated vectors together (element-wise). Call this C.
4) Look for all points in C whose absolute value is above a certain threshold (you'll have to eyeball the data to get a good idea of what this should be). Points above this threshold indicate highly similar velocity.
5) Now look for where a high positive value in C is followed by a high negative value, or vice versa. In between these two points you will have similar curves in A and B.
Note: a) You could do the smoothing after step 3 rather than after step 1. b) Re 5), you could have a situation in which a 'hill' in your data is at the edge of the vector and so is 'cut in half', and the vectors descend to baseline before ascending in the next hill. Then 5) would misidentify the hill as coming between the initial descent and subsequent ascent. To avoid this, you could also require that the points in A and B in between the two points of velocity similarity have high absolute values.