Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
Evolutionary Algorithms use a fitness function to select candidates for survival across generations ("survival of the fittest"). I believe all fitness functions assume that the closer the candidate's value is to the desired value, the closer their input ("key") must be to the desired input.
Cryptographic Hash Functions have the property that "it is infeasible to generate a message that has a given hash". I understand this to mean that there is little or no correlation between the "closeness" of values to the "closeness" of keys.
Putting these two together, doesn't that imply that the "survival of the fittest" assumption is wrong for Cryptographic Hash Functions? Meaning, if you wanted to use Evolutionary Algorithms to try to figure out the reverse of a Cryptographic Hash value, the fitness function would drive you in the wrong direction. Is the correlation between "closeness" of values and "closeness" of keys a prerequisite of Evolutionary Algorithms?
Yes, it's pretty much impossible to construct a fitness function that consistently tells you that value A is closer to the goal than value B based on the output of a (good) cryptographic hash function for all three. That follows from the property you mentioned. So evolutionary algorithms can't speed up reversing cryptographic hash functions for the average case. However, this shouldn't be a surprise: Said property is only useful in the first place because it breaks precisely the approach of evolutionary algorithms (speeding up reversal by looking at hash value similarity).
Generalizing this, evolutionary algorithms (like all other algorithms that rely on a heuristic to guide their search, e.g. A*) are only useful if you can define a meaningful fitness function (heuristic)†. Obviously, it is possible to construct problems that don't allow this (e.g. by giving too little information), and it's completely propably that there are some more real-world applications that have the same problem. Evolutionary algorithms don't cure cancer, but again that's no surprise (nothing does, else we'd have moved to a different metaphor).
†On a side note, this fitness function doesn't have to be closeness to any particular value, there are many problems where the fitness can grow indefinitely, e.g. when optimizing code for performance the fitness could be the number of operations per second.
Cryptographic Hash Functions have the property that "it is infeasible to generate a message that has a given hash". I understand this to mean that there is little or no correlation between the "closeness" of values to the "closeness" of keys.
Your understanding of "closeness" of values vs keys is true. In fact that is the primary purpose of hash functions. And evolutionary algorithms won't work well here.
However, that is not why "it is infeasible to generate a message that has a given hash". It's because hash functions are not 1 to 1. For example, it's possible that hash(a) = key = hash(b). So if you are given a key, there is no way of telling if the original message is a or b.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I collected data from different sources FB, Twitter, Linkedin, then made them in a structured format. As a result now: I'm having a csv file with 10000 rows (10000 person) and the data associated is about their names, age,their interests and buying habits.
I'm really stuck on this step: CLASSIFICATION or CLUSTERING. For the classification I don't really have predefined classes or a model for my users to classify them.
For clustering: I started calculating similarities and KMeans, but still can't get the result I wanted. How can I decide what to choose before moving on to the next step of Collaborative filtering?
Foremost, you have to understand that clustering is a pre-processing activity/task. The idea in clustering is to identify objects with similar properties and group them. The clustering process can be understood in terms of cattle-herding. Wherein the jockey herds loose cattle (read data points) into groups.
Note: If you are looking at the partitioning clustering algorithm family includes K-means, k-modes, k-prototype etc. The algorithm k-means will work only for numerical data. K-modes will work only for categorical data and k-prototype will work for both numerical and categorical data.
Question: Is the data preprocessed? If the answer is no, then you may try the following steps;
Is the data (column values) all categorical (=text) format or numerical or mixed?
a. If all categorical then discretize or bin or interval scale them.
b. if mixed, then discretize or bin or interval scale the categorical values only
c. Perform missing value and outlier treatment for both numerical and categorical data. This will help in retaining maximum variance as well as reduce dimensionality.
d. Normalize the numerical values to a median of zero.
Now apply a suitable clustering algorithm (based on your problem) to determine patterns. Once you have found the patterns, then you may label them. Once the identified patterns are labelled, thereafter or subsequently a classification algorithm can be used to classify any new incoming data points into an appropriate class.
Matlab's optimproblem class of objects allows users to define an Integer Linear Program (ILP) problems using symbolic variables. This is dubbed the "problem-based" formulation. Internal methods takes care of setting up the detailed ILP formulation by assembling the coefficient arrays and matrices for the objective function, equality constraints, and inequality constraints. In Matlab, these details are referred to as the "structure" for the "solver-based" formulation.
Users can see the order in which the optimproblem.Variables are taken in setting up the solver-based formulation by using prob2struct to explicitly convert an optimizationproblem object into a solver-based structure. The Algorithms section of the prob2struct page, the variables are taken in the order in which they appear in the optimizationproblem.Variables property.
I haven't been able to find what determines this order. Is there any way to control het order, maybe even change it if necessary? This would allow one to control the order of the scalar variables of the archetypal ILP problem setup, i.e., the solver-based formulation.
Thanks.
Reason for this question
I'm using Matlab as a prototyping environment, and may be relying on others to develop based on the prototype, possibly calling other solver engines. An uncontrolled ordering of variables makes it hard to compare, especially if the development has a deterministic way of arranging the variables. Hence my wish to control the variable ordering. If this is not possible, it would be nice to know. I would then know to turn my attention completely to mitigating the challenge of disparately ordered variables.
I have read many tutorials, papers and I understood the concept of Genetic Algorithm, but I have some problems to implement the problem in Matlab.
In summary, I have:
A chromosome containing three genes [ a b c ] with each gene constrained by some different limits.
Objective function to be evaluated to find the best solution
What I did:
Generated random values of a, b and c, say 20 populations. i.e
[a1 b1 c1] [a2 b2 c2]…..[a20 b20 c20]
At each solution, I evaluated the objective function and ranked the solutions from best to worst.
Difficulties I faced:
Now, why should we go for crossover and mutation? Is the best solution I found not enough?
I know the concept of doing crossover (generating random number, probability…etc) but which parents and how many of them will be selected to do crossover or mutation?
Should I do the crossover for the entire 20 solutions (parents) or only two of them?
Generally a Genetic Algorithm is used to find a good solution to a problem with a huge search space, where finding an absolute solution is either very difficult or impossible. Obviously, I don't know the range of your values but since you have only three genes it's likely that a good solution will be found by a Genetic Algorithm (or a simpler search strategy at that) without any additional operators. Selection and Crossover is usually carried out on all chromosome in the population (although it's not uncommon to carry some of the best from each generation forward as is). The general idea is that the fitter chromosomes are more likely to be selected and undergo crossover with each other.
Mutation is usually used to stop the Genetic Algorithm prematurely converging on a non-optimal solution. You should analyse the results without mutation to see if it's needed. Mutation is usually run on the entire population, at every generation, but with a very small probability. Giving every gene 0.05% chance that it will mutate isn't uncommon. You usually want to give a small chance of mutation, without it completely overriding the results of selection and crossover.
As has been suggested I'd do a lit bit more general background reading on Genetic Algorithms to give a better understanding of its concepts.
Sharing a bit of advice from 'Practical Neural Network Recipies in C++' book... It is a good idea to have a significantly larger population for your first epoc, then your likely to include features which will contribute to an acceptable solution. Later epocs which can have smaller populations will then tune and combine or obsolete these favourable features.
And Handbook-Multiparent-Eiben seems to indicate four parents are better than two. However bed manufactures have not caught on to this yet and seem to only produce single and double-beds.
I am willing to compute a Frechet/Gateaux derivative of a function which is not entirely explicit and my question is : What would be the most efficient way to do it ? Which language would you recommend me to use ?
Precisely, my problem is that I have a function, say F, which is the square of the euclidean norm of the sum of products of pairs of multidimensional functions (i.e. from R^n to R^k).
AFAIK, If I use Maple or Maxima, they will ask me to explicit the functions involved in the formula whereas I would like to keep them abstract. Then, I necessarly need to compute a Frechet/Gateaux derivative so as to keep the expressions simple. Indeed, when I proceed the standard way, I start to develop the square of the euclidean norm as a sum of squares and there is a lot of indexes. My goal being to make a Taylor developpement with integral remainder to the third order, the expression becomes, according to me, humanly infeasible (the formula is more than one A4 page long).
So I would prefer to use a Frechet/Gateaux derivative, which would allow me, among other, to keep scalar products instead of sums.
As the functions invloved have some similarities with their derivatives (due to the presence of exponentials) there is just a small amount of rules to know. So I thought that I might make such a one-purposed computer algebra system by myself.
And I started to learn LISP, as I read that it would be efficient for my problem, but I am a little bit lost now, since this language is very different and I am still used to think in terms of C/Python/Perl...
Here is another question : would you have some links to courses or articles about how an algebra system for symbolic computations is made (preferably in LISP) ? Any suggestions are welcome.
Thank you very much for your answers.
My advice is to use Maxima. Maxima is inspired by Lisp, and implemented in Lisp, so using Maxima will save you a tremendous amount of time and trouble. If Lisp is suitable for your problem, Maxima is even more so.
Maxima will allow you to use undefined terms in an expression; it is not necessary to define all terms.
Post a message to the Maxima mailing list (maxima#math.utexas.edu) to ask for specific advice. Please explain in detail about what you are trying to accomplish.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I find this question a little tricky. Maybe someone knows an approach to answer this question. Imagine that you have a dataset(training data) which you don't know what it is about. Which features of training data would you look at in order to infer classification algorithm to classify this data? Can we say anything whether we should use a non-linear or linear classification algorithm?
By the way, I am using WEKA to analyze the data.
Any suggestions?
Thank you.
This is in fact two questions in one ;-)
Feature selection
Linear or not
add "algorithm selection", and you probably have three most fundamental questions of classifier design.
As an aside note, it's a good thing that you do not have any domain expertise which would have allowed you to guide the selection of features and/or to assert the linearity of the feature space. That's the fun of data mining : to infer such info without a priori expertise. (BTW, and while domain expertise is good to double-check the outcome of the classifier, too much a priori insight may make you miss good mining opportunities). Without any such a priori knowledge you are forced to establish sound methodologies and apply careful scrutiny to the results.
It's hard to provide specific guidance, in part because many details are left out in the question, and also because I'm somewhat BS-ing my way through this ;-). Never the less I hope the following generic advice will be helpful
For each algorithm you try (or more precisely for each set of parameters for a given algorithm), you will need to run many tests. Theory can be very helpful, but there will remain a lot of "trial and error". You'll find Cross-Validation a valuable technique.
In a nutshell, [and depending on the size of the available training data], you randomly split the training data in several parts and train the classifier on one [or several] of these parts, and then evaluate the classifier on its performance on another [or several] parts. For each such run you measure various indicators of performance such as Mis-Classification Error (MCE) and aside from telling you how the classifier performs, these metrics, or rather their variability will provide hints as to the relevance of the features selected and/or their lack of scale or linearity.
Independently of the linearity assumption, it is useful to normalize the values of numeric features. This helps with features which have an odd range etc.
Within each dimension, establish the range within, say, 2.5 standard deviations on either side of the median, and convert the feature values to a percentage on the basis of this range.
Convert nominal attributes to binary ones, creating as many dimensions are there are distinct values of the nominal attribute. (I think many algorithm optimizers will do this for you)
Once you have identified one or a few classifiers with a relatively decent performance (say 33% MCE), perform the same test series, with such a classifier by modifying only one parameter at a time. For example remove some features, and see if the resulting, lower dimensionality classifier improves or degrades.
The loss factor is a very sensitive parameter. Try and stick with one "reasonnable" but possibly suboptimal value for the bulk of the tests, fine tune the loss at the end.
Learn to exploit the "dump" info provided by the SVM optimizers. These results provide very valuable info as to what the optimizer "thinks"
Remember that what worked very well wih a given dataset in a given domain may perform very poorly with data from another domain...
coffee's good, not too much. When all fails, make it Irish ;-)
Wow, so you have some training data and you don't know whether you are looking at features representing words in a document, or genese in a cell and need to tune a classifier. Well, since you don't have any semantic information, you are going to have to do this soley by looking at statistical properties of the data sets.
First, to formulate the problem, this is more than just linear vs non-linear. If you are really looking to classify this data, what you really need to do is to select a kernel function for the classifier which may be linear, or non-linear (gaussian, polynomial, hyperbolic, etc. In addition each kernel function may take one or more parameters that would need to be set. Determining an optimal kernel function and parameter set for a given classification problem is not really a solved problem, there are only useful heuristics and if you google 'selecting a kernel function' or 'choose kernel function', you will be treated to many research papers proposing and testing various approaches. While there are many approaches, one of the most basic and well travelled is to do a gradient descent on the parameters-- basically you try a kernel method and a parameter set , train on half your data points and see how you do. Then you try a different set of parameters and see how you do. You move the parameters in the direction of best improvement in accuracy until you get satisfactory results.
If you don't need to go through all this complexity to find a good kernel function, and simply want an answer to linear or non-linear. then the question mainly comes down to two things: Non linear classifiers will have a higher risk of overfitting (undergeneralizing) since they have more dimensions of freedom. They can suffer from the classifier merely memorizing sets of good data points, rather than coming up with a good generalization. On the other hand a linear classifier has less freedom to fit, and in the case of data that is not linearly seperable, will fail to find a good decision function and suffer from high error rates.
Unfortunately, I don't know a better mathematical solution to answer the question "is this data linearly seperable" other than to just try the classifier itself and see how it performs. For that you are going to need a smarter answer than mine.
Edit: This research paper describes an algorithm which looks like it should be able to determine how close a given data set comes to being linearly seperable.
http://www2.ift.ulaval.ca/~mmarchand/publications/wcnn93aa.pdf