I am using generalised mixed effects models family = binomial to determine difference between disturbed and undistubed substrate - mixed-models

I am using a GLMM model to determine differences in soil compaction across 3 locations and 2 seasons in undisturbed and disturbed sites. I used location and seas as random effects. My teacher says to use the compaction reading divided by its upper bound as the Y value against the different sites (fixed effect). (I was previously using disturbed and undisturbed sites as 1,0 as Y against the compaction reading - so the opposite way around.) The random variables are minimal. I was using both glmer (glmer to determine AIC and therefore best model fit (but this cannot be done in glmmPQL)) while glmmPQL provides all amounts of variation which glmer does not. So while these outcomes are very similar when using disturbed and undisturbed as Y (as well as matching the graphs) only glmmPQL is similar to the graphs when using proportion of compaction reading. glmer using proportions is totally different. Additionally my teacher says I need to validate my model choice with a chi-squared value and if over-dispersed use a quasi binomial. But I cannot find any way to do this in glmmPQL and with glmer showing strange results using proportions as Y I am unsure if this is correct. I also cannot use quasi binomial in either glmer or glmmPQL.
My response was the compaction reading which is measured from 0 to 6 (kg per cm squared) inclusive. The explanatory variable was Type (diff soil either disturbed and not disturbed = 4 categories because some were artificially disturbed to pull out differences). All compaction readings were divided by 6 to make them a proportion and so a continuous variable bounded by 0 and 1 but containing values of both 0 and 1. (I also tried the reverse and coded disturbed as 1 and undisturbed as 0 and compared these groups separately across all Types (due to 4 Types) and left compaction readings as original). Using glmer with code:
model1 <- glmer(comp/6 ~ Type +(1|Loc/Seas), data=mydata,
family = "binomial")
model2 <- glmer(comp/6~Type +(1|Loc) , data=mydata, family="binomial")
and using glmmPQL:
mod1 <-glmmPQL(comp/6~Type, random=~1|Loc, family = binomial, data=mydata)
mod2 <- glmmPQL(comp/6~Type, random=~1|Loc/Seas, family = binomial, data=mydata)
I could compare models in glmer but not in glmmPQL but the latter gave me the variance for all random effects plus residual variance whereas glmer did not provide the residual variance (so was left wondering what was the total amount and what proportion were these random effects responsible for)
When I used glmer this way, the results were completely different to glmmPQL as in no there was no sig difference at all in glmer but very much a sig diff in glmmPQL. (However if I do the reverse and code by disturbed and undisturbed these do provide similar results between glmer and glmmPQL and what is suggested by the graphs - but my supervisor said this is not strictly correct (eg: mod3 <- glmmPQL(Status~compaction, random=~1|Loc/Seas, family = binomial, data=mydata) where Status is 1 or 0 according to disturbed or undisturbed) plus my supervisor would like me to provide a chi squared goodness of fit for the model chosen - so can only use glmer here ?). Additionally, the random effects variance is minimal, and glmer model choice removes these as non significant (although keeping one in provides a smaller AIC). Removing them (as suggested by the chi-squared test (but not AIC) and running as only a glm is consistent to both results from glmmPQL and what is observed on the graph. Sorry if this seems very pedantic, but I am trying to do what is correct for my supervisor and for the species I am researching. I know there are differences.. they are seen, observed, eyeballing the data suggests so and so do the graphs.. Maybe I should just run the glm ? Thank you for answering me. I will find some output to post.

Related

Genetic algorithm techniques for allocation of electric vehicles

The problem I'm trying to solve is about the best allocation for electric vehicles (EVs) in the electrical power grid. My grid has 20 possible positions (busbar) allowed to receive one EV each. Each chromosome has length 20 and its genes can be 0 or 1, where 0 means no EV and 1 means thereĀ“s an EV at that position (busbar).
I start my population (100 individuals) with a fixed number of EVs (5, for instance) allocated randomly. And let them evolve through my GA. The GA utilizes tournament selection, 2-points crossover and flip-bit mutation. Each chromosome/individual is evaluated through a fitness function that calculate the power losses between bars (sum of RI^2). The best chromosome is the one with the lowest power losses.
The problem is that utilizing 2-points crossover and flip-bit mutation changes the fixed number of EVs that must be in the grid. I would like to know what are the best techniques for my GA operations. Besides this, I get this weird looking graphic of the most fitness chromosome throughout generations 1
I would appreciate any help/suggestions. Thanks.
You want to define your state space in such a way where the mutations you've chosen can't create an illegal configuration.
This is probably not a great fit for a genetic algorithm. If you want to pick 5 from 20, there are ~15k possibilities. Testing a population of 100 over 50 generations already gives you enough computations to have done 1/3 of the brute force work.
If you have N EV to assign on your grid, you can use chromosomes of size N, each gene being an integer representing the position of an EV. For the crossover, you first need to separate the values that are the same in both parents from the rest and apply a classic (1 or 2 points) crossover on the parts that differ, and mutate a gene randomly picking a valid available position.

Multiclass classification in SVM

I have been working on "Script identification from bilingual documents".
I want to classify the pages/blocks as either Eng(class 1), Hindi (class 2) or Mixed using libsvm in matlab. but the problem is that the training data i have consists of samples corresponding to Hindi and english pages/blocks only but no mixed pages.
The test data i want to give may consists of Mixed pages/blocks also, in that case i want it to be classified as "Mixed". I am planning to do it using confidence score or probability values. like if the prob value of class 1 is greater than a threshold (say 0.8) and prob value of class 2 is less than a threshold say(0.05) then it will be classified as class 1, and class 2 vice-versa. but if aforementioned two conditions dont satisfy then i want to classify it as "Mixed".
The third return value from the "libsvmpredict" is prob_values and i was planning to go ahead with this prob_values to decide whether the testdata is Hindi, English or Mixed. but at few places i learnt that "libsvmpredict" does not produce the actual prob_values.
Is there any way which can help me to classify the test data into 3 classes( Hindi, English, Mixed) using training data consisting of only 2 classes in SVM.
This is not the modus operandi for SVMs.
In no way SVMs can predict a given class without knowing it, without knowing how to separate such class from all other classes.
The function svmpredict() in LibSVM actually shows the probability estimates and the greater this value is, the more confident you can be regarding your prediction. But you cannot rely on such values if you have just two classes in order to predict a third class: indeed svmpredict() will return as many decision values as there are classes.
You can go on with your thresholding system (which, again, is not SVM-based) but it most likely fail or give bad performances. Think about that: you have to set up two thresholds and use them in a logic AND manner. The chance of correctly classified non-Mixed documents will indeed drastically decrease.
My suggestion is: instead of wasting time setting up thresholds, with a high chance of bad performances, join some of these texts together or create some new files with some Hindi and some English lines in order to add to your training data some proper Mixed documents and perform a standard 3-classes SVM system.
In order to create such files you can as well use Matlab, which has a pretty decent file I/O functions such as fread(), fwrite(), fprintf(), fscanf(), importdata() and so on...

What is the actual meaning implied by information gain in data mining?

Information Gain= (Information before split)-(Information after split)
Information gain can be found by above equation. But what I don't understand is what is exactly the meaning of this information gain? Does it mean that how much more information is gained or reduced by splitting according to the given attribute or something like that???
Link to the answer:
https://stackoverflow.com/a/1859910/740601
Information gain is the reduction in entropy achieved after splitting the data according to an attribute.
IG = Entropy(before split) - Entropy(after split).
See http://en.wikipedia.org/wiki/Information_gain_in_decision_trees
Entropy is a measure of the uncertainty present. By splitting the data, we are trying to reduce the entropy in it and gain information about it.
We want to maximize the information gain by choosing the attribute and split point which reduces the entropy the most.
If entropy = 0, then there is no further information which can be gained from it.
Correctly written it is
Information-gain = entropy-before-split - average entropy-after-split
the difference of entropy vs. information is the sign. Entropy is high, if you do not have much information of the data.
The intuition is that of statistical information theory. The rough idea is: how many bits per record do you need to encode the class label assignment? If you have only one class left, you need 0 bits per record. If you have a chaotic data set, you will need 1 bit for every record. And if the class is unbalanced, you could get away with less than that, using a (theoretical!) optimal compression scheme; e.g. by encoding the exceptions only. To match this intuition, you should be using the base 2 logarithm, of course.
A split is considered good, if the branches have lower entropy on average afterwards. Then you have gained information on the class label by splitting the data set. The IG value is the average number of bits of information you gained for predicting the class label.

How can I prevent my program from getting stuck at a local maximum (Feed forward artificial neural network and genetic algorithm)

I'm working on a feed forward artificial neural network (ffann) that will take input in form of a simple calculation and return the result (acting as a pocket calculator). The outcome wont be exact.
The artificial network is trained using genetic algorithm on the weights.
Currently my program gets stuck at a local maximum at:
5-6% correct answers, with 1% error margin
30 % correct answers, with 10% error margin
40 % correct answers, with 20% error margin
45 % correct answers, with 30% error margin
60 % correct answers, with 40% error margin
I currently use two different genetic algorithms:
The first is a basic selection, picking two random from my population, naming the one with best fitness the winner, and the other the loser. The loser receives one of the weights from the winner.
The second is mutation, where the loser from the selection receives a slight modification based on the amount of resulting errors. (the fitness is decided by correct answers and incorrect answers).
So if the network outputs a lot of errors, it will receive a big modification, where as if it has many correct answers, we are close to a acceptable goal and the modification will be smaller.
So to the question: What are ways I can prevent my ffann from getting stuck at local maxima?
Should I modify my current genetic algorithm to something more advanced with more variables?
Should I create additional mutation or crossover?
Or Should I maybe try and modify my mutation variables to something bigger/smaller?
This is a big topic so if I missed any information that could be needed, please leave a comment
Edit:
Tweaking the numbers of the mutation to a more suited value has gotten be a better answer rate but far from approved:
10% correct answers, with 1% error margin
33 % correct answers, with 10% error margin
43 % correct answers, with 20% error margin
65 % correct answers, with 30% error margin
73 % correct answers, with 40% error margin
The network is currently a very simple 3 layered structure with 3 inputs, 2 neurons in the only hidden layer, and a single neuron in the output layer.
The activation function used is Tanh, placing values in between -1 and 1.
The selection type crossover is very simple working like the following:
[a1, b1, c1, d1] // Selected as winner due to most correct answers
[a2, b2, c2, d2] // Loser
The loser will end up receiving one of the values from the winner, moving the value straight down since I believe the position in the array (of weights) matters to how it performs.
The mutation is very simple, adding a very small value (currently somewhere between about 0.01 and 0.001) to a random weight in the losers array of weights, with a 50/50 chance of being a negative value.
Here are a few examples of training data:
1, 8, -7 // the -7 represents + (1+8)
3, 7, -3 // -3 represents - (3-7)
7, 7, 3 // 3 represents * (7*7)
3, 8, 7 // 7 represents / (3/8)
Use a niching techniche in the GA. A useful alternative is niching. The score of every solution (some form of quadratic error, I think) is changed in taking account similarity of the entire population. This maintains diversity inside the population and avoid premature convergence an traps into local optimum.
Take a look here:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.100.7342
A common problem when using GAs to train ANNs is that the population becomes highly correlated
as training progresses.
You could try increasing mutation chance and/or effect as the error-change decreases.
In English. The population becomes genetically similar due to crossover and fitness selection as a local minim is approached. You can introduce variation by increasing the chance of mutation.
You can do a simple modification to the selection scheme: the population can be viewed as having a 1-dimensional spatial structure - a circle (consider the first and last locations to be adjacent).
The production of an individual for location i is permitted to involve only parents from i's local neighborhood, where the neighborhood is defined as all individuals within distance R of i. Aside from this restriction no changes are made to the genetic system.
It's only one or a few lines of code and it can help to avoid premature convergence.
References:
TRIVIAL GEOGRAPHY IN GENETIC PROGRAMMING (2005) - Lee Spector, Jon Klein

Shannon's Entropy measure in Decision Trees

Why is Shannon's Entropy measure used in Decision Tree branching?
Entropy(S) = - p(+)log( p(+) ) - p(-)log( p(-) )
I know it is a measure of the no. of bits needed to encode information; the more uniform the distribution, the more the entropy. But I don't see why it is so frequently applied in creating decision trees (choosing a branch point).
Because you want to ask the question that will give you the most information. The goal is to minimize the number of decisions/questions/branches in the tree, so you start with the question that will give you the most information and then use the following questions to fill in the details.
For the sake of decision trees, forget about the number of bits and just focus on the formula itself. Consider a binary (+/-) classification task where you have an equal number of + and - examples in your training data. Initially, the entropy will be 1 since p(+) = p(-) = 0.5. You want to split the data on an attribute that most decreases the entropy (i.e., makes the distribution of classes least random). If you choose an attribute, A1, that is completely unrelated to the classes, then the entropy will still be 1 after splitting the data by the values of A1, so there is no reduction in entropy. Now suppose another attribute, A2, perfectly separates the classes (e.g., the class is always + for A2="yes" and always - for A2="no". In this case, the entropy is zero, which is the ideal case.
In practical cases, attributes don't typically perfectly categorize the data (the entropy is greater than zero). So you choose the attribute that "best" categorizes the data (provides the greatest reduction in entropy). Once the data are separated in this manner, another attribute is selected for each of the branches from the first split in a similar manner to further reduce the entropy along that branch. This process is continued to construct the tree.
You seem to have an understanding of the math behind the method, but here is a simple example that might give you some intuition behind why this method is used: Imagine you are in a classroom that is occupied by 100 students. Each student is sitting at a desk, and the desks are organized such there are 10 rows and 10 columns. 1 out of the 100 students has a prize that you can have, but you must guess which student it is to get the prize. The catch is that everytime you guess, the prize is decremented in value. You could start by asking each student individually whether or not they have the prize. However, initially, you only have a 1/100 chance of guessing correctly, and it is likely that by the time you find the prize it will be worthless (think of every guess as a branch in your decision tree). Instead, you could ask broad questions that dramatically reduce the search space with each question. For example "Is the student somewhere in rows 1 though 5?" Whether the answer is "Yes" or "No" you have reduced the number of potential branches in your tree by half.