I have a very limited cumulative probability information as shown below:
x1=6 F(x1)=0
x2=7.25 F(x2)=0.1
x3=8 F(x3)=0.35
x4=9.5 F(x4)=1
I want to fit this data using cumulative log-normal distribution curve. I know that there is a method called percentile matching for estimating the parameter from this kind of information. But we usually use only 2 data, since there are only 2 unknown parameters.
But here, I have to accommodate the upper and lower thresholds. Is there any way to do this using MATLAB?
Any reference about estimating lognormal distribution parameters using percentiles information would be appreciated. Thanks!
I've a question, in my simulation model I wanted to create some randomness of quantities of agents arrival. In my model the chance is 25% that the quantity is 1, 50% that quantity is 2 and 25% that the quantity is 3. I want to generate these quantities by using the split function of anylogic and than in the split function I want to add a distribution, which gives the values based on this distribution.
But is there a distribution in Anylogic, which fits to this kind of estimations or should I really do it manually. I was already thinking about using a rounded value of a normal distribution, but I'm not sure if this is correct.
Thanks.
You can use a custom distribution to create any shape of distribution you want (it is under the Agent palette).
In your case, you would need to pick one of type discrete as shown below.
Then, in the split block you would specify the number of copies as
customDistribution()
Is it possible to train classifiers in sklearn with a cost matrix with different costs for different mistakes? For example in a 2 class problem, the cost matrix would be a 2 by 2 square matrix. For example A_ij = cost of classifying i as j.
The main classifier I am using is a Random Forest.
Thanks.
The cost-sensitive framework you describe is not supported in scikit-learn, in any of the classifiers we have.
You could use a custom scoring function that accepts a matrix of per-class or per-instance costs. Here's an example of a scorer that calculates per-instance misclassification cost:
def financial_loss_scorer(y, y_pred, **kwargs):
import pandas as pd
totals = kwargs['totals']
# Create an indicator - 0 if correct, 1 otherwise
errors = pd.DataFrame((~(y == y_pred)).astype(int).rename('Result'))
# Use the product totals dataset to create results
results = errors.merge(totals, left_index=True, right_index=True, how='inner')
# Calculate per-prediction loss
loss = results.Result * results.SumNetAmount
return loss.sum()
The scorer becomes:
make_scorer(financial_loss_scorer, totals=totals_data, greater_is_better=False)
Where totals_data is a pandas.DataFrame with indexes that match the training set indexes.
You could always just look at your ROC curve. Each point on the ROC curve corresponds to a separate confusion matrix. So by specifying the confusion matrix you want, via choosing your classifier threshold implies some sort of cost weighting scheme. Then you just have to choose the confusion matrix that would imply the cost matrix you are looking for.
On the other hand if you really had your heart set on it, and really want to "train" an algorithm using a cost matrix, you could "sort of" do it in sklearn.
Although it is impossible to directly train an algorithm to be cost sensitive in sklearn you could use a cost matrix sort of setup to tune your hyper-parameters. I've done something similar to this using a genetic algorithm. It really doesn't do a great job, but it should give a modest boost to performance.
One way to circumvent this limitation is to use under or oversampling. E.g., if you are doing binary classification with an imbalanced dataset, and want to make errors on the minority class more costly, you could oversample it. You may want to have a look at imbalanced-learn which is a package from scikit-learn-contrib.
May not be direct to your question (since you are asking about Random Forest).
But for SVM (in Sklearn), you can utilize the class_weight parameter to specify the weights of different classes. Essentially, you will pass in a dictionary.
You might want to refer to this page to see an example of using class_weight.
I have a set of data in a vector. If I were to plot a histogram of the data I could see (by clever inspection) that the data is distributed as the sum of three distributions;
One normal distribution centered around x_1 with variance s_1;
One normal distribution centered around x_2 with variance s_2;
Once lognormal distribution.
My data is obviously a subset of the 'real' data.
What I would like to do is to take a random subset of my data away from my data ensuring that the resulting subset is a reasonable representative sample of the original data.
I would like to do this as easily as possible in matlab but am new to both statistics and matlab and am unsure where to start.
Thank you for any help :)
If you can identify each of the 3 distributions (in the sense that you can estimate their parameters), one approach could be to select a random subset of your data and then try to estimate the parameters for each distribution and see whether they are close enough (according to your own definition of "close") to the parameters of the original distributions. You should repeat this process several time and look at the average difference given a random subset size.
I am trying to deseasonalize a set of monthly water quality data consisting of 10 years. Since Box-Cox transformation may be required to rectify the anomalies like heteroscedasticity and non-normality of the residuals, I tried to this transformation before deseasonalization. I applied the transformation (boxcox function in MATLAB) on each month data sets separately and I used Kolmogorov-Smirnov test (kstest function in MATLAB) to check if it follows normal distribution. but even after the transformation the p value is very small and the hypothesis of kstest is rejected!! So my question is this: Do I do it the right way (applying the transformation and kstest on each month separately? why I don't get a normal data-set after boxcox?
Thanks
Boxcox transforms the data in order to reduce the nonnormality:
boxcox transforms nonnormally distributed data to a set of data that
has approximately normal distribution.
However, this is unfortunately not the same as that it can take any dataset and transform it to a perfectly normal dataset.
My guess is that your data is too messy, so even after using boxcox it still cannot pass the kolmogorov-smirnov test.