Interpreting percentage units and a rate per 1000 in a regression - linear-regression

I want to examine the effect of the crime rate on school performance.
My dependent variable is the percentage of students who secure the 80 percentile or above of the grades,
and the crime rate is calculated per 1,000 students.
Therefore, my estimation model becomes as:
% of student=a+ b* crime rate + error.
If the estimate of b is -1.16, how can I interpret the result?

Related

How do I determine the number of runs I need to carry out in a netlogo to reduce the influence of randomization?

enter image description here
Box plot
I have plotted an output variable as a box plot for the number of runs. But I fail to provide argumentation on which should be the optimum amount of runs that should be carried out.
If each simulation is considered one observation in a sample, your sample size (number of simulations) should be large enough so that estimate of the parameter of interest approaches the true population value for the model (See Cowled, B.D., Garner, M.G., Negus, K., Ward, M.P., 2012. Controlling disease outbreaks in wildlife using limited culling: modelling classical swine fever incursions in wild pigs in Australia. Vet. Res. 43, 3).
This is what Cowled et al. did: "To estimate our sample size, we calculated the mean of the parameter-of-interest (after each simulation). We then determined the coefficient of variation of this mean. At the point when the coefficient of variation was less than 15% for 30 consecutive simulations we considered that convergence had occurred and that
this number of simulations was adequate to estimate the parameter with precision."
I have used a similar approach to calculate the required number of model simulations: Belsare, A.V. and Gompper, M.E. 2015. A model-based approach for investigation and mitigation of disease spillover risks to wildlife: dogs, foxes and canine distemper in central India. Ecological Modelling 296, 102-112.

Evaluating ICC for Reliability Study

I am writing a research study on the reliability and validity of a foot scanning app. I want to evaluate the ICC (intraclass correlation coefficient, absolute agreement) for a couple of datasets I have. In one set, I only measured each person once (about 45 people). In another, I measured 3 people 15 times. I have extracted 10 measurements from each person (5 different foot measurements of each foot). I also have some ICC calculation code to help me calculate these things.
The code requires that I know what the "objects of measurement" are (rows) and the "judge or measurement" (columns). How should I arrange my data to calculate an ICC value for each foot measurement type?

Predicting the difference or the quotient?

For a time series forecasting problem, I noticed some people tried to predict the difference or the quotient. For instance, in trading, we can try to predict the price difference P_{t-1} - P_t or the price quotient P_{t-1}/P_t. So we get a more stationary problem. With a recurrent neural network for a regression problem, trying to predict the price difference can be a real pain if the price does not change sufficiently fast because it will predict mostly zero at each step.
Questions :
What are the advantages and inconveniences of using the difference or the quotient instead of the whole quantity?
What can a nice tool to get rid of the repetitive zeros in a problem like trying to predict the price movement?
If the assumption is that the price is stationary (*Pt=Cte), then predict the whole quantity.
If the assumption is that the price increase ()is stationary (Pt= Pt-1+Cte), then predict the absolute difference Pt-Pt-1. (Note: thie is the ARIMA model with a degree of differencing=1)
If the assumption is that the price growth (in percentage) is stationary (Pt=Pt-1 +Cte * Pt-1), then predict the relative difference Pt/Pt-1.
If the price changes rarely (i.e. the absolute or relative difference is most often zero), then try to predict the time interval between tow changes rather than the price itself.

Will an average of neural net weights be as effective as one humungous similation?

I'm planning to write a neural network to predict the closing price on day n, using open, high,low,close,& volume for days n-10 to n-1, and doing this for apx 800 days.
I was going to use an 'all but 1' strategy for validation, but this would basically square the number of simulations I'd have to run. Very inefficient!
Should it be significantly less accurate if I ran the simulation just once for each of the 800 days, and stored the weights, and then, for validation, average the weights for all dates except the one to be predicted and the 10 preceding dates?
If the transfer function were linear, it should make no difference, but of course that logistic function makes it non-linear.

kNN improvement on Spam Classification

Currently I'm trying to classify spam emails with kNN classification. Dataset is represented in the bag-of-words notation and it contains approx. 10000 observations with approx. 900 features. Matlab is the tool I use to process the data.
Within the last days I played with several machine learning approaches: SVM, Bayes and kNN. In my point of view, kNN's performance beats SVM and Bayes when it comes to minimize the false positive rate. Checking with 10-fold Cross-Validation I obtain a false positive rate of 0.0025 using k=9 and Manhattan-Distance. Hamming distance performs in the same region.
To further improve my FPR I tried to preprocess my data with PCA, but that blow away my FPR as a value of 0.08 is not acceptable.
Do you have any idea how to tune the dataset to get a better FPR?
PS: Yes, this is a task I have to do in order to pass a machine learning course.
Something to try: double count the non-spam samples in your training data. Say, 500 of the 1000 samples were non-spam. After double counting the non-spam ones you will have a training set of 1500 samples. This might give the false positive test samples more positive nearest neighbours. Note that overall performance might suffer.