Imputing missing values for linear regression model, using linear regression

Imputing missing values for linear regression model, using linear regression - linear-regression

I scraped a real estate website and would like to impute missing data on total area (about 40% missing) using linear regression. I achieve the best results using price, number of rooms, bedrooms, bathrooms, and powder rooms.
Correlation matrix
Adding price to the room information makes a significant difference. This makes sense, since the number of rooms alone don't give you any information on how large those rooms may be. Price can reduce some of that uncertainty. There is a 20 point difference between the R^2 scores of the model that includes and the one that excludes price (0.62 vs 0.82).
The problem that I see, is that my final model would likely also be a liner regression with price as the target. With this, it seems wrong to include price in predicting total area for imputation. My final model will look better as a consequence but I will have engineered a synthetic correlation. This is especially critical since about 40% of values need to be replaced.
Does anyone disagree with this? Should I keep price as a predictor to impute missing values even though it will be the target of my final model?

Based by context, I think you're talking about Hotel prices?
Based from my experience, imputing missing values for your predictor values, it can really make a significant boost to R^2 Scores, however the more you impute the predictor, the fewer observations you have, and thus it will be bias to conclude that to a bigger picture of Hotel Prices, since you may never know if there exist unobserved Hotel Prices with more variation right?

Related

Having a data set contains

raw_data
Having a dataset which contains variables such as ordered units, average price and discount (example attached) how I could find the optimal discount numerically?
Plotting ordered units vs. discount, it seems the optimal discount is around 10% (considering more units to be ordered). How I numerically can support or reject this guess? However, maximising ordered units may not need maximising profit, but I got only that much of data to decide on the optimal discount.
ordered units vs. discount
Thank you!

Predicting the difference or the quotient?

For a time series forecasting problem, I noticed some people tried to predict the difference or the quotient. For instance, in trading, we can try to predict the price difference P_{t-1} - P_t or the price quotient P_{t-1}/P_t. So we get a more stationary problem. With a recurrent neural network for a regression problem, trying to predict the price difference can be a real pain if the price does not change sufficiently fast because it will predict mostly zero at each step.
Questions :
What are the advantages and inconveniences of using the difference or the quotient instead of the whole quantity?
What can a nice tool to get rid of the repetitive zeros in a problem like trying to predict the price movement?

If the assumption is that the price is stationary (*Pt=Cte), then predict the whole quantity.
If the assumption is that the price increase ()is stationary (Pt= Pt-1+Cte), then predict the absolute difference Pt-Pt-1. (Note: thie is the ARIMA model with a degree of differencing=1)
If the assumption is that the price growth (in percentage) is stationary (Pt=Pt-1 +Cte * Pt-1), then predict the relative difference Pt/Pt-1.
If the price changes rarely (i.e. the absolute or relative difference is most often zero), then try to predict the time interval between tow changes rather than the price itself.

Data normalisation for presenting to neural network

I'm experimenting with neural networks and as an introduction I'm doing the popular stock market prediction method; feed in price and volumes in order to predict the future price. I need to normalise my data before presenting it to the network, but I'm unsure as to the methodology...
Each stock has a closing price and volume figure for each trading day; do I normalise the price data across the prices of all stocks for each day, or do I normalise it against the previous prices for that one stock?
I.e. I'm presenting StockA to the NN, do I normalise the price data against the previous prices of StockA, or do I normalise it with the price of StockA, B, C, D... for the date that's being presented?

In my opinion you should be treating this issue as a hyperparameter,
will say: Try both and do what works best.
In the end this comes down to what the Information in the stocks are like, and how many different (quantity, characteristics) stock data you have.
If you normalize over each single stock you'll probably get a better generalization, especially if you only have few data available.
However if you normalize over the whole stock data, you still keep the overall information of the whole dataset in each stock dataset (e.g. Magnitude of the stockprice) - which might help your model, since more expensive stocks might behave different from less expensive once.

Shouldn't we take average of n models in cross validation in linear regression?

I have a question regarding cross validation in Linear regression model.
From my understanding, in cross validation, we split the data into (say) 10 folds and train the data from 9 folds and the remaining folds we use for testing. We repeat this process until we test all of the folds, so that every folds are tested exactly once.
When we are training the model from 9 folds, should we not get a different model (may be slightly different from the model that we have created when using the whole dataset)? I know that we take an average of all the "n" performances.
But, what about the model? Shouldn't the resulting model also be taken as the average of all the "n" models? I see that the resulting model is same as the model which we created using whole of the dataset before cross-validation. If we are considering the overall model even after cross-validation (and not taking avg of all the models), then what's the point of calculating average performance from n different models (because they are trained from different folds of data and are supposed to be different, right?)
I apologize if my question is not clear or too funny.
Thanks for reading, though!

I think that there is some confusion in some of the answers proposed because of the use of the word "model" in the question asked. If I am guessing correctly, you are referring to the fact that in K-fold cross-validation we learn K-different predictors (or decision functions), which you call "model" (this is a bad idea because in machine learning we also do model selection which is choosing between families of predictors and this is something which can be done using cross-validation). Cross-validation is typically used for hyperparameter selection or to choose between different algorithms or different families of predictors. Once these chosen, the most common approach is to relearn a predictor with the selected hyperparameter and algorithm from all the data.
However, if the loss function which is optimized is convex with respect to the predictor, than it is possible to simply average the different predictors obtained from each fold.
This is because for a convex risk, the risk of the average of the predictor is always smaller than the average of the individual risks.
The PROs and CONs of averaging (vs retraining) are as follows
PROs: (1) In each fold, the evaluation that you made on the held out set gives you an unbiased estimate of the risk for those very predictors that you have obtained, and for these estimates the only source of uncertainty is due to the estimate of the empirical risk (the average of the loss function) on the held out data.
This should be contrasted with the logic which is used when you are retraining and which is that the cross-validation risk is an estimate of the "expected value of the risk of a given learning algorithm" (and not of a given predictor) so that if you relearn from data from the same distribution, you should have in average the same level of performance. But note that this is in average and when retraining from the whole data this could go up or down. In other words, there is an additional source of uncertainty due to the fact that you will retrain.
(2) The hyperparameters have been selected exactly for the number of datapoints that you used in each fold to learn. If you relearn from the whole dataset, the optimal value of the hyperparameter is in theory and in practice not the same anymore, and so in the idea of retraining, you really cross your fingers and hope that the hyperparameters that you have chosen are still fine for your larger dataset.
If you used leave-one-out, there is obviously no concern there, and if the number of data point is large with 10 fold-CV you should be fine. But if you are learning from 25 data points with 5 fold CV, the hyperparameters for 20 points are not really the same as for 25 points...
CONs: Well, intuitively you don't benefit from training with all the data at once
There are unfortunately very little thorough theory on this but the following two papers especially the second paper consider precisely the averaging or aggregation of the predictors from K-fold CV.
Jung, Y. (2016). Efficient Tuning Parameter Selection by Cross-Validated Score in High Dimensional Models. International Journal of Mathematical and Computational Sciences, 10(1), 19-25.
Maillard, G., Arlot, S., & Lerasle, M. (2019). Aggregated Hold-Out. arXiv preprint arXiv:1909.04890.

The answer is simple: you use the process of (repeated) cross validation (CV) to obtain a relatively stable performance estimate for a model instead of improving it.
Think of trying out different model types and parametrizations which are differently well suited for your problem. Using CV you obtain many different estimates on how each model type and parametrization would perform on unseen data. From those results you usually choose one well suited model type + parametrization which you will use, then train it again on all (training) data. The reason for doing this many times (different partitions with repeats, each using different partition splits) is to get a stable estimation of the performance - which will enable you to e.g. look at the mean/median performance and its spread (would give you information about how well the model usually performs and how likely it is to be lucky/unlucky and get better/worse results instead).
Two more things:
Usually, using CV will improve your results in the end - simply because you take a model that is better suited for the job.
You mentioned taking the "average" model. This actually exists as "model averaging", where you average the results of multiple, possibly differently trained models to obtain a single result. Its one way to use an ensemble of models instead of a single one. But also for those you want to use CV in the end for choosing reasonable model.

I like your thinking. I think you have just accidentally discovered Random Forest:
https://en.wikipedia.org/wiki/Random_forest

Without repeated cv your seemingly best model is likely to be only a mediocre model when you score it on new data...

How to find the "optimal" cut-off point (threshold)

I have a set of weighted features for machine learning. I'd like to reduce the feature set and just use those with a very large or very small weight.
So given below image of sorted weights, I'd only like to use the features that have weights above the higher or below the lower yellow line.
What I'm looking for is some kind of slope change detection so I can discard all the features until the first/last slope coefficient increase/decrease.
While I (think I) know how to code this myself (with first and second numerical derivatives), I'm interested in any established methods. Perhaps there's some statistic or index that computes something like that, or anything I can use from SciPy?
Edit:
At the moment, I'm using 1.8*positive.std() as positive and 1.8*negative.std() as negative threshold (fast and simple), but I'm not mathematician enough to determine how robust this is. I don't think it is, though. ⍨

If the data are (approximately) Gaussian distributed, then just using a multiple
of the standard deviation is sensible.
If you are worried about heavier tails, then you may want to base your analysis on order
statistics.
Since you've plotted it, I'll assume you're willing to sort all of the
data.
Let N be the number of data points in your sample.
Let x[i] be the i'th value in the sorted list of values.
Then 0.5( x[int( 0.8413*N)]-x[int(0.1587*N)]) is an estimate of the standard deviation
which is more robust against outliers. This estimate of the std can be used as you
indicated above. (The magic numbers above are the fraction of data that are
less than [mean+1sigma] and [mean-1sigma] respectively).
There are also conditions where just keeping the highest 10% and lowest 10% would be
sensible as well; and these cutoffs are easily computed if you have the sorted data
on hand.
These are somewhat ad hoc approaches based on the content of your question.
The general sense of what you're trying to do is (a form of) anomaly detection,
and you can probably do a better job of it if you're careful in defining/estimating
what the shape of the distribution is near the middle, so that you can tell when
the features are getting anomalous.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse