Trying to get feature importance in Random Forest(PySpark) - pyspark

I have a customer data with close to 15k columns.
I'm trying to run RF on the data to reduce the number of columns and then run other ML algorithms on it.
I am able to run RF on PySpark but am unable to extract the feature importance of the variables.
Anyone having any clue about the same or any other technique which would help me in reducing the 15k variable to some 200 odd variables.

Related

Pyspark Linear forecast

I am still new to world of Pyspark and Big data.....
My problem is related to Linear forecasting function and how to derive this data for a larger dataset in pyspark
Below is the link for data which i use for scenario value calculation
Scenario_Data
Scenario Data with output using return
Based on expected return i calculate scenario value
Example if the expected return is 3% ---> i manually identify the rows which will provide values for X & Y.....so in this case 3% will be between 1% and 5% after identifying this row manually, i calculate scenario value using formulae in excel (forecast.linear), so in this case of 3% , my scenario value computed will be -162.5
Objective is to calculate all of this within pyspark with no manual effort which was mentioned above
Let me know if you need any further details on this query
Thanks a lot in advance for the help
Note: I am using databricks for this task
Regards
Hitesh

usage of CP-SAT to forecast 3 Milions of boolean variables

Dears,
I want to understand if I using not properly the CP-SAT algorithm. Basically my code automatically creates a model reading a csv with a dataset. My code creates model.NewBoolVar() for each record of the dataset multiplied for the number of possible decisions to be taken by the optimization problem...
For example if I have a dataset with 1 Milion of records and I have to decide between 3 options, the model will contains 3 Milions of boolean variables. The combination of the 3 Milions of booleans is the solution to my optimizzation problem.
Currently after 100K variables the program is becoming unstable and python crashes. Do you think that I'm trying to use CP-SAT not properly? Do you have experience with this kind of volumes?
Thank you very much.
Cheers
You are aware that this is an NP problem.
Thus potentially, you are creating a search tree of size 2^3000000000.

Tableau Time Series Prediction using Python Integration

I need help regarding the time series in Tableau. So far Here is what I can do.
Connect to TabPY
Call / Run scripts on TabPy
My current issue is that tableau doesn't seem to allow more output than input elements. Say I want to use the last 100 data points to predict the coming 10 points. Input of the data to python isn't a problem. The problem comes when I want to return a list with 110 elements. I've also tried returning the 10 elements and it complaints that it expects 100 elements list.
Thanks for reading
I've found a work around. You can see the post here for more information. Basically you shift the original values by the prediction amount and then have the prediction return the same amount as the shifted original

Some questions about split_train_test() function

I am currently trying to use Python's linearregression() model to describe the relationship between two variables X and Y. Given a dataset with 8 columns and 1000 rows, I want to split this dataset into training and test sets using split_train_test.
My question: I wonder what is the difference between train_test_split(dataset, test_size, random_test = int) vs train_test_split(dataset, test_size).Also, does the 2nd one (without setting random_test=int) give me a different test set and training set each time I re-run my program? Also, does the 1st one give me the same test set and training set every time I re-run my program? What is the difference between setting random_test=42 vs random_test=43, for example?
In python scikit-learn train_test_split will split your input data into two sets i) train and ii) test. It has argument random_state which allows you to split data randomly.
If the argument is not mentioned it will classify the data in a stratified manner which will give you the same split for the same dataset.
Assume you want a random split the data so that you could measure the performance of your regression on the same data with different splits. you can use random_state to achieve it. Each random state will give you pseudo-random split of your initial data. In order to keep track of performance and reproduce it later on the same data you will use the random_state argument with value used before.
It is useful for cross validation technique in machine learning.

Average results from multiple simulations

I have a simulation with a lot of random components, so I would like to run many simulations and average the results (the result is determined by a variable called score).
How would you do this in Netlogo?
Currently I'm working on a program that will export the results to csv, then I plan to use python/excel to average them. I don't like this because I want to run 100+ simulations (so there will be 100+ files)... I'm hoping there is a better solution
EDIT or an implementation of what I described (I have to relearn enough python/vba to solve this, so it's going to take me some time)
This should be simple enough if you use BehaviorSpace.
In your experiment definition, put score in the Measure runs using these reporters textbox and uncheck Measure run at every step.
When you run your experiment, save your results using Table output. It will produce a csv that you can open in your spreadsheet application. From there, producing an average of the score column should be trivial.