Pyspark Linear forecast - pyspark

I am still new to world of Pyspark and Big data.....
My problem is related to Linear forecasting function and how to derive this data for a larger dataset in pyspark
Below is the link for data which i use for scenario value calculation
Scenario_Data
Scenario Data with output using return
Based on expected return i calculate scenario value
Example if the expected return is 3% ---> i manually identify the rows which will provide values for X & Y.....so in this case 3% will be between 1% and 5% after identifying this row manually, i calculate scenario value using formulae in excel (forecast.linear), so in this case of 3% , my scenario value computed will be -162.5
Objective is to calculate all of this within pyspark with no manual effort which was mentioned above
Let me know if you need any further details on this query
Thanks a lot in advance for the help
Note: I am using databricks for this task
Regards
Hitesh

Related

tbl_regression (gtsummary) ordering covariables levels and processing time

Originally in my df, I had my BMI in numeric format(1-5), which I recoded (underweigh to obese), factored and choose a specific reference using relevel (Normal, originally 3). Then did a logistic regression: y~ BMI+other covariates. My questions are the following :
1- When I plug my logistic in tbl_regression, the levels have undesired orders (underweight, obese1, obese 2, overweight) . Is there a way to rearrange the levels the way I want to (underweight, overweight, obese 1, obese 2)?
2- I used tbl_regression on a small data set which went ok. My new model, however, is based on 3M observation and 13 variables (the database is 1Gb). This time my tbl_regression is taking about 1h to process and out put the table, which is not normal since I have a fast laptop. Is there a way to make this more efficient ? I tried keeping the model only while using tbl_regression and removed the database, but it is still hellishly long. I tried with the trial data and it was ok..
1 - I recommend using contrasts() to set the reference level. The relevel() function just moves a factor level to the first position. Examples here Is there a way to relevel a variable in gtsummary after generating the beautiful table?
2 - I suspect with such a large model, the confidence interval calculation is what is slowing you down. If you see a big difference in the computation times of summary() and broom::tidy() with the CI calculation compared to tbl_regression(), please create an illustrative example (that anyone can run locally) and it can be looked into further.

How to tackle skewness and output file size in Apache Spark

I am facing skewness problem, when I am trying to join 2 datasets. One of data partition ( the column I am trying to perform the join operation) has skewness than rest of the partition and due to this one of a final output part file is 40 times greater than rest of the output part files.
I am using Scala, Apache spark for performing my calculation and file format used is parquet.
So I am looking for 2 solutions:
First is how can I tackle the skewness as time taken to process that
skewed data is taking a lot of time. (For the skewed data I have tried Broadcasting but it did not helped )
Seconds is how can I make all the final output part files stored
within a 256 MB range. I have tried a property
spark.sql.files.maxPartitionBytes=268435456 but it is not making any
difference.
Thanks,
Skewness is a common problem when dealing with data.
To handle it, there is a technique called salting exists.
First, you may check out this video by Ted Malaska to get the intuition about salting.
Second, examine his repository oh this theme.
I think that each issue with skewness has its own method to solve it.
Hope these materials will help you.

How to pass a vector from tableau to R

I have a need to pass a vector of arguments to Rserve from tableau. Specifically, I am using IRR calculations in R (on Rserve), and i want to pass vector of cash-flows that are as columns in my table (instead of rows/measure). So, i want to collect all those CF in a vector and pass it on to Rserve. Passing them one at a time slows down IO.
SCRIPT_REAL("r_func(c(.arg1, .arg2, .arg3))",sum(cf1), sum(cf2), sum(cf3))
cf1..cfn are cashflows corresponding to various periods. Above code works well when cf are few but takes a long time when i have few hundereds. Further, time spent is not in calculation but IO when communicating with remote Rserve. If i have a local Rserve, this calculation happens under few seconds while on remote, it takes well over a minute.
Also, want to point out that tableau / Rserve, set one argument after another and that takes time. My expectation is that once i have a vector, it would be just 1 transfer and setting of arguments, and therefore this should speed up
The first step in understanding how Tableau interacts with R or Python, is understanding how Tableau's table calcs work.
Tableau Script_XXX() functions are table calculations which means that you invoke them on a vector of aggregate query results and the corresponding R or Python code needs to return a vector usually of the same size. (I think you may be able to return a scalar or smaller vector which gets replicated to appear like a vector of the same size as the argument -- but not certain)
You can control how your data is partitioned into vectors, and also the ordering of data in the vectors, by editing the table calc to specify the partitioning and addressing for that calc.
Partitioning determines how your aggregate query results are broken up into vectors for calculation purposes. Addressing determines how the elements of each vector are ordered. You can either do that based on the physical layout of the table structure, or (better) based on the specific dimensions.
See the Tableau on-line help for table calcs for more info, and look online training videos from Tableau or blog entries (especially from anyone named Bora)
One way to test your understanding of these concepts is create a Tableau table (i.e., a viz with a mark type of text) with several dimensions on row and column shelves. Then create calculated fields for INDEX() and SIZE() and display them on text. Finally, change the partitioning and addressing in different ways by editing those table calcs. Try several different permutations. When you can confidently predict what those functions will produce for different settings, then you're ready to do more complex tasks - such as talking to R.
It is also instructive to experiment with FIRST(), LAST(), LOOKUP(), WINDOW_SUM() etc -- and finally dig into PREVIOUS_VALUE(). Warning, PREVIOUS_VALUE() is a bit odd, and does not behave the way you probably assume it does. Still, it is a useful technique that can implement a recursive calculation, and is about as close to a for loop as Tableau gets.

Some questions about split_train_test() function

I am currently trying to use Python's linearregression() model to describe the relationship between two variables X and Y. Given a dataset with 8 columns and 1000 rows, I want to split this dataset into training and test sets using split_train_test.
My question: I wonder what is the difference between train_test_split(dataset, test_size, random_test = int) vs train_test_split(dataset, test_size).Also, does the 2nd one (without setting random_test=int) give me a different test set and training set each time I re-run my program? Also, does the 1st one give me the same test set and training set every time I re-run my program? What is the difference between setting random_test=42 vs random_test=43, for example?
In python scikit-learn train_test_split will split your input data into two sets i) train and ii) test. It has argument random_state which allows you to split data randomly.
If the argument is not mentioned it will classify the data in a stratified manner which will give you the same split for the same dataset.
Assume you want a random split the data so that you could measure the performance of your regression on the same data with different splits. you can use random_state to achieve it. Each random state will give you pseudo-random split of your initial data. In order to keep track of performance and reproduce it later on the same data you will use the random_state argument with value used before.
It is useful for cross validation technique in machine learning.

Trying to get feature importance in Random Forest(PySpark)

I have a customer data with close to 15k columns.
I'm trying to run RF on the data to reduce the number of columns and then run other ML algorithms on it.
I am able to run RF on PySpark but am unable to extract the feature importance of the variables.
Anyone having any clue about the same or any other technique which would help me in reducing the 15k variable to some 200 odd variables.