How do you do stratified sampling across different groups, when creating train and test sets, in pyspark? [closed] - pyspark

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am looking for a solution to split my data to Test and Train sets but I want to have all the levels of my categorical variable in both test and train.
My variable has 200 levels and the data is 18 million records. I tried sampleBy function with fractions (0.8) and could get the training set but had difficulties getting the test set since there is no index in Spark and even with creating a key, using left join or subtract is very slow to get the test set!
I want to do a groupBy based on my categorical variable and randomly sample each category and if there is only one observation for that category, put that in the train set.
Is there a default function or library to help with this operation?

A pretty hard problem.
I don't know of an in-built function which will help you get this. Using sampleBy and then so subtraction subtraction would work but as you said - would be pretty slow.
Alternatively, wonder if you can try this*:
Use window functions, add row num and remove everything with rownum=1 into a separate dataframe which you will add into your training in the end.
On the remaining data, using randomSplit (a dataframe function) to divide into training and test
Add the separated data from Step 1 to training.
This should work faster.
*(I haven't tried it before! Would be great if you can share what worked in the end!)

Related

How should I implement the function to choose the route i want? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a simple network that has several routes from start to end. Vehicles from a transporter fleet will carry agents from the left conveyor to the right conveyor using the moveByTransporter block. What are some syntax i can use to refer to paths/nodes on the network?
Also, what is a sample code line of how i can check the number of vehicles on a specific path?
This is my sample network and idea of trying how to make a new routing instead of just the shortest path (the path i want to follow is via the yellow highlighted one)
The moveTo block will take the route with the shortest distance. If the agent's speed is the same across all choices, then the shortest distance will also be the fastest time.
In the past, I have used Dijkstra's algorithm and manually routed my agents from a to b, then b to c, etc. This way, I could use travel times instead of just distances. This also opens up the possibility of considering congestion by applying a penalty to some segments if there are too many other agents on them. You can also pick a route, but then when you get to the next node, re-calculate the rest of the route for updated congestion considerations.
This is all custom, and I would not recommend it for simple problems. You would be better off to look at other alternatives (assume constant speeds or consider the pedestrian library with walls, etc, depending on your problem).

ML model to transform words [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I build model that on input have correct word. On output there is possible word written by human (it contain some errors). My training dataset looks that:
input - output
hello - helo
hello - heelo
hello - hellou
between - betwen
between - beetween
between - beetwen
between - bettwen
between - bitween
etc.
During preprocessing I add a measure of the distortion of a word. Then I hardcoding letters for numbers.
My current model's using CNN. The number of neurons of input is the same as the longest word in training dataset and the number of neurons of output is the same as the longest word in traning dataset.
This model doesn't work as I excepted. Word on the output is not look as I except.
eg.
input - output
house - gjrtdd
Question:
How can I build/improve model for this task? Is CNN a good idea? What other methods can I use for this task?

Difference between Array and Timeseries [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I want to save result to to_file block in model matlab
just I want to know what is difference between array and timeseries in save format field.
Lets start from array - it's easiest thing. If you use To File or To Workspace block with array options it writes to file just column of values of your variable.
If you use Timeseries it writes values in timeseries format. This structure consist of several fields. Main of them are Time and Data. So you get not only values but times corresponded to this data! Furthermore it contain some additional information like interpolation method and other (see it in help).
When I have to use Array and when Timeseries?
It's clear that if time moments important to you you need to use Timeseries. For example if your simulation uses variable time step then data will not be uniformly distributed.So it's helpful to get times too.
Using an array is useful if times of data is not important. For example if I save from Enabled subsystem only 1 value of my variable.

Optimizing total number of cables [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I need lots of cables of different small sizes (under 100 meters) and cables are only sold in lenghts of 100 meters.
So, to optimize my purchase, I would like a code where I can input the lengths of all pieces of cables that I need. The code will combine my inputs under the constraint the sum is under 100, while minimizing the total number of 100m-length cables that I need to buy.
If anyone could help with a code in VBA, Matlab or Python I would be very grateful.
This is known as a bin-packing problem, and it's actually very difficult (computationally speaking) to find the optimal solution.
However, it is a problem that is practically useful to solve (as you have seen for yourself) and so there are several approaches that seek to find an approximate solution--one that is "good enough" without guaranteeing that it's the best possible solution. I did a quick search and found this course website, which has some examples that may help you out.
If you are looking for an exact solution, you can ask the related question "will I be able to fit the cables I need into N 100-meter cables?". This feasibility problem can be expressed as a "binary program", which is a special case of a "mixed-integer linear program", for which MATLAB has a solver called intlinprog (requires the optimization toolbox).
I'm sorry that I don't have any code to solve your problem, but I hope that this at least gives you some keywords to help you find more resources!
I believe this is like the cutting stock problem. There are some very good methods to solve this. Here is an implementation and some background. It is not too difficult to write an Excel front-end for this (see here).
If you google for "cutting stock problem" you will find lots of references.

Which is the best clustering algorithm to find outliers? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Basically I have some hourly and daily data like
Day 1
Hours,Measure
(1,21)
(2,22)
(3,27)
(4,24)
Day 2
hours,measure
(1,23)
(2,26)
(3,29)
(4,20)
Now I want to find outliers in the data by considering hourly variations and as well as the daily variations using bivariate analysis...which includes hourly and measure...
So which is the best clustering algorithm is more suited to find outlier considering this scenario?
.
one 'good' advice (:P) I can give you is that (based on my experience) it is NOT a good idea to treat time similar to spatial features. So beware of solutions that do this. You probably can start with searching the literature in outlier detection for time-series data.
You really should use a different repesentation for your data.
Why don't you use an actual outlier detection method, if you want to detect outliers?
Other than that, just read through some literature. k-means for example is known to have problems with outliers. DBSCAN on the other hand is designed to be used on data with "Noise" (the N in DBSCAN), which essentially are outliers.
Still, the way you are representing your data will make none of these work very well.
You should use time series based outlier detection method because of the nature of your data (it has its own seasonality, trend, autocorrelation etc.). Time series based outliers are of different kinds (AO, IO etc.) and it's kind of complicated but there are applications which make it easy to implement.
Download the latest build of R from http://cran.r-project.org/. Install the packages "forecast" & "TSA".
Use the auto.arima function of forecast package to derive the best model fit for your data amd pass on those variables along with your data to detectAO & detectIO of TSA functions. These functions will pop up any outlier which is present in the data with their time indexes.
R is also easy to integrate with other applications or just simply run a batch job ....Hope that helps...