equivalent of sklearn's StratifiedGroupKFold for PySpark? - pyspark

I have a dataframe for single-label binary classification with some class imbalance and I want to make a train-test split. Some observations are members of groups in the data that should only appear in either the test split or train split but not both.
Outside of PySpark, I could use StratifiedGroupKFold from sklearn. What is the easiest way to achieve the same effect with PySpark?
I looked at the sampleBy method from PySpark, but I'm not sure how to use it while keeping the groups separate.
Documentation links:
StratifiedGroupKFold
sampleBy

Related

PySpark MLLib APproximate nearest neighbour search for multiple keys

I want to use ANN from PySpark. I have a DataFrame of 100K keys for which I want to perform top-10 ANN searches on an already transformed Spark DataFrame. But it seems that API of BucketedRandomProjectionLSH expects only one key at a time. I also want to avoid using approxSimilarityJoin, because it only allows you to set a threshold, but this would lead to a variable k for each key (it also fails in my case saying that for some records there are no NNs for a given threshold).
Currently, the best thing I came up with is .collect() the keys and call approxNearestNeighbors in a for loop on the driver, but it is terribly inefficient.
Does anyone know how I can get top-10 ANN searches for my 100K keys in parallel?
Thank you.

Combine vs ParDo in apache beam

May I know the exact difference between a ParDo and a Combine transformation in Apache Beam?
Can I see ParDo as the Map phase in the map/shuffle/reduce while Combine as the reduce phase?
Thank you!
As far as I have understand Apache Beam, there are no explicit Map and Reduce phases.
You can apply several element-wise map functions in a row, where ParDo is the most general class that can be used for own implementation.
The term reduce has been replaced by aggregation and there the corresponding class is Combine.
MapReduce is limited to graphs of the shape Map-Shuffle-Reduce, where Reduce is an elementwise operation, just like map, that is distinguished only by following the shuffle.
In Apache Beam, one can have arbitrary topologies, e.g.
Map-Map-Shuffle-Map-Shuffle-Map-Map-Shuffle-Map
so the notion of breaking phases down by that which follows shuffle no longer holds. (Beam calls Map/Shuffle ParDo and GroupByKey respectively.)
Combine operations are a special kind of Map operations that are known to be associative (think sum, max, etc. but they can be much more complicated) which allow us to push part of the work before the shuffle, e.g.
Shuffle-Sum
becomes
PartialSum-Shuffle-Sum
(Most MapReduce systems also have this notion, named combining or semi-reducing or similar.)
Note that Beam's CombinePerKey and GlobalCombine operations pair the shuffle with the CombineFn, no need to GroupByKey first.

featuretools dfs vs categorical_encoding

When I want to add categorical_encoding I can do it in two different ways :
With dfs with setting categorical feature as relationship and getting mean/std/skew statistics . In this case categorical feature and value/s in same dataframe
With categorical_encoding sub-library and fit_transform
I see the only difference that in second case I have wider range of parameters , i.e. setting method='leave_one_out' that can be more accurate than using regular mean in case of dfs
Am I right ? If categorical_encoding uses parallel processing ?
You can do the categorical encoding with DFS and also stack additional primitives to create new features. The library for categorical encoding does not use parallel processing, but does provide a wider range of encoders.

ELKI: How to Specify Feature Columns of CSV for K-Means

I am trying to run K-Means using ELKI MiniGUI. I have a CSV dataset of 15 features (columns) and a label column. I would like to do multiple runs of K-Means with different combinations of the feature columns.
Is there anywhere in the MiniGUI where I can specify the indeces of which columns I would like to be used for clustering?
If not, what is the simplest way to achieve this by changin/extending ELKI in Java?
This is obivously easily achievable with Java code, or simply by preprocessing the data as necessary. Generate 10 variants, then launch ELKI via the command line.
But there is a filter to select columns: NumberVectorFeatureSelectionFilter. To only use columns 0,1,2 (in the numeric part; labels are treated separately at this point; this is a vector transformation):
-dbc.filter transform.NumberVectorFeatureSelectionFilter
-projectionfilter.selectedattributes 0,1,2
The filter could be extended using our newer IntRangeParameter to allow for specifications such as 1..3,5..8; but this has not been implemented yet.

Converting Dataframe from Spark to the type used by DL4j

Is there any convenient way to convert Dataframe from Spark to the type used by DL4j? Currently using Daraframe in algorithms with DL4j I get an error:
"type mismatch, expected: RDD[DataSet], actual: Dataset[Row]".
In general, we use datavec for that. I can point you at examples for that if you want. Dataframes make too many assumptions that make it too brittle to be used for real world deep learning.
Beyond that, a data frame is not typically a good abstraction for representing linear algebra. (It falls down when dealing with images for example)
We have some interop with spark.ml here: https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/test/java/org/deeplearning4j/spark/ml/impl/SparkDl4jNetworkTest.java
But in general, a dataset is just a pair of ndarrays just like numpy. If you have to use spark tools, and want to use ndarrays on the last mile only, then my advice would be to get the dataframe to match some form of schema that is purely numerical, map that to an ndarray "row".
In general, a big reason we do this is because all of our ndarrays are off heap.
Spark has many limitations when it comes to working with their data pipelines and using the JVM for things it shouldn't be(matrix math) - we took a different approach that allows us to use gpus and a bunch of other things efficiently.
When we do that conversion, it ends up being:
raw data -> numerical representation -> ndarray
What you could do is map dataframes on to a double/float array and then use Nd4j.create(float/doubleArray) or you could also do:
someRdd.map(inputFloatArray -> new DataSet(Nd4j.create(yourInputArray),yourLabelINDARray))
That will give you a "dataset" You need a pair of ndarrays matching your input data and a label.
The label from there is relative to the kind of problem you're solving whether that be classification or regression though.