join datasets with tfx tensorflow transform - apache-beam

I am trying to replicate some data preprocessing that I have done in pandas into tensorflow transform.
I have a few CSV files, which I joined and aggregated with pandas to produce a training dataset. Now as part of productionising the model I would like this preprocessing to be done at scale with apache beam and tensorflow transform. However it is not quite clear to me how I can reproduce the same data manipulation there. Let's look at two main operations: JOIN dataset a and dataset b to produce c and group by col1 on dataset c. This would be a quite straightforward operation in pandas, but how would I do this in tensorflow transform running on apache beam? Am I using the wrong tool for the job? What would be the right tool then?

You can use the Beam Dataframes API to do the join and other preprocessing exactly as you would have in Pandas. You can then use to_pcollection to get a PCollection that you can pass directly to your Tensorflow Transform operations, or save it as a file to read in later.
For top-level functions (such as merge) one needs to do
from apache_beam.dataframe.pandas_top_level_functions import pd_wrapper as beam_pd
and use operations beam_pd.func(...) in place of pd.func(...).

Related

Spark ML API to convert a vector to a probability for multilabel classification

I'm a bit new to Spark ML API. I'm trying to do multi-label classification for 160 labels by training 160 classifiers(logistic or random forest etc). Once I train on Dataset[LabeledPoint], I'm finding it hard to get an API where I get the probability for each class for a single example. I've read on SO that you can use the pipeline API and get the probabilities, but for my use case this is going to be hard because I'll have to repicate 160 RDDs for my evaluation features, get probability for each class and then do a join to rank the classes by their probabilities. Instead, I want to just have one copy of evaluation features, broadcast the 160 models and then do the predictions inside the map function. I find myself having to implement this but wonder if there's another convenience API in Spark to do the same for different classifiers like Logistic/RF which converts a Vector representing features to the probability for it belonging to a class. Please let me know if there's a better way to approach multi-label classification in Spark.
EDIT: I tried to create a function to transform a vector to a label for random forest, but it's super annoying because I now have to clone large pieces of tree traversal in Spark, and almost everywhere I encountered dead ends because some function or variable was private or protected. Correct me if wrong, but if this use case is not already implemented, I think it atleast is well-justified because Scikit-learn already has such APIs in place to do this.
Thanks
Found the culprit line in Spark MLLib code: https://github.com/apache/spark/blob/5ad644a4cefc20e4f198d614c59b8b0f75a228ba/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L224
The predict method is marked as protected but it should actually be public for such use cases to be supported.
This has been fixed in version 2.4 as seen here:
https://github.com/apache/spark/blob/branch-2.4/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala
So upgrading to version 2.4 should do the trick ... although I don't think 2.4 is out yet, so it's a matter of waiting.
EDIT: for people that are interested, apparently not only is this beneficial for multi-label prediction, it's been observed that there's 3-4x improvement in latency as well for regular classification/regression for single instance/small batch predictions (see https://issues.apache.org/jira/browse/SPARK-16198 for details).

Perform EDA and visualize it if my data can not fit in memory? my dataset size is 200gigs

Performing exploratory data analysis is the first step in any machine learning project, I mostly use pandas to perform data exploration using datasets that fit in memory... but I would like to know how to perform data cleaning, handle missing data and data outlier, single variable plots, density plot of how a feature impacts label, correlation, etc, etc
Pandas is easy and intuitive for doing data analysis in Python. But I find difficulty in handling multiple bigger dataframes in Pandas due to limited system memory.
For datasets that are greater than size of RAM... 100s of gigabytes
I have seen tutorials where they use spark to filter out based on rules and generate a dataframe that fits in memory... eventually there is always data that resides entirely in memory but i want to know how to work with big data set and perform exploratory data analysis
Another challenge would be to visualize big data for exploratory data analysis... its easy to do using packages like seaborn or matplotlib if it fits in memory but how to perform it for big data
To put up something concrete:
normally you will want to reduce your data, by aggregation, sampling, etc., to something small enough that a direct visualisation makes sense
some tools exist for directly dealing with bigger-than-memory (Dask) data to create visuals. One good link was this: http://pyviz.org/tutorial/10_Working_with_Large_Datasets.html

How to specify multiple columns in xgboost.trainWithDataframe when using spark?

enter image description here
This is the api doc present in xgboost.com , it seems that I can just set only one column as the "featureCol" .
As with any ML Estimator on Spark, this one expects inputCol to be a Vector of assembled features. Before you apply the Estimator, you should use tools from org.apache.spark.ml.feature to extract, transform and assemble feature vector.
You can check How to vectorize DataFrame columns for ML algorithms? for example Pipeline.

Clustering data with categorical and numeric features in Apache Spark

I am currently looking for an Algorithm in Apache Spark (Scala/Java) that is able to cluster data that has numeric and categorical features.
As far as I have seen, there is an implementation for k-medoids and k-prototypes for pyspark (https://github.com/ThinkBigAnalytics/pyspark-distributed-kmodes), but I could not identify something similar for the Scala/Java version I am currently working with.
Is there another recommend algorithm to achieve similar things for Spark running Scala? Or am I overlooking something and could actually make use of the pyspark library in my Scala project?
If you need further information or clarification feel free to ask.
I think you need first to convert your categorical variables to numbers using OneHotEncoder then, you can apply your clustering algorithm using mllib (e.g. kmeans). Also, I recommend doing scaling or normalization before applying your cluster algorithm as it is distance sensitive.

Is it necessary to convert categorical attributes to numerical attributes to use LabeledPoint function in Pyspark?

I am new to Pyspark. I have a dataset that contains categorical features and I want to use regression models from pyspark to predict continuous values. I am stuck in pre-processing of data that is required for using MLlib models.
Yes, it is necessary. You have to not only convert to numerical but also encode to make them useful for linear models. Both steps are implemented in pyspark.ml (not mllib) with:
pyspark.ml.feature.StringIndexer - indexing.
pyspark.ml.feature.OneHotEncoder - encoding.