Apply scikit-learn model to pyspark dataframe column - pyspark

I have a trained Scikit-learn LogisticRegression model in a sklearn.pipeline.Pipeline. This is an NLP task. The model is saved as a pkl file (actually in ML Studio models, but I download it to databricks dbfs).
I have a Hive table (delta-backed) containing some 1 million rows. The rows have, amongst other things, an id, a keyword_context column (containing the text), a modelled column (boolean, indicates the model has been run on this row), and a prediction column, which is an integer for the class output by the logistic regression.
My problem is how to update the prediction column.
running locally I can do
def generatePredictions(data:pd.DataFrame, model:Pipeline) -> pd.DataFrame:
data.loc[:, 'keyword_context'] = data.keyword_context.apply(lambda x: x.replace("\n", " ")
data['prediction'] = model.predict(data.keyword_context)
data['modelled'] = True
return data
This actually runs fast enough (~20s), but running the UPDATEs back to databricks via the databricks.sql.connector, takes many hours. So I want to do the same in a pyspark notebook to bypass the lengthy upload.
The trouble is that it is generally suggested to use inbuilt functions (which this isn't) or if there must be a udf then the examples all use inbuilt types, not Pipelines. I'm wondering whether the model should be loaded within the function, and I presume the function takes a single row, which means a lot of loading. I'm really not sure how to code the function, or call it.

I work on the Fugue project which aims to provide a simpler interface than the Spark one for porting Python/Pandas code. This is actually the first use case in our tutorial. Fugue will use the underlying Spark call (pandas_udf, udf, mapPartitions, applyInPandas, mapInPandas) based on the arguments you provide with minimal overhead.
Here is what the code looks like.
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
X = pd.DataFrame({"x_1": [1, 1, 2, 2], "x_2":[1, 2, 2, 3]})
y = np.dot(X, np.array([1, 2])) + 3
reg = LinearRegression().fit(X, y)
def predict(df: pd.DataFrame, model: LinearRegression) -> pd.DataFrame:
return df.assign(predicted=model.predict(df))
input_df = pd.DataFrame({"x_1": [3, 4, 6, 6], "x_2":[3, 3, 6, 6]})
sdf = spark.createDataFrame(input_df)
# This is the start of the Fugue portion. It's minimal
from fugue import transform
result = transform(
sdf,
predict,
schema="*,predicted:double",
params=dict(model=reg),
engine=spark
)
print(type(result))
result.show()
This code will be applied per partition. Schema is a requirement for Spark. I am not sure but it sounds like you were using a row-wise UDF, so I think this will be faster. It also leaves your logic easy to unit test because you don't need Spark to unit test.
On loading the file inside the function
If you load the file inside the function, it gets executed on the workers. If you pass it in, it gets passed through the scheduler. This can create a lot of redundant passing of data. Loading it inside might speed things up.

Related

How can I union and repartition DataFrames stored in parquet files using fugue?

Say I have these 2 parquet files
import pandas as pd
pd.DataFrame([[0]], columns=["a"]).to_parquet("/tmp/1.parquet")
pd.DataFrame([[0],[2]], columns=["a"]).to_parquet("/tmp/2.parquet")
I would like to have a new parquet file that is a row wise union of the two.
The resulting DataFrame should look like this
a
0 0
1 0
2 2
I also would like to repartition that new file with a pre-determined number of partitions.
You can certainly solve this problem in either Pandas, Spark or other computing frameworks, but each of them will require different implementations. Using Fugue here, you can have one implementation for different computing backends, more importantly, the logic is unit testable without using any heavy backend.
from fugue import FugueWorkflow
def merge_and_save(file1, file2, file3, partition_num):
dag = FugueWorkflow()
df1 = dag.load(file1)
df2 = dag.load(file2)
df3 = df1.union(df2, distinct=False)
df3.partition(num=partition_num).save(file3)
return dag
To unit test this logic, just use small local files and use the default execution engine. Assume you have a function assert_eq:
merge_and_save(f1, f2, f3, 4).run()
assert_eq(pd.read_parquet(f3), expected_df)
And in real production, if the input files are large, you can switch to spark
merge_and_save(f4, f5, f6, 100).run(spark_session)
It's worth to point out that partition_num is not respected by the default local execution engine, so we can't assert on the number of output files. But it takes effect when the backend is Spark or Dask.

Spark 2.2: Load org.apache.spark.ml.feature.LabeledPoint from file

The following line of code loads the (soon to be deprecated) mllib.regression.LabeledPoint from file to an RDD[LabeledPoint]:
MLUtils.loadLibSVMFile(spark.sparkContext, s"$path${File.separator}${fileName}_data_sparse").repartition(defaultPartitionSize)
I'm unable to find the equivalent function for ml.feature.LabeledPoint, which is not yet heavily used in the Spark documentation examples.
Can someone point me to the relevant function?
With the ml package you won't need to put the data into a LabeledPoint since you can specify which columns to use for labels/features in all transformations/algorithms. For example:
val gbt = new GBTClassifier()
.setLabelCol("label")
.setFeaturesCol("features")
To load the LibSVM file as a dataframe, simply do:
val df = spark.read.format("libsvm").load(s"$path${File.separator}${fileName}_data_sparse")
Which will return a dataframe with two columns:
The loaded DataFrame has two columns: label containing labels stored as doubles and features containing feature vectors stored as Vectors.
See the documentation for more information.

Ques Classification Using Support Vector Machines

I am trying to classify Questions using SVM. I am following this link for reference -
https://shirishkadam.com/2017/07/03/nlp-question-classification-using-support-vector-machines-spacyscikit-learnpandas/
But they have used SPACY,SCIKIT-LEARN and PANDAS. I want to do the same thing using Spark Mllib.
I am using this code to create a Dataframe -
sc = SparkContext(conf=sconf) # SparkContext
sqlContext = SQLContext(sc)
data = sc.textFile("<path_to_csv_file>")
header = data.first()
trainingDF = sqlContext.createDataFrame(data
.filter(lambda line: line != header)
.map(lambda line: line.split("|"))
.map(lambda line: ([line[0]], [line[2]], [line[6]]))).toDF("Question", "WH-Bigram", "Class")
And I am getting following result by printing the dataframe- trainingDF.show(3)
+--------------------+-------------------+------+
| Question| WH-Bigram| Class|
+--------------------+-------------------+------+
|[How did serfdom ...| [How did]|[DESC]|
|[What films featu...| [What films]|[ENTY]|
|[How can I find a...| [How can]|[DESC]|
My sample csv file is -
#Question|WH|WH-Bigram|Class
How did serfdom develop in and then leave Russia ?|How|How did|DESC
I am using word2vec to create training data for SVM and trying to train using SVM.
word2Vec1 = Word2Vec(vectorSize=2, minCount=0, inputCol="Question", outputCol="result1")
training = word2Vec1.fit(trainingDF).transform(trainingDF)
model = SVMWithSGD.train(training, iterations=100)
After using word2vec my data is converted in this format -
[Row(Question=[u'How did serfdom develop in and then leave Russia ?'], WH-Bigram=[u'How did'], Class=[u'DESC'], result1=DenseVector([0.0237, -0.186])), Row(Question=[u'What films featured the character Popeye Doyle ?'], WH-Bigram=[u'What films'], Class=[u'ENTY'], result1=DenseVector([-0.2429, 0.0935]))]
But when I try to train the dataframe using SVM then getting error that TypeError: data should be an RDD of LabeledPoint, but got <class 'pyspark.sql.types.Row'>
I am stuck here...i think the dataframe that i have created is not correct.
Do any body know how to create a suitable dataframe for training it with SVM. And Please let me know if I am doing something wrong.
Great that you are trying out one of the machine learning methods in Spark, but there are multiple problems with your approach,
1) Your data has multiple classes, it is not a binary classification model hence SVM in Spark won't work on this dataset (you can have a look at the source code here). You can try the one class vs all others approach and train as many models as there are classes in your data. However, you would be better off using something like the MultilayerPerceptronClassifier or the multiclass logistic model in Spark.
2) Secondly, Mllib is very unforgiving in terms of the class labels that you use, you can only specify 0,1,2 or 0.0,1.0,2.0 etc i.e it does not automatically infer the number of classes based on your output column. Even if you specify two classes as 1.0 & 2.0 it will not work it has to be 0.0 & 1.0.
3) You need to use a labeledpoint RDD instead of a spark dataframe, remember that spark.mllib is for use with RDD's whereas spark.ml is for use with dataframes. On help for how to create a Labeledpoint rdd you may refer to the spark documentation here where there are multiple examples.
4) On a feature engineering note, I don't think you would want to take the vectorSize as 2 for your word2vec model (something like 10 would be more appropriate as a starting point), these are simply too less for giving a reasonable prediction.

Using MLUtils.convertVectorColumnsToML() inside a UDF?

I have a Dataset/Dataframe with a mllib.linalg.Vector (of Doubles) as one of the columns. I would like to add another column to this dataset of type ml.linalg.Vector to this data set (so I will have both types of Vectors). The reason is I am evaluating few algorithms and some of those expect mllib vector and some expect ml vector. Also, I have to feed o/p of one algorithm to another and each use different types.
Can someone please help me convert mllib.linalg.Vector to ml.linalg.Vector and append a new column to the data set in hand. I tried using MLUtils.convertVectorColumnsToML() inside an UDF and regular functions but not able to get it to working. I am trying to avoid creating a new dataset and then doing inner join and dropping the columns as the data set will be huge eventually and joins are expensive.
You can use the method toML to convert from mllib to ml vector. An UDF and usage example can look like this:
val convertToML = udf((mllibVec: org.apache.spark.mllib.linalg.Vector) = > {
mllibVec.asML
})
val df2 = df.withColumn("mlVector", convertToML($"mllibVector"))
Assuming df to be the original dataframe and the column with the mllib vector to be named mllibVector.

How to use QuantileDiscretizer across groups in a DataFrame?

I have a DataFrame with the following columns.
scala> show_times.printSchema
root
|-- account: string (nullable = true)
|-- channel: string (nullable = true)
|-- show_name: string (nullable = true)
|-- total_time_watched: integer (nullable = true)
This is data about how many times customer has watched watched a particular show. I'm supposed to categorize the customer for each show based on total time watched.
The dataset has 133 million rows in total with 192 distinct show_names.
For each individual show I'm supposed to bin the customer into 3 categories (1,2,3).
I use Spark MLlib's QuantileDiscretizer
Currently I loop through every show and run QuantileDiscretizer in the sequential manner as in the code below.
What I'd like to have in the end is for the following sample input to get the sample output.
Sample Input:
account,channel,show_name,total_time_watched
acct1,ESPN,show1,200
acct2,ESPN,show1,250
acct3,ESPN,show1,800
acct4,ESPN,show1,850
acct5,ESPN,show1,1300
acct6,ESPN,show1,1320
acct1,ESPN,show2,200
acct2,ESPN,show2,250
acct3,ESPN,show2,800
acct4,ESPN,show2,850
acct5,ESPN,show2,1300
acct6,ESPN,show2,1320
Sample Output:
account,channel,show_name,total_time_watched,Time_watched_bin
acct1,ESPN,show1,200,1
acct2,ESPN,show1,250,1
acct3,ESPN,show1,800,2
acct4,ESPN,show1,850,2
acct5,ESPN,show1,1300,3
acct6,ESPN,show1,1320,3
acct1,ESPN,show2,200,1
acct2,ESPN,show2,250,1
acct3,ESPN,show2,800,2
acct4,ESPN,show2,850,2
acct5,ESPN,show2,1300,3
acct6,ESPN,show2,1320,3
Is there a more efficient and distributed way to do it using some groupBy-like operation instead of looping through each show_name and bin it one after other?
I know nothing about QuantileDiscretizer, but think you're mostly concerned with the dataset to apply QuantileDiscretizer to. I think you want to figure out how to split your input dataset into smaller datasets per show_name (you said that there are 192 distinct show_name in the input dataset).
Solution 1: Partition Parquet Dataset
I've noticed that you use parquet as the input format. My understanding of the format is very limited but I've noticed that people are using some partitioning scheme to split large datasets into smaller chunks that they could then process whatever they like (per some partitioning scheme).
In your case the partitioning scheme could include show_name.
That would make your case trivial as the splitting were done at writing time (aka not my problem anymore).
See How to save a partitioned parquet file in Spark 2.1?
Solution 2: Scala's Future
Given your iterative solution, you could wrap every iteration into a Future that you'd submit to process in parallel.
Spark SQL's SparkSession (and Spark Core's SparkContext) are thread-safe.
Solution 3: Dataset's filter and union operators
I would think twice before following this solution since it puts burden on your shoulders which I think could easily be sorted out by solution 1.
Given you've got one large 133-million-row parquet file, I'd first build the 192 datasets per show_name using filter operator (as you did to build show_rdd which is against the name as it's a DataFrame not RDD) and union (again as you did).
See Dataset API.
Solution 4: Use Window Functions
That's something I think could work, but didn't check it out myself.
You could use window functions (see WindowSpec and Column's over operator).
Window functions would give you partitioning (windows) while over would somehow apply QuantileDiscretizer to a window/partition. That would however require "destructuring" QuantileDiscretizer into an Estimator to train a model and somehow fit the result model to the window again.
I think it's doable, but haven't done it myself. Sorry.
This is older question. However answering it to help someone with same situation in future.
It can be achieved using pandas udf function. Both input and output of pandas UDF function is dataframe. We need to provide schema of the output dataframe as shown in annotation in below code sample. Below code sample can achieve required result.
output_schema = StructType(df.schema.fields + [StructField('Time_watched_bin', IntegerType(), True)])
#pandas_udf(output_schema, PandasUDFType.GROUPED_MAP)
# pdf: pandas dataframe
def get_buckets(pdf):
pdf['Time_watched_bin'] = pd.cut(pdf['total_time_watched'], 3, labels=False)
return pdf
df = df.groupby('show_name').apply(get_buckets)
df will have new column 'Time_watched_bin' with bucket information.