Ques Classification Using Support Vector Machines - pyspark

I am trying to classify Questions using SVM. I am following this link for reference -
https://shirishkadam.com/2017/07/03/nlp-question-classification-using-support-vector-machines-spacyscikit-learnpandas/
But they have used SPACY,SCIKIT-LEARN and PANDAS. I want to do the same thing using Spark Mllib.
I am using this code to create a Dataframe -
sc = SparkContext(conf=sconf) # SparkContext
sqlContext = SQLContext(sc)
data = sc.textFile("<path_to_csv_file>")
header = data.first()
trainingDF = sqlContext.createDataFrame(data
.filter(lambda line: line != header)
.map(lambda line: line.split("|"))
.map(lambda line: ([line[0]], [line[2]], [line[6]]))).toDF("Question", "WH-Bigram", "Class")
And I am getting following result by printing the dataframe- trainingDF.show(3)
+--------------------+-------------------+------+
| Question| WH-Bigram| Class|
+--------------------+-------------------+------+
|[How did serfdom ...| [How did]|[DESC]|
|[What films featu...| [What films]|[ENTY]|
|[How can I find a...| [How can]|[DESC]|
My sample csv file is -
#Question|WH|WH-Bigram|Class
How did serfdom develop in and then leave Russia ?|How|How did|DESC
I am using word2vec to create training data for SVM and trying to train using SVM.
word2Vec1 = Word2Vec(vectorSize=2, minCount=0, inputCol="Question", outputCol="result1")
training = word2Vec1.fit(trainingDF).transform(trainingDF)
model = SVMWithSGD.train(training, iterations=100)
After using word2vec my data is converted in this format -
[Row(Question=[u'How did serfdom develop in and then leave Russia ?'], WH-Bigram=[u'How did'], Class=[u'DESC'], result1=DenseVector([0.0237, -0.186])), Row(Question=[u'What films featured the character Popeye Doyle ?'], WH-Bigram=[u'What films'], Class=[u'ENTY'], result1=DenseVector([-0.2429, 0.0935]))]
But when I try to train the dataframe using SVM then getting error that TypeError: data should be an RDD of LabeledPoint, but got <class 'pyspark.sql.types.Row'>
I am stuck here...i think the dataframe that i have created is not correct.
Do any body know how to create a suitable dataframe for training it with SVM. And Please let me know if I am doing something wrong.

Great that you are trying out one of the machine learning methods in Spark, but there are multiple problems with your approach,
1) Your data has multiple classes, it is not a binary classification model hence SVM in Spark won't work on this dataset (you can have a look at the source code here). You can try the one class vs all others approach and train as many models as there are classes in your data. However, you would be better off using something like the MultilayerPerceptronClassifier or the multiclass logistic model in Spark.
2) Secondly, Mllib is very unforgiving in terms of the class labels that you use, you can only specify 0,1,2 or 0.0,1.0,2.0 etc i.e it does not automatically infer the number of classes based on your output column. Even if you specify two classes as 1.0 & 2.0 it will not work it has to be 0.0 & 1.0.
3) You need to use a labeledpoint RDD instead of a spark dataframe, remember that spark.mllib is for use with RDD's whereas spark.ml is for use with dataframes. On help for how to create a Labeledpoint rdd you may refer to the spark documentation here where there are multiple examples.
4) On a feature engineering note, I don't think you would want to take the vectorSize as 2 for your word2vec model (something like 10 would be more appropriate as a starting point), these are simply too less for giving a reasonable prediction.

Related

IllegalArgumentException: 'Field "label" does not exist Spark MLlib

I'm trying to model some data with a logistic regression, part of spark MLlib. For the model creation I've got the following columns:
ID,
features,
label
I can split it into Train and value data via
(trainsample,testsample) = sample.randomSplit([0.7, 0.3], seed)
Also, I can define my model:
lr = LogisticRegression(featuresCol="features", labelCol="label",
predictionCol="prediction")
Then I can train and test it with:
lrmodel = lr.fit(trainsample)
result = lrmodel.transform(testmodel)
All fine. But now I want to use my model and predict unlabeled data. I am always getting
the following Error:
IllegalArgumentException: 'Field "label" does not exist
I tried to create a dummy label column (all values 999). But than, all my predictions belong to one class (class 6 for 7 different classes). So the label seems to influence my predictions, even with a pretrained model.
Maybe "lrmodel.transform" is just for testing and there is other syntax for use the model. But I didn't find anything to this topic. Any help would be appreciated.
found the issue... I had the label within my featureset x_x... Thanks for your help

Applying transformations with filter or map which one is faster Scala spark

Iam trying to do some transformations on the dataset with spark using scala currently using spark sql but want to shift the code to native scala code. i want to know whether to use filter or map, doing some operations like matching the values in column and get a single column after the transformation into a different dataset.
SELECT * FROM TABLE WHERE COLUMN = ''
Used to write something like this earlier in spark sql can someone tell me an alternative way to write the same using map or filter on the dataset, and even which one is much faster when compared.
You can read documentation from Apache Spark website. This is the link to API documentation at https://spark.apache.org/docs/2.3.1/api/scala/index.html#package.
Here is a little example -
val df = sc.parallelize(Seq((1,"ABC"), (2,"DEF"), (3,"GHI"))).toDF("col1","col2")
val df1 = df.filter("col1 > 1")
df1.show()
val df2 = df1.map(x => x.getInt(0) + 3)
df2.show()
If I understand you question correctly, you need to rewrite your SQL query to DataFrame API. Your query reads all columns from table TABLE and filter rows where COLUMN is empty. You can do this with DF in the following way:
spark.read.table("TABLE")
.where($"COLUMN".eqNullSafe(""))
.show(10)
Performance will be the same as in your SQL. Use dataFrame.explain(true) method to understand what Spark will do.

Spark 2.2: Load org.apache.spark.ml.feature.LabeledPoint from file

The following line of code loads the (soon to be deprecated) mllib.regression.LabeledPoint from file to an RDD[LabeledPoint]:
MLUtils.loadLibSVMFile(spark.sparkContext, s"$path${File.separator}${fileName}_data_sparse").repartition(defaultPartitionSize)
I'm unable to find the equivalent function for ml.feature.LabeledPoint, which is not yet heavily used in the Spark documentation examples.
Can someone point me to the relevant function?
With the ml package you won't need to put the data into a LabeledPoint since you can specify which columns to use for labels/features in all transformations/algorithms. For example:
val gbt = new GBTClassifier()
.setLabelCol("label")
.setFeaturesCol("features")
To load the LibSVM file as a dataframe, simply do:
val df = spark.read.format("libsvm").load(s"$path${File.separator}${fileName}_data_sparse")
Which will return a dataframe with two columns:
The loaded DataFrame has two columns: label containing labels stored as doubles and features containing feature vectors stored as Vectors.
See the documentation for more information.

Using MLUtils.convertVectorColumnsToML() inside a UDF?

I have a Dataset/Dataframe with a mllib.linalg.Vector (of Doubles) as one of the columns. I would like to add another column to this dataset of type ml.linalg.Vector to this data set (so I will have both types of Vectors). The reason is I am evaluating few algorithms and some of those expect mllib vector and some expect ml vector. Also, I have to feed o/p of one algorithm to another and each use different types.
Can someone please help me convert mllib.linalg.Vector to ml.linalg.Vector and append a new column to the data set in hand. I tried using MLUtils.convertVectorColumnsToML() inside an UDF and regular functions but not able to get it to working. I am trying to avoid creating a new dataset and then doing inner join and dropping the columns as the data set will be huge eventually and joins are expensive.
You can use the method toML to convert from mllib to ml vector. An UDF and usage example can look like this:
val convertToML = udf((mllibVec: org.apache.spark.mllib.linalg.Vector) = > {
mllibVec.asML
})
val df2 = df.withColumn("mlVector", convertToML($"mllibVector"))
Assuming df to be the original dataframe and the column with the mllib vector to be named mllibVector.

How to use QuantileDiscretizer across groups in a DataFrame?

I have a DataFrame with the following columns.
scala> show_times.printSchema
root
|-- account: string (nullable = true)
|-- channel: string (nullable = true)
|-- show_name: string (nullable = true)
|-- total_time_watched: integer (nullable = true)
This is data about how many times customer has watched watched a particular show. I'm supposed to categorize the customer for each show based on total time watched.
The dataset has 133 million rows in total with 192 distinct show_names.
For each individual show I'm supposed to bin the customer into 3 categories (1,2,3).
I use Spark MLlib's QuantileDiscretizer
Currently I loop through every show and run QuantileDiscretizer in the sequential manner as in the code below.
What I'd like to have in the end is for the following sample input to get the sample output.
Sample Input:
account,channel,show_name,total_time_watched
acct1,ESPN,show1,200
acct2,ESPN,show1,250
acct3,ESPN,show1,800
acct4,ESPN,show1,850
acct5,ESPN,show1,1300
acct6,ESPN,show1,1320
acct1,ESPN,show2,200
acct2,ESPN,show2,250
acct3,ESPN,show2,800
acct4,ESPN,show2,850
acct5,ESPN,show2,1300
acct6,ESPN,show2,1320
Sample Output:
account,channel,show_name,total_time_watched,Time_watched_bin
acct1,ESPN,show1,200,1
acct2,ESPN,show1,250,1
acct3,ESPN,show1,800,2
acct4,ESPN,show1,850,2
acct5,ESPN,show1,1300,3
acct6,ESPN,show1,1320,3
acct1,ESPN,show2,200,1
acct2,ESPN,show2,250,1
acct3,ESPN,show2,800,2
acct4,ESPN,show2,850,2
acct5,ESPN,show2,1300,3
acct6,ESPN,show2,1320,3
Is there a more efficient and distributed way to do it using some groupBy-like operation instead of looping through each show_name and bin it one after other?
I know nothing about QuantileDiscretizer, but think you're mostly concerned with the dataset to apply QuantileDiscretizer to. I think you want to figure out how to split your input dataset into smaller datasets per show_name (you said that there are 192 distinct show_name in the input dataset).
Solution 1: Partition Parquet Dataset
I've noticed that you use parquet as the input format. My understanding of the format is very limited but I've noticed that people are using some partitioning scheme to split large datasets into smaller chunks that they could then process whatever they like (per some partitioning scheme).
In your case the partitioning scheme could include show_name.
That would make your case trivial as the splitting were done at writing time (aka not my problem anymore).
See How to save a partitioned parquet file in Spark 2.1?
Solution 2: Scala's Future
Given your iterative solution, you could wrap every iteration into a Future that you'd submit to process in parallel.
Spark SQL's SparkSession (and Spark Core's SparkContext) are thread-safe.
Solution 3: Dataset's filter and union operators
I would think twice before following this solution since it puts burden on your shoulders which I think could easily be sorted out by solution 1.
Given you've got one large 133-million-row parquet file, I'd first build the 192 datasets per show_name using filter operator (as you did to build show_rdd which is against the name as it's a DataFrame not RDD) and union (again as you did).
See Dataset API.
Solution 4: Use Window Functions
That's something I think could work, but didn't check it out myself.
You could use window functions (see WindowSpec and Column's over operator).
Window functions would give you partitioning (windows) while over would somehow apply QuantileDiscretizer to a window/partition. That would however require "destructuring" QuantileDiscretizer into an Estimator to train a model and somehow fit the result model to the window again.
I think it's doable, but haven't done it myself. Sorry.
This is older question. However answering it to help someone with same situation in future.
It can be achieved using pandas udf function. Both input and output of pandas UDF function is dataframe. We need to provide schema of the output dataframe as shown in annotation in below code sample. Below code sample can achieve required result.
output_schema = StructType(df.schema.fields + [StructField('Time_watched_bin', IntegerType(), True)])
#pandas_udf(output_schema, PandasUDFType.GROUPED_MAP)
# pdf: pandas dataframe
def get_buckets(pdf):
pdf['Time_watched_bin'] = pd.cut(pdf['total_time_watched'], 3, labels=False)
return pdf
df = df.groupby('show_name').apply(get_buckets)
df will have new column 'Time_watched_bin' with bucket information.