I'm trying to model some data with a logistic regression, part of spark MLlib. For the model creation I've got the following columns:
ID,
features,
label
I can split it into Train and value data via
(trainsample,testsample) = sample.randomSplit([0.7, 0.3], seed)
Also, I can define my model:
lr = LogisticRegression(featuresCol="features", labelCol="label",
predictionCol="prediction")
Then I can train and test it with:
lrmodel = lr.fit(trainsample)
result = lrmodel.transform(testmodel)
All fine. But now I want to use my model and predict unlabeled data. I am always getting
the following Error:
IllegalArgumentException: 'Field "label" does not exist
I tried to create a dummy label column (all values 999). But than, all my predictions belong to one class (class 6 for 7 different classes). So the label seems to influence my predictions, even with a pretrained model.
Maybe "lrmodel.transform" is just for testing and there is other syntax for use the model. But I didn't find anything to this topic. Any help would be appreciated.
found the issue... I had the label within my featureset x_x... Thanks for your help
Related
I am using pyspark.ml.RandomForestClassifier and one of the steps here involves StringIndexer on the training data target variable to convert it into labels.
indexer = StringIndexer(inputCol = target_variable_name, outputCol = 'label').fit(df)
df = indexer.transform(df)
After fitting the final model I am saving it using mlflow.spark.log_model(). So, when applying the model on a new dataset in future, I just load the model again and apply to the new data:
model = mlflow.sklearn.load_model("models:/RandomForest_model/None")
predictions = rfModel.transform(new_data)
In the new_data the prediction will come as labels and not in original value. So, if I have to get the original values I have to use IndexToString
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",labels=indexer.labels)
predictions = labelConverter.transform(predictions)
So, the question is, my model doesn't save the indexer.labels as only the model gets saved. How do, I save and use the indexer.labels from my training dataset on any new dataset. Can this be saved and retrived in mlflow ?
Apologies, if Iam sounding naïve here . But, getting back the original values in the new dataset is really getting me confused.
Hope you got the answer in case if you haven't then here's the solution. String indexer has a method to save and read, you can use save to reuse the string indexer model.
Eg: Stringindexermodel.save("PAth")
Source:
StringIndexerModel — PySpark 3.3.1 documentation (apache.org)
I was searching for a quick answer but couldn't find any, on searching document i found save as an option.
The following line of code loads the (soon to be deprecated) mllib.regression.LabeledPoint from file to an RDD[LabeledPoint]:
MLUtils.loadLibSVMFile(spark.sparkContext, s"$path${File.separator}${fileName}_data_sparse").repartition(defaultPartitionSize)
I'm unable to find the equivalent function for ml.feature.LabeledPoint, which is not yet heavily used in the Spark documentation examples.
Can someone point me to the relevant function?
With the ml package you won't need to put the data into a LabeledPoint since you can specify which columns to use for labels/features in all transformations/algorithms. For example:
val gbt = new GBTClassifier()
.setLabelCol("label")
.setFeaturesCol("features")
To load the LibSVM file as a dataframe, simply do:
val df = spark.read.format("libsvm").load(s"$path${File.separator}${fileName}_data_sparse")
Which will return a dataframe with two columns:
The loaded DataFrame has two columns: label containing labels stored as doubles and features containing feature vectors stored as Vectors.
See the documentation for more information.
I am trying to classify Questions using SVM. I am following this link for reference -
https://shirishkadam.com/2017/07/03/nlp-question-classification-using-support-vector-machines-spacyscikit-learnpandas/
But they have used SPACY,SCIKIT-LEARN and PANDAS. I want to do the same thing using Spark Mllib.
I am using this code to create a Dataframe -
sc = SparkContext(conf=sconf) # SparkContext
sqlContext = SQLContext(sc)
data = sc.textFile("<path_to_csv_file>")
header = data.first()
trainingDF = sqlContext.createDataFrame(data
.filter(lambda line: line != header)
.map(lambda line: line.split("|"))
.map(lambda line: ([line[0]], [line[2]], [line[6]]))).toDF("Question", "WH-Bigram", "Class")
And I am getting following result by printing the dataframe- trainingDF.show(3)
+--------------------+-------------------+------+
| Question| WH-Bigram| Class|
+--------------------+-------------------+------+
|[How did serfdom ...| [How did]|[DESC]|
|[What films featu...| [What films]|[ENTY]|
|[How can I find a...| [How can]|[DESC]|
My sample csv file is -
#Question|WH|WH-Bigram|Class
How did serfdom develop in and then leave Russia ?|How|How did|DESC
I am using word2vec to create training data for SVM and trying to train using SVM.
word2Vec1 = Word2Vec(vectorSize=2, minCount=0, inputCol="Question", outputCol="result1")
training = word2Vec1.fit(trainingDF).transform(trainingDF)
model = SVMWithSGD.train(training, iterations=100)
After using word2vec my data is converted in this format -
[Row(Question=[u'How did serfdom develop in and then leave Russia ?'], WH-Bigram=[u'How did'], Class=[u'DESC'], result1=DenseVector([0.0237, -0.186])), Row(Question=[u'What films featured the character Popeye Doyle ?'], WH-Bigram=[u'What films'], Class=[u'ENTY'], result1=DenseVector([-0.2429, 0.0935]))]
But when I try to train the dataframe using SVM then getting error that TypeError: data should be an RDD of LabeledPoint, but got <class 'pyspark.sql.types.Row'>
I am stuck here...i think the dataframe that i have created is not correct.
Do any body know how to create a suitable dataframe for training it with SVM. And Please let me know if I am doing something wrong.
Great that you are trying out one of the machine learning methods in Spark, but there are multiple problems with your approach,
1) Your data has multiple classes, it is not a binary classification model hence SVM in Spark won't work on this dataset (you can have a look at the source code here). You can try the one class vs all others approach and train as many models as there are classes in your data. However, you would be better off using something like the MultilayerPerceptronClassifier or the multiclass logistic model in Spark.
2) Secondly, Mllib is very unforgiving in terms of the class labels that you use, you can only specify 0,1,2 or 0.0,1.0,2.0 etc i.e it does not automatically infer the number of classes based on your output column. Even if you specify two classes as 1.0 & 2.0 it will not work it has to be 0.0 & 1.0.
3) You need to use a labeledpoint RDD instead of a spark dataframe, remember that spark.mllib is for use with RDD's whereas spark.ml is for use with dataframes. On help for how to create a Labeledpoint rdd you may refer to the spark documentation here where there are multiple examples.
4) On a feature engineering note, I don't think you would want to take the vectorSize as 2 for your word2vec model (something like 10 would be more appropriate as a starting point), these are simply too less for giving a reasonable prediction.
I have a Dataset/Dataframe with a mllib.linalg.Vector (of Doubles) as one of the columns. I would like to add another column to this dataset of type ml.linalg.Vector to this data set (so I will have both types of Vectors). The reason is I am evaluating few algorithms and some of those expect mllib vector and some expect ml vector. Also, I have to feed o/p of one algorithm to another and each use different types.
Can someone please help me convert mllib.linalg.Vector to ml.linalg.Vector and append a new column to the data set in hand. I tried using MLUtils.convertVectorColumnsToML() inside an UDF and regular functions but not able to get it to working. I am trying to avoid creating a new dataset and then doing inner join and dropping the columns as the data set will be huge eventually and joins are expensive.
You can use the method toML to convert from mllib to ml vector. An UDF and usage example can look like this:
val convertToML = udf((mllibVec: org.apache.spark.mllib.linalg.Vector) = > {
mllibVec.asML
})
val df2 = df.withColumn("mlVector", convertToML($"mllibVector"))
Assuming df to be the original dataframe and the column with the mllib vector to be named mllibVector.
How do I handle a missing numerical feature when using Decision Trees in Spark MLlib?
I am considering replacing the missing feature with the mean of the other values, however I'm not sure what's the impact on the model quality. Does Spark MLlib provide any support for this common issue?
Every DataFrame can take advantage of the DataFrameNaFunctions which can drop the offending record (not the whole column), fill which can fill the offending datum with static "dummy data" or replace which can replace the offending datum with specified data.
https://spark.apache.org/docs/2.1.1/api/scala/#org.apache.spark.sql.DataFrameNaFunctions
scala> df.na
res20: org.apache.spark.sql.DataFrameNaFunctions = org.apache.spark.sql.DataFrameNaFunctions#e7e9006
scala> df.na.
drop fill replace