Convert Scala countvectorizer output to libsvm format

Convert Scala countvectorizer output to libsvm format - scala

Hi I Could anyone suggest a mapping from scala countvectorizer output: ([label, (nVocab, [i1, i2, ...], [c1, c2, ...])]) to the libsvm format: (label, : : ...) ?
if you take the input as a string, I am not sure where to split to get the fields, for starters.
Alternatively, is there a scala utility for this?
Thanks,
kvd

I fingured this out. The countVectorizer output could be cast into sparseVecor data type, which has [size, [indices], [values]]. The indices and values arrays could be zipped and output in the libsvm format.
val countVec = vec(1).asInstanceOf[SparseVector]
Upon further exploration it turns out that I don't need this conversion. I could create a Labeled point using the classLabel and the sparseVector and pass to the machine learning object directly.
Thanks,
kvd

Related

Exception while trying to explain model with MMLSpark's scala LIME library

I am trying to explain the predictions made by my XGboost model using MMLSparks Lime package for scala.
This is my first time using LIME library, I am able to perform a fit operation on the dataset and when I am trying to perform the transform operation, the program stops with an exception,
Caused by: java.lang.ClassCastException: org.apache.spark.ml.linalg.SparseVector cannot be cast to org.apache.spark.ml.linalg.DenseVector
I have around 200 features and many of them contain zero as its feature value.

You are likely using VectorAssembler to create your feature vector column. The transform function outputs a sparse vector if there are lots of zeros in your feature set to save computational space. This causes the error for LIME.
More info on VectorAssembler output - Spark ML VectorAssembler returns strange output
The solution is to convert the column back to a dense vector in order for mmlspark LIME to interpret.
import org.apache.spark.sql.functions.udf
import org.apache.spark.ml.linalg.Vector
val asDense = udf((v: Vector) => v.toDense)
featuresDF.withColumn("features", asDense(col("features")))
Then you can fit your model.

Running KMeans clustering in PySpark

it's my very first time trying to run KMeans cluster analysis in Spark, so, I am sorry for a stupid question.
I have a spark dataframe mydataframe with many columns. I want to run kmeans on only two columns: lat and long (latitude & longitude) using them as simple values. I want to extract 7 clusters based on just those 2 columns. I've tried:
from numpy import array
from math import sqrt
from pyspark.mllib.clustering import KMeans, KMeansModel
# Prepare a data frame with just 2 columns:
data = mydataframe.select('lat', 'long')
# Build the model (cluster the data)
clusters = KMeans.train(data, 7, maxIterations=15, initializationMode="random")
But I am getting an error:
'DataFrame' object has no attribute 'map'
What should be the object one feeds to KMeans.train?
Clearly, it doesn't accept a DataFrame.
How should I prepare my data frame for the analysis?
Thank you very much!

the method KMeans.train takes as imput an RDD and not a dataframe (data). So, you just have to convert data to rdd: data.rdd.
Hope it helps.

How does spark LDA handle non-integer token counts (e.g. TF-IDF)

I have been running a series of topic modeling experiments in Spark, varying the number of topics. So, given an RDD docsWithFeatures, I'm doing something like this:
for (n_topics <- Range(65,301,5) ){
val s = n_topics.toString
val lda = new LDA().setK(n_topics).setMaxIterations(20) // .setAlpha(), .setBeta()
val ldaModel = lda.run(docsWithFeatures)
// now do some eval, save results to file, etc...
This has been working great, but I also want to compare results if I first normalize my data with TF-IDF. Now, to the best of my knowledge, LDA strictly expects a bag-of-words format where term frequencies are integer values. But in principal (and I've seen plenty of examples of this), the math works out fine if we first convert integer term frequencies to float TF-IDF values. My approach at the moment to do this is the following (again given my docsWithFeatures rdd):
val index_reset = docsWithFeatures.map(_._2).cache()
val idf = new IDF().fit(index_reset)
val tfidf = idf.transform(index_reset).zipWithIndex.map(x => (x._2,x._1))
I can then run the same code as in teh first block, substituting tfidf for docsWithFeatures. This works without any crashes, but my main question here is whether this is OK to do. That is, I want to make sure Spark isn't doing anything funky under the hood, like converting the float values coming out of the TFIDF to integers or something.

HDFS Files as input to Spark Mllib

All the examples in the tutorial use files in LibSVM format as input to Spark Mllib.(http://spark.apache.org/docs/latest/mllib-ensembles.html)
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
But I have a file with millions of rows located on HDFS and want to give this as an input to Spark MLLib using PySpark and I do not want to convert it into libsvm format.
Can anyone please guide me how to do this?

Generally when you give an input an algorithm in MLLib, you create an rdd of a certain data Type (say LabeledPoint Or a vector.) MLUtils.loadLibSVMFile will convert your data into a labeledpoint RDD for you.
You can directly transform your data into whatever format the algorithm needs and then give the resultant RDD as an input to your MLLib algorithm.
http://spark.apache.org/docs/latest/mllib-data-types.html

I agree with #Rishi with few additions to that -
LibSVM format represents a org.apache.spark.mllib.regression.LabeledPoint, it contains a label and a feature vector. If you don't have data in LibSVM format then you can create that by building a dataframe having a column of type LabeledPoint.
val trainingData = spark.read.text (<path to data folder or file>)
val trainingLabelPoints = trainingData.map { row =>
//LabeledPoint(<Label as a Double>, Vectors.sparse(....) )
LabeledPoint(row.getAs[Double]("column 1"), Vectors.sparse(row.getAs[Double]("column 2")...)
}.toDF("labelpoints")
//trainingLabelPoints can be used for input to a Mllib library
Clustering algorithms like K-mean don't need LabelPoints, just a Vector column is enough.
Some classification algorithms like LinearSVN can take 2 columns - label and feature vector, a LabelPoint would work too.
If you have words in training document then you can use org.apache.spark.ml.feature.Word2Vec to convert words to vectors.
So you have quite a lot of choices.

TF - IDF rdds into readable format using spark

I am trying to calculate TF-IDF for documents of strings and I am referring http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf link.
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.feature.IDF
val sc: SparkContext = ...
// Load documents (one per line).
val documents: RDD[Seq[String]] = sc.textFile("...").map(_.split(" ").toSeq)
val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(documents)
tf.cache()
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
output:
Array((1048576,[1088,35436,98482,1024805],[2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776]), (1048576,[49,34227,39165,114066,125344,240472,312955,388260,436506,469864,493361,496101,566174,747007,802226],[2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776]),...
With this I am getting a RDD of vectors but I am not able to get any information from this vector about the original strings. Can anyone help me out in mapping indexes to original strings?

It is hard to answer to your question without more information. My best guess is that you may want to extract the TFIDF value for some term of some document.
tfidf you get at your last line is a RDD of Vector : for every document in your corpus (which is a RDD[Seq[String]]) , you get back a Vector representing the document. Every term in the document has a specific TFIDF value in this vector.
To know the position of a term in the vector, and retrieve the TFIDF :
val position = hashingTF.indexOf("term")
Then use it to retrieve the tfidf value for the given document calling the apply method on the Vector (first document in documents in this example) :
tfidf.first.apply(position)
Raw frequencies may be extracted the same way using tf instead of tfidf in the line above.
With the implementation of Spark using a hashing trick (see documentation and wikipedia article) my understanding is that it is not possible to retrieve the terms from the Vector : this is due to the fact that the hashing function is one way by definition, and that this "trick" may causes collisions (several terms may produce the same hash).

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Convert Scala countvectorizer output to libsvm format - scala

Related

Exception while trying to explain model with MMLSpark's scala LIME library

Running KMeans clustering in PySpark

How does spark LDA handle non-integer token counts (e.g. TF-IDF)

HDFS Files as input to Spark Mllib

TF - IDF rdds into readable format using spark

Categories

Resources