All the examples in the tutorial use files in LibSVM format as input to Spark Mllib.(http://spark.apache.org/docs/latest/mllib-ensembles.html)
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
But I have a file with millions of rows located on HDFS and want to give this as an input to Spark MLLib using PySpark and I do not want to convert it into libsvm format.
Can anyone please guide me how to do this?
Generally when you give an input an algorithm in MLLib, you create an rdd of a certain data Type (say LabeledPoint Or a vector.) MLUtils.loadLibSVMFile will convert your data into a labeledpoint RDD for you.
You can directly transform your data into whatever format the algorithm needs and then give the resultant RDD as an input to your MLLib algorithm.
http://spark.apache.org/docs/latest/mllib-data-types.html
I agree with #Rishi with few additions to that -
LibSVM format represents a org.apache.spark.mllib.regression.LabeledPoint, it contains a label and a feature vector. If you don't have data in LibSVM format then you can create that by building a dataframe having a column of type LabeledPoint.
val trainingData = spark.read.text (<path to data folder or file>)
val trainingLabelPoints = trainingData.map { row =>
//LabeledPoint(<Label as a Double>, Vectors.sparse(....) )
LabeledPoint(row.getAs[Double]("column 1"), Vectors.sparse(row.getAs[Double]("column 2")...)
}.toDF("labelpoints")
//trainingLabelPoints can be used for input to a Mllib library
Clustering algorithms like K-mean don't need LabelPoints, just a Vector column is enough.
Some classification algorithms like LinearSVN can take 2 columns - label and feature vector, a LabelPoint would work too.
If you have words in training document then you can use org.apache.spark.ml.feature.Word2Vec to convert words to vectors.
So you have quite a lot of choices.
Related
I'm trying to re-write a code wrote (that it's in Python), but now in spark.
#pandas
tfidf = TfidfVectorizer()
df_final = np.array(tfidf.fit_transform(df['sentence']).todense())
I read on spark documentation, is it necessary to use Tokenizer, HashingTF and then IDF to model tf-idf in PySpark?
#pyspark
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
tokenizer = Tokenizer(inputCol = "sentence", outputCol = "words")
wordsData = tokenizer.transform(df)
hashingTF = HashingTF(inputCol = "words", outputCol="rawFeatures", numFeatures = 20)
tf = hashingTF.transform(wordsData)
idf = IDF(inputCol = "rawFeatures", outputCol = "features")
tf_idf = idf.fit(tf)
df_final = tf_idf.transform(tf)
I'm not sure if you understand clearly how tf-idf model works, since tokenizing is essential and fundamental for tf-idf model no matter in sklearn or spark.ml version. You post actually cover 2 questions:
Why tf-idf need to tokenization the sentence: I won't copy the mathematical equation since it's easy to search in google. Long in short, tf-idf is a statistical measurement to evaluate the relevancy and relationship between a word to a document in a collection of documents, which is calculated by the how frequent a word appear in a document (tf) and the inverse frequency of the word across a set of documents (idf). Therefore, as the essence is the vocabulary and all calculation are based on vocaulary, if your input is sentence like what you mentioned in your sklearn version, you must do the tokenizing of the sentence before the calculation, otherwise the whole methodology is not valid anymore.
How tf-idf work in sklearn: If you understand how tf-idf works, then you should understand the different steps in the example of spark official document are essential. Thanks for the sklearn developer to create such convenient API, you can use the .fit_transform() directly with the Series of sentence. In fact, if you check the source code of the TfidfVectorizer in sklearn, you can see that it actually did the "tokenization", just in a different way:
It inherits the from the CountVectorizer (https://github.com/scikit-learn/scikit-learn/blob/36958fb240fbe435673a9e3c52e769f01f36bec0/sklearn/feature_extraction/text.py#L1717)
It uses the ._count_vocab() method in CountVectorizer to transform your sentence. (https://github.com/scikit-learn/scikit-learn/blob/36958fb240fbe435673a9e3c52e769f01f36bec0/sklearn/feature_extraction/text.py#L1338)
In ._count_vocab(), it checks each sentences and create the sparse matrix to store the frequency of each vocabulary in each sentences before the tf-idf calculation. (https://github.com/scikit-learn/scikit-learn/blob/36958fb240fbe435673a9e3c52e769f01f36bec0/sklearn/feature_extraction/text.py#L1192)
To conclude, tokenizing the sentence is essential for the tf-idf model calculation, the example in spark official documents is efficient enough for your model building. Remember to use the function or method if spark provide such API and DON'T try to build the user defined function/class to achieve the same goal, otherwise it may reduce your computing performance or trigger other issue like out-of-memory.
i have developed a clustering model using pyspark and i want to just predict the class of one vector and here is the code
spark = SparkSession.builder.config("spark.sql.warehouse.dir",
"file:///C:/temp").appName("Kmeans").getOrCreate()
vecAssembler = VectorAssembler(inputCols=FEATURES_COL, outputCol="features")
df_kmeans = vecAssembler.transform(df).select('LCLid', 'features')
k = 6
kmeans = KMeans().setK(k).setSeed(1).setFeaturesCol("features")
model = kmeans.fit(df_kmeans)
centers = model.clusterCenters()
predictions = model.transform(df_kmeans)
transformed = model.transform(df_kmeans).select('LCLid', 'prediction')
rows = transformed.collect()
say that i have a vector of features V and i want to predict in which class it belongs
i tried a method that i found in this link http://web.cs.ucla.edu/~zhoudiyu/tutorial/
but it doesn't work since i'm working with SparkSession not in sparkContext
I see that you dealt with the most basic steps in your model creation, what you still need is to apply your k-means model on the vector that you want to make the clustering on (like what you did in line 10) then get your prediction, I mean what you have to do is to reDo the same work done in line 10 but on the new vector of features V. To understand this more I invite you to read this posted answer in StackOveflow:
KMeans clustering in PySpark.
I want to add also that the problem in the example that you are following is not due to the use of SparkSession or SparkContext as those are just an entry point to the Spark APIs, you can also get access to a sparContext through a sparkSession since it is unified by Databricks since Spark 2.0. The pyspark k-means is like the Scikit learn the only difference is the predefined functions in spark python API (PySpark).
You can call the predict method of the kmeans model using a Spark ML Vector:
from pyspark.ml.linalg import Vectors
model.predict(Vectors.dense([1,0]))
Here [1,0] is just an example. It should have the same length as your feature vector.
it's my very first time trying to run KMeans cluster analysis in Spark, so, I am sorry for a stupid question.
I have a spark dataframe mydataframe with many columns. I want to run kmeans on only two columns: lat and long (latitude & longitude) using them as simple values. I want to extract 7 clusters based on just those 2 columns. I've tried:
from numpy import array
from math import sqrt
from pyspark.mllib.clustering import KMeans, KMeansModel
# Prepare a data frame with just 2 columns:
data = mydataframe.select('lat', 'long')
# Build the model (cluster the data)
clusters = KMeans.train(data, 7, maxIterations=15, initializationMode="random")
But I am getting an error:
'DataFrame' object has no attribute 'map'
What should be the object one feeds to KMeans.train?
Clearly, it doesn't accept a DataFrame.
How should I prepare my data frame for the analysis?
Thank you very much!
the method KMeans.train takes as imput an RDD and not a dataframe (data). So, you just have to convert data to rdd: data.rdd.
Hope it helps.
I have gone through the link https://spark.apache.org/docs/latest/mllib-clustering.html regarding fitting a GMM in pyspark. I have carried out the same operation successfully in python, but after several iteration, I am unable to run in pyspark.
The questions i have are as follow;
1. The link mentioned above & another example of fitting GMM in pyspark I checked, takes a txt file with no column headings. I have a csv with 17 columns. The code is,
data = sc.textFile("..path/mydata.csv")
parsedData = data.map(lambda line: array([float(x) for x in line.strip().split(' ')]))
This worked, but when i am trying to fit GaussianMixture.train specifying some components, It is not working.
If the data used in the examples have no column headings, how can I judge which column is coming from which distribution & how the change in pattern is appearing?
How can I get heat-map from here so that whenever a new data comes in, I will use my trained model's heat-map to judge the distribution pattern of my new test data & can point out the mis-matches.
Thanks.
Hi I Could anyone suggest a mapping from scala countvectorizer output: ([label, (nVocab, [i1, i2, ...], [c1, c2, ...])]) to the libsvm format: (label, : : ...) ?
if you take the input as a string, I am not sure where to split to get the fields, for starters.
Alternatively, is there a scala utility for this?
Thanks,
kvd
I fingured this out. The countVectorizer output could be cast into sparseVecor data type, which has [size, [indices], [values]]. The indices and values arrays could be zipped and output in the libsvm format.
val countVec = vec(1).asInstanceOf[SparseVector]
Upon further exploration it turns out that I don't need this conversion. I could create a Labeled point using the classLabel and the sparseVector and pass to the machine learning object directly.
Thanks,
kvd