I am using Apache Spark 2.1.2 and I want to use Latent Dirichlet allocation (LDA).
Previously I was using org.apache.spark.mllib package and I could run this without any problems, but now after starting using spark.ml I am getting an error.
val lda = new LDA().setK(numTopics).setMaxIter(numIterations)
val docs = spark.createDataset(documents)
val ldaModel = lda.fit(docs)
As you may have noticed, I'm converting documents RDD to a dataset object and am not sure if this is the correct way of doing this.
In this last line with .fit I am getting the following error:
java.lang.IllegalArgumentException: Field "features" does not exist.
My docs dataset looks like this:
scala> docs.take(2)
res28: Array[(Long, org.apache.spark.ml.linalg.Vector)] = Array((0,(7336,[1,2,4,5,12,13,19,24,26,42,48,49,57,59,63,73,81,89,99,106,113,114,141,151,157,160,177,181,198,261,266,267,272,297,307,314,315,359,383,385,410,416,422,468,471,527,564,629,717,744,763,837,890,928,932,951,961,1042,1134,1174,1305,1604,1653,1850,2119,2159,2418,2634,2836,3002,3132,3594,4103,4316,4852,5065,5107,5632,5945,6378,6597,6658],[1.0,1.0,1.0.......
My previous documents before converting them to a dataset:
documents: org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.Vector)] = MapPartitionsRDD[2520]
How to get rid of the error above?
The main difference between spark mllib and spark ml is that spark ml operates on Dataframes (or Datasets) while mllib operates directly on RDDs of very defined structure.
You don't need to do much to make your code work with spark ml, but I'd still suggest to go through their documentation page and understand the differences, because you will come against more and more differences as you shift more and more towards spark ml. A good starting page with all the basics is here https://spark.apache.org/docs/2.1.0/ml-pipeline.html.
But to your code, all that is needed is just to give a correct column name to each column and it should be working just fine. Probably the easiest way to do so would be to utilise the implicit method toDF on the underlying RDD:
import spark.implicits._
val lda = new LDA().setK(numTopics).setMaxIter(numIterations)
val docs = documents.toDF("label", "features")
val ldaModel = lda.fit(docs)
Related
I am teaching myself scala (so as to use it with Apache Spark) and wanted to know if there would be some way to concatenate a series of transformations on a Spark DataFrame. E.g. let's assume we have a list of transformations
l: List[(String, String)] = List(("field1", "nonEmpty"), ("field2", "notNull"))
and a Spark DataFrame
df, such that the desired result would be
df.filter(df("field1") =!= "").filter(df("field2").isNotNull).
I was thinking perhaps this could be done using function composition or list folding or something, but I really don't know how. Any help would be greatly appreciated.
Thanks!
Yes, it is perfectly possible. But it depends of you really want, I mean, Spark provides Pipelines, that allows to compose your transformations and create a pipeline that can be serialized. You can create your custom transformers, here an example. You can include your "filter" stages in custom transformations, you will be able to use later, for example, in a Spark structured streaming.
Other option is to use Spark datasets and use the transform api. That seems more functional and elegant.
Scala has a lot of possibilities to create your own api, but take a look first to these approaches.
Yes you can fold over an existing Dataframe. You could keep all columns in a list and don't bother with other intermediary types:
val df =
???
val columns =
List(
col("1") =!= "",
col("2").isNotNull,
col("3") > 10
)
val filtered =
columns.foldLeft(df)((df, col) => df.filter(col))
This question already has answers here:
Spark 2.0 Dataset vs DataFrame
(3 answers)
Closed 4 years ago.
What is the advantage of using case class in spark dataframe? I can define the schema using "inferschema" option or define Structtype fields.
I referred
"https://docs.scala-lang.org/tour/case-classes.html" but could not understand what are the advantages of using case class apart from generating schema using reflection.
inferschema can be an expensive operation and will defer error behavior unnecessarily. consider the following pseudocode
val df = loadDFWithSchemaInference
//doing things that takes time
df.map(row => row.getAs[String]("fieldName")).//more stuff
now in your this code you already have the assumption baked in that fieldName is of type String but it's only expressed and ensured late in your processing leading to unfortunate errors in case it wasn't actually a String
now if you'd do this instead
val df = load.as[CaseClass]
or
val df = load.option("schema", predefinedSchema)
the fact that fieldName is a String will be a precondition and thus your code will be more robust and less error prone.
schema inference is very handy to have if you do explorative things in the REPL or e.g. Zeppelin but should not be used in operational code.
Edit Addendum:
I personally prefer to use case classes over schemas because I prefer the Dataset API to the Dataframe API (which is Dataset[Row]) for similar robustness reasons.
I've just started learning Spark and Scala.
From what I understand it's bad practice to use collect, because it gathers the whole data in memory and it's also bad practice to use for, because the code inside the block is not executed concurrently by more than one node.
Now, I have a List of numbers from 1 to 10:
List(1,2,3,4,5,6,7,8,9,10)
and for each of these values I need to generate a RDD using this value.
in such cases, how can I generate the RDD?
By doing
sc.parallelize(List(1,2,3,4,5,6,7,8,9,10)).map(number => generate_rdd(number))
I get an error because RDD cannot be generated inside another RDD.
What is the best workaround to this problem?
Assuming generate_rdd defined like def generate_rdd(n: Int): RDD[Something] what you need is flatMap instead of map.
sc.parallelize(List(1,2,3,4,5,6,7,8,9,10)).flatMap(number => generate_rdd(number))
This will give a RDD that is a concatenation of all RDDs that are created for numbers from 1 to 10.
Assuming that the number of RDDs that you would like to create would be lower and hence that parallelization itself need not be accomplished by RDD, we can use Scala's parallel collections instead. For example, I tried to count the number of lines in about 40 HDFS files simultaneously using the following piece of code [Ignore the setting of delimiter. For newline delimited texts, this could have well been replaced by sc.textFile]:
val conf = new Configuration(sc.hadoopConfiguration)
conf.set("textinputformat.record.delimiter", "~^~")
val parSeq = List("path of file1.xsv","path of file2.xsv",...).par
parSeq.map(x => {
val rdd = sc.newAPIHadoopFile(x, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
println(rdd.count())
})
Here is part of the output in Spark UI. As seen, most of the RDD count operations started at the same time.
I'm following the instructions of PMML model export - spark.mllib to create a K-means model.
val numClusters = 10
val numIterations = 10
val clusters = KMeans.train(data, numClusters, numIterations)
// Save and load model: export to PMML
println("PMML Model:\n" + clusters.toPMML("/kmeans.xml"))
But I don't know how to load the PMML after that.
I'm trying
val sameModel = KMeansModel.load(sc, "/kmeans.xml")
and appears:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/kmeans.xml/metadata
Any idea?
Best regards
As stated in the documentation (for the version you seem to be interested it - 1.6.1 and also for the latest available - 2.1.0) Spark supports exporting to PMML only. The load method actually expects to retrieve a model saved in Spark own format and this is why the load method expects a certain path to be there and why the exception has been thrown.
If you trained the model with Spark, you can save it and load it later.
If you need to load a model that has not been trained in Spark and has been saved as PMML you can use jpmml-spark to load and evaluate it.
My limited experience in this spark.mllib's KMeans space is that it is not possible, but you could develop the feature yourself.
spark.mllib's KMeansModel is PMMLExportable:
class KMeansModel #Since("1.1.0") (#Since("1.0.0") val clusterCenters: Array[Vector])
extends Saveable with Serializable with PMMLExportable {
That's why you can use toPMML that saves a model into the PMML XML format.
(Again I've got a very little experience in Spark MLlib) My understanding is that KMeans is all about centroids and that's what is loaded when you do KMeansModel.load that in turn uses KMeansModel.SaveLoadV1_0.load that reads the centroids and creates a KMeansModel:
new KMeansModel(localCentroids.sortBy(_.id).map(_.point))
For KMeansModel.toPMML, Spark MLlib uses pmml-model's PMML (as you can see here):
new PMML("4.2", header, null)
I'd recommend exploring pmml-model's PMML how to do saving and loading as that's beyond Spark's realm.
Side notes
Why would you even want to use Spark to have the model after you trained it? It is indeed possible, but you may be wasting your cluster resources for Spark to host the model.
In my limited understanding, the sole purpose of Spark MLlib is to use Spark's features like distribution and parallelism to handle large datasets to build models and use them without the Spark machinery afterwards.
I must be missing something important in my narrow view...
You could use PMML4S-Spark to load a PMML model to evaluate it in Spark, for example:
import org.pmml4s.spark.ScoreModel
val model = ScoreModel.fromFile("/kmeans.xml")
The model is a SparkML transformer, so you can make prediction against a dataframe:
val scoreDf = model.transform(df)
PMML files are actually xml files with schemas defined by Data Mining Consortium. For that reason you can either define a deserializer based on the contract given at DMC and PMML web page here or use 3rd party libraries.
I am researching on jpmml library for incorporation python prepared models in Spring application.
Information here:
https://github.com/jpmml
http://dmg.org/pmml/v4-1/GeneralStructure.html
I am using Spark 1.5.0 (cdh5.5.2). I am running FpGrowth algorithm on my transactions data and I get different results each time. I checked my transactions data using linux diff command and found to have no difference. Is there any random seed involved in the fpgrowth function in Scala? Why am I getting different number of frequent item sets each time? Is there any tie that is broken randomly? Also, I use a support value that is really low - when I increase the support this problem doesn't exist. The support I use is 0.000459. When I increase this to 0.005 I am not getting the error. Is there any minimum threshold for support that needs to be used?
Thanks for your help.
Here is the code that I used:
val conf = new SparkConf()
conf.registerKryoClasses(Array(classOf[ArrayBuffer[String]], classOf[ListBuffer[String]]))
val sc = new SparkContext(conf)
val data = sc.textFile("path/test_rdd.txt")
val transactions = data.map(x=>(x.split('\t')))
val transactioncount = transactions.count()
print(transactioncount)
print("\n")
transactions.cache()
val fpg = new FPGrowth().setMinSupport(0.000459)
val model = fpg.run(transactions)
print("\n")
print(model.freqItemsets.collect().length)
print("\n")
I am getting the same number in transactioncount. However, when I print the length of RDD that is output of FPGrowth, I am getting different numbers each time.
The problem was that Cloudera by default has Kryo Serializer turned on. Spark downloads (individually) have Java Serializer by default. When I run FPGrowth using Kryo Serializer it will ask to register Kryo classes. Once I do that there are no errors popping up. However, the results are incorrect. Once I changed it back to Java Serializer, the results were correct and match those from Spark 1.6.0. I still do not know if the problem is in FPGrowth function itself or Kryo serialization affects other functions/libraries as well.