I am new to Apache Spark. I ran the sample ALS algorithm code present in the examples folder. I gave a csv file as an input. When I use model.save(path) to save the model, it is stored in gz.parquet file.
When I tried to open this file, I get these errors
Now I want to store the recommendation model generated in a text or csv file for using it outside Spark.
I tried the following function to store the model generated in a file but it was useless:
model.saveAsTextFile("path")
Please suggest me a way to overcome this issue.
Lest say you have trained your model with something like this:
val model = ALS.train(ratings, rank, numIterations, 0.01)
All that you have to do is:
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
import org.apache.spark.mllib.recommendation.Rating
// Save
model.save(sc, "yourpath/yourmodel")
// Load Model
val sameModel = MatrixFactorizationModel.load(sc, "yourpath/yourmodel")
As it turns out saveAsTextFile() only works on the slaves.Use collect() to collect the data from the slaves so it can be saved locally on the master. Solution can be found here
Related
So, I made an Scala Application to run in Spark, and created the Uber Jar using sbt> assembly.
The file I load is a lookup needed by the application, thus the idea is to package it together. It works fine from within InteliJ using the path "src/main/resources/lookup01.csv"
I am developing in Windows, testing locally, to after deploy it to a remote test server.
But when I call spark-submit on the Windows machine, I get the error :
"org.apache.spark.sql.AnalysisException: Path does not exist: file:/H:/dev/Spark/spark-2.4.3-bin-hadoop2.7/bin/src/main/resources/"
Seems it tries to find the file in the sparkhome location instead of from inside the JAr file.
How could I express the Path so it works looking the file from within the JAR package?
Example code of the way I load the Dataframe. After loading it I transform it into other structures like Maps.
val v_lookup = sparkSession.read.option( "header", true ).csv( "src/main/resources/lookup01.csv")
What I would like to achieve is getting as way to express the path so it works in every environment I try to run the JAR, ideally working also from within InteliJ while developing.
Edit: scala version is 2.11.12
Update:
Seems that to get a hand in the file inside the JAR, I have to read it as a stream, the bellow code worked, but I cant figure out a secure way to extract the headers of the file such as SparkSession.read.option has.
val fileStream = scala.io.Source.getClass.getResourceAsStream("/lookup01.csv")
val inputDF = sparkSession.sparkContext.makeRDD(scala.io.Source.fromInputStream(fileStream).getLines().toList).toDF
When the makeRDD is applied, I get the RDD and then can convert it to a dataframe, but it seems I lost the ability tu use the option from "read" that parsed out the headers as the schema.
Any way around it when using makeRDD ?
Other problem with this is that seems that I will have to manually parse the lines into columns.
You have to get the correct path from classPath
Considering that your file is under src/main/resources:
val path = getClass.getResource("/lookup01.csv")
val v_lookup = sparkSession.read.option( "header", true ).csv(path)
So, it all points to that after the file is inside JAR, it can only be accessed as a inputstream to read the chunk of data from within the compressed file.
I arrived at a solution, even though its not pretty it does what I need, that is to read a csv file, take the 2 first columns and make it into a dataframe and after load it inside a key-value structure (in this case i created a case class to hold these pairs).
I am considering migrating these lookups to a HOCON file, that may make the process less convoluted to load these lookups
import sparkSession.implicits._
val fileStream = scala.io.Source.getClass.getResourceAsStream("/lookup01.csv")
val input = sparkSession.sparkContext.makeRDD(scala.io.Source.fromInputStream(fileStream).getLines().toList).toDF()
val myRdd = input.map {
line =>
val col = utils.Utils.splitCSVString(line.getString(0))
KeyValue(col(0), col(1))
}
val myDF = myRdd.rdd.map(x => (x.key, x.value)).collectAsMap()
fileStream.close()
From the spark-nlp Github page I downloaded a .zip file containing a pre-trained NerCRFModel. The zip contains three folders: embeddings, fields, and metadata.
How do I load that into a Scala NerCrfModel so that I can use it? Do I have to drop it into HDFS or the host where I launch my Spark Shell? How do I reference it?
you just need to provide the path where the folders you mentioned are contained,
import com.johnsnowlabs.nlp.annotators.ner.crf.NerCrfModel
val path = "path/to/unziped/file/folder"
val model = NerCrfModel.read.load(path)
// use your model
model.setInputCols(someCol)
model.transform(yourData) // which contains 'someCol',
As long as I remember, you can place the folder in local FS or distributed FS, hope this helps other users as well!.
best,
Alberto.
In Spark MLlib, BisectingKMeansModel in pyspark have no save/load function.
why?
How to save or load the BisectingKMeans Model with Python to HDFS ?
It may be your spark version. For bisecting k_means is recommended to have above 2.1.0.
You can find a complete example here on the class pyspark.ml.clustering.BisectingKMeans, hope it helps:
https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.clustering.BisectingKMeans%20featuresCol=%22features%22,%20predictionCol=%22prediction%22
The last part of the example code include a model save/load:
model_path = temp_path + "/bkm_model"
model.save(model_path)
model2 = BisectingKMeansModel.load(model_path)
It works for hdfs as well, but make sure that temp_path/bkm_model folder does not exist before saving the model or it will give you an error:
(java.io.IOException: Path <temp_path>/bkm_model already exists)
In my project, I have three input files and make the file names as args(0) to args(2), I also have a output filename as args(3), in the source code, I use
val sc = new SparkContext()
var log = sc.textFile(args(0))
for(i <- 1 until args.size - 1) log = log.union(sc.textFile(args(i)))
I do nothing to the log but save it as a text file by using
log.coalesce(1, true).saveAsTextFile(args(args.size - 1))
but it still save to 3 file as part-00000、part-00001、part-00002, So is there any way that I can save the three input files to an output file?
Having multiple output files is a standard behavior of multi-machine clusters like Hadoop or Spark. The number of output files depends on the number of reducers.
How to "solve" it in Hadoop:
merge output files after reduce phase
How to "solve" in Spark:
how to make saveAsTextFile NOT split output into multiple file?
A good info you can get also here:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-Spark-merge-the-output-file-td322.html
So, you were right about coalesce(1,true). However, it is very inefficient. Interesting is that (as #climbage mentioned in his remark) your code is working if you run it locally.
What you might try is to read the files first and then save the output.
...
val sc = new SparkContext()
var str = new String("")
for(i <- 0 until args.size - 1){
val file = sc.textFile(args(i))
file.foreach(line => str+= line)
}
//and now you might save the content
str.coalesce(1, true).saveAsTextFile("out")
Note: this code is also extremely inefficient and working for small files only!!! You need to come up with a better code. I wouldn't try to reduce number of file but process multiple outputs files instead.
As mentioned your problem is somewhat unavoidable via the standard API's as the assumption is that you are dealing with large quanatities of data. However, if I assume your data is manageable you could try the following
import java.nio.file.{Paths, Files}
import java.nio.charset.StandardCharsets
Files.write(Paths.get("./test_file"), data.collect.mkString("\n").getBytes(StandardCharsets.UTF_8))
What I am doing here is converting the RDD into a String by performing a collect and then mkString. I would suggest not doing this in production. It works fine for local data analysis (Working with 5gb~ of local data)
I'm using the containers.Map-function to store my data.
Is there an easy way to export the whole structure to a file and be able to import it again at a later time.
A structure could be:
keys = {'six','seven','eight','nine'};
vals = {6,7,8,9};
Map = containers.Map(keys,vals);
and then say I want to export this structure and be able to import at a later time (or in another code).
Thanks, regards
Rasmus
Use the save and load functions: http://www.mathworks.com/help/matlab/matlab_env/save-load-and-delete-workspace-variables.html:
save('MyFileName', 'Map');
load('MyFileName');