load CrossValidator model in Spark 2.2.1 - pyspark

I have Spark version 2.2.1 (cluster) and Spark 2.4 (my laptop). I can train and save a CrossValidator model in both the cluster and my laptop. But when I try to load it back, in Spark 2.4, CrossValidatorModel.load works but with Spark 2.2.1, CrossValidatorModel does not have a load method. How can I load it? Sample code is below: the data is from Spark GitHub Repo
training = spark.read.format("libsvm").load("sample_multiclass_classification_data.txt")
logreg = LogisticRegression(maxIter = 200)
paramGrid_logreg = ParamGridBuilder().addGrid(logreg.regParam, np.linspace(0.0, 1, 11))\
.addGrid(logreg.elasticNetParam, np.linspace(0, 1, 11)).build()
crossval_logreg = CrossValidator(estimator = logreg,
estimatorParamMaps = paramGrid_logreg,
evaluator = BinaryClassificationEvaluator(), numFolds = 10)
cvModel_logreg = crossval_logreg.fit(training)
cvModel_logreg.save("cvModel_logreg_numFolds10")
now, with spark 2.4, I can load it using:
CrossValidatorModel.load('cvModel_logreg_numFolds10')
But in Spark 2.2 CrossValidatorModel does have load method. Any way to read it?

Unfortunately I think you can't. I have also faced the same problem, finally I switched to spark 2.3 which contains the load function.

Related

It seems that Spark RDD's cache doesn't work, since there is no RDD on Spark web UI

I am going to test Spark's RDD cache by running PythonPageRank on CentOS 7:
spark-submit --master yarn --deploy-mode cluster /usr/spark/examples/src/main/python/pagerank.py input/testpr.txt 10
As you can see, I am doing the PageRank, therefore testpr.txt and 10 are the parameters.
The file pagerank.py contains the following code:
spark = SparkSession\
.builder\
.appName("PythonPageRank")\
.getOrCreate()
lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
links = lines.map(lambda urls: parseNeighbors(urls)).distinct().groupByKey().cache()
ranks = links.map(lambda url_neighbors: (url_neighbors[0], 1.0))
for iteration in range(int(sys.argv[2])):
contribs = links.join(ranks).flatMap(
lambda url_urls_rank: computeContribs(url_urls_rank[1][0], url_urls_rank[1][1]))
ranks = contribs.reduceByKey(add).mapValues(lambda rank: rank * 0.85 + 0.15)
for (link, rank) in ranks.collect():
print("%s has rank: %s." % (link, rank))
spark.stop()
As you can see,links = lines.map(lambda urls: parseNeighbors(urls)).distinct().groupByKey().cache() contains cache. However, when I look at the Spark UI's Storage page, I can't find anything about cache.
Here is the PageRank application, it works well.
Here is the Job page of the application, the action collect() generates a job:
Here is the Stage page of the application, it shows that there contains many iterations in PageRank.
Here is the Storage page of the application, which should contain cached RDDs. However, it contains nothing, seeming that the cache() doesn't work.
Why can't I see any cached RDDs on the Storage page? Why doesn't the cache() in pagerank.py work? Hope someone can help me.
You can add spark.eventLog.logBlockUpdates.enabled true into spark-defaults.conf, which won't make the Spark History Server's Storage tab be blank.

Spark Scala file stream

I am new to Spark and Scala. I want to keep read files from folder and persist file content in Cassandra. I have written simple Scala program using file streaming to read the file content. it is not reading files from the specified folder.
Can anybody correct my below sample code ?
i am using Windows 7
Code:
val spark = SparkHelper.getOrCreateSparkSession()
val ssc = new StreamingContext(spark.sparkContext, Seconds(1))
val lines = ssc.textFileStream("file:///C:/input/")
lines.foreachRDD(file=> {
file.foreach(fc=> {
println(fc)
})
})
ssc.start()
ssc.awaitTermination()
}
I think a normal spark job is needed for the scenario rather than spark streaming.Spark streaming is used in cases where your source is something like kafka or a normal port where there is continuous inflow of data.

ingesting data in solr using spark scala

I am trying to ingest data to solr using scala and spark however, my code is missing something. For instance, I got below code from Hortonworks tutorial.
I am using spark 1.6.2, solr 5.2.1, scala 2.10.5.
Can anybody provide me a workable snippet to successfully insert data into solr?
val input_file = "hdfs:///tmp/your_text_file"
case class Person(id: Int, name: String)
val people_df1 = sc.textFile(input_file).map(_.split(",")).map(p => Person(p(0).trim.toInt, p(1))).toDF()
val docs = people_df1.map{doc=>
val docx=SolrSupport.autoMapToSolrInputDoc(doc.getAs[Int]("id").toString, doc, null)
docx.setField("scala_s", "supercool")
docx.setField("name_s", doc.getAs[String]("name"))
}
// below code has compilation issue somehow although jar file doest contain these functions.
SolrSupport.indexDocs("sandbox.hortonworks.com:2181","testsparksolr",10,docs)
val solrServer = com.lucidworks.spark.SolrSupport.getSolrServer("http://ambari.asiacell.com:2181")
solrServer.setDefaultCollection("
testsparksolr")
solrServer.commit(false, false)
thanks in advance
Have you tried spark-solr?
The library's main focus is to provide a clean API to index documents to a Solr server as in your case.

How to apply a simple filter with Flink in Scala

I was using an old version of Flink. I upgrade to 1.2.0 and I have some issues with filters.
I have a DataStream of Log which works just fine :
val logs: DataStream[Log] = env.addSource(new LogSource(
data, delay, factor))
// DISPLAY TUPLE IN CONSOLE
logs.print()
// EXECUTE SCRIPT
env.execute("stream")
I have of course read the documentation which shows :
dataStream.filter { _ != 0 }
I tried a bunch of things like this :
val cleanLogs = logs.filter { _.isComplete }
But I got the following error :
Type mismatch, expected: FilterFunction[Log], actual: (Any) => An
So I don't see the link between the documentation and this error.
Any help ? Examples ?
Thanks
The problem was first a wrong import of StreamExecutionEnvironment which lead to this problem with basic functions like filter.
Then as I used an old version of Flink I was using LocalExecutionEnvironment class which is no longer available in Flink 1.X.
Instead : StreamExecutionEnvironment.createLocalEnvironment(1)

Self-consistent and updated example of using Spark over ElasticSearch

This guy had a very small example that showed how to integrate ElasticSearch and Spark, when all the ES ecosystem was around version 0.9. Nowadays, it doesn't work anymore (and googling for it doesn't seem an easy feat). Can someone give a small, self-contained Scala example of:
Opening a file in spark (in the example above, it was /var/log/syslog);
Doing something with it;
Sending the result into ES;
Opening that result back in Spark.
... that works with ElasticSearch 1.3.4 and Spark 1.1.0.
I gave a talk awhile back with Spark and Elastic Search (around the 0.9 days), and I recently updated some of the examples for present day (read 1.1). I've posted the slides and the example code. Hope that helps!
I've also copied the relevant sections (from my own github repo) here:
import org.elasticsearch.spark.sql._
...
val tweetsAsCS =
createSchemaRDD(tweetRDD.map(SharedIndex.prepareTweetsCaseClass))
tweetsAsCS.saveToEs(esResource)
Note that we didn't specify any ES nodes. This will default to trying to save to a cluster on local host. If we want to use a different cluster we can add:
// if we want to have a different es cluster we can add
import org.elasticsearch.hadoop.cfg.ConfigurationOptions
val config = new SparkConf()
config.set(ConfigurationOptions.ES_NODES, node) // set the node for discovery
// other config settings
val sc = new SparkContext(config)
So that will do the first part (indexing some data).
Querying ES from Spark has also gotten a lot simpler, although only if your data types are supported by the mappings of the connector (the primary one I ran into that wasn't was geolocation but its easy enough to extend the mapper if you run into this).
val query = "{\"query\": {\"filtered\" : {\"query\" : {\"match_all\" : {}},\"filter\" : { \"geo_distance\" : { \"distance\" : \""+ dist + "km\", \"location\" : { \"lat\" : "+ lat +", \"lon\" : "+ lon +" }}}}}}"
val tweets = sqlCtx.esRDD(esResource, query)
The esRDD function isn't normally on the SQLContext, but the implicit conversions we imported up above make it available to us. tweets is now a SchemaRDD and we can update it as desired and save the results back as we did in the first part of this example.
Hope this helps!