Add two block matrices in scala spark

Add two block matrices in scala spark - scala

We need to add two block matrices in scala spark don't use MR and take sparse matrices as input and convert them into block matrices and then perform block matrix addition on them.
I am just trying to do this like some sample code in spark scala.
import org.apache.spark._
import org.apache.spark.rdd._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object Add {
val rows = 100
val columns = 100
case class Block ( data: Array[Double] ) {
override
def toString (): String = {
var s = "\n"
for ( i <- 0 until rows ) {
for ( j <- 0 until columns )
s += "\t%.3f".format(data(i*rows+j))
s += "\n"
}
s
}
}
/* Convert a list of triples (i,j,v) into a Block */
def toBlock ( triples: List[(Int,Int,Double)] ): Block = {
/* ... */
}
/* Add two Blocks */
def blockAdd ( m: Block, n: Block ): Block = {
/* ... */
}
/* Read a sparse matrix from a file and convert it to a block matrix */
def createBlockMatrix ( sc: SparkContext, file: String ): RDD[((Int,Int),Block)] = {
/* ... */
}
def main ( args: Array[String] ) {
/* ... */
}
}

Related

Apache Flink (Scala) Rate-Balanced Source Function

In advance, I know this question is a little long; but I did my best to simplify it and make it approachable. Please provide feedback and I will try to act on it.
I am new to Flink, and I am trying to use it for a pipeline that will update each item of a large (TBs) dataset continuously. I wish to update some higher priority items more often but want to update all items as fast as possible.
Would it be possible to have a different source for each priority Tier (e.g. high, medium, low) to start the update process and read from the higher-priority sources more often? The other approach I've thought of is a custom SourceFunction that would have a reader for each file and emit according to the rates I set. The first approach didn't seem feasable, so I am trying the second but am stuck.
This is what I've tried so far:
import scala.collection.mutable
import org.apache.flink.streaming.api.functions.source.SourceFunction
import org.apache.hadoop.fs.{FileSystem, Path}
import java.util.concurrent.locks.ReadWriteLock
import java.util.concurrent.locks.ReentrantReadWriteLock
import java.util.concurrent.atomic.AtomicBoolean
/** Implements logic to ensure higher-priority values are pushed more often.
*
* The relative priority is the relative run frequency. For example, if tier C
* is half the priority of tier B, then C is pushed half as often.
*
* Type Parameters: \- T: Tier Type (e.g. enum) \- V: Emitted
* Value Type (e.g. ItemToUpdate)
*
* How to read objects from CSV rows:
* - https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/connectors/datastream/formats/csv/#:~:text=Flink%20supports%20reading%20CSV%20files%20using%20CsvReaderFormat.%20The,configuration%20for%20the%20CSV%20schema%20and%20parsing%20options.
*
* How to read objects from Parquet:
* - https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/connectors/datastream/formats/parquet/#:~:text=Flink%20supports%20reading%20Parquet%20files%2C%20producing%20Flink%20RowData,to%20your%20project%3A%20%3Cdependency%3E%20%3CgroupId%3Eorg.apache.flink%3C%2FgroupId%3E%20%3CartifactId%3Eflink-parquet%3C%2FartifactId%3E%20%3Cversion%3E1.16.0%3C%2Fversion%3E%20%3C%2Fdependency%3E
*
*
*
*/
abstract class TierPriorityEmitter[T, V](
val emitRates: mutable.Map[T, Float],
val updateCounts: mutable.Map[T, Long],
val iterators: mutable.Map[T, Iterator[V]]
) extends SourceFunction[V] {
#transient
private val running = new AtomicBoolean(false)
def this(
emitRates: Map[T, Float]
) {
this(
mutable.Map() ++= emitRates,
mutable.Map() ++= (for ((k, v) <- emitRates) yield (k, 0L)).toMap,
mutable.Map.empty[T, Iterator[V]]
)
for ((k, v) <- emitRates) {
val it = getTierIterator(k)
if (it.hasNext) {
iterators.put(k, it)
} else {
updateCounts.remove(k)
this.emitRates.remove(k)
}
}
// rescale rates in case any dropped or priorities given instead
val s = this.emitRates.foldLeft(0.0)(_ + _._2).asInstanceOf[Float] // sum
for ((k, v) <- this.emitRates) {
this.emitRates.put(k, v / s)
}
}
/** Return iterator for reading a specific priority tier from source file.
* Called again every pass over each priority tier.
*
* For example, may return an iterator to a specific file or an iterator that
* filters to the specified tier. Must avoid storing the entire file in memory
* as it is prohibitively large. TODO FIXME how to read?
*/
def getTierIterator(tier: T): Iterator[V] = ???
/** Determine which class of road should be emitted next.
*/
def nextToEmit(): T = {
/*
```python
total_updates = sum(updateCounts.values())
if total_updates == 0:
return max(emitRates.keys(), key=lambda k: emitRates[k])
actualEmitRates = {k: v/total_updates for k, v in updateCounts}
return min(emitRates.keys(), key=lambda k: actualEmitRates[k] - emitRates[k])
```
*/
val totalUpdates = updateCounts.foldLeft(0.0)(_ + _._2) // sum
if (totalUpdates == 0) {
return emitRates.maxBy(_._2)._1
}
val actualEmitRates =
(for ((k, v) <- updateCounts) yield (k, v / totalUpdates)).toMap
return emitRates.minBy(t => actualEmitRates(t._1) - t._2)._1
}
/** Emit to trigger updates.
*
* #param ctx
*/
override def run(ctx: SourceFunction.SourceContext[V]): Unit = {
running.set(true)
while (running.get()) {
val tier = nextToEmit()
val it = iterators(tier)
// zero length iterators are removed in the constructor
ctx.collect(it.next())
updateCounts.put(tier, updateCounts(tier) + 1)
if (!it.hasNext) {
iterators.put(tier, getTierIterator(tier))
}
}
}
override def cancel(): Unit = {
running.set(false)
}
}
I have this unit test.
import org.scalatest.flatspec.AnyFlatSpec
import org.scalatest.matchers.should.Matchers
import org.scalatest.BeforeAndAfter
import org.apache.flink.api.common.time.Time
import org.apache.flink.streaming.util.SourceOperatorTestHarness
import org.apache.flink.api.common.typeinfo.TypeInformation
import org.apache.flink.api.common.typeinfo.TypeHint
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.operators.SourceOperatorFactory
import org.apache.flink.api.common.eventtime.WatermarkStrategy
import org.apache.flink.runtime.operators.testutils.MockEnvironmentBuilder
import org.apache.flink.api.connector.source.Source
import org.apache.flink.api.connector.source.SourceReader
import scala.collection.mutable.Queue
import org.apache.flink.test.util.MiniClusterWithClientResource
import org.apache.flink.test.util.MiniClusterResourceConfiguration
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
/** Class for testing abstract TierPriorityEmitter
*/
class StringPriorityEmitter(priorities: Map[Char, Float]) extends TierPriorityEmitter[Char, String](priorities) with Serializable {
override def getTierIterator(tier: Char): Iterator[String] = {
return (for (i <- (1 to 9)) yield tier + i.toString()).iterator
}
}
class TestTierPriorityEmitter
extends AnyFlatSpec
with Matchers with BeforeAndAfter {
val flinkCluster = new MiniClusterWithClientResource(
new MiniClusterResourceConfiguration.Builder()
.setNumberSlotsPerTaskManager(1)
.setNumberTaskManagers(1)
.build
)
before {
flinkCluster.before()
}
after {
flinkCluster.after()
}
"TierPriorityEmitter" should "emit according to priority" in {
// be sure emit rates are floats and sum to 1
val emitter: TierPriorityEmitter[Char, String] = new StringPriorityEmitter(Map('A' -> 3f, 'B' -> 2f, 'C' -> 1f))
// should set emit rates based on our priorities
(emitter.emitRates('A') > emitter.emitRates('B') && emitter.emitRates('B') > emitter.emitRates('C') && emitter.emitRates('C') > 0) shouldBe(true)
(emitter.emitRates('A') + emitter.emitRates('B') + emitter.emitRates('C')) shouldEqual(1.0)
(emitter.emitRates('A') / emitter.emitRates('C')) shouldEqual(3.0)
// should output according to the assigned rates
// TODO use a SourceOperatorTestHarness instead. I couldn't get one to work.
val res = new Queue[String]()
for (_ <- 1 to 15) {
val tier = emitter.nextToEmit()
val it = emitter.iterators(tier)
// zero length iterators are removed in the constructor
res.enqueue(it.next())
emitter.updateCounts.put(tier, emitter.updateCounts(tier) + 1)
if (!it.hasNext) {
emitter.iterators.put(tier, emitter.getTierIterator(tier))
}
}
res shouldEqual(Queue("A1", "B1", "C1", "A2", "B2", "A3", "B3", "A4", "C2", "A5", "B4", "A6", "B5", "A7", "C3"))
}
"TierPriorityEmitter" should "be a source" in {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1) // FIXME not enough resources for 2
val emitter: TierPriorityEmitter[Char, String] = new StringPriorityEmitter(Map('A' -> 3, 'B' -> 2, 'C' -> 1))
// FIXME not serializable
val source = env.addSource(
emitter
)
source.returns(TypeInformation.of(new TypeHint[String]() {}))
val sink = new TestSink[String]("Strings")
source.addSink(sink)
// Execute program, beginning computation.
env.execute("Test TierPriorityEmitter")
sink.getResults should contain allOf ("A1", "B1", "C1", "A2", "B2", "A3", "B3", "A4", "C2", "A5", "B4", "A6", "B5", "A7", "C3")
}
}
There are a couple obvious problems that I can't figure out how to resolve:
How to read the input files to implement the TierPriorityEmitter as a working (serializable) SourceFunction?
I know that Iterators probably aren't the right class to use, but I am not sure what to try next. I am not very experienced with Scala.
Optimization advice?
Although the mutex leads to synchronization rate balancing should still result in the best downstream performance as later steps will be slower. Any advice on how to improve this would be appreciated.

StreamingQuery with ALS Recommendation: "requirement failed: Nothing has been added to this summarizer."

This is a repost from https://stackoverflow.com/a/64364555/14452959 .
I meet a problem when I am running a Spark Job for ALS Recommendation. My application is written in Scala and Java.
/**
* <h1>Kafka Load Mode in Spark SQL / Spark Structured Streaming</h1>
*
* #author Dragon1573
*/
public enum KafkaLoadMode {
/** Original Spark SQL (I called "Full-Batch") Mode */
BATCH,
/** Spark Structed Streaming (I called "Micro-Batch") Mode */
STREAM
}
import KafkaLoadMode.BATCH
import KafkaLoadMode.STREAM
import org.apache.spark.ml.recommendation.ALS
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.SparkSession
/**
* <h1>ALS Model Training Job in Batch Mode with Kafka Topics</h1>
*
* #author Dragon1573
* */
object AlsModelTrainJob {
def main(args: Array[String]): Unit = {
val dataset: DataFrame = dataPreprocess("ALS Model Training", KafkaLoadMode.BATCH)
/* Model Training */
val Array(train, test) = dataset.randomSplit(Array(0.8, 0.2))
val model = new ALS().setMaxIter(10).setRank(10).setRegParam(0.001).setUserCol("user").
setItemCol("item").setRatingCol("rating").setPredictionCol("predict").
setColdStartStrategy("drop").fit(train)
/* Save the above trained ALSModel into HDFS */
model.write.overwrite().save("hdfs://<server>:<port>/user/root/C04/ALS.sml")
/* Save the test dataset into HDFS */
test.write.mode(SaveMode.Overwrite).save("hdfs://<server>:<port>/user/root/C04/test.csv")
}
/** <h2>Data preprocessing</h2> */
def dataPreprocess(appName: String, mode: KafkaLoadMode): DataFrame = {
/* Create a SparkSession object */
val spark = SparkSession.builder().appName(appName).getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
/* Data preprocessing */
var source: DataFrame = mode match {
case BATCH => spark.read.format("kafka").options(Map(
"kafka.bootstrap.servers" -> "<server>:<port>,<server>:<port>,<server>:<port>",
"subscribe" -> "<topic>"
)).load()
case STREAM => spark.readStream.format("kafka").options(Map(
"kafka.bootstrap.servers" -> "slave1:9092,slave2:9092,slave3:9092",
"subscribe" -> "shop"
)).load()
}
source.selectExpr("CAST(value AS STRING)").as[String].map(_.split(",")).
filter(row => row.nonEmpty && row.length == 11).map(row => (row(8).toLong, row(4).toLong, 1)).
toDF("user", "item", "rating").groupBy("user", "item").sum("rating").
withColumnRenamed("sum(rating)", "rating")
}
}
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.recommendation.ALSModel
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.SparkSession
/**
* <h1>ALS Model Evaluation in Batch Mode</h1>
*
* #author Dragon1573
* */
object AlsModelCheckJob {
/** <h2>Evaluate the model with RMSE Value</h2> */
def rmseValue(trainedModel: ALSModel, dataset: DataFrame): Double = {
new RegressionEvaluator().setMetricName("rmse").setLabelCol("rating").
setPredictionCol("predict").evaluate(trainedModel.transform(dataset))
}
def main(args: Array[String]): Unit = {
/* Create SparkSession object */
val spark = SparkSession.builder().appName("ALS Model Evaluation").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
/* Load my pre-trained ALSModel from HDFS */
val trainedModel = ALSModel.read.load("hdfs://<server>:<port>/user/root/C04/ALS.sml")
/* Load evaluating datasets from HDFS */
val dataset = spark.read.load("hdfs://<server>:<port>/user/root/C04/test.csv").
map(row => (row.getLong(0), row.getLong(1), row.getLong(2))).
toDF("user", "item", "rating")
println(s"RMSE：${rmseValue(trainedModel, dataset)}")
}
}
import java.text.SimpleDateFormat
import java.util.Date
import AlsModelCheckJob.rmseValue
import AlsModelTrainJob.dataPreprocess
import KafkaLoadMode.STREAM
import org.apache.spark.ml.recommendation.ALSModel
import org.apache.spark.sql.SaveMode.Overwrite
import org.apache.spark.sql.streaming.OutputMode
/**
* <h1>ALS Model Real-time Recommendation Job</h1>
*
* #author Dragon1573
*/
object AlsModelStreamingJob {
def main(args: Array[String]): Unit = {
/* Load my pre-trained ALS Model */
val model = ALSModel.read.load("hdfs://<server>:<port>/user/root/C04/ALS.sml")
/* Data preprocessing */
val dataset = dataPreprocess("ALS Realtime Recommendation", STREAM)
/* Recommendation */
dataset.writeStream.outputMode(OutputMode.Complete()).foreachBatch((dataFrame, _) => {
// Get current time
val currentDate = new Date()
// Recommend 5 items for each users in the Streaming DataFrame, and also save the recommendations to HDFS
model.recommendForUserSubset(dataFrame, 5).write.mode(Overwrite).
save(s"hdfs://<server>:<port>/user/root/C04/recommendation/${
new SimpleDateFormat("yyyy-MM-dd-HH-mm-ss").format(currentDate)
}")
println(s"[${
new SimpleDateFormat("yyyy-MM-dd-HH-mm-ss").format(currentDate)
}]\\tRMSE: ${rmseValue(model, dataFrame)}")
}).start().awaitTermination()
}
}
After submit my Appication JAR to Spark Standalone Cluster by spark-submit --class AlsModelStreamingJob <JAR Name>, the cluster throws these exceptions to the console and exit.
java.lang.IllegalArgumentException: requirement failed: Nothing has been added to this summarizer.
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.mllib.stat.MultivariateOnlineSummarizer.normL2(MultivariateOnlineSummarizer.scala:281)
at org.apache.spark.mllib.evaluation.RegressionMetrics.SSerr$lzycompute(RegressionMetrics.scala:65)
at org.apache.spark.mllib.evaluation.RegressionMetrics.SSerr(RegressionMetrics.scala:65)
at org.apache.spark.mllib.evaluation.RegressionMetrics.meanSquaredError(RegressionMetrics.scala:100)
at org.apache.spark.mllib.evaluation.RegressionMetrics.rootMeanSquaredError(RegressionMetrics.scala:109)
at org.apache.spark.ml.evaluation.RegressionEvaluator.evaluate(RegressionEvaluator.scala:86)
at AlsModelCheckJob$.rmseValue(AlsModelCheckJob.scala:15)
at AlsModelStreamingJob$$anonfun$main$1.apply(AlsModelStreamingJob.scala:35)
at AlsModelStreamingJob$$anonfun$main$1.apply(AlsModelStreamingJob.scala:25)
at org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:35)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:534)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:532)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:531)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
Exception in thread "main" org.apache.spark.sql.streaming.StreamingQueryException: requirement failed: Nothing has been added to this summarizer.
I also use a Python script to randomly generate records in local filesystem and gather them by Apache Flume automatically.
# -*- coding:UTF-8 -*-
from random import choice
from random import randint
from random import sample
from string import ascii_letters
from string import ascii_lowercase
from string import capwords
from string import digits
from sys import argv
from time import time
from tqdm import tqdm
def get_random(instr, length):
res = sample(instr, length)
result = ''.join(res)
return result
# These are indexes previously generated
row_key_tmp_list = []
def get_random_row_key():
while True:
num = randint(00, 99)
timestamp = int(time())
pre_row_key = str(num).zfill(2) + str(timestamp)
if pre_row_key not in row_key_tmp_list:
row_key_tmp_list.append(pre_row_key)
break
pass
pass
return pre_row_key
def get_random_name(length):
return capwords(get_random(ascii_lowercase, length))
def get_random_age():
return str(randint(18, 60))
def get_random_sex():
return choice(("woman", "man"))
def get_random_goods_no():
return choice((
"220902", "430031", "550012", "650012", "532120", "230121",
"250983", "480071", "580016", "950013", "152121", "230121"
))
def get_random_goods_price():
# dollars
price_int = randint(1, 999)
# cents
price_decimal = randint(1, 99)
return str(price_int) + "." + str(price_decimal)
def get_random_store_id():
return choice([
"313012", "313013", "313014", "313015", "313016", "313017",
"313018", "313019", "313020", "313021", "313022", "313023"
])
def get_random_goods_type():
""" Random user actions """
return choice(["pv", "buy", "cart", "fav", "scan"])
def get_random_tel():
# 随机提取手机号段
isp_prefix = choice([
"130", "131", "132", "133", "134", "135",
"136", "137", "138", "139", "147", "150",
"151", "152", "153", "155", "156", "157",
"158", "159", "186", "187", "188"
])
# 随机生成8位号码
return isp_prefix + ''.join(sample(digits, 4))
def get_random_email(length):
return "#".join((
get_random(ascii_letters, length),
choice(["163.com", "126.com", "qq.com", "gmail.com", "huawei.com"])
))
def get_random_buy_time():
return choice(["2019-08-01", "2019-08-02", "2019-08-03", "2019-08-04", "2019-08-05", "2019-08-06", "2019-08-07"])
def get_random_record():
return ",".join((
get_random_row_key(), get_random_name(5), get_random_age(), get_random_sex(), get_random_goods_no(),
get_random_goods_price(), get_random_store_id(), get_random_goods_type(), get_random_tel(),
get_random_email(10), get_random_buy_time()
))
if __name__ == "__main__":
if len(argv) != 3:
raise SyntaxError("Usage：python query_generator.py <target> <recordCount>")
pass
else:
with open(argv[1], "w") as file:
for _ in range(int(argv[2])):
record = get_random_record()
file.write(record)
file.write('\n')
pass
pass
pass
pass
Is there any mistakes in my Scala/Java source codes? How can I use my pre-trained ALSModel, recommend for every Micro-Batch in Kafka Topics and print the RMSE evaluation result at the same time?
I rewrite the AlsModelStreamingJob.scala with original Spark Streaming as follows and execute for 7 times. First 6 times all shows the above exceptions, but it runs perfectly in the 7th time.
import java.text.SimpleDateFormat
import java.util.Date
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.recommendation.ALSModel
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.Minutes
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
/**
* <h1>Realtime Recommendation Job with ALS</h1>
*
* #author Dragon1573
*/
object AlsModelStreamingJob {
def main(args: Array[String]): Unit = {
/* Load ALS Model */
val model = ALSModel.read.load("hdfs://<server>:<port>/user/root/C04/ALS.sml")
/* Spark Application Startpoint */
val sparkConf = new SparkConf().setAppName("ALS Realtime Recommendation")
val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
import sparkSession.implicits._
val streamingContext = new StreamingContext(sparkSession.sparkContext, Seconds(5))
/* Kafka Configurations */
val kafkaConfigs = Map[String, Object](
"bootstrap.servers" -> "server_list",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "Streaming Recommendation"
)
val topics = Array("<topic>")
/* Get Kafka Source */
val source = KafkaUtils.createDirectStream(
streamingContext, PreferConsistent,
Subscribe[String, String](topics, kafkaConfigs)
)
/* Streaming Jobs */
source.map(_.value).map(_.split(",")). // Get contents from Kafka Topics, split them into columns
filter(row => row.nonEmpty && row.length == 11). // filter valid messages
map(row => ((row(8).toLong, row(4).toLong), 1)). // Convert into tuple
reduceByKeyAndWindow((rate1: Int, rate2: Int) => rate1 + rate2, Minutes(5), Seconds(5)). // Key-value aggregations
map(row => (row._1._1, row._1._2, row._2)). // Map into dataset
foreachRDD(
/* Apply for each RDD in the DStream */
rdd => {
// Get current time
val time = new Date
// Convert RDD into DataFrame
val dataset = rdd.toDF("user", "item", "rating")
// Recommend 5 items for each user at the moment and save to HDFS
model.recommendForUserSubset(dataset, 5).
write.mode(SaveMode.Overwrite).
save(s"hdfs://master:8020/user/root/C04/recommendation/${
new SimpleDateFormat("yyyy-MM-dd-HH-mm-ss").format(time)
}")
/* The RMSE value of this recommendation */
val predicted = model.transform(dataset)
val rmse = new RegressionEvaluator().setMetricName("rmse").
setLabelCol("rating").
setPredictionCol("predict").
evaluate(predicted)
println(s"[${new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(time)}] RMSE：$rmse")
})
/* Launch the Spark Streaming Application and wait for termination */
streamingContext.start()
streamingContext.awaitTermination()
}
}

How to persist the list which we made dynamically from dataFrame in scala spark

def getAnimalName(dataFrame: DataFrame): List[String] = {
dataFrame.select("animal").
filter(col("animal").isNotNull && col("animal").notEqual("")).
rdd.map(r => r.getString(0)).distinct().collect.toList
}
I am basicaly Calling this function 2 times For getting the list for different purposes . I just want to know is there a way to retain the list in memory and we dont have to call the same function again and again to generate the list and only have to generate the list only one time in scala spark.

Try something as below and you can also check the performance using time func.
Also find the code explanation inline
import org.apache.spark.rdd
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, functions}
object HandleCachedDF {
var cachedAnimalDF : rdd.RDD[String] = _
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
val df = spark.read.json("src/main/resources/hugeTest.json") // Load your Dataframe
val df1 = time[rdd.RDD[String]] {
getAnimalName(df)
}
val resultList = df1.collect().toList
val df2 = time{
getAnimalName(df)
}
val resultList1 = df2.collect().toList
println(resultList.equals(resultList1))
}
def getAnimalName(dataFrame: DataFrame): rdd.RDD[String] = {
if (cachedAnimalDF == null) { // Check if this the first initialization of your dataframe
cachedAnimalDF = dataFrame.select("animal").
filter(functions.col("animal").isNotNull && col("animal").notEqual("")).
rdd.map(r => r.getString(0)).distinct().cache() // Cache your dataframe
}
cachedAnimalDF // Return your cached dataframe
}
def time[R](block: => R): R = { // COmpute the time taken by function to execute
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")
result
}
}

You would have to persist or cache at this point
dataFrame.select("animal").
filter(col("animal").isNotNull && col("animal").notEqual("")).
rdd.map(r => r.getString(0)).distinct().persist
and then call the function as follow
def getAnimalName(dataFrame: DataFrame): List[String] = {
dataFrame.collect.toList
}
as many times as you need it without repeat the process.
I hope it helps.

Why does spark-shell print thousands lines of code after count on DataFrame with 3000 columns? What's JaninoRuntimeException and 64 KB?

(With spark-2.1.0-bin-hadoop2.7 version from the official website on local machine)
When I executed a simple spark command in spark-shell, it starts to print out thousands and thousands lines of code before throwing an error. What are these "code"?
I was running spark on my local machine. The command I ran was a simple df.count where df is a DataFrame.
Please see a screenshot below (the codes fly by so fast I could only take screenshots to see what's going on). More details are below the image.
More details:
I created the data frame df by
val df: DataFrame = spark.createDataFrame(rows, schema)
// rows: RDD[Row]
// schema: StructType
// There were about 3000 columns and 700 rows (testing set) of data in df.
// The following line ran successfully and returned the correct value
rows.count
// The following line threw exception after printing out tons of codes as shown in the screenshot above
df.count
The exception thrown after the "codes" is:
...
/* 181897 */ apply_81(i);
/* 181898 */ result.setTotalSize(holder.totalSize());
/* 181899 */ return result;
/* 181900 */ }
/* 181901 */ }
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:889)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:941)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:938)
at org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
at org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
at org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
... 29 more
Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB
at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
at org.codehaus.janino.CodeContext.write(CodeContext.java:854)
at org.codehaus.janino.CodeContext.writeShort(CodeContext.java:959)
Edit: As #TzachZohar pointed out, this looks like one of the known bugs (https://issues.apache.org/jira/browse/SPARK-16845) that was fixed but not released from the spark project.
I pulled the spark master, built it from the source, and retried my example. Now I got a new exception following the generated code:
/* 308608 */ apply_1560(i);
/* 308609 */ apply_1561(i);
/* 308610 */ result.setTotalSize(holder.totalSize());
/* 308611 */ return result;
/* 308612 */ }
/* 308613 */ }
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:941)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:998)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:995)
at org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
at org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
at org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
... 29 more
Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection has grown past JVM limit of 0xFFFF
at org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
It looks like a pull request is addressing the second problem: https://github.com/apache/spark/pull/16648

This is a bug. It is related to runtime code being generated on the JVM. So it seems to be hard for the Scala team to resolve. (There is much discussion on JIRA).
The error occurred with me when doing row operations. Even df.head() on a dataframe of 700 rows would cause the Exception.
The workaround for me was to convert the dataframe to a sparse data RDD (i.e., RDD[LabeledPoint]) and run rowwise operations on the RDD. It's much faster and more memory efficient. HOwever, it only works with numeric data. Categorical variables (factors, target etc) need to be converted to Double.
That said, I am new to Scala myself, so I my code is probably a tad amateurish. But it works.
CreateRow
#throws(classOf[Exception])
private def convertRowToLabeledPoint(rowIn: Row, fieldNameSeq: Seq[String], label: Int): LabeledPoint =
{
try
{
logger.info(s"fieldNameSeq $fieldNameSeq")
val values: Map[String, Long] = rowIn.getValuesMap(fieldNameSeq)
val sortedValuesMap = ListMap(values.toSeq.sortBy(_._1): _*)
//println(s"convertRowToLabeledPoint row values ${sortedValuesMap}")
print(".")
val rowValuesItr: Iterable[Long] = sortedValuesMap.values
var positionsArray: ArrayBuffer[Int] = ArrayBuffer[Int]()
var valuesArray: ArrayBuffer[Double] = ArrayBuffer[Double]()
var currentPosition: Int = 0
rowValuesItr.foreach
{
kv =>
if (kv > 0)
{
valuesArray += kv.toDouble;
positionsArray += currentPosition;
}
currentPosition = currentPosition + 1;
}
new LabeledPoint(label, org.apache.spark.mllib.linalg.Vectors.sparse(positionsArray.size, positionsArray.toArray, valuesArray.toArray))
}
catch
{
case ex: Exception =>
{
throw new Exception(ex)
}
}
}
private def castColumnTo(df: DataFrame, cn: String, tpe: DataType): DataFrame =
{
//println("castColumnTo")
df.withColumn(cn, df(cn).cast(tpe)
)
}
Provide a Dataframe and return RDD LabeledPOint
#throws(classOf[Exception])
def convertToLibSvm(spark:SparkSession,mDF : DataFrame, targetColumnName:String): RDD[LabeledPoint] =
{
try
{
val fieldSeq: scala.collection.Seq[StructField] = mDF.schema.fields.toSeq.filter(f => f.dataType == IntegerType || f.dataType == LongType)
val fieldNameSeq: Seq[String] = fieldSeq.map(f => f.name)
val indexer = new StringIndexer()
.setInputCol(targetColumnName)
.setOutputCol(targetColumnName+"_Indexed")
val mDFTypedIndexed = indexer.fit(mDF).transform(mDF).drop(targetColumnName)
val mDFFinal = castColumnTo(mDFTypedIndexed, targetColumnName+"_Indexed", IntegerType)
//mDFFinal.show()
//only doubles accepted by sparse vector, so that's what we filter for
var positionsArray: ArrayBuffer[LabeledPoint] = ArrayBuffer[LabeledPoint]()
mDFFinal.collect().foreach
{
row => positionsArray += convertRowToLabeledPoint(row, fieldNameSeq, row.getAs(targetColumnName+"_Indexed"));
}
spark.sparkContext.parallelize(positionsArray.toSeq)
}
catch
{
case ex: Exception =>
{
throw new Exception(ex)
}
}
}

Cannot call methods on a stopped SparkContext

When I run the following test, it throws "Cannot call methods on a stopped SparkContext". The possible problem is that I use TestSuiteBase and Streaming Spark Context. At the line val gridEvalsRDD = ssc.sparkContext.parallelize(gridEvals) I need to use SparkContext that I access via ssc.sparkContext and this is where I have the problem (see the warning and error messages below)
class StreamingTest extends TestSuiteBase with BeforeAndAfter {
test("Test 1") {
//...
val gridEvals = for (initialWeights <- gridParams("initialWeights");
stepSize <- gridParams("stepSize");
numIterations <- gridParams("numIterations")) yield {
val lr = new StreamingLinearRegressionWithSGD()
.setInitialWeights(initialWeights.asInstanceOf[Vector])
.setStepSize(stepSize.asInstanceOf[Double])
.setNumIterations(numIterations.asInstanceOf[Int])
ssc = setupStreams(inputData, (inputDStream: DStream[LabeledPoint]) => {
lr.trainOn(inputDStream)
lr.predictOnValues(inputDStream.map(x => (x.label, x.features)))
})
val output: Seq[Seq[(Double, Double)]] = runStreams(ssc, numBatches, numBatches)
val cvRMSE = calculateRMSE(output, nPoints)
println(s"RMSE = $cvRMSE")
(initialWeights, stepSize, numIterations, cvRMSE)
}
val gridEvalsRDD = ssc.sparkContext.parallelize(gridEvals)
}
}
16/04/27 10:40:17 WARN StreamingContext: StreamingContext has already
been stopped 16/04/27 10:40:17 INFO SparkContext: SparkContext already
stopped.
Cannot call methods on a stopped SparkContext
UPDATE:
This is the base class TestSuiteBase:
trait TestSuiteBase extends SparkFunSuite with BeforeAndAfter with Logging {
// Name of the framework for Spark context
def framework: String = this.getClass.getSimpleName
// Master for Spark context
def master: String = "local[2]"
// Batch duration
def batchDuration: Duration = Seconds(1)
// Directory where the checkpoint data will be saved
lazy val checkpointDir: String = {
val dir = Utils.createTempDir()
logDebug(s"checkpointDir: $dir")
dir.toString
}
// Number of partitions of the input parallel collections created for testing
def numInputPartitions: Int = 2
// Maximum time to wait before the test times out
def maxWaitTimeMillis: Int = 10000
// Whether to use manual clock or not
def useManualClock: Boolean = true
// Whether to actually wait in real time before changing manual clock
def actuallyWait: Boolean = false
// A SparkConf to use in tests. Can be modified before calling setupStreams to configure things.
val conf = new SparkConf()
.setMaster(master)
.setAppName(framework)
// Timeout for use in ScalaTest `eventually` blocks
val eventuallyTimeout: PatienceConfiguration.Timeout = timeout(Span(10, ScalaTestSeconds))
// Default before function for any streaming test suite. Override this
// if you want to add your stuff to "before" (i.e., don't call before { } )
def beforeFunction() {
if (useManualClock) {
logInfo("Using manual clock")
conf.set("spark.streaming.clock", "org.apache.spark.util.ManualClock")
} else {
logInfo("Using real clock")
conf.set("spark.streaming.clock", "org.apache.spark.util.SystemClock")
}
}
// Default after function for any streaming test suite. Override this
// if you want to add your stuff to "after" (i.e., don't call after { } )
def afterFunction() {
System.clearProperty("spark.streaming.clock")
}
before(beforeFunction)
after(afterFunction)
/**
* Run a block of code with the given StreamingContext and automatically
* stop the context when the block completes or when an exception is thrown.
*/
def withStreamingContext[R](ssc: StreamingContext)(block: StreamingContext => R): R = {
try {
block(ssc)
} finally {
try {
ssc.stop(stopSparkContext = true)
} catch {
case e: Exception =>
logError("Error stopping StreamingContext", e)
}
}
}
/**
* Run a block of code with the given TestServer and automatically
* stop the server when the block completes or when an exception is thrown.
*/
def withTestServer[R](testServer: TestServer)(block: TestServer => R): R = {
try {
block(testServer)
} finally {
try {
testServer.stop()
} catch {
case e: Exception =>
logError("Error stopping TestServer", e)
}
}
}
/**
* Set up required DStreams to test the DStream operation using the two sequences
* of input collections.
*/
def setupStreams[U: ClassTag, V: ClassTag](
input: Seq[Seq[U]],
operation: DStream[U] => DStream[V],
numPartitions: Int = numInputPartitions
): StreamingContext = {
// Create StreamingContext
val ssc = new StreamingContext(conf, batchDuration)
if (checkpointDir != null) {
ssc.checkpoint(checkpointDir)
}
// Setup the stream computation
val inputStream = new TestInputStream(ssc, input, numPartitions)
val operatedStream = operation(inputStream)
val outputStream = new TestOutputStreamWithPartitions(operatedStream,
new ArrayBuffer[Seq[Seq[V]]] with SynchronizedBuffer[Seq[Seq[V]]])
outputStream.register()
ssc
}
/**
* Set up required DStreams to test the binary operation using the sequence
* of input collections.
*/
def setupStreams[U: ClassTag, V: ClassTag, W: ClassTag](
input1: Seq[Seq[U]],
input2: Seq[Seq[V]],
operation: (DStream[U], DStream[V]) => DStream[W]
): StreamingContext = {
// Create StreamingContext
val ssc = new StreamingContext(conf, batchDuration)
if (checkpointDir != null) {
ssc.checkpoint(checkpointDir)
}
// Setup the stream computation
val inputStream1 = new TestInputStream(ssc, input1, numInputPartitions)
val inputStream2 = new TestInputStream(ssc, input2, numInputPartitions)
val operatedStream = operation(inputStream1, inputStream2)
val outputStream = new TestOutputStreamWithPartitions(operatedStream,
new ArrayBuffer[Seq[Seq[W]]] with SynchronizedBuffer[Seq[Seq[W]]])
outputStream.register()
ssc
}
/**
* Runs the streams set up in `ssc` on manual clock for `numBatches` batches and
* returns the collected output. It will wait until `numExpectedOutput` number of
* output data has been collected or timeout (set by `maxWaitTimeMillis`) is reached.
*
* Returns a sequence of items for each RDD.
*/
def runStreams[V: ClassTag](
ssc: StreamingContext,
numBatches: Int,
numExpectedOutput: Int
): Seq[Seq[V]] = {
// Flatten each RDD into a single Seq
runStreamsWithPartitions(ssc, numBatches, numExpectedOutput).map(_.flatten.toSeq)
}
/**
* Runs the streams set up in `ssc` on manual clock for `numBatches` batches and
* returns the collected output. It will wait until `numExpectedOutput` number of
* output data has been collected or timeout (set by `maxWaitTimeMillis`) is reached.
*
* Returns a sequence of RDD's. Each RDD is represented as several sequences of items, each
* representing one partition.
*/
def runStreamsWithPartitions[V: ClassTag](
ssc: StreamingContext,
numBatches: Int,
numExpectedOutput: Int
): Seq[Seq[Seq[V]]] = {
assert(numBatches > 0, "Number of batches to run stream computation is zero")
assert(numExpectedOutput > 0, "Number of expected outputs after " + numBatches + " is zero")
logInfo("numBatches = " + numBatches + ", numExpectedOutput = " + numExpectedOutput)
// Get the output buffer
val outputStream = ssc.graph.getOutputStreams.
filter(_.isInstanceOf[TestOutputStreamWithPartitions[_]]).
head.asInstanceOf[TestOutputStreamWithPartitions[V]]
val output = outputStream.output
try {
// Start computation
ssc.start()
// Advance manual clock
val clock = ssc.scheduler.clock.asInstanceOf[ManualClock]
logInfo("Manual clock before advancing = " + clock.getTimeMillis())
if (actuallyWait) {
for (i <- 1 to numBatches) {
logInfo("Actually waiting for " + batchDuration)
clock.advance(batchDuration.milliseconds)
Thread.sleep(batchDuration.milliseconds)
}
} else {
clock.advance(numBatches * batchDuration.milliseconds)
}
logInfo("Manual clock after advancing = " + clock.getTimeMillis())
// Wait until expected number of output items have been generated
val startTime = System.currentTimeMillis()
while (output.size < numExpectedOutput &&
System.currentTimeMillis() - startTime < maxWaitTimeMillis) {
logInfo("output.size = " + output.size + ", numExpectedOutput = " + numExpectedOutput)
ssc.awaitTerminationOrTimeout(50)
}
val timeTaken = System.currentTimeMillis() - startTime
logInfo("Output generated in " + timeTaken + " milliseconds")
output.foreach(x => logInfo("[" + x.mkString(",") + "]"))
assert(timeTaken < maxWaitTimeMillis, "Operation timed out after " + timeTaken + " ms")
assert(output.size === numExpectedOutput, "Unexpected number of outputs generated")
Thread.sleep(100) // Give some time for the forgetting old RDDs to complete
} finally {
ssc.stop(stopSparkContext = true)
}
output
}
/**
* Verify whether the output values after running a DStream operation
* is same as the expected output values, by comparing the output
* collections either as lists (order matters) or sets (order does not matter)
*/
def verifyOutput[V: ClassTag](
output: Seq[Seq[V]],
expectedOutput: Seq[Seq[V]],
useSet: Boolean
) {
logInfo("--------------------------------")
logInfo("output.size = " + output.size)
logInfo("output")
output.foreach(x => logInfo("[" + x.mkString(",") + "]"))
logInfo("expected output.size = " + expectedOutput.size)
logInfo("expected output")
expectedOutput.foreach(x => logInfo("[" + x.mkString(",") + "]"))
logInfo("--------------------------------")
// Match the output with the expected output
for (i <- 0 until output.size) {
if (useSet) {
assert(
output(i).toSet === expectedOutput(i).toSet,
s"Set comparison failed\n" +
s"Expected output (${expectedOutput.size} items):\n${expectedOutput.mkString("\n")}\n" +
s"Generated output (${output.size} items): ${output.mkString("\n")}"
)
} else {
assert(
output(i).toList === expectedOutput(i).toList,
s"Ordered list comparison failed\n" +
s"Expected output (${expectedOutput.size} items):\n${expectedOutput.mkString("\n")}\n" +
s"Generated output (${output.size} items): ${output.mkString("\n")}"
)
}
}
logInfo("Output verified successfully")
}
/**
* Test unary DStream operation with a list of inputs, with number of
* batches to run same as the number of expected output values
*/
def testOperation[U: ClassTag, V: ClassTag](
input: Seq[Seq[U]],
operation: DStream[U] => DStream[V],
expectedOutput: Seq[Seq[V]],
useSet: Boolean = false
) {
testOperation[U, V](input, operation, expectedOutput, -1, useSet)
}
/**
* Test unary DStream operation with a list of inputs
* #param input Sequence of input collections
* #param operation Binary DStream operation to be applied to the 2 inputs
* #param expectedOutput Sequence of expected output collections
* #param numBatches Number of batches to run the operation for
* #param useSet Compare the output values with the expected output values
* as sets (order matters) or as lists (order does not matter)
*/
def testOperation[U: ClassTag, V: ClassTag](
input: Seq[Seq[U]],
operation: DStream[U] => DStream[V],
expectedOutput: Seq[Seq[V]],
numBatches: Int,
useSet: Boolean
) {
val numBatches_ = if (numBatches > 0) numBatches else expectedOutput.size
withStreamingContext(setupStreams[U, V](input, operation)) { ssc =>
val output = runStreams[V](ssc, numBatches_, expectedOutput.size)
verifyOutput[V](output, expectedOutput, useSet)
}
}
/**
* Test binary DStream operation with two lists of inputs, with number of
* batches to run same as the number of expected output values
*/
def testOperation[U: ClassTag, V: ClassTag, W: ClassTag](
input1: Seq[Seq[U]],
input2: Seq[Seq[V]],
operation: (DStream[U], DStream[V]) => DStream[W],
expectedOutput: Seq[Seq[W]],
useSet: Boolean
) {
testOperation[U, V, W](input1, input2, operation, expectedOutput, -1, useSet)
}
/**
* Test binary DStream operation with two lists of inputs
* #param input1 First sequence of input collections
* #param input2 Second sequence of input collections
* #param operation Binary DStream operation to be applied to the 2 inputs
* #param expectedOutput Sequence of expected output collections
* #param numBatches Number of batches to run the operation for
* #param useSet Compare the output values with the expected output values
* as sets (order matters) or as lists (order does not matter)
*/
def testOperation[U: ClassTag, V: ClassTag, W: ClassTag](
input1: Seq[Seq[U]],
input2: Seq[Seq[V]],
operation: (DStream[U], DStream[V]) => DStream[W],
expectedOutput: Seq[Seq[W]],
numBatches: Int,
useSet: Boolean
) {
val numBatches_ = if (numBatches > 0) numBatches else expectedOutput.size
withStreamingContext(setupStreams[U, V, W](input1, input2, operation)) { ssc =>
val output = runStreams[W](ssc, numBatches_, expectedOutput.size)
verifyOutput[W](output, expectedOutput, useSet)
}
}
}

These are a few things that you should check -
Verify if you have resources available that you are specifying in spark-config
Do a search for stop() keyword in your codebase and check it should not be on sparkcontext
Spark has Spark-UI component where you can see what job ran, if it failed or succeeded, along with its log. That will tell you why is it failing.

Cannot call methods on a stopped SparkContext it is consequence of some error which happend earlier. Look at the logs in $SPARK_HOME$/logs and $SPARK_HOME$/work.

Restarting the spark context on interpreter binding panel worked for me it looks like something bellow you just have to click on the refresh button and save.

The issue is only one SparkSession or SparkContext is allowed per JVM. Make sure to the instance being used is a singleton. For instance, wrap the singleton of SparkSession (or SparkContext) in an SharedSparkSession (or SharedSparkContext) object.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Add two block matrices in scala spark - scala

Related

Apache Flink (Scala) Rate-Balanced Source Function

StreamingQuery with ALS Recommendation: "requirement failed: Nothing has been added to this summarizer."

How to persist the list which we made dynamically from dataFrame in scala spark

Why does spark-shell print thousands lines of code after count on DataFrame with 3000 columns? What's JaninoRuntimeException and 64 KB?

Cannot call methods on a stopped SparkContext

Categories

Resources