I'm trying to create a string sampler using a rdd of string as a dictionnary and the class RandomDataGenerator from org.apache.spark.mllib.random package.
import org.apache.spark.mllib.random.RandomDataGenerator
import org.apache.spark.rdd.RDD
import scala.util.Random
class StringSampler(var dic: RDD[String], var seed: Long = System.nanoTime) extends RandomDataGenerator[String] {
require(dic != null, "Dictionary cannot be null")
require(!dic.isEmpty, "Dictionary must contains lines (words)")
Random.setSeed(seed)
var fraction: Long = 1 / dic.count()
//return a random line from dictionary
override def nextValue(): String = dic.sample(withReplacement = true, fraction).take(1)(0)
override def setSeed(newSeed: Long): Unit = Random.setSeed(newSeed)
override def copy(): StringSampler = new StringSampler(dic)
def setDictionary(newDic: RDD[String]): Unit = {
require(newDic != null, "Dictionary cannot be null")
require(!newDic.isEmpty, "Dictionary must contains lines (words)")
dic = newDic
fraction = 1 / dic.count()
}
}
val dictionaryName: String
val dictionaries: Broadcast[Map[String, RDD[String]]]
val dictionary: RDD[String] = dictionaries.value(dictionaryName) // dictionary.cache()
val sampler = new StringSampler(dictionary)
RandomRDDs.randomRDD(context, sampler, size, numPartitions)
But I encounter a SparkException saying that the dictionary is lacking of a SparkContext when I try to generate a random RDD of strings. It seems that spark is loosing context of the dictionary rdd when copying it to cluster nodes and I don't know how to fix it.
I tried to cache the dictionary before passing it to the StringSampler, but it didn't change anything...
I was thinking about linking it back to the original SparkContext, but I don't even know if it's possible. Anyone have an idea ?
Caused by: org.apache.spark.SparkException: This RDD lacks a SparkContext. It could happen in the following cases:
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.
I believe the issue is here:
val dictionaries: Broadcast[Map[String, RDD[String]]]
val dictionary: RDD[String] = dictionaries.value(dictionaryName)
You should not be broadcasting anything containing an RDD. RDDs are already parallelized and spread throughout the cluster. The error comes from trying to serialize and deserialize an RDD, which loses its context and is pointless anyway.
Just do this:
val dictionaries: Map[String, RDD[String]]
val dictionary: RDD[String] = dictionaries(dictionaryName)
Related
I am trying to compare java and kryo serialization and on saving the rdd on disk, it is giving the same size when saveAsObjectFile is used, but on persist it shows different in spark ui. Kryo one is smaller then java, but the irony is processing time of java is lesser then that of kryo which was not expected from spark UI?
val conf = new SparkConf()
.setAppName("kyroExample")
.setMaster("local[*]")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.registerKryoClasses(
Array(classOf[Person],classOf[Array[Person]])
)
val sparkContext = new SparkContext(conf)
val personList: Array[Person] = (1 to 100000).map(value => Person(value + "", value)).toArray
val rddPerson: RDD[Person] = sparkContext.parallelize(personList)
val evenAgePerson: RDD[Person] = rddPerson.filter(_.age % 2 == 0)
case class Person(name: String, age: Int)
evenAgePerson.saveAsObjectFile("src/main/resources/objectFile")
evenAgePerson.persist(StorageLevel.MEMORY_ONLY_SER)
evenAgePerson.count()
persist and saveAsObjectFile solve different needs.
persist has a misleading name. It is not supposed to be used to permanently persist the rdd result. Persist is used to temporarily persist a computation result of the rdd during a spark workfklow. The user has no control, over the location of the persisted dataframe. Persist is just caching with different caching strategy - memory, disk or both. In fact cache will just call persist with a default caching strategy.
e.g
val errors = df.filter(col("line").like("%ERROR%"))
// Counts all the errors
errors.count()
// Counts errors mentioning MySQL
// Runs again on the full dataframe of all the lines , repeats the above operation
errors.filter(col("line").like("%MySQL%")).count()
vs
val errors = df.filter(col("line").like("%ERROR%"))
errors.persist()
// Counts all the errors
errors.count()
// Counts errors mentioning MySQL
// Runs only on the errors tmp result containing only the filtered error lines
errors.filter(col("line").like("%MySQL%")).count()
saveAsObjectFile is for permanent persistance. It is used for serializing the final result of a spark job onto a persistent and usually distributed filessystem like hdfs or amazon s3
Spark persist and saveAsObjectFile is not the same at all.
persist - persist your RDD DAG to the requested StorageLevel, thats mean from now and on any transformation apply on this RDD will be calculated only from the persisted DAG.
saveAsObjectFile - just save the RDD into a SequenceFile of serialized object.
saveAsObjectFile doesn`t use "spark.serializer" configuration at all.
as you can see the below code:
/**
* Save this RDD as a SequenceFile of serialized objects.
*/
def saveAsObjectFile(path: String): Unit = withScope {
this.mapPartitions(iter => iter.grouped(10).map(_.toArray))
.map(x => (NullWritable.get(), new BytesWritable(Utils.serialize(x))))
.saveAsSequenceFile(path)
}
saveAsObjectFile uses Utils.serialize to serialize your object, when serialize method defenition is:
/** Serialize an object using Java serialization */
def serialize[T](o: T): Array[Byte] = {
val bos = new ByteArrayOutputStream()
val oos = new ObjectOutputStream(bos)
oos.writeObject(o)
oos.close()
bos.toByteArray
}
saveAsObjectFile use the Java Serialization always.
On the other hand, persist will use the configured spark.serializer you configure him.
I want to replace the string "a" for an array of Strings making .contains() to check for every String in the array. Is that possible?
val filtered = stream.flatMap(status => status.getText.split(" ").filter(_.contains("a")))
Edit:
Also tried this (sc is sparkContext):
val ssc = new StreamingContext(sc, Seconds(15))
val stream = TwitterUtils.createStream(ssc, None)
val filtered = stream.flatMap(status => status.getText.split(" ").filter(a.contains(_)))
And got the following error:
java.io.NotSerializableException: Object of org.apache.spark.streaming.twitter.TwitterInputDStream is being serialized possibly as a part of closure of an RDD operation. This is because the DStream object is being referred to from within the closure. Please rewrite the RDD operation inside this DStream to avoid this. This has been enforced to avoid bloating of Spark tasks with unnecessary objects.
Then I tried to broadcast the array before it is used:
val aBroadcast = sc.broadcast(a)
val filtered = stream.flatMap(status => status.getText.split(" ").filter(aBroadcast.value.contains(_)))
And got the same error.
Thanks
As I understand the question you want to see if the status text after being split contains a list of words which is a subset of a:
val a = Array("a1", "a2")
val filtered = stream.flatMap(status => status.getText.split(" ").filter(_.forall(a contains))
I am a newbie to functional programming language and I am trying to learn spark scala
The goal is to partition the rdf datset by subject
the code is below:
object SimpleApp {
def main(args: Array[String]): Unit = {
val sparkConf =
new SparkConf().
setAppName("SimpleApp").
setMaster("local[2]").
set("spark.executor.memory", "1g")
val sc = new SparkContext(sparkConf)
val data = sc.textFile("/home/hduser/Bureau/11.txt")
val subject = data.map(_.split("\\s+")(0)).distinct.collect
}
}
So I get to recover the subjects but it returns an array of string also mapPartitions(func) and mapPartitionsWithIndex(func) : the func need to be iterator
So how do I proceed?
Partitioning your RDD by subject would probably best be done by using a HashPartitioner. The HashPartitioner works by taking an RDD of N-tuples and sorting the data by key eg
myPairRDD:
("sub1", "desc1")
("sub2", "desc2")
("sub1", "desc3")
("sub2", "desc4")
myPairRDD.partitionBy(new HashPartitioner(2))
becomes:
partition 1:
("sub1", "desc1")
("sub1", "desc3")
partition 2:
("sub2", "desc2")
("sub2", "desc4")
Therefore, your subjects RDD should probably be created more like this (note the extra brackets which create a tuple/pair RDD):
val subjectTuples = data.map((_.split("\\s+")(0), _.split("\\s+")(1)))
See the diagrams here for more info: https://blog.knoldus.com/2015/06/19/shufflling-and-repartitioning-of-rdds-in-apache-spark/
I am facing a strange behaviour from Spark. Here's my code:
object MyJob {
def main(args: Array[String]): Unit = {
val sc = new SparkContext()
val sqlContext = new hive.HiveContext(sc)
val query = "<Some Hive Query>"
val rawData = sqlContext.sql(query).cache()
val aggregatedData = rawData.groupBy("group_key")
.agg(
max("col1").as("max"),
min("col2").as("min")
)
val redisConfig = new RedisConfig(new RedisEndpoint(sc.getConf))
aggregatedData.foreachPartition {
rows =>
writePartitionToRedis(rows, redisConfig)
}
aggregatedData.write.parquet(s"/data/output.parquet")
}
}
Against my intuition the spark scheduler yields two jobs for each data sink (Redis, HDFS/Parquet). The problem is the second job is also performing the hive query and doubling the work. I assumed both write operations would share the data from aggregatedData stage. Is something wrong or is it behaviour to be expected?
You've missed a fundamental concept of spark: Lazyness.
An RDD does not contain any data, all it is is a set of instructions that will be executed when you call an action (like writing data to disk/hdfs). If you reuse an RDD (or Dataframe), there's no stored data, just store instructions that will need to be evaluated everytime you call an action.
If you want to reuse data without needing to reevaluate an RDD, use .cache() or preferably persist. Persisting an RDD allows you to store the result of a transformation so that the RDD doesn't need to be reevaluated in future iterations.
I need to pass SparkContext to my function and please suggest me how to do that for below scenario.
I have a Sequence, each element refers to specific data source from which we gets RDD and process them. I have defined a function which takes spark context and the data source and does the necessary things. I am curretly using while loop. But, i would like to do it with foreach or map, so that i can imply parallel processing. I need to spark context for the function, but how can i pass it from the foreach.?
Just a SAMPLE code, as i cannot present the actual code:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
object RoughWork {
def main(args: Array[String]) {
val str = "Hello,hw:How,sr:are,ws:You,re";
val conf = new SparkConf
conf.setMaster("local");
conf.setAppName("app1");
val sc = new SparkContext(conf);
val sqlContext = new SQLContext(sc);
val rdd = sc.parallelize(str.split(":"))
rdd.map(x => {println("==>"+x);passTest(sc, x)}).collect();
}
def passTest(context: SparkContext, input: String) {
val rdd1 = context.parallelize(input.split(","));
rdd1.foreach(println)
}
}
You cannot pass the SparkContext around like that. passTest will be run on an/the executor(s), while the SparkContext runs on the driver.
If I would have to do a double split like that, one approach would be to use flatMap:
rdd
.zipWithIndex
.flatMap(l => {
val parts = l._1.split(",");
List.fill(parts.length)(l._2) zip parts})
.countByKey
There may be prettier ways, but basically the idea is that you can use zipWithIndex to keep track which line an item came from and then use key-value pair RDD methods to work on your data.
If you have more than one key, or just more structured data in general, you can look into using Spark SQL with DataFrames (or DataSets in latest version) and explode instead of flatMap.