Task not serializable - foreach function spark - scala

I have a function getS3Object to get a json object stored in S3
def getS3Object (s3ObjectName) : Unit = {
val bucketName = "xyz"
val object_to_write = s3client.getObject(bucketName, s3ObjectName)
val file = new File(filename)
fileWriter = new FileWriter(file)
bw = new BufferedWriter(fileWriter)
bw.write(object_to_write)
bw.close()
fileWriter.close()
}
My dataframe (df) contains one column where each row is the S3ObjectName
S3ObjectName
a1.json
b2.json
c3.json
d4.json
e5.json
When I execute the below logic I get an error saying "task is not serializable"
Method 1:- df.foreach(x => getS3Object(x.getString(0))
I tried converting the df to rdd but still get the same error
Method 2:- df.rdd.foreach(x => getS3Object(x.getString(0))
However it works with collect()
Method 3:- df.collect.foreach(x => getS3Object(x.getString(0))
I do not wish to use the collect() method as all the elements of the dataframe are collected to the driver and potentially result in OutOfMemory error.
Is there a way to make the foreach() function work using Method 1?

The problem for your s3Client can be solved as following. But you have to remember that these functions run on executor nodes (other machines), so your whole val file = new File(filename) thing is probably not going to work here.
You can put your files on some distibuted file system like HDFS or S3.
object S3ClientWrapper extends Serializable {
// s3Client must be created here.
val s3Client = {
val awsCreds = new BasicAWSCredentials("access_key_id", "secret_key_id")
AmazonS3ClientBuilder.standard()
.withCredentials(new AWSStaticCredentialsProvider(awsCreds))
.build()
}
}
def getS3Object (s3ObjectName) : Unit = {
val bucketName = "xyz"
val object_to_write = S3ClientWrapper.s3Client.getObject(bucketName, s3ObjectName)
// now you have to solve your file problem
}

Related

Running Multiple Queries in Spark Structured Streaming with Watermarking and Windowed Aggregations

My aim is to read data from multiple Kafka topics, aggregate the data and write into hdfs.
I looped through the list of kafka topics to create multiple queries. The code runs fine while running a single query but gives error while running multiple queries. I've kept the checkpoint directories for all topics different as I read in many posts that this can cause a similar issue.
The code is as follows:
object CombinedDcAggStreaming {
def main(args: Array[String]): Unit = {
val jobConfigFile = "configPath"
/* Read input configuration */
val jobProps = Util.loadProperties(jobConfigFile).asScala
val sparkConfigFile = jobProps.getOrElse("spark_config_file", throw new RuntimeException("Can't find spark property file"))
val kafkaConfigFile = jobProps.getOrElse("kafka_config_file", throw new RuntimeException("Can't find kafka property file"))
val sparkProps = Util.loadProperties(sparkConfigFile).asScala
val kafkaProps = Util.loadProperties(kafkaConfigFile).asScala
val topicList = Seq("topic_1", "topic_2")
val avroSchemaFile = jobProps.getOrElse("schema_file", throw new RuntimeException("Can't find schema file..."))
val checkpointLocation = jobProps.getOrElse("checkpoint_location", throw new RuntimeException("Can't find check point directory..."))
val triggerInterval = jobProps.getOrElse("triggerInterval", throw new RuntimeException("Can't find trigger interval..."))
val outputPath = jobProps.getOrElse("output_path", throw new RuntimeException("Can't find output directory..."))
val outputFormat = jobProps.getOrElse("output_format", throw new RuntimeException("Can't find output format...")) //"parquet"
val outputMode = jobProps.getOrElse("output_mode", throw new RuntimeException("Can't find output mode...")) //"append"
val partitionByCols = jobProps.getOrElse("partition_by_columns", throw new RuntimeException("Can't find partition by columns...")).split(",").toSeq
val spark = SparkSession.builder.appName("streaming").master("local[4]").getOrCreate()
sparkProps.foreach(prop => spark.conf.set(prop._1, prop._2))
topicList.foreach(
topicId => {
kafkaProps.update("subscribe", topicId)
val schemaPath = avroSchemaFile + "/" + topicId + ".avsc"
val dimensionMap = ConfigUtils.getDimensionMap(jobConfig)
val measureMap = ConfigUtils.getMeasureMap(jobConfig)
val source= Source.fromInputStream(Util.getInputStream(schemaPath)).getLines.mkString
val schemaParser = new Schema.Parser
val schema = schemaParser.parse(source)
val sqlTypeSchema = SchemaConverters.toSqlType(schema).dataType.asInstanceOf[StructType]
val kafkaStreamData = spark
.readStream
.format("kafka")
.options(kafkaProps)
.load()
val udfDeserialize = udf(deserialize(source), DataTypes.createStructType(sqlTypeSchema.fields))
val transformedDeserializedData = kafkaStreamData.select("value").as(Encoders.BINARY)
.withColumn("rows", udfDeserialize(col("value")))
.select("rows.*")
.withColumn("end_time", (col("end_time") / 1000).cast(LongType))
.withColumn("timestamp", from_unixtime(col("end_time"),"yyyy-MM-dd HH").cast(TimestampType))
.withColumn("year", from_unixtime(col("end_time"),"yyyy").cast(IntegerType))
.withColumn("month", from_unixtime(col("end_time"),"MM").cast(IntegerType))
.withColumn("day", from_unixtime(col("end_time"),"dd").cast(IntegerType))
.withColumn("hour",from_unixtime(col("end_time"),"HH").cast(IntegerType))
.withColumn("topic_id", lit(topicId))
val groupBycols: Array[String] = dimensionMap.keys.toArray[String] ++ partitionByCols.toArray[String]
)
val aggregatedData = AggregationUtils.aggregateDFWithWatermarking(transformedDeserializedData, groupBycols, "timestamp", "10 minutes", measureMap) //Watermarking time -> 10. minutes, window => window("timestamp", "5 minutes")
val query = aggregatedData
.writeStream
.trigger(Trigger.ProcessingTime(triggerInterval))
.outputMode("update")
.format("console")
.partitionBy(partitionByCols: _*)
.option("path", outputPath)
.option("checkpointLocation", checkpointLocation + "//" + topicId)
.start()
})
spark.streams.awaitAnyTermination()
def deserialize(source: String): Array[Byte] => Option[Row] = (data: Array[Byte]) => {
try {
val parser = new Schema.Parser
val schema = parser.parse(source)
val recordInjection: Injection[GenericRecord, Array[Byte]] = GenericAvroCodecs.toBinary(schema)
val record = recordInjection.invert(data).get
val objectArray = new Array[Any](record.asInstanceOf[GenericRecord].getSchema.getFields.size)
record.getSchema.getFields.asScala.foreach(field => {
val fieldVal = record.get(field.pos()) match {
case x: org.apache.avro.util.Utf8 => x.toString
case y: Any => y
case _ => None
}
objectArray(field.pos()) = fieldVal
})
Some(Row(objectArray: _*))
} catch {
case ex: Exception => {
log.info(s"Failed to parse schema with error: ${ex.printStackTrace()}")
None
}
}
}
}
}
I'm getting the following error while running the job:
java.lang.IllegalStateException: Race while writing batch 0
But the job runs normally when I run a single query instead of multiple. Any suggestions on how this issue can be solved?
It may be a late answer. But I also faced the same problem.
I was able to resolve the problem. The root cause was that both the queries were trying to write to the same base path. Thus there was an overlap of the _spark_meta information. Spark Structured Streaming maintain checkpointing, as well as _spark_metadata file to keep track of the batch being processed.
Source Spark Doc:
In order to correctly handle partial failures while maintaining exactly once semantics, the files for each batch are written out to a unique directory and then atomically appended to a metadata log. When a parquet based DataSource is initialized for reading, we first check for this log directory and use it instead of file listing when present.
Thus for now every query should be given a separate path. There is no option to configure the _spark_matadata location, unlike in checkpointing.
Link to same type of question which I asked.

NullPointerException applying a function to spark RDD that works on non-RDD

I have a function that I want to apply to a every row of a .csv file:
def convert(inString: Array[String]) : String = {
val country = inString(0)
val sellerId = inString(1)
val itemID = inString(2)
try{
val minidf = sqlContext.read.json( sc.makeRDD(inString(3):: Nil) )
.withColumn("country", lit(country))
.withColumn("seller_id", lit(sellerId))
.withColumn("item_id", lit(itemID))
val finalString = minidf.toJSON.collect().mkString(",")
finalString
} catch{
case e: Exception =>println("AN EXCEPTION "+inString.mkString(","))
("this is an exception "+e+" "+inString.mkString(","))
}
}
This function transforms an entry of the sort:
CA 112578240 132080411845 [{"id":"general_spam_policy","severity":"critical","timestamp":"2017-02-26T08:30:16Z"}]
Where I have 4 columns, the 4th being a json blob, into
[{"country":"CA", "seller":112578240", "product":112578240, "id":"general_spam_policy","severity":"critical","timestamp":"2017-02-26T08:30:16Z"}]
which is the json object where the first 3 columns have been inserted into the fourth.
Now, this works:
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).collect().map(x => convert(x))
or this:
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).take(10).map(x => convert(x))
but this does not
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).map(x => convert(x))
The last one throw a java.lang.NullPointerException.
I included a try catch clause so see where exactly is this failing and it's failing for every single row.
What am I doing wrong here?
You cannot put sqlContext or sparkContext in a Spark map, since that object can only exist on the driver node. Essentially they are in charge of distributing your tasks.
You could rewite the JSON parsing bit using one of these libraries in pure scala: https://manuel.bernhardt.io/2015/11/06/a-quick-tour-of-json-libraries-in-scala/

String filter using Spark UDF

input.csv:
200,300,889,767,9908,7768,9090
300,400,223,4456,3214,6675,333
234,567,890
123,445,667,887
What I want:
Read input file and compare with set "123,200,300" if match found, gives matching data
200,300 (from 1 input line)
300 (from 2 input line)
123 (from 4 input line)
What I wrote:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object sparkApp {
val conf = new SparkConf()
.setMaster("local")
.setAppName("CountingSheep")
val sc = new SparkContext(conf)
def parseLine(invCol: String) : RDD[String] = {
println(s"INPUT, $invCol")
val inv_rdd = sc.parallelize(Seq(invCol.toString))
val bs_meta_rdd = sc.parallelize(Seq("123,200,300"))
return inv_rdd.intersection(bs_meta_rdd)
}
def main(args: Array[String]) {
val filePathName = "hdfs://xxx/tmp/input.csv"
val rawData = sc.textFile(filePathName)
val datad = rawData.map{r => parseLine(r)}
}
}
I get the following exception:
java.lang.NullPointerException
Please suggest where I went wrong
Problem is solved. This is very simple.
val pfile = sc.textFile("/FileStore/tables/6mjxi2uz1492576337920/input.csv")
case class pSchema(id: Int, pName: String)
val pDF = pfile.map(_.split("\t")).map(p => pSchema(p(0).toInt,p(1).trim())).toDF()
pDF.select("id","pName").show()
Define UDF
val findP = udf((id: Int,
pName: String
) => {
val ids = Array("123","200","300")
var idsFound : String = ""
for (id <- ids){
if (pName.contains(id)){
idsFound = idsFound + id + ","
}
}
if (idsFound.length() > 0) {
idsFound = idsFound.substring(0,idsFound.length -1)
}
idsFound
})
Use UDF in withCoulmn()
pDF.select("id","pName").withColumn("Found",findP($"id",$"pName")).show()
For simple answer, why we are making it so complex? In this case we don't require UDF.
This is your input data:
200,300,889,767,9908,7768,9090|AAA
300,400,223,4456,3214,6675,333|BBB
234,567,890|CCC
123,445,667,887|DDD
and you have to match it with 123,200,300
val matchSet = "123,200,300".split(",").toSet
val rawrdd = sc.textFile("D:\\input.txt")
rawrdd.map(_.split("|"))
.map(arr => arr(0).split(",").toSet.intersect(matchSet).mkString(",") + "|" + arr(1))
.foreach(println)
Your output:
300,200|AAA
300|BBB
|CCC
123|DDD
What you are trying to do can't be done the way you are doing it.
Spark does not support nested RDDs (see SPARK-5063).
Spark does not support nested RDDs or performing Spark actions inside of transformations; this usually leads to NullPointerExceptions (see SPARK-718 as one example). The confusing NPE is one of the most common sources of Spark questions on StackOverflow:
call of distinct and map together throws NPE in spark library
NullPointerException in Scala Spark, appears to be caused be collection type?
Graphx: I've got NullPointerException inside mapVertices
(those are just a sample of the ones that I've answered personally; there are many others).
I think we can detect these errors by adding logic to RDD to check whether sc is null (e.g. turn sc into a getter function); we can use this to add a better error message.

Spark Streaming - Issue with Passing parameters

Please take a look at the following spark streaming code written in scala:
object HBase {
var hbaseTable = ""
val hConf = new HBaseConfiguration()
hConf.set("hbase.zookeeper.quorum", "zookeeperhost")
def init(input: (String)) {
hbaseTable = input
}
def display() {
print(hbaseTable)
}
def insertHbase(row: (String)) {
val hTable = new HTable(hConf,hbaseTable)
}
}
object mainHbase {
def main(args : Array[String]) {
if (args.length < 5) {
System.err.println("Usage: MetricAggregatorHBase <zkQuorum> <group> <topics> <numThreads> <hbaseTable>")
System.exit(1)
}
val Array(zkQuorum, group, topics, numThreads, hbaseTable) = args
HBase.init(hbaseTable)
HBase.display()
val sparkConf = new SparkConf().setAppName("mainHbase")
val ssc = new StreamingContext(sparkConf, Seconds(10))
ssc.checkpoint("checkpoint")
val topicpMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicpMap).map(_._2)
val storeStg = lines.foreachRDD(rdd => rdd.foreach(HBase.insertHbase))
lines.print()
ssc.start()
}
}
I am trying to initialize the parameter hbaseTable in the object HBase by calling HBase.init method. It was setting the parameter properly. I confirmed that by calling the HBase.display method in the next line.
However when HBase.insertHbase method in the foreachRDD is called, its throwing error that hbaseTable is not set.
Update with exception:
java.lang.IllegalArgumentException: Table qualifier must not be empty
org.apache.hadoop.hbase.TableName.isLegalTableQualifierName(TableName.java:179)
org.apache.hadoop.hbase.TableName.isLegalTableQualifierName(TableName.java:149)
org.apache.hadoop.hbase.TableName.<init>(TableName.java:303)
org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:339)
org.apache.hadoop.hbase.TableName.valueOf(TableName.java:426)
org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:156)
Can you please let me know how to make this code work.
"Where is this code running" - that's the question that we need to ask in order to understand what's going on.
HBase is a Scala object - by definition it's a singleton construct that gets initialized with 'only once' semantics in the JVM.
At the initialization point, HBase.init(hbaseTable) is executed in the driver of this Spark application, initializing this object with the given value in the VM of the driver.
But when we do: rdd.foreach(HBase.insertHbase), the closure is executed as a task on each executor that hosts a partition for the given RDD. At that point, the object HBase is initialized on each VM for each executor. As we can see, no initialization has happened on this object at that point.
There're two options:
We can add some checking "isInitialized" to the HBase object and add the -now conditional- call to initialize on each call to foreach.
Another option would be to use
rdd.foreachPartitition{partition =>
HBase.initialize(...)
partition.foreach(elem => HBase.insert(elem))
}
This construction will amortize any initialization by the amount of element in each partition. It's also possible to combine it with an initialization check to prevent unnecessary bootstrap work.

Spark job not parallelising locally (using Parquet + Avro from local filesystem)

edit 2
Indirectly solved the problem by repartitioning the RDD into 8 partitions. Hit a roadblock with avro objects not being "java serialisable" found a snippet here to delegate avro serialisation to kryo. The original problem still remains.
edit 1: Removed local variable reference in map function
I'm writing a driver to run a compute heavy job on spark using parquet and avro for io/schema. I can't seem to get spark to use all my cores. What am I doing wrong ? Is it because I have set the keys to null ?
I am just getting my head around how hadoop organises files. AFAIK since my file has a gigabyte of raw data I should expect to see things parallelising with the default block and page sizes.
The function to ETL my input for processing looks as follows :
def genForum {
class MyWriter extends AvroParquetWriter[Topic](new Path("posts.parq"), Topic.getClassSchema) {
override def write(t: Topic) {
synchronized {
super.write(t)
}
}
}
def makeTopic(x: ForumTopic): Topic = {
// Ommited to save space
}
val writer = new MyWriter
val q =
DBCrawler.db.withSession {
Query(ForumTopics).filter(x => x.crawlState === TopicCrawlState.Done).list()
}
val sz = q.size
val c = new AtomicInteger(0)
q.par.foreach {
x =>
writer.write(makeTopic(x))
val count = c.incrementAndGet()
print(f"\r${count.toFloat * 100 / sz}%4.2f%%")
}
writer.close()
}
And my transformation looks as follows :
def sparkNLPTransformation() {
val sc = new SparkContext("local[8]", "forumAddNlp")
// io configuration
val job = new Job()
ParquetInputFormat.setReadSupportClass(job, classOf[AvroReadSupport[Topic]])
ParquetOutputFormat.setWriteSupportClass(job,classOf[AvroWriteSupport])
AvroParquetOutputFormat.setSchema(job, Topic.getClassSchema)
// configure annotator
val props = new Properties()
props.put("annotators", "tokenize,ssplit,pos,lemma,parse")
val an = DAnnotator(props)
// annotator function
def annotatePosts(ann : DAnnotator, top : Topic) : Topic = {
val new_p = top.getPosts.map{ x=>
val at = new Annotation(x.getPostText.toString)
ann.annotator.annotate(at)
val t = at.get(classOf[SentencesAnnotation]).map(_.get(classOf[TreeAnnotation])).toList
val r = SpecificData.get().deepCopy[Post](x.getSchema,x)
if(t.nonEmpty) r.setTrees(t)
r
}
val new_t = SpecificData.get().deepCopy[Topic](top.getSchema,top)
new_t.setPosts(new_p)
new_t
}
// transformation
val ds = sc.newAPIHadoopFile("forum_dataset.parq", classOf[ParquetInputFormat[Topic]], classOf[Void], classOf[Topic], job.getConfiguration)
val new_ds = ds.map(x=> ( null, annotatePosts(x._2) ) )
new_ds.saveAsNewAPIHadoopFile("annotated_posts.parq",
classOf[Void],
classOf[Topic],
classOf[ParquetOutputFormat[Topic]],
job.getConfiguration
)
}
Can you confirm that the data is indeed in multiple blocks in HDFS? The total block count on the forum_dataset.parq file