Spark job not parallelising locally (using Parquet + Avro from local filesystem) - scala

edit 2
Indirectly solved the problem by repartitioning the RDD into 8 partitions. Hit a roadblock with avro objects not being "java serialisable" found a snippet here to delegate avro serialisation to kryo. The original problem still remains.
edit 1: Removed local variable reference in map function
I'm writing a driver to run a compute heavy job on spark using parquet and avro for io/schema. I can't seem to get spark to use all my cores. What am I doing wrong ? Is it because I have set the keys to null ?
I am just getting my head around how hadoop organises files. AFAIK since my file has a gigabyte of raw data I should expect to see things parallelising with the default block and page sizes.
The function to ETL my input for processing looks as follows :
def genForum {
class MyWriter extends AvroParquetWriter[Topic](new Path("posts.parq"), Topic.getClassSchema) {
override def write(t: Topic) {
synchronized {
super.write(t)
}
}
}
def makeTopic(x: ForumTopic): Topic = {
// Ommited to save space
}
val writer = new MyWriter
val q =
DBCrawler.db.withSession {
Query(ForumTopics).filter(x => x.crawlState === TopicCrawlState.Done).list()
}
val sz = q.size
val c = new AtomicInteger(0)
q.par.foreach {
x =>
writer.write(makeTopic(x))
val count = c.incrementAndGet()
print(f"\r${count.toFloat * 100 / sz}%4.2f%%")
}
writer.close()
}
And my transformation looks as follows :
def sparkNLPTransformation() {
val sc = new SparkContext("local[8]", "forumAddNlp")
// io configuration
val job = new Job()
ParquetInputFormat.setReadSupportClass(job, classOf[AvroReadSupport[Topic]])
ParquetOutputFormat.setWriteSupportClass(job,classOf[AvroWriteSupport])
AvroParquetOutputFormat.setSchema(job, Topic.getClassSchema)
// configure annotator
val props = new Properties()
props.put("annotators", "tokenize,ssplit,pos,lemma,parse")
val an = DAnnotator(props)
// annotator function
def annotatePosts(ann : DAnnotator, top : Topic) : Topic = {
val new_p = top.getPosts.map{ x=>
val at = new Annotation(x.getPostText.toString)
ann.annotator.annotate(at)
val t = at.get(classOf[SentencesAnnotation]).map(_.get(classOf[TreeAnnotation])).toList
val r = SpecificData.get().deepCopy[Post](x.getSchema,x)
if(t.nonEmpty) r.setTrees(t)
r
}
val new_t = SpecificData.get().deepCopy[Topic](top.getSchema,top)
new_t.setPosts(new_p)
new_t
}
// transformation
val ds = sc.newAPIHadoopFile("forum_dataset.parq", classOf[ParquetInputFormat[Topic]], classOf[Void], classOf[Topic], job.getConfiguration)
val new_ds = ds.map(x=> ( null, annotatePosts(x._2) ) )
new_ds.saveAsNewAPIHadoopFile("annotated_posts.parq",
classOf[Void],
classOf[Topic],
classOf[ParquetOutputFormat[Topic]],
job.getConfiguration
)
}

Can you confirm that the data is indeed in multiple blocks in HDFS? The total block count on the forum_dataset.parq file

Related

How to batch columns of spark dataframe, process with REST API and add it back?

I have a dataframe in spark and I need to process a particular column in that dataframe using a REST API. The API does some transformation to a string and returns a result string. The API can process multiple strings at a time.
I can iterate over the columns of the dataframe, collect n values of the column in a batch and call the api and then add it back to the dataframe, and continue with the next batch. But this seems like the normal way of doing it without taking advantage of spark.
Is there a better way to do this which can take advantage of spark sql optimiser and spark parallel processing?
For Spark parallel processing you can use mapPartitions
case class Input(col: String)
case class Output ( col : String,new_col : String )
val data = spark.read.csv("/a/b/c").as[Input].repartiton(n)
def declare(partitions: Iterator[Input]): Iterator[Output] ={
val url = ""
implicit val formats: DefaultFormats.type = DefaultFormats
var list = new ListBuffer[Output]()
val httpClient =
try {
while (partitions.hasNext) {
val x = partitions.next()
val col = x.col
val concat_url =""
val apiResp = HttpClientAcceptSelfSignedCertificate.call(httpClient, concat_url)
if (apiResp.isDefined) {
val json = parse(apiResp.get)
val new_col = (json \\"value_to_take_from_api").children.head.values.toString
val output = Output(col,new_col)
list+=output
}
else {
val new_col = "Not Found"
val output = Output(col,new_col)
list+=output
}
}
} catch {
case e: Exception => println("api Exception with : " + e.getMessage)
}
finally {
HttpClientAcceptSelfSignedCertificate.close(httpClient)
}
list.iterator
}
val dd:Dataset[Output] =data.mapPartitions(x=>declare(x))

Task not serializable - foreach function spark

I have a function getS3Object to get a json object stored in S3
def getS3Object (s3ObjectName) : Unit = {
val bucketName = "xyz"
val object_to_write = s3client.getObject(bucketName, s3ObjectName)
val file = new File(filename)
fileWriter = new FileWriter(file)
bw = new BufferedWriter(fileWriter)
bw.write(object_to_write)
bw.close()
fileWriter.close()
}
My dataframe (df) contains one column where each row is the S3ObjectName
S3ObjectName
a1.json
b2.json
c3.json
d4.json
e5.json
When I execute the below logic I get an error saying "task is not serializable"
Method 1:- df.foreach(x => getS3Object(x.getString(0))
I tried converting the df to rdd but still get the same error
Method 2:- df.rdd.foreach(x => getS3Object(x.getString(0))
However it works with collect()
Method 3:- df.collect.foreach(x => getS3Object(x.getString(0))
I do not wish to use the collect() method as all the elements of the dataframe are collected to the driver and potentially result in OutOfMemory error.
Is there a way to make the foreach() function work using Method 1?
The problem for your s3Client can be solved as following. But you have to remember that these functions run on executor nodes (other machines), so your whole val file = new File(filename) thing is probably not going to work here.
You can put your files on some distibuted file system like HDFS or S3.
object S3ClientWrapper extends Serializable {
// s3Client must be created here.
val s3Client = {
val awsCreds = new BasicAWSCredentials("access_key_id", "secret_key_id")
AmazonS3ClientBuilder.standard()
.withCredentials(new AWSStaticCredentialsProvider(awsCreds))
.build()
}
}
def getS3Object (s3ObjectName) : Unit = {
val bucketName = "xyz"
val object_to_write = S3ClientWrapper.s3Client.getObject(bucketName, s3ObjectName)
// now you have to solve your file problem
}

Running Multiple Queries in Spark Structured Streaming with Watermarking and Windowed Aggregations

My aim is to read data from multiple Kafka topics, aggregate the data and write into hdfs.
I looped through the list of kafka topics to create multiple queries. The code runs fine while running a single query but gives error while running multiple queries. I've kept the checkpoint directories for all topics different as I read in many posts that this can cause a similar issue.
The code is as follows:
object CombinedDcAggStreaming {
def main(args: Array[String]): Unit = {
val jobConfigFile = "configPath"
/* Read input configuration */
val jobProps = Util.loadProperties(jobConfigFile).asScala
val sparkConfigFile = jobProps.getOrElse("spark_config_file", throw new RuntimeException("Can't find spark property file"))
val kafkaConfigFile = jobProps.getOrElse("kafka_config_file", throw new RuntimeException("Can't find kafka property file"))
val sparkProps = Util.loadProperties(sparkConfigFile).asScala
val kafkaProps = Util.loadProperties(kafkaConfigFile).asScala
val topicList = Seq("topic_1", "topic_2")
val avroSchemaFile = jobProps.getOrElse("schema_file", throw new RuntimeException("Can't find schema file..."))
val checkpointLocation = jobProps.getOrElse("checkpoint_location", throw new RuntimeException("Can't find check point directory..."))
val triggerInterval = jobProps.getOrElse("triggerInterval", throw new RuntimeException("Can't find trigger interval..."))
val outputPath = jobProps.getOrElse("output_path", throw new RuntimeException("Can't find output directory..."))
val outputFormat = jobProps.getOrElse("output_format", throw new RuntimeException("Can't find output format...")) //"parquet"
val outputMode = jobProps.getOrElse("output_mode", throw new RuntimeException("Can't find output mode...")) //"append"
val partitionByCols = jobProps.getOrElse("partition_by_columns", throw new RuntimeException("Can't find partition by columns...")).split(",").toSeq
val spark = SparkSession.builder.appName("streaming").master("local[4]").getOrCreate()
sparkProps.foreach(prop => spark.conf.set(prop._1, prop._2))
topicList.foreach(
topicId => {
kafkaProps.update("subscribe", topicId)
val schemaPath = avroSchemaFile + "/" + topicId + ".avsc"
val dimensionMap = ConfigUtils.getDimensionMap(jobConfig)
val measureMap = ConfigUtils.getMeasureMap(jobConfig)
val source= Source.fromInputStream(Util.getInputStream(schemaPath)).getLines.mkString
val schemaParser = new Schema.Parser
val schema = schemaParser.parse(source)
val sqlTypeSchema = SchemaConverters.toSqlType(schema).dataType.asInstanceOf[StructType]
val kafkaStreamData = spark
.readStream
.format("kafka")
.options(kafkaProps)
.load()
val udfDeserialize = udf(deserialize(source), DataTypes.createStructType(sqlTypeSchema.fields))
val transformedDeserializedData = kafkaStreamData.select("value").as(Encoders.BINARY)
.withColumn("rows", udfDeserialize(col("value")))
.select("rows.*")
.withColumn("end_time", (col("end_time") / 1000).cast(LongType))
.withColumn("timestamp", from_unixtime(col("end_time"),"yyyy-MM-dd HH").cast(TimestampType))
.withColumn("year", from_unixtime(col("end_time"),"yyyy").cast(IntegerType))
.withColumn("month", from_unixtime(col("end_time"),"MM").cast(IntegerType))
.withColumn("day", from_unixtime(col("end_time"),"dd").cast(IntegerType))
.withColumn("hour",from_unixtime(col("end_time"),"HH").cast(IntegerType))
.withColumn("topic_id", lit(topicId))
val groupBycols: Array[String] = dimensionMap.keys.toArray[String] ++ partitionByCols.toArray[String]
)
val aggregatedData = AggregationUtils.aggregateDFWithWatermarking(transformedDeserializedData, groupBycols, "timestamp", "10 minutes", measureMap) //Watermarking time -> 10. minutes, window => window("timestamp", "5 minutes")
val query = aggregatedData
.writeStream
.trigger(Trigger.ProcessingTime(triggerInterval))
.outputMode("update")
.format("console")
.partitionBy(partitionByCols: _*)
.option("path", outputPath)
.option("checkpointLocation", checkpointLocation + "//" + topicId)
.start()
})
spark.streams.awaitAnyTermination()
def deserialize(source: String): Array[Byte] => Option[Row] = (data: Array[Byte]) => {
try {
val parser = new Schema.Parser
val schema = parser.parse(source)
val recordInjection: Injection[GenericRecord, Array[Byte]] = GenericAvroCodecs.toBinary(schema)
val record = recordInjection.invert(data).get
val objectArray = new Array[Any](record.asInstanceOf[GenericRecord].getSchema.getFields.size)
record.getSchema.getFields.asScala.foreach(field => {
val fieldVal = record.get(field.pos()) match {
case x: org.apache.avro.util.Utf8 => x.toString
case y: Any => y
case _ => None
}
objectArray(field.pos()) = fieldVal
})
Some(Row(objectArray: _*))
} catch {
case ex: Exception => {
log.info(s"Failed to parse schema with error: ${ex.printStackTrace()}")
None
}
}
}
}
}
I'm getting the following error while running the job:
java.lang.IllegalStateException: Race while writing batch 0
But the job runs normally when I run a single query instead of multiple. Any suggestions on how this issue can be solved?
It may be a late answer. But I also faced the same problem.
I was able to resolve the problem. The root cause was that both the queries were trying to write to the same base path. Thus there was an overlap of the _spark_meta information. Spark Structured Streaming maintain checkpointing, as well as _spark_metadata file to keep track of the batch being processed.
Source Spark Doc:
In order to correctly handle partial failures while maintaining exactly once semantics, the files for each batch are written out to a unique directory and then atomically appended to a metadata log. When a parquet based DataSource is initialized for reading, we first check for this log directory and use it instead of file listing when present.
Thus for now every query should be given a separate path. There is no option to configure the _spark_matadata location, unlike in checkpointing.
Link to same type of question which I asked.

spark job freeze when started in ParArray

I want to convert a set of time-serial data to Labeledpoint from multiple csv files and save to parquet file. Csv Files are small, usually < 10MiB
When I start it with ParArray, it submit 4 jobs a time and freeze . codes here
val idx = Another_DataFrame
ListFiles(new File("data/stock data"))
.filter(_.getName.contains(".csv")).zipWithIndex
.par //comment this line and code runs smoothly
.foreach{
f=>
val stk = spark_csv(f._1.getPath) //doing good
ColMerge(stk,idx,RESULT_PATH(f)) //freeze here
stk.unpersist()
}
and the freeze part:
def ColMerge(ori:DataFrame,index:DataFrame,PATH:String) = {
val df = ori.join(index,ori("date")===index("index_date")).drop("index_date").orderBy("date").cache
val head = df.head
val col = df.columns.filter(e=>e!="code"&&e!="date"&&e!="name")
val toMap = col.filter{
e=>head.get(head.fieldIndex(e)).isInstanceOf[String]
}.sorted
val toCast = col.diff(toMap).filterNot(_=="data")
val res: Array[((String, String, Array[Double]), Long)] = df.sort("date").map{
row=>
val res1= toCast.map{
col=>
row.getDouble(row.fieldIndex(col))
}
val res2= toMap.flatMap{
col=>
val mapping = new Array[Double](GlobalConfig.ColumnMapping(col).size)
row.getString(row.fieldIndex(col)).split(";").par.foreach{
word=>
mapping(GlobalConfig.ColumnMapping(col)(word)) = 1
}
mapping
}
(
row.getString(row.fieldIndex("code")),
row.getString(row.fieldIndex("date")),
res1++res2++row.getAs[Seq[Double]]("data")
)
}.zipWithIndex.collect
df.unpersist
val dataset = GlobalConfig.sctx.makeRDD(res.map{
day=>
(day._1._1,
day._1._2,
try{
new LabeledPoint(GetHighPrice(res(day._2.toInt+2)._1._3.slice(0,4))/GetLowPrice(res(day._2.toInt)._1._3.slice(0,4))*1.03,Vectors.dense(day._1._3))
}
catch {
case ex:ArrayIndexOutOfBoundsException=>
new LabeledPoint(-1,Vectors.dense(day._1._3))
}
)
}).filter(_._3.label != -1).toDF("code","date","labeledpoint")
dataset.write.mode(SaveMode.Overwrite).parquet(PATH)
}
The exact job that freezes is the DataFrame.sort() or zipWithIndex when generating res in ColMerge
Since most part of the job get done after collect I really want to use ParArray to accelerate ColMerge but this weird freeze stopped me from doing so. Do I need to new a thread pool to do this?

Spark Streaming using Scala to insert to Hbase Issue

I am trying to read records from Kafka message and put into Hbase. Though the scala script is running with out any issue, the inserts are not happening. Please help me.
Input:
rowkey1,1
rowkey2,2
Here is the code which I am using:
object Blaher {
def blah(row: Array[String]) {
val hConf = new HBaseConfiguration()
val hTable = new HTable(hConf, "test")
val thePut = new Put(Bytes.toBytes(row(0)))
thePut.add(Bytes.toBytes("cf"), Bytes.toBytes("a"), Bytes.toBytes(row(1)))
hTable.put(thePut)
}
}
object TheMain extends Serializable{
def run() {
val ssc = new StreamingContext(sc, Seconds(1))
val topicmap = Map("test" -> 1)
val lines = KafkaUtils.createStream(ssc,"127.0.0.1:2181", "test-consumer-group",topicmap).map(_._2)
val words = lines.map(line => line.split(",")).map(line => (line(0),line(1)))
val store = words.foreachRDD(rdd => rdd.foreach(Blaher.blah))
ssc.start()
}
}
TheMain.run()
From the API doc for HTable's flushCommits() method: "Executes all the buffered Put operations". You should call this at the end of your blah() method -- it looks like they're currently being buffered but never executed or executed at some random time.