Process continuously parquet files as Datastreams in Flink's DataStream API

Process continuously parquet files as Datastreams in Flink's DataStream API - scala

I have a parquet file on HDFS. It is overwritten daily with a new one. My goal is to emit this parquet file continuously - when it changes - as a DataStream in a Flink Job using the DataStream API.
The end goal is to use the file content in a Broadcast State, but this is out of scope for this question.
To process a file continuously, there is this very useful API: Data-sources about datasources. More specifically, FileProcessingMode.PROCESS_CONTINUOUSLY: this is exactly what I need. This works for reading/monitoring text files, no problem, but not for parquet files:
// Partial version 1: the raw file is processed continuously
val path: String = "hdfs://hostname/path_to_file_dir/"
val textInputFormat: TextInputFormat = new TextInputFormat(new Path(path))
// monitor the file continuously every minute
val stream: DataStream[String] = streamExecutionEnvironment.readFile(textInputFormat, path, FileProcessingMode.PROCESS_CONTINUOUSLY, 60000)
To process parquet files, I can use Hadoop Input Formats using this API:
using-hadoop-inputformats. However there is no FileProcessingMode parameter via this API, and this processes the file only once:
// Partial version 2: the parquet file is only processed once
val parquetPath: String = "/path_to_file_dir/parquet_0000"
// raw text format
val hadoopInputFormat: HadoopInputFormat[Void, ArrayWritable] = HadoopInputs.readHadoopFile(new MapredParquetInputFormat(), classOf[Void], classOf[ArrayWritable], parquetPath)
val stream: DataStream[(Void, ArrayWritable)] = streamExecutionEnvironment.createInput(hadoopInputFormat).map { record =>
// process the record here ...
}
I would like to somehow combine the two APIs, to process continuously Parquet Files via the DataStream API. Have any of you tried something like this ?

After browsing Flink's code, looks like that those two APIS are relatively different, and it does not seem possible to merge them together.
The other approach, which I will detail here, is to define your own SourceFunction that will periodically read the file:
class ParquetSourceFunction extends SourceFunction[Int] {
private var isRunning = true
override def run(ctx: SourceFunction.SourceContext[Int]): Unit = {
while (isRunning) {
val path = new Path("path_to_parquet_file")
val conf = new Configuration()
val readFooter = ParquetFileReader.readFooter(conf, path, ParquetMetadataConverter.NO_FILTER)
val metadata = readFooter.getFileMetaData
val schema = metadata.getSchema
val parquetFileReader = new ParquetFileReader(conf, metadata, path, readFooter.getBlocks, schema.getColumns)
var pages: PageReadStore = null
try {
while ({ pages = parquetFileReader.readNextRowGroup; pages != null }) {
val rows = pages.getRowCount
val columnIO = new ColumnIOFactory().getColumnIO(schema)
val recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema))
(0L until rows).foreach { _ =>
val group = recordReader.read()
val my_integer = group.getInteger("field_name", 0)
ctx.collect(my_integer)
}
}
}
// do whatever logic suits you to stop "watching" the file
Thread.sleep(60000)
}
}
override def cancel(): Unit = isRunning = false
}
Then, use the streamExecutionEnvironment to register this source:
val dataStream: DataStream[Int] = streamExecutionEnvironment.addSource(new ParquetProtoSourceFunction)
// do what you want with your new datastream

Related

Task not serializable - foreach function spark

I have a function getS3Object to get a json object stored in S3
def getS3Object (s3ObjectName) : Unit = {
val bucketName = "xyz"
val object_to_write = s3client.getObject(bucketName, s3ObjectName)
val file = new File(filename)
fileWriter = new FileWriter(file)
bw = new BufferedWriter(fileWriter)
bw.write(object_to_write)
bw.close()
fileWriter.close()
}
My dataframe (df) contains one column where each row is the S3ObjectName
S3ObjectName
a1.json
b2.json
c3.json
d4.json
e5.json
When I execute the below logic I get an error saying "task is not serializable"
Method 1:- df.foreach(x => getS3Object(x.getString(0))
I tried converting the df to rdd but still get the same error
Method 2:- df.rdd.foreach(x => getS3Object(x.getString(0))
However it works with collect()
Method 3:- df.collect.foreach(x => getS3Object(x.getString(0))
I do not wish to use the collect() method as all the elements of the dataframe are collected to the driver and potentially result in OutOfMemory error.
Is there a way to make the foreach() function work using Method 1?

The problem for your s3Client can be solved as following. But you have to remember that these functions run on executor nodes (other machines), so your whole val file = new File(filename) thing is probably not going to work here.
You can put your files on some distibuted file system like HDFS or S3.
object S3ClientWrapper extends Serializable {
// s3Client must be created here.
val s3Client = {
val awsCreds = new BasicAWSCredentials("access_key_id", "secret_key_id")
AmazonS3ClientBuilder.standard()
.withCredentials(new AWSStaticCredentialsProvider(awsCreds))
.build()
}
}
def getS3Object (s3ObjectName) : Unit = {
val bucketName = "xyz"
val object_to_write = S3ClientWrapper.s3Client.getObject(bucketName, s3ObjectName)
// now you have to solve your file problem
}

Using Spark on Dataproc, how to write to GCS separately from each partition?

Using Spark on GCP Dataproc, I successfuly write an entire RDD to GCS like so:
rdd.saveAsTextFile(s"gs://$path")
The products are files for each partition in the same path.
How do I write files for each partition (with a unique path based on information from the partition)
Below is an invented non working wishful code example
rdd.mapPartitionsWithIndex(
(i, partition) =>{
partition.write(path = s"gs://partition_$i", data = partition_specific_data)
}
)
when I call the function below from within the partition on my mac it writes to local disk, on Dataproc I get an error not recognizing the gs as a valid path.
def writeLocally(filePath: String, data: Array[Byte], errorMessage: String): Unit = {
println("Juicy Platform")
val path = new Path(filePath)
var ofos: Option[FSDataOutputStream] = null
try {
println(s"\nTrying to write to $filePath\n")
val conf = new Configuration()
conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
// conf.addResource(new Path("/home/hadoop/conf/core-site.xml"))
println(conf.toString)
val fs = FileSystem.get(conf)
val fos = fs.create(path)
ofos = Option(fos)
fos.write(data)
println(s"\nWrote to $filePath\n")
}
catch {
case e: Exception =>
logError(errorMessage, s"Exception occurred writing to GCS:\n${ExceptionUtils.getStackTrace(e)}")
}
finally {
ofos match {
case Some(i) => i.close()
case _ =>
}
}
}
This is the error:
java.lang.IllegalArgumentException: Wrong FS: gs://path/myFile.json, expected: hdfs://cluster-95cf-m

If running on a Dataproc cluster, you shouldn't need to explicitly populate "fs.gs.impl" in the Configuration; a new Configuration() should already contain the necessary mappings.
The main problem here is that val fs = FileSystem.get(conf) is using the fs.defaultFS property of the conf; it has no way of knowing whether you wanted to get a FileSystem instance specific to HDFS or to GCS. In general, In Hadoop and Spark, a FileSystem instance is fundamentally tied to a single URL scheme; you need to fetch a scheme-specific instance for each different scheme, such as hdfs:// or gs:// or s3://.
The simplest fix to your problem is to always use Path.getFileSystem(Configuration) as opposed to FileSystem.get(Configuration). And make sure your path is fully-qualified with the scheme:
...
val path = "gs://bucket/foo/data"
val fs = path.getFileSystem(conf)
val fos = fs.create(path)
ofos = Option(fos)
fos.write(data)
...

Play Framework 2.6 Alpakka S3 File Upload

I use Play Framework 2.6 (Scala) and Alpakka AWS S3 Connector to upload files asynchronously to S3 bucket. My code looks like this:
def richUpload(extension: String, checkFunction: (String, Option[String]) => Boolean, cannedAcl: CannedAcl, bucket: String) = userAction(parse.multipartFormData(handleFilePartAsFile)).async { implicit request =>
val s3Filename = request.user.get.id + "/" + java.util.UUID.randomUUID.toString + "." + extension
val fileOption = request.body.file("file").map {
case FilePart(key, filename, contentType, file) =>
Logger.info(s"key = ${key}, filename = ${filename}, contentType = ${contentType}, file = $file")
if(checkFunction(filename, contentType)) {
s3Service.uploadSink(s3Filename, cannedAcl, bucket).runWith(FileIO.fromPath(file.toPath))
} else {
throw new Exception("Upload failed")
}
}
fileOption match {
case Some(opt) => opt.map(o => Ok(s3Filename))
case _ => Future.successful(BadRequest("ERROR"))
}
}
It works, but it returns filename before it uploads to S3. But I want to return value after it uploads to S3. Is there any solution?
Also, is it possible to stream file upload directly to S3, to show progress correctly and to not use temporary disk file?

You need to flip around your source and sink to obtain the materialized value you are interested in.
You have:
a source that reads from your local files, and materializes to a Future[IOResult] upon completion of reading the file.
a sink that writes to S3 and materializes to Future[MultipartUploadResult] upon completion of writing to S3.
You are interested in the latter, but in your code you are using the former. This is because the runWith function always keeps the materialized value of stage passed as parameter.
The types in the sample snippet below should clarify this:
val fileSource: Source[ByteString, Future[IOResult]] = ???
val s3Sink : Sink [ByteString, Future[MultipartUploadResult]] = ???
val m1: Future[IOResult] = s3Sink.runWith(fileSource)
val m2: Future[MultipartUploadResult] = fileSource.runWith(s3Sink)
After you have obtained a Future[MultipartUploadResult] you can map on it the same way and access the location field to get a file's URI, e.g.:
val location: URI = fileSource.runWith(s3Sink).map(_.location)

How to re-load the model when Spark streaming process is running?

I have the configuration file myConfig.conf in which the path to the prediction model is defined as a parameter pathToModel. I am reading this file once in order to get pathToModel.
However, now I want to slightly update the logic - I want to be able to re-load the prediction model any time pathToModel is updated in the configuration file.
This is my current code:
val pathToModelPar = ConfigFactory.load("myConfig.conf")
val pathToModel = pathToModelPar.getString("pathToModel")
val model = GradientBoostedTreesModel.load(sc,pathToModel)
myDStream.foreachRDD(myRDD => {
myRDD.foreachPartition({ partitionOfRecords =>
//...
val predictions = model.predict(...)
//...
})
})
I was thinking that the file config.conf could be checked every hour and then the IF-THEN rule can be applied - if a new pathToModel is different from a current pathToModel, then model is re-loaded.
However, I do not know how to implement this re-loading in the context of streaming data. As far as I understand, if I put IF-THEN rule outside from myDStream.foreachRDD(myRDD => {...}), then it will be checked only at the start of the streaming process. If I put IF-THEN rule inside myDStream.foreachRDD(myRDD => {...}), then I will get Task serialization error due to using sc in load. That's the problem. Any idea?
I tried this approach, but it loads the model only once:
var pathToModelPar = ConfigFactory.load("myConfig.conf")
var pathToModel = pathToModelPar.getString("pathToModel")
var model = GradientBoostedTreesModel.load(sc,pathToModel)
val ex = new ScheduledThreadPoolExecutor(1)
val task = new Runnable {
def run() = {
pathToModelPar = ConfigFactory.load("myConfig.conf")
pathToModel = pathToModelPar.getString("pathToModel")
model = GradientBoostedTreesModel.load(sc,pathToModel)
}
}
val f = ex.scheduleAtFixedRate(task, 1, 1, TimeUnit.HOURS)
myDStream.foreachRDD(myRDD => {
myRDD.foreachPartition({ partitionOfRecords =>
//...
val predictions = model.predict(...)
//...
})
})

Spark job not parallelising locally (using Parquet + Avro from local filesystem)

edit 2
Indirectly solved the problem by repartitioning the RDD into 8 partitions. Hit a roadblock with avro objects not being "java serialisable" found a snippet here to delegate avro serialisation to kryo. The original problem still remains.
edit 1: Removed local variable reference in map function
I'm writing a driver to run a compute heavy job on spark using parquet and avro for io/schema. I can't seem to get spark to use all my cores. What am I doing wrong ? Is it because I have set the keys to null ?
I am just getting my head around how hadoop organises files. AFAIK since my file has a gigabyte of raw data I should expect to see things parallelising with the default block and page sizes.
The function to ETL my input for processing looks as follows :
def genForum {
class MyWriter extends AvroParquetWriter[Topic](new Path("posts.parq"), Topic.getClassSchema) {
override def write(t: Topic) {
synchronized {
super.write(t)
}
}
}
def makeTopic(x: ForumTopic): Topic = {
// Ommited to save space
}
val writer = new MyWriter
val q =
DBCrawler.db.withSession {
Query(ForumTopics).filter(x => x.crawlState === TopicCrawlState.Done).list()
}
val sz = q.size
val c = new AtomicInteger(0)
q.par.foreach {
x =>
writer.write(makeTopic(x))
val count = c.incrementAndGet()
print(f"\r${count.toFloat * 100 / sz}%4.2f%%")
}
writer.close()
}
And my transformation looks as follows :
def sparkNLPTransformation() {
val sc = new SparkContext("local[8]", "forumAddNlp")
// io configuration
val job = new Job()
ParquetInputFormat.setReadSupportClass(job, classOf[AvroReadSupport[Topic]])
ParquetOutputFormat.setWriteSupportClass(job,classOf[AvroWriteSupport])
AvroParquetOutputFormat.setSchema(job, Topic.getClassSchema)
// configure annotator
val props = new Properties()
props.put("annotators", "tokenize,ssplit,pos,lemma,parse")
val an = DAnnotator(props)
// annotator function
def annotatePosts(ann : DAnnotator, top : Topic) : Topic = {
val new_p = top.getPosts.map{ x=>
val at = new Annotation(x.getPostText.toString)
ann.annotator.annotate(at)
val t = at.get(classOf[SentencesAnnotation]).map(_.get(classOf[TreeAnnotation])).toList
val r = SpecificData.get().deepCopy[Post](x.getSchema,x)
if(t.nonEmpty) r.setTrees(t)
r
}
val new_t = SpecificData.get().deepCopy[Topic](top.getSchema,top)
new_t.setPosts(new_p)
new_t
}
// transformation
val ds = sc.newAPIHadoopFile("forum_dataset.parq", classOf[ParquetInputFormat[Topic]], classOf[Void], classOf[Topic], job.getConfiguration)
val new_ds = ds.map(x=> ( null, annotatePosts(x._2) ) )
new_ds.saveAsNewAPIHadoopFile("annotated_posts.parq",
classOf[Void],
classOf[Topic],
classOf[ParquetOutputFormat[Topic]],
job.getConfiguration
)
}

Can you confirm that the data is indeed in multiple blocks in HDFS? The total block count on the forum_dataset.parq file

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Process continuously parquet files as Datastreams in Flink's DataStream API - scala

Related

Task not serializable - foreach function spark

Using Spark on Dataproc, how to write to GCS separately from each partition?

Play Framework 2.6 Alpakka S3 File Upload

How to re-load the model when Spark streaming process is running?

Spark job not parallelising locally (using Parquet + Avro from local filesystem)

Categories

Resources