Spark Local File Streaming - Fault tolerance

Spark Local File Streaming - Fault tolerance - scala

I am working on an application where every 30sec(can be 5 sec also) some files will be dropped in a file system. I have to read it parse it and push some records to REDIS.
In each file all records are independent and I am not doing any calculation that will require updateStateByKey.
My question is if due to some issue (eg: REDIS connection issue, Data issue in a file etc) some file is not processed completely I want to reprocess (say n times) the files again and also keep a track of the files already processed.
For testing purpose I am reading from a local folder. Also I am not sure how to conclude that one file is fully processed and mark it as completed (ie write in a text file or db that this file processed)
val lines = ssc.textFileStream("E:\\SampleData\\GG")
val words = lines.map(x=>x.split("_"))
words.foreachRDD(
x=> {
x.foreach(
x => {
var jedis = jPool.getResource();
try{
i=i+1
jedis.set("x"+i+"__"+x(0)+"__"+x(1), x(2))
}finally{
jedis.close()
}
}
)
}
)

Spark has a fault tolerance guide. Read more :
https://spark.apache.org/docs/2.1.0/streaming-programming-guide.html#fault-tolerance-semantics

Related

Scala Amazon S3 streaming object

I am looking to use to Scala to get faster performance in accessing and downloading a Amazon S3 file. The file comes in as a InputStream and it is large (over 30 million rows).
I have tried this in python (pandas), but it is too slow. I am hoping to increase the speed with Scala.
So far I am doing this, but it is too slow. Have I encountered a bottle neck in the stream in that I cannot access data from the stream any faster than what I have with the code below?
val obj = amazonS3Client.getObject(bucket_name, file_name)
val reader = new BufferedReader(new InputStreamReader(obj.getObjectContent()))
while(line != null) {
list_of_lines = list_of_lines ::: List(line)
line = reader.readLine
}
I'm looking for serious speed improvement compared to the above approach. Thanks.

I suspect that your performance bottleneck is appending to a (linked) List in that loop:
list_of_lines = list_of_lines ::: List(line)
With 30 million lines, it should take a few hundred trillion times longer to process all the lines than it takes to process one line. If the first iteration of that loop is 1ns, then this should take somewhere around 15 minutes to execute.
Switching to prepending to the List and then reversing at the end should improve the speed of your loop by a factor of more than a million:
while(line != null) {
list_of_lines = line :: list_of_lines
line = reader.readLine
}
list_of_lines = list_of_lines.reverse
List is also notoriously memory-inefficient, so for this many elements, it's may also be worth doing something like this (which is also more idiomatic Scala):
import scala.io.{ Codec, Source }
val obj = amazonS3Client.getObject(bucket_name, file_name)
val source = Source.fromInputStream(obj.getObjectContent())(Codec.defaultCharsetCodec)
val lines = source.getLines().toVector
Vector being more memory-efficient than List should dramatically reduce GC thrashing.

The best way to achieve better performance is using the TransferManager provided by the AWS Java SDKs. It's a high level file transfer manager that will automatically parallelise downloads. I'd recommend using SDK v2, but the same can be done with SDK v1. Though, be aware, SDK v1 comes with limitations and only multipart files can be downloaded in parallel.
You need the following dependency (assuming you are using sbt with Scala). But note, there's no benefit of using Scala over Java with the TransferManager.
libraryDependencies += "software.amazon.awssdk" % "s3-transfer-manager" % "2.17.243-PREVIEW"
Example (Java):
S3TransferManager transferManager = S3TransferManager.create();
FileDownload download =
transferManager.downloadFile(b -> b.destination(Paths.get("myFile.txt"))
.getObjectRequest(req -> req.bucket("bucket").key("key")));
download.completionFuture().join();
I recommend reading more on the topic here: Introducing Amazon S3 Transfer Manager in the AWS SDK for Java 2.x

FileSink in Apache Flink not generating logs in output folder

I am using Apache Flink to read data from kafka topic and to store it in files on server. I am using FileSink to store files, it creates the directory structure date and time wise but no logs files are getting created.
When i run the program it creates directory structure as below but log files are not getting stored here.
/flink/testlogs/2021-12-08--07
/flink/testlogs/2021-12-08--06
I want the log files should be written every 15 mins to a new log file.
Below is the code.
DataStream <String> kafkaTopicData = env.addSource(new FlinkKafkaConsumer<String>("MyTopic",new SimpleStringSchema(),p));
OutputFileConfig config = OutputFileConfig
.builder()
.withPartPrefix("prefix")
.withPartSuffix(".ext")
.build();
DataStream <Tuple6 < String,String,String ,String, String ,Integer >> newStream=kafkaTopicData.map(new LogParser());
final FileSink<Tuple6<String, String, String, String, String, Integer>> sink = FileSink.forRowFormat(new Path("/flink/testlogs"),
new SimpleStringEncoder < Tuple6 < String,String,String ,String, String ,Integer >> ("UTF-8"))
.withRollingPolicy(DefaultRollingPolicy.builder()
.withRolloverInterval(TimeUnit.MINUTES.toMillis(15))
.withInactivityInterval(TimeUnit.MINUTES.toMillis(5))
.withMaxPartSize(1024 * 1024 * 1024)
.build())
.withOutputFileConfig(config)
.build();
newStream.sinkTo(sink);
env.execute("DataReader");
LogParser returns Tuple6.

When used in streaming mode, Flink's FileSink requires that checkpointing be enabled. To do this, you need to specify where you want checkpoints to be stored, and at what interval you want them to occur.
To configure this in flink-conf.yaml, you would do something like this:
state.checkpoints.dir: s3://checkpoint-bucket
execution.checkpointing.interval: 10s
Or in your application code you can do this:
env.getCheckpointConfig().setCheckpointStorage("s3://checkpoint-bucket");
env.enableCheckpointing(10000L);
Another important detail from the docs:
Given that Flink sinks and UDFs in general do not differentiate between normal job termination (e.g. finite input stream) and termination due to failure, upon normal termination of a job, the last in-progress files will not be transitioned to the “finished” state.

TimeoutException when consuming files from S3 with akka streams

I'm trying to consume a bunch of files from S3 in a streaming manner using akka streams:
S3.listBucket("<bucket>", Some("<common_prefix>"))
.flatMapConcat { r => S3.download("<bucket>", r.key) }
.mapConcat(_.toList)
.flatMapConcat(_._1)
.via(Compression.gunzip())
.via(Framing.delimiter(ByteString("\n"), Int.MaxValue))
.map(_.utf8String)
.runForeach { x => println(x) }
Without increasing akka.http.host-connection-pool.response-entity-subscription-timeout I get
java.util.concurrent.TimeoutException: Response entity was not subscribed after 1 second. Make sure to read the response entity body or call discardBytes() on it. for the second file, just after printing the last line of the first file, when trying to access the first line of the second file.
I understand the nature of that exception. I don't understand why the request to the second file is already in progress, while the first file is still being processed. I guess, there's some buffering involved.
Any ideas how to get rid of that exception without having to increase akka.http.host-connection-pool.response-entity-subscription-timeout?

Instead of merging the processing of downloaded files into one stream with flatMapConcat you could try materializing the stream within the outer stream and fully process it there before emitting your output downstream. Then you shouldn't begin downloading (and fully processing) the next object until you're ready.
Generally you want to avoid having too many stream materializations to reduce overhead, but I suspect that would be negligible for an app performing network I/O like this.
Let me know if something like this works: (warning: untested)
S3.listBucket("<bucket>", Some("<common_prefix>"))
.mapAsync(1) { result =>
val contents = S3.download("<bucket>", r.key)
.via(Compression.gunzip())
.via(Framing.delimiter(ByteString("\n"), Int.MaxValue))
.map(_.utf8String)
.to(Sink.seq)(Keep.right)
.run()
contents
}
.mapConcat(identity)
.runForeach { x => println(x) }

How to continuous reading resources(configs) by spark

The back ground is that we need to read file which is used as global configs to do data calculation and the file will be changed each hour, so it is necessary to reload the file. our confusion is about how to reload the config if the 'for-loop' goes to the end and how to notify the main process that file is changing, if spark engine could complete it independently? Sample codes like that:
// init streamingContext
val alertConfigRDD: RDD[String] = sc.textFile("alert-config.json")
val alertConfig = alertConfigRDD.collect()
for (config <- alertConfigs) {
// spark streaming process: select window duration according to config details.
}
streamingContext.start()
streamingContext.awaitTermination()
Thanks for solutions in advance.

If it's vital to have this resource being updated for data processing then load if from one place (being it HDFS, S3 or any other storage accessible from all executors) on each batch before actual processing.

Mirth is reading too slow from disk

I am using Mirth 3.0.1 version. I am reading a file (using File Reader) having 34,000 records. Every record is having 45 columns and are pipe(|) separated. Mirth is taking too much time while reading the file from the disk. Mirth is installed on the same server where file is located.Earlier, I was facing the java head space issue which I resolved after setting the -Xms1024m -Xmx4096m in files mcserver.vmoptions & mcservice.vmoptions. Now I have to solve reading performance issue. Please find in attachment the channel for the same.

The answer to this problem is highly dependent on the solution itself. As an example, if you are doing transformations when you benchmark, it might be that the problem is not with reading the files, but rather with doing massive amounts of filtering and transformations in Mirth. Since Mirth converts everything you configure into basically one gigantic Javascript that executes on the server, it might just as well be that this is causing the performance problem. Pre-processor scripts might also create a problem if you do something that causes Mirth to read the whole file.
It migh also be that your 34.000 lines in the file contains huge quantities of information, simply making the file very big and extensive to process. If every record in the file is supposed to create new messages within Mirth, you might also want to check your batch settings for the reader.
And in addition to this, the performance of the read operations from disk is of course affected a lot by the infrastructure and hardware of the platform itself. You did mention that you are reading the files locally and that you had to increase the memory for Mirth. All of this could of course be a problem in itself. To make a benchmark you would want to compare this to something else. Maybe write a small Java program to just read the file to compare performance outside of Mirth.

Thanks for the suggestions.
I have used router.routeMessage('channelName','PartOfMsg') to route the 5000 records(from one channel to second channel) from the file having 34000 of records. This has helped to read faster from the file and processing the records at the same time.
For Mirth Community, below is the code to route the msg from one channel to other channel, this solution is also for the requirement if you have bulk of records to process in batches
In Source Transformer,
debug = "ON";
XML.ignoreWhitespace = true;
logger.debug('Inside source transformer "SplitFileIntoFiles" of channel: SplitFile');
var
subSegmentCounter = 0,
xmlMessageProcessCounter = 0,
singleFileLimit = 5000,
isError = false,
xmlMessageProcess = new XML(<delimited><row><column1></column1><column2></column2></row></delimited>),
newSubSegment = <row><column1></column1><column2></column2></row>,
totalPatientRecords = msg.children().length();
logger.debug('Total number of records found in the patient input file are: ');
logger.debug(totalPatientRecords);
try{
for each (seg in msg.children())
{
xmlMessageProcess.appendChild(newSubSegment);
xmlMessageProcess['row'][xmlMessageProcessCounter] = msg['row'][subSegmentCounter];
if (xmlMessageProcessCounter == singleFileLimit -1)
{
logger.debug('Now sending the 5000 records to the next channel from channel DOR Batch File Process IHI');
router.routeMessage('DOR SendPatientsToMedicare',xmlMessageProcess);
logger.debug('After sending the 5000 records to the next channel from channel DOR Batch File Process IHI');
xmlMessageProcessCounter = 0;
delete xmlMessageProcess['row'];
}
subSegmentCounter++;
xmlMessageProcessCounter++;
}// End of FOR loop
}// End of try block
catch (exception)
{
logger.error('The exception has been raised in source transformer "SplitFileIntoFiles" of channel: SplitFile');
logger.error(exception);
globalChannelMap.put('isFailed',true);
globalChannelMap.put('errDesc',exception);
return true;
}
if (xmlMessageProcessCounter > 1)
{
try
{
logger.debug('Now sending the remaining records to the next channel from channel DOR Batch File Process IHI');
router.routeMessage('DOR SendPatientsToMedicare',xmlMessageProcess);
logger.debug('After sending the remaining records to the next channel from channel DOR Batch File Process IHI');
delete xmlMessageProcess['row'];
}
catch (exception)
{
logger.error('The exception has been raised in source transformer "SplitFileIntoFiles" of channel: SplitFile');
logger.error(exception);
globalChannelMap.put('isFailed',true);
globalChannelMap.put('errDesc',exception);
return true;
}
}
return true;
// End of JavaScript
Hope, this will help.