FileSink in Apache Flink not generating logs in output folder - apache-kafka

I am using Apache Flink to read data from kafka topic and to store it in files on server. I am using FileSink to store files, it creates the directory structure date and time wise but no logs files are getting created.
When i run the program it creates directory structure as below but log files are not getting stored here.
/flink/testlogs/2021-12-08--07
/flink/testlogs/2021-12-08--06
I want the log files should be written every 15 mins to a new log file.
Below is the code.
DataStream <String> kafkaTopicData = env.addSource(new FlinkKafkaConsumer<String>("MyTopic",new SimpleStringSchema(),p));
OutputFileConfig config = OutputFileConfig
.builder()
.withPartPrefix("prefix")
.withPartSuffix(".ext")
.build();
DataStream <Tuple6 < String,String,String ,String, String ,Integer >> newStream=kafkaTopicData.map(new LogParser());
final FileSink<Tuple6<String, String, String, String, String, Integer>> sink = FileSink.forRowFormat(new Path("/flink/testlogs"),
new SimpleStringEncoder < Tuple6 < String,String,String ,String, String ,Integer >> ("UTF-8"))
.withRollingPolicy(DefaultRollingPolicy.builder()
.withRolloverInterval(TimeUnit.MINUTES.toMillis(15))
.withInactivityInterval(TimeUnit.MINUTES.toMillis(5))
.withMaxPartSize(1024 * 1024 * 1024)
.build())
.withOutputFileConfig(config)
.build();
newStream.sinkTo(sink);
env.execute("DataReader");
LogParser returns Tuple6.

When used in streaming mode, Flink's FileSink requires that checkpointing be enabled. To do this, you need to specify where you want checkpoints to be stored, and at what interval you want them to occur.
To configure this in flink-conf.yaml, you would do something like this:
state.checkpoints.dir: s3://checkpoint-bucket
execution.checkpointing.interval: 10s
Or in your application code you can do this:
env.getCheckpointConfig().setCheckpointStorage("s3://checkpoint-bucket");
env.enableCheckpointing(10000L);
Another important detail from the docs:
Given that Flink sinks and UDFs in general do not differentiate between normal job termination (e.g. finite input stream) and termination due to failure, upon normal termination of a job, the last in-progress files will not be transitioned to the “finished” state.

Related

How can be configured kafka-connect-file-pulse for continuous reading of a text file?

I have FilePulse correctly configured, so that when I create a file inside the reading folder, it reads it and ingests it in the topic.
Now I need to do continuous reading of each of the files in that folder, since they are continually being updated.
I have to change any property of properties file?
My filePulseTxtFile.properties:
name=connect-file-pulse-txt
connector.class=io.streamthoughts.kafka.connect.filepulse.source.FilePulseSourceConnector
topic=lineas-fichero
tasks.max=1
File types
fs.scan.filters=io.streamthoughts.kafka.connect.filepulse.scanner.local.filter.RegexFileListFilter
file.filter.regex.pattern=.*\\.log$
task.reader.class=io.streamthoughts.kafka.connect.filepulse.reader.RowFileInputReader
File scanning
fs.cleanup.policy.class=io.streamthoughts.kafka.connect.filepulse.clean.LogCleanupPolicy
fs.scanner.class=io.streamthoughts.kafka.connect.filepulse.scanner.local.LocalFSDirectoryWalker
fs.scan.directory.path=/home/ec2-user/parser/scanDirKafka
fs.scan.interval.ms=10000
Internal Reporting
internal.kafka.reporter.bootstrap.servers=localhost:9092
internal.kafka.reporter.id=connect-file-pulse-txt
internal.kafka.reporter.topic=connect-file-pulse-status
Track file by name
offset.strategy=name
Thanks a lot!
Continious reading is only supported by the RowFileInputReader that you can configure with the read.max.wait.ms property - The maximum time to wait in milliseconds for more bytes after hitting end of file.
For example, if you configure that property to 10000 then the reader will wait 10 seconds for new lines to be added to the file before considering it completed.
Also, you should note that as long as there are task processing files, then new files that are added to the source directory will not be selected. But, you can configure the allow.tasks.reconfiguration.after.timeout.ms to force all tasks to be restarted after a given period so that new files will be scheduled.
Finally, you must take care to correctly set the max.tasks property so that all files can be processed in parallel (a task can only process one file at a time).

How to continuous reading resources(configs) by spark

The back ground is that we need to read file which is used as global configs to do data calculation and the file will be changed each hour, so it is necessary to reload the file. our confusion is about how to reload the config if the 'for-loop' goes to the end and how to notify the main process that file is changing, if spark engine could complete it independently? Sample codes like that:
// init streamingContext
val alertConfigRDD: RDD[String] = sc.textFile("alert-config.json")
val alertConfig = alertConfigRDD.collect()
for (config <- alertConfigs) {
// spark streaming process: select window duration according to config details.
}
streamingContext.start()
streamingContext.awaitTermination()
Thanks for solutions in advance.
If it's vital to have this resource being updated for data processing then load if from one place (being it HDFS, S3 or any other storage accessible from all executors) on each batch before actual processing.

Spark Local File Streaming - Fault tolerance

I am working on an application where every 30sec(can be 5 sec also) some files will be dropped in a file system. I have to read it parse it and push some records to REDIS.
In each file all records are independent and I am not doing any calculation that will require updateStateByKey.
My question is if due to some issue (eg: REDIS connection issue, Data issue in a file etc) some file is not processed completely I want to reprocess (say n times) the files again and also keep a track of the files already processed.
For testing purpose I am reading from a local folder. Also I am not sure how to conclude that one file is fully processed and mark it as completed (ie write in a text file or db that this file processed)
val lines = ssc.textFileStream("E:\\SampleData\\GG")
val words = lines.map(x=>x.split("_"))
words.foreachRDD(
x=> {
x.foreach(
x => {
var jedis = jPool.getResource();
try{
i=i+1
jedis.set("x"+i+"__"+x(0)+"__"+x(1), x(2))
}finally{
jedis.close()
}
}
)
}
)
Spark has a fault tolerance guide. Read more :
https://spark.apache.org/docs/2.1.0/streaming-programming-guide.html#fault-tolerance-semantics

Consuming RabbitMQ messages with Spark streaming

I'm new to scala and trying to hack my way around sending serialized Java objects over a RabbitMQ queue to a Spark Streaming application.
I can successfully enqueue my objects which have been serialized with an ObjectOutputStream. To receive my objects on the Spark end I have downloaded a custom RabbitMQ InputDStream and Receiver implementation from here - https://github.com/Stratio/rabbitmq-receiver
However, in my understanding that codebase only supports String messages, not binary. Thus I started hacking on that code in order to make it support being able to read a binary message and store it as a byte array, so that I can deserialize it on the Spark end. That attempt is here - https://github.com/llevar/rabbitmq-receiver
I then have the following code in my Spark driver program:
val conf = new SparkConf().setMaster("local[6]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val receiverStream: ReceiverInputDStream[scala.reflect.ClassTag[AnyRef]] =
RabbitMQUtils.createStreamFromAQueue(ssc,
"localhost",
5672,
"mappingQueue",
StorageLevel.MEMORY_AND_DISK_SER_2)
val parsedStream = receiverStream.map{ m =>
SerializationUtils.deserialize(m.asInstanceOf[Array[Byte]]).asInstanceOf[SAMRecord]
}
parsedStream.print()
ssc.start()
Unfortunately this does not seem to work. The data is consumed off the queue. I don't get any errors but I don't get any of the output that I expect either.
This is all I get.
2015-07-24 23:33:38 WARN BlockManager:71 - Block input-0-1437795218845 replicated to only 0 peer(s) instead of 1 peers
2015-07-24 23:33:38 WARN BlockManager:71 - Block input-0-1437795218846 replicated to only 0 peer(s) instead of 1 peers
2015-07-24 23:33:38 WARN BlockManager:71 - Block input-0-1437795218847 replicated to only 0 peer(s) instead of 1 peers
2015-07-24 23:33:38 WARN BlockManager:71 - Block input-0-1437795218848 replicated to only 0 peer(s) instead of 1 peers
I was able to successfully deserialize my objects before calling the store() method here - https://github.com/llevar/rabbitmq-receiver/blob/master/src/main/scala/com/stratio/receiver/RabbitMQInputDStream.scala#L106
just by invoking SerializationUtils on the data from the delivery.getBody call, but I don't seem to be able to get the same data from the DStream in my main program.
Any help is appreciated.

Mirth is reading too slow from disk

I am using Mirth 3.0.1 version. I am reading a file (using File Reader) having 34,000 records. Every record is having 45 columns and are pipe(|) separated. Mirth is taking too much time while reading the file from the disk. Mirth is installed on the same server where file is located.Earlier, I was facing the java head space issue which I resolved after setting the -Xms1024m -Xmx4096m in files mcserver.vmoptions & mcservice.vmoptions. Now I have to solve reading performance issue. Please find in attachment the channel for the same.
The answer to this problem is highly dependent on the solution itself. As an example, if you are doing transformations when you benchmark, it might be that the problem is not with reading the files, but rather with doing massive amounts of filtering and transformations in Mirth. Since Mirth converts everything you configure into basically one gigantic Javascript that executes on the server, it might just as well be that this is causing the performance problem. Pre-processor scripts might also create a problem if you do something that causes Mirth to read the whole file.
It migh also be that your 34.000 lines in the file contains huge quantities of information, simply making the file very big and extensive to process. If every record in the file is supposed to create new messages within Mirth, you might also want to check your batch settings for the reader.
And in addition to this, the performance of the read operations from disk is of course affected a lot by the infrastructure and hardware of the platform itself. You did mention that you are reading the files locally and that you had to increase the memory for Mirth. All of this could of course be a problem in itself. To make a benchmark you would want to compare this to something else. Maybe write a small Java program to just read the file to compare performance outside of Mirth.
Thanks for the suggestions.
I have used router.routeMessage('channelName','PartOfMsg') to route the 5000 records(from one channel to second channel) from the file having 34000 of records. This has helped to read faster from the file and processing the records at the same time.
For Mirth Community, below is the code to route the msg from one channel to other channel, this solution is also for the requirement if you have bulk of records to process in batches
In Source Transformer,
debug = "ON";
XML.ignoreWhitespace = true;
logger.debug('Inside source transformer "SplitFileIntoFiles" of channel: SplitFile');
var
subSegmentCounter = 0,
xmlMessageProcessCounter = 0,
singleFileLimit = 5000,
isError = false,
xmlMessageProcess = new XML(<delimited><row><column1></column1><column2></column2></row></delimited>),
newSubSegment = <row><column1></column1><column2></column2></row>,
totalPatientRecords = msg.children().length();
logger.debug('Total number of records found in the patient input file are: ');
logger.debug(totalPatientRecords);
try{
for each (seg in msg.children())
{
xmlMessageProcess.appendChild(newSubSegment);
xmlMessageProcess['row'][xmlMessageProcessCounter] = msg['row'][subSegmentCounter];
if (xmlMessageProcessCounter == singleFileLimit -1)
{
logger.debug('Now sending the 5000 records to the next channel from channel DOR Batch File Process IHI');
router.routeMessage('DOR SendPatientsToMedicare',xmlMessageProcess);
logger.debug('After sending the 5000 records to the next channel from channel DOR Batch File Process IHI');
xmlMessageProcessCounter = 0;
delete xmlMessageProcess['row'];
}
subSegmentCounter++;
xmlMessageProcessCounter++;
}// End of FOR loop
}// End of try block
catch (exception)
{
logger.error('The exception has been raised in source transformer "SplitFileIntoFiles" of channel: SplitFile');
logger.error(exception);
globalChannelMap.put('isFailed',true);
globalChannelMap.put('errDesc',exception);
return true;
}
if (xmlMessageProcessCounter > 1)
{
try
{
logger.debug('Now sending the remaining records to the next channel from channel DOR Batch File Process IHI');
router.routeMessage('DOR SendPatientsToMedicare',xmlMessageProcess);
logger.debug('After sending the remaining records to the next channel from channel DOR Batch File Process IHI');
delete xmlMessageProcess['row'];
}
catch (exception)
{
logger.error('The exception has been raised in source transformer "SplitFileIntoFiles" of channel: SplitFile');
logger.error(exception);
globalChannelMap.put('isFailed',true);
globalChannelMap.put('errDesc',exception);
return true;
}
}
return true;
// End of JavaScript
Hope, this will help.