How to continuous reading resources(configs) by spark - scala

The back ground is that we need to read file which is used as global configs to do data calculation and the file will be changed each hour, so it is necessary to reload the file. our confusion is about how to reload the config if the 'for-loop' goes to the end and how to notify the main process that file is changing, if spark engine could complete it independently? Sample codes like that:
// init streamingContext
val alertConfigRDD: RDD[String] = sc.textFile("alert-config.json")
val alertConfig = alertConfigRDD.collect()
for (config <- alertConfigs) {
// spark streaming process: select window duration according to config details.
}
streamingContext.start()
streamingContext.awaitTermination()
Thanks for solutions in advance.

If it's vital to have this resource being updated for data processing then load if from one place (being it HDFS, S3 or any other storage accessible from all executors) on each batch before actual processing.

Related

FileSink in Apache Flink not generating logs in output folder

I am using Apache Flink to read data from kafka topic and to store it in files on server. I am using FileSink to store files, it creates the directory structure date and time wise but no logs files are getting created.
When i run the program it creates directory structure as below but log files are not getting stored here.
/flink/testlogs/2021-12-08--07
/flink/testlogs/2021-12-08--06
I want the log files should be written every 15 mins to a new log file.
Below is the code.
DataStream <String> kafkaTopicData = env.addSource(new FlinkKafkaConsumer<String>("MyTopic",new SimpleStringSchema(),p));
OutputFileConfig config = OutputFileConfig
.builder()
.withPartPrefix("prefix")
.withPartSuffix(".ext")
.build();
DataStream <Tuple6 < String,String,String ,String, String ,Integer >> newStream=kafkaTopicData.map(new LogParser());
final FileSink<Tuple6<String, String, String, String, String, Integer>> sink = FileSink.forRowFormat(new Path("/flink/testlogs"),
new SimpleStringEncoder < Tuple6 < String,String,String ,String, String ,Integer >> ("UTF-8"))
.withRollingPolicy(DefaultRollingPolicy.builder()
.withRolloverInterval(TimeUnit.MINUTES.toMillis(15))
.withInactivityInterval(TimeUnit.MINUTES.toMillis(5))
.withMaxPartSize(1024 * 1024 * 1024)
.build())
.withOutputFileConfig(config)
.build();
newStream.sinkTo(sink);
env.execute("DataReader");
LogParser returns Tuple6.
When used in streaming mode, Flink's FileSink requires that checkpointing be enabled. To do this, you need to specify where you want checkpoints to be stored, and at what interval you want them to occur.
To configure this in flink-conf.yaml, you would do something like this:
state.checkpoints.dir: s3://checkpoint-bucket
execution.checkpointing.interval: 10s
Or in your application code you can do this:
env.getCheckpointConfig().setCheckpointStorage("s3://checkpoint-bucket");
env.enableCheckpointing(10000L);
Another important detail from the docs:
Given that Flink sinks and UDFs in general do not differentiate between normal job termination (e.g. finite input stream) and termination due to failure, upon normal termination of a job, the last in-progress files will not be transitioned to the “finished” state.

Refresh cached values in spark streaming without reboot the batch

Maybe the question is too simple, at least look like that, but I have the following problem:
A. Execute spark submit in spark streaming process.
ccc.foreachRDD(rdd => {
rdd.repartition(20).foreachPartition(p => {
val repo = getReposX
t foreach (g => {
.................
B. getReposX is a function which one make a query in mongoDB recovering a Map wit a key/value necessary in every executor of the process.
C. Into each g in foreach I manage this map "cached"
The problem or the question is when in the collection of mongo anything change, I don't watch or I don't detect the change, therefore I am managing a Map not updated. My question is: How can I get it? Yes, I know If I reboot the spark-submit and the driver is executed again is OK but otherwise I will never see the update in my Map.
Any ideas or suggestion?
Regards.
Finally I Developed a solution. First I explain the question more in detail because what I really wanted to know is how to implement an object or "cache", which was refreshed every so often, or by some kind of order, without the need to restart the spark streaming process, that is, it would refresh in alive.
In my case this "cache" or refreshed object is an object (Singleton) that connected to a collection of mongoDB to recover a HashMap that was used by each executor and was cached in memory as a good Singleton. The problem with this was that once the spark streaming submit is executed it was cached that object in memory but it was not refreshed unless the process was restarted. Think of a broadcast as a counter mode to refresh when the variable reaches 1000, but these are read only and can not be modified. Think of an counter, but these can only be read by the driver.
Finally my solution is, within the initialization block of the object that loads the mongo collection and the cache, I implement this:
//Initialization Block
{
val ex = new ScheduledThreadPoolExecutor(1)
val task = new Runnable {
def run() = {
logger.info("Refresh - Inicialization")
initCache
}
}
val f = ex.scheduleAtFixedRate(task, 0, TIME_REFRES, TimeUnit.SECONDS)
}
initCache is nothing more than a function that connects mongo and loads a collection:
var cache = mutable.HashMap[String,Description]()
def initCache():mutable.HashMap[String, Description]={
val serverAddresses = Functions.getMongoServers(SERVER, PORT)
val mongoConnectionFactory = new MongoCollectionFactory(serverAddresses,DATABASE,COLLECTION,USERNAME,PASSWORD)
val collection = mongoConnectionFactory.getMongoCollection()
val docs = collection.find.iterator()
cache.clear()
while (docs.hasNext) {
var doc = docs.next
cache.put(...............
}
cache
}
In this way, once the spark streaming submit has been started, each task will open one more task, which will make every X time (1 or 2 hours in my case) refresh the value of the singleton collection, and always recover that value instantiated:
def getCache():mutable.HashMap[String, Description]={
cache
}

Akka Sink never closes

I am uploading a single file to an SFTP server using Alpakka but once the file is uploaded and I have gotten the success response the Sink stays open, how do I drain it?
I started off with this:
val sink = Sftp.toPath(path, settings, false)
val source = Source.single(ByteString(data))
​
source
.viaMat(KillSwitches.single)(Keep.right)
.toMat(sink)(Keep.both).run()
.map(_.wasSuccessful)
But that ends up never leaving the map step.
I tried to add a killswitch, but that seems to have had no effect (neither with shutdown or abort):
val sink = Sftp.toPath(path, settings, false)
val source = Source.single(ByteString(data))
​
val (killswitch, result) = source
.viaMat(KillSwitches.single)(Keep.right)
.toMat(sink)(Keep.both).run()
result.map {
killswitch.shutdown()
_.wasSuccessful
}
Am I doing something fundamentally wrong? I just want one result.
EDIT The settings sent in to toPath:
SftpSettings(InetAddress.getByName(host))
.withCredentials(FtpCredentials.create(amexUsername, amexPassword))
.withStrictHostKeyChecking(false)
By asking you to put Await.result(result, Duration.Inf) at the end I wanted to check the theory expressed by A. Gregoris. Thus it's either
your app exits before Future completes or
(if you app does't exit) function in which you do this discards result
If your app doesn't exit you can try using result.onComplete to do necessary work.
I cannot see your whole code but it seems to me that in the snippet you posted that result value is a Future that is not completing before the end of your program execution and that is because the code in the map is not being executed either.

spark broadcast isn't being saved at the executors memory

I used spark-shell on EMR - Spark version 2.2.0 / 2.1.0.
While trying to broadcast simple object (my CSV file contain only 1 column and it's less then 2 MB) I noticed it isn't being kept on each executor memory and just in the driver memory although it should be as suggested in the documentation https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-TorrentBroadcast.html
Attached print screen before the broadcast (i.e. sc.broadcast(arr_collected) ) and after the broadcast which shows my conclusion. Additionally I checked the worker's machine memory usage and same as in Spark UI, it's not being change after the broadcasting.
1- print screen before broadcast
2- print screen after broadcast
Attached the log for the broadcasting process after adding 'log4j.logger.org.apache.spark.storage.BlockManager=TRACE' like suggested here -
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-blockmanager.html
3- print screen broadcast logging
Below there is the code -
val input = "s3://bucketName/pathToFile.csv"
val df = spark.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter", ",").load(input)
val df_2 = df_read_for_bc.withColumn("is_exist",lit("true").cast("Boolean"))
val arr_collected = df_2.collect()
val broadcast_map_fraud_locations4 = sc.broadcast(arr_collected)
Any ideas?
Can you please use the broadcast varialables to join the data or do some kind of operation. It might be lazy so not using any memory

Spark Local File Streaming - Fault tolerance

I am working on an application where every 30sec(can be 5 sec also) some files will be dropped in a file system. I have to read it parse it and push some records to REDIS.
In each file all records are independent and I am not doing any calculation that will require updateStateByKey.
My question is if due to some issue (eg: REDIS connection issue, Data issue in a file etc) some file is not processed completely I want to reprocess (say n times) the files again and also keep a track of the files already processed.
For testing purpose I am reading from a local folder. Also I am not sure how to conclude that one file is fully processed and mark it as completed (ie write in a text file or db that this file processed)
val lines = ssc.textFileStream("E:\\SampleData\\GG")
val words = lines.map(x=>x.split("_"))
words.foreachRDD(
x=> {
x.foreach(
x => {
var jedis = jPool.getResource();
try{
i=i+1
jedis.set("x"+i+"__"+x(0)+"__"+x(1), x(2))
}finally{
jedis.close()
}
}
)
}
)
Spark has a fault tolerance guide. Read more :
https://spark.apache.org/docs/2.1.0/streaming-programming-guide.html#fault-tolerance-semantics