Spark: write to master's log from workers - scala

I got this general piece of code that I'm running:
df.rdd.foreachPartition( i => {
//some code
//writing to log
})
The problem is that the writing to log is performed on the workers themselves, not on the master - and therefore the log entries are scattered somewhere, and are very hard - or even impossible sometimes - to retrieve. Is there a way to write to the master's logs from the workers, or some other work around?

There's no immediate way to write into the master's log - distributed processing means your code runs on various machines and therefore any access to a machine's resources (e.g. file system) is going to be distributed too.
There are a few ways you can achieve what you want:
Treat logs as data: instead of using foreachPartition, you can use mapPartitions with a function that returns Iterator[String] with the log lines you want to write (or the data needed to construct them). Assuming the total number of log lines isn't huge, you can then collect them into driver machine and log them:
val logLines = df.rdd.mapPartitions( i => {
//some code
val log: Iterator[String] = //construct log lines
log
}).collect()
logLines.foreach(logger.info)
Use some log-aggregation framework: these frameworks collect logs from multiple machines and can display them as a single stream of log entries. This is very useful for distributed computing as it makes accessing a specific machine's log redundant.

Related

What is the proper use of localCheckpoint()

I am in need of using localCheckpoint() to resize shuffle partitions, but I am unsure how to use it properly. I searched through Github as well as Stackoverflow and found effectively everyone repeating this block of code (with no further context, situation, reasoning, etc) which is from this databricks presentation: https://www.youtube.com/watch?v=daXEp4HmS-E
df.localCheckpoint(...).repartition(n).write(...)
How do I use this properly? For instance, do I:
df = df.localCheckpoint()
or
df.localCheckpoint()
Under what conditions would I repartition? Is specifying the write necessary?
Let's say I have specified my number of shuffle partitions like so:
spark.conf.set("spark.sql.shuffle.partitions", 1200);
# bunch of work here
# local checkpoint here
But I have more work to do, but at a different shuffle scale, so then can I set my shuffle partitions again, like so?
spark.conf.set("spark.sql.shuffle.partitions", 400);
# more work here
# local checkpoint again
Finally, write to disk:
df.coalesce(n).write....
How should localCheckpoint() be used in work that we have to chain together at different shuffle scales?
Also, how can I use two localCheckpoint() operations? If I run one localCheckpoint() on df but then run another on df2; when I return to yet another localCheckpoint() on df, I get:
Checkpoint block rdd_322_17 not found! Either the executor that originally checkpointed this partition is no longer alive, or the original RDD is unpersisted

Spark configurations for Out of memory error [duplicate]

Cluster setup -
Driver has 28gb
Workers have 56gb each (8 workers)
Configuration -
spark.memory.offHeap.enabled true
spark.driver.memory 20g
spark.memory.offHeap.size 16gb
spark.executor.memory 40g
My job -
//myFunc just takes a string s and does some transformations on it, they are very small strings, but there's about 10million to process.
//Out of memory failure
data.map(s => myFunc(s)).saveAsTextFile(outFile)
//works fine
data.map(s => myFunc(s))
Also, I de-clustered / removed spark from my program and it completed just fine(successfully saved to a file) on a single server with 56gb of ram. This shows that it just a spark configuration issue. I reviewed https://spark.apache.org/docs/latest/configuration.html#memory-management and the configurations I currently have seem to be all that should be needed to be changed for my job to work. What else should I be changing?
Update -
Data -
val fis: FileInputStream = new FileInputStream(new File(inputFile))
val bis: BufferedInputStream = new BufferedInputStream(fis);
val input: CompressorInputStream = new CompressorStreamFactory().createCompressorInputStream(bis);
br = new BufferedReader(new InputStreamReader(input))
val stringArray = br.lines().toArray()
val data = sc.parallelize(stringArray)
Note - this does not cause any memory issues, even though it is incredibly inefficient. I can't read from it using spark because it throws some EOF errors.
myFunc, I can't really post the code for it because it's complex. But basically, the input string is a deliminated string, it does some deliminator replacement, date/time normalizing and things like that. The output string will be roughly the same size as an input string.
Also, it works fine for smaller data sizes, and the output is correct and roughly the same size as input data file, as it should be.
You current solution does not take advantage of spark. You are loading the entire file into an array in memory, then using sc.parallelize to distribute it into an RDD. This is hugely wasteful of memory (even without spark) and will of course cause out of memory problems for large files.
Instead, use sc.textFile(filePath) to create your RDD. Then spark is able to smartly read and process the file in chunks, so only a small portion of it needs to be in memory at a time. You are also able to take advantage of parallelism this way, as spark will be able to read and process the file in parallel, with however many executors and corse your have, instead of needing the read the entire file on a single thread on a single machine.
Assuming that myFunc can look at only a single line at a time, then this program should have a very small memory footprint.
Would help if you put more details of what going on in your program before and after the MAP.
Second command (only Map) does not do anything unless an action is triggered. Your file is probably not partitioned and driver is doing the work. Below should force data to workers evenly and protect OOM on a single node. It will cause shuffling of data though.
Updating solution after looking at your code, will be better if you do this
val data = sc.parallelize(stringArray).repartition(8)
data.map(s => myFunc(s)).saveAsTextFile(outFile)

Are two transformations on the same RDD executed in parallel in Apache Spark?

Lets say we have the following Scala program:
val inputRDD = sc.textFile("log.txt")
inputRDD.persist()
val errorsRDD = inputRDD.filter(lambda x: "error" in x)
val warningsRDD = inputRDD.filter(lambda x: "warning" in x)
println("Errors: " + errorsRDD.count() + ", Warnings: " + warningsRDD.count())
We create a simple RDD, persist it, perform two transformations on the RDD and finally have an action which uses the RDDs.
When the print is called, the transformations are executed, each transformation is of course parallel depending on the cluster management.
My main question is: Are the two actions and transformations executed in parallel or sequence? Or does errorsRDD.count() first execute and then warningsRDD.count(), in sequence?
I'm also wondering if there is any point in using persist in this example.
All standard RDD methods are blocking (with exception to AsyncRDDActions) so actions will be evaluated sequentially. It is possible to execute multiple actions concurrently using non-blocking submission (threads, Futures) with correct configuration of in-application scheduler or explicitly limited resources for each action.
Regarding cache it is impossible to answer without knowing the context. Depending on the cluster configuration, storage, and data locality it might be cheaper to load data from disk again, especially when resources are limited, and subsequent actions might trigger cache cleaner.
This will execute errorsRDD.count() first then warningsRDD.count().
The point of using persist here is when the first count is executed, inputRDD will be in memory.
The second count, spark won't need to re-read "whole" content of file from storage again, so execution time of this count would be much faster than the first.

Can we make the different transformation functions for the Spark Streaming running on different servers?

val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
For the above example, we know there are two transformation functions. Both of them must running at the same process\server, however I want to make the second transformation running on a different server from the first one to achieve scalability, is it possible?
To clear things up: a Spark transformation is not an actual execution. Transformations in Spark are lazy which means nothing gets executed until you call an action (e.g. save, collect). An action is a job in Spark.
So based on the above, you can control jobs but you cannot control transformations. A Spark's job will be distributed on multiple executors by splitting the processed data (RDD) among them. Each executor will apply the job (multiple transformations) on its split and then the results will be collected again. This will significantly reduce network usage.
If you can perform what your asking about, then the intermediate results (which you actually don't care about) should be transformed over the network which in turns will add a great network overhead.

Multiple Actors that writes to the same file + rotate

I have written a very simple webserver in Scala (based on Actors). The purpose
of it so to log events from our frontend server (such as if a user clicks a
button or a page is loaded). The file will need to be rotated every 64-100mb or so and
it will be send to s3 for later analysis with Hadoop. the amount of traffic will
be about 50-100 calls/s
Some questions that pops into my mind:
How do I make sure that all actors can write to one file in a thread safe way?
What is the best way to rotate the file after X amount of mb. Should I do this
in my code or from the filesystem (if I do it from the filesystem, how do I then verify
that the file isn't in the middle of a write or that the buffer is flushed)
One simple method would be to have a single file writer actor that serialized all writes to the disk. You could then have multiple request handler actors that fed it updates as they processed logging events from the frontend server. You'd get concurrency in request handling while still serializing writes to your log file. Having more than a single actor would open the possibility of concurrent writes, which would at best corrupt your log file. Basically, if you want something to be thread-safe in the actor model, it should be executed on a single actor. Unfortunately, your task is inherently serial at the point you write to disk. You could do something more involved like merge log files coming from multiple actors at rotation time but that seems like overkill. Unless you're generating that 64-100MB in a second or two, I'd be surprised if the extra threads doing I/O bought you anything.
Assuming a single writing actor, it's pretty trivial to calculate the amount that has been written since the last rotation and I don't think tracking in the actor's internal state versus polling the filesystem would make a difference one way or the other.
U can use Only One Actor to write every requests from different threads, since all of the requests go through this actor, there will be no concurrency problems.
As per file write rolling, if your write requests can be logged in line by line, then you can resort to log4j or logback's FileRollingAppender things. Otherwise, you can write your own which will be easy as long as remembering to lock the file before performing any delete or update operations.
The rolling usually means you rename the older files and current file to other names and then create a new file with current file name, at last, u can always write to the file with current file name.