How to read a continuously updated log file in scala

I want to read a continuously updated file as (tail -f) in scala, I can't use other tools such as tail because I need to do some extra processing on the records.
So how to keep track of exact file contents every time.

There's an implementation of "tail -f" in Apache commons.
With google search, I found another implementation of "tail -f":

You can always call tail -f from Scala and then do your extra processing there. Using the scala.sys.process API:
import scala.sys.process._
def someProcessing(line: String): Unit = {
// do whatever you want with that line
print("[just read this line] ")
// the file to read
val file = "mylogfile.txt"
// the process to start
val tail = Seq("tail", "-f", file)
// continuously read lines from tail -f
// careful: this never returns (unless tail is externally killed)
Edit: An advantage of this is that there is no polling involved. But in exchange for that this blocks the calling thread in a possibly non-interruptible way.


Creating temporary resource test files in Scala

I am currently writing tests for a function that takes file paths and loads a dataset from them. I am not able to change the function. To test it currently I am creating files for each run of the test function. I am worried that simply making files and then deleting them is a bad practice. Is there a better way to create temporary test files in Scala?
import{File, PrintWriter}
val testFile = new File("src/main/resources/temp.txt" )
val pw = new PrintWriter(testFile)
val testLines = List("this is a text line", "this is the next text line")
// test logic here
I would generally prefer java.nio over You can create a temporary file like so:
import java.nio.Files
You can delete it using Files.delete. To ensure that the file is deleted even in the case of an error, you should put the delete call into a finally block.

Read large file into StringBuilder

I've got text files sitting in HDFS, ranging in size from around 300-800 MB each. They are almost valid json files. I am trying to make them valid json files so I can save them as ORC files.
I am attempting to create a StringBuilder with the needed opening characters, then read the file in line by line stripping off the newlines, append each line the string builder, and then add the needed closing character.
import org.apache.hadoop.fs.{FileSystem,Path, PathFilter, RemoteIterator}
import scala.collection.mutable.StringBuilder
//create stringbuilder
var sb = new scala.collection.mutable.StringBuilder("{\"data\" : ")
//read in the file
val path = new Path("/path/to/crappy/file.json")
val stream =
//read the file line by line. This will strip off the newlines so we can append it to the string builder
def readLines = Stream.cons(stream.readLine, Stream.continually( stream.readLine))
readLines.takeWhile(_ != null).foreach(line => sb.append(line)
That works. But as soon as I try to append the closing }:
It crashes with out of memory:
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(
at java.lang.AbstractStringBuilder.ensureCapacityInternal(
at java.lang.AbstractStringBuilder.append(
at java.lang.StringBuilder.append(
at scala.collection.mutable.StringBuilder.append(StringBuilder.scala
I've tried setting the initial size of the stringbuilder to be larger than the file I'm currently testing with, but that didn't help. I've also tried giving the driver more memory (spark-shell --driver-memory 3g), didn't help either.
Is there a better way to do this?
If that's all you need, you can just do it without Scala via hdfs command-line:
hadoop fs -cat /hdfs/path/prefix /hdfs/path/badjson /hdfs/path/suffix | hadoop fs -put - /hdfs/path/properjson
where file prefix just contains {"data" :, and suffix - a single }.
1) Don't use scala's Stream. It is just a broken abstraction. It's extremely difficult to use infinite/huge stream without blowing-up the heap. Stick either with a plain old Iterator or use more principled approaches from fs2 / zio.
In your case readLines object accumulates all entries even though it expects to hold only one at a time.
2) sb object leaks as well. It accumulates entire file content in memory.
Consider writing the corrected content directly into some OutputStreamWriter.

Store execution plan of Spark´s dataframe

I am currently trying to store the execution plan of a Spark´s dataframe into HDFS (through dataframe.explain(true) command)
The issue I am finding is that when I am using the explain(true) command, I am able to see the output by the command line and by the logs, however if I create a file (let´s say a .txt) with the content of the dataframe´s explain the file will appear empty.
I believe the issue relates to the configuration of Spark, but I am unable to
find any information about this in internet
(for those who want to see more about the plan execution of the dataframes using the explain function please refer to
if I create a file (let´s say a .txt) with the content of the dataframe´s explain
How exactly did you try to achieve this?
explain writes its result to console, using println, and returns Unit, as can be seen in Dataset.scala:
def explain(extended: Boolean): Unit = {
val explain = ExplainCommand(queryExecution.logical, extended = extended)
sparkSession.sessionState.executePlan(explain).executedPlan.executeCollect().foreach {
// scalastyle:off println
r => println(r.getString(0))
// scalastyle:on println
So, unless you redirect the console output to write to your file (along with anything else printed to the console...), you won't be able to write explain's output to file.
The best way I have found is to redirect the output to a file when you run the job. I have used the following command :
spark-shell --master yarn -i test.scala > getlogs.log
my scala file has the following simple commands :
val df = sqlContext.sql("SELECT COUNT(*) FROM testtable")

Using the same object from different PyTest testfiles?

im working with pytest right know. My Problem is that I need to use the same object generated in one in another which are in two different directories and invoked separately from another.
Heres the code:
# Starts the first testcases
returnValue = pytest.main(["-x", "--alluredir=%s" % test1_path, "--junitxml=%s" % test1_path+"\\JunitOut_test1.xml", test_file1])
# Starts the second testcases
pytest.main(["--alluredir=%s" % test2_path, "--junitxml=%s" % test2_path+"\\JunitOut_test2.xml", test_file2])
As you can see the first one is critical, therefore I start it with -x to interrupt if there is an error. And --alluredir deletes the target directory before starting the new tests. Thats why I decided to invoke pytest twice in my (moreoften in the future maybe)
Here is are the test_files:
$ test1_directory/
def object():
# Generate reusable object from another file
def test_use_object(object):
# use the object generated above
Note that the object is actually a class with parameters and functions.
$ test2_directory/
def test_use_object_from_file1():
# reuse the object
I tried to generate the object in the file and importing it to both testfiles. The problem was that the object was not excatly the same as in the or
My question is now if there is a possibility to use excatly that one generated object. Maybe with a global or something like that.
Thank you for your time!
By exactly the same you mean a similar object, right? The only way to do this is to marshal it in the first process and unmarshal it in the other process. One way to do it is by using json or pickle as marshaller, and pass the filename to use for the json/pickle file to be able to read the object back.
Here's some sample code, untested:
def pytest_addoption(parser):
parser.addoption("--marshalfile", help="file name to transfer files between processes")
def object(request):
filename = request.getoption('marshalfile')
if filename is None:
raise pytest.UsageError('--marshalfile required')
# dump object
if not os.path.isfile(filename):
obj = create_expensive_object()
with open(filename, 'wb') as f:
pickle.dump(f, obj)
# load object, hopefully in the other process
with open(filename, 'rb') as f:
obj = pickle.load(f)
return obj

how to run scalding test in local mode with local input file

Scalding has a great utility to run an integration test for the job flow.
In this way the inputs and outputs are the in-memory buffer
val input = List("0" -> "This a a day")
val expectedOutput = List(("This", 1),("a", 2),("day", 1))
.arg("input", "input-data")
.arg("output", "output-data")
.source(TextLine("input-data"), input)
.sink(Tsv("output-data")) {
buffer: mutable.Buffer[(String, Int)] => {
buffer should equal(expectedOutput)
How can I transfare/write another code that will read input and write output to the real local file? Like FileTap/LFS in cascading - and not an in-memory approach
You might check out HadoopPlatformJobTest and the TypedParquetTupleTest.scala example which uses a local mini-cluster.
This unit test writes to a "MiniLocalCLuster" - While it's not directly a file, but accessible via reading the local minicluster with Hadoop filesystem.
Given you local file scenario, maybe you can copies the files with local reads to the mini-HDFS.