Scalding has a great utility to run an integration test for the job flow.
In this way the inputs and outputs are the in-memory buffer
val input = List("0" -> "This a a day")
val expectedOutput = List(("This", 1),("a", 2),("day", 1))
JobTest(classOf[WordCountJob].getName)
.arg("input", "input-data")
.arg("output", "output-data")
.source(TextLine("input-data"), input)
.sink(Tsv("output-data")) {
buffer: mutable.Buffer[(String, Int)] => {
buffer should equal(expectedOutput)
}
}.run
How can I transfare/write another code that will read input and write output to the real local file? Like FileTap/LFS in cascading - and not an in-memory approach
You might check out HadoopPlatformJobTest and the TypedParquetTupleTest.scala example which uses a local mini-cluster.
This unit test writes to a "MiniLocalCLuster" - While it's not directly a file, but accessible via reading the local minicluster with Hadoop filesystem.
Given you local file scenario, maybe you can copies the files with local reads to the mini-HDFS.
Related
What is the most efficient (or recommended) way to prepend a string or a file to another large file in Scala, preferably without using external libraries? The large file can be binary.
E.g.
if prepend string is:
header_information|123.45|xyz\n
and large file is:
abcdefghijklmnopqrstuvwxyz0123456789
abcdefghijklmnopqrstuvwxyz0123456789
abcdefghijklmnopqrstuvwxyz0123456789
...
I would expect to get:
header_information|123.45|xyz
abcdefghijklmnopqrstuvwxyz0123456789
abcdefghijklmnopqrstuvwxyz0123456789
abcdefghijklmnopqrstuvwxyz0123456789
...
I come up with the following solution:
Turn prepend string/file into InputStream
Turn large file into InputStream
"Combine" InputStreams together using java.io.SequenceInputStream
Use java.nio.file.Files.copy to write to target file
object FileAppender {
def main(args: Array[String]): Unit = {
val stringToPrepend = new ByteArrayInputStream("header_information|123.45|xyz\n".getBytes)
val largeFile = new FileInputStream("big_file.dat")
Files.copy(
new SequenceInputStream(stringToPrepend, largeFile),
Paths.get("output_file.dat"),
StandardCopyOption.REPLACE_EXISTING
)
}
}
Tested on ~30GB file, took ~40 seconds on MacBookPro (3.3GHz/16GB).
This approach can be used (if necessary) to combine multiple partitioned files created by e.g. Spark engine.
My application needs to read configuration file either from the resource directory or from s3.
For local development, I need to read it from the local resource directory. So, when build my project, I don't put the configuration file config.properties into my application jar file. In this case, it should read the configuration from S3. When I can think of doing this scala is pretty much like what I do it by java
val stream : InputStream = getClass.getResourceAsStream("/config.properties")
if (stream != null) {
val lines = scala.io.Source.fromInputStream( stream ).getLines
} else {
/*read it from S3*/
}
But I think scala gives a more functional programing sytax. Any advice?
There are probably better ways to go about what you're after, but here's a more-or-less straight translation of the posted code.
val lines:Iterator[String] = Option(getClass.getResourceAsStream("/config.properties"))
.fold{/*read from S3
return Iterator*/}(io.Source.fromInputStream(_).getLines)
I am currently trying to store the execution plan of a Spark´s dataframe into HDFS (through dataframe.explain(true) command)
The issue I am finding is that when I am using the explain(true) command, I am able to see the output by the command line and by the logs, however if I create a file (let´s say a .txt) with the content of the dataframe´s explain the file will appear empty.
I believe the issue relates to the configuration of Spark, but I am unable to
find any information about this in internet
(for those who want to see more about the plan execution of the dataframes using the explain function please refer to https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-sql-dataset-operators.html#explain)
if I create a file (let´s say a .txt) with the content of the dataframe´s explain
How exactly did you try to achieve this?
explain writes its result to console, using println, and returns Unit, as can be seen in Dataset.scala:
def explain(extended: Boolean): Unit = {
val explain = ExplainCommand(queryExecution.logical, extended = extended)
sparkSession.sessionState.executePlan(explain).executedPlan.executeCollect().foreach {
// scalastyle:off println
r => println(r.getString(0))
// scalastyle:on println
}
}
So, unless you redirect the console output to write to your file (along with anything else printed to the console...), you won't be able to write explain's output to file.
The best way I have found is to redirect the output to a file when you run the job. I have used the following command :
spark-shell --master yarn -i test.scala > getlogs.log
my scala file has the following simple commands :
val df = sqlContext.sql("SELECT COUNT(*) FROM testtable")
df.explain(true)
exit()
I am receiving the streaming data myDStream (DStream[String]) that I want to save in S3 (basically, for this question, it doesn't matter where exactly do I want to save the outputs, but I am mentioning it just in case).
The following code works well, but it saves folders with the names like jsonFile-19-45-46.json, and then inside the folders it saves files _SUCCESS and part-00000.
Is it possible to save each RDD[String] (these are JSON strings) data into the JSON files, not the folders? I thought that repartition(1) had to make this trick, but it didn't.
myDStream.foreachRDD { rdd =>
// datetimeString = ....
rdd.repartition(1).saveAsTextFile("s3n://mybucket/keys/jsonFile-"+datetimeString+".json")
}
AFAIK there is no option to save it as a file. Because it's a distributed processing framework and it's not a good practice write on single file rather than each partition writes it's own files in the specified path.
We can pass only output directory where we wanted to save the data. OutputWriter will create file(s)(depends on partitions) inside specified path with part- file name prefix.
As an alternative to rdd.collect.mkString("\n") you can use hadoop Filesystem library to cleanup output by moving part-00000 file into it's place. Below code works perfectly on local filesystem and HDFS, but I'm unable to test it with S3:
val outputPath = "path/to/some/file.json"
rdd.saveAsTextFile(outputPath + "-tmp")
import org.apache.hadoop.fs.Path
val fs = org.apache.hadoop.fs.FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.rename(new Path(outputPath + "-tmp/part-00000"), new Path(outputPath))
fs.delete(new Path(outputPath + "-tmp"), true)
For JAVA I implemented this one. Hope it helps:
val fs = FileSystem.get(spark.sparkContext().hadoopConfiguration());
File dir = new File(System.getProperty("user.dir") + "/my.csv/");
File[] files = dir.listFiles((d, name) -> name.endsWith(".csv"));
fs.rename(new Path(files[0].toURI()), new Path(System.getProperty("user.dir") + "/csvDirectory/newData.csv"));
fs.delete(new Path(System.getProperty("user.dir") + "/my.csv/"), true);
I want to read a continuously updated file as (tail -f) in scala, I can't use other tools such as tail because I need to do some extra processing on the records.
So how to keep track of exact file contents every time.
There's an implementation of "tail -f" in Apache commons. http://commons.apache.org/proper/commons-io/javadocs/api-release/org/apache/commons/io/input/Tailer.html
With google search, I found another implementation of "tail -f": https://github.com/alaz/tailf/blob/master/src/main/scala/com/osinka/tailf/Tail.scala
You can always call tail -f from Scala and then do your extra processing there. Using the scala.sys.process API:
import scala.sys.process._
def someProcessing(line: String): Unit = {
// do whatever you want with that line
print("[just read this line] ")
println(line)
}
// the file to read
val file = "mylogfile.txt"
// the process to start
val tail = Seq("tail", "-f", file)
// continuously read lines from tail -f
tail.lineStream.foreach(someProcessing)
// careful: this never returns (unless tail is externally killed)
Edit: An advantage of this is that there is no polling involved. But in exchange for that this blocks the calling thread in a possibly non-interruptible way.