How to get disk usage or sizes of files in a directory in a HDFS filesystem using Scala - scala

I am trying to get the size of the files in a HDFS directory in Scala. I can do the following in REPL:
Seq("/usr/bin/hdfs", "dfs", "-du", "-s", "/tmp/test").!
but I cannot store the result into a value. How can I get the size of the files in a directory in Scala?

The ! method you're using comes from ProcessBuilder.
(Seq[String] is being implicitly converted to a ProcessBuilder, thus granting you access to !).
/** Starts the process represented by this builder,
* blocks until it exits, and returns the exit code.
*/
abstract def !: Int
If you want the output, use a different method, like !!
/** Starts the process represented by this builder,
* blocks until it exits, and returns the output as a String.
*/
abstract def !!: String
I recommend checking out the other methods defined on ProcessBuilder. I'm sure at least one of them will suit your needs.

I'd recommend the use of https://github.com/pathikrit/better-files
import better.files._
import java.io.{File => JFile}
val size = File("/usr/bin/hdfs").size
println(size)

Related

Reading list of input textFiles where individual file-names contain commas

I have a folder on HDFS, which for whatever reason, contains part-files that contain commas in their name. For instance
hdfs://namespace/mypath/1-1,123
hdfs://namespace/mypath/1-2,124
hdfs://namespace/mypath/1-3,125
The issue is, I want to only read some of the part files at a time, to prevent over-loading my cluster, meaning that I want to read 1-1,123 and 1-2,124 files.
However, when path is fed to spark as:
sc.textFile("hdfs://namespace/mypath/1-1,123,hdfs://namespace/mypath/1-2,124")
Spark obviously seems to just tokenize on ",", thereby assuming I'm looking for 4 separate files.
Is there a way to escape the commas in the path?
Is the only option to rename the source files?
SparkContext.textFile calls at some point FileInputFormat.setInputPaths(Job job, String commaSeparatedPaths) which apparently simply splits on , the input String representing the comma-separated paths:
Sets the given comma separated paths as the list of inputs for the map-reduce job.
One way to bypass this limitation consists in using the alternative signature of setInputPaths: FileInputFormat.setInputPaths(Job job, Path... inputPaths) which takes a vararg of Path objects. This way, no need to split on , and thus no confusion possible.
To do that, we'll have to create our own textFile method which does the exact same thing as SparkContext.textFile: calling the HadoopRDD object but this time using an input provided as a List of Strings instead of a String:
package org.apache.spark
import org.apache.spark.rdd.{RDD, HadoopRDD}
import org.apache.spark.util.SerializableConfiguration
import org.apache.hadoop.mapred.{FileInputFormat, JobConf, TextInputFormat}
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.fs.Path
object TextFileOverwrite {
implicit class SparkContextExtension(val sc: SparkContext) extends AnyVal {
def textFile(
paths: Seq[String],
minPartitions: Int = sc.defaultMinPartitions
): RDD[String] = {
val confBroadcast =
sc.broadcast(new SerializableConfiguration(sc.hadoopConfiguration))
val setInputPathsFunc =
(jobConf: JobConf) =>
FileInputFormat.setInputPaths(jobConf, paths.map(p => new Path(p)): _*)
new HadoopRDD(
sc,
confBroadcast,
Some(setInputPathsFunc),
classOf[TextInputFormat],
classOf[LongWritable],
classOf[Text],
minPartitions
).map(pair => pair._2.toString)
}
}
}
which can be used this way:
import org.apache.spark.TextFileOverwrite.SparkContextExtension
sc.textFile(Seq("path/hello,world.txt", "path/hello_world.txt"))
Compared to SparkContext.textFile, the only difference in the implementation is the call to FileInputFormat.setInputPaths which takes Paths in input instead of a comma-separated String.
Note that I use the package org.apache.spark to store this function, because SerializableConfiguration has the visibility private[spark] in spark's code base.
Also note the use of an implicit class on SparkContext which allows us to implicitly attach this additional textFile method directly to the SparkContext object and thus to call it using sc.textFile() instead of having to pass the sparkContext as a parameter of the method.
Also note that I would have preferred giving Seq[Path] instead of Seq[String] as an input of this method, but Path is not yet Serializable in the current version of hadoop-common used by Spark (it will become Serializable starting version 3 of hadoop-common).
Use filename globbing, assuming that this gives you unique files:
sc.textFile("hdfs://namespace/mypath/1-1?123,hdfs://namespace/mypath/1-2?124")
Doesn't work if you only want the first one of these and not the other two:
hdfs://namespace/mypath/1-1,123,hdfs
hdfs://namespace/mypath/1-1:123,hdfs
hdfs://namespace/mypath/1-1.123,hdfs
I was going to suggest this:
sc.textFile("hdfs://namespace/mypath/1-1[,]123, ...
And I think that's supposed to work. Looking at the code for org.apache.hadoop.mapred.FileInputFormat#getPathStrings though makes me suspicious. It looks like that function specifically looks for commas inside curly braces, and will fail if you put a comma inside [,].

how to make saveAsTextFile NOT split output into multiple file?

When using Scala in Spark, whenever I dump the results out using saveAsTextFile, it seems to split the output into multiple parts. I'm just passing a parameter(path) to it.
val year = sc.textFile("apat63_99.txt").map(_.split(",")(1)).flatMap(_.split(",")).map((_,1)).reduceByKey((_+_)).map(_.swap)
year.saveAsTextFile("year")
Does the number of outputs correspond to the number of reducers it uses?
Does this mean the output is compressed?
I know I can combine the output together using bash, but is there an option to store the output in a single text file, without splitting?? I looked at the API docs, but it doesn't say much about this.
The reason it saves it as multiple files is because the computation is distributed. If the output is small enough such that you think you can fit it on one machine, then you can end your program with
val arr = year.collect()
And then save the resulting array as a file, Another way would be to use a custom partitioner, partitionBy, and make it so everything goes to one partition though that isn't advisable because you won't get any parallelization.
If you require the file to be saved with saveAsTextFile you can use coalesce(1,true).saveAsTextFile(). This basically means do the computation then coalesce to 1 partition. You can also use repartition(1) which is just a wrapper for coalesce with the shuffle argument set to true. Looking through the source of RDD.scala is how I figured most of this stuff out, you should take a look.
For those working with a larger dataset:
rdd.collect() should not be used in this case as it will collect all data as an Array in the driver, which is the easiest way to get out of memory.
rdd.coalesce(1).saveAsTextFile() should also not be used as the parallelism of upstream stages will be lost to be performed on a single node, where data will be stored from.
rdd.coalesce(1, shuffle = true).saveAsTextFile() is the best simple option as it will keep the processing of upstream tasks parallel and then only perform the shuffle to one node (rdd.repartition(1).saveAsTextFile() is an exact synonym).
rdd.saveAsSingleTextFile() as provided bellow additionally allows one to store the rdd in a single file with a specific name while keeping the parallelism properties of rdd.coalesce(1, shuffle = true).saveAsTextFile().
Something that can be inconvenient with rdd.coalesce(1, shuffle = true).saveAsTextFile("path/to/file.txt") is that it actually produces a file whose path is path/to/file.txt/part-00000 and not path/to/file.txt.
The following solution rdd.saveAsSingleTextFile("path/to/file.txt") will actually produce a file whose path is path/to/file.txt:
package com.whatever.package
import org.apache.spark.rdd.RDD
import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
import org.apache.hadoop.io.compress.CompressionCodec
object SparkHelper {
// This is an implicit class so that saveAsSingleTextFile can be attached to
// SparkContext and be called like this: sc.saveAsSingleTextFile
implicit class RDDExtensions(val rdd: RDD[String]) extends AnyVal {
def saveAsSingleTextFile(path: String): Unit =
saveAsSingleTextFileInternal(path, None)
def saveAsSingleTextFile(path: String, codec: Class[_ <: CompressionCodec]): Unit =
saveAsSingleTextFileInternal(path, Some(codec))
private def saveAsSingleTextFileInternal(
path: String, codec: Option[Class[_ <: CompressionCodec]]
): Unit = {
// The interface with hdfs:
val hdfs = FileSystem.get(rdd.sparkContext.hadoopConfiguration)
// Classic saveAsTextFile in a temporary folder:
hdfs.delete(new Path(s"$path.tmp"), true) // to make sure it's not there already
codec match {
case Some(codec) => rdd.saveAsTextFile(s"$path.tmp", codec)
case None => rdd.saveAsTextFile(s"$path.tmp")
}
// Merge the folder of resulting part-xxxxx into one file:
hdfs.delete(new Path(path), true) // to make sure it's not there already
FileUtil.copyMerge(
hdfs, new Path(s"$path.tmp"),
hdfs, new Path(path),
true, rdd.sparkContext.hadoopConfiguration, null
)
// Working with Hadoop 3?: https://stackoverflow.com/a/50545815/9297144
hdfs.delete(new Path(s"$path.tmp"), true)
}
}
}
which can be used this way:
import com.whatever.package.SparkHelper.RDDExtensions
rdd.saveAsSingleTextFile("path/to/file.txt")
// Or if the produced file is to be compressed:
import org.apache.hadoop.io.compress.GzipCodec
rdd.saveAsSingleTextFile("path/to/file.txt.gz", classOf[GzipCodec])
This snippet:
First stores the rdd with rdd.saveAsTextFile("path/to/file.txt") in a temporary folder path/to/file.txt.tmp as if we didn't want to store data in one file (which keeps the processing of upstream tasks parallel)
And then only, using the hadoop file system api, we proceed with the merge (FileUtil.copyMerge()) of the different output files to create our final output single file path/to/file.txt.
You could call coalesce(1) and then saveAsTextFile() - but it might be a bad idea if you have a lot of data. Separate files per split are generated just like in Hadoop in order to let separate mappers and reducers write to different files. Having a single output file is only a good idea if you have very little data, in which case you could do collect() as well, as #aaronman said.
As others have mentioned, you can collect or coalesce your data set to force Spark to produce a single file. But this also limits the number of Spark tasks that can work on your dataset in parallel. I prefer to let it create a hundred files in the output HDFS directory, then use hadoop fs -getmerge /hdfs/dir /local/file.txt to extract the results into a single file in the local filesystem. This makes the most sense when your output is a relatively small report, of course.
In Spark 1.6.1 the format is as shown below. It creates a single output file.It is best practice to use it if the output is small enough to handle.Basically what it does is that it returns a new RDD that is reduced into numPartitions partitions.If you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1)
pair_result.coalesce(1).saveAsTextFile("/app/data/")
You can call repartition() and follow this way:
val year = sc.textFile("apat63_99.txt").map(_.split(",")(1)).flatMap(_.split(",")).map((_,1)).reduceByKey((_+_)).map(_.swap)
var repartitioned = year.repartition(1)
repartitioned.saveAsTextFile("C:/Users/TheBhaskarDas/Desktop/wc_spark00")
You will be able to do it in the next version of Spark, in the current version 1.0.0 it's not possible unless you do it manually somehow, for example, like you mentioned, with a bash script call.
I also want to mention that the documentation clearly states that users should be careful when calling coalesce with a real small number of partitions . this can cause upstream partitions to inherit this number of partitions.
I would not recommend using coalesce(1) unless really required.
Here's my answer to output a single file. I just added coalesce(1)
val year = sc.textFile("apat63_99.txt")
.map(_.split(",")(1))
.flatMap(_.split(","))
.map((_,1))
.reduceByKey((_+_)).map(_.swap)
year.saveAsTextFile("year")
Code:
year.coalesce(1).saveAsTextFile("year")

Get filename of the current file in scala

Is there a way the file name of the current file (when the code is written) in scala?
Like my class is in a file like com/mysite/app/myclass.scala and i want to call a method that will return "myclass.scala" (or the full path...)
Thank you!
This can be achieved with Scala macros, an experimental language feature available from version 2.10.
Macros make it possible to interact with the building of the AST during the source code parsing phase, and to modify trees of AST before the actual compilation is performed.
Since the information about the context of the compilation is available to the macro though the Context object, during the parsing phase it is possible to retrieve the source file name and to return it as a String literal, to be inserted in place of the macro call in the AST.
What follows is a working example of returning the source file name. The example is divided in two files:
A source file where the macro is defined and implemented:
// Contents of: "Macros.scala"
import scala.reflect.macros.Context
import scala.language.experimental.macros
object Macros {
def sourceFile: String = macro sourceFileImpl
def sourceFileImpl(c: Context) = {
import c.universe._
c.Expr[String](Literal(Constant(c.enclosingUnit.source.path.toString)))
}
}
Another source file where the macro is used:
// Contents of: "Main.scala"
object Main extends App {
val fileName = Macros.sourceFile
println(fileName)
}
The macro implementation and the code using it must be in different source files. The file name returned is the correct one, i.e. the name of the source file with macro call.

Why is the main function not running in the REPL?

This is a simple program. I expected main to run in interpreted mode. But the presence of another object caused it to do nothing. If the QSort were not present, the program would have executed.
Why is main not called when I run this in the REPL?
object MainObject{
def main(args: Array[String])={
val unsorted = List(8,3,1,0,4,6,4,6,5)
print("hello" + unsorted toString)
//val sorted = QSort(unsorted)
//sorted foreach println
}
}
//this must not be present
object QSort{
def apply(array: List[Int]):List[Int]={
array
}
}
EDIT: Sorry for causing confusion, I am running the script as scala filename.scala.
What's happening
If the parameter to scala is an existing .scala file, it will be compiled in-memory and run. When there is a single top level object a main method will be searched and, if found, executed. If that's not the case the top level statements are wrapped in a synthetic main method which will get executed instead.
This is why removing the top-level QSort objects allows your main method to run.
If you're going to expand this to a full program, I advise to compile and run (use a build tool like sbt) the compiled .class files:
scalac main.scala && scala MainObject
If you're writing a single file script, just drop the main method (and its object) and write the statements you want executed in the outer scope, like:
// qsort.scala
object QSort{
def apply(array: List[Int]):List[Int]={
array
}
}
val unsorted = List(8,3,1,0,4,6,4,6,5)
print("hello" + unsorted toString)
val sorted = QSort(unsorted)
sorted foreach println
and run with: scala qsort.scala
A little context
The scala command is meant for executing both scala "scripts" (single file programs) and complex java-like programs (with a main object and a bunch of classes in the classpath).
From man scala:
The scala utility runs Scala code using a Java runtime environment.
The Scala code to run is specified in one of three ways:
1. With no arguments specified, a Scala shell starts and reads com-
mands interactively.
2. With -howtorun:object specified, the fully qualified name of a
top-level Scala object may be specified. The object should pre-
viously have been compiled using scalac(1).
3. With -howtorun:script specified, a file containing Scala code
may be specified.
If not explicitly specified, the howtorun mode is guessed from the arguments passed to the script.
When given a fully qualified name of an object, scala will guess -howtorun:object and expect a compiled object with that name on the path.
Otherwise, if the parameter to scala is an existing .scala file, -howtorun:script is guessed and the entry point is selected as described above.
Any method of an object module can be run in REPL by explicitly specifying it and giving it the arguments it requires if any. For example:
scala> object MainObject{
| def main(args: Array[String])={
| val unsorted = List(9,3,1,0,7,5,9,3,11)
| print("sorted: " + unsorted.sorted)
| }
| def fun = println("fun here")
| }
defined module MainObject
scala> MainObject.main(Array(""))
sorted: List(0, 1, 3, 3, 5, 7, 9, 9, 11)
scala> MainObject.fun
fun here
In some cases this can be useful for quick testing and troubleshooting.

Doing something like Python's "import" in Scala

Is it possible to use Scala's import without specifying a main function in an object, and without using the package keyword in the source file with the code you wish to import?
Some explanation: In Python, I can define some functions in some file "Lib.py", write
from Lib import *
in some other file "Run.py" in the same directory, use the functions from Lib in Run, and then run Run with the command python Run.py. This workflow is ideal for small scripts that I might write in an hour.
In Scala, it appears that if I want to include functions from another file, I need to start wrapping things in superfluous objects. I would rather not do this.
Writing Python in Scala is unlikely to yield satisfactory results. Objects are not "superfluous" -- it's your program that is not written in an object oriented way.
First, methods must be inside objects. You can place them inside a package object, and they'll then be visible to anything else that is inside the package of the same name.
Second, if one considers solely objects and classes, then all package-less objects and classes whose class files are present in the classpath, or whose scala files are compiled together, will be visible to each other.
This is as minimal as I could get it:
[$]> cat foo.scala
object Foo {
def foo(): Boolean = {
return true
}
}
// vim: set ts=4 sw=4 et:
[$]> cat bar.scala
object Bar extends App {
import Foo._
println(foo)
}
// vim: set ts=4 sw=4 et:
[$]> fsc foo.scala bar.scala
[$]> export CLASSPATH=.:$CLASSPATH # Or else it can't find Bar.
[$]> scala Bar
true
When you just write simple scripts, use Scala's REPL. There, you can define functions and call them without having any enclosing object or package, and without a main method.
Objects/classes don't have to be in packages, though it's highly recommended. That said, you can also treat singleton objects like packages, i.e., as namespaces for standalone functions, and import their contents as if they were packages.
If you define your application as an object that extends App, then you don't have to define a main method. Just write your code in the body of the object, and the App trait (which extends thespecial DelayedInit trait) will provide a main method that will execute your code.
If just want to write a script, you can forgo the object altogether and just write code without any container, then pass your source file to the interpreter (REPL) in non-interactive mode.