I'm new to scala/java and I'm trying to understand this code below which returns a list of files in a directory. Function source
It takes an argument called dir - is the type of dir a File or File object?
It returns an array of type file.
It calls the method listFiles on dir.
What does the last line do?
def getRecursiveListOfFiles(dir: File): Array[File] = {
val these = dir.listFiles
these ++ these.filter(_.isDirectory).flatMap(getRecursiveListOfFiles)
}
This code does a breadth-first search, using recursion.
A File can either be a file, or a directory.
The code dir.listFiles list all files in directory. Remember that this will be a list of files and directories!
We can then break down the last line into 3 things. These could easily be separate lines.
these.filter(_.isDirectory) will return a list of directories that need searching in. It filters out the files.
flatMap(getRecursiveListOfFiles) takes this list of directories, and calls getRecursiveListOfFiles for every single directory. It then adds flattens these results into one list.
++ adds together two arrays. We add these to the result of the recursive call.
flatMap is key here. Read up about it, and how it differs from the map function to fully understand what's going on.
In short:
these ++ these.filter(_.isDirectory).flatMap(getRecursiveListOfFiles)
is fundamentally:
val allSubDirectories:Array[Files] = these.filter(_.isDirectory)
allSubDirectories.flatMap(getRecursiveListOfFiles)
//i.e. for each sub-directory, again find all files in sub-directory
these ++ (files of all sub-directories)
//ultimately add files of sub-directory to the actual list
Another alternative way to understand would be:
def getAllFiles(dir: File): List[File] = {
val these = dir.listFiles.toList
these ::: these.filter(_.isDirectory).map(x => getAllFiles(x)).flatten
}
Fundamentally same control flow, that for each sub-directory, you get a list of all the files and then to the same list, you add files of sub-directory.
I'm going to try to walk you through how I would think of this as someone trying to read a piece of unknown code.
It should be pretty clear by context that File can either be a regular file or directory. It doesn't contain the file contents, but represents any entry in the file system, and you can open the file with other library commands. For the purposes of this function, File is just something that happens to contain more Files in the case that it's a directory, and these can be accessed via the method listFiles: List[Files]. Presumably, it also provides other info, so that the original caller of getRecursiveListOfFiles could do something with the resulting list.
Also by context, these is pretty clearly the entries in the current directory.
The last line is the most subtle. But to break it down, it augments these with the Files found in those entries in these which happen to be directories.
To explain this step, the signature of flatMap on a List[File] can be thought of as flatMap[B](f: File => List[B]): List[B], where B is a type variable. In this case, because the same function getRecursiveListOfFiles, which is of type File => List[File], is being passed recursively, B is just File, so we can think of this particular call as flatMap(f: File => List[File]): List[File].
Roughly speaking, flatMap applies a function f to each item in a container, where f is required to return the same type of container. The "flat" part is simply the fact that these individual containers get combined, instead of being nested, which is what map would do. This is what allows the function to recursively add all the files found in its subdirectories. Pretty slick.
What's the distinction you're drawing? The type of dir is File.
An array of Files, yes.
It's probably implicitly converting these to a scala List or Buffer, which is the confusing part (e.g. it may import scala.collection.JavaConversions._ - it's better to use JavaConverters which makes the conversions explicit calls to .asScala or asJava). You can find the definitions of ++, filter and flatMap in the scaladoc, which will hopefully be enough to understand what's happening.
(note: my 4. is coming out as a 3. because of markdown :()
Related
I have a folder on HDFS, which for whatever reason, contains part-files that contain commas in their name. For instance
hdfs://namespace/mypath/1-1,123
hdfs://namespace/mypath/1-2,124
hdfs://namespace/mypath/1-3,125
The issue is, I want to only read some of the part files at a time, to prevent over-loading my cluster, meaning that I want to read 1-1,123 and 1-2,124 files.
However, when path is fed to spark as:
sc.textFile("hdfs://namespace/mypath/1-1,123,hdfs://namespace/mypath/1-2,124")
Spark obviously seems to just tokenize on ",", thereby assuming I'm looking for 4 separate files.
Is there a way to escape the commas in the path?
Is the only option to rename the source files?
SparkContext.textFile calls at some point FileInputFormat.setInputPaths(Job job, String commaSeparatedPaths) which apparently simply splits on , the input String representing the comma-separated paths:
Sets the given comma separated paths as the list of inputs for the map-reduce job.
One way to bypass this limitation consists in using the alternative signature of setInputPaths: FileInputFormat.setInputPaths(Job job, Path... inputPaths) which takes a vararg of Path objects. This way, no need to split on , and thus no confusion possible.
To do that, we'll have to create our own textFile method which does the exact same thing as SparkContext.textFile: calling the HadoopRDD object but this time using an input provided as a List of Strings instead of a String:
package org.apache.spark
import org.apache.spark.rdd.{RDD, HadoopRDD}
import org.apache.spark.util.SerializableConfiguration
import org.apache.hadoop.mapred.{FileInputFormat, JobConf, TextInputFormat}
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.fs.Path
object TextFileOverwrite {
implicit class SparkContextExtension(val sc: SparkContext) extends AnyVal {
def textFile(
paths: Seq[String],
minPartitions: Int = sc.defaultMinPartitions
): RDD[String] = {
val confBroadcast =
sc.broadcast(new SerializableConfiguration(sc.hadoopConfiguration))
val setInputPathsFunc =
(jobConf: JobConf) =>
FileInputFormat.setInputPaths(jobConf, paths.map(p => new Path(p)): _*)
new HadoopRDD(
sc,
confBroadcast,
Some(setInputPathsFunc),
classOf[TextInputFormat],
classOf[LongWritable],
classOf[Text],
minPartitions
).map(pair => pair._2.toString)
}
}
}
which can be used this way:
import org.apache.spark.TextFileOverwrite.SparkContextExtension
sc.textFile(Seq("path/hello,world.txt", "path/hello_world.txt"))
Compared to SparkContext.textFile, the only difference in the implementation is the call to FileInputFormat.setInputPaths which takes Paths in input instead of a comma-separated String.
Note that I use the package org.apache.spark to store this function, because SerializableConfiguration has the visibility private[spark] in spark's code base.
Also note the use of an implicit class on SparkContext which allows us to implicitly attach this additional textFile method directly to the SparkContext object and thus to call it using sc.textFile() instead of having to pass the sparkContext as a parameter of the method.
Also note that I would have preferred giving Seq[Path] instead of Seq[String] as an input of this method, but Path is not yet Serializable in the current version of hadoop-common used by Spark (it will become Serializable starting version 3 of hadoop-common).
Use filename globbing, assuming that this gives you unique files:
sc.textFile("hdfs://namespace/mypath/1-1?123,hdfs://namespace/mypath/1-2?124")
Doesn't work if you only want the first one of these and not the other two:
hdfs://namespace/mypath/1-1,123,hdfs
hdfs://namespace/mypath/1-1:123,hdfs
hdfs://namespace/mypath/1-1.123,hdfs
I was going to suggest this:
sc.textFile("hdfs://namespace/mypath/1-1[,]123, ...
And I think that's supposed to work. Looking at the code for org.apache.hadoop.mapred.FileInputFormat#getPathStrings though makes me suspicious. It looks like that function specifically looks for commas inside curly braces, and will fail if you put a comma inside [,].
I'm trying to use scalaz-stream to process multiple files simultaneously, applying a single function to each line in the files, across all the files.
For concreteness, suppose I have a function that takes a list of strings
def f(lines: Seq[String]): Something = ???
And a couple of files, the first:
foo1
foo2
foo3
the second:
bar1
bar2
bar3
The result of the whole process should be:
List(
f(Seq("foo1", "bar1")),
f(Seq("foo2", "bar2")),
f(Seq("foo3", "bar3"))
)
(or more likely written directly into some other file)
The number of files is not known beforehand, and the number of lines may vary between the different files, but I'm okay with padding (at runtime) the ends of the shorter files with a default values, or cutting out the longer files.
So essentially, I need a way to combine a Seq[Process[Task, String]] (obtained via something like io.linesR) into a single Process[Task, Seq[String]].
What would be the simplest way to achieve that?
Or, more generally, how do I combine n instances of Process[F, I] into a single instance Process[F, Seq[I]]?
I'm sure there's some standard combinator for this purpose, but I wasn't able to figure it out...
Thanks.
This combinator doesn't exist yet, but you could add it. I think it will be something like:
def zipN[F[_], A](xs: Seq[Process[F,A]]): Process[F,Seq[A]] =
if (xs.isEmpty) Process.halt
else xs.map(_ map (Vector(_))).reduceLeft(_.zipWith(_)(_ ++ _))
You could also add zipAllN, which takes a value to pad the sequences with (and which uses zipAll, and alignN, which allows streams to 'drop out' of the output process when they are exhausted. (So the output sequence may get shorter.)
I would suggest you implement it as a 'balanced' reduce rather than a left or right reduce, since it will be more efficient that way.
Please do submit a pull request + tests if you end up implementing this for real!
I am working in a Scala embedded DSL and macros are becoming a main tool for achieving my purposes. I am getting an error while trying to reuse a subtree from the incoming macro expression into the resulting one. The situation is quite complex, but (I hope) I have simplified it for its understanding.
Suppose we have this code:
val y = transform {
val x = 3
x
}
println(y) // prints 3
where 'transform' is the involved macro. Although it could seem it does absolutely nothing, it is really transforming the shown block into this expression:
3 match { case x => x }
It is done with this macro implementation:
def transform(c: Context)(block: c.Expr[Int]): c.Expr[Int] = {
import c.universe._
import definitions._
block.tree match {
/* {
* val xNam = xVal
* xExp
* }
*/
case Block(List(ValDef(_, xNam, _, xVal)), xExp) =>
println("# " + showRaw(xExp)) // prints Ident(newTermName("x"))
c.Expr(
Match(
xVal,
List(CaseDef(
Bind(xNam, Ident(newTermName("_"))),
EmptyTree,
/* xExp */ Ident(newTermName("x")) ))))
case _ =>
c.error(c.enclosingPosition, "Can't transform block to function")
block // keep original expression
}
}
Notice that xNam corresponds with the variable name, xVal corresponds with its associated value and finally xExp corresponds with the expression containing the variable. Well, if I print the xExp raw tree I get Ident(newTermName("x")), and that is exactly what is set in the case RHS. Since the expression could be modified (for instance x+2 instead of x), this is not a valid solution for me. What I want to do is to reuse the xExp tree (see the xExp comment) while altering the 'x' meaning (it is a definition in the input expression but will be a case LHS variable in the output one), but it launches a long error summarized in:
symbol value x does not exist in org.habla.main.Main$delayedInit$body.apply); see the error output for details.
My current solution consists on the parsing of the xExp to sustitute all the Idents with new ones, but it is totally dependent on the compiler internals, and so, a temporal workaround. It is obvious that the xExp comes along with more information that the offered by showRaw. How can I clean that xExp for allowing 'x' to role the case variable? Can anyone explain the whole picture of this error?
PS: I have been trying unsuccessfully to use the substitute* method family from the TreeApi but I am missing the basics to understand its implications.
Disassembling input expressions and reassembling them in a different fashion is an important scenario in macrology (this is what we do internally in the reify macro). But unfortunately, it's not particularly easy at the moment.
The problem is that input arguments of the macro reach macro implementation being already typechecked. This is both a blessing and a curse.
Of particular interest for us is the fact that variable bindings in the trees corresponding to the arguments are already established. This means that all Ident and Select nodes have their sym fields filled in, pointing to the definitions these nodes refer to.
Here is an example of how symbols work. I'll copy/paste a printout from one of my talks (I don't give a link here, because most of the info in my talks is deprecated by now, but this particular printout has everlasting usefulness):
>cat Foo.scala
def foo[T: TypeTag](x: Any) = x.asInstanceOf[T]
foo[Long](42)
>scalac -Xprint:typer -uniqid Foo.scala
[[syntax trees at end of typer]]// Scala source: Foo.scala
def foo#8339
[T#8340 >: Nothing#4658 <: Any#4657]
(x#9529: Any#4657)
(implicit evidence$1#9530: TypeTag#7861[T#8341])
: T#8340 =
x#9529.asInstanceOf#6023[T#8341];
Test#14.this.foo#8339[Long#1641](42)(scala#29.reflect#2514.`package`#3414.mirror#3463.TypeTag#10351.Long#10361)
To recap, we write a small snippet and then compile it with scalac, asking the compiler to dump the trees after the typer phase, printing unique ids of the symbols assigned to trees (if any).
In the resulting printout we can see that identifiers have been linked to corresponding definitions. For example, on the one hand, the ValDef("x", ...), which represents the parameter of the method foo, defines a method symbol with id=9529. On the other hand, the Ident("x") in the body of the method got its sym field set to the same symbol, which establishes the binding.
Okay, we've seen how bindings work in scalac, and now is the perfect time to introduce a fundamental fact.
If a symbol has been assigned to an AST node,
then subsequent typechecks will never reassign it.
This is why reify is hygienic. You can take a result of reify and insert it into an arbitrary tree (that possibly defines variables with conflicting names) - the original bindings will remain intact. This works because reify preserves the original symbols, so subsequent typechecks won't rebind reified AST nodes.
Now we're all set to explain the error you're facing:
symbol value x does not exist in org.habla.main.Main$delayedInit$body.apply); see the error output for details.
The argument of the transform macro contains both a definition and a reference to a variable x. As we've just learned, this means that the corresponding ValDef and Ident will have their sym fields synchronized. So far, so good.
However unfortunately the macro corrupts the established binding. It recreates the ValDef, but doesn't clean up the sym field of the corresponding Ident. Subsequent typecheck assigns a fresh symbol to the newly created ValDef, but doesn't touch the original Ident that is copied to the result verbatim.
After the typecheck, the original Ident points to a symbol that no longer exists (this is exactly what the error message was saying :)), which leads to a crash during bytecode generation.
So how do we fix the error? Unfortunately there is no easy answer.
One option would be to utilize c.resetLocalAttrs, which recursively erases all symbols in a given AST node. Subsequent typecheck will then reestablish the bindings granted that the code you generated doesn't mess with them (if, for example, you wrap xExp in a block that itself defines a value named x, then you're in trouble).
Another option is to fiddle with symbols. For example, you could write your own resetLocalAttrs that only erases corrupted bindings and doesn't touch the valid ones. You could also try and assign symbols by yourself, but that's a short road to madness, though sometimes one is forced to walk it.
Not cool at all, I agree. We're aware of that and intend to try and fix this fundamental issue sometimes. However right now our hands are full with bugfixing before the final 2.10.0 release, so we won't be able to address the problem in the nearest future. upd. See https://groups.google.com/forum/#!topic/scala-internals/rIyJ4yHdPDU for some additional information.
Bottom line. Bad things happen, because bindings get messed up. Try resetLocalAttrs first, and if it doesn't work, prepare yourself for a chore.
Coming from a C/C++ background, I'm not very familiar with the functional style of programming so all my code tends to be very imperative, as in most cases I just can't see a better way of doing it.
I'm just wondering if there is a way of making this block of Scala code more "functional"?
var line:String = "";
var lines:String = "";
do {
line = reader.readLine();
lines += line;
} while (line != null)
How about this?
val lines = Iterator.continually(reader.readLine()).takeWhile(_ != null).mkString
Well, in Scala you can actually say:
val lines = scala.io.Source.fromFile("file.txt").mkString
But this is just a library sugar. See Read entire file in Scala? for other possiblities. What you are actually asking is how to apply functional paradigm to this problem. Here is a hint:
Source.fromFile("file.txt").getLines().foreach {println}
Do you get the idea behind this? foreach line in the file execute println function. BTW don't worry, getLines() returns an iterator, not the whole file. Now something more serious:
lines filter {_.startsWith("ab")} map {_.toUpperCase} foreach {println}
See the idea? Take lines (it can be an array, list, set, iterator, whatever that can be filtered and which contains an items having startsWith method) and filter taking only the items starting with "ab". Now take every item and map it by applying toUpperCase method. Finally foreach resulting item print it.
The last thought: you are not limited to a single type. For instance say you have a file containing integer number, one per line. If you want to read that file, parse the number and sum them up, simply say:
lines.map(_.toInt).sum
To add how the same can be achieved using the formerly new nio files which I vote to use because it has several advantages:
val path: Path = Paths.get("foo.txt")
val lines = Source.fromInputStream(Files.newInputStream(path)).getLines()
// Now we can iterate the file or do anything we want,
// e.g. using functional operations such as map. Or simply concatenate.
val result = lines.mkString
Don't forget to close the stream afterwards.
I find that Stream is a pretty nice approach: it create a re-traversible (if needed) sequence:
def loadLines(in: java.io.BufferedReader): Stream[String] = {
val line = in.readLine
if (line == null) Stream.Empty
else Stream.cons(line, loadLines(in))
}
Each Stream element has a value (a String, line, in this case), and calls a function (loadLines(in), in this example) which will yield the next element, lazily, on demand. This makes for a good memory usage profile, especially with large data sets -- lines aren't read until they're needed, and aren't retained unless something is actually still holding onto them. Yet you can also go back to a previous Stream element and traverse forward again, yielding the exact same result.
I'm a bit confused about interfaces vs. signatures in OCaml.
From what I've read, interfaces (the .mli files) are what govern what values can be used/called by the other programs. Signature files look like they're exactly the same, except that they name it, so that you can create different implementations of the interface.
For example, if I want to create a module that is similar to a set in Java:
I'd have something like this:
the set.mli file:
type 'a set
val is_empty : 'a set -> bool
val ....
etc.
The signature file (setType.ml)
module type Set = sig
type 'a set
val is_empty : 'a set -> bool
val ...
etc.
end
and then an implementation would be another .ml file, such as SpecialSet.ml, which includes a struct that defines all the values and what they do.
module SpecialSet : Set
struct
...
I'm a bit confused as to what exactly the "signature" does, and what purpose it serves. Isn't it acting like a sort of interface? Why is both the .mli and .ml needed? The only difference in lines I see is that it names the module.
Am I misunderstanding this, or is there something else going on here?
OCaml's module system is tied into separate compilation (the pairs of .ml and .mli files). So each .ml file implicitly defines a module, each .mli file defines a signature, and if there is a corresponding .ml file that signature is applied to that module.
It is useful to have an explicit syntax to manipulate modules and interfaces to one's liking inside a .ml or .mli file. This allows signature constraints, as in S with type t = M.t.
Not least is the possibility it gives to define functors, modules parameterized by one or several modules: module F (X : S) = struct ... end. All these would be impossible if the only way to define a module or signature was as a file.
I am not sure how that answers your question, but I think the answer to your question is probably "yes, it is as simple as you think, and the system of having .mli files and explicit signatures inside files is redundant on your example. Manipulating modules and signatures inside a file allows more complicated tricks in addition to these simple things".
This question is old but maybe this is useful to someone:
A file named a.ml appears as a module A in the program...
The interface of the module a.ml can be written in file named a.mli
slide link
This is from the OCaml MOOC from Université Paris Diderot.