Reading list of input textFiles where individual file-names contain commas - scala

I have a folder on HDFS, which for whatever reason, contains part-files that contain commas in their name. For instance
hdfs://namespace/mypath/1-1,123
hdfs://namespace/mypath/1-2,124
hdfs://namespace/mypath/1-3,125
The issue is, I want to only read some of the part files at a time, to prevent over-loading my cluster, meaning that I want to read 1-1,123 and 1-2,124 files.
However, when path is fed to spark as:
sc.textFile("hdfs://namespace/mypath/1-1,123,hdfs://namespace/mypath/1-2,124")
Spark obviously seems to just tokenize on ",", thereby assuming I'm looking for 4 separate files.
Is there a way to escape the commas in the path?
Is the only option to rename the source files?

SparkContext.textFile calls at some point FileInputFormat.setInputPaths(Job job, String commaSeparatedPaths) which apparently simply splits on , the input String representing the comma-separated paths:
Sets the given comma separated paths as the list of inputs for the map-reduce job.
One way to bypass this limitation consists in using the alternative signature of setInputPaths: FileInputFormat.setInputPaths(Job job, Path... inputPaths) which takes a vararg of Path objects. This way, no need to split on , and thus no confusion possible.
To do that, we'll have to create our own textFile method which does the exact same thing as SparkContext.textFile: calling the HadoopRDD object but this time using an input provided as a List of Strings instead of a String:
package org.apache.spark
import org.apache.spark.rdd.{RDD, HadoopRDD}
import org.apache.spark.util.SerializableConfiguration
import org.apache.hadoop.mapred.{FileInputFormat, JobConf, TextInputFormat}
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.fs.Path
object TextFileOverwrite {
implicit class SparkContextExtension(val sc: SparkContext) extends AnyVal {
def textFile(
paths: Seq[String],
minPartitions: Int = sc.defaultMinPartitions
): RDD[String] = {
val confBroadcast =
sc.broadcast(new SerializableConfiguration(sc.hadoopConfiguration))
val setInputPathsFunc =
(jobConf: JobConf) =>
FileInputFormat.setInputPaths(jobConf, paths.map(p => new Path(p)): _*)
new HadoopRDD(
sc,
confBroadcast,
Some(setInputPathsFunc),
classOf[TextInputFormat],
classOf[LongWritable],
classOf[Text],
minPartitions
).map(pair => pair._2.toString)
}
}
}
which can be used this way:
import org.apache.spark.TextFileOverwrite.SparkContextExtension
sc.textFile(Seq("path/hello,world.txt", "path/hello_world.txt"))
Compared to SparkContext.textFile, the only difference in the implementation is the call to FileInputFormat.setInputPaths which takes Paths in input instead of a comma-separated String.
Note that I use the package org.apache.spark to store this function, because SerializableConfiguration has the visibility private[spark] in spark's code base.
Also note the use of an implicit class on SparkContext which allows us to implicitly attach this additional textFile method directly to the SparkContext object and thus to call it using sc.textFile() instead of having to pass the sparkContext as a parameter of the method.
Also note that I would have preferred giving Seq[Path] instead of Seq[String] as an input of this method, but Path is not yet Serializable in the current version of hadoop-common used by Spark (it will become Serializable starting version 3 of hadoop-common).

Use filename globbing, assuming that this gives you unique files:
sc.textFile("hdfs://namespace/mypath/1-1?123,hdfs://namespace/mypath/1-2?124")
Doesn't work if you only want the first one of these and not the other two:
hdfs://namespace/mypath/1-1,123,hdfs
hdfs://namespace/mypath/1-1:123,hdfs
hdfs://namespace/mypath/1-1.123,hdfs
I was going to suggest this:
sc.textFile("hdfs://namespace/mypath/1-1[,]123, ...
And I think that's supposed to work. Looking at the code for org.apache.hadoop.mapred.FileInputFormat#getPathStrings though makes me suspicious. It looks like that function specifically looks for commas inside curly braces, and will fail if you put a comma inside [,].

Related

Scala - Turning RDD[String] into a Map

I have a very large file that contains individual JSONs which I would like to iterate through, turning each one into a Map using the Jackson library:
import com.fasterxml.jackson.databind.ObjectMapper import com.fasterxml.module.scala.DefaultScalaModule
import com.fasterxml.module.scala.ScalaObjectMapper
val mapper = new ObjectMapper() with ScalaObjectMapper
mapper.register(DefaultScalaModule)
val lines = sc.textFile(fileName)
on a single JSON string, I can perform without issues:
mapper.readValue[Map[String, Object]](JSONString)
to get my map.
However, if I try the following by iterating through an RDD[String] like so I get the following error:
lines.foreach(line=> mapper.readValue[Map[String, Object]])
org.apache.Spark.SparkException: Task not serializable
I can do lines.take(10000) or so and then work on that but this file is so huge I can't "take" or "collect" the whole file in one go and I want to be able to use the same solution across files of all different sizes.
After the string becomes a Map, I need to perform functions on it and write to a string, so any solution that allows me to do that without going over my allocated memory will help. Thank you!
Managed to solve this with the below:
import scala.util.parsing.json._
val myMap = JSON.parseFull(jsonString).get.asInstanceOf[Map[String, Object]]
The above will work on an RDD[String]

How to convert Dataset to a Scala Iterable?

Is there a way to convert a org.apache.spark.sql.Dataset to a scala.collection.Iterable? It seems like this should be simple enough.
You can do myDataset.collect or myDataset.collectAsList.
But then it will no longer be distributed. If you want to be able to spread your computations out on multiple machines you need to use one of the distributed datastructures such as RDD, Dataframe or Dataset.
You can also use toLocalIterator if you just need to iterate the contents on the driver as it has the advantage of only loading one partition at a time, instead of the entire dataset, into memory. Iterator is not an Iterable (although it is a Traverable) but depending on what you are doing it may be what you want.
You could try something like this:
def toLocalIterable[T](dataset: Dataset[T]): Iterable[T] = new Iterable[T] {
def iterator = scala.collection.JavaConverters.asScalaIterator(dataset.toLocalIterator)
}
The conversion via JavaConverters.asScalaIterator is necessary because the toLocalIterator method of Dataset returns a java.util.Iterator instead of a scala.collection.Iterator (which is what the toLocalIterator on RDD returns.) I suspect this is a bug.
In Scala 2.11 you can do the following:
import scala.collection.JavaConverters._
dataset.toLocalIterator.asScala.toIterable

Scala io.source fromFile

I have this two lines(among the all others)
import scala.io.Source
val source = Source.fromFile(filename)
As I understand this is a way to read file content.I have read
http://www.scala-lang.org/api/2.12.x/scala/io/Source.html#iter:Iterator[Char]
I still do not get it what does Source.from File represent,one of Type Members,or something else?
from the Scala API stated here fromFile is a method defined on the Source companion object. This is a curried method with the first param list taking a single String representing the path of the file to be read and the second curried parameter list takes a single implicit codec argument of type scala.io.Codec. And this function returns a BufferedSource object

How to get disk usage or sizes of files in a directory in a HDFS filesystem using Scala

I am trying to get the size of the files in a HDFS directory in Scala. I can do the following in REPL:
Seq("/usr/bin/hdfs", "dfs", "-du", "-s", "/tmp/test").!
but I cannot store the result into a value. How can I get the size of the files in a directory in Scala?
The ! method you're using comes from ProcessBuilder.
(Seq[String] is being implicitly converted to a ProcessBuilder, thus granting you access to !).
/** Starts the process represented by this builder,
* blocks until it exits, and returns the exit code.
*/
abstract def !: Int
If you want the output, use a different method, like !!
/** Starts the process represented by this builder,
* blocks until it exits, and returns the output as a String.
*/
abstract def !!: String
I recommend checking out the other methods defined on ProcessBuilder. I'm sure at least one of them will suit your needs.
I'd recommend the use of https://github.com/pathikrit/better-files
import better.files._
import java.io.{File => JFile}
val size = File("/usr/bin/hdfs").size
println(size)

how to make saveAsTextFile NOT split output into multiple file?

When using Scala in Spark, whenever I dump the results out using saveAsTextFile, it seems to split the output into multiple parts. I'm just passing a parameter(path) to it.
val year = sc.textFile("apat63_99.txt").map(_.split(",")(1)).flatMap(_.split(",")).map((_,1)).reduceByKey((_+_)).map(_.swap)
year.saveAsTextFile("year")
Does the number of outputs correspond to the number of reducers it uses?
Does this mean the output is compressed?
I know I can combine the output together using bash, but is there an option to store the output in a single text file, without splitting?? I looked at the API docs, but it doesn't say much about this.
The reason it saves it as multiple files is because the computation is distributed. If the output is small enough such that you think you can fit it on one machine, then you can end your program with
val arr = year.collect()
And then save the resulting array as a file, Another way would be to use a custom partitioner, partitionBy, and make it so everything goes to one partition though that isn't advisable because you won't get any parallelization.
If you require the file to be saved with saveAsTextFile you can use coalesce(1,true).saveAsTextFile(). This basically means do the computation then coalesce to 1 partition. You can also use repartition(1) which is just a wrapper for coalesce with the shuffle argument set to true. Looking through the source of RDD.scala is how I figured most of this stuff out, you should take a look.
For those working with a larger dataset:
rdd.collect() should not be used in this case as it will collect all data as an Array in the driver, which is the easiest way to get out of memory.
rdd.coalesce(1).saveAsTextFile() should also not be used as the parallelism of upstream stages will be lost to be performed on a single node, where data will be stored from.
rdd.coalesce(1, shuffle = true).saveAsTextFile() is the best simple option as it will keep the processing of upstream tasks parallel and then only perform the shuffle to one node (rdd.repartition(1).saveAsTextFile() is an exact synonym).
rdd.saveAsSingleTextFile() as provided bellow additionally allows one to store the rdd in a single file with a specific name while keeping the parallelism properties of rdd.coalesce(1, shuffle = true).saveAsTextFile().
Something that can be inconvenient with rdd.coalesce(1, shuffle = true).saveAsTextFile("path/to/file.txt") is that it actually produces a file whose path is path/to/file.txt/part-00000 and not path/to/file.txt.
The following solution rdd.saveAsSingleTextFile("path/to/file.txt") will actually produce a file whose path is path/to/file.txt:
package com.whatever.package
import org.apache.spark.rdd.RDD
import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
import org.apache.hadoop.io.compress.CompressionCodec
object SparkHelper {
// This is an implicit class so that saveAsSingleTextFile can be attached to
// SparkContext and be called like this: sc.saveAsSingleTextFile
implicit class RDDExtensions(val rdd: RDD[String]) extends AnyVal {
def saveAsSingleTextFile(path: String): Unit =
saveAsSingleTextFileInternal(path, None)
def saveAsSingleTextFile(path: String, codec: Class[_ <: CompressionCodec]): Unit =
saveAsSingleTextFileInternal(path, Some(codec))
private def saveAsSingleTextFileInternal(
path: String, codec: Option[Class[_ <: CompressionCodec]]
): Unit = {
// The interface with hdfs:
val hdfs = FileSystem.get(rdd.sparkContext.hadoopConfiguration)
// Classic saveAsTextFile in a temporary folder:
hdfs.delete(new Path(s"$path.tmp"), true) // to make sure it's not there already
codec match {
case Some(codec) => rdd.saveAsTextFile(s"$path.tmp", codec)
case None => rdd.saveAsTextFile(s"$path.tmp")
}
// Merge the folder of resulting part-xxxxx into one file:
hdfs.delete(new Path(path), true) // to make sure it's not there already
FileUtil.copyMerge(
hdfs, new Path(s"$path.tmp"),
hdfs, new Path(path),
true, rdd.sparkContext.hadoopConfiguration, null
)
// Working with Hadoop 3?: https://stackoverflow.com/a/50545815/9297144
hdfs.delete(new Path(s"$path.tmp"), true)
}
}
}
which can be used this way:
import com.whatever.package.SparkHelper.RDDExtensions
rdd.saveAsSingleTextFile("path/to/file.txt")
// Or if the produced file is to be compressed:
import org.apache.hadoop.io.compress.GzipCodec
rdd.saveAsSingleTextFile("path/to/file.txt.gz", classOf[GzipCodec])
This snippet:
First stores the rdd with rdd.saveAsTextFile("path/to/file.txt") in a temporary folder path/to/file.txt.tmp as if we didn't want to store data in one file (which keeps the processing of upstream tasks parallel)
And then only, using the hadoop file system api, we proceed with the merge (FileUtil.copyMerge()) of the different output files to create our final output single file path/to/file.txt.
You could call coalesce(1) and then saveAsTextFile() - but it might be a bad idea if you have a lot of data. Separate files per split are generated just like in Hadoop in order to let separate mappers and reducers write to different files. Having a single output file is only a good idea if you have very little data, in which case you could do collect() as well, as #aaronman said.
As others have mentioned, you can collect or coalesce your data set to force Spark to produce a single file. But this also limits the number of Spark tasks that can work on your dataset in parallel. I prefer to let it create a hundred files in the output HDFS directory, then use hadoop fs -getmerge /hdfs/dir /local/file.txt to extract the results into a single file in the local filesystem. This makes the most sense when your output is a relatively small report, of course.
In Spark 1.6.1 the format is as shown below. It creates a single output file.It is best practice to use it if the output is small enough to handle.Basically what it does is that it returns a new RDD that is reduced into numPartitions partitions.If you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1)
pair_result.coalesce(1).saveAsTextFile("/app/data/")
You can call repartition() and follow this way:
val year = sc.textFile("apat63_99.txt").map(_.split(",")(1)).flatMap(_.split(",")).map((_,1)).reduceByKey((_+_)).map(_.swap)
var repartitioned = year.repartition(1)
repartitioned.saveAsTextFile("C:/Users/TheBhaskarDas/Desktop/wc_spark00")
You will be able to do it in the next version of Spark, in the current version 1.0.0 it's not possible unless you do it manually somehow, for example, like you mentioned, with a bash script call.
I also want to mention that the documentation clearly states that users should be careful when calling coalesce with a real small number of partitions . this can cause upstream partitions to inherit this number of partitions.
I would not recommend using coalesce(1) unless really required.
Here's my answer to output a single file. I just added coalesce(1)
val year = sc.textFile("apat63_99.txt")
.map(_.split(",")(1))
.flatMap(_.split(","))
.map((_,1))
.reduceByKey((_+_)).map(_.swap)
year.saveAsTextFile("year")
Code:
year.coalesce(1).saveAsTextFile("year")