I’ve been looking for a while now for a way to get all filenames in a directory and its sub-directories in Hadoop file system (hdfs).
I found out I can use these commands to get it :
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")
sc.wholeTextFiles(path).map(_._1)
Here is "wholeTextFiles" documentation:
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
Parameters:
path - Directory to the input data files, the path can be
comma separated paths as the list of inputs.
minPartitions - A
suggestion value of the minimal splitting number for input data.
Returns:
RDD representing tuples of file path and the corresponding
file content
Note: Small files are preferred, large file is also
allowable, but may cause bad performance., On some filesystems,
.../path/* can be a more efficient way to read all files in a
directory rather than .../path/ or .../path, Partitioning is
determined by data locality. This may result in too few partitions by
default.
As you can see "wholeTextFiles" returns a pair RDD with both the filenames and their content. So I tried mapping it and taking only the file names, but I suspect it still reads the files.
The reason I suspect so: if I try to count (for example) and I get the spark equivalent of "out of memory" (losing executors and not being able to complete the tasks).
I would rather use Spark to achieve this goal the fastest way possible, however, if there are other ways with a reasonable performance I would be happy to give them a try.
EDIT:
To clear it - I want to do it using Spark, I know I can do it using HDFS commands and such thing - I would like to know how to do such thing with the existing tools provided with Spark and maybe an explanation on how I can make "wholeTextFiles" not reading the text itself (kind of like how transformations only happen after an action and some of the "commands" never really happen).
Thank you very much!
This is the way to list out all the files till the depth of last subdirectory....and is with out using wholetextfiles
and is recursive call till the depth of subdirectories...
val lb = new scala.collection.mutable[String] // variable to hold final list of files
def getAllFiles(path:String, sc: SparkContext):scala.collection.mutable.ListBuffer[String] = {
val conf = sc.hadoopConfiguration
val fs = FileSystem.get(conf)
val files: RemoteIterator[LocatedFileStatus] = fs.listLocatedStatus(new Path(path))
while(files.hasNext) {// if subdirectories exist then has next is true
var filepath = files.next.getPath.toString
//println(filepath)
lb += (filepath)
getAllFiles(filepath, sc) // recursive call
}
println(lb)
lb
}
Thats it. it was tested with success. you can use as is..
Related
I am trying to load dataframe from a list of paths in spark. If a file exists in all the mentioned paths then the code is working fine. If there is at least one path that is empty then it is throwing error.
This is my code:
val paths = List("path1", "path2")
val df = spark.read.json(paths: _*)
I looked at other options.
Build a single regex string which contains all the paths.
Building a list from the master list of paths by checking if spark can read or not.
.
for(path <- paths) {
if(Try(spark.read.json(path)).isSuccess) {
//add path to list
}
}
The first approach won't work for my case because I can't create a regex out the paths I have to read.
Second approach works but I feel it is going to degrade performance as it has to read from all the paths twice.
Please suggest an approach to solve this issue.
Note:
All the paths are in hdfs
Each path is itself a regex string which will read from multiple files
As mentioned in the comments, you can use HDFS FileSystem API to get a list of paths that exist based on your regex (as long as it's a valid regex).
import org.apache.hadoop.fs._
val path = Array("path_prefix/folder1[2-8]/*", "path_prefix/folder2[2-8]/*")
val fs: FileSystem = FileSystem.get(sc.hadoopConfiguration) // sc = SparkContext
val paths = path.flatMap(p => fs.globStatus(new Path(p)).map(_.getPath.toString))
This way even if, say, /path_prefix/folder13 is empty, it's contents will not get listed in the variable paths which will be a Array[String] containing all the available files in the regex.
Finally, you can do:
spark.read.json(paths : _*)
Add, copy dummy file with 0 length to the directories in path list is a pragmatic techical work around that functionally equates to what you want to do. The empty dir problem I have encountered before and alleviated this way, may be not possible for you...
I'd like to print rdd data using scala such as below
res1.foreach{case(userid,tags)=>println(s"${userid}${"\t"}${tags.topicInterests.map(_.id).mkString(",")}")}
And now ,i want to save the detail to local file instead of println,how can i implement it?
Use saveAsTextFile() method of the RDD as shown below:
val strRdd = res1.map{case(userid,tags)=>(s"${userid}${"\t"}${tags.topicInterests.map(_.id).mkString(",")}")}
strRdd.saveAsTextFile("/home/test_user/result")
Note that, saveAsTextFile method takes a path(absolute or relative) to a folder/directory and not a file. The RDD data will be written as part files inside the given directory. In this case, a directory called result will be created with part files inside it.
There will be as many part files as the number of partitions in the strRdd. If the path /home/test_user/result already exists, your code will fail. So you will have to use a non-existing directory only.
Bonus info: The same saveAsTextFile method also works on other file systems like HDFS, S3 etc by taking the URL to the target directories instead of just paths.
I have a directory with some subfolders which content different parquet files. Something like this:
2017-09-05
10-00
part00000.parquet
part00001.parquet
11-00
part00000.parquet
part00001.parquet
12-00
part00000.parquet
part00001.parquet
What I want is by passing the path to the directory 05-09 to get a list of names of all parquet files.
I was able to achieve it, but in a very inefficient way:
val allParquetFiles = sc.wholeTextFiles("C:/MyDocs/2017-09-05/*/*.parquet")
allParquetFiles.keys.foreach((k) => println("The path to the file is: "+k))
So each key is the name I am looking for, but this process requires me to load all files as well, which then I can't use, since I get them in binary (and I don't know how to convert them into a dataframe).
Once I have the keys (so the list of filePaths) I am planning to invoke:
val myParquetDF = sqlContext.read.parquet(filePath);
As you may have already understood I am quite new in Spark. So please if there is a faster or easier approach to read a list of parquet files located in different folders, please let me know.
My Partial Solution: I wasn't able to get all paths for all filenames in a folder, but I was able to get the content of all files of that type into the same dataframe. Which was my ultimate goal. In case someone may need it in the future, I used the following line:
val df = sqlContext.read.parquet("C:/MyDocs/2017-05-09/*/*.parquet")
Thanks for your time
You can do it using the hdfs api like this
import org.apache.hadoop.fs._
import org.apache.hadoop.conf._
val fs = FileSystem.get(new Configuration())
val files = ( fs.listStatus(new Path("C:/MyDocs/2017-09-05/*/*.parquet")) ).map(_.getPath.toString)
First, it is better to avoid using wholeTextFiles. This method reads the whole file at once. Try to use textFile method. read more
Second, if you need to get all files recursively in one directory, you can achieve it by textFile method:
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")
This configuration will enable recursive search (works for spark jobs as for mapreduce jobs). And then just invoke sc.textFile(path).
I need to save DataFrame in CSV or parquet format (as a single file) and then open it again. The amount of data will not exceed 60Mb, so a single file is reasonable solution. This simple task provides me a lot of headache... This is what I tried:
To read the file if it exists:
df = sqlContext
.read.parquet("s3n://bucket/myTest.parquet")
.toDF("key", "value", "date", "qty")
To write the file:
df.write.parquet("s3n://bucket/myTest.parquet")
This does not work because:
1) write creates the folder myTest.parquet with hadoopish files that later I cannot read with .read.parquet("s3n://bucket/myTest.parquet"). In fact I don't care about multiple hadoopish files, unless I can later read them easily into DataFrame. Is it possible?
2) I am always working with the same file myTest.parquet that I am updating and overwriting in S3. It tells me that the file cannot be saved because it already exists.
So, can someone indicate me a right way to do the read/write loop? The file format doesn't matter for me (csv,parquet,csv,hadoopish files) unleass I can make the read and write loop.
You can save your DataFrame with saveAsTable("TableName") and read it with table("TableName"). And the location can be set by spark.sql.warehouse.dir. And you can overwrite a file with mode(SaveMode.Ignore). You can read here more from the official documentation.
In Java it would look like this:
SparkSession spark = ...
spark.conf().set("spark.sql.warehouse.dir", "hdfs://localhost:9000/tables");
Dataset<Row> data = ...
data.write().mode(SaveMode.Overwrite).saveAsTable("TableName");
Now you can read from the Data with:
spark.read().table("TableName");
In my project, I have three input files and make the file names as args(0) to args(2), I also have a output filename as args(3), in the source code, I use
val sc = new SparkContext()
var log = sc.textFile(args(0))
for(i <- 1 until args.size - 1) log = log.union(sc.textFile(args(i)))
I do nothing to the log but save it as a text file by using
log.coalesce(1, true).saveAsTextFile(args(args.size - 1))
but it still save to 3 file as part-00000、part-00001、part-00002, So is there any way that I can save the three input files to an output file?
Having multiple output files is a standard behavior of multi-machine clusters like Hadoop or Spark. The number of output files depends on the number of reducers.
How to "solve" it in Hadoop:
merge output files after reduce phase
How to "solve" in Spark:
how to make saveAsTextFile NOT split output into multiple file?
A good info you can get also here:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-Spark-merge-the-output-file-td322.html
So, you were right about coalesce(1,true). However, it is very inefficient. Interesting is that (as #climbage mentioned in his remark) your code is working if you run it locally.
What you might try is to read the files first and then save the output.
...
val sc = new SparkContext()
var str = new String("")
for(i <- 0 until args.size - 1){
val file = sc.textFile(args(i))
file.foreach(line => str+= line)
}
//and now you might save the content
str.coalesce(1, true).saveAsTextFile("out")
Note: this code is also extremely inefficient and working for small files only!!! You need to come up with a better code. I wouldn't try to reduce number of file but process multiple outputs files instead.
As mentioned your problem is somewhat unavoidable via the standard API's as the assumption is that you are dealing with large quanatities of data. However, if I assume your data is manageable you could try the following
import java.nio.file.{Paths, Files}
import java.nio.charset.StandardCharsets
Files.write(Paths.get("./test_file"), data.collect.mkString("\n").getBytes(StandardCharsets.UTF_8))
What I am doing here is converting the RDD into a String by performing a collect and then mkString. I would suggest not doing this in production. It works fine for local data analysis (Working with 5gb~ of local data)