Read files recursively in scala - scala

I am trying to read a set of XML files nested in many folders into sequence files in spark. I can read the file names using function recursiveListFiles from How do I list all files in a subdirectory in scala?.
import java.io.File
def recursiveListFiles(f: File): Array[File] = {
val these = f.listFiles
these ++ these.filter(_.isDirectory).flatMap(recursiveListFiles)
}
But how to read the file content as separate column here?

What about using sparks wholeTextFiles method? And parsing the XML yourself afterwards?

Related

Pass json file into JAR or read from spark session

I have a Spark UDF written on Scala. I'd like to use my function with some additional files.
import scala.io.Source
import org.json4s.jackson.JsonMethods.parse
import org.json4s.DefaultFormats
object consts {
implicit val formats = DefaultFormats
val my_map = parse(Source.fromFile("src/main/resources/map.json").mkString).extract[Map[String, Map[String, List[String]]]]
}
Now I want to use my_map object inside UDF. So I basically do this:
import package.consts.my_map
object myUDFs{
*bla-bla and use my_map*
}
I've already tested my function in a local, so it works well.
Now I want to understand how to pack a jar file so that .json file stays there?
Thank you.
If you manage your project with Maven, you can place your .json file(s) under src/main/resources as it's the default place where Maven looks for your project's resources.
You also can define a custom path for your resources as described here: https://maven.apache.org/plugins/maven-resources-plugin/examples/resource-directory.html
UPD: I managed to do so by creating fatJar and reading my resource file this way:
Source
.fromInputStream(
getClass.getClassLoader.getResourceAsStream("map.json")
)
.mkString
).extract[Map[String, Map[String, List[String]]]]

How to convert a java.io list to a DataFrame in Scala?

I'm using this code to get the list of files in a directory, and want to call to toDF method that works when converting lists to dataframes. However, because this is a java.io List, it's saying it won't work.
val files = Option(new java.io.File("data").list).map(_.count(_.endsWith(".csv"))).getOrElse(0)
When I try to do
files.toDF.show()
I get this error:
How can I get this to work? Can someone help me with the code to convert this java.io List to a regular list?
Thanks
val files = Option(new java.io.File("data").list).map(_.count(_.endsWith(".csv"))).getOrElse(0)
Above Code returns - Int, And you are trying to convert Int Value to DataFrame, How is it possible. If I understand you wanted to convert list of .csv files as DataFrame. Please use this below code -
val files = Option(new java.io.File("data").list)).get.filter(x=>x.endsWith(".csv")).toList
import spark.implicits._
files.toDF().show()

How To create dynamic data source reader and different file format reader in scala spark

I am trying to create a program in spark scala that read the data from the different sources based on dynamic based on configuration setting.
i am trying to create a program that read the data in different format like csv,parquet and Sequence files dynamic based on configuration setting.
I tried more please help i am new in scala spark
Please use a config file to specify your input file format and location. For example:
import java.io.File
import com.typesafe.config.{ Config, ConfigFactory }
val configFile= System.getProperty("config.file")
val config = ConfigFactory.parseFile(new File(configFile))
val format = config.getString("inputDataFormat")
Based on the above format, write your conditional statements for reading files.

Read only parquet file

I want to read multiple parquet files from a folder which also contains some other file types(csv,avro) into a dataframe. I want to read only if its parquet and skip and go to next if any other.
The problem is parquet file might not have extension and codec might also vary from file to file. In Spark-scala is there a way to do this?
You can get the filenames beforehand in the following way:
improt org.apache.spark.sql.DataFrame
import scala.sys.process._
val fileNames: List[String] = "hdfs dfs -ls /path/to/files/on/hdfs".!!
.split("\n")
.filter(_.endsWith(".parquet"))
.map(_.split("\\s").last).toList
val df: DataFrame = spark.read.parquet(fileNames:_*)
spark in the above code is the SparkSession object. This code should work for Spark 1.x versions as well since the method signature for parquet() is the same in Spark 1.x and Spark 2.x versions.

Source.fromFile not working for HDFS file path

i am trying to read file contents from my hdfs for that i am using Source.fromFile(). It is working fine when my file is in local system but throwing error when i am trying to read file from HDFS.
object CheckFile{
def main(args:Array[String]) {
for (line <- Source.fromFile("/user/cloudera/xxxx/File").getLines()) {
println(line)
}
}
}
Error:
java.io.FileNotFoundException: hdfs:/quickstart.cloudera:8080/user/cloudera/xxxx/File (No such file or directory)
i searched but i am not able to find any solutions to this.
Please help
If you are using Spark you should use SparkContext to load the files. Source.fromFile uses the local file system.
Say you have your SparkContext at sc,
val fromFile = sc.textFile("hdfs://path/to/file.txt")
Should do the trick. You might have to specify the node address, though.
UPDATE:
To add to the comment. You want to read some data from hdfs and store it as a Scala collection. This is bad practice as the file might contain milions of lines and it will crash due to insufficient amount of memory; you should use RDDs and not built-in Scala collections. Nevertheless, if this is what you want, you could do:
val fromFile = sc.textFile("hdfs://path/to/file.txt").toLocalIterator.toArray
Which would produce a local collection of desired type (Array in this case).
sc.textFile("hdfs://path/to/file.txt").toLocalIterator.toArray.mkString will give the result as string