Read only parquet file - scala

I want to read multiple parquet files from a folder which also contains some other file types(csv,avro) into a dataframe. I want to read only if its parquet and skip and go to next if any other.
The problem is parquet file might not have extension and codec might also vary from file to file. In Spark-scala is there a way to do this?

You can get the filenames beforehand in the following way:
improt org.apache.spark.sql.DataFrame
import scala.sys.process._
val fileNames: List[String] = "hdfs dfs -ls /path/to/files/on/hdfs".!!
.split("\n")
.filter(_.endsWith(".parquet"))
.map(_.split("\\s").last).toList
val df: DataFrame = spark.read.parquet(fileNames:_*)
spark in the above code is the SparkSession object. This code should work for Spark 1.x versions as well since the method signature for parquet() is the same in Spark 1.x and Spark 2.x versions.

Related

How To create dynamic data source reader and different file format reader in scala spark

I am trying to create a program in spark scala that read the data from the different sources based on dynamic based on configuration setting.
i am trying to create a program that read the data in different format like csv,parquet and Sequence files dynamic based on configuration setting.
I tried more please help i am new in scala spark
Please use a config file to specify your input file format and location. For example:
import java.io.File
import com.typesafe.config.{ Config, ConfigFactory }
val configFile= System.getProperty("config.file")
val config = ConfigFactory.parseFile(new File(configFile))
val format = config.getString("inputDataFormat")
Based on the above format, write your conditional statements for reading files.

Read files recursively in scala

I am trying to read a set of XML files nested in many folders into sequence files in spark. I can read the file names using function recursiveListFiles from How do I list all files in a subdirectory in scala?.
import java.io.File
def recursiveListFiles(f: File): Array[File] = {
val these = f.listFiles
these ++ these.filter(_.isDirectory).flatMap(recursiveListFiles)
}
But how to read the file content as separate column here?
What about using sparks wholeTextFiles method? And parsing the XML yourself afterwards?

Source.fromFile not working for HDFS file path

i am trying to read file contents from my hdfs for that i am using Source.fromFile(). It is working fine when my file is in local system but throwing error when i am trying to read file from HDFS.
object CheckFile{
def main(args:Array[String]) {
for (line <- Source.fromFile("/user/cloudera/xxxx/File").getLines()) {
println(line)
}
}
}
Error:
java.io.FileNotFoundException: hdfs:/quickstart.cloudera:8080/user/cloudera/xxxx/File (No such file or directory)
i searched but i am not able to find any solutions to this.
Please help
If you are using Spark you should use SparkContext to load the files. Source.fromFile uses the local file system.
Say you have your SparkContext at sc,
val fromFile = sc.textFile("hdfs://path/to/file.txt")
Should do the trick. You might have to specify the node address, though.
UPDATE:
To add to the comment. You want to read some data from hdfs and store it as a Scala collection. This is bad practice as the file might contain milions of lines and it will crash due to insufficient amount of memory; you should use RDDs and not built-in Scala collections. Nevertheless, if this is what you want, you could do:
val fromFile = sc.textFile("hdfs://path/to/file.txt").toLocalIterator.toArray
Which would produce a local collection of desired type (Array in this case).
sc.textFile("hdfs://path/to/file.txt").toLocalIterator.toArray.mkString will give the result as string

Spark 2.0 Scala - RDD.toDF()

I am working with Spark 2.0 Scala. I am able to convert an RDD to a DataFrame using the toDF() method.
val rdd = sc.textFile("/pathtologfile/logfile.txt")
val df = rdd.toDF()
But for the life of me I cannot find where this is in the API docs. It is not under RDD. But it is under DataSet (link 1). However I have an RDD not a DataSet.
Also I can't see it under implicits (link 2).
So please help me understand why toDF() can be called for my RDD. Where is this method being inherited from?
It's coming from here:
Spark 2 API
Explanation: if you import sqlContext.implicits._, you have a implicit method to convert RDD to DataSetHolder (rddToDataSetHolder), then you call toDF on the DataSetHolder
Yes, you should import sqlContext implicits like that:
val sqlContext = //create sqlContext
import sqlContext.implicits._
val df = RDD.toDF()
Before you call to "toDF" in your RDDs
Yes I finally found piece of mind, this issue. It was troubling me like hell, this post is a life saver. I was trying to generically load data from log files to a case class object making it mutable List, this idea was to finally convert the list into DF. However as it was mutable and Spark 2.1.1 has changed the toDF implementation, what ever why the list want not getting converted. I finally thought of even covering save the data to file and the load it back using .read. However 5 min back this post had saved my day.
I did the exact same way as described.
after loading the data to mutable list I immediately used
import spark.sqlContext.implicits._
val df = <mutable list object>.toDF
df.show()
I have done just this with Spark 2.
it worked.
val orders = sc.textFile("/user/gd/orders")
val ordersDF = orders.toDF()

Read Parquet files from Scala without using Spark

Is it possible to read parquet files from Scala without using Apache Spark?
I found a project which allows us to read and write avro files using plain scala.
https://github.com/sksamuel/avro4s
However I can't find a way to read and write parquet files using plain scala program without using Spark?
It's straightforward enough to do using the parquet-mr project, which is the project Alexey Raga is referring to in his answer.
Some sample code
val reader = AvroParquetReader.builder[GenericRecord](path).build().asInstanceOf[ParquetReader[GenericRecord]]
// iter is of type Iterator[GenericRecord]
val iter = Iterator.continually(reader.read).takeWhile(_ != null)
// if you want a list then...
val list = iter.toList
This will return you a standard Avro GenericRecords, but if you want to turn that into a scala case class, then you can use my Avro4s library as you linked to in your question, to do the marshalling for you. Assuming you are using version 1.30 or higher then:
case class Bibble(name: String, location: String)
val format = RecordFormat[Bibble]
// then for a given record
val bibble = format.from(record)
We can obviously combine that with the original iterator in one step:
val reader = AvroParquetReader.builder[GenericRecord](path).build().asInstanceOf[ParquetReader[GenericRecord]]
val format = RecordFormat[Bibble]
// iter is now an Iterator[Bibble]
val iter = Iterator.continually(reader.read).takeWhile(_ != null).map(format.from)
// and list is now a List[Bibble]
val list = iter.toList
There is also a relatively new project called eel this is a lightweight (non distributed processing) toolkit for using some of the 'big data' technology in the small.
Yes, you don't have to use Spark to read/write Parquet.
Just use parquet lib directly from your Scala code (and that's what Spark is doing anyway): http://search.maven.org/#search%7Cga%7C1%7Cparquet