skip header of csv while reading multiple files into rdd in scala - scala

I am trying to read multiple csvs into an rdd from a path. This path has many csvs Is there a way I can avoid the headers while reading all the csvs into rdd? or use spotsRDD to omit out the header without having to use filter or deal with each csv individually and then union them?
val path ="file:///home/work/csvs/*"
val spotsRDD= sc.textFile(path)

That is pity you are using spark 1.0.0.
You can use CSV Data Source for Apache Spark but this library requires Spark 1.3+ and btw. this library was inlined to Spark 2.x.
But we can analyse and implement something similar.
When we look into the com/databricks/spark/csv/DefaultSource.scala there is
val useHeader = parameters.getOrElse("header", "false")
and then in the com/databricks/spark/csv/CsvRelation.scala there is
// If header is set, make sure firstLine is materialized before sending to executors.
val filterLine = if (useHeader) firstLine else null
baseRDD().mapPartitions { iter =>
// When using header, any input line that equals firstLine is assumed to be header
val csvIter = if (useHeader) {
iter.filter(_ != filterLine)
} else {
parseCSV(csvIter, csvFormat)
so if we assume the first line is only once in RDD (our csv rows) we can do something like in the example below:
CSV example file:
48.1,0.25,"First point"
49.2,1.1,"Second point"
47.5,0.75,"Third point"
scala> val csvData = sc.textFile("test.csv")
csvData: org.apache.spark.rdd.RDD[String] = test.csv MapPartitionsRDD[24] at textFile at <console>:24
scala> val header = csvDataRdd.first
header: String = Latitude,Longitude,Name
scala> val csvDataWithoutHeaderRdd = csvDataRdd.mapPartitions{iter => iter.filter(_ != header)}
csvDataWithoutHeaderRdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[25] at mapPartitions at <console>:28
scala> csvDataWithoutHeaderRdd.foreach(println)
49.2,1.1,"Second point"
48.1,0.25,"First point"
47.5,0.75,"Third point"


Spark Scala - textFile() and sequenceFile() RDDs

I'm successfully loading my sequence files into a DataFrame with some code like this:
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val jsonRecs = (String, String)) => new String(record._2))
val df =
I'd like to do the same with some text files. The text files have a similar format as the sequence files (A timestamp, a tab char, then the json). But the problem is textFile() returns an RDD[String] instead of an RDD[LongWritable,String] like the sequenceFile() method.
My goal is to be able to test the program with either sequence files or text files as input.
How could I convert the RDD[String] coming from textFile() into an RDD[LongWritable,String]? Or is there a better solution?
Assuming that your text file is a CSV file, you can use following code for reading a CSV file in a Dataframe where spark is the SparkSession:
val df ="header", "false").csv("file.txt")
Like header option there are multiple options you can provide depending upon your requirement. Check this for more details.
Thanks for the responses. It's not a CSV but I guess it could be. It's just the text output of doing this on a sequence file in HDFS:
hdfs dfs -text /path/to/my/file > myFile.txt
Anyway, I found a solution that works for both sequence and text file for my use case. This code ends up setting the variable 'file' to a RDD[String,String] in both cases, and I can work with that.
var file = if (inputType.equalsIgnoreCase("text")) {
sc.textFile(src).map(line => (line.split("\t")(0), line.split("\t")(1)))
} else { // Default to assuming sequence files are input

How to add source file name to each row in Spark?

I'm new to Spark and am trying to insert a column to each input row with the file name that it comes from.
I've seen others ask a similar question, but all their answers used wholeTextFile, but I'm trying to do this for larger CSV files (read using the Spark-CSV library), JSON files, and Parquet files (not just small text files).
I can use the spark-shell to get a list of the filenames:
val df ="/blah/dir")
val names =
but that's a dataframe.
I am not sure how to add it as a column to each row (and if that result is ordered the same as the initial data either, though I assume it always is) and how to do this as a general solution for all input types.
Another solution I just found to add file name as one of the columns in DataFrame
val df ="/blah/dir")
val dfWithCol = df.withColumn("filename",input_file_name())
spark load data and add filename as dataframe column
When you create a RDD from a text file, you probably want to map the data into a case class, so you could add the input source in that stage:
case class Person(inputPath: String, name: String, age: Int)
val inputPath = "hdfs://localhost:9000/tmp/demo-input-data/persons.txt"
val rdd = sc.textFile(inputPath).map {
l =>
val tokens = l.split(",")
Person(inputPath, tokens(0), tokens(1).trim().toInt)
If you do not want to mix "business data" with meta data:
case class InputSourceMetaData(path: String, size: Long)
case class PersonWithMd(name: String, age: Int, metaData: InputSourceMetaData)
// Fake the size, for demo purposes only
val md = InputSourceMetaData(inputPath, size = -1L)
val rdd = sc.textFile(inputPath).map {
l =>
val tokens = l.split(",")
PersonWithMd(tokens(0), tokens(1).trim().toInt, md)
and if you promote the RDD to a DataFrame:
import sqlContext.implicits._
val df = rdd.toDF()
you can query it like
sqlContext.sql("select name, metadata from x").show()
sqlContext.sql("select name, metadata.path from x").show()
sqlContext.sql("select name, metadata.path, metadata.size from x").show()
You can read the files in HDFS using org.apache.hadoop.fs.FileSystem.listFiles() recursively.
Given a list of file names in a value files (standard Scala collection containing org.apache.hadoop.fs.LocatedFileStatus), you can create one RDD for each file:
val rdds = { f =>
val md = InputSourceMetaData(f.getPath.toString, f.getLen)
sc.textFile(md.path).map {
l =>
val tokens = l.split(",")
PersonWithMd(tokens(0), tokens(1).trim().toInt, md)
Now you can reduce the list of RDDs into a single one: The function for reduce concats all RDDs into a single one:
val rdd = rdds.reduce(_ ++ _)
This works, but I cannot test if this distributes/performs well with large files.

