Spark scala read multiple files from S3 using Seq(paths) - scala

I have a scala program that reads json files into a DataFrame using DataFrameReader, using a file pattern like "s3n://bucket/filepath/*.json" to specify files. Now I need to read both ".json" and ".json.gz" (gzip) files into the dataframe.
Since current approach uses a wildcard, like this:
session.read().json("s3n://bucket/filepath/*.json")
I want to read both json and json-gzip files, but I have not found documentation for the wildcard pattern expression. I was tempted to compose a more complex wildcard, but the lack of wildcard documentation motivated me to consider another approach.
Reading the documentation for Spark, it says that the DataFrameReader has these relevant methods,
json(path: String): DataFrame
json(paths: String*): DataFrame
Which would produce code more like this:
// spark.isInstanceOf[SparkSession]
// val reader: DataFrameReader = spark.read
val df: DataFrame = spark.read.json(path: String)
// or
val df: DataFrame = spark.read.json(paths: String*)
I need to read json and json-gzip files, but I may need to read other filename formats. The second method (above) accepts a Scala Seq(uence), which means I could provide a Seq(uence), which I could later add other filename wildcards.
// session.isInstanceOf[SparkSession]
val s3json: String = "s3n://bucket/filepath/*.json"
val s3gzip: String = "s3n://bucket/filepath/*.json.gz"
val paths: Seq[String] = Seq(s3json, s3gzip)
val df: DataFrame = session.read().json(paths)
Please comment on this approach, and is this idionatic?
I have also seen examples of the last line with the splat operator ("_") added to the paths sequence. Is that needed? Can you explain what the ": _" part does?
val df: DataFrame = session.read().json(paths: _*)
Example of the splat operator use are here:
How to read multiple directories in s3 in spark Scala?
How to pass a list of paths to spark.read.load?

Adding further to blackbishop's answer, you can use val df = spark.read.json(paths: _*) for reading files from entirely independent buckets/folders.
val paths = Seq("s3n://bucket1/filepath1/","s3n://bucket2/filepath/2")
val df = spark.read.json(paths: _*)
The _* converts a Seq to variable arguments needed by path function.

You can use brace expansions in your path to include the 2 extensions:
val df = spark.read.json("s3n://bucket/filepath/{*.json,*.json.gz}")
If your bucket contains only .json and .json.gz files, you can actually read all the files:
val df = spark.read.json("s3n://bucket/filepath/")

Related

How read a csv with quotes using sparkcontext

I've started recently to use scala spark, in particular I'm trying to use GraphX in order to make a graph from a csv. To read a csv file with spark context I always do this:
val rdd = sc.textFile("file/path")
.map(line => line.split(","))
In this way I obtain an RDD of objects Array[String].
My problem is that the csv file contains strings delimited by quotes ("") and number without quotes, an example of some lines inside the file is the following:
"Luke",32,"Rome"
"Mary",43,"London"
"Mario",33,"Berlin"
If I use the method split(",") I obtain String objects that inside contain quotes, for instance the string Luke is saved as "Luke" and not as Luke.
How can I do to not consider quotes and make the correct string objects?
I hope I was clear to explain my problem
you can let the Spark DataFrame level CSV parser resolve that for you
val rdd=spark.read.csv("file/path").rdd.map(_.mkString(",")).map(_.split(","))
by the way, you can transform the Row directly to VertexId, (String,String) in the first map based on the Row fields
Try with below example.
import org.apache.spark.sql.SparkSession
object DataFrameFromCSVFile {
def main(args:Array[String]):Unit= {
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExample")
.getOrCreate()
val filePath="C://zipcodes.csv"
//Chaining multiple options
val df2 = spark.read.options(Map("inferSchema"->"true","sep"->",","header"->"true")).csv(filePath)
df2.show(false)
df2.printSchema()
}
}

Spark Scala - textFile() and sequenceFile() RDDs

I'm successfully loading my sequence files into a DataFrame with some code like this:
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sc.sequenceFile[LongWritable,String](src)
val jsonRecs = file.map((record: (String, String)) => new String(record._2))
val df = sqlContext.read.json(jsonRecs)
I'd like to do the same with some text files. The text files have a similar format as the sequence files (A timestamp, a tab char, then the json). But the problem is textFile() returns an RDD[String] instead of an RDD[LongWritable,String] like the sequenceFile() method.
My goal is to be able to test the program with either sequence files or text files as input.
How could I convert the RDD[String] coming from textFile() into an RDD[LongWritable,String]? Or is there a better solution?
Assuming that your text file is a CSV file, you can use following code for reading a CSV file in a Dataframe where spark is the SparkSession:
val df = spark.read.option("header", "false").csv("file.txt")
Like header option there are multiple options you can provide depending upon your requirement. Check this for more details.
Thanks for the responses. It's not a CSV but I guess it could be. It's just the text output of doing this on a sequence file in HDFS:
hdfs dfs -text /path/to/my/file > myFile.txt
Anyway, I found a solution that works for both sequence and text file for my use case. This code ends up setting the variable 'file' to a RDD[String,String] in both cases, and I can work with that.
var file = if (inputType.equalsIgnoreCase("text")) {
sc.textFile(src).map(line => (line.split("\t")(0), line.split("\t")(1)))
} else { // Default to assuming sequence files are input
sc.sequenceFile[String,String](src)
}

Spark: Write each record in RDD to individual files in HDFS directory

I have a requirement where I want to write each individual records in an RDD to an individual file in HDFS.
I did it for the normal filesystem but obviously, it doesn't work for HDFS.
stream.foreachRDD{ rdd =>
if(!rdd.isEmpty()) {
rdd.foreach{
msg =>
val value = msg._2
println(value)
val fname = java.util.UUID.randomUUID.toString
val path = dir + fname
write(path, value)
}
}
}
where write is a function which writes to the filesystem.
Is there a way to do it within spark so that for each record I can natively write to the HDFS, without using any other tool like Kafka Connect or Flume??
EDIT: More Explanation
For eg:
If my DstreamRDD has the following records,
abcd
efgh
ijkl
mnop
I need different files for each record, so different file for "abcd", different for "efgh" and so on.
I tried creating an RDD within the streamRDD but I learnt it's not allowed as the RDD's are not serializable.
You can forcefully repartition the rdd to no. of partitions as many no. of records and then save
val rddCount = rdd.count()
rdd.repartition(rddCount).saveAsTextFile("your/hdfs/loc")
You can do in couple of ways..
From rdd, you can get the sparkCOntext, once you got the sparkCOntext, you can use parallelize method and pass the String as List of String.
For example:
val sc = rdd.sparkContext
sc.parallelize(Seq("some string")).saveAsTextFile(path)
Also, you can use sqlContext to convert the string to DF then write in the file.
for Example:
import sqlContext.implicits._
Seq(("some string")).toDF

skip header of csv while reading multiple files into rdd in scala

I am trying to read multiple csvs into an rdd from a path. This path has many csvs Is there a way I can avoid the headers while reading all the csvs into rdd? or use spotsRDD to omit out the header without having to use filter or deal with each csv individually and then union them?
val path ="file:///home/work/csvs/*"
val spotsRDD= sc.textFile(path)
println(spotsRDD.count())
Thanks
That is pity you are using spark 1.0.0.
You can use CSV Data Source for Apache Spark but this library requires Spark 1.3+ and btw. this library was inlined to Spark 2.x.
But we can analyse and implement something similar.
When we look into the com/databricks/spark/csv/DefaultSource.scala there is
val useHeader = parameters.getOrElse("header", "false")
and then in the com/databricks/spark/csv/CsvRelation.scala there is
// If header is set, make sure firstLine is materialized before sending to executors.
val filterLine = if (useHeader) firstLine else null
baseRDD().mapPartitions { iter =>
// When using header, any input line that equals firstLine is assumed to be header
val csvIter = if (useHeader) {
iter.filter(_ != filterLine)
} else {
iter
}
parseCSV(csvIter, csvFormat)
so if we assume the first line is only once in RDD (our csv rows) we can do something like in the example below:
CSV example file:
Latitude,Longitude,Name
48.1,0.25,"First point"
49.2,1.1,"Second point"
47.5,0.75,"Third point"
scala> val csvData = sc.textFile("test.csv")
csvData: org.apache.spark.rdd.RDD[String] = test.csv MapPartitionsRDD[24] at textFile at <console>:24
scala> val header = csvDataRdd.first
header: String = Latitude,Longitude,Name
scala> val csvDataWithoutHeaderRdd = csvDataRdd.mapPartitions{iter => iter.filter(_ != header)}
csvDataWithoutHeaderRdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[25] at mapPartitions at <console>:28
scala> csvDataWithoutHeaderRdd.foreach(println)
49.2,1.1,"Second point"
48.1,0.25,"First point"
47.5,0.75,"Third point"

Why doesn't keys() and values() work on (String,String) one-pair RDD, while sortByKey() works

I create an RDD using the README.md file in Spark directory. The type of the newRDD is (String,String)
val lines = sc.textFile("README.md")
val newRDD = lines.map(x => (x.split(" ")(0),x))
So, when I try to runnewRDD.values() or newRDD.keys(), I get the error:
error: org.apache.spark.rdd.RDD[String] does not take parameters newRDD.values()or.keys() resp.
What I can understand from the error is maybe that String data type cannot be a key (And I think I am wrong). But if that's the case, why does
newRDD.sortByKey() work ?
Note: I am trying values() and keys() transformations because they're listed as valid transformations for one-pair RDDs
Edit: I am using Apache Spark version 1.5.2 in Scala
It doesn't work values (or keys) receives no parameters and because of that it has to be called without parentheses:
val rdd = sc.parallelize(Seq(("foo", "bar")))
rdd.keys.first
// String = foo
rdd.values.first
// String = bar