spark whole textiles - many small files - scala

I want to ingest many small text files via spark to parquet. Currently, I use wholeTextFiles and perform some parsing additionally.
To be more precise - these small text files are ESRi ASCII Grid files each with a maximum size of around 400kb. GeoTools are used to parse them as outlined below.
Do you see any optimization possibilities? Maybe something to avoid the creation of unnecessary objects? Or something to better handle the small files. I wonder if it is better to only get the paths of the files and manually read them instead of using String -> ByteArrayInputStream.
case class RawRecords(path: String, content: String)
case class GeometryId(idPath: String, value: Double, geo: String)
#transient lazy val extractor = new PolygonExtractionProcess()
#transient lazy val writer = new WKTWriter()
def readRawFiles(path: String, parallelism: Int, spark: SparkSession) = {
import spark.implicits._
.wholeTextFiles(path, parallelism)
.toDF("path", "content")
def mapToSimpleTypes(iterator: Iterator[RawRecords]): Iterator[GeometryId] = iterator.flatMap(r => {
val extractor = new PolygonExtractionProcess()
val readRaster = new ArcGridReader(new ByteArrayInputStream(r.content.getBytes(StandardCharsets.UTF_8))).read(null)
// TODO maybe consider optimization of known size instead of using growable data structure
val vectorizedFeatures = extractor.execute(readRaster, 0, true, null, null, null, null).features
val result: collection.Seq[GeometryId] with Growable[GeometryId] = mutable.Buffer[GeometryId]()
while (vectorizedFeatures.hasNext) {
val vectorizedFeature =
val geomWKTLineString = vectorizedFeature.getDefaultGeometry match {
case g: Geometry => writer.write(g)
val geomUserdata = vectorizedFeature.getAttribute(1).asInstanceOf[Double]
result += GeometryId(r.path, geomUserdata, geomWKTLineString)

I have suggestions:
use wholeTextFile -> mapPartitions -> convert to Dataset. Why? If you make mapPartitions on Dataset, then all rows are converted from internal format to object - it causes additional serialization.
Run Java Mission Control and sample your application. It will show all compilations and times of execution of methods
Maybe you can use binaryFiles, it will give you Stream, so you can parse it without additional reading in mapPartitions


Create Spark UDF of a function that depends on other resources

I have a code for tokenizing a string.
But that tokenization method uses some data which is loaded when my application starts.
val stopwords = getStopwords();
val tokens = tokenize("hello i am good",stopwords)
def tokenize(string:String,stopwords: List[String]) : List[String] = {
val splitted = string.split(" ")
// I use this stopwords for filtering my splitted array.
// Then i return the items back.
Now I want to make the tokenize method an UDF for Spark.I want to use it to create new column in DataFrame Transformations.
I created simple UDFs before which had no dependencies like it needs items that needs to be read from text file etc.
Can some one tell me how to do these kind of operation?
This is what I have tried ,and its working.
val moviesDF = Seq(
val tokenizeUDF: UserDefinedFunction = udf(tokenize(_: String): List[String])
moviesDF.withColumn("tokenized", tokenizeUDF(col("column_name"))).show(100, false)
def tokenize(name: String): List[String] = {
val wordFreqMap: Map[String, Double] = DataProviderUtil.getWordFreqMap()
val stopWords: Set[String] = DataProviderUtil.getStopWordSet()
val maxLengthWord: Int = wordFreqMap.keys.maxBy(_.length).length
Its giving me the expected output:
|columnname |tokenized |
|kingdomofheaven |[kingdom, heaven] |
|enemyatthegates |[enemi, gate] |
|salesinfointheyearofdecember|[sale, info, year, decemb]|
Now my question is , will it work when its deployed ? Currently I am
running it locally. My main concern it that this function reads from a
file to get information like stopwords,wordfreq etc for making the
tokenization possible. So registering it like this will work properly
At this point, if you deploy this code Spark will try to serialize your DataProviderUtil, you would need to mark as serializable that class. Another possibility is to declare you logic inside an Object. Functions inside objects are considered static functions and they are not serialized.

How to efficiently read/parse loads of .gz files in a s3 folder with spark on EMR

I'm trying to read all files in a directory on s3 via a spark app that's executing on EMR.
The data is store in a typical format like "s3a://Some/path/yyyy/mm/dd/hh/blah.gz"
If I use deeply nested wildcards (e.g. "s3a://SomeBucket/SomeFolder/////*.gz"), the performance is terrible and takes about 40 minutes to read a few tens of thousand small gzipped json files.
It works, but losing 40 minutes to test some code is really bad.
I have two other approaches that my research has told me are much more performant.
Using the hadoop.fs library (2.8.5) I try to read each file path I provide it.
private def getEventDataHadoop(
eventsFilePaths: RDD[String]
)(implicit sqlContext: SQLContext): Try[RDD[String]] =
val conf = sqlContext.sparkContext.hadoopConfiguration => {
val p = new Path(eventsFilePath)
val fs = p.getFileSystem(conf)
val eventData: FSDataInputStream =
These file paths are generated by the below code:
private[disneystreaming] def generateInputBucketPaths(
s3Protocol: String,
bucketName: String,
service: String,
region: String,
yearsMonths: Map[String, Set[String]]
): Try[Set[String]] =
val days = 1 to 31
val hours = 0 to 23
val dateFormatter: Int => String = buildDateFormat("00")
yearsMonths.flatMap { yearMonth: (String, Set[String]) =>
for {
month: String <- yearMonth._2
day: Int <- days
hour: Int <- hours
} yield
s"$s3Protocol$bucketName/$service/$region/${dateFormatter(yearMonth._1.toInt)}/${dateFormatter(month.toInt)}/" +
The hadoop.fs code fails because the Path class is not serializable. I can't think of how I can get around that.
So this led me to another approach using AmazonS3Client, where I just ask the client to give me all the file paths in a folder (or prefix), then parse the files to a string, which will likely fail due to them being compressed:
private def getEventDataS3(bucketName: String, prefix: String)(
implicit sqlContext: SQLContext
): Try[RDD[String]] =
import, model._
import scala.collection.JavaConverters._
val request = new ListObjectsRequest()
val s3 = new AmazonS3Client(new ProfileCredentialsProvider("default"))
val objs: ObjectListing = s3.listObjects(request) // Note that this method returns truncated data if longer than the "pageLength" above. You might need to deal with that.
.flatMap { key =>
.fromInputStream(s3.getObject(bucketName, key).getObjectContent: InputStream)
This code produce a null exception because the profile cannot be null ("java.lang.IllegalArgumentException: profile file cannot be null").
Remember this code is running on EMR within AWS, so how do I provide the credentials it wants? How are other people running spark jobs on EMR using this client?
Any help with getting any of these approaches working is much appreciated.
Path is serializable in later Hadoop releases, because it is useful to be able to use in Spark RDDs. Until then, convert the path to a URI, marshall that, and create a new path from that URI inside your closure.

Spark UDF with Maxmind Geo Data

I'm trying to use the Maxmind snowplow library to pull out geo data on each IP that I have in a dataframe.
We are using Spark SQL (spark version 2.1.0) and I created an UDF in the following class:
class UdfDefinitions #Inject() extends Serializable with StrictLogging {
val s3Config = configuration.databases.dataWarehouse.s3
val lruCacheConst = 20000
val ipLookups = IpLookups(geoFile = Some(SparkFiles.get(s3Config.geoIPFileName) ),
ispFile = None, orgFile = None, domainFile = None, memCache = false, lruCache = lruCacheConst)
def lookupIP(ip: String): LookupIPResult = {
val loc: Option[IpLocation] = ipLookups.getFile.performLookups(ip)._1
loc match {
case None => LookupIPResult("", "", "")
case Some(x) => LookupIPResult(Option(x.countryName).getOrElse(""),""), x.regionName.getOrElse(""))
val lookupIPUDF: UserDefinedFunction = udf(lookupIP _)
The intention is to create the pointer to the file (ipLookups) outside the UDF and use it inside, so not to open files on each row. This get an error of task no serialized and when we use the addFiles in the UDF, we get a too many files open error (when using a large dataset, on a small dataset it does work).
This thread show how to use to solve the problem using RDD, but we would like to use Spark SQL. using maxmind geoip in spark serialized
Any thoughts?
The problem here is that IpLookups is not Serializable. Yet it makes the lookups from a static file (frmo what I gathered) so you should be able to fix that. I would advise that you clone the repo and make IpLookups Serializable. Then, to make it work with spark SQL, wrap everything in a class like you did. The in the main spark job, you can write something as follows:
val IPResolver = new MySerializableIpResolver()
val resolveIP = udf((ip : String) => IPResolver.resolve(ip))
data.withColumn("Result", resolveIP($"IP"))
If you do not have that many distinct IP addresses, there is another solution: you could do everything in the driver.
val ipMap ="IP").distinct.collect
.map(/* calls to the non serializable IpLookups but that's ok, we are in the driver*/)
val resolveIP = udf((ip : String) => ipMap(ip))
data.withColumn("Result", resolveIP($"IP"))

Spark: How to get String value while generating output file

I have two files
StudentId in first should be replaced with StudentName and Course in the second file.
Once replaced I need to generate a new CSV with complete details like
Code used
val studentRDD = sc.textFile(file path);
val studentdetailsRDD = sc.textFile(file path);
val studentB = sc.broadcast(studentdetailsRDD.collect)
//Generating CSV{student =>
val name = getName(student.StudentId)
val course = getCourse(student.StudentId)
Array(name, course, student.City)
}.mapPartitions{data =>
val stringWriter = new StringWriter();
val csvWriter =new CSVWriter(stringWriter);
//Functions defined to get details
def getName(studentId : String) {{stud =>if(studentId == stud.StudentId) stud.StudentName}
def getCourse(studentId : String) {{stud =>if(studentId == stud.StudentId) stud.Course}
File gets generated but the values are object representations instead of String value.
How can I get the string values instead of objects ?
As suggested in another answer, Spark's DataFrame API is especially suitable for this, as it easily supports joining two DataFrames, and writing CSV files.
However, if you insist on staying with RDD API, looks like the main issue with your code is the lookup functions: getName and getCourse basically do nothing, because their return type is Unit; Using an if without an else means that for some inputs there's no return value, which makes the entire function return Unit.
To fix this, it's easier to get rid of them and simplify the lookup by broadcasting a Map:
// better to broadcast a Map instead of an Array, would make lookups more efficient
val studentB = sc.broadcast(studentdetailsRDD.keyBy(_.StudentId).collectAsMap())
// convert to RDD[String] with the wanted formatting
val resultStrings = { student =>
val details = studentB.value(student.StudentId)
Array(details.StudentName, details.Course, student.City)
.map(_.mkString(",")) // naive CSV writing with no escaping etc., you can also use CSVWriter like you did
// save as text file
Spark has great support for join and write to file. Join only takes 1 line of code and write also only takes 1.
Hand write those code can be error proven, hard to read and most likely super slow.
val df1 = Seq((101,"NDLS"),
).toDF("id", "city")
val df2 = Seq((101,"ABC","C001"),
).toDF("id", "name", "course")
val dfResult = df1.join(df2, "id").select("id", "city", "name")
There will be a directory created. There is only 1 file in the directory which is the finally result.

Scala - Remove header from Pair RDD

I am new in Scala and want to remove header from data. I have below data
and I am using below code to read
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object PAN {
def main(args: Array[String]) {
case class income(recordid : Int, income : Int)
val sc = new SparkContext(new SparkConf().setAppName("income").setMaster("local[2]"))
val income_data = sc.textFile("file:///home/user/Documents/income_info.txt").map(_.split(","))
val income_recs = => (r(0).toInt, income(r(0).toInt, r(1).toInt)))
I want to remove header from pair RDD but not getting how.
I was playing with below code
val header = income_data.first()
val a = income_data.filter(row => row != header)
a.foreach { println }
but it return below output
You technique to remove the header by filtering it out will work fine. The problem is how you are trying to print the array.
Arrays in Scala do not override toString so when you try to print one it uses the default string representation which is just the name and hashcode and usually not very useful.
If you want to print an array, turn it into a string first using the mkString method on string, or use foreach(println)
a.foreach {array => println(array.mkString("[",", ","]")}
a.foreach {array => array.foreach{println}}
Will both print out the elements of your array so you can see what they contain.
Keep in mind that when working with Spark, printing inside transformation and actions only works in local mode. Once you move to the cluster, the work will be done on remote executors so you won't be able to see and console output from them.
val income_data = sc.textFile("file:///home/user/Documents/income_info.txt")
When you create an RDD it will return RDD[String] , then when you collect() on top of it it will return Array[String], drop(number of elements) is a function on top of Array to remove those many rows from RDD.