Scala - Turning RDD[String] into a Map - scala

I have a very large file that contains individual JSONs which I would like to iterate through, turning each one into a Map using the Jackson library:
import com.fasterxml.jackson.databind.ObjectMapper import com.fasterxml.module.scala.DefaultScalaModule
import com.fasterxml.module.scala.ScalaObjectMapper
val mapper = new ObjectMapper() with ScalaObjectMapper
mapper.register(DefaultScalaModule)
val lines = sc.textFile(fileName)
on a single JSON string, I can perform without issues:
mapper.readValue[Map[String, Object]](JSONString)
to get my map.
However, if I try the following by iterating through an RDD[String] like so I get the following error:
lines.foreach(line=> mapper.readValue[Map[String, Object]])
org.apache.Spark.SparkException: Task not serializable
I can do lines.take(10000) or so and then work on that but this file is so huge I can't "take" or "collect" the whole file in one go and I want to be able to use the same solution across files of all different sizes.
After the string becomes a Map, I need to perform functions on it and write to a string, so any solution that allows me to do that without going over my allocated memory will help. Thank you!

Managed to solve this with the below:
import scala.util.parsing.json._
val myMap = JSON.parseFull(jsonString).get.asInstanceOf[Map[String, Object]]
The above will work on an RDD[String]

Related

How to convert a java.io list to a DataFrame in Scala?

I'm using this code to get the list of files in a directory, and want to call to toDF method that works when converting lists to dataframes. However, because this is a java.io List, it's saying it won't work.
val files = Option(new java.io.File("data").list).map(_.count(_.endsWith(".csv"))).getOrElse(0)
When I try to do
files.toDF.show()
I get this error:
How can I get this to work? Can someone help me with the code to convert this java.io List to a regular list?
Thanks
val files = Option(new java.io.File("data").list).map(_.count(_.endsWith(".csv"))).getOrElse(0)
Above Code returns - Int, And you are trying to convert Int Value to DataFrame, How is it possible. If I understand you wanted to convert list of .csv files as DataFrame. Please use this below code -
val files = Option(new java.io.File("data").list)).get.filter(x=>x.endsWith(".csv")).toList
import spark.implicits._
files.toDF().show()

Reading list of input textFiles where individual file-names contain commas

I have a folder on HDFS, which for whatever reason, contains part-files that contain commas in their name. For instance
hdfs://namespace/mypath/1-1,123
hdfs://namespace/mypath/1-2,124
hdfs://namespace/mypath/1-3,125
The issue is, I want to only read some of the part files at a time, to prevent over-loading my cluster, meaning that I want to read 1-1,123 and 1-2,124 files.
However, when path is fed to spark as:
sc.textFile("hdfs://namespace/mypath/1-1,123,hdfs://namespace/mypath/1-2,124")
Spark obviously seems to just tokenize on ",", thereby assuming I'm looking for 4 separate files.
Is there a way to escape the commas in the path?
Is the only option to rename the source files?
SparkContext.textFile calls at some point FileInputFormat.setInputPaths(Job job, String commaSeparatedPaths) which apparently simply splits on , the input String representing the comma-separated paths:
Sets the given comma separated paths as the list of inputs for the map-reduce job.
One way to bypass this limitation consists in using the alternative signature of setInputPaths: FileInputFormat.setInputPaths(Job job, Path... inputPaths) which takes a vararg of Path objects. This way, no need to split on , and thus no confusion possible.
To do that, we'll have to create our own textFile method which does the exact same thing as SparkContext.textFile: calling the HadoopRDD object but this time using an input provided as a List of Strings instead of a String:
package org.apache.spark
import org.apache.spark.rdd.{RDD, HadoopRDD}
import org.apache.spark.util.SerializableConfiguration
import org.apache.hadoop.mapred.{FileInputFormat, JobConf, TextInputFormat}
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.fs.Path
object TextFileOverwrite {
implicit class SparkContextExtension(val sc: SparkContext) extends AnyVal {
def textFile(
paths: Seq[String],
minPartitions: Int = sc.defaultMinPartitions
): RDD[String] = {
val confBroadcast =
sc.broadcast(new SerializableConfiguration(sc.hadoopConfiguration))
val setInputPathsFunc =
(jobConf: JobConf) =>
FileInputFormat.setInputPaths(jobConf, paths.map(p => new Path(p)): _*)
new HadoopRDD(
sc,
confBroadcast,
Some(setInputPathsFunc),
classOf[TextInputFormat],
classOf[LongWritable],
classOf[Text],
minPartitions
).map(pair => pair._2.toString)
}
}
}
which can be used this way:
import org.apache.spark.TextFileOverwrite.SparkContextExtension
sc.textFile(Seq("path/hello,world.txt", "path/hello_world.txt"))
Compared to SparkContext.textFile, the only difference in the implementation is the call to FileInputFormat.setInputPaths which takes Paths in input instead of a comma-separated String.
Note that I use the package org.apache.spark to store this function, because SerializableConfiguration has the visibility private[spark] in spark's code base.
Also note the use of an implicit class on SparkContext which allows us to implicitly attach this additional textFile method directly to the SparkContext object and thus to call it using sc.textFile() instead of having to pass the sparkContext as a parameter of the method.
Also note that I would have preferred giving Seq[Path] instead of Seq[String] as an input of this method, but Path is not yet Serializable in the current version of hadoop-common used by Spark (it will become Serializable starting version 3 of hadoop-common).
Use filename globbing, assuming that this gives you unique files:
sc.textFile("hdfs://namespace/mypath/1-1?123,hdfs://namespace/mypath/1-2?124")
Doesn't work if you only want the first one of these and not the other two:
hdfs://namespace/mypath/1-1,123,hdfs
hdfs://namespace/mypath/1-1:123,hdfs
hdfs://namespace/mypath/1-1.123,hdfs
I was going to suggest this:
sc.textFile("hdfs://namespace/mypath/1-1[,]123, ...
And I think that's supposed to work. Looking at the code for org.apache.hadoop.mapred.FileInputFormat#getPathStrings though makes me suspicious. It looks like that function specifically looks for commas inside curly braces, and will fail if you put a comma inside [,].

How to convert Dataset to a Scala Iterable?

Is there a way to convert a org.apache.spark.sql.Dataset to a scala.collection.Iterable? It seems like this should be simple enough.
You can do myDataset.collect or myDataset.collectAsList.
But then it will no longer be distributed. If you want to be able to spread your computations out on multiple machines you need to use one of the distributed datastructures such as RDD, Dataframe or Dataset.
You can also use toLocalIterator if you just need to iterate the contents on the driver as it has the advantage of only loading one partition at a time, instead of the entire dataset, into memory. Iterator is not an Iterable (although it is a Traverable) but depending on what you are doing it may be what you want.
You could try something like this:
def toLocalIterable[T](dataset: Dataset[T]): Iterable[T] = new Iterable[T] {
def iterator = scala.collection.JavaConverters.asScalaIterator(dataset.toLocalIterator)
}
The conversion via JavaConverters.asScalaIterator is necessary because the toLocalIterator method of Dataset returns a java.util.Iterator instead of a scala.collection.Iterator (which is what the toLocalIterator on RDD returns.) I suspect this is a bug.
In Scala 2.11 you can do the following:
import scala.collection.JavaConverters._
dataset.toLocalIterator.asScala.toIterable

Spark 2.0 Scala - RDD.toDF()

I am working with Spark 2.0 Scala. I am able to convert an RDD to a DataFrame using the toDF() method.
val rdd = sc.textFile("/pathtologfile/logfile.txt")
val df = rdd.toDF()
But for the life of me I cannot find where this is in the API docs. It is not under RDD. But it is under DataSet (link 1). However I have an RDD not a DataSet.
Also I can't see it under implicits (link 2).
So please help me understand why toDF() can be called for my RDD. Where is this method being inherited from?
It's coming from here:
Spark 2 API
Explanation: if you import sqlContext.implicits._, you have a implicit method to convert RDD to DataSetHolder (rddToDataSetHolder), then you call toDF on the DataSetHolder
Yes, you should import sqlContext implicits like that:
val sqlContext = //create sqlContext
import sqlContext.implicits._
val df = RDD.toDF()
Before you call to "toDF" in your RDDs
Yes I finally found piece of mind, this issue. It was troubling me like hell, this post is a life saver. I was trying to generically load data from log files to a case class object making it mutable List, this idea was to finally convert the list into DF. However as it was mutable and Spark 2.1.1 has changed the toDF implementation, what ever why the list want not getting converted. I finally thought of even covering save the data to file and the load it back using .read. However 5 min back this post had saved my day.
I did the exact same way as described.
after loading the data to mutable list I immediately used
import spark.sqlContext.implicits._
val df = <mutable list object>.toDF
df.show()
I have done just this with Spark 2.
it worked.
val orders = sc.textFile("/user/gd/orders")
val ordersDF = orders.toDF()

Run a read-only test in Spark

I want to compare the read performance of different storage systems using Spark ,e.g. HDFS/S3N. I have written a small Scala program for this:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
object SimpleApp {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("WordCount")
val sc = new SparkContext(conf)
val file = sc.textFile("s3n://test/wordtest")
val splits = file.map(word => word)
splits.saveAsTextFile("s3n://test/myoutput")
}
}
My question is, is it possible to run a read-only test with Spark? For the program above, isn't saveAsTextFile() causing some write as well?
I am not sure if that is possible at all. In order to run a transformation, a posterior action is necessary.
From the official Spark documentation:
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program.
Taking this into account, saveAsTextFile might not be considered the lightest from the wide range of actions available. Several lightweight alternatives exists, actions like count or first, for example. These would leverage almost the totality of the work to the transformations phase, making you able to measure the performance of your solution.
You might want to check the available actions and choose the one that best fits your requirements.
Yes."saveAsTextFile" writes the RDD data to text file using given path.