How to convert Dataset to a Scala Iterable? - scala

Is there a way to convert a org.apache.spark.sql.Dataset to a scala.collection.Iterable? It seems like this should be simple enough.

You can do myDataset.collect or myDataset.collectAsList.
But then it will no longer be distributed. If you want to be able to spread your computations out on multiple machines you need to use one of the distributed datastructures such as RDD, Dataframe or Dataset.
You can also use toLocalIterator if you just need to iterate the contents on the driver as it has the advantage of only loading one partition at a time, instead of the entire dataset, into memory. Iterator is not an Iterable (although it is a Traverable) but depending on what you are doing it may be what you want.

You could try something like this:
def toLocalIterable[T](dataset: Dataset[T]): Iterable[T] = new Iterable[T] {
def iterator = scala.collection.JavaConverters.asScalaIterator(dataset.toLocalIterator)
}
The conversion via JavaConverters.asScalaIterator is necessary because the toLocalIterator method of Dataset returns a java.util.Iterator instead of a scala.collection.Iterator (which is what the toLocalIterator on RDD returns.) I suspect this is a bug.

In Scala 2.11 you can do the following:
import scala.collection.JavaConverters._
dataset.toLocalIterator.asScala.toIterable

Related

org.mockito.exceptions.misusing.WrongTypeOfReturnValue on spark test cases

I'm currently writing test cases for spark with mockito and I'm mocking a sparkContext which gets wholeTextFiles called on it. I have something like this
val rdd = sparkSession.sparkContext.makeRDD(Seq(("Java", 20000),
("Python", 100000), ("Scala", 3000)))
doReturn(rdd).when(mockContext).wholeTextFiles(testPath)
However, I keep getting an error saying wholeTextFiles is supposed to output an int
org.mockito.exceptions.misusing.WrongTypeOfReturnValue: ParallelCollectionRDD cannot be returned by wholeTextFiles$default$2()
wholeTextFiles$default$2() should return int
I know this isn't the case, the spark docs say that wholeTextFiles returns an RDD object, any tips on how I can fix this? I can't have my doReturn be of type int, because then the rest of my function fails, since I turn the wholeTextFiles output into a dataframe.
Resolved this with some alternative approaches. It works just fine if I do the Mockito.when() thenReturn pattern, but I decided to instead change the entire test case to start a local sparksession and load some sample files into an rdd instead, because that's a more in depth test of my function imo

Adding contents in an RDD[(Array[String], Long)] into a new array into a new RDD: RDD[Array[(Array[String], Long)]]

I have an RDD[Array[String]] which I zipWithIndex:
val dataWithIndex = data.zipWithIndex()
Now I have a RDD[(Array[String], Long)], I would like to add all the pairs in the RDD to an array and still have it in the RDD. Is there an efficient way to do so? My final datastructure should be RDD[Array[(Array[String], Long)]] where the RDD essentially only contains one element.
Right now I do the following, but it is very ineffective because of collect():
val dataWithIndex = data.zipWithIndex()
val dataNoRDD = dataWithIndex.collect()
val dataArr = ListBuffer[Array[(Array[String], Long)]]()
dataArr += dataNoRDD
val initData = sc.parallelize(dataArr)
The conclusion is that this seems to be extremely hard to do with standard functionality.
Instead, if the input comes from a Hadoop filesystem it is possible to do. This can be done by extending certain Hadoop classes.
First you need to implement WritableComparable<> and define a custom format that the RDD will contain. In order for this to work, you need to define a custom FileInputFormat and extend it in order to support your custom Writable. In order for FileInputFormat to know what to do with data being read, a custom RecordReader has to be written by extending it and here specifically the method nextKeyValue() has to be written which defines what each RDD element will contain. All of these three are written in Java, but with some simple tricks it is possible to do.

Spark Notebook: Does GeoPointsChart accept a Dataframe?

I have a Dataframe which has two columns latitude and longitude. I passed that to GeoPointsChart. The output is "showing 1000 rows" but it isn't actually showing me anything. Has anyone faced the same issue? Is this a syntactical mistake?
I have not worked with this notebook, but it looks like you have an API call somewhere producing a java.util.List. That class does not have a toSeq method. You want to convert java.util.List into its Scala equivalent.
First, import this:
import scala.collection.JavaConverters._
This import enriches (or pimps) the Java collections with an asScala method to do the conversion:
val testAsScala = test.asScala.toSeq
I would note though that the call to toSeq is unnecessary since the result already mixes in Seq. Still, with asScala you can now work entirely with Scala collections, which are so much easier.

Sort by dateTime in scala

I have an RDD[org.joda.time.DateTime]. I would like to sort records by date in scala.
Input - sample data after applying collect() below -
res41: Array[org.joda.time.DateTime] = Array(2016-10-19T05:19:07.572Z, 2016-10-12T00:31:07.572Z, 2016-10-18T19:43:07.572Z)
Expected Output
2016-10-12T00:31:07.572Z
2016-10-18T19:43:07.572Z
2016-10-19T05:19:07.572Z
I have googled and checked following link but could not understand it -
How to define an Ordering in Scala?
Any help?
If you collect the records of your RDD, then you can apply the following sorting:
array.sortBy(_.getMillis)
On the contrary, if your RDD is big and you do not want to collect it to the driver, you should consider:
rdd.sortBy(_.getMillis)
You can define an implicit ordering for org.joda.time.DateTime like so;
implicit def ord: Ordering[DateTime] = Ordering.by(_.getMillis)
Which looks at the milliseconds of a DateTime and sorts based on that.
You can then either ensure that the implicit is in your scope or just use it more explicitly:
arr.sorted(ord)

how to make saveAsTextFile NOT split output into multiple file?

When using Scala in Spark, whenever I dump the results out using saveAsTextFile, it seems to split the output into multiple parts. I'm just passing a parameter(path) to it.
val year = sc.textFile("apat63_99.txt").map(_.split(",")(1)).flatMap(_.split(",")).map((_,1)).reduceByKey((_+_)).map(_.swap)
year.saveAsTextFile("year")
Does the number of outputs correspond to the number of reducers it uses?
Does this mean the output is compressed?
I know I can combine the output together using bash, but is there an option to store the output in a single text file, without splitting?? I looked at the API docs, but it doesn't say much about this.
The reason it saves it as multiple files is because the computation is distributed. If the output is small enough such that you think you can fit it on one machine, then you can end your program with
val arr = year.collect()
And then save the resulting array as a file, Another way would be to use a custom partitioner, partitionBy, and make it so everything goes to one partition though that isn't advisable because you won't get any parallelization.
If you require the file to be saved with saveAsTextFile you can use coalesce(1,true).saveAsTextFile(). This basically means do the computation then coalesce to 1 partition. You can also use repartition(1) which is just a wrapper for coalesce with the shuffle argument set to true. Looking through the source of RDD.scala is how I figured most of this stuff out, you should take a look.
For those working with a larger dataset:
rdd.collect() should not be used in this case as it will collect all data as an Array in the driver, which is the easiest way to get out of memory.
rdd.coalesce(1).saveAsTextFile() should also not be used as the parallelism of upstream stages will be lost to be performed on a single node, where data will be stored from.
rdd.coalesce(1, shuffle = true).saveAsTextFile() is the best simple option as it will keep the processing of upstream tasks parallel and then only perform the shuffle to one node (rdd.repartition(1).saveAsTextFile() is an exact synonym).
rdd.saveAsSingleTextFile() as provided bellow additionally allows one to store the rdd in a single file with a specific name while keeping the parallelism properties of rdd.coalesce(1, shuffle = true).saveAsTextFile().
Something that can be inconvenient with rdd.coalesce(1, shuffle = true).saveAsTextFile("path/to/file.txt") is that it actually produces a file whose path is path/to/file.txt/part-00000 and not path/to/file.txt.
The following solution rdd.saveAsSingleTextFile("path/to/file.txt") will actually produce a file whose path is path/to/file.txt:
package com.whatever.package
import org.apache.spark.rdd.RDD
import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
import org.apache.hadoop.io.compress.CompressionCodec
object SparkHelper {
// This is an implicit class so that saveAsSingleTextFile can be attached to
// SparkContext and be called like this: sc.saveAsSingleTextFile
implicit class RDDExtensions(val rdd: RDD[String]) extends AnyVal {
def saveAsSingleTextFile(path: String): Unit =
saveAsSingleTextFileInternal(path, None)
def saveAsSingleTextFile(path: String, codec: Class[_ <: CompressionCodec]): Unit =
saveAsSingleTextFileInternal(path, Some(codec))
private def saveAsSingleTextFileInternal(
path: String, codec: Option[Class[_ <: CompressionCodec]]
): Unit = {
// The interface with hdfs:
val hdfs = FileSystem.get(rdd.sparkContext.hadoopConfiguration)
// Classic saveAsTextFile in a temporary folder:
hdfs.delete(new Path(s"$path.tmp"), true) // to make sure it's not there already
codec match {
case Some(codec) => rdd.saveAsTextFile(s"$path.tmp", codec)
case None => rdd.saveAsTextFile(s"$path.tmp")
}
// Merge the folder of resulting part-xxxxx into one file:
hdfs.delete(new Path(path), true) // to make sure it's not there already
FileUtil.copyMerge(
hdfs, new Path(s"$path.tmp"),
hdfs, new Path(path),
true, rdd.sparkContext.hadoopConfiguration, null
)
// Working with Hadoop 3?: https://stackoverflow.com/a/50545815/9297144
hdfs.delete(new Path(s"$path.tmp"), true)
}
}
}
which can be used this way:
import com.whatever.package.SparkHelper.RDDExtensions
rdd.saveAsSingleTextFile("path/to/file.txt")
// Or if the produced file is to be compressed:
import org.apache.hadoop.io.compress.GzipCodec
rdd.saveAsSingleTextFile("path/to/file.txt.gz", classOf[GzipCodec])
This snippet:
First stores the rdd with rdd.saveAsTextFile("path/to/file.txt") in a temporary folder path/to/file.txt.tmp as if we didn't want to store data in one file (which keeps the processing of upstream tasks parallel)
And then only, using the hadoop file system api, we proceed with the merge (FileUtil.copyMerge()) of the different output files to create our final output single file path/to/file.txt.
You could call coalesce(1) and then saveAsTextFile() - but it might be a bad idea if you have a lot of data. Separate files per split are generated just like in Hadoop in order to let separate mappers and reducers write to different files. Having a single output file is only a good idea if you have very little data, in which case you could do collect() as well, as #aaronman said.
As others have mentioned, you can collect or coalesce your data set to force Spark to produce a single file. But this also limits the number of Spark tasks that can work on your dataset in parallel. I prefer to let it create a hundred files in the output HDFS directory, then use hadoop fs -getmerge /hdfs/dir /local/file.txt to extract the results into a single file in the local filesystem. This makes the most sense when your output is a relatively small report, of course.
In Spark 1.6.1 the format is as shown below. It creates a single output file.It is best practice to use it if the output is small enough to handle.Basically what it does is that it returns a new RDD that is reduced into numPartitions partitions.If you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1)
pair_result.coalesce(1).saveAsTextFile("/app/data/")
You can call repartition() and follow this way:
val year = sc.textFile("apat63_99.txt").map(_.split(",")(1)).flatMap(_.split(",")).map((_,1)).reduceByKey((_+_)).map(_.swap)
var repartitioned = year.repartition(1)
repartitioned.saveAsTextFile("C:/Users/TheBhaskarDas/Desktop/wc_spark00")
You will be able to do it in the next version of Spark, in the current version 1.0.0 it's not possible unless you do it manually somehow, for example, like you mentioned, with a bash script call.
I also want to mention that the documentation clearly states that users should be careful when calling coalesce with a real small number of partitions . this can cause upstream partitions to inherit this number of partitions.
I would not recommend using coalesce(1) unless really required.
Here's my answer to output a single file. I just added coalesce(1)
val year = sc.textFile("apat63_99.txt")
.map(_.split(",")(1))
.flatMap(_.split(","))
.map((_,1))
.reduceByKey((_+_)).map(_.swap)
year.saveAsTextFile("year")
Code:
year.coalesce(1).saveAsTextFile("year")