Sort by dateTime in scala - scala

I have an RDD[org.joda.time.DateTime]. I would like to sort records by date in scala.
Input - sample data after applying collect() below -
res41: Array[org.joda.time.DateTime] = Array(2016-10-19T05:19:07.572Z, 2016-10-12T00:31:07.572Z, 2016-10-18T19:43:07.572Z)
Expected Output
2016-10-12T00:31:07.572Z
2016-10-18T19:43:07.572Z
2016-10-19T05:19:07.572Z
I have googled and checked following link but could not understand it -
How to define an Ordering in Scala?
Any help?

If you collect the records of your RDD, then you can apply the following sorting:
array.sortBy(_.getMillis)
On the contrary, if your RDD is big and you do not want to collect it to the driver, you should consider:
rdd.sortBy(_.getMillis)

You can define an implicit ordering for org.joda.time.DateTime like so;
implicit def ord: Ordering[DateTime] = Ordering.by(_.getMillis)
Which looks at the milliseconds of a DateTime and sorts based on that.
You can then either ensure that the implicit is in your scope or just use it more explicitly:
arr.sorted(ord)

Related

Spark Scala: functional difference in notation using $?

Is there a functional difference between the following two expressions? The result looks the same to me but curious if there's an unknown unknown. What does the $ symbol indicate/how is it read?
df1.orderBy($"reasonCode".asc).show(10, false)
df1.orderBy(asc("reasonCode")).show(10, false)
Those two statements are equivalent and will lead to the identical result.
The $ notation is special for Scala Spark and is referring to an implicit StringToColumn method which interprets the subsequent string "reasonCode" as a Column
implicit class StringToColumn(val sc: StringContext) {
def $(args: Any*): ColumnName = {
new ColumnName(sc.s(args: _*))
}
}
In Scala Spark you have many ways to select a column. I have written down a full list of syntax varieties in another answer on select specific columns from spark dataframe.
Using different notations do not have any impact on the performance as they all get translated to the same set of RDD instructions through Spark's Catalyst optimizer.

Sorting by DateTime in Slick

I'm currently going through a rough point in Slick. I'm trying to sort the query of a table with a timestamp:
TableName.filter(tableAttribute === 1).sortBy(_.tableTimestamp)
The timestamp is of type joda.DateTime within slick. When I try to sort, I'm getting the following error:
No implicit view available from dao.Tables.profile.api.Rep[org.joda.time.DateTime] => slick.lifted.Ordered.
I'm assuming that this isn't built into Slick. Is there a quick and clean way to add an implicit view and solve this?
Thanks!
You might be looking for an implicit conversion using Ordering.fromLessThan like below:
import org.joda.time.DateTime
implicit def datetimeOrdering: Ordering[DateTime] = Ordering.fromLessThan(_ isBefore _)
In case you want to reverse the ordering, simply replace isBefore with isAfter.

Adding contents in an RDD[(Array[String], Long)] into a new array into a new RDD: RDD[Array[(Array[String], Long)]]

I have an RDD[Array[String]] which I zipWithIndex:
val dataWithIndex = data.zipWithIndex()
Now I have a RDD[(Array[String], Long)], I would like to add all the pairs in the RDD to an array and still have it in the RDD. Is there an efficient way to do so? My final datastructure should be RDD[Array[(Array[String], Long)]] where the RDD essentially only contains one element.
Right now I do the following, but it is very ineffective because of collect():
val dataWithIndex = data.zipWithIndex()
val dataNoRDD = dataWithIndex.collect()
val dataArr = ListBuffer[Array[(Array[String], Long)]]()
dataArr += dataNoRDD
val initData = sc.parallelize(dataArr)
The conclusion is that this seems to be extremely hard to do with standard functionality.
Instead, if the input comes from a Hadoop filesystem it is possible to do. This can be done by extending certain Hadoop classes.
First you need to implement WritableComparable<> and define a custom format that the RDD will contain. In order for this to work, you need to define a custom FileInputFormat and extend it in order to support your custom Writable. In order for FileInputFormat to know what to do with data being read, a custom RecordReader has to be written by extending it and here specifically the method nextKeyValue() has to be written which defines what each RDD element will contain. All of these three are written in Java, but with some simple tricks it is possible to do.

How to convert Dataset to a Scala Iterable?

Is there a way to convert a org.apache.spark.sql.Dataset to a scala.collection.Iterable? It seems like this should be simple enough.
You can do myDataset.collect or myDataset.collectAsList.
But then it will no longer be distributed. If you want to be able to spread your computations out on multiple machines you need to use one of the distributed datastructures such as RDD, Dataframe or Dataset.
You can also use toLocalIterator if you just need to iterate the contents on the driver as it has the advantage of only loading one partition at a time, instead of the entire dataset, into memory. Iterator is not an Iterable (although it is a Traverable) but depending on what you are doing it may be what you want.
You could try something like this:
def toLocalIterable[T](dataset: Dataset[T]): Iterable[T] = new Iterable[T] {
def iterator = scala.collection.JavaConverters.asScalaIterator(dataset.toLocalIterator)
}
The conversion via JavaConverters.asScalaIterator is necessary because the toLocalIterator method of Dataset returns a java.util.Iterator instead of a scala.collection.Iterator (which is what the toLocalIterator on RDD returns.) I suspect this is a bug.
In Scala 2.11 you can do the following:
import scala.collection.JavaConverters._
dataset.toLocalIterator.asScala.toIterable

Subtract an RDD from another RDD doesn't work correctly

I want to subtract an RDD from another RDD. I looked into the documentation and I found that subtract can do that. Actually, when I tested subtract, the final RDD remains the same and the values are not removed!
Is there any other function to do that? Or am I using subtract incorrectly?
Here is the code that I used:
val vertexRDD: org.apache.spark.rdd.RDD[(VertexId, Array[Int])]
val clusters = vertexRDD.takeSample(false, 3)
val clustersRDD: RDD[(VertexId, Array[Int])] = sc.parallelize(clusters)
val final = vertexRDD.subtract(clustersRDD)
final.collect().foreach(println(_))
Performing set operations like subtract with mutable types (Array in this example) is usually unsupported, or at least not recommended.
Try using a immutable type instead.
I believe WrappedArray is the relevant container for storing arrays in sets, but i'm not sure.
If your rdd is composed of mutables object it wont work... problem is it wont show an error either so this kind of problems are hard to identify, i had a similar one yesterday and i used a workaround.
rdd.keyBy( someImmutableValue ) -> do this using the same key value to
both your rdds
val resultRDD = rdd.subtractByKey(otherRDD).values
Recently I tried the subtract operation of 2 RDDs (of array List) and it is working. The important note is - the RDD val after .subtract method should be the list from where you're subtracting, not the other way around.
Correct: val result = theElementYouWantToSubtract.subtract(fromList)
Incorrrect: val reuslt = fromList.subtract(theElementYouWantToSubtract) (will not give any compile/runtime error message)