Create Spark dataset with parts of other dataset

Create Spark dataset with parts of other dataset - scala

I'm trying to create a new dataset by taking intervals from another dataset, for example, consider dataset1 as input and dataset2 as output:
dataset1 = [1, 2, 3, 4, 5, 6]
dataset2 = [1, 2, 2, 3, 3, 4, 4, 5, 5, 6]
I managed to do that using arrays, but for mlib a dataset is needed.
My code with array:
def generateSeries(values: Array[Double], n: Int): Seq[Array[Float]] = {
var res: Array[Array[Float]] = new Array[Array[Float]](m)
for(i <- 0 to m-n){
res :+ values(i to i + n)
}
return res
}
FlatMap seems like the way to go, but how a function can search for the next value in the dataset?

The problem here is that an array is in no way similar to a DataSet. A DataSet is unordered and has no indices, so thinking in terms of arrays won't help you. Go for a Seq and treat it without using indices and positions at all.
So, to represent an array-like behaviour on a DataSet you need to create your own indices. This is simply done by pairing the value with the position in the "abstract array" we are representing.
So the type of your DataSet will be something like [(Int,Int)], where the first is the index and the second is the value. They will arrive unordered, so you will need to rework your logic in a more functional way. It's not really clear what you're trying to achieve but I hope I gave you an hint. Otherwise explain better the expected result in the comment to my answer and I will edit.

Related

How does sortWith in Scala work in terms of iterating a tuple?

A list can be iterated as follows:
scala> val thrill = "Will" :: "fill" :: "until" :: Nil
val thrill: List[String] = List(Will, fill, until)
scala> thrill.map(s => s + "y")
val res14: List[String] = List(Willy, filly, untily)
The above code first creates a list and then the second command creates a map with an iterable called as 's', the map creates a new string from 's' by appending the char 'y'.
However, I do not understand the following iteration process:
scala> thrill.sortWith((s,t) => s.charAt(0).toLower < t.charAt(0).toLower)
val res19: List[String] = List(fill, until, Will)
Does the tuple (s,t) take two elements of thrill at once and compares them? How is sorting performed exactly using this syntax/function?

Sorting is arranging the data in ascending or descending order. Sorted data helps us searching easily.
Scala uses TimSort, which is a hybrid of Merge Sort and Insertion Sort.
Here is signature of sortWith function in scala -
def sortWith(lt: (A, A) => Boolean): Repr
The sortWith function Sorts this sequence according to a comparison function. it takes a comparator function and sort according to it.you can provide your own custom comparison function.
It will perform Tim Sort on the input list to result sorted output.

Yes the function compares two elements at a time to determine their ordering.
The implementation of sortWith falls back on Java's Array.sort (docs) with the following signature
sort(T[] a, Comparator<? super T> c)
which takes in an Array of T and a comparator, equipped with a binary function that can compare any two elements of the set of possible values of T and give you some information about how they relate to each other (as in should come before, same or after based on your function).
Behind the scenes it is really just an iterative merge sort details of which can be found on the wikipedia page for merge sort
Leaving the optimizations aside, if you aren't familiar with it, merge sort, which is a divide and conquer algorithm, simply breaks down the collection you have until you have groups of 2 elements and then merges the smaller lists in a sorted way simply by comparing 2 elements at a time, which is where the binary comparison function you pass in comes to play.
i.e.
List(4, 11, 2, 1, 9, 0) //original list
Break down to chunks:
[4, 11], [2, 1], [9, 0]
Sort internally:
[4, 11], [1, 2], [0, 9]
Merge:
[1, 4, 11], [2], [0, 9]
[1, 2, 4, 11], [0, 9]
[0, 1, 2, 4, 11], [9]
[0, 1, 2, 4, 9, 11]
P.S. a nit picky detail, sortedWith takes in a Function2 (a function of arity 2). This function you pass in is used to generate what we call an Ordering in Scala, which is then implicity converted to a Comparator. Above, I have linked to the implementation of sorted which is what sortedWith calls once it generates this ordering and which is where most of the sorting logic happens.

How can i make the following function more efficient?

Here is a function which makes a map from given array. Where key is the integer number and the value is the frequency of this number in the given array.
I need to find the key which has the maximum frequency. If two key has the same frequency then i need to take the key which is smaller.
that's what i have written:
def findMinKeyWithMaxFrequency(arr: List[Int]): Int = {
val ansMap:scala.collection.mutable.Map[Int,Int] = scala.collection.mutable.Map()
arr.map(elem=> ansMap+=(elem->arr.count(p=>elem==p)))
ansMap.filter(_._2==ansMap.values.max).keys.min
}
val arr = List(1, 2 ,3, 4, 5, 4, 3, 2, 1, 3, 4)
val ans=findMinKeyWithMaxFrequency(arr) // output:3
How can i make it more efficient, it is giving me the right answer but i don't think it's the most efficient way to solve the problem.
In the given example the frequency of 3 and 4 is 3 so the answer should be 3 as 3 is smaller than 4.
Edit 1:
That's what i have done to make it bit efficient. Which is converting arr into Set as we need to find frequency for the unique elements only.
def findMinKeyWithMaxFrequency(arr: List[Int]): Int = {
val ansMap=arr.toSet.map{ e: Int =>(e,arr.count(x=>x==e))}.toMap
ansMap.filter(_._2==ansMap.values.max).keys.min
}
Can it be more efficient? Is it the most functional way of writing the solution for the given problem.

def findMinKeyWithMaxFrequency(arr: List[Int]): Int =
arr.groupBy(identity).toSeq.maxBy(p => (p._2.length,-p._1))._1
Use groupBy() to get an effective count for each element then, after flattening to a sequence of tuples, code the required rules to determine the maximum.

Breakdown of reduce function

I currently have:
x.collect()
# Result: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
val a = x.reduce((x,y) => x+1)
# a: Int = 6
val b = x.reduce((x,y) => y + 1)
# b: Int = 12
I have tried to follow what has been said here (http://www.python-course.eu/lambda.php) but still don't quite understand what the individual operations are that lead to these answers.
Could anyone please explain the steps being taken here?
Thank you.

The reason is that the function (x, y) => x + 1 is not associative. reduce requires an associative function. This is necessary to avoid indeterminacy when combining results across different partitions.

You can think of the reduce() method as grabbing two elements from the collection, applying them to a function that results in a new element, and putting that new element back in the collection, replacing the two it grabbed. This is done repeatedly until there is only one element left. In other words, that previous result is going to get re-grabbed until there are no more previous results.
So you can see where (x,y) => x+1 results in a new value (x+1) which would be different from (x,y) => y+1. Cascade that difference through all the iterations and ...

How to calculate median over RDD[org.apache.spark.mllib.linalg.Vector] in Spark efficiently?

What I want to do like this:
http://cn.mathworks.com/help/matlab/ref/median.html?requestedDomain=www.mathworks.com
Find the median value of each column.
It can be done by collecting the RDD to driver, for a big data which will become impossible.
I know Statistics.colStats() can calculate mean, variance... but median is not included.
Additionally, the vector is high-dimensional and sparse.

Well I didn't understand the vector part, however this is my approach (I bet there are better ones):
val a = sc.parallelize(Seq(1, 2, -1, 12, 3, 0, 3))
val n = a.count() / 2
println(n) // outputs 3
val b = a.sortBy(x => x).zipWithIndex()
val median = b.filter(x => x._2 == n).collect()(0)._1 // this part doesn't look nice, I hope someone tells me how to improve it, maybe zero?
println(median) // outputs 2
b.collect().foreach(println) // (-1,0) (0,1) (1,2) (2,3) (3,4) (3,5) (12,6)
The trick is to sort your dataset using sortBy, then zip the entries with their index using zipWithIndex and then get the middle entry, note that I set an odd number of samples, for simplicity but the essence is there, besides you have to do this with every column of your dataset.

What is the difference between partition and groupBy?

I am reading through Twitter's Scala School right now and was looking at the groupBy and partition methods for collections. And I am not exactly sure what the difference between the two methods is.
I did some testing on my own:
scala> List(1, 2, 3, 4, 5, 6).partition(_ % 2 == 0)
res8: (List[Int], List[Int]) = (List(2, 4, 6),List(1, 3, 5))
scala> List(1, 2, 3, 4, 5, 6).groupBy(_ % 2 == 0)
res9: scala.collection.immutable.Map[Boolean,List[Int]] = Map(false -> List(1, 3, 5), true -> List(2, 4, 6))
So does this mean that partition returns a list of two lists and groupBy returns a Map with boolean keys and list values? Both have the same "effect" of splitting a list into two different parts based on a condition. I am not sure why I would use one over the other. So, when would I use partition over groupBy and vice-versa?

groupBy is better suited for lists of more complex objects.
Say, you have a class:
case class Beer(name: String, cityOfBrewery: String)
and a List of beers:
val beers = List(Beer("Bitburger", "Bitburg"), Beer("Frueh", "Cologne") ...)
you can then group beers by cityOfBrewery:
val beersByCity = beers.groupBy(_.cityOfBrewery)
Now you can get yourself a list of all beers brewed in any city you have in your data:
val beersByCity("Cologne") = List(Beer("Frueh", "Cologne"), ...)
Neat, isn't it?

And I am not exactly sure what the difference between the two methods
is.
The difference is in their signature. partition expects a function A => Boolean while groupBy expects a function A => K.
It appears that in your case the function you apply with groupBy is A => Boolean too, but you don't want always to do this, sometimes you want to group by a function that don't always returns a boolean based on its input.
For example if you want to group a List of strings by their length, you need to do it with groupBy.
So, when would I use partition over groupBy and vice-versa?
Use groupBy if the image of the function you apply is not in the boolean set (i.e f(x) for an input x yield another result than a boolean). If it's not the case then you can use both, it's up to you whether you prefer a Map or a (List, List) as output.

Partition is when you need to split some collection into two basing on yes/no logic (even/odd numbers, uppercase/lowecase letters, you name it). GroupBy has more general usage: producing many groups, basing on some function. Let's say you want to split corpus of words into bins depending on their first letter (resulting into 26 groups), it simply not possible with .partition.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Create Spark dataset with parts of other dataset - scala

Related

How does sortWith in Scala work in terms of iterating a tuple?

How can i make the following function more efficient?

Breakdown of reduce function

How to calculate median over RDD[org.apache.spark.mllib.linalg.Vector] in Spark efficiently?

What is the difference between partition and groupBy?

Categories

Resources