Spark Scala : how to slice an array in array of array - scala

for an array of array:
val arrarr = Array(Array(0.37,1),Array(145.38,100),Array(149.37,100),Array(149.37,300),Array(119.37,5),Array(144.37,100))
For example, if the input value is 149.37, I want to do some sort of indexing to get 300. 149.37 occurs two times in arrarr(Array(149.37,100),Array(149.37,300). I want to return the last value using spark scala.
Can you please help? Thanks!

you can do it like this :
val result : Doulbe = arrarr.filter(_(0) == 149.37).last(1)
or
val result: Option[Double] = arrarr.reverse.find(_ (0) == 149.37).map(_ (1))

val index = arrarr.lastIndexWhere(x => x(0) == input)
val result = arrarr(index)(1)
Test it here

Related

How to skip appending some condition in map scala?

I have a situation in which I need to create a map from a collection by applying some filter inside like given in the code below:
//Say I have a list
//I don't have to apply filter function ...
val myList = List(2,3,4,5)
val evenList = myList.map(x=>{
if ( x is even) x
else 0
}
//And the output is : List(2,0,4,0)
//The output actually needed was List(2,4) without applying filter on top like - ```myList.filter```
//I have objects instead of numbers of a case class so the output becomes :List(object1, None, object2, None)
But actual output needed was : List(object1,object2)
//The updated scenario
val basket = List(2,4,5,6)
case class Apple(name:Option[String],size:Option[Int])
val listApples: List[Apple] = basket.map(x=>{
val r = new scala.util.Random
val size = r.nextInt(10)
if(x%2!=0){
Apple(None,None)
}
else Apple(Some("my-apple"),Some(size))
})
Current Output :
Apple(Some(my-apple),Some(2))
Apple(Some(my-apple),Some(0))
Apple(None,None)
Apple(Some(my-apple),Some(4))
Expected was :
Apple(Some(my-apple),Some(2))
Apple(Some(my-apple),Some(0))
Apple(Some(my-apple),Some(4))
I believe collect best suits your case. It takes a partial function as an argument and only if that function matches then the element is transformed and added to result:
val myList = List(2,3,4,5)
case class Wrapper(i: Int)
val evenList = myList.collect{
case x if x % 2 == 0 => Wrapper(x)
}
In this case only 2 and 4 will be wrapped inside Wrapper:
List(Wrapper(2), Wrapper(4))
I'm not sure if I understand you correctly, but why not just use a filter directly:
val myList = List(2,3,4,5)
myList.filter(_ % 2 == 0)
If you want to have the Filter as a function:
def even(n:Int) = n % 2 == 0
myList.filter(even)
After question update, here the difference between filter and collect:
Filter:
myList
.filter(even)
.map(s => Apple(Some("my-apple"),Some(s)))
Collect:
myList
.collect{ case s if(even(s)) => Apple(Some("my-apple"),Some(s))}
Both return List(Apple(Some(my-apple),Some(2)), Apple(Some(my-apple),Some(4)))
So the only difference is that you can do both steps at once with collect.
However for me to separate these 2 steps is mostly more readable.

Finding length of string in Scala

I am a newbie to scala
I have a list of strings -
List[String] (“alpha”, “gamma”, “omega”, “zeta”, “beta”)
I want to count all the strings with length = 4
i.e I want to get output = 2.
You can do like this:
val data = List("alpha", "gamma", "omega", "zeta", "beta")
data.filter(x => x.length == 4).size
res8: Int = 2
You can just use count function as
val list = List[String] ("alpha", "gamma", "omega", "zeta", "beta")
println(list.count(x => x.length == 4))
//2 is printed
I hope the answer is helpful

How do I remove empty dataframes from a sequence of dataframes in Scala

How do I remove empty data frames from a sequence of data frames? In this below code snippet, there are many empty data frames in twoColDF. Also another question for the below for loop, is there a way that I can make this efficient? I tried rewriting this to below line but didn't work
//finalDF2 = (1 until colCount).flatMap(j => groupCount(j).map( y=> finalDF.map(a=>a.filter(df(cols(j)) === y)))).toSeq.flatten
var twoColDF: Seq[Seq[DataFrame]] = null
if (colCount == 2 )
{
val i = 0
for (j <- i + 1 until colCount) {
twoColDF = groupCount(j).map(y => {
finalDF.map(x => x.filter(df(cols(j)) === y))
})
}
}finalDF = twoColDF.flatten
Given a set of DataFrames, you can access each DataFrame's underlying RDD and use isEmpty to filter out the empty ones:
val input: Seq[DataFrame] = ???
val result = input.filter(!_.rdd.isEmpty())
As for your other question - I can't understand what your code tries to do, but I'd first try to convert it into something more functional (remove use of vars and imperative conditionals). If I'm guessing the meaning of your inputs, here's something that might be equivalent to what you're trying to do:
var input: Seq[DataFrame] = ???
// map of column index to column values -
// for each combination we'd want a new DF where that column has that value
// I'm assuming values are Strings, can be anything else
val groupCount: Map[Int, Seq[String]] = ???
// for each combination of DF + column + value - produce the filtered DF where this column has this value
val perValue: Seq[DataFrame] = for {
df <- input
index <- groupCount.keySet
value <- groupCount(index)
} yield df.filter(col(df.columns(index)) === value)
// remove empty results:
val result: Seq[DataFrame] = perValue.filter(!_.rdd.isEmpty())

Splitting string in dataset Apache Spark

I am absolutely new in Spark
I have a txt dataset with cathegorical attributes, looking like this:
10000,5,0,1,0,0,5,3,2,2,1,0,1,0,4,3,0,2,0,0,1,0,0,0,0,10,0,1,0,1,0,1,4,2,2,3,0,2,0,2,1,4,3,0,0,0,3,1,0,3,22,0,3,0,1,0,1,0,0,0,5,0,2,1,1,0,11,1,0
10001,6,1,1,0,0,7,5,2,2,0,0,3,0,1,1,0,1,0,0,0,0,1,0,0,4,0,2,0,0,0,1,4,1,2,2,0,2,0,2,2,4,2,1,0,0,1,1,0,2,10,0,1,0,1,0,1,0,0,0,1,0,2,1,1,0,5,1,0
10002,3,1,2,0,0,7,4,2,2,0,0,1,0,4,4,0,1,0,1,0,0,0,0,0,1,0,2,0,4,0,10,4,1,2,4,0,2,0,2,1,4,2,2,0,0,0,1,0,2,10,0,6,0,1,0,1,0,0,0,2,0,2,1,1,0,10,1,0
10003,4,1,2,0,0,1,3,2,2,0,0,3,0,3,3,0,1,0,0,0,0,0,0,1,4,0,2,0,2,0,1,4,1,2,2,0,2,0,2,1,2,2,0,0,0,1,1,0,2,10,0,4,0,1,0,1,0,0,0,1,0,1,1,1,0,10,1,0
10004,7,1,1,0,0,0,0,2,2,0,0,3,0,0,0,0,0,0,0,0,1,0,0,0,0,0,2,2,0,0,0,4,1,2,0,0,2,0,2,1,4,0,1,0,0,0,6,0,2,22,0,1,0,1,0,1,0,0,3,0,0,0,2,2,0,5,6,0
10005,1,1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,4,0,0,0,1,0,0,0,0,0,2,0,4,0,2,0,121,0,0,1,0,10,1,0,0,2,0,1,0,0,0,0,0,0,0,0,0,4,0,0
10006,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,4,0,0,0,1,0,0,0,0,0,2,1,0,0,2,0,121,0,0,1,0,10,1,0,0,2,0,0,0,0,0,0,0,0,0,0,0,4,0,0
10007,4,1,2,0,0,6,0,2,2,0,0,4,0,5,5,0,2,1,0,0,0,0,0,0,9,0,2,0,0,0,11,4,1,2,3,0,2,0,2,1,2,3,1,0,0,0,1,0,3,10,0,1,0,1,0,1,0,0,0,0,0,2,1,1,0,11,1,0
10008,6,1,1,0,0,1,0,2,2,0,0,7,0,1,0,0,1,0,0,0,0,0,0,0,7,0,2,2,0,0,0,4,1,2,6,0,2,0,2,1,2,2,1,0,0,0,6,0,2,10,0,1,0,1,0,1,0,0,3,0,0,1,1,2,0,10,1,0
10009,3,1,12,0,0,1,0,2,2,0,0,0,0,3,0,0,1,0,0,0,0,0,0,0,4,0,2,2,4,0,0,2,1,2,6,0,2,0,2,1,0,2,2,0,0,0,3,0,2,10,0,6,1,1,1,0,0,0,1,0,0,1,1,2,0,8,1,1
10010,5,11,1,0,0,1,3,2,2,0,0,0,0,3,3,0,3,0,0,0,0,0,0,0,6,0,2,0,0,0,1,4,1,2,1,0,2,0,2,1,0,4,0,0,0,1,1,0,4,21,0,1,0,1,0,0,0,0,0,4,0,2,1,1,0,11,1,0
10011,4,0,1,0,0,1,5,2,2,0,0,3,0,1,1,0,1,0,0,0,0,0,0,0,7,0,2,0,0,0,1,4,1,2,1,0,2,0,2,1,3,2,1,0,0,1,1,0,2,10,0,1,0,1,0,1,0,0,0,2,0,2,1,1,0,10,1,0
10012,1,1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,0,0,2,0,0,0,2,0,112,0,0,1,0,10,1,0,0,1,0,0,0,0,0,0,0,0,0,2,0,1,0,0
10013,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,4,0,0,0,1,0,0,0,0,0,2,1,4,0,2,0,121,0,0,1,0,10,1,0,0,2,0,1,0,0,0,0,0,0,0,0,0,4,0,0
10014,3,11,1,0,0,6,4,2,2,0,0,1,0,2,2,0,0,1,0,0,0,0,0,0,3,0,2,0,3,0,1,4,2,2,5,0,2,0,1,2,4,2,10,0,0,1,1,0,2,10,0,5,0,1,0,1,0,0,0,3,0,1,1,1,0,7,1,0
10015,4,3,1,0,0,1,3,2,2,1,0,0,0,3,5,0,3,0,0,1,0,0,0,0,4,0,1,0,0,1,1,2,2,2,2,0,2,0,2,0,0,4,0,0,0,1,1,0,4,10,0,1,3,1,1,0,0,0,0,3,0,2,1,1,0,11,1,1
10016,4,11,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,2,2,4,0,0,4,1,1,0,0,1,0,0,2,0,0,12,0,0,0,6,0,2,23,0,6,0,1,0,0,0,0,3,0,0,0,2,0,0,5,7,0
10017,7,1,1,0,0,0,0,2,2,0,0,3,0,0,0,0,0,0,0,1,1,0,1,0,0,0,2,2,0,0,0,4,1,2,0,0,2,0,2,1,4,0,1,0,0,0,6,0,2,10,0,1,0,1,0,1,0,0,3,0,0,0,2,2,0,6,6,0
The task is to get the number of strings, where numeral on 57th position
10001,6,1,1,0,0,7,5,2,2,0,0,3,0,1,1,0,1,0,0,0,0,1,0,0,4,0,2,0,0,0,1,4,1,2,2,0,2,0,2,2,4,2,1,0,0,1,1,0,2,10,0,1,0,1,0,((1)),0,0,0,1,0,2,1,1,0,5,1,0
is 1 . The problem is that strings are the elements of the RDD, so I need to split each string and make an array(x,y) to get the position i need.
I tried to use
val censusText = sc.textFile("USCensus1990.data.txt")
val splitRDD = censusText.map(line=>line.split(","))
but It didn't help
But I have no idea how to do it.
Can you please help me
You can try:
censusText.filter(l => l.split(",")(56) == "1").count
// res5: Long = 12
Or you can first split the RDD then do the filter / count:
val splitRDD = censusText.map(l => l.split(","))
splitRDD.filter(r => r(56) == "1").count
// res7: Long = 12

Compare rows in RDD

How can I iterate through RDD rows and compare one row to the next one in the RDD?
I know I can use for loop in the following way : for(x<-rddItems), is there any way to do something like x.next() inside the for loop? or to use some index inside the for?
thanks
You can do something like this using mapPartitions:
rdd.mapPartitions { partition =>
var previous = partition.next
for (element <- partition) yield {
val result = previous == element // Do your comparison.
previous = element
result
}
}
But this does not compare the last element of partition N with the first element of partition N+1. It would be quite complicated to do that and would hurt performance. So I'm just crossing my fingers and hope you're okay with missing some comparisons!
You can iterate through each individual partition of the RDD using mapPartitions, something like:
val rdd = sc.parallelize(List(1,73,5,226))
rdd.mapPartitions { iter =>
var last = 0
var result = List[Boolean]()
while (iter.hasNext) {
val current = iter.next
result = result ::: List(current > last)
last = current
}
result.iterator
}.collect().foreach(println)
Gives:
true
true
false
true
This is done on a partition by partition basis, not through the entire RDD.
You need to create a key and then join the rdd to itself (applying your offset).
I have thought of this possibility , I am unsure it is really a good one ?
def diff_timestamp(liste):
timestamps = liste
r = []
values = []
for indice, valeur in enumerate(timestamps):
values.append(float(valeur))
if indice>0:
delta = values[indice] - values[indice-1]
r.append(delta)
return r