Compare rows in RDD - scala

How can I iterate through RDD rows and compare one row to the next one in the RDD?
I know I can use for loop in the following way : for(x<-rddItems), is there any way to do something like x.next() inside the for loop? or to use some index inside the for?
thanks

You can do something like this using mapPartitions:
rdd.mapPartitions { partition =>
var previous = partition.next
for (element <- partition) yield {
val result = previous == element // Do your comparison.
previous = element
result
}
}
But this does not compare the last element of partition N with the first element of partition N+1. It would be quite complicated to do that and would hurt performance. So I'm just crossing my fingers and hope you're okay with missing some comparisons!

You can iterate through each individual partition of the RDD using mapPartitions, something like:
val rdd = sc.parallelize(List(1,73,5,226))
rdd.mapPartitions { iter =>
var last = 0
var result = List[Boolean]()
while (iter.hasNext) {
val current = iter.next
result = result ::: List(current > last)
last = current
}
result.iterator
}.collect().foreach(println)
Gives:
true
true
false
true
This is done on a partition by partition basis, not through the entire RDD.

You need to create a key and then join the rdd to itself (applying your offset).

I have thought of this possibility , I am unsure it is really a good one ?
def diff_timestamp(liste):
timestamps = liste
r = []
values = []
for indice, valeur in enumerate(timestamps):
values.append(float(valeur))
if indice>0:
delta = values[indice] - values[indice-1]
r.append(delta)
return r

Related

Spark Scala : how to slice an array in array of array

for an array of array:
val arrarr = Array(Array(0.37,1),Array(145.38,100),Array(149.37,100),Array(149.37,300),Array(119.37,5),Array(144.37,100))
For example, if the input value is 149.37, I want to do some sort of indexing to get 300. 149.37 occurs two times in arrarr(Array(149.37,100),Array(149.37,300). I want to return the last value using spark scala.
Can you please help? Thanks!
you can do it like this :
val result : Doulbe = arrarr.filter(_(0) == 149.37).last(1)
or
val result: Option[Double] = arrarr.reverse.find(_ (0) == 149.37).map(_ (1))
val index = arrarr.lastIndexWhere(x => x(0) == input)
val result = arrarr(index)(1)
Test it here

Attemping to parallelize a nested loop in Scala

I am comparing 2 dataframes in scala/spark using a nested loop and an external jar.
for (nrow <- dfm.rdd.collect) {
var mid = nrow.mkString(",").split(",")(0)
var mfname = nrow.mkString(",").split(",")(1)
var mlname = nrow.mkString(",").split(",")(2)
var mlssn = nrow.mkString(",").split(",")(3)
for (drow <- dfn.rdd.collect) {
var nid = drow.mkString(",").split(",")(0)
var nfname = drow.mkString(",").split(",")(1)
var nlname = drow.mkString(",").split(",")(2)
var nlssn = drow.mkString(",").split(",")(3)
val fNameArray = Array(mfname,nfname)
val lNameArray = Array (mlname,nlname)
val ssnArray = Array (mlssn,nlssn)
val fnamescore = Main.resultSet(fNameArray)
val lnamescore = Main.resultSet(lNameArray)
val ssnscore = Main.resultSet(ssnArray)
val overallscore = (fnamescore +lnamescore +ssnscore) /3
if(overallscore >= .95) {
println("MeditechID:".concat(mid)
.concat(" MeditechFname:").concat(mfname)
.concat(" MeditechLname:").concat(mlname)
.concat(" MeditechSSN:").concat(mlssn)
.concat(" NextGenID:").concat(nid)
.concat(" NextGenFname:").concat(nfname)
.concat(" NextGenLname:").concat(nlname)
.concat(" NextGenSSN:").concat(nlssn)
.concat(" FnameScore:").concat(fnamescore.toString)
.concat(" LNameScore:").concat(lnamescore.toString)
.concat(" SSNScore:").concat(ssnscore.toString)
.concat(" OverallScore:").concat(overallscore.toString))
}
}
}
What I'm hoping to do is add some parallelism to the outer loop so that I can create a threadpool of 5 and pull 5 records from the collection of the outerloop and compare them to the collection of the inner loop, rather than doing this serially. So the outcome would be I can specify the number of threads, have 5 records from the outerloop's collection processing at any given time against the collection in the inner loop. How would I go about doing this?
Let's start by analyzing what you are doing. You collect the data of dfm to the driver. Then, for each element you collect the data from dfn, transform it and compute a score for each pair of elements.
That's problematic in many ways. First even without considering parallel computing, the transformations on the elements of dfn are made as many times as dfm as elements. Also, you collect the data of dfn for every row of dfm. That's a lot of network communications (between the driver and the executors).
If you want to use spark to parallelize you computations, you need to use the API (RDD , SQL or Datasets). You seem to want to use RDDs to perform a cartesian product (this is O(N*M) so be careful, it may take a while).
Let's start by transforming the data before the Cartesian product to avoid performing them more than once per element. Also, for clarity, let's define a case class to contain your data and a function that transform your dataframes into RDDs of that case class.
case class X(id : String, fname : String, lname : String, lssn : String)
def toRDDofX(df : DataFrame) = {
df.rdd.map(row => {
// using pattern matching to convert the array to the case class X
row.mkString(",").split(",") match {
case Array(a, b, c, d) => X(a, b, c, d)
}
})
}
Then, I use filter to keep only the tuples whose score is more than .95 but you could use map, foreach... depending on what you intend to do.
val rddn = toRDDofX(dfn)
val rddm = toRDDofX(dfm)
rddn.cartesian(rddm).filter{ case (xn, xm) => {
val fNameArray = Array(xm.fname,xn.fname)
val lNameArray = Array(xm.lname,xn.lname)
val ssnArray = Array(xm.lssn,xn.lssn)
val fnamescore = Main.resultSet(fNameArray)
val lnamescore = Main.resultSet(lNameArray)
val ssnscore = Main.resultSet(ssnArray)
val overallscore = (fnamescore +lnamescore +ssnscore) /3
// and then, let's say we filter by score
overallscore > .95
}}
This is not a right way of iterating over spark dataframe. The major concern is the dfm.rdd.collect. If the dataframe is arbitrarily large, you would end up exception. This due to the fact that the collect function essentially brings all the data into the master node.
Alternate way would be use the foreach or map construct of the rdd.
dfm.rdd.foreach(x => {
// your logic
}
Now you are trying to iterate the second dataframe here. I am afraid that won't be possible. The elegant way is to join the dfm and dfn and iterate over the resulting dataset to compute your function.

Spark find previous value on each iteration of RDD

I've following code :-
val rdd = sc.cassandraTable("db", "table").select("id", "date", "gpsdt").where("id=? and date=? and gpsdt>? and gpsdt<?", entry(0), entry(1), entry(2) , entry(3))
val rddcopy = rdd.sortBy(row => row.get[String]("gpsdt"), false).zipWithIndex()
rddcopy.foreach { records =>
{
val previousRow = (records - 1)th row
val currentRow = records
// Some calculation based on both rows
}
}
So, Idea is to get just previous \ next row on each iteration of RDD. I want to calculate some field on current row based on the value present on previous row. Thanks,
EDIT II: Misunderstood question below is how to get tumbling window semantics but sliding window is needed. considering this is a sorted RDD
import org.apache.spark.mllib.rdd.RDDFunctions._
sortedRDD.sliding(2)
should do the trick. Note however that this is using a DeveloperAPI.
alternatively you can
val l = sortedRdd.zipWithIndex.map(kv => (kv._2, kv._1))
val r = sortedRdd.zipWithIndex.map(kv => (kv._2-1, kv._1))
val sliding = l.join(r)
rdd joins should be inner joins (IIRC) thus dropping the edge cases where the tuples would be partially null
OLD STUFF:
how do you do identify the previous row? RDDs do not have any sort of stable ordering by themselves. if you have an incrementing dense key you could add a new column that get's calculated the following way if (k % 2 == 0) k / 2 else (k-1)/2 this should give you a key that has the same value for two successive keys. Then you could just group by.
But to reiterate there is no really sensible notion of previous in most cases for RDDs (depending on partitioning, datasource etc.)
EDIT: so now that you have a zipWithIndex and an ordering in your set you can do what I mentioned above. So now you have an RDD[(Int, YourData)] and can do
rdd.map( kv => if (kv._1 % 2 == 0) (kv._1 / 2, kv._2) else ( (kv._1 -1) /2, kv._2 ) ).groupByKey.foreach (/* your stuff here /*)
if you reduce at any point consider using reduceByKey rather than groupByKey().reduce

Dropping the first and last row of an RDD with Spark

I'm reading in a text file using spark with sc.textFile(fileLocation) and need to be able to quickly drop the first and last row (they could be a header or trailer). I've found good ways of returning the first and last row, but no good one for removing them. Is this possible?
One way of doing this would be to zipWithIndex, and then filter out the records with indices 0 and count - 1:
// We're going to perform multiple actions on this RDD,
// so it's usually better to cache it so we don't read the file twice
rdd.cache()
// Unfortunately, we have to count() to be able to identify the last index
val count = rdd.count()
val result = rdd.zipWithIndex().collect {
case (v, index) if index != 0 && index != count - 1 => v
}
Do note that this might be be rather costly in terms of performance (if you cache the RDD - you use up memory; If you don't, you read the RDD twice). So, if you have any way of identifying these records based on their contents (e.g. if you know all records but these should contain a certain pattern), using filter would probably be faster.
This might be a lighter version:
val rdd = sc.parallelize(Array(1,2,3,4,5,6), 3)
val partitions = rdd.getNumPartitions
val rddFirstLast = rdd.mapPartitionsWithIndex { (idx, iter) =>
if (idx == 0) iter.drop(1)
else if (idx == partitions - 1) iter.sliding(2).map(_.head)
else iter
}
scala> rddFirstLast.collect()
res3: Array[Int] = Array(2, 3, 4, 5)
Here is my take on it, may require an action(count), expected results always and independent to number of partitions.
val rddRowCount = rdd.count()
val rddWithIndices = rdd.zipWithIndex()
val filteredRddWithIndices = rddWithIndices.filter(eachRow =>
if(eachRow._2 == 0) false
else if(eachRow._2 == rddRowCount - 1) false
else true
)
val finalRdd = filteredRddWithIndices.map(eachRow => eachRow._1)

My RDD change his values himself

I have a basic RDD[Object] on which i apply a map with a hashfunction on Object values using nextGaussian and nextDouble scala function. And when i print values there change at each print
def hashmin(x:Data_Object, w:Double) = {
val x1 = x.get_vector.toArray
var a1 = Array(0.0).tail
val b = Random.nextDouble * w
for( ind <- 0 to x1.size-1) {
val nG = Random.nextGaussian
a1 = a1 :+ nG
}
var sum = 0.0
for( ind <- 0 to x1.size-1) {
sum = sum + (x1(ind)*a1(ind))
}
val hash_val = (sum+b)/w
val hash_val1 = (x.get_id,hash_val)
hash_val1
}
val w = 8
val rddhash = parsedData.map(x => hashmin(x,w))
rddhash.foreach(println)
rddhash.foreach(println)
I don't understand why. Thank you in advance.
RDDs are merely a "pointer" to the data + operations to be applied to it. Actions materialize those operations by executing the RDD lineage.
So, RDDs are basically recomputed when an action is requested. In this case, the map function calling hashmin is being evaluated every time the foreach action is called.
There're few options:
Cache the RDD - this will cause the lineage to be broken and the results of the first transformation will be preserved:
val rddhash = parsedData.map(x => hashmin(x,w)).cache()
Use a seed for your random function, sothat the pseudo-random sequence generated is each time the same.
RDDs are lazy - they're computed when they're used. So the calls to Random.nextGaussian are made again each time you call foreach.
You can use persist() to store an RDD if you want to keep fixed values.