My RDD change his values himself - scala

I have a basic RDD[Object] on which i apply a map with a hashfunction on Object values using nextGaussian and nextDouble scala function. And when i print values there change at each print
def hashmin(x:Data_Object, w:Double) = {
val x1 = x.get_vector.toArray
var a1 = Array(0.0).tail
val b = Random.nextDouble * w
for( ind <- 0 to x1.size-1) {
val nG = Random.nextGaussian
a1 = a1 :+ nG
}
var sum = 0.0
for( ind <- 0 to x1.size-1) {
sum = sum + (x1(ind)*a1(ind))
}
val hash_val = (sum+b)/w
val hash_val1 = (x.get_id,hash_val)
hash_val1
}
val w = 8
val rddhash = parsedData.map(x => hashmin(x,w))
rddhash.foreach(println)
rddhash.foreach(println)
I don't understand why. Thank you in advance.

RDDs are merely a "pointer" to the data + operations to be applied to it. Actions materialize those operations by executing the RDD lineage.
So, RDDs are basically recomputed when an action is requested. In this case, the map function calling hashmin is being evaluated every time the foreach action is called.
There're few options:
Cache the RDD - this will cause the lineage to be broken and the results of the first transformation will be preserved:
val rddhash = parsedData.map(x => hashmin(x,w)).cache()
Use a seed for your random function, sothat the pseudo-random sequence generated is each time the same.

RDDs are lazy - they're computed when they're used. So the calls to Random.nextGaussian are made again each time you call foreach.
You can use persist() to store an RDD if you want to keep fixed values.

Related

Length of dataframe inside UDF function

I need to write a complex User Defined Function (UDF) that takes multiple columns as input. Something like:
val uudf = udf{(val:Int, lag:Int, cumsum_p:Double) => val + lag + cum_p} // actually a more complex function but let's make it simple
The third parameter cumsum_p indicate is a cumulative sum of p where p is a the length of the group it is computed. Because this udf will then be used in a groupby.
I come up with this solution which is almost ok:
val uudf = udf{(val:Int, lag:Int, cumsum_p:Double) => val + lag + cum_p}
val w = Window.orderBy($"sale_qty")
df.withColumn("needThat",
uudf(col("sale_qty"),
lead("sale_qty",1).over(w), sum(lit(1/length_group)).over(w)
)
).show()
The problem is that if I replace lit(1/length_group) with lit(1/count("sale_qty")) the created column now contains only 1 element which lead to an error...
You should compute count("sale_qty") first:
val w = Window.orderBy($"sale_qty")
df
.withColumn("cnt",count($"sale_qty").over())
.withColumn("needThat",
uudf(col("sale_qty"),
lead("sale_qty",1).over(w), sum(lit(1)/$"cnt").over(w)
)
).show()

Attemping to parallelize a nested loop in Scala

I am comparing 2 dataframes in scala/spark using a nested loop and an external jar.
for (nrow <- dfm.rdd.collect) {
var mid = nrow.mkString(",").split(",")(0)
var mfname = nrow.mkString(",").split(",")(1)
var mlname = nrow.mkString(",").split(",")(2)
var mlssn = nrow.mkString(",").split(",")(3)
for (drow <- dfn.rdd.collect) {
var nid = drow.mkString(",").split(",")(0)
var nfname = drow.mkString(",").split(",")(1)
var nlname = drow.mkString(",").split(",")(2)
var nlssn = drow.mkString(",").split(",")(3)
val fNameArray = Array(mfname,nfname)
val lNameArray = Array (mlname,nlname)
val ssnArray = Array (mlssn,nlssn)
val fnamescore = Main.resultSet(fNameArray)
val lnamescore = Main.resultSet(lNameArray)
val ssnscore = Main.resultSet(ssnArray)
val overallscore = (fnamescore +lnamescore +ssnscore) /3
if(overallscore >= .95) {
println("MeditechID:".concat(mid)
.concat(" MeditechFname:").concat(mfname)
.concat(" MeditechLname:").concat(mlname)
.concat(" MeditechSSN:").concat(mlssn)
.concat(" NextGenID:").concat(nid)
.concat(" NextGenFname:").concat(nfname)
.concat(" NextGenLname:").concat(nlname)
.concat(" NextGenSSN:").concat(nlssn)
.concat(" FnameScore:").concat(fnamescore.toString)
.concat(" LNameScore:").concat(lnamescore.toString)
.concat(" SSNScore:").concat(ssnscore.toString)
.concat(" OverallScore:").concat(overallscore.toString))
}
}
}
What I'm hoping to do is add some parallelism to the outer loop so that I can create a threadpool of 5 and pull 5 records from the collection of the outerloop and compare them to the collection of the inner loop, rather than doing this serially. So the outcome would be I can specify the number of threads, have 5 records from the outerloop's collection processing at any given time against the collection in the inner loop. How would I go about doing this?
Let's start by analyzing what you are doing. You collect the data of dfm to the driver. Then, for each element you collect the data from dfn, transform it and compute a score for each pair of elements.
That's problematic in many ways. First even without considering parallel computing, the transformations on the elements of dfn are made as many times as dfm as elements. Also, you collect the data of dfn for every row of dfm. That's a lot of network communications (between the driver and the executors).
If you want to use spark to parallelize you computations, you need to use the API (RDD , SQL or Datasets). You seem to want to use RDDs to perform a cartesian product (this is O(N*M) so be careful, it may take a while).
Let's start by transforming the data before the Cartesian product to avoid performing them more than once per element. Also, for clarity, let's define a case class to contain your data and a function that transform your dataframes into RDDs of that case class.
case class X(id : String, fname : String, lname : String, lssn : String)
def toRDDofX(df : DataFrame) = {
df.rdd.map(row => {
// using pattern matching to convert the array to the case class X
row.mkString(",").split(",") match {
case Array(a, b, c, d) => X(a, b, c, d)
}
})
}
Then, I use filter to keep only the tuples whose score is more than .95 but you could use map, foreach... depending on what you intend to do.
val rddn = toRDDofX(dfn)
val rddm = toRDDofX(dfm)
rddn.cartesian(rddm).filter{ case (xn, xm) => {
val fNameArray = Array(xm.fname,xn.fname)
val lNameArray = Array(xm.lname,xn.lname)
val ssnArray = Array(xm.lssn,xn.lssn)
val fnamescore = Main.resultSet(fNameArray)
val lnamescore = Main.resultSet(lNameArray)
val ssnscore = Main.resultSet(ssnArray)
val overallscore = (fnamescore +lnamescore +ssnscore) /3
// and then, let's say we filter by score
overallscore > .95
}}
This is not a right way of iterating over spark dataframe. The major concern is the dfm.rdd.collect. If the dataframe is arbitrarily large, you would end up exception. This due to the fact that the collect function essentially brings all the data into the master node.
Alternate way would be use the foreach or map construct of the rdd.
dfm.rdd.foreach(x => {
// your logic
}
Now you are trying to iterate the second dataframe here. I am afraid that won't be possible. The elegant way is to join the dfm and dfn and iterate over the resulting dataset to compute your function.

how to update an RDD

I have and an RDD[(Int,Array[Double],Double, Double)].
val full_data = rdd.map(row => {
val label = row._1
val feature = row._2.map(_.toDouble)
val QD = k_function(feature)
val alpha = 0.0
(label,feature,QD,alpha)
})
Now I want to update the value of alpha in each record (say 10)
var tmp = full_data.map( x=> {
x._4 = 10
})
I got the error
Error: reassignment to val
x._4 = 10
I have changed the all the val to var but still, the error occurs. How to update the value of alpha. and I would like to know how to update the full row or a specific row in an RDD.
RDD's are immutable in nature. They are made so for easy caching, sharing and replicating. Its always safe to copy than to mutate in a multi-threaded system like spark for fault tolerance and correctness in processing. Recreation of immutable data is much easier than mutable data.
Transformation is like copying the RDD data to another RDD every variables are treated as val i.e. they are immutable so if you are looking to replace the last double with 10, you can do is
var tmp = full_data.map( x=> {
(x._1, x._2, x._3, 10)
})

How do I remove empty dataframes from a sequence of dataframes in Scala

How do I remove empty data frames from a sequence of data frames? In this below code snippet, there are many empty data frames in twoColDF. Also another question for the below for loop, is there a way that I can make this efficient? I tried rewriting this to below line but didn't work
//finalDF2 = (1 until colCount).flatMap(j => groupCount(j).map( y=> finalDF.map(a=>a.filter(df(cols(j)) === y)))).toSeq.flatten
var twoColDF: Seq[Seq[DataFrame]] = null
if (colCount == 2 )
{
val i = 0
for (j <- i + 1 until colCount) {
twoColDF = groupCount(j).map(y => {
finalDF.map(x => x.filter(df(cols(j)) === y))
})
}
}finalDF = twoColDF.flatten
Given a set of DataFrames, you can access each DataFrame's underlying RDD and use isEmpty to filter out the empty ones:
val input: Seq[DataFrame] = ???
val result = input.filter(!_.rdd.isEmpty())
As for your other question - I can't understand what your code tries to do, but I'd first try to convert it into something more functional (remove use of vars and imperative conditionals). If I'm guessing the meaning of your inputs, here's something that might be equivalent to what you're trying to do:
var input: Seq[DataFrame] = ???
// map of column index to column values -
// for each combination we'd want a new DF where that column has that value
// I'm assuming values are Strings, can be anything else
val groupCount: Map[Int, Seq[String]] = ???
// for each combination of DF + column + value - produce the filtered DF where this column has this value
val perValue: Seq[DataFrame] = for {
df <- input
index <- groupCount.keySet
value <- groupCount(index)
} yield df.filter(col(df.columns(index)) === value)
// remove empty results:
val result: Seq[DataFrame] = perValue.filter(!_.rdd.isEmpty())

How to compute cumulative sum using Spark

I have an rdd of (String,Int) which is sorted by key
val data = Array(("c1",6), ("c2",3),("c3",4))
val rdd = sc.parallelize(data).sortByKey
Now I want to start the value for the first key with zero and the subsequent keys as sum of the previous keys.
Eg: c1 = 0 , c2 = c1's value , c3 = (c1 value +c2 value) , c4 = (c1+..+c3 value)
expected output:
(c1,0), (c2,6), (c3,9)...
Is it possible to achieve this ?
I tried it with map but the sum is not preserved inside the map.
var sum = 0 ;
val t = keycount.map{ x => { val temp = sum; sum = sum + x._2 ; (x._1,temp); }}
Compute partial results for each partition:
val partials = rdd.mapPartitionsWithIndex((i, iter) => {
val (keys, values) = iter.toSeq.unzip
val sums = values.scanLeft(0)(_ + _)
Iterator((keys.zip(sums.tail), sums.last))
})
Collect partials sums
val partialSums = partials.values.collect
Compute cumulative sum over partitions and broadcast it:
val sumMap = sc.broadcast(
(0 until rdd.partitions.size)
.zip(partialSums.scanLeft(0)(_ + _))
.toMap
)
Compute final results:
val result = partials.keys.mapPartitionsWithIndex((i, iter) => {
val offset = sumMap.value(i)
if (iter.isEmpty) Iterator()
else iter.next.map{case (k, v) => (k, v + offset)}.toIterator
})
Spark has buit-in supports for hive ANALYTICS/WINDOWING functions and the cumulative sum could be achieved easily using ANALYTICS functions.
Hive wiki ANALYTICS/WINDOWING functions.
Example:
Assuming you have sqlContext object-
val datardd = sqlContext.sparkContext.parallelize(Seq(("a",1),("b",2), ("c",3),("d",4),("d",5),("d",6)))
import sqlContext.implicits._
//Register as test table
datardd.toDF("id","val").createOrReplaceTempView("test")
//Calculate Cumulative sum
sqlContext.sql("select id,val, " +
"SUM(val) over ( order by id rows between unbounded preceding and current row ) cumulative_Sum " +
"from test").show()
This approach cause to below warning. In case executor runs outOfMemory, tune job’s memory parameters accordingly to work with huge dataset.
WARN WindowExec: No Partition Defined for Window operation! Moving
all data to a single partition, this can cause serious performance
degradation
I hope this helps.
Here is a solution in PySpark. Internally it's essentially the same as #zero323's Scala solution, but it provides a general-purpose function with a Spark-like API.
import numpy as np
def cumsum(rdd, get_summand):
"""Given an ordered rdd of items, computes cumulative sum of
get_summand(row), where row is an item in the RDD.
"""
def cumsum_in_partition(iter_rows):
total = 0
for row in iter_rows:
total += get_summand(row)
yield (total, row)
rdd = rdd.mapPartitions(cumsum_in_partition)
def last_partition_value(iter_rows):
final = None
for cumsum, row in iter_rows:
final = cumsum
return (final,)
partition_sums = rdd.mapPartitions(last_partition_value).collect()
partition_cumsums = list(np.cumsum(partition_sums))
partition_cumsums = [0] + partition_cumsums
partition_cumsums = sc.broadcast(partition_cumsums)
def add_sums_of_previous_partitions(idx, iter_rows):
return ((cumsum + partition_cumsums.value[idx], row)
for cumsum, row in iter_rows)
rdd = rdd.mapPartitionsWithIndex(add_sums_of_previous_partitions)
return rdd
# test for correctness by summing numbers, with and without Spark
rdd = sc.range(10000,numSlices=10).sortBy(lambda x: x)
cumsums, values = zip(*cumsum(rdd,lambda x: x).collect())
assert all(cumsums == np.cumsum(values))
I came across a similar problem and implemented #Paul 's solution. I wanted to do cumsum on a integer frequency table sorted by key(the integer), and there was a minor problem with np.cumsum(partition_sums), error being unsupported operand type(s) for +=: 'int' and 'NoneType'.
Because if the range is big enough, the probability of each partition having something is thus big enough(no None values). However, if the range is much smaller than count, and number of partitions remains the same, some of the partitions would be empty. Here comes the modified solution:
def cumsum(rdd, get_summand):
"""Given an ordered rdd of items, computes cumulative sum of
get_summand(row), where row is an item in the RDD.
"""
def cumsum_in_partition(iter_rows):
total = 0
for row in iter_rows:
total += get_summand(row)
yield (total, row)
rdd = rdd.mapPartitions(cumsum_in_partition)
def last_partition_value(iter_rows):
final = None
for cumsum, row in iter_rows:
final = cumsum
return (final,)
partition_sums = rdd.mapPartitions(last_partition_value).collect()
# partition_cumsums = list(np.cumsum(partition_sums))
#----from here are the changed lines
partition_sums = [x if x is not None else 0 for x in partition_sums]
temp = np.cumsum(partition_sums)
partition_cumsums = list(temp)
#----
partition_cumsums = [0] + partition_cumsums
partition_cumsums = sc.broadcast(partition_cumsums)
def add_sums_of_previous_partitions(idx, iter_rows):
return ((cumsum + partition_cumsums.value[idx], row)
for cumsum, row in iter_rows)
rdd = rdd.mapPartitionsWithIndex(add_sums_of_previous_partitions)
return rdd
#test on random integer frequency
x = np.random.randint(10, size=1000)
D = sqlCtx.createDataFrame(pd.DataFrame(x.tolist(),columns=['D']))
c = D.groupBy('D').count().orderBy('D')
c_rdd = c.rdd.map(lambda x:x['count'])
cumsums, values = zip(*cumsum(c_rdd,lambda x: x).collect())
you can want to try out with windows over using rowsBetween. hope still helpful.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val data = Array(("c1",6), ("c2",3),("c3",4))
val df = sc.parallelize(data).sortByKey().toDF("c", "v")
val w = Window.orderBy("c")
val r = df.select( $"c", sum($"v").over(w.rowsBetween(-2, -1)).alias("cs"))
display(r)