Why ArrayBuffer is deleted after RDD transform?

Why ArrayBuffer is deleted after RDD transform? - scala

Here is my scala code:
var calcUsers = new ArrayBuffer[Int]()
nRDD.foreach(item=>{
val arr = item.split(' ')
val currUserId = arr(1).toInt
calcUsers.+=(currUserId)
println("calcUsers",currUserId,calcUsers.length)
})
println("calcUsers",calcUsers)
In the first println, calcUsers's length is increased as it processes.
But in the second line, output always is 0.
Why? How to solve it?

Related

how to update an RDD

I have and an RDD[(Int,Array[Double],Double, Double)].
val full_data = rdd.map(row => {
val label = row._1
val feature = row._2.map(_.toDouble)
val QD = k_function(feature)
val alpha = 0.0
(label,feature,QD,alpha)
})
Now I want to update the value of alpha in each record (say 10)
var tmp = full_data.map( x=> {
x._4 = 10
})
I got the error
Error: reassignment to val
x._4 = 10
I have changed the all the val to var but still, the error occurs. How to update the value of alpha. and I would like to know how to update the full row or a specific row in an RDD.

RDD's are immutable in nature. They are made so for easy caching, sharing and replicating. Its always safe to copy than to mutate in a multi-threaded system like spark for fault tolerance and correctness in processing. Recreation of immutable data is much easier than mutable data.
Transformation is like copying the RDD data to another RDD every variables are treated as val i.e. they are immutable so if you are looking to replace the last double with 10, you can do is
var tmp = full_data.map( x=> {
(x._1, x._2, x._3, 10)
})

How do I remove empty dataframes from a sequence of dataframes in Scala

How do I remove empty data frames from a sequence of data frames? In this below code snippet, there are many empty data frames in twoColDF. Also another question for the below for loop, is there a way that I can make this efficient? I tried rewriting this to below line but didn't work
//finalDF2 = (1 until colCount).flatMap(j => groupCount(j).map( y=> finalDF.map(a=>a.filter(df(cols(j)) === y)))).toSeq.flatten
var twoColDF: Seq[Seq[DataFrame]] = null
if (colCount == 2 )
{
val i = 0
for (j <- i + 1 until colCount) {
twoColDF = groupCount(j).map(y => {
finalDF.map(x => x.filter(df(cols(j)) === y))
})
}
}finalDF = twoColDF.flatten

Given a set of DataFrames, you can access each DataFrame's underlying RDD and use isEmpty to filter out the empty ones:
val input: Seq[DataFrame] = ???
val result = input.filter(!_.rdd.isEmpty())
As for your other question - I can't understand what your code tries to do, but I'd first try to convert it into something more functional (remove use of vars and imperative conditionals). If I'm guessing the meaning of your inputs, here's something that might be equivalent to what you're trying to do:
var input: Seq[DataFrame] = ???
// map of column index to column values -
// for each combination we'd want a new DF where that column has that value
// I'm assuming values are Strings, can be anything else
val groupCount: Map[Int, Seq[String]] = ???
// for each combination of DF + column + value - produce the filtered DF where this column has this value
val perValue: Seq[DataFrame] = for {
df <- input
index <- groupCount.keySet
value <- groupCount(index)
} yield df.filter(col(df.columns(index)) === value)
// remove empty results:
val result: Seq[DataFrame] = perValue.filter(!_.rdd.isEmpty())

How to copy matrix to column array

I'm trying to copy a column of a matrix into an array, also I want to make this matrix public.
Heres my code:
val years = Array.ofDim[String](1000, 1)
val bufferedSource = io.Source.fromFile("Top_1_000_Songs_To_Hear_Before_You_Die.csv")
val i=0;
//println("THEME, TITLE, ARTIST, YEAR, SPOTIFY_URL")
for (line <- bufferedSource.getLines) {
val cols = line.split(",").map(_.trim)
years(i)=cols(3)(i)
}
I want the cols to be a global matrix and copy the column 3 to years, because of the method of that I get cols I dont know how to define it

There're three different problems in your attempt:
Your regexp will fail for this dataset. I suggest you change it to:
val regex = ",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))"
This will capture the blocks wrapped in double quotes but containing commas (courtesy of Luke Sheppard on regexr)
This val i=0; is not very scala-ish / functional. We can replace it by a zipWithIndex in the for comprehension:
for ((line, count) <- bufferedSource.getLines.zipWithIndex)
You can create the "global matrix" by extracting elements from each line (val Array (...)) and returning them as the value of the for-comprehension block (yield):
It looks like that:
for ((line, count) <- bufferedSource.getLines.zipWithIndex) yield {
val Array(theme,title,artist,year,spotify_url) = line....
...
(theme,title,artist,year,spotify_url)
}
And here is the complete solution:
val bufferedSource = io.Source.fromFile("/tmp/Top_1_000_Songs_To_Hear_Before_You_Die.csv")
val years = Array.ofDim[String](1000, 1)
val regex = ",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))"
val iteratorMatrix = for ((line, count) <- bufferedSource.getLines.zipWithIndex) yield {
val Array(theme,title,artist,year,spotify_url) = line.split(regex, -1).map(_.trim)
years(count) = Array(year)
(theme,title,artist,year,spotify_url)
}
// will actually consume the iterator and fill in globalMatrix AND years
val globalMatrix = iteratorMatrix.toList

Here's a function that will get the col column from the CSV. There is no error handling here for any empty row or other conditions. This is a proof of concept so add your own error handling as you see fit.
val years = (fileName: String, col: Int) => scala.io.Source.fromFile(fileName)
.getLines()
.map(_.split(",")(col).trim())
Here's a suggestion if you are looking to keep the contents of the file in a map. Again there's no error handling just proof of concept.
val yearColumn = 3
val fileName = "Top_1_000_Songs_To_Hear_Before_You_Die.csv"
def lines(file: String) = scala.io.Source.fromFile(file).getLines()
val mapRow = (row: String) => row.split(",").zipWithIndex.foldLeft(Map[Int, String]()){
case (acc, (v, idx)) => acc.updated(idx,v.trim())}
def mapColumns = (values: Iterator[String]) =>
values.zipWithIndex.foldLeft(Map[Int, Map[Int, String]]()){
case (acc, (v, idx)) => acc.updated(idx, mapRow(v))}
val parser = lines _ andThen mapColumns
val matrix = parser(fileName)
val years = matrix.flatMap(_.swap._1.get(yearColumn))
This will build a Map[Int,Map[Int, String]] which you can use elsewhere. The first index of the map is the row number and the index of the inner map is the column number. years is an Iterable[String] that contains the year values.

Consider adding contents to a collection at the same time as it is created, in contrast to allocate space first and then update it; for instance like this,
val rawSongsInfo = io.Source.fromFile("Top_Songs.csv").getLines
val cols = for (rsi <- rawSongsInfo) yield rsi.split(",").map(_.trim)
val years = cols.map(_(3))

My RDD change his values himself

I have a basic RDD[Object] on which i apply a map with a hashfunction on Object values using nextGaussian and nextDouble scala function. And when i print values there change at each print
def hashmin(x:Data_Object, w:Double) = {
val x1 = x.get_vector.toArray
var a1 = Array(0.0).tail
val b = Random.nextDouble * w
for( ind <- 0 to x1.size-1) {
val nG = Random.nextGaussian
a1 = a1 :+ nG
}
var sum = 0.0
for( ind <- 0 to x1.size-1) {
sum = sum + (x1(ind)*a1(ind))
}
val hash_val = (sum+b)/w
val hash_val1 = (x.get_id,hash_val)
hash_val1
}
val w = 8
val rddhash = parsedData.map(x => hashmin(x,w))
rddhash.foreach(println)
rddhash.foreach(println)
I don't understand why. Thank you in advance.

RDDs are merely a "pointer" to the data + operations to be applied to it. Actions materialize those operations by executing the RDD lineage.
So, RDDs are basically recomputed when an action is requested. In this case, the map function calling hashmin is being evaluated every time the foreach action is called.
There're few options:
Cache the RDD - this will cause the lineage to be broken and the results of the first transformation will be preserved:
val rddhash = parsedData.map(x => hashmin(x,w)).cache()
Use a seed for your random function, sothat the pseudo-random sequence generated is each time the same.

RDDs are lazy - they're computed when they're used. So the calls to Random.nextGaussian are made again each time you call foreach.
You can use persist() to store an RDD if you want to keep fixed values.

Efficient countByValue of each column Spark Streaming

I want to find countByValues of each column in my data. I can find countByValue() for each column (e.g. 2 columns now) in basic batch RDD as fallows:
scala> val double = sc.textFile("double.csv")
scala> val counts = sc.parallelize((0 to 1).map(index => {
double.map(x=> { val token = x.split(",")
(math.round(token(index).toDouble))
}).countByValue()
}))
scala> counts.take(2)
res20: Array[scala.collection.Map[Long,Long]] = Array(Map(2 -> 5, 1 -> 5), Map(4 -> 5, 5 -> 5))
Now I want to perform same with DStreams. I have windowedDStream and want to countByValue on each column. My data has 50 columns. I have done it as fallows:
val windowedDStream = myDStream.window(Seconds(2), Seconds(2)).cache()
ssc.sparkContext.parallelize((0 to 49).map(index=> {
val counts = windowedDStream.map(x=> { val token = x.split(",")
(math.round(token(index).toDouble))
}).countByValue()
counts.print()
}))
val topCounts = counts.map . . . . will not work
I get correct results with this, the only issue is that I want to apply more operations on counts and it's not available outside map.

You misunderstand what parallelize does. You think when you give it a Seq of two elements, those two elements will be calculated in parallel. That it not the case and it would be impossible for it to be the case.
What parallelize actually does is it creates an RDD from the Seq that you provided.
To try to illuminate this, consider that this:
val countsRDD = sc.parallelize((0 to 1).map { index =>
double.map { x =>
val token = x.split(",")
math.round(token(index).toDouble)
}.countByValue()
})
Is equal to this:
val counts = (0 to 1).map { index =>
double.map { x =>
val token = x.split(",")
math.round(token(index).toDouble)
}.countByValue()
}
val countsRDD = sc.parallelize(counts)
By the time parallelize runs, the work has already been performed. parallelize cannot retroactively make it so that the calculation happened in parallel.
The solution to your problem is to not use parallelize. It is entirely pointless.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Why ArrayBuffer is deleted after RDD transform? - scala

Related

how to update an RDD

How do I remove empty dataframes from a sequence of dataframes in Scala

How to copy matrix to column array

My RDD change his values himself

Efficient countByValue of each column Spark Streaming

Categories

Resources