Scala - append RDD to itself - scala

for (fordate <- 2 to 30) {
val dataRDD = sc.textFile("s3n://mypath" + fordate + "/*")
val a = 1
val c = fordate - 1
for (b <- a to c) {
val cumilativeRDD1 = sc.textFile("s3n://mypath/" + b + "/*")
val cumilativeRDD : org.apache.spark.rdd.RDD[String] = sc.union(cumilativeRDD1, cumilativeRDD)
if (b == c) {
val incrementalDEviceIDs = dataRDD.subtract(cumilativeRDD)
val countofIDs = incrementalDEviceIDs.distinct().count()
println(s"201611 $fordate $countofIDs")
}
}
}
i have a data set where i get deviceIDs on daily basis. i need to figure out the incremental count per day but when i join cumilativeRDD to itself it saysthrows following error:
forward reference extends over definition of value cumilativeRDD
how can i overcome this.

The problem is this line:
val cumilativeRDD : org.apache.spark.rdd.RDD[String] = sc.union(cumilativeRDD1 ,cumilativeRDD)
You're using cumilativeRDD before it's declaration. Variable assignment works from right to left. The right side of = defines the variable on the left. Therefore you cannot use the variable inside it's own definition. Because on the right side of the equation the variable does not yet exist.
You have to init cumilativeRDD in the first run and then you can you use it in following runs:
var cumilativeRDD: Option[org.apache.spark.rdd.RDD[String]] = None
for (fordate <- 2 to 30) {
val DataRDD = sc.textFile("s3n://mypath" + fordate + "/*")
val c = fordate - 1
for (b <- 1 to c) {
val cumilativeRDD1 = sc.textFile("s3n://mypath/" + b + "/*")
if (cumilativeRDD.isEmpty) cumilativeRDD = Some(cumilativeRDD1)
else cumilativeRDD = Some(sc.union(cumilativeRDD1, cumilativeRDD.get))
if (b == c) {
val IncrementalDEviceIDs = DataRDD.subtract(cumilativeRDD.get)
val countofIDs = IncrementalDEviceIDs.distinct().count()
println("201611" + fordate + " " + countofIDs)
}
}
}

Related

Store each partition to file and load it on same partition in Scala Spark

I have situation where have to store each partition's data to a file and load stored data at same partition later. Here is my code
Base class
case class foo ( posVals : Array[Double] , velVals : Array[Double] , f: Array[Double] => Double ,
fitnessVal: Double , LR1 : Double , PR1 : Double) extends Serializable {
var position : Array[Double] = posVals
var velocity : Array[Double] = velVals
var fitness : Double = fitnessVal
var PulseRate: Double = PR1
var LoudnessRate: Double = LR1
}
Objective function
def sphere (ar : Array[Double]) : Double = ar.reduce((x,y) => x+y*y)
Store and read data inside each partition
def execute(RDD: RDD[foo], c_itr: Int ): Array[(foo, Int)] = {
val newRDD = RDD.mapPartitionsWithIndex {
(index, Iterator) => {
var arr: Array[foo] = Iterator.toArray
if (c_itr != 0) {
//Read Data from stored file where file name is equal to partition number (index)
val bufferedSource = Source.fromFile("/result/"+index+".txt")
val lines = bufferedSource.getLines()
val data : Array[BAT1] = lines.flatMap{line =>
val p = line.split(",")
Seq( BAT1(p(0).toArray.map(_.toDouble) , p(1).toArray.map(_.toDouble) ,sphere ,line(2).toDouble, p(3).toDouble, p(4).toDouble) )
}.toArray
}
arr = data.clone() // Replace arr with loaded data from file
//Save to file
val writer = new FileWriter(Path + index + ".txt")
for ( i <- 0 until arr.length ) {
writer.write(arr(i).position.toList + "," + arr(i).velocity.toList + "," + arr(i).fitness + "," +
arr(i).LoudnessRate + "," + arr(i).PulseRate + "\n")
}
writer.close()
val bests : Array[(foo , Int)] = res1.map(x => (x, index))
bests.toIterator
}
}
newRDD.persist().collect()
}
Sample of data stored in file is.
List(86.6582767815429, -25.224569272200586, 90.52371028878218, -59.91851894060545, -37.12944037124118),List(-59.60155033146984, -8.927455672466586, -23.679516503590534, 87.58857469881022 ,-14.864361504195127),6.840659702736215E10,0.6012,0.04131580765457621
List(86.6582767815429, -25.224569272200586, 90.52371028878218, -59.91851894060545, -26.10553311409422),List(-66.83980088207335, 51.088426986986015, -109.74073303298485, 66.87095748811572, -22.941448024344268),9.195157603574039E10,0.9025,0.06132589765454988
This code is not reading exact data when data is read from file. I tried lot but unable to find the issue. How can I read stored data correctly in data object ?

Bubble sort of random integers in scala

I'm new in Scala programming language so in this Bubble sort I need to generate 10 random integers instead of right it down like the code below
any suggestions?
object BubbleSort {
def bubbleSort(array: Array[Int]) = {
def bubbleSortRecursive(array: Array[Int], current: Int, to: Int): Array[Int] = {
println(array.mkString(",") + " current -> " + current + ", to -> " + to)
to match {
case 0 => array
case _ if(to == current) => bubbleSortRecursive(array, 0, to - 1)
case _ =>
if (array(current) > array(current + 1)) {
var temp = array(current + 1)
array(current + 1) = array(current)
array(current) = temp
}
bubbleSortRecursive(array, current + 1, to)
}
}
bubbleSortRecursive(array, 0, array.size - 1)
}
def main(args: Array[String]) {
val sortedArray = bubbleSort(Array(10,9,11,5,2))
println("Sorted Array -> " + sortedArray.mkString(","))
}
}
Try this:
import scala.util.Random
val sortedArray = (1 to 10).map(_ => Random.nextInt).toArray
You can use scala.util.Random for generation. nextInt method takes maxValue argument, so in the code sample, you'll generate list of 10 int values from 0 to 100.
val r = scala.util.Random
for (i <- 1 to 10) yield r.nextInt(100)
You can find more info here or here
You can use it this way.
val solv1 = Random.shuffle( (1 to 100).toList).take(10)
val solv2 = Array.fill(10)(Random.nextInt)

Join two strings in Scala with one to one mapping

I have two strings in Scala
Input 1 : "a,c,e,g,i,k"
Input 2 : "b,d,f,h,j,l"
How do I join the two Strings in Scala?
Required output = "ab,cd,ef,gh,ij,kl"
I tried something like:
var columnNameSetOne:Array[String] = Array(); //v1 = "a,c,e,g,i,k"
var columnNameSetTwo:Array[String] = Array(); //v2 = "b,d,f,h,j,l"
After I get the input data as mentioned above
columnNameSetOne = v1.split(",")
columnNameSetTwo = v2.split(",");
val newColumnSet = IntStream.range(0, Math.min(columnNameSetOne.length, columnNameSetTwo.length)).mapToObj(j => (columnNameSetOne(j) + columnNameSetTwo(j))).collect(Collectors.joining(","));
println(newColumnSet)
But I am getting error on j
Also, I am not sure if this would work!
object Solution1 extends App {
val input1 = "a,c,e,g,i,k"
val input2 = "b,d,f,h,j,l"
val i1= input1.split(",")
val i2 = input2.split(",")
val x =i1.zipAll(i2, "", "").map{
case (a,b)=> a + b
}
println(x.mkString(","))
}
//output : ab,cd,ef,gh,ij,kl
Easy to do using zip function on list.
val v1 = "a,c,e,g,i,k"
val v2 = "b,d,f,h,j,l"
val list1 = v1.split(",").toList
val list2 = v2.split(",").toList
list1.zip(list2).mkString(",") // res0: String = (a,b),( c,d),( e,f),( g,h),( i,j),( k,l)

org.apache.spark.SparkException: Task not serializable (scala)

I am new for scala as well as FOR spark, Please help me to resolve this issue.
in spark shell when I load below functions individually they run without any exception, when I copy this function in scala object, and load same file in spark shell they throws task not serialization exception in "processbatch" function when trying to parallelize.
PFB code for the same:
import org.apache.spark.sql.Row
import org.apache.log4j.Logger
import org.apache.spark.sql.hive.HiveContext
object Process {
val hc = new HiveContext(sc)
def processsingle(wait: Int, patient: org.apache.spark.sql.Row, visits: Array[org.apache.spark.sql.Row]) : String = {
var out = new StringBuilder()
val processStart = getTimeInMillis()
for( x <- visits ) {
out.append(", " + x.getAs("patientid") + ":" + x.getAs("visitid"))
}
}
def processbatch(batch: Int, wait: Int, patients: Array[org.apache.spark.sql.Row], visits: Array[org.apache.spark.sql.Row]) = {
val out = sc.parallelize(patients, batch).map( r=> processsingle(wait, r, visits.filter(f=> f.getAs("patientid") == r.getAs("patientid")))).collect()
for(x <- out) println(x)
}
def processmeasures(fetch: Int, batch: Int, wait: Int) = {
val patients = hc.sql("SELECT patientid FROM tableName1 order by p_id").collect()
val visit = hc.sql("SELECT patientid, visitid FROM tableName2")
val count = patients.length
val fetches = if(count % fetch > 0) (count / fetch + 1) else (count / fetch)
for(i <- 0 to fetches.toInt-1){
val startFetch = i*fetch
val endFetch = math.min((i+1)*fetch, count.toInt)-1
val fetchSize = endFetch - startFetch + 1
val fetchClause = "patientid >= " + patients(startFetch).get(0) + " and patientid <= " + patients(endFetch).get(0)
val fetchVisit = visit.filter( fetchClause ).collect()
val batches = if(fetchSize % batch > 0) (fetchSize / batch + 1) else (fetchSize / batch)
for(j <- 0 to batches.toInt-1){
val startBatch = j*batch
val endBatch = math.min((j+1)*batch, fetch.toInt)-1
println(s"Batch from $startBatch to $endBatch");
val batchVisits = fetchVisit.filter(g => g.getAs[Long]("patientid") >= patients(i*fetch + startBatch).getLong(0) && g.getAs[Long]("patientid") <= patients(math.min(i*fetch + endBatch + 1, endFetch)).getLong(0))
processbatch(batch, wait, patients.slice(i*fetch + startBatch, i*fetch + endBatch + 1), batchVisits)
}
}
println("Processing took " + getExecutionTime(processStart) + " millis")
}
}
You should make Process object Serializable:
object Process extends Serializable {
...
}

Spark: split rows and accumulate

I have this code:
val rdd = sc.textFile(sample.log")
val splitRDD = rdd.map(r => StringUtils.splitPreserveAllTokens(r, "\\|"))
val rdd2 = splitRDD.filter(...).map(row => createRow(row, fieldsMap))
sqlContext.createDataFrame(rdd2, structType).save(
org.apache.phoenix.spark, SaveMode.Overwrite, Map("table" -> table, "zkUrl" -> zkUrl))
def createRow(row: Array[String], fieldsMap: ListMap[Int, FieldConfig]): Row = {
//add additional index for invalidValues
val arrSize = fieldsMap.size + 1
val arr = new Array[Any](arrSize)
var invalidValues = ""
for ((k, v) <- fieldsMap) {
val valid = ...
var value : Any = null
if (valid) {
value = row(k)
// if (v.code == "SOURCE_NAME") --> 5th column in the row
// sourceNameCount = row(k).split(",").size
} else {
invalidValues += v.code + " : " + row(k) + " | "
}
arr(k) = value
}
arr(arrSize - 1) = invalidValues
Row.fromSeq(arr.toSeq)
}
fieldsMap contains the mapping of the input columns: (index, FieldConfig). Where FieldConfig class contains "code" and "dataType" values.
TOPIC -> (0, v.code = "TOPIC", v.dataType = "String")
GROUP -> (1, v.code = "GROUP")
SOURCE_NAME1,SOURCE_NAME2,SOURCE_NAME3 -> (4, v.code = "SOURCE_NAME")
This is the sample.log:
TOPIC|GROUP|TIMESTAMP|STATUS|SOURCE_NAME1,SOURCE_NAME2,SOURCE_NAME3|
SOURCE_TYPE1,SOURCE_TYPE2,SOURCE_TYPE3|SOURCE_COUNT1,SOURCE_COUNT2,SOURCE_COUNT3|
DEST_NAME1,DEST_NAME2,DEST_NAME3|DEST_TYPE1,DEST_TYPE2,DEST_TYPE3|
DEST_COUNT1,DEST_COUNT2,DEST_COUNT3|
The goal is to split the input (sample.log), based on the number of source_name(s).. In the example above, the output will have 3 rows:
TOPIC|GROUP|TIMESTAMP|STATUS|SOURCE_NAME1|SOURCE_TYPE1|SOURCE_COUNT1|
|DEST_NAME1|DEST_TYPE1|DEST_COUNT1|
TOPIC|GROUP|TIMESTAMP|STATUS|SOURCE_NAME2|SOURCE_TYPE2|SOURCE_COUNT2|
DEST_NAME2|DEST_TYPE2|DEST_COUNT2|
TOPIC|GROUP|TIMESTAMP|STATUS|SOURCE_NAME3|SOURCE_TYPE3|SOURCE_COUNT3|
|DEST_NAME3|DEST_TYPE3|DEST_COUNT3|
This is the new code I am working on (still using createRow defined above):
val rdd2 = splitRDD.filter(...).flatMap(row => {
val srcName = row(4).split(",")
val srcType = row(5).split(",")
val srcCount = row(6).split(",")
val destName = row(7).split(",")
val destType = row(8).split(",")
val destCount = row(9).split(",")
var newRDD: ArrayBuffer[Row] = new ArrayBuffer[Row]()
//if (srcName != null) {
println("\n\nsrcName.size: " + srcName.size + "\n\n")
for (i <- 0 to srcName.size - 1) {
// missing column: destType can sometimes be null
val splittedRow: Array[String] = Row.fromSeq(Seq((row(0), row(1), row(2), row(3),
srcName(i), srcType(i), srcCount(i), destName(i), "", destCount(i)))).toSeq.toArray[String]
newRDD = newRDD ++ Seq(createRow(splittedRow, fieldsMap))
}
//}
Seq(Row.fromSeq(Seq(newRDD)))
})
Since I am having an error in converting my splittedRow to Array[String]
(".toSeq.toArray[String]")
error: type arguments [String] do not conform to method toArray's type parameter bounds [B >: Any]
I decided to update my splittedRow to:
val rowArr: Array[String] = new Array[String](10)
for (j <- 0 to 3) {
rowArr(j) = row(j)
}
rowArr(4) = srcName(i)
rowArr(5) = row(5).split(",")(i)
rowArr(6) = row(6).split(",")(i)
rowArr(7) = row(7).split(",")(i)
rowArr(8) = row(8).split(",")(i)
rowArr(9) = row(9).split(",")(i)
val splittedRow = rowArr
You could use a flatMap operation instead of a map operation to return multiple rows. Consequently, your createRow would be refactored to createRows(row: Array[String], fieldsMap: List[Int, IngestFieldConfig]): Seq[Row].