Getting type mismatch exception in scala - scala

Hi I am trying UDAF with spark scala. I am getting the following exception.
Description Resource Path Location Type type mismatch; found : scala.collection.immutable.IndexedSeq[Any] required: String SumCalc.scala /filter line 63 Scala Problem
This is my code.
override def evaluate(buffer: Row): Any = {
val in_array = buffer.getAs[WrappedArray[String]](0);
var finalArray = Array.empty[Array[String]]
import scala.util.control.Breaks._
breakable {
for (outerArray <- in_array) {
val currentTimeStamp = outerArray(1).toLong
var sum = 0.0
var count = 0
var check = false
var list = outerArray
for (array <- in_array) {
val toCheckTimeStamp = array(1).toLong
if (((currentTimeStamp - 10L) <= toCheckTimeStamp) && (currentTimeStamp >= toCheckTimeStamp)) {
sum += array(5).toDouble
count += 1
}
if ((currentTimeStamp - 10L) > toCheckTimeStamp) {
check = true
break
}
}
if (sum != 0.0 && check) list = list :+ (sum).toString // getting error on this line.
else list = list :+ list(5).toDouble.toString
finalArray ++= Array(list)
}
finalArray
}
}
Any help will be appreciated.

There are a couple of mistakes in your evaluate function of UDAF.
list variable is a string but you are treating it as an array
finalArray is initialized as Array.empty[Array[String]] but later on you are adding Array(list) to the finalArray
You are not returning finalArray from evaluate method as its inside for loop
So the correct way should be as below
override def evaluate(buffer: Row): Any = {
val in_array = buffer.getAs[WrappedArray[String]](0);
var finalArray = Array.empty[String]
import scala.util.control.Breaks._
breakable {
for (outerArray <- in_array) {
val currentTimeStamp = outerArray(1).toLong // timestamp values
var sum = 0.0
var count = 0
var check = false
var list = outerArray
for (array <- in_array) {
val toCheckTimeStamp = array(1).toLong
if (((currentTimeStamp - 10L) <= toCheckTimeStamp) && (currentTimeStamp >= toCheckTimeStamp)) {
sum += array(5).toDouble // RSSI weightage values
count += 1
}
if ((currentTimeStamp - 10L) > toCheckTimeStamp) {
check = true
break
}
}
if (sum != 0.0 && check) list = list + (sum).toString // calculate sum for the 10 secs difference
else list = list + (sum).toString // If 10 secs difference is not there take rssi weightage value
finalArray ++= Array(list)
}
}
finalArray // Final results for this function
}
Hope the answer is helpful

Related

Records are missing after creating the table from spark temp table in Spark2

I have created a dataframe from below sequence.
val df = sc.parallelize(Seq((100,23,9.50),
(100,23,9.51),
(100,24,9.52),
(100,25,9.54),
(100,23,9.55),
(101,21,8.51),
(101,23,8.52),
(101,24,8.55),
(101,20,8.56))).toDF("id", "temp","time")
I wanted to update the DF by addin few more rows where data is missing for the time. So I have iterated the DF from mapPartitions to add new rows.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, Row, Column}
#transient val w = org.apache.spark.sql.expressions.Window.partitionBy("id").orderBy("time")
val leadDf = df.withColumn("time_diff", ((lead("time", 1).over(w) - df("time")).cast("Float")*100).cast("int"))
Dataframe iteration goes here:
val result = leadDf.rdd.mapPartitions(itr =>
new Iterator[Row] {
var prevRow = null: Row
var prevDone = true
var firstRow = true
var outputRow: Row = null: Row
var counter = 0
var currRecord = null :Row
var currRow: Row = if (itr.hasNext) {currRecord = itr.next; currRecord } else null
prevRow = currRow
override def hasNext: Boolean = {
if (!prevDone) {
prevRow = incrementValue(prevRow,2)
outputRow = prevRow
counter = counter -1
if(counter == 0) {
prevDone = true
}
true
} else if (itr.hasNext) {
prevRow = currRow
if(counter == 0 && prevRow.getAs[Int](3) != 1 && !isNullValue(prevRow,3 )){
outputRow = prevRow
counter = prevRow.getAs[Int](3) - 1
prevDone = false
}else if(counter > 0) {
counter = counter -1
prevDone = false
}
else {
outputRow = currRow
}
//if(counter == 0){
currRow = itr.next
true
} else if (currRow != null) {
outputRow = currRow
currRow =null
true
} else {
false
}
}
override def next(): Row = outputRow
})
val newDf = spark.createDataFrame(result,leadDf.schema)
After this, I can see 12 records in dataframe. But got 10 records from the physical table created by the temp table created from "newDf" dataframe.
newDf.registerTempTable("test")
spark.sql("create table newtest as select * from test")
scala> newDf.count
res14: Long = 12
scala> spark.sql("select * from newtest").count
res15: Long = 10
The same code works fine in Spark 1.6 and final table count matches with dataframe record count.
Can someone explain why this is happening ? and any solution or workaround to solve the problem
I found a solution or workaround that is calling reparation method on newly created dataframe from RDD[Row].
val newDf = spark.createDataFrame(result,leadDf.schema).repartition(result.getNumPartitions)

alternate way to proceed without list in scala

I have a scala code like this
def avgCalc(buffer: Iterable[Array[String]], list: Array[String]) = {
val currentTimeStamp = list(1).toLong // loads the timestamp column
var sum = 0.0
var count = 0
var check = false
import scala.util.control.Breaks._
breakable {
for (array <- buffer) {
val toCheckTimeStamp = array(1).toLong // timestamp column
if (((currentTimeStamp - 10L) <= toCheckTimeStamp) && (currentTimeStamp >= toCheckTimeStamp)) { // to check the timestamp for 10 seconds difference
sum += array(5).toDouble // RSSI weightage values
count += 1
}
if ((currentTimeStamp - 10L) > toCheckTimeStamp) {
check = true
break
}
}
}
list :+ sum
}
I will call the above function like this
import spark.implicits._
val averageDF =
filterop.rdd.map(_.mkString(",")).map(line => line.split(",").map(_.trim))
.sortBy(array => array(1), false) // Sort by timestamp
.groupBy(array => (array(0), array(2))) // group by tag and listner
.mapValues(buffer => {
buffer.map(list => {
avgCalc(buffer, list) // calling the average function
})
})
.flatMap(x => x._2)
.map(x => findingavg(x(0).toString, x(1).toString.toLong, x(2).toString, x(3).toString, x(4).toString, x(5).toString.toDouble, x(6).toString.toDouble)) // defining the schema through case class
.toDF // converting to data frame
The above code is working fine.But I need to get rid of list.My senior ask me to remove the list,Because list reduces the execution speed.Any suggestions to proceed without list ?
Any help will be appreciated.
The following solution should work I guess, I have tried to avoid passing both iterable and one array.
def avgCalc(buffer: Iterable[Array[String]]) = {
var finalArray = Array.empty[Array[String]]
import scala.util.control.Breaks._
breakable {
for (outerArray <- buffer) {
val currentTimeStamp = outerArray(1).toLong
var sum = 0.0
var count = 0
var check = false
var list = outerArray
for (array <- buffer) {
val toCheckTimeStamp = array(1).toLong
if (((currentTimeStamp - 10L) <= toCheckTimeStamp) && (currentTimeStamp >= toCheckTimeStamp)) {
sum += array(5).toDouble
count += 1
}
if ((currentTimeStamp - 10L) > toCheckTimeStamp) {
check = true
break
}
}
if (sum != 0.0 && check) list = list :+ (sum / count).toString
else list = list :+ list(5).toDouble.toString
finalArray ++= Array(list)
}
}
finalArray
}
and you can call it like
import sqlContext.implicits._
val averageDF =
filter_op.rdd.map(_.mkString(",")).map(line => line.split(",").map(_.trim))
.sortBy(array => array(1), false)
.groupBy(array => (array(0), array(2)))
.mapValues(buffer => {
avgCalc(buffer)
})
.flatMap(x => x._2)
.map(x => findingavg(x(0).toString, x(1).toString.toLong, x(2).toString, x(3).toString, x(4).toString, x(5).toString.toDouble, x(6).toString.toDouble))
.toDF
I hope this is the desired answer
I can see that you have accepted an answer, but I have to say that you have a lot of unnecessary code. As far as I can see, you have no reason to do the initial conversion to Array type in the first place and the sortBy is also unnecessary at this point. I would suggest you work directly on the Row.
Also you have a number of unused variables that should be removed and converting to a case-class only to be followed by a toDF seems excessive IMHO.
I would do something like this:
import org.apache.spark.sql.Row
def avgCalc(sortedList: List[Row]) = {
sortedList.indices.map(i => {
var sum = 0.0
val row = sortedList(i)
val currentTimeStamp = row.getString(1).toLong // loads the timestamp column
import scala.util.control.Breaks._
breakable {
for (j <- 0 until sortedList.length) {
if (j != i) {
val anotherRow = sortedList(j)
val toCheckTimeStamp = anotherRow.getString(1).toLong // timestamp column
if (((currentTimeStamp - 10L) <= toCheckTimeStamp) && (currentTimeStamp >= toCheckTimeStamp)) { // to check the timestamp for 10 seconds difference
sum += anotherRow.getString(5).toDouble // RSSI weightage values
}
if ((currentTimeStamp - 10L) > toCheckTimeStamp) {
break
}
}
}
}
(row.getString(0), row.getString(1), row.getString(2), row.getString(3), row.getString(4), row.getString(5), sum.toString)
})
}
val averageDF = filterop.rdd
.groupBy(row => (row(0), row(2)))
.flatMap{case(_,buffer) => avgCalc(buffer.toList.sortBy(_.getString(1).toLong))}
.toDF("Tag", "Timestamp", "Listner", "X", "Y", "RSSI", "AvgCalc")
And as a final comment, I'm pretty sure it's possible to come up with at nicer/cleaner implementation of the avgCalc function, but I'll leave it to you to play around with that :)

Scala: Haar Wavelet Transform

I am trying to implement Haar Wavelet Transform in Scala. I am using this Python Code for reference Github Link to Python implementation of HWT
I am also giving here my Scala code version. I am new to Scala so forgive me for not-so-good-code.
/**
* Created by vipul vaibhaw on 1/11/2017.
*/
import scala.collection.mutable.{ListBuffer, MutableList,ArrayBuffer}
object HaarWavelet {
def main(args: Array[String]): Unit = {
var samples = ListBuffer(
ListBuffer(1,4),
ListBuffer(6,1),
ListBuffer(0,2,4,6,7,7,7,7),
ListBuffer(1,2,3,4),
ListBuffer(7,5,1,6,3,0,2,4),
ListBuffer(3,2,3,7,5,5,1,1,0,2,5,1,2,0,1,2,0,2,1,0,0,2,1,2,0,2,1,0,0,2,1,2)
)
for (i <- 0 to samples.length){
var ubound = samples(i).max+1
var length = samples(i).length
var deltas1 = encode(samples(i), ubound)
var deltas = deltas1._1
var avg = deltas1._2
println( "Input: %s, boundary = %s, length = %s" format(samples(i), ubound, length))
println( "Haar output:%s, average = %s" format(deltas, avg))
println("Decoded: %s" format(decode(deltas, avg, ubound)))
}
}
def wrap(value:Int, ubound:Int):Int = {
(value+ubound)%ubound
}
def encode(lst1:ListBuffer[Int], ubound:Int):(ListBuffer[Int],Int)={
//var lst = ListBuffer[Int]()
//lst1.foreach(x=>lst+=x)
var lst = lst1
var deltas = new ListBuffer[Int]()
var avg = 0
while (lst.length>=2) {
var avgs = new ListBuffer[Int]()
while (lst.nonEmpty) {
// getting first two element from the list and removing them
val a = lst.head
lst -= 1 // removing index 0 element from the list
val b = lst.head
lst -= 1 // removing index 0 element from the list
if (a<=b) {
avg = (a + b)/2
}
else{
avg = (a+b+ubound)/2
}
var delta = wrap(b-a,ubound)
avgs += avg
deltas += delta
}
lst = avgs
}
(deltas, avg%ubound)
}
def decode(deltas:ListBuffer[Int],avg:Int,ubound:Int):ListBuffer[Int]={
var avgs = ListBuffer[Int](avg)
var l = 1
while(deltas.nonEmpty){
for(i <- 0 to l ){
val delta = deltas.last
deltas -= -1
val avg = avgs.last
avgs -= -1
val a = wrap(math.ceil(avg-delta/2.0).toInt,ubound)
val b = wrap(math.ceil(avg+delta/2.0).toInt,ubound)
}
l*=2
}
avgs
}
def is_pow2(n:Int):Boolean={
(n & -n) == n
}
}
But Code gets stuck at "var deltas1 = encode(samples(i), ubound)" and doesn't give any output. How can I improve my implementation? Thanks in advance!
Your error is on this line:
lst -= 1 // removing index 0 element from the list.
This doesn't remove index 0 from the list. It removes the element 1 (if it exists). This means that the list never becomes empty. The while-loop while (lst.nonEmpty) will therefore never terminate.
To remove the first element of the list, simply use lst.remove(0).

Inconsistent outputs from Scala Spark and pyspark job

I am converting my Scala code to pyspark like below, but got different counts for the final RDD.
My Scala code:
val scalaRDD = rowRDD.map {
row: Row =>
var rowList: ListBuffer[Row] = ListBuffer()
rowList.add(row)
(row.getString(1) + "_" + row.getString(6), rowList)
}.reduceByKey{ (list1,list2) =>
var rowList: ListBuffer[Row] = ListBuffer()
for (i <- 0 to list1.length -1) {
val row1 = list1.get(i);
var foundMatch = false;
breakable {
for (j <- 0 to list2.length -1) {
var row2 = list2.get(j);
val result = mergeRow(row1, row2)
if (result._1) {
list2.set(j, result._2)
foundMatch = true;
break;
}
} // for j loop
} // breakable for j
if(!foundMatch) {
rowList.add(row1);
}
}
list2.addAll(rowList);
list2
}.flatMap { t=> t._2 }
where
def mergeRow(row1:Row, row2:Row):(Boolean, Row)= {
var z:Array[String] = new Array[String](row1.length)
var hasDiff = false
for (k <- 1 to row1.length -2){
// k = 0 : ID, always different
// k = 43 : last field, which is not important
if (row1.getString(0) < row2.getString(0)) {
z(0) = row2.getString(0)
z(43) = row2.getString(43)
} else {
z(0) = row1.getString(0)
z(43) = row1.getString(43)
}
if (Option(row2.getString(k)).getOrElse("").isEmpty && !Option(row1.getString(k)).getOrElse("").isEmpty) {
z(k) = row1.getString(k)
hasDiff = true
} else if (!Option(row1.getString(k)).getOrElse("").isEmpty && !Option(row2.getString(k)).getOrElse("").isEmpty && row1.getString(k) != row2.getString(k)) {
return (false, null)
} else {
z(k) = row2.getString(k)
}
} // for k loop
if (hasDiff) {
(true, Row.fromSeq(z))
} else {
(true, row2)
}
}
I then tried to convert them to pyspark code as below:
pySparkRDD = rowRDD.map (
lambda row : singleRowList(row)
).reduceByKey(lambda list1,list2: mergeList(list1,list2)).flatMap(lambda x : x[1])
where I have:
def mergeRow(row1, row2):
z=[]
hasDiff = False
#for (k <- 1 to row1.length -2){
for k in xrange(1, len(row1) - 2):
# k = 0 : ID, always different
# k = 43 : last field, which is not important
if (row1[0] < row2[0]):
z[0] = row2[0]
z[43] = row2[43]
else:
z[0] = row1[0]
z[43] = row1[43]
if not(row2[k]) and row1[k]:
z[k] = row1[k].strip()
hasDiff = True
elif row1[k] and row2[k] and row1[k].strip() != row2[k].strip():
return (False, None)
else:
z[k] = row2[k].strip()
if hasDiff:
return (True, Row.fromSeq(z))
else:
return (True, row2)
and
def singleRowList(row):
myList=[]
myList.append(row)
return (row[1] + "_" + row[6], myList)
and
def mergeList(list1, list2):
rowList = []
for i in xrange(0, len(list1)-1):
row1 = list1[i]
foundMatch = False
for j in xrange(0, len(list2)-1):
row2 = list2[j]
resultBool, resultRow = mergeRow(row1, row2)
if resultBool:
list2[j] = resultRow
foundMatch = True
break
if foundMatch == False:
rowList.append(row1)
list2.extend(rowList)
return list2
BTW, rowRDD is converted from a data frame. i.e. rowRDD = myDF.rdd
However, I got different counts for scalaRDD and pySparkRDD. I checked the codes many times but couldn't figure out what I missed. Does anyone have any ideas? Thanks!
Consider this:
scala> (1 to 5).length
res1: Int = 5
and this:
>>> len(xrange(1, 5))
4

scala priority queue not ordering properly?

I'm seeing some strange behavior with Scala's collection.mutable.PriorityQueue. I'm performing an external sort and testing it with 1M records. Each time I run the test and verify the results between 10-20 records are not sorted properly. I replace the scala PriorityQueue implementation with a java.util.PriorityQueue and it works 100% of the time. Any ideas?
Here's the code (sorry it's a bit long...). I test it using the tools gensort -a 1000000 and valsort from http://sortbenchmark.org/
def externalSort(inFileName: String, outFileName: String)
(implicit ord: Ordering[String]): Int = {
val MaxTempFiles = 1024
val TempBufferSize = 4096
val inFile = new java.io.File(inFileName)
/** Partitions input file and sorts each partition */
def partitionAndSort()(implicit ord: Ordering[String]):
List[java.io.File] = {
/** Gets block size to use */
def getBlockSize: Long = {
var blockSize = inFile.length / MaxTempFiles
val freeMem = Runtime.getRuntime().freeMemory()
if (blockSize < freeMem / 2)
blockSize = freeMem / 2
else if (blockSize >= freeMem)
System.err.println("Not enough free memory to use external sort.")
blockSize
}
/** Sorts and writes data to temp files */
def writeSorted(buf: List[String]): java.io.File = {
// Create new temp buffer
val tmp = java.io.File.createTempFile("external", "sort")
tmp.deleteOnExit()
// Sort buffer and write it out to tmp file
val out = new java.io.PrintWriter(tmp)
try {
for (l <- buf.sorted) {
out.println(l)
}
} finally {
out.close()
}
tmp
}
val blockSize = getBlockSize
var tmpFiles = List[java.io.File]()
var buf = List[String]()
var currentSize = 0
// Read input and divide into blocks
for (line <- io.Source.fromFile(inFile).getLines()) {
if (currentSize > blockSize) {
tmpFiles ::= writeSorted(buf)
buf = List[String]()
currentSize = 0
}
buf ::= line
currentSize += line.length() * 2 // 2 bytes per char
}
if (currentSize > 0) tmpFiles ::= writeSorted(buf)
tmpFiles
}
/** Merges results of sorted partitions into one output file */
def mergeSortedFiles(fs: List[java.io.File])
(implicit ord: Ordering[String]): Int = {
/** Temp file buffer for reading lines */
class TempFileBuffer(val file: java.io.File) {
private val in = new java.io.BufferedReader(
new java.io.FileReader(file), TempBufferSize)
private var curLine: String = ""
readNextLine() // prep first value
def currentLine = curLine
def isEmpty = curLine == null
def readNextLine() {
if (curLine == null) return
try {
curLine = in.readLine()
} catch {
case _: java.io.EOFException => curLine = null
}
if (curLine == null) in.close()
}
override protected def finalize() {
try {
in.close()
} finally {
super.finalize()
}
}
}
val wrappedOrd = new Ordering[TempFileBuffer] {
def compare(o1: TempFileBuffer, o2: TempFileBuffer): Int = {
ord.compare(o1.currentLine, o2.currentLine)
}
}
val pq = new collection.mutable.PriorityQueue[TempFileBuffer](
)(wrappedOrd)
// Init queue with item from each file
for (tmp <- fs) {
val buf = new TempFileBuffer(tmp)
if (!buf.isEmpty) pq += buf
}
var count = 0
val out = new java.io.PrintWriter(new java.io.File(outFileName))
try {
// Read each value off of queue
while (pq.size > 0) {
val buf = pq.dequeue()
out.println(buf.currentLine)
count += 1
buf.readNextLine()
if (buf.isEmpty) {
buf.file.delete() // don't need anymore
} else {
// re-add to priority queue so we can process next line
pq += buf
}
}
} finally {
out.close()
}
count
}
mergeSortedFiles(partitionAndSort())
}
My tests don't show any bugs in PriorityQueue.
import org.scalacheck._
import Prop._
object PriorityQueueProperties extends Properties("PriorityQueue") {
def listToPQ(l: List[String]): PriorityQueue[String] = {
val pq = new PriorityQueue[String]
l foreach (pq +=)
pq
}
def pqToList(pq: PriorityQueue[String]): List[String] =
if (pq.isEmpty) Nil
else { val h = pq.dequeue; h :: pqToList(pq) }
property("Enqueued elements are dequeued in reverse order") =
forAll { (l: List[String]) => l.sorted == pqToList(listToPQ(l)).reverse }
property("Adding/removing elements doesn't break sorting") =
forAll { (l: List[String], s: String) =>
(l.size > 0) ==>
((s :: l.sorted.init).sorted == {
val pq = listToPQ(l)
pq.dequeue
pq += s
pqToList(pq).reverse
})
}
}
scala> PriorityQueueProperties.check
+ PriorityQueue.Enqueued elements are dequeued in reverse order: OK, passed
100 tests.
+ PriorityQueue.Adding/removing elements doesn't break sorting: OK, passed
100 tests.
If you could somehow reduce the input enough to make a test case, it would help.
I ran it with five million inputs several times, output matched expected always. My guess from looking at your code is that your Ordering is the problem (i.e. it's giving inconsistent answers.)