Inconsistent outputs from Scala Spark and pyspark job - scala

I am converting my Scala code to pyspark like below, but got different counts for the final RDD.
My Scala code:
val scalaRDD = rowRDD.map {
row: Row =>
var rowList: ListBuffer[Row] = ListBuffer()
rowList.add(row)
(row.getString(1) + "_" + row.getString(6), rowList)
}.reduceByKey{ (list1,list2) =>
var rowList: ListBuffer[Row] = ListBuffer()
for (i <- 0 to list1.length -1) {
val row1 = list1.get(i);
var foundMatch = false;
breakable {
for (j <- 0 to list2.length -1) {
var row2 = list2.get(j);
val result = mergeRow(row1, row2)
if (result._1) {
list2.set(j, result._2)
foundMatch = true;
break;
}
} // for j loop
} // breakable for j
if(!foundMatch) {
rowList.add(row1);
}
}
list2.addAll(rowList);
list2
}.flatMap { t=> t._2 }
where
def mergeRow(row1:Row, row2:Row):(Boolean, Row)= {
var z:Array[String] = new Array[String](row1.length)
var hasDiff = false
for (k <- 1 to row1.length -2){
// k = 0 : ID, always different
// k = 43 : last field, which is not important
if (row1.getString(0) < row2.getString(0)) {
z(0) = row2.getString(0)
z(43) = row2.getString(43)
} else {
z(0) = row1.getString(0)
z(43) = row1.getString(43)
}
if (Option(row2.getString(k)).getOrElse("").isEmpty && !Option(row1.getString(k)).getOrElse("").isEmpty) {
z(k) = row1.getString(k)
hasDiff = true
} else if (!Option(row1.getString(k)).getOrElse("").isEmpty && !Option(row2.getString(k)).getOrElse("").isEmpty && row1.getString(k) != row2.getString(k)) {
return (false, null)
} else {
z(k) = row2.getString(k)
}
} // for k loop
if (hasDiff) {
(true, Row.fromSeq(z))
} else {
(true, row2)
}
}
I then tried to convert them to pyspark code as below:
pySparkRDD = rowRDD.map (
lambda row : singleRowList(row)
).reduceByKey(lambda list1,list2: mergeList(list1,list2)).flatMap(lambda x : x[1])
where I have:
def mergeRow(row1, row2):
z=[]
hasDiff = False
#for (k <- 1 to row1.length -2){
for k in xrange(1, len(row1) - 2):
# k = 0 : ID, always different
# k = 43 : last field, which is not important
if (row1[0] < row2[0]):
z[0] = row2[0]
z[43] = row2[43]
else:
z[0] = row1[0]
z[43] = row1[43]
if not(row2[k]) and row1[k]:
z[k] = row1[k].strip()
hasDiff = True
elif row1[k] and row2[k] and row1[k].strip() != row2[k].strip():
return (False, None)
else:
z[k] = row2[k].strip()
if hasDiff:
return (True, Row.fromSeq(z))
else:
return (True, row2)
and
def singleRowList(row):
myList=[]
myList.append(row)
return (row[1] + "_" + row[6], myList)
and
def mergeList(list1, list2):
rowList = []
for i in xrange(0, len(list1)-1):
row1 = list1[i]
foundMatch = False
for j in xrange(0, len(list2)-1):
row2 = list2[j]
resultBool, resultRow = mergeRow(row1, row2)
if resultBool:
list2[j] = resultRow
foundMatch = True
break
if foundMatch == False:
rowList.append(row1)
list2.extend(rowList)
return list2
BTW, rowRDD is converted from a data frame. i.e. rowRDD = myDF.rdd
However, I got different counts for scalaRDD and pySparkRDD. I checked the codes many times but couldn't figure out what I missed. Does anyone have any ideas? Thanks!

Consider this:
scala> (1 to 5).length
res1: Int = 5
and this:
>>> len(xrange(1, 5))
4

Related

I was trying bubble sort program in scala as I encountered a with Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 8

I know why that error is occuring here. Can anyone help me resolve this error? Also, i dont want to use the two for loops here.
Here is my code:
object BubbleSort {
def main(args: Array[String]) {
val arr = Array(5, 3, 2, 5, 6, 77, 99, 88)
var temp = 0
var n = arr.length
var fixed = false
while (fixed == false) {
fixed == true
for (i <- 0 to n - 1) {
if (arr(i) > arr(i + 1)) {
temp = arr(i + 1)
arr(i + 1) = arr(i)
arr(i) = temp
fixed = false
}
}
}
for (i <- 0 to n) {
println("sorted numbers are:" + arr)
}
}
}
you forget this comparison i + 1 < arr.length. Here is your code running correctly.
object BubbleSort {
def main(args: Array[String]): Unit = {
val arr = Array(5, 3, 2, 5, 6, 77, 99, 88)
var temp = 0
var fixed = false
while (fixed == false) {
fixed = true
for (i <- 0 to arr.length - 1) {
if (i + 1 < arr.length && arr(i) > arr(i + 1)) {
temp = arr(i + 1)
arr(i + 1) = arr(i)
arr(i) = temp
fixed = false
}
}
}
arr.foreach(println)
}
}

Records are missing after creating the table from spark temp table in Spark2

I have created a dataframe from below sequence.
val df = sc.parallelize(Seq((100,23,9.50),
(100,23,9.51),
(100,24,9.52),
(100,25,9.54),
(100,23,9.55),
(101,21,8.51),
(101,23,8.52),
(101,24,8.55),
(101,20,8.56))).toDF("id", "temp","time")
I wanted to update the DF by addin few more rows where data is missing for the time. So I have iterated the DF from mapPartitions to add new rows.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, Row, Column}
#transient val w = org.apache.spark.sql.expressions.Window.partitionBy("id").orderBy("time")
val leadDf = df.withColumn("time_diff", ((lead("time", 1).over(w) - df("time")).cast("Float")*100).cast("int"))
Dataframe iteration goes here:
val result = leadDf.rdd.mapPartitions(itr =>
new Iterator[Row] {
var prevRow = null: Row
var prevDone = true
var firstRow = true
var outputRow: Row = null: Row
var counter = 0
var currRecord = null :Row
var currRow: Row = if (itr.hasNext) {currRecord = itr.next; currRecord } else null
prevRow = currRow
override def hasNext: Boolean = {
if (!prevDone) {
prevRow = incrementValue(prevRow,2)
outputRow = prevRow
counter = counter -1
if(counter == 0) {
prevDone = true
}
true
} else if (itr.hasNext) {
prevRow = currRow
if(counter == 0 && prevRow.getAs[Int](3) != 1 && !isNullValue(prevRow,3 )){
outputRow = prevRow
counter = prevRow.getAs[Int](3) - 1
prevDone = false
}else if(counter > 0) {
counter = counter -1
prevDone = false
}
else {
outputRow = currRow
}
//if(counter == 0){
currRow = itr.next
true
} else if (currRow != null) {
outputRow = currRow
currRow =null
true
} else {
false
}
}
override def next(): Row = outputRow
})
val newDf = spark.createDataFrame(result,leadDf.schema)
After this, I can see 12 records in dataframe. But got 10 records from the physical table created by the temp table created from "newDf" dataframe.
newDf.registerTempTable("test")
spark.sql("create table newtest as select * from test")
scala> newDf.count
res14: Long = 12
scala> spark.sql("select * from newtest").count
res15: Long = 10
The same code works fine in Spark 1.6 and final table count matches with dataframe record count.
Can someone explain why this is happening ? and any solution or workaround to solve the problem
I found a solution or workaround that is calling reparation method on newly created dataframe from RDD[Row].
val newDf = spark.createDataFrame(result,leadDf.schema).repartition(result.getNumPartitions)

How to define multiple Custom Delimiter for input file in spark?

The default input file delimiter while reading a file via Spark is newline character(\n). It is possible to define a custom delimiter by using "textinputformat.record.delimiter" property.
But, Is it possible to specify multiple delimiter for the same file ?
Suppose a file has following content :
COMMENT,A,B,C
COMMENT,D,E,
F
LIKE,I,H,G
COMMENT,J,K,
L
COMMENT,M,N,O
I want to read this file with delimiter as COMMENT and LIKE instead of newline character.
Although, i came up with an alternative if multiple delimiters are not allowed in spark.
val ss = SparkSession.builder().appName("SentimentAnalysis").master("local[*]").getOrCreate()
val sc = ss.sparkContext
sc.hadoopConfiguration.set("textinputformat.record.delimiter", "COMMENT")
val rdd = sc.textFile("<filepath>")
val finalRdd = rdd.flatmap(f=>f.split("LIKE"))
But still, i think it will better to have multiple custom delimiter. Is it possible in spark ? or do i have to use the above alternative ?
Solved the above issue by creating a custom TextInputFormat class that splits on two types delimiter strings. The post pointed by #puhlen in the comments was a great help. Find below the code snippet which i used :
class CustomInputFormat extends TextInputFormat {
override def createRecordReader(inputSplit: InputSplit, taskAttemptContext: TaskAttemptContext): RecordReader[LongWritable, Text] = {
return new ParagraphRecordReader();
}
}
class ParagraphRecordReader extends RecordReader[LongWritable, Text] {
var end: Long = 0L;
var stillInChunk = true;
var key = new LongWritable();
var value = new Text();
var fsin: FSDataInputStream = null;
val buffer = new DataOutputBuffer();
val tempBuffer1 = MutableList[Int]();
val tempBuffer2 = MutableList[Int]();
val endTag1 = "COMMENT".getBytes();
val endTag2 = "LIKE".getBytes();
#throws(classOf[IOException])
#throws(classOf[InterruptedException])
override def initialize(inputSplit: org.apache.hadoop.mapreduce.InputSplit, taskAttemptContext: org.apache.hadoop.mapreduce.TaskAttemptContext) {
val split = inputSplit.asInstanceOf[FileSplit];
val conf = taskAttemptContext.getConfiguration();
val path = split.getPath();
val fs = path.getFileSystem(conf);
fsin = fs.open(path);
val start = split.getStart();
end = split.getStart() + split.getLength();
fsin.seek(start);
if (start != 0) {
readUntilMatch(endTag1, endTag2, false);
}
}
#throws(classOf[IOException])
override def nextKeyValue(): Boolean = {
if (!stillInChunk) return false;
val status = readUntilMatch(endTag1, endTag2, true);
value = new Text();
value.set(buffer.getData(), 0, buffer.getLength());
key = new LongWritable(fsin.getPos());
buffer.reset();
if (!status) {
stillInChunk = false;
}
return true;
}
#throws(classOf[IOException])
#throws(classOf[InterruptedException])
override def getCurrentKey(): LongWritable = {
return key;
}
#throws(classOf[IOException])
#throws(classOf[InterruptedException])
override def getCurrentValue(): Text = {
return value;
}
#throws(classOf[IOException])
#throws(classOf[InterruptedException])
override def getProgress(): Float = {
return 0;
}
#throws(classOf[IOException])
override def close() {
fsin.close();
}
#throws(classOf[IOException])
def readUntilMatch(match1: Array[Byte], match2: Array[Byte], withinBlock: Boolean): Boolean = {
var i = 0;
var j = 0;
while (true) {
val b = fsin.read();
if (b == -1) return false;
if (b == match1(i)) {
tempBuffer1.+=(b)
i = i + 1;
if (i >= match1.length) {
tempBuffer1.clear()
return fsin.getPos() < end;
}
} else if (b == match2(j)) {
tempBuffer2.+=(b)
j = j + 1;
if (j >= match2.length) {
tempBuffer2.clear()
return fsin.getPos() < end;
}
} else {
if (tempBuffer1.size != 0)
tempBuffer1.foreach { x => if (withinBlock) buffer.write(x) }
else if (tempBuffer2.size != 0)
tempBuffer2.foreach { x => if (withinBlock) buffer.write(x) }
tempBuffer1.clear()
tempBuffer2.clear()
if (withinBlock) buffer.write(b);
i = 0;
j = 0;
}
}
return false;
}
Use the following class in while reading file from filesystem and your file will get read with two delimiters as required. :)
val rdd = sc.newAPIHadoopFile("<filepath>", classOf[ParagraphInputFormat], classOf[LongWritable], classOf[Text], sc.hadoopConfiguration)

Getting type mismatch exception in scala

Hi I am trying UDAF with spark scala. I am getting the following exception.
Description Resource Path Location Type type mismatch; found : scala.collection.immutable.IndexedSeq[Any] required: String SumCalc.scala /filter line 63 Scala Problem
This is my code.
override def evaluate(buffer: Row): Any = {
val in_array = buffer.getAs[WrappedArray[String]](0);
var finalArray = Array.empty[Array[String]]
import scala.util.control.Breaks._
breakable {
for (outerArray <- in_array) {
val currentTimeStamp = outerArray(1).toLong
var sum = 0.0
var count = 0
var check = false
var list = outerArray
for (array <- in_array) {
val toCheckTimeStamp = array(1).toLong
if (((currentTimeStamp - 10L) <= toCheckTimeStamp) && (currentTimeStamp >= toCheckTimeStamp)) {
sum += array(5).toDouble
count += 1
}
if ((currentTimeStamp - 10L) > toCheckTimeStamp) {
check = true
break
}
}
if (sum != 0.0 && check) list = list :+ (sum).toString // getting error on this line.
else list = list :+ list(5).toDouble.toString
finalArray ++= Array(list)
}
finalArray
}
}
Any help will be appreciated.
There are a couple of mistakes in your evaluate function of UDAF.
list variable is a string but you are treating it as an array
finalArray is initialized as Array.empty[Array[String]] but later on you are adding Array(list) to the finalArray
You are not returning finalArray from evaluate method as its inside for loop
So the correct way should be as below
override def evaluate(buffer: Row): Any = {
val in_array = buffer.getAs[WrappedArray[String]](0);
var finalArray = Array.empty[String]
import scala.util.control.Breaks._
breakable {
for (outerArray <- in_array) {
val currentTimeStamp = outerArray(1).toLong // timestamp values
var sum = 0.0
var count = 0
var check = false
var list = outerArray
for (array <- in_array) {
val toCheckTimeStamp = array(1).toLong
if (((currentTimeStamp - 10L) <= toCheckTimeStamp) && (currentTimeStamp >= toCheckTimeStamp)) {
sum += array(5).toDouble // RSSI weightage values
count += 1
}
if ((currentTimeStamp - 10L) > toCheckTimeStamp) {
check = true
break
}
}
if (sum != 0.0 && check) list = list + (sum).toString // calculate sum for the 10 secs difference
else list = list + (sum).toString // If 10 secs difference is not there take rssi weightage value
finalArray ++= Array(list)
}
}
finalArray // Final results for this function
}
Hope the answer is helpful

Spark: split rows and accumulate

I have this code:
val rdd = sc.textFile(sample.log")
val splitRDD = rdd.map(r => StringUtils.splitPreserveAllTokens(r, "\\|"))
val rdd2 = splitRDD.filter(...).map(row => createRow(row, fieldsMap))
sqlContext.createDataFrame(rdd2, structType).save(
org.apache.phoenix.spark, SaveMode.Overwrite, Map("table" -> table, "zkUrl" -> zkUrl))
def createRow(row: Array[String], fieldsMap: ListMap[Int, FieldConfig]): Row = {
//add additional index for invalidValues
val arrSize = fieldsMap.size + 1
val arr = new Array[Any](arrSize)
var invalidValues = ""
for ((k, v) <- fieldsMap) {
val valid = ...
var value : Any = null
if (valid) {
value = row(k)
// if (v.code == "SOURCE_NAME") --> 5th column in the row
// sourceNameCount = row(k).split(",").size
} else {
invalidValues += v.code + " : " + row(k) + " | "
}
arr(k) = value
}
arr(arrSize - 1) = invalidValues
Row.fromSeq(arr.toSeq)
}
fieldsMap contains the mapping of the input columns: (index, FieldConfig). Where FieldConfig class contains "code" and "dataType" values.
TOPIC -> (0, v.code = "TOPIC", v.dataType = "String")
GROUP -> (1, v.code = "GROUP")
SOURCE_NAME1,SOURCE_NAME2,SOURCE_NAME3 -> (4, v.code = "SOURCE_NAME")
This is the sample.log:
TOPIC|GROUP|TIMESTAMP|STATUS|SOURCE_NAME1,SOURCE_NAME2,SOURCE_NAME3|
SOURCE_TYPE1,SOURCE_TYPE2,SOURCE_TYPE3|SOURCE_COUNT1,SOURCE_COUNT2,SOURCE_COUNT3|
DEST_NAME1,DEST_NAME2,DEST_NAME3|DEST_TYPE1,DEST_TYPE2,DEST_TYPE3|
DEST_COUNT1,DEST_COUNT2,DEST_COUNT3|
The goal is to split the input (sample.log), based on the number of source_name(s).. In the example above, the output will have 3 rows:
TOPIC|GROUP|TIMESTAMP|STATUS|SOURCE_NAME1|SOURCE_TYPE1|SOURCE_COUNT1|
|DEST_NAME1|DEST_TYPE1|DEST_COUNT1|
TOPIC|GROUP|TIMESTAMP|STATUS|SOURCE_NAME2|SOURCE_TYPE2|SOURCE_COUNT2|
DEST_NAME2|DEST_TYPE2|DEST_COUNT2|
TOPIC|GROUP|TIMESTAMP|STATUS|SOURCE_NAME3|SOURCE_TYPE3|SOURCE_COUNT3|
|DEST_NAME3|DEST_TYPE3|DEST_COUNT3|
This is the new code I am working on (still using createRow defined above):
val rdd2 = splitRDD.filter(...).flatMap(row => {
val srcName = row(4).split(",")
val srcType = row(5).split(",")
val srcCount = row(6).split(",")
val destName = row(7).split(",")
val destType = row(8).split(",")
val destCount = row(9).split(",")
var newRDD: ArrayBuffer[Row] = new ArrayBuffer[Row]()
//if (srcName != null) {
println("\n\nsrcName.size: " + srcName.size + "\n\n")
for (i <- 0 to srcName.size - 1) {
// missing column: destType can sometimes be null
val splittedRow: Array[String] = Row.fromSeq(Seq((row(0), row(1), row(2), row(3),
srcName(i), srcType(i), srcCount(i), destName(i), "", destCount(i)))).toSeq.toArray[String]
newRDD = newRDD ++ Seq(createRow(splittedRow, fieldsMap))
}
//}
Seq(Row.fromSeq(Seq(newRDD)))
})
Since I am having an error in converting my splittedRow to Array[String]
(".toSeq.toArray[String]")
error: type arguments [String] do not conform to method toArray's type parameter bounds [B >: Any]
I decided to update my splittedRow to:
val rowArr: Array[String] = new Array[String](10)
for (j <- 0 to 3) {
rowArr(j) = row(j)
}
rowArr(4) = srcName(i)
rowArr(5) = row(5).split(",")(i)
rowArr(6) = row(6).split(",")(i)
rowArr(7) = row(7).split(",")(i)
rowArr(8) = row(8).split(",")(i)
rowArr(9) = row(9).split(",")(i)
val splittedRow = rowArr
You could use a flatMap operation instead of a map operation to return multiple rows. Consequently, your createRow would be refactored to createRows(row: Array[String], fieldsMap: List[Int, IngestFieldConfig]): Seq[Row].