Iterating over cogrouped RDD - scala

I have used a cogroup function and obtain following RDD:
org.apache.spark.rdd.RDD[(Int, (Iterable[(Int, Long)], Iterable[(Int, Long)]))]
Before the map operation joined object would look like this:
RDD[(Int, (Iterable[(Int, Long)], Iterable[(Int, Long)]))]
(-2095842000,(CompactBuffer((1504999740,1430096464017), (613904354,1430211912709), (-1514234644,1430288363100), (-276850688,1430330412225)),CompactBuffer((-511732877,1428682217564), (1133633791,1428831320960), (1168566678,1428964645450), (-407341933,1429009306167), (-1996133514,1429016485487), (872888282,1429031501681), (-826902224,1429034491003), (818711584,1429111125268), (-1068875079,1429117498135), (301875333,1429121399450), (-1730846275,1429131773065), (1806256621,1429135583312))))
(352234000,(CompactBuffer((1350763226,1430006650167), (-330160951,1430320010314)),CompactBuffer((2113207721,1428994842593), (-483470471,1429324209560), (1803928603,1429426861915))))
Now I want to do the following:
val globalBuffer = ListBuffer[Double]()
val joined = data1.cogroup(data2).map(x => {
val listA = x._2._1.toList
val listB = x._2._2.toList
for(tupleB <- listB) {
val localResults = ListBuffer[Double]()
val itemToTest = Set(tupleB._1)
val tempList = ListBuffer[(Int, Double)]()
for(tupleA <- listA) {
val tValue = someFunctionReturnDouble(tupleB._2, tupleA._2)
val i = (tupleA._1, tValue)
tempList += i
}
val sortList = tempList.sortWith(_._2 > _._2).slice(0,20).map(i => i._1)
val intersect = sortList.toSet.intersect(itemToTest)
if (intersect.size > 0)
localResults += 1.0
else localResults += 0.0
val normalized = sum(localResults.toList)/localResults.size
globalBuffer += normalized
}
})
//method sum
def sum(xs: List[Double]): Double = {//do the sum}
At the end of this I was expecting joined to be a list with double values. But when I looked at it it was unit. Also I will this is not the Scala way of doing it. How do I obtain globalBuffer as the final result.

Hmm, if I understood your code correctly, it could benefit from these improvements:
val joined = data1.cogroup(data2).map(x => {
val listA = x._2._1.toList
val listB = x._2._2.toList
val localResults = listB.map {
case (intBValue, longBValue) =>
val itemToTest = intBValue // it's always one element
val tempList = listA.map {
case (intAValue, longAValue) =>
(intAValue, someFunctionReturnDouble(longBvalue, longAValue))
}
val sortList = tempList.sortWith(-_._2).slice(0,20).map(i => i._1)
if (sortList.toSet.contains(itemToTest)) { 1.0 } else {0.0}
// no real need to convert to a set for 20 elements, by the way
}
sum(localResults)/localResults.size
})

Transformations of RDDs are not going to modify globalBuffer. Copies of globalBuffer are made and sent out to each of the workers, but any modifications to these copies on the workers will never modify the globalBuffer that exists on the driver (the one you have defined outside the map on the RDD.) Here's what I do (with a few additional modifications):
val joined = data1.cogroup(data2) map { x =>
val iterA = x._2._1
val iterB = x._2._2
var count, positiveCount = 0
val tempList = ListBuffer[(Int, Double)]()
for (tupleB <- iterB) {
tempList.clear
for(tupleA <- iterA) {
val tValue = someFunctionReturnDouble(tupleB._2, tupleA._2)
tempList += ((tupleA._1, tValue))
}
val sortList = tempList.sortWith(_._2 > _._2).iterator.take(20)
if (sortList.exists(_._1 == tupleB._1)) positiveCount += 1
count += 1
}
positiveCount.toDouble/count
}
At this point you can obtain of local copy of the proportions by using joined.collect.

Related

Flatten array in yield statement in Scala

I have the following piece of code
var splitDf = fullCertificateSourceDf.map(row => {
val ID = row.getAs[String]("ID")
val CertificateID = row.getAs[String]("CertificateID")
val CertificateTag = row.getAs[String]("CertificateTag")
val CertificateDescription = row.getAs[String]("CertificateDescription")
val WorkBreakdownUp1Summary = row.getAs[String]("WorkBreakdownUp1Summary")
val ProcessBreakdownSummaryList = row.getAs[String]("ProcessBreakdownSummaryList")
val ProcessBreakdownUp1SummaryList = row.getAs[String]("ProcessBreakdownUp1SummaryList")
val ProcessBreakdownUp2Summary = row.getAs[String]("ProcessBreakdownUp2Summary")
val ProcessBreakdownUp3Summary = row.getAs[String]("ProcessBreakdownUp3Summary")
val ActualStartDate = row.getAs[java.sql.Date]("ActualStartDate")
val ActualEndDate = row.getAs[java.sql.Date]("ActualEndDate")
val ApprovedDate = row.getAs[java.sql.Date]("ApprovedDate")
val CurrentState = row.getAs[String]("CurrentState")
val DataType = row.getAs[String]("DataType")
val PullDate = row.getAs[String]("PullDate")
val PullTime = row.getAs[String]("PullTime")
val split_ProcessBreakdownSummaryList = ProcessBreakdownSummaryList.split(",")
val split_ProcessBreakdownUp1SummaryList = ProcessBreakdownUp1SummaryList.split(",")
val Pattern = "^.*?(?= - *[a-zA-Z])".r
for{
subSystem : String <- split_ProcessBreakdownSummaryList
} yield(ID,
CertificateID,
CertificateTag,
CertificateDescription,
WorkBreakdownUp1Summary,
subSystem,
for{ system: String <- split_ProcessBreakdownUp1SummaryList if(system contains subSystem.trim().substring(0,11))}yield(system),
ProcessBreakdownUp2Summary,
ProcessBreakdownUp3Summary,
ActualStartDate,
ActualEndDate,
ApprovedDate,
CurrentState,
DataType,
PullDate,
PullTime
)
}).flatMap(identity(_))
display(splitDf)
How can I get the first matching element from the following portion of the above statement:
for{ system: String <- split_ProcessBreakdownUp1SummaryList if(system contains subSystem.trim().substring(0,11))}yield(system)
At the moment it returns an array with one element in it. I dont want the array I just want the element.
Thank you in advance.

Evaluating multiple filters in a single pass

I have below rdd created and I need to perform a series of filters on the same dataset to derive different counters and aggregates.
Is there a way I can apply these filters and compute aggregates in a single pass, avoiding spark to go over the same dataset multiple times?
val res = df.rdd.map(row => {
// ............... Generate data here for each row.......
})
res.persist(StorageLevel.MEMORY_AND_DISK)
val all = res.count()
val stats1 = res.filter(row => row.getInt(1) > 0)
val stats1Count = stats1.count()
val stats1Agg = stats1.map(r => r.getInt(1)).mean()
val stats2 = res.filter(row => row.getInt(2) > 0)
val stats2Count = stats2.count()
val stats2Agg = stats2.map(r => r.getInt(2)).mean()
You can use aggregate:
case class Stats(count: Int = 0, sum: Int = 0) {
def mean = sum/count
def +(s: Stats): Stats = Stats(count + s.count, sum + s.sum)
def <- (n: Int) = if(n > 0) copy(count + 1, sum + n) else this
}
val (stats1, stats2) = res.aggregate(Stats() -> Stats()) (
{ (s, row) => (s._1 <- row.getInt(1), s._2 <- row.getInt(2)) },
{ _ + _ }
)
val (stat1Count, stats1Agg, stats2Count, stats2Agg) = (stats1.count, stats1.mean, stats2.count, stats2.mean)

spark partition.toList fails

What I want to is to group multiple elements in a partition and then do some operations on the grouped elements in each partition. But I found the conversion from partition to list failed. See the below example:
import scala.collection.mutable.ArrayBuffer
val rdd = sc.parallelize(Seq("a","b","c","d","e"), 2)
val mapped = rdd.mapPartitions( partition =>
{
val total = partition.size
var first = partition.toList match
{
case Nil => "EMPTYLIST"
case _ => partition.toList.head
}
var finalResult = ArrayBuffer[String]()
finalResult += "1:"+first;
finalResult += "2:"+first;
finalResult += "3:"+first;
finalResult.iterator
})
mapped.collect()
Result:
Array[String] = Array(1:EMPTYLIST, 2:EMPTYLIST, 3:EMPTYLIST,
1:EMPTYLIST, 2:EMPTYLIST, 3:EMPTYLIST)
why partition.toList is always empty?
partition is an iterator, the size count consumes it so at the point you are converting it to a List, it's already empty; To go through the partition more than once, you can convert the partition to a list at the beginning and then do what you need later on the List:
val mapped = rdd.mapPartitions( partition =>
{
val partitionList = partition.toList
val total = partitionList.size
val first = partitionList match
{
case Nil => "EMPTYLIST"
case _ => partitionList.head
}
var finalResult = ArrayBuffer[String]()
finalResult += "1:"+first;
finalResult += "2:"+first;
finalResult += "3:"+first;
finalResult.iterator
})
mapped.collect
// res7: Array[String] = Array(1:a, 2:a, 3:a, 1:c, 2:c, 3:c)

obtain a specific value from a RDD according to another RDD

I want to map a RDD by lookup another RDD by this code:
val product = numOfT.map{case((a,b),c)=>
val h = keyValueRecords.lookup(b).take(1).mkString.toInt
(a,(h*c))
}
a,b are Strings and c is a Integer. keyValueRecords is like this: RDD[(string,string)]-
i got type missmatch error: how can I fix it ?
what is my mistake ?
This is a sample of data:
userId,movieId,rating,timestamp
1,16,4.0,1217897793
1,24,1.5,1217895807
1,32,4.0,1217896246
2,3,2.0,859046959
3,7,3.0,8414840873
I'm triying by this code:
val lines = sc.textFile("ratings.txt").map(s => {
val substrings = s.split(",")
(substrings(0), (substrings(1),substrings(1)))
})
val shoppingList = lines.groupByKey()
val coOccurence = shoppingList.flatMap{case(k,v) =>
val arry1 = v.toArray
val arry2 = v.toArray
val pairs = for (pair1 <- arry1; pair2 <- arry2 ) yield ((pair1,pair2),1)
pairs.iterator
}
val numOfT = coOccurence.reduceByKey((a,b)=>(a+b)) // (((item,rate),(item,rate)),coccurence)
// produce recommend for an especial user
val keyValueRecords = sc.textFile("ratings.txt").map(s => {
val substrings = s.split(",")
(substrings(0), (substrings(1),substrings(2)))
}).filter{case(k,v)=> k=="1"}.groupByKey().flatMap{case(k,v) =>
val arry1 = v.toArray
val arry2 = v.toArray
val pairs = for (pair1 <- arry1; pair2 <- arry2 ) yield ((pair1,pair2),1)
pairs.iterator
}
val numOfTForaUser = keyValueRecords.reduceByKey((a,b)=>(a+b))
val joined = numOfT.join(numOfTForaUser).map{case(k,v)=>(k._1._1,(k._2._2.toFloat*v._1.toFloat))}.collect.foreach(println)
The Last RDD won't produced. Is it wrong ?

Spark: split rows and accumulate

I have this code:
val rdd = sc.textFile(sample.log")
val splitRDD = rdd.map(r => StringUtils.splitPreserveAllTokens(r, "\\|"))
val rdd2 = splitRDD.filter(...).map(row => createRow(row, fieldsMap))
sqlContext.createDataFrame(rdd2, structType).save(
org.apache.phoenix.spark, SaveMode.Overwrite, Map("table" -> table, "zkUrl" -> zkUrl))
def createRow(row: Array[String], fieldsMap: ListMap[Int, FieldConfig]): Row = {
//add additional index for invalidValues
val arrSize = fieldsMap.size + 1
val arr = new Array[Any](arrSize)
var invalidValues = ""
for ((k, v) <- fieldsMap) {
val valid = ...
var value : Any = null
if (valid) {
value = row(k)
// if (v.code == "SOURCE_NAME") --> 5th column in the row
// sourceNameCount = row(k).split(",").size
} else {
invalidValues += v.code + " : " + row(k) + " | "
}
arr(k) = value
}
arr(arrSize - 1) = invalidValues
Row.fromSeq(arr.toSeq)
}
fieldsMap contains the mapping of the input columns: (index, FieldConfig). Where FieldConfig class contains "code" and "dataType" values.
TOPIC -> (0, v.code = "TOPIC", v.dataType = "String")
GROUP -> (1, v.code = "GROUP")
SOURCE_NAME1,SOURCE_NAME2,SOURCE_NAME3 -> (4, v.code = "SOURCE_NAME")
This is the sample.log:
TOPIC|GROUP|TIMESTAMP|STATUS|SOURCE_NAME1,SOURCE_NAME2,SOURCE_NAME3|
SOURCE_TYPE1,SOURCE_TYPE2,SOURCE_TYPE3|SOURCE_COUNT1,SOURCE_COUNT2,SOURCE_COUNT3|
DEST_NAME1,DEST_NAME2,DEST_NAME3|DEST_TYPE1,DEST_TYPE2,DEST_TYPE3|
DEST_COUNT1,DEST_COUNT2,DEST_COUNT3|
The goal is to split the input (sample.log), based on the number of source_name(s).. In the example above, the output will have 3 rows:
TOPIC|GROUP|TIMESTAMP|STATUS|SOURCE_NAME1|SOURCE_TYPE1|SOURCE_COUNT1|
|DEST_NAME1|DEST_TYPE1|DEST_COUNT1|
TOPIC|GROUP|TIMESTAMP|STATUS|SOURCE_NAME2|SOURCE_TYPE2|SOURCE_COUNT2|
DEST_NAME2|DEST_TYPE2|DEST_COUNT2|
TOPIC|GROUP|TIMESTAMP|STATUS|SOURCE_NAME3|SOURCE_TYPE3|SOURCE_COUNT3|
|DEST_NAME3|DEST_TYPE3|DEST_COUNT3|
This is the new code I am working on (still using createRow defined above):
val rdd2 = splitRDD.filter(...).flatMap(row => {
val srcName = row(4).split(",")
val srcType = row(5).split(",")
val srcCount = row(6).split(",")
val destName = row(7).split(",")
val destType = row(8).split(",")
val destCount = row(9).split(",")
var newRDD: ArrayBuffer[Row] = new ArrayBuffer[Row]()
//if (srcName != null) {
println("\n\nsrcName.size: " + srcName.size + "\n\n")
for (i <- 0 to srcName.size - 1) {
// missing column: destType can sometimes be null
val splittedRow: Array[String] = Row.fromSeq(Seq((row(0), row(1), row(2), row(3),
srcName(i), srcType(i), srcCount(i), destName(i), "", destCount(i)))).toSeq.toArray[String]
newRDD = newRDD ++ Seq(createRow(splittedRow, fieldsMap))
}
//}
Seq(Row.fromSeq(Seq(newRDD)))
})
Since I am having an error in converting my splittedRow to Array[String]
(".toSeq.toArray[String]")
error: type arguments [String] do not conform to method toArray's type parameter bounds [B >: Any]
I decided to update my splittedRow to:
val rowArr: Array[String] = new Array[String](10)
for (j <- 0 to 3) {
rowArr(j) = row(j)
}
rowArr(4) = srcName(i)
rowArr(5) = row(5).split(",")(i)
rowArr(6) = row(6).split(",")(i)
rowArr(7) = row(7).split(",")(i)
rowArr(8) = row(8).split(",")(i)
rowArr(9) = row(9).split(",")(i)
val splittedRow = rowArr
You could use a flatMap operation instead of a map operation to return multiple rows. Consequently, your createRow would be refactored to createRows(row: Array[String], fieldsMap: List[Int, IngestFieldConfig]): Seq[Row].