Do I need to persist a continuously updated RDD?

Do I need to persist a continuously updated RDD? - scala

I'm working with a spark program which need to continuously update some RDD in a loop:
var totalRandomPath: RDD[String] = null
for (iter <- 0 until config.numWalks) {
var randomPath: RDD[String] = examples.map { case (nodeId, clickNode) =>
clickNode.path.mkString("\t")
}
for (walkCount <- 0 until config.walkLength) {
randomPath = edge2attr.join(randomPath.mapPartitions { iter =>
iter.map { pathBuffer =>
val paths: Array[String] = pathBuffer.split("\t")
(paths.slice(paths.size - 2, paths.size).mkString(""), pathBuffer)
}
}).mapPartitions { iter =>
iter.map { case (edge, (attr, pathBuffer)) =>
try {
if (pathBuffer != null && pathBuffer.nonEmpty && attr.dstNeighbors != null && attr.dstNeighbors.nonEmpty) {
val nextNodeIndex: PartitionID = GraphOps.drawAlias(attr.J, attr.q)
val nextNodeId: VertexId = attr.dstNeighbors(nextNodeIndex)
s"$pathBuffer\t$nextNodeId"
} else {
pathBuffer //add
}
} catch {
case e: Exception => throw new RuntimeException(e.getMessage)
}
}.filter(_ != null)
}
}
if (totalRandomPath != null) {
totalRandomPath = totalRandomPath.union(randomPath)
} else {
totalRandomPath = randomPath
}
}
In this program, RDD totalRandomPath and randomPath are constantly updated with a lot of transformation operations: join and mapPartitions. This program will end with action collect.
So need I persist those continuously updated RDDs(totalRandomPath, randomPath) to speed up my spark program?
And I notice that this program run fast in single node machine, but slow down when run in a three node cluster, why does this happen?

Yes, you need to persist updated RDD and also unpersist older RDD
var totalRandomPath:RDD[String] = spark.sparkContext.parallelize(List.empty[String]).cache()
for (iter <- 0 until config.numWalks){
// existing logic
val tempRDD = totalRandomPath.union(randomPath).cache()
tempRDD foreach { _ => } //this will trigger cache operation for tempRDD immediately
totalRandomPath.unpersist() //unpersist old RDD which is no longer needed
totalRandomPath = tempRDD // point totalRandomPath to updated RDD
}

Related

Return 2 (two) RDD from partition operator

My code looks like this:
val doDatabaseWrites = true
val testMode = false
val memberRdd = getMembersToProcess() // returns RDD[String]
if (doDatabaseWrites && !testMode) {
val dbUpsertsRDD = memberRdd.mapPartitions(partitionOp())
writeToDB(dbUpsertsRDD)
} else if (testMode) {
// I want a different RDD here so I can do
writeToCSV(resultsRDD)
} else {
memberRdd.foreachPartition({ it => partitionOp(it) })
}
is there a function other than mapPartitions that returns 2 different RDDs? So I can just do:
val (dbUpsertsRDD, resultsRDD) = memberRdd.mapPartitions(partitionOp())
if (doDatabaseWrites && !testMode) {
writeToDB(dbUpsertsRDD)
} else if (testMode) {
writeToCSV(resultsRDD)
}

How to remove last line from RDD Spark Scala

I want to remove last line from RDD using .mapPartitionsWithIndex function.
I have tried below code
val withoutFooter = rdd.mapPartitionsWithIndex { (idx, iter) =>
if (idx == noOfTotalPartitions) {
iter.drop(size - 1)
}
else iter
}
But not able to get correct result.

drop will drop first n elements and returns the remaining elements
Read more here https://stackoverflow.com/a/51792161/6556191
Below code works for me
val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9),4)
val lastPartitionIndex = rdd.getNumPartitions - 1
rdd.mapPartitionsWithIndex { (idx, iter) =>
var reti = iter
if (idx == lastPartitionIndex) {
var lastPart = iter.toArray
reti = lastPart.slice(0, lastPart.length-1).toIterator
}
reti
}

Multiple filter condition in Spark Filter method

How to write multiple case in filter() method in spark using scala like, I have an Rdd of cogroup
(1,(CompactBuffer(1,john,23),CompactBuffer(1,john,24)).filter(x => (x._2._1 != x._2._2))//value not equal
(2,(CompactBuffer(),CompactBuffer(2,Arun,24)).filter(x => (x._2._1==null))//Second tuple first value is null
(3,(CompactBuffer(3,kumar,25),CompactBuffer()).filter(x => (x._2._2==null))//Second tuple second value is null
val a = source_primary_key.cogroup(destination_primary_key).filter(x => (x._2._1 != x._2._2))
val c= a.map { y =>
val key = y._1
val value = y._2
srcs = value._1.mkString(",")
destt = value._2.mkString(",")
if (srcs.equalsIgnoreCase(destt) == false) {
srcmis :+= srcs
destmis :+= destt
}
if (srcs == "") {
extraindest :+= destt.mkString("")
}
if (destt == "") {
extrainsrc :+= srcs.mkString("")
}
}
How to store each condition in 3 different Array[String]
I tried like above but seems naive, is there anyway way we can do it efficiently ?

For testing purpose, I created following rdds
val source_primary_key = sc.parallelize(Seq((1,(1,"john",23)),(3,(3,"kumar",25))))
val destination_primary_key = sc.parallelize(Seq((1,(1,"john",24)),(2,(2,"arun",24))))
Then I cogrouped as you did
val coGrouped = source_primary_key.cogroup(destination_primary_key)
Now is the step to filter the cogrouped rdd to three separate rdds as
val a = coGrouped.filter(x => !x._2._1.isEmpty && !x._2._2.isEmpty)
val b = coGrouped.filter(x => x._2._1.isEmpty && !x._2._2.isEmpty)
val c = coGrouped.filter(x => !x._2._1.isEmpty && x._2._2.isEmpty)
I hope the answer is helpful

You can use collect on your RDD and then toList .
Example :
(1,(CompactBuffer(1,john,23),CompactBuffer(1,john,24)).filter(x => (x._2._1 != x._2._2)).collect().toList

alternate way to proceed without list in scala

I have a scala code like this
def avgCalc(buffer: Iterable[Array[String]], list: Array[String]) = {
val currentTimeStamp = list(1).toLong // loads the timestamp column
var sum = 0.0
var count = 0
var check = false
import scala.util.control.Breaks._
breakable {
for (array <- buffer) {
val toCheckTimeStamp = array(1).toLong // timestamp column
if (((currentTimeStamp - 10L) <= toCheckTimeStamp) && (currentTimeStamp >= toCheckTimeStamp)) { // to check the timestamp for 10 seconds difference
sum += array(5).toDouble // RSSI weightage values
count += 1
}
if ((currentTimeStamp - 10L) > toCheckTimeStamp) {
check = true
break
}
}
}
list :+ sum
}
I will call the above function like this
import spark.implicits._
val averageDF =
filterop.rdd.map(_.mkString(",")).map(line => line.split(",").map(_.trim))
.sortBy(array => array(1), false) // Sort by timestamp
.groupBy(array => (array(0), array(2))) // group by tag and listner
.mapValues(buffer => {
buffer.map(list => {
avgCalc(buffer, list) // calling the average function
})
})
.flatMap(x => x._2)
.map(x => findingavg(x(0).toString, x(1).toString.toLong, x(2).toString, x(3).toString, x(4).toString, x(5).toString.toDouble, x(6).toString.toDouble)) // defining the schema through case class
.toDF // converting to data frame
The above code is working fine.But I need to get rid of list.My senior ask me to remove the list,Because list reduces the execution speed.Any suggestions to proceed without list ?
Any help will be appreciated.

The following solution should work I guess, I have tried to avoid passing both iterable and one array.
def avgCalc(buffer: Iterable[Array[String]]) = {
var finalArray = Array.empty[Array[String]]
import scala.util.control.Breaks._
breakable {
for (outerArray <- buffer) {
val currentTimeStamp = outerArray(1).toLong
var sum = 0.0
var count = 0
var check = false
var list = outerArray
for (array <- buffer) {
val toCheckTimeStamp = array(1).toLong
if (((currentTimeStamp - 10L) <= toCheckTimeStamp) && (currentTimeStamp >= toCheckTimeStamp)) {
sum += array(5).toDouble
count += 1
}
if ((currentTimeStamp - 10L) > toCheckTimeStamp) {
check = true
break
}
}
if (sum != 0.0 && check) list = list :+ (sum / count).toString
else list = list :+ list(5).toDouble.toString
finalArray ++= Array(list)
}
}
finalArray
}
and you can call it like
import sqlContext.implicits._
val averageDF =
filter_op.rdd.map(_.mkString(",")).map(line => line.split(",").map(_.trim))
.sortBy(array => array(1), false)
.groupBy(array => (array(0), array(2)))
.mapValues(buffer => {
avgCalc(buffer)
})
.flatMap(x => x._2)
.map(x => findingavg(x(0).toString, x(1).toString.toLong, x(2).toString, x(3).toString, x(4).toString, x(5).toString.toDouble, x(6).toString.toDouble))
.toDF
I hope this is the desired answer

I can see that you have accepted an answer, but I have to say that you have a lot of unnecessary code. As far as I can see, you have no reason to do the initial conversion to Array type in the first place and the sortBy is also unnecessary at this point. I would suggest you work directly on the Row.
Also you have a number of unused variables that should be removed and converting to a case-class only to be followed by a toDF seems excessive IMHO.
I would do something like this:
import org.apache.spark.sql.Row
def avgCalc(sortedList: List[Row]) = {
sortedList.indices.map(i => {
var sum = 0.0
val row = sortedList(i)
val currentTimeStamp = row.getString(1).toLong // loads the timestamp column
import scala.util.control.Breaks._
breakable {
for (j <- 0 until sortedList.length) {
if (j != i) {
val anotherRow = sortedList(j)
val toCheckTimeStamp = anotherRow.getString(1).toLong // timestamp column
if (((currentTimeStamp - 10L) <= toCheckTimeStamp) && (currentTimeStamp >= toCheckTimeStamp)) { // to check the timestamp for 10 seconds difference
sum += anotherRow.getString(5).toDouble // RSSI weightage values
}
if ((currentTimeStamp - 10L) > toCheckTimeStamp) {
break
}
}
}
}
(row.getString(0), row.getString(1), row.getString(2), row.getString(3), row.getString(4), row.getString(5), sum.toString)
})
}
val averageDF = filterop.rdd
.groupBy(row => (row(0), row(2)))
.flatMap{case(_,buffer) => avgCalc(buffer.toList.sortBy(_.getString(1).toLong))}
.toDF("Tag", "Timestamp", "Listner", "X", "Y", "RSSI", "AvgCalc")
And as a final comment, I'm pretty sure it's possible to come up with at nicer/cleaner implementation of the avgCalc function, but I'll leave it to you to play around with that :)

MongoSpark duplicate key error only on massive datasets

Using MongoSpark, running the same code on 2 different datasets of differing sizes causing one to throw the E11000 duplicate key error.
Before we proceed, here is the code below:
object ScrapeHubCompanyImporter {
def importData(path: String, companyMongoUrl: String): Unit = {
val spark = SparkSession.builder()
.master("local[*]")
.config("spark.mongodb.input.uri", companyMongoUrl)
.config("spark.mongodb.output.uri", companyMongoUrl)
.config("spark.mongodb.input.partitionerOptions.partitionKey", "profileUrl")
.getOrCreate()
import spark.implicits._
val websiteToDomainTransformer = udf((website: String) => {
val tldExtract = SplitHost.fromURL(website)
if (tldExtract.domain == "") {
null
} else {
tldExtract.domain + "." + tldExtract.tld
}
})
val jsonDF =
spark
.read
.json(path)
.filter { row =>
row.getAs[String]("canonical_url") != null
}
.dropDuplicates(Seq("canonical_url"))
.select(
toHttpsUdf($"canonical_url").as("profileUrl"),
$"city",
$"country",
$"founded",
$"hq".as("headquartes"),
$"industry",
$"company_id".as("companyId"),
$"name",
$"postal",
$"size",
$"specialties",
$"state",
$"street_1",
$"street_2",
$"type",
$"website"
)
.filter { row => row.getAs[String]("website") != null }
.withColumn("domain", websiteToDomainTransformer($"website"))
.filter(row => row.getAs[String]("domain") != null)
.as[ScrapeHubCompanyDataRep]
val jsonColsSet = jsonDF.columns.toSet
val mongoData = MongoSpark
.load[LinkedinCompanyRep](spark)
.withColumn("companyUrl", toHttpsUdf($"companyUrl"))
.as[CompanyRep]
val mongoColsSet = mongoData.columns.toSet
val union = jsonDF.joinWith(
mongoData,
jsonDF("companyUrl") === mongoData("companyUrl"),
joinType = "left")
.map { t =>
val scrapeHub = t._1
val liCompanyRep = if (t._2 != null ) {
t._2
} else {
CompanyRep(domain = scrapeHub.domain)
}
CompanyRep(
_id = pickValue(liCompanyRep._id, None),
city = pickValue(scrapeHub.city, liCompanyRep.city),
country = pickValue(scrapeHub.country, liCompanyRep.country),
postal = pickValue(scrapeHub.postal, liCompanyRep.postal),
domain = scrapeHub.domain,
founded = pickValue(scrapeHub.founded, liCompanyRep.founded),
headquartes = pickValue(scrapeHub.headquartes, liCompanyRep.headquartes),
headquarters = liCompanyRep.headquarters,
industry = pickValue(scrapeHub.industry, liCompanyRep.industry),
linkedinId = pickValue(scrapeHub.companyId, liCompanyRep.companyId),
companyUrl = Option(scrapeHub.companyUrl),
name = pickValue(scrapeHub.name, liCompanyRep.name),
size = pickValue(scrapeHub.size, liCompanyRep.size),
specialties = pickValue(scrapeHub.specialties, liCompanyRep.specialties),
street_1 = pickValue(scrapeHub.street_1, liCompanyRep.street_1),
street_2 = pickValue(scrapeHub.street_2, liCompanyRep.street_2),
state = pickValue(scrapeHub.state, liCompanyRep.state),
`type` = pickValue(scrapeHub.`type`, liCompanyRep.`type`),
website = pickValue(scrapeHub.website, liCompanyRep.website),
updatedDate = None,
scraped = Some(true)
)
}
val idToMongoId = udf { st: String =>
if (st != null) {
ObjectId(st)
} else {
null
}
}
val saveReady =
union
.map { rep =>
rep.copy(
updatedDate = Some(new Timestamp(System.currentTimeMillis)),
scraped = Some(true),
headquarters = generateCompanyHeadquarters(rep)
)
}
.dropDuplicates(Seq("companyUrl"))
MongoSpark.save(
saveReady.withColumn("_id", idToMongoId($"_id")),
WriteConfig(Map(
"uri" -> companyMongoUrl
)))
}
def generateCompanyHeadquarters(companyRep: CompanyRep): Option[CompanyHeadquarters] = {
val hq = CompanyHeadquarters(
country = companyRep.country,
geographicArea = companyRep.state,
city = companyRep.city,
postalCode = companyRep.postal,
line1 = companyRep.street_1,
line2 = companyRep.street_2
)
CompanyHeadquarters
.unapply(hq)
.get
.productIterator.toSeq.exists {
case a: Option[_] => a.isDefined
case _ => false
} match {
case true => Some(hq)
case false => None
}
}
def pickValue(left: Option[String], right: Option[String]): Option[String] = {
def _noneIfNull(opt: Option[String]): Option[String] = {
if (opt != null) {
opt
} else {
None
}
}
val lOpt = _noneIfNull(left)
val rOpt = _noneIfNull(right)
lOpt match {
case Some(l) => Option(l)
case None => rOpt match {
case Some(r) => Option(r)
case None => None
}
}
}
}
This issue is around the companyUrl which is one of the unique keys in the collection, the other being the _id key. The issue is that there are tons of duplicates that Spark will attempt to save on a 700gb dataset, but if I run a very small dataset locally, Im never able to replicate the issue. Im trying to understand whats going on, and how can I make sure to group all the existing companies on the companyUrl, and make sure that duplicates really are removed globally across the dataset.
EDIT
Here are some scenarios that arise:
Company is in Mongo, the file thats read has updated data -> Duplicate key error can occur here
Company not in Mongo but in file -> Duplicate key error can occur here as well.
EDIT2
The duplication error occurs around companyUrl field.
EDIT 3
I've narrowed down this as being an issue the merging stage. Looking through records that have been marked as having duplicate companyUrl's, some of those records are not in the target collection yet somehow a duplicate record is still being written to the collection. In other situations, the _id field of the new record doesn't match the old record that had the same companyUrl.