spark groupBy operation hangs at 199/200 - scala

I have a spark standalone cluster with master and two executors. I have an RDD[LevelOneOutput] and below is LevelOneOutput class
class LevelOneOutput extends Serializable {
#BeanProperty
var userId: String = _
#BeanProperty
var tenantId: String = _
#BeanProperty
var rowCreatedMonth: Int = _
#BeanProperty
var rowCreatedYear: Int = _
#BeanProperty
var listType1: ArrayBuffer[TypeOne] = _
#BeanProperty
var listType2: ArrayBuffer[TypeTwo] = _
#BeanProperty
var listType3: ArrayBuffer[TypeThree] = _
...
...
#BeanProperty
var listType18: ArrayBuffer[TypeEighteen] = _
#BeanProperty
var groupbyKey: String = _
}
Now I want to group this RDD based on userId, tenantId, rowCreatedMonth, rowCreatedYear. For that I did this
val levelOneRDD = inputRDD.map(row => {
row.setGroupbyKey(s"${row.getTenantId}_${row.getRowCreatedYear}_${row.getRowCreatedMonth}_${row.getUserId}")
row
})
val groupedRDD = levelOneRDD.groupBy(row => row.getGroupbyKey)
This gives me the data in key as String and value as Iterable[LevelOneOutput]
Now I want to generate one single object of LevelOneOutput for that group key. For that I was doing something like below:
val rdd = groupedRDD.map(row => {
val levelOneOutput = new LevelOneOutput
val groupKey = row._1.split("_")
levelOneOutput.setTenantId(groupKey(0))
levelOneOutput.setRowCreatedYear(groupKey(1).toInt)
levelOneOutput.setRowCreatedMonth(groupKey(2).toInt)
levelOneOutput.setUserId(groupKey(3))
var listType1 = new ArrayBuffer[TypeOne]
var listType2 = new ArrayBuffer[TypeTwo]
var listType3 = new ArrayBuffer[TypeThree]
...
...
var listType18 = new ArrayBuffer[TypeEighteen]
row._2.foreach(data => {
if (data.getListType1 != null) listType1 = listType1 ++ data.getListType1
if (data.getListType2 != null) listType2 = listType2 ++ data.getListType2
if (data.getListType3 != null) listType3 = listType3 ++ data.getListType3
...
...
if (data.getListType18 != null) listType18 = listType18 ++ data.getListType18
})
if (listType1.isEmpty) levelOneOutput.setListType1(null) else levelOneOutput.setListType1(listType1)
if (listType2.isEmpty) levelOneOutput.setListType2(null) else levelOneOutput.setListType2(listType2)
if (listType3.isEmpty) levelOneOutput.setListType3(null) else levelOneOutput.setListType3(listType3)
...
...
if (listType18.isEmpty) levelOneOutput.setListType18(null) else levelOneOutput.setListType18(listType18)
levelOneOutput
})
This is working as expected for small size of input, but when I try to run on the larger set of input data, group by operation is getting hang at 199/200 and I don't see any specific error or warning in stdout/stderr
Can some one point me why the job is not proceeding further...

Instead of using groupBy operation I have created paired RDD like below
val levelOnePairedRDD = inputRDD.map(row => {
row.setGroupbyKey(s"${row.getTenantId}_${row.getRowCreatedYear}_${row.getRowCreatedMonth}_${row.getUserId}")
(row.getGroupByKey, row)
})
and updated the processing logic, which solved my issue.

Related

Function without var in scala

I have a function that takes a string and a case class as input and return string as output.
Different case class gets appended to the list and the final case class is returned which has the list.
I want to do it without using var. The val list would be immutable and no data would be added to it. Is there any other way of doing it in Scala way?
def getResult(eventName: Option[String], content: Content): String = {
var list = List.empty[Json]
val device = Device(
DEVICE_SCHEMA,
data = content.data.device
)
list = list :+ device.asJson
val parser = Parser(
PARSER_SCHEMA,
data = content.data.parser
)
list = list :+ parser.asJson
val res = Result(
RESULT_SCHMEA,
data = list
)
res.asJson.noSpaces
}
Try inlining list creation like so
def getResult(eventName: Option[String], content: Content): String = {
val device = Device(
DEVICE_SCHEMA,
data = content.data.device
)
val parser = Parser(
PARSER_SCHEMA,
data = content.data.parser
)
Result(
RESULT_SCHMEA,
data = List(device.asJson, parser.asJson) // <== inline list creation
).asJson.noSpaces
}
Just some little changes from the previous answer.
You don't need val res and it's preferred to create the list outside Result for easier reading and later debugging:
def getResult(eventName: Option[String], content: Content): String = {
val device = Device(
DEVICE_SCHEMA,
data = content.data.device
)
val parser = Parser(
PARSER_SCHEMA,
data = content.data.parser
)
val jsons = List(device.asJson, parser.asJson)
Result(
RESULT_SCHMEA,
data = jsons
).asJson.noSpaces
}

Multiple random UUID but value uuid is the same

case class ConversationId(value: UUID)
case class CustomerRequestId(uuid: UUID)
class CustomerRequestReceiverController{
def createCustomerRequest(req: Request, optConversationId: Option[ConversationId] = None): Task[Response] = {
val NONE_REVISION = 0
val customerRequestId = CustomerRequestId(RandomUUID.randomUUID)
val result: Task[Response] = for {
form <- req.as(jsonOf[CreateCustomerRequestFormInternalApiV2])
_ = println(s"Conversation Id ${form.conversationId.getOrElse(ConversationId(RandomUUID.randomUUID))}")
_ = appLogger.info(s"creating CustomerRequest ${form.customerId.value} with Conversation ${optConversationId.getOrElse(form.conversationId.getOrElse(RandomUUID.randomUUID))}")
_ = logger.info(s"creating CustomerRequest ${form.customerId.value} with Conversation ${optConversationId.getOrElse(form.conversationId.getOrElse(RandomUUID.randomUUID))}")
createCommand <- Task.delay(CreateCustomerRequest(
customerRequestId = customerRequestId,
communication = Communication.Chat(optConversationId.getOrElse(form.conversationId.getOrElse(ConversationId(RandomUUID.randomUUID)))),
commandResult <- Task.delay(createCommandHandler.handle(createCommand))
response <- responseFormat(commandResult)
} yield response
result.handleWith(invalidRequestBody)
}
}
I expect the output of value from UUID.randomUUID() is the same UUID, but the actual output is different value
Replace line
customerRequestId = customerRequestId,
with
customerRequestId = CustomerRequestId(RandomUUID.randomUUID)
Now you generate CustomerRequestId only once and use it on each iteration

Split a list in several files scala

I have the following list,
10,44,22
10,47,12
15,38,3
15,41,30
16,44,15
16,47,18
22,38,21
22,41,42
34,44,40
34,47,36
40,38,39
40,41,42
45,38,27
45,41,30
46,44,45
46,47,48
Then I am creating one file with it is content with the following code:
val fstream:FileWriter = new FileWriter("patSPO.csv")
var out:BufferedWriter = new BufferedWriter(fstream)
val sl = listSPO.sortBy(l => (l.sub, l.pre))
for ( a <- 0 to listSPO.size-1){
out.write(sl(a).sub.toString+","+sl(a).pre.toString+","+sl(a).obj.toString+"\n")
}
out.close()
However I want to divide the content in a n files, then I try the following for 4 files:
val fstream:FileWriter = new FileWriter("patSPO.csv")
val fstream1:FileWriter = new FileWriter("patSPO1.csv")
val fstream2:FileWriter = new FileWriter("patSPO2.csv")
val fstream3:FileWriter = new FileWriter("patSPO3.csv")
val fstream4:FileWriter = new FileWriter("patSPO4.csv")
var out:BufferedWriter = new BufferedWriter(fstream)
var out1:BufferedWriter = new BufferedWriter(fstream1)
var out2:BufferedWriter = new BufferedWriter(fstream2)
var out3:BufferedWriter = new BufferedWriter(fstream3)
var out4:BufferedWriter = new BufferedWriter(fstream4)
val b :Int = listSPO.size/4
val sl = listSPO.sortBy(l => (l.sub, l.pre))
for ( a <- 0 to listSPO.size-1){
out.write(sl(a).sub.toString+","+sl(a).pre.toString+","+sl(a).obj.toString+"\n")
}
for ( a <- 0 to b-1){
out1.write(sl(a).sub.toString+","+sl(a).pre.toString+","+sl(a).obj.toString+"\n")
}
for ( a <- b to (b*2)-1){
out2.write(sl(a).sub.toString+","+sl(a).pre.toString+","+sl(a).obj.toString+"\n")
}
for ( a <- b*2 to (b*3)-1){
out3.write(sl(a).sub.toString+","+sl(a).pre.toString+","+sl(a).obj.toString+"\n")
}
for ( a <- b*3 to (b*4)-1){
out4.write(sl(a).sub.toString+","+sl(a).pre.toString+","+sl(a).obj.toString+"\n")
}
out.close()
out1.close()
out2.close()
out3.close()
out4.close()
Then my question is if exist a general code where I put the number of files to generate, for example 32, and not to write 32 times the out, the for and the fstream?
First, let's make some utilities to eliminate ugly boilerplate:
def open(
number: Int = 1,
prefix: String,
ext: String
) = (0 to n-1)
.iterator
.map { i =>
"%s%s%s".format(prefix, if(i == 0) "" else i.toString, postfix)
}.map(new FileWriter(_))
.map(new BufferedWriter(_))
def toString(l: WhateverTheType) = Seq(
l.sub,
l.pre,
l.obj
).mkString(",") + "\n"
And now to the implementation:
writeToFiles(
listSPO: List[WhateverTheType],
numFiles: Int = 1,
prefix: String = "pasSPO",
ext: String = ".csv"
) = listSPO
.grouped((listSPO.size+1)/numFiles)
.zip(open(numFiles, prefix, ext))
.foreach { case (input, file) =>
try {
input.foreach(file.write(toString(input)))
} finally {
file.close
}
}
Assuming the objects in the list have this type (otherwise, relpace SPO in the below code with your type):
case class SPO(sub: Int, pre: Int, obj: Int)
This should do it:
val sl = listSPO.sortBy(l => (l.sub, l.pre))
val files = 5 // whatever number you want
// helper function to write a single list - similar to your implementation but shorter
def writeListToFile(name: String, list: List[SPO]): Unit = {
val writer = new BufferedWriter(new FileWriter(name))
list.foreach(spo => writer.write(s"${spo.sub},${spo.pre},${spo.obj}\n"))
writer.close()
}
sl.grouped(sl.size / files) // split into sublists
.zipWithIndex // add index of sublists for file name
.foreach {
case (sublist, 0) => writeListToFile(s"pasSPO.csv", sublist) // if indeed you want the first file name NOT to include the index
case (sublist, index) => writeListToFile(s"pasSPO$index.csv", sublist)
}

MongoSpark duplicate key error only on massive datasets

Using MongoSpark, running the same code on 2 different datasets of differing sizes causing one to throw the E11000 duplicate key error.
Before we proceed, here is the code below:
object ScrapeHubCompanyImporter {
def importData(path: String, companyMongoUrl: String): Unit = {
val spark = SparkSession.builder()
.master("local[*]")
.config("spark.mongodb.input.uri", companyMongoUrl)
.config("spark.mongodb.output.uri", companyMongoUrl)
.config("spark.mongodb.input.partitionerOptions.partitionKey", "profileUrl")
.getOrCreate()
import spark.implicits._
val websiteToDomainTransformer = udf((website: String) => {
val tldExtract = SplitHost.fromURL(website)
if (tldExtract.domain == "") {
null
} else {
tldExtract.domain + "." + tldExtract.tld
}
})
val jsonDF =
spark
.read
.json(path)
.filter { row =>
row.getAs[String]("canonical_url") != null
}
.dropDuplicates(Seq("canonical_url"))
.select(
toHttpsUdf($"canonical_url").as("profileUrl"),
$"city",
$"country",
$"founded",
$"hq".as("headquartes"),
$"industry",
$"company_id".as("companyId"),
$"name",
$"postal",
$"size",
$"specialties",
$"state",
$"street_1",
$"street_2",
$"type",
$"website"
)
.filter { row => row.getAs[String]("website") != null }
.withColumn("domain", websiteToDomainTransformer($"website"))
.filter(row => row.getAs[String]("domain") != null)
.as[ScrapeHubCompanyDataRep]
val jsonColsSet = jsonDF.columns.toSet
val mongoData = MongoSpark
.load[LinkedinCompanyRep](spark)
.withColumn("companyUrl", toHttpsUdf($"companyUrl"))
.as[CompanyRep]
val mongoColsSet = mongoData.columns.toSet
val union = jsonDF.joinWith(
mongoData,
jsonDF("companyUrl") === mongoData("companyUrl"),
joinType = "left")
.map { t =>
val scrapeHub = t._1
val liCompanyRep = if (t._2 != null ) {
t._2
} else {
CompanyRep(domain = scrapeHub.domain)
}
CompanyRep(
_id = pickValue(liCompanyRep._id, None),
city = pickValue(scrapeHub.city, liCompanyRep.city),
country = pickValue(scrapeHub.country, liCompanyRep.country),
postal = pickValue(scrapeHub.postal, liCompanyRep.postal),
domain = scrapeHub.domain,
founded = pickValue(scrapeHub.founded, liCompanyRep.founded),
headquartes = pickValue(scrapeHub.headquartes, liCompanyRep.headquartes),
headquarters = liCompanyRep.headquarters,
industry = pickValue(scrapeHub.industry, liCompanyRep.industry),
linkedinId = pickValue(scrapeHub.companyId, liCompanyRep.companyId),
companyUrl = Option(scrapeHub.companyUrl),
name = pickValue(scrapeHub.name, liCompanyRep.name),
size = pickValue(scrapeHub.size, liCompanyRep.size),
specialties = pickValue(scrapeHub.specialties, liCompanyRep.specialties),
street_1 = pickValue(scrapeHub.street_1, liCompanyRep.street_1),
street_2 = pickValue(scrapeHub.street_2, liCompanyRep.street_2),
state = pickValue(scrapeHub.state, liCompanyRep.state),
`type` = pickValue(scrapeHub.`type`, liCompanyRep.`type`),
website = pickValue(scrapeHub.website, liCompanyRep.website),
updatedDate = None,
scraped = Some(true)
)
}
val idToMongoId = udf { st: String =>
if (st != null) {
ObjectId(st)
} else {
null
}
}
val saveReady =
union
.map { rep =>
rep.copy(
updatedDate = Some(new Timestamp(System.currentTimeMillis)),
scraped = Some(true),
headquarters = generateCompanyHeadquarters(rep)
)
}
.dropDuplicates(Seq("companyUrl"))
MongoSpark.save(
saveReady.withColumn("_id", idToMongoId($"_id")),
WriteConfig(Map(
"uri" -> companyMongoUrl
)))
}
def generateCompanyHeadquarters(companyRep: CompanyRep): Option[CompanyHeadquarters] = {
val hq = CompanyHeadquarters(
country = companyRep.country,
geographicArea = companyRep.state,
city = companyRep.city,
postalCode = companyRep.postal,
line1 = companyRep.street_1,
line2 = companyRep.street_2
)
CompanyHeadquarters
.unapply(hq)
.get
.productIterator.toSeq.exists {
case a: Option[_] => a.isDefined
case _ => false
} match {
case true => Some(hq)
case false => None
}
}
def pickValue(left: Option[String], right: Option[String]): Option[String] = {
def _noneIfNull(opt: Option[String]): Option[String] = {
if (opt != null) {
opt
} else {
None
}
}
val lOpt = _noneIfNull(left)
val rOpt = _noneIfNull(right)
lOpt match {
case Some(l) => Option(l)
case None => rOpt match {
case Some(r) => Option(r)
case None => None
}
}
}
}
This issue is around the companyUrl which is one of the unique keys in the collection, the other being the _id key. The issue is that there are tons of duplicates that Spark will attempt to save on a 700gb dataset, but if I run a very small dataset locally, Im never able to replicate the issue. Im trying to understand whats going on, and how can I make sure to group all the existing companies on the companyUrl, and make sure that duplicates really are removed globally across the dataset.
EDIT
Here are some scenarios that arise:
Company is in Mongo, the file thats read has updated data -> Duplicate key error can occur here
Company not in Mongo but in file -> Duplicate key error can occur here as well.
EDIT2
The duplication error occurs around companyUrl field.
EDIT 3
I've narrowed down this as being an issue the merging stage. Looking through records that have been marked as having duplicate companyUrl's, some of those records are not in the target collection yet somehow a duplicate record is still being written to the collection. In other situations, the _id field of the new record doesn't match the old record that had the same companyUrl.

Edgetriplets are not getting broadcast-ed properly

I created a graph using graphx and now I need to extract sub-graphs from the original graph. In the following code I am trying to broadcast edgetriplets and filter it for each user-id.
class VertexProperty(val id:Long) extends Serializable
case class User(val userId:Long, var offset:Int, val userCode:String, val Name:String, val Surname:String, val organizational_unit:String, val UME:String, val person_type:String, val SOD_HIGH:String, val SOD_MEDIUM:String, val SOD_LOW:String, val Under_mitigated:String) extends VertexProperty(userId)
case class Account(val accountId:Long, var offset:Int, val userCode:String, val userId:String, val account_creation_date:String, var disabled:String, var forcechangepwd:String, var pwdlife:String, var numberloginerror:String, var lastchangepwd:String, var lastlogin:String, var lastwronglogin:String, var state:String, var expire:String, var last_cert_time:String, var creation_date:String, var creation_user:String,var challenge_counter:String, var challenge_failed_attempt:String) extends VertexProperty(accountId) //Check if userCode is actually the code in this example.
case class Application(var applicationId:Long, var offset:Int, var Name:String, var Description:String, var Target:String, var Owner:String, var Ownercode:String, var Creation_date:String, var Creation_user:String) extends VertexProperty(applicationId)
case class Entitlement(val entitlementId:Long, var offset:Int, val Name:String, var Code:String, var Description:String, var Type:String, var Application:String, var Administrative:String, var Parent_ID:String, var Owner_code:String, var Scope_type:String, var Business_name:String, var Business_policy:String, var SOD_high:String, var SOD_medium:String, var SOD_low:String) extends VertexProperty(entitlementId)
def compute_user_triplets(uId:String, bcast_triplets:Broadcast[Array[EdgeTriplet[VertexProperty,String]]]):ArrayBuffer[EdgeTriplet[VertexProperty, String]] = {
var user_triplets = ArrayBuffer[EdgeTriplet[VertexProperty, String]]()
var triplets = bcast_triplets.value
for(x <- triplets){
if(x.attr == uId){
user_triplets += x
}
}
return user_triplets
}
//Some code for computing vertexRDD and edges
val edges : RDD[Edge[String]] = sc.union(user_account_edges, account_application_edges, user_entitlement_edges)
val vertexRDD: RDD[(VertexId, VertexProperty)] = vertices.map(t => (t.id, t))
val graph: Graph[VertexProperty,String] = Graph(vertexRDD, edges, new VertexProperty(-1))
val triplets = graph.triplets
val temp = triplets.map(t => t.attr)
val distinct_users = temp.distinct.filter(t => t != "NULL")
val bcast_triplets = sc.broadcast(triplets.collect())
val users_triplets = distinct_users.map(uId => compute_user_triplets(uId, bcast_triplets))
But I get the error below after the last line of the code runs. Why am I getting this error?"
org.apache.spark.SparkException: Task not serializable