MongoSpark duplicate key error only on massive datasets - mongodb

Using MongoSpark, running the same code on 2 different datasets of differing sizes causing one to throw the E11000 duplicate key error.
Before we proceed, here is the code below:
object ScrapeHubCompanyImporter {
def importData(path: String, companyMongoUrl: String): Unit = {
val spark = SparkSession.builder()
.master("local[*]")
.config("spark.mongodb.input.uri", companyMongoUrl)
.config("spark.mongodb.output.uri", companyMongoUrl)
.config("spark.mongodb.input.partitionerOptions.partitionKey", "profileUrl")
.getOrCreate()
import spark.implicits._
val websiteToDomainTransformer = udf((website: String) => {
val tldExtract = SplitHost.fromURL(website)
if (tldExtract.domain == "") {
null
} else {
tldExtract.domain + "." + tldExtract.tld
}
})
val jsonDF =
spark
.read
.json(path)
.filter { row =>
row.getAs[String]("canonical_url") != null
}
.dropDuplicates(Seq("canonical_url"))
.select(
toHttpsUdf($"canonical_url").as("profileUrl"),
$"city",
$"country",
$"founded",
$"hq".as("headquartes"),
$"industry",
$"company_id".as("companyId"),
$"name",
$"postal",
$"size",
$"specialties",
$"state",
$"street_1",
$"street_2",
$"type",
$"website"
)
.filter { row => row.getAs[String]("website") != null }
.withColumn("domain", websiteToDomainTransformer($"website"))
.filter(row => row.getAs[String]("domain") != null)
.as[ScrapeHubCompanyDataRep]
val jsonColsSet = jsonDF.columns.toSet
val mongoData = MongoSpark
.load[LinkedinCompanyRep](spark)
.withColumn("companyUrl", toHttpsUdf($"companyUrl"))
.as[CompanyRep]
val mongoColsSet = mongoData.columns.toSet
val union = jsonDF.joinWith(
mongoData,
jsonDF("companyUrl") === mongoData("companyUrl"),
joinType = "left")
.map { t =>
val scrapeHub = t._1
val liCompanyRep = if (t._2 != null ) {
t._2
} else {
CompanyRep(domain = scrapeHub.domain)
}
CompanyRep(
_id = pickValue(liCompanyRep._id, None),
city = pickValue(scrapeHub.city, liCompanyRep.city),
country = pickValue(scrapeHub.country, liCompanyRep.country),
postal = pickValue(scrapeHub.postal, liCompanyRep.postal),
domain = scrapeHub.domain,
founded = pickValue(scrapeHub.founded, liCompanyRep.founded),
headquartes = pickValue(scrapeHub.headquartes, liCompanyRep.headquartes),
headquarters = liCompanyRep.headquarters,
industry = pickValue(scrapeHub.industry, liCompanyRep.industry),
linkedinId = pickValue(scrapeHub.companyId, liCompanyRep.companyId),
companyUrl = Option(scrapeHub.companyUrl),
name = pickValue(scrapeHub.name, liCompanyRep.name),
size = pickValue(scrapeHub.size, liCompanyRep.size),
specialties = pickValue(scrapeHub.specialties, liCompanyRep.specialties),
street_1 = pickValue(scrapeHub.street_1, liCompanyRep.street_1),
street_2 = pickValue(scrapeHub.street_2, liCompanyRep.street_2),
state = pickValue(scrapeHub.state, liCompanyRep.state),
`type` = pickValue(scrapeHub.`type`, liCompanyRep.`type`),
website = pickValue(scrapeHub.website, liCompanyRep.website),
updatedDate = None,
scraped = Some(true)
)
}
val idToMongoId = udf { st: String =>
if (st != null) {
ObjectId(st)
} else {
null
}
}
val saveReady =
union
.map { rep =>
rep.copy(
updatedDate = Some(new Timestamp(System.currentTimeMillis)),
scraped = Some(true),
headquarters = generateCompanyHeadquarters(rep)
)
}
.dropDuplicates(Seq("companyUrl"))
MongoSpark.save(
saveReady.withColumn("_id", idToMongoId($"_id")),
WriteConfig(Map(
"uri" -> companyMongoUrl
)))
}
def generateCompanyHeadquarters(companyRep: CompanyRep): Option[CompanyHeadquarters] = {
val hq = CompanyHeadquarters(
country = companyRep.country,
geographicArea = companyRep.state,
city = companyRep.city,
postalCode = companyRep.postal,
line1 = companyRep.street_1,
line2 = companyRep.street_2
)
CompanyHeadquarters
.unapply(hq)
.get
.productIterator.toSeq.exists {
case a: Option[_] => a.isDefined
case _ => false
} match {
case true => Some(hq)
case false => None
}
}
def pickValue(left: Option[String], right: Option[String]): Option[String] = {
def _noneIfNull(opt: Option[String]): Option[String] = {
if (opt != null) {
opt
} else {
None
}
}
val lOpt = _noneIfNull(left)
val rOpt = _noneIfNull(right)
lOpt match {
case Some(l) => Option(l)
case None => rOpt match {
case Some(r) => Option(r)
case None => None
}
}
}
}
This issue is around the companyUrl which is one of the unique keys in the collection, the other being the _id key. The issue is that there are tons of duplicates that Spark will attempt to save on a 700gb dataset, but if I run a very small dataset locally, Im never able to replicate the issue. Im trying to understand whats going on, and how can I make sure to group all the existing companies on the companyUrl, and make sure that duplicates really are removed globally across the dataset.
EDIT
Here are some scenarios that arise:
Company is in Mongo, the file thats read has updated data -> Duplicate key error can occur here
Company not in Mongo but in file -> Duplicate key error can occur here as well.
EDIT2
The duplication error occurs around companyUrl field.
EDIT 3
I've narrowed down this as being an issue the merging stage. Looking through records that have been marked as having duplicate companyUrl's, some of those records are not in the target collection yet somehow a duplicate record is still being written to the collection. In other situations, the _id field of the new record doesn't match the old record that had the same companyUrl.

Related

why mapPartitions does not see my val - SCALA/SPARK?

I define val like this :
val config = Config(args)
val product_type = config.product_type
then I send product_type as "AA"
and my code is this :
val scores = df.mapPartitions(iterator => {
val inputStream =
if(product_type == "AA" ) {
getClass().getClassLoader().getResourceAsStream("my_aa.hdf5")
}
else {
getClass().getClassLoader().getResourceAsStream("my_bb.hdf5")
}
val multiLayerNetwork: MultiLayerNetwork = KerasModelImport.importKerasSequentialModelAndWeights(inputStream, false)
val wrapped: ParallelInference = new ParallelInference.Builder(multiLayerNetwork).build()
val res = iterator.map(row => {
wrapped.output(row).toDoubleVector
})
res
})
But my inputStream equals "my_bb.hdf5" which is not correct. This value comes from else statement. So why my product_type variable cant read in mappartition?
I print my product_type value before code and I checked it , it is : "AA"
it occurs because of i get this variable from argument in spark submit.sh
and it can not read from mappartition.
It works like this:
val scores =
if (product_type == "AA") {
df.mapPartitions(iterator => {
val inputStream = getClass().getClassLoader().getResourceAsStream("AA.hdf5")
val multiLayerNetwork: MultiLayerNetwork = KerasModelImport.importKerasSequentialModelAndWeights(inputStream, false)
val wrapped: ParallelInference = new ParallelInference.Builder(multiLayerNetwork).build()
val res = iterator.map(row => {
wrapped.output(row).toDoubleVector
})
res
})
} else {
df.mapPartitions(iterator => {
val inputStream = getClass().getClassLoader().getResourceAsStream("BB.hdf5")
val multiLayerNetwork: MultiLayerNetwork = KerasModelImport.importKerasSequentialModelAndWeights(inputStream, false)
val wrapped: ParallelInference = new ParallelInference.Builder(multiLayerNetwork).build()
val res = iterator.map(row => {
wrapped.output(row).toDoubleVector
})
res
})
}

Spark doesn't conform to expected type TraversableOnce

val num_idf_pairs = rescaledData.select("item", "features")
.rdd.map(x => {(x(0), x(1))})
val itemRdd = rescaledData.select("item", "features").where("item = 1")
.rdd.map(x => {(x(0), x(1))})
val b_num_idf_pairs = sparkSession.sparkContext.broadcast(num_idf_pairs.collect())
val sims = num_idf_pairs.flatMap {
case (key, value) =>
val sv1 = value.asInstanceOf[SV]
import breeze.linalg._
val valuesVector = new SparseVector[Double](sv1.indices, sv1.values, sv1.size)
itemRdd.map {
case (id2, idf2) =>
val sv2 = idf2.asInstanceOf[SV]
val xVector = new SparseVector[Double](sv2.indices, sv2.values, sv2.size)
val sim = valuesVector.dot(xVector) / (norm(valuesVector) * norm(xVector))
(id2.toString, key.toString, sim)
}
}
The error is doesn't conform to expected type TraversableOnce.
When i modify as follows:
val b_num_idf_pairs = sparkSession.sparkContext.broadcast(num_idf_pairs.collect())
val docSims = num_idf_pairs.flatMap {
case (id1, idf1) =>
val idfs = b_num_idf_pairs.value.filter(_._1 != id1)
val sv1 = idf1.asInstanceOf[SV]
import breeze.linalg._
val bsv1 = new SparseVector[Double](sv1.indices, sv1.values, sv1.size)
idfs.map {
case (id2, idf2) =>
val sv2 = idf2.asInstanceOf[SV]
val bsv2 = new SparseVector[Double](sv2.indices, sv2.values, sv2.size)
val cosSim = bsv1.dot(bsv2).asInstanceOf[Double] / (norm(bsv1) * norm(bsv2))
(id1.toString(), id2.toString(), cosSim)
}
}
it compiles but this will cause an OutOfMemoryException. I set --executor-memory 4G.
The first snippet:
num_idf_pairs.flatMap {
...
itemRdd.map { ...}
}
is not only not valid Spark code (no nested transformations are allowed), but also, as you already know, won't type check, because RDD is not TraversableOnce.
The second snippet likely fails, because data you are trying to collect and broadcast is to large.
It looks like you are trying to find all items similarity so you'll need Cartesian product, and structure your code roughly like this:
num_idf_pairs
.cartesian(itemRdd)
.filter { case ((id1, idf1), (id2, idf2)) => id1 != id2 }
.map { case ((id1, idf1), (id2, idf2)) => {
val cosSim = ??? // Compute similarity
(id1.toString(), id2.toString(), cosSim)
}}

Spark map and broadcast together issue in program

I am trying to join two dataset. Below are two dataset .
1/2/2009 6:17,iphone,800,Mastercard,carolina
1/2/2009 4:53,cloth,200,Visa,Betina
1/2/2009 13:08,cloth,100,Mastercard,Federica e Andrea
1/3/2009 14:44,blender,160,Visa,Gouya
1/4/2009 12:56,samsung,3600,Visa,Gerd W
1/4/2009 13:19,htc,1200,Visa,LAURENCE
1/4/2009 20:11,iphone,999,Mastercard,Fleur
1/2/2009 20:09,tmobile,81,Mastercard,adam
1/4/2009 13:17,iphone,400,Cash,Renee Elisabeth
similarly other dataset is :
Mastercard,MS
Visa,VS
I want to join two data set and get output like below:
(htc,VS)
(iphone,MS)
(iphone,NULL)
Below is what My approach :
def mapCard(cardname:String):String={
if(cardname.isEmpty()){
return "NONE"
}
else
return cardname
}
def main(args: Array[String]): Unit = {
val source = scala.io.Source.fromFile("bc.txt")
val keymap = scala.collection.mutable.Map[String, String]()
for (line <- source.getLines) {
val Array(country, capital) = line.split(",").map { _.trim() }
keymap += country -> capital
}
println(keymap)
val conf = new SparkConf().setMaster("local[2]").setAppName("AAA")
val sparkcontext = new SparkContext(conf)
val countriesCache = sparkcontext.broadcast(keymap)
val file = sparkcontext.textFile("salesdata.csv")
val a = file.map { line => line.split(",") }
.map { line => {
var columns = line(3)
if(countriesCache.value.contains(columns) )
{
columns.map { x => ( line(1),countriesCache.value(columns) ) }
}
else
columns.map { x => (line(1),"NULL") }
}
}
a.foreach(x=> println(x.mkString(",")))
}}
This doesnot give me my output.Please suggest me the issue here. Instead it gives like below .
htc,VS),(htc,VS),(htc,VS),(htc,VS)
(iphone,MS),(iphone,MS),(iphone,MS),(iphone,MS),(iphone,MS),(iphone,MS),(iphone,MS),(iphone,MS),(iphone,MS),(iphone,MS)
(cloth,VS),(cloth,VS),(cloth,VS),(cloth,VS)
I think to problem is that you iterate over the characters of your sing in these lines:
columns.map { x => ( line(1),countriesCache.value(columns) ) }
and
columns.map { x => (line(1),"NULL") }
just use
( line(1),countriesCache.value(columns) )
and
(line(1),"NULL")

Slick3 with SQLite - autocommit seems to not be working

I'm trying to write some basic queries with Slick for SQLite database
Here is my code:
class MigrationLog(name: String) {
val migrationEvents = TableQuery[MigrationEventTable]
lazy val db: Future[SQLiteDriver.backend.DatabaseDef] = {
val db = Database.forURL(s"jdbc:sqlite:$name.db", driver = "org.sqlite.JDBC")
val setup = DBIO.seq(migrationEvents.schema.create)
val createFuture = for {
tables <- db.run(MTable.getTables)
createResult <- if (tables.length == 0) db.run(setup) else Future.successful()
} yield createResult
createFuture.map(_ => db)
}
val addEvent: (String, String) => Future[String] = (aggregateId, eventType) => {
val id = java.util.UUID.randomUUID().toString
val command = DBIO.seq(migrationEvents += (id, aggregateId, None, eventType, "CREATED", System.currentTimeMillis, None))
db.flatMap(_.run(command).map(_ => id))
}
val eventSubmitted: (String, String) => Future[Unit] = (id, batchId) => {
val q = for { e <- migrationEvents if e.id === id } yield (e.batchId, e.status, e.updatedAt)
val updateAction = q.update(Some(batchId), "SUBMITTED", Some(System.currentTimeMillis))
db.map(_.run(updateAction))
}
val eventMigrationCompleted: (String, String, String) => Future[Unit] = (batchId, id, status) => {
val q = for { e <- migrationEvents if e.batchId === batchId && e.id === id} yield (e.status, e.updatedAt)
val updateAction = q.update(status, Some(System.currentTimeMillis))
db.map(_.run(updateAction))
}
val allEvents = () => {
db.flatMap(_.run(migrationEvents.result))
}
}
Here is how I'm using it:
val migrationLog = MigrationLog("test")
val events = for {
id <- migrationLog.addEvent("aggregateUserId", "userAccessControl")
_ <- migrationLog.eventSubmitted(id, "batchID_generated_from_idam")
_ <- migrationLog.eventMigrationCompleted("batchID_generated_from_idam", id, "Successful")
events <- migrationLog.allEvents()
} yield events
events.map(_.foreach(event => event match {
case (id, aggregateId, batchId, eventType, status, submitted, updatedAt) => println(s"$id $aggregateId $batchId $eventType $status $submitted $updatedAt")
}))
The idea is to add event first, then update it with batchId (which also updates status) and then update the status when the job is done. events should contain events with status Successful.
What happens is that after running this code it prints events with status SUBMITTED. If I wait a while and do the same allEvents query or just go and check the db from command line using sqlite3 then it's updated correctly.
I'm properly waiting for futures to be resolved before starting the next operation, auto-commit should be enabled by default.
Am I missing something?
Turns out the problem was with db.map(_.run(updateAction)) which returns Future[Future[Int]] which means that the command was not finished by the time I tried to run another query.
Replacing it with db.flatMap(_.run(updateAction)) solved the issue.

handle spark state istimingout

I'm using the new mapWithState function in spark streaming (1.6) with a timing out state. I want to use the timing out state and add it to another rdd in order to use it for calculations further down the road:
val aggedlogs = sc.emptyRDD[MyLog];
val mappingFunc = (key: String, newlog: Option[MyLog], state: State[MyLog]) => {
val _newLog = newlog.getOrElse(null)
if ((state.exists())&&(_newLog!=null))
{
val stateLog = state.get()
val combinedLog = LogUtil.CombineLogs(_newLog, stateLog);
state.update(combinedLog)
}
else if (_newLog !=null) {
state.update(_newLog);
}
if (state.isTimingOut())
{
val stateLog = state.get();
aggedlogs.union(sc.parallelize(List(stateLog), 1))
}
val stateLog = state.get();
(key,stateLog);
}
val stateDstream = reducedlogs.mapWithState(StateSpec.function(mappingFunc).timeout(Seconds(10)))
but when I try to add it to an rdd in the StateSpec function, I get an error that the function is not serializable. Any thoughts on how I can get pass this?
EDIT:
After drilling deeper i found that my approach was wrong. before trying this solution i tried to get the timing out logs from the statesnapeshot(), but they were not there anymore, changing the mapping function to :
def mappingFunc(key: String, newlog: Option[MyLog], state: State[KomoonaLog]) : Option[(String, MyLog)] = {
val _newLog = newlog.getOrElse(null)
if ((state.exists())&&(_newLog!=null))
{
val stateLog = state.get()
val combinedLog = LogUtil.CombineLogs(_newLog, stateLog);
state.update(combinedLog)
Some(key,combinedLog);
}
else if (_newLog !=null) {
state.update(_newLog);
Some(key,_newLog);
}
if (state.isTimingOut())
{
val stateLog = state.get();
stateLog.timinigOut = true;
System.out.println("timinigOut : " +key );
Some(key, stateLog);
}
val stateLog = state.get();
Some(key,stateLog);
}
i managed to filter the mapedwithstatedstream for the logs that are timing out in each batch:
val stateDstream = reducedlogs.mapWithState(
StateSpec.function(mappingFunc _).timeout(Seconds(60)))
val tiningoutlogs= stateDstream.filter (filtertimingout)