why mapPartitions does not see my val - SCALA/SPARK? - scala

I define val like this :
val config = Config(args)
val product_type = config.product_type
then I send product_type as "AA"
and my code is this :
val scores = df.mapPartitions(iterator => {
val inputStream =
if(product_type == "AA" ) {
getClass().getClassLoader().getResourceAsStream("my_aa.hdf5")
}
else {
getClass().getClassLoader().getResourceAsStream("my_bb.hdf5")
}
val multiLayerNetwork: MultiLayerNetwork = KerasModelImport.importKerasSequentialModelAndWeights(inputStream, false)
val wrapped: ParallelInference = new ParallelInference.Builder(multiLayerNetwork).build()
val res = iterator.map(row => {
wrapped.output(row).toDoubleVector
})
res
})
But my inputStream equals "my_bb.hdf5" which is not correct. This value comes from else statement. So why my product_type variable cant read in mappartition?
I print my product_type value before code and I checked it , it is : "AA"

it occurs because of i get this variable from argument in spark submit.sh
and it can not read from mappartition.
It works like this:
val scores =
if (product_type == "AA") {
df.mapPartitions(iterator => {
val inputStream = getClass().getClassLoader().getResourceAsStream("AA.hdf5")
val multiLayerNetwork: MultiLayerNetwork = KerasModelImport.importKerasSequentialModelAndWeights(inputStream, false)
val wrapped: ParallelInference = new ParallelInference.Builder(multiLayerNetwork).build()
val res = iterator.map(row => {
wrapped.output(row).toDoubleVector
})
res
})
} else {
df.mapPartitions(iterator => {
val inputStream = getClass().getClassLoader().getResourceAsStream("BB.hdf5")
val multiLayerNetwork: MultiLayerNetwork = KerasModelImport.importKerasSequentialModelAndWeights(inputStream, false)
val wrapped: ParallelInference = new ParallelInference.Builder(multiLayerNetwork).build()
val res = iterator.map(row => {
wrapped.output(row).toDoubleVector
})
res
})
}

Related

Scala code generates type mismatch (ScalaPB)

I have a protobuf file ...
which I transform to a Scala file using ScalaPB. This way I can then ...
use it inside my Juypter notebook* for transformation. Sadly, when I ...
run the specific cell I get an type
mismatch error and I don't know why?
As the protobuf file is working with Python and the Scala code is generated, what is not right here? Could this be a bug?
*The notebook uses com.google.protobuf:protobuf-java:3.5.0,com.thesamet.scalapb:sparksql-scalapb_2.11:0.7.0 as imports
Sources & Error
protobuf file:
syntax = "proto2";
import "scalapb/scalapb.proto";
option (scalapb.options) = {
flat_package: true
single_file: true
};
message JSONEntry {
required uint64 ts = 1;
required string data = 2;
}
message JSONOutput {
optional string metadata = 1;
repeated JSONEntry entry = 2;
}
Scala (generated) code
// Generated by the Scala Plugin for the Protocol Buffer Compiler.
// Do not edit!
//
// Protofile syntax: PROTO2
#SerialVersionUID(0L)
final case class JSONEntry(
ts: _root_.scala.Long,
data: _root_.scala.Predef.String
) extends scalapb.GeneratedMessage with scalapb.Message[JSONEntry] with scalapb.lenses.Updatable[JSONEntry] {
#transient
private[this] var __serializedSizeCachedValue: _root_.scala.Int = 0
private[this] def __computeSerializedValue(): _root_.scala.Int = {
var __size = 0
__size += _root_.com.google.protobuf.CodedOutputStream.computeUInt64Size(1, ts)
__size += _root_.com.google.protobuf.CodedOutputStream.computeStringSize(2, data)
__size
}
final override def serializedSize: _root_.scala.Int = {
var read = __serializedSizeCachedValue
if (read == 0) {
read = __computeSerializedValue()
__serializedSizeCachedValue = read
}
read
}
def writeTo(`_output__`: _root_.com.google.protobuf.CodedOutputStream): Unit = {
_output__.writeUInt64(1, ts)
_output__.writeString(2, data)
}
def mergeFrom(`_input__`: _root_.com.google.protobuf.CodedInputStream): JSONEntry = {
var __ts = this.ts
var __data = this.data
var __requiredFields0: _root_.scala.Long = 0x3L
var _done__ = false
while (!_done__) {
val _tag__ = _input__.readTag()
_tag__ match {
case 0 => _done__ = true
case 8 =>
__ts = _input__.readUInt64()
__requiredFields0 &= 0xfffffffffffffffeL
case 18 =>
__data = _input__.readString()
__requiredFields0 &= 0xfffffffffffffffdL
case tag => _input__.skipField(tag)
}
}
if (__requiredFields0 != 0L) { throw new _root_.com.google.protobuf.InvalidProtocolBufferException("Message missing required fields.") }
JSONEntry(
ts = __ts,
data = __data
)
}
def withTs(__v: _root_.scala.Long): JSONEntry = copy(ts = __v)
def withData(__v: _root_.scala.Predef.String): JSONEntry = copy(data = __v)
def getFieldByNumber(__fieldNumber: _root_.scala.Int): scala.Any = {
(__fieldNumber: #_root_.scala.unchecked) match {
case 1 => ts
case 2 => data
}
}
def getField(__field: _root_.scalapb.descriptors.FieldDescriptor): _root_.scalapb.descriptors.PValue = {
require(__field.containingMessage eq companion.scalaDescriptor)
(__field.number: #_root_.scala.unchecked) match {
case 1 => _root_.scalapb.descriptors.PLong(ts)
case 2 => _root_.scalapb.descriptors.PString(data)
}
}
def toProtoString: _root_.scala.Predef.String = _root_.scalapb.TextFormat.printToUnicodeString(this)
def companion = JSONEntry
}
object JSONEntry extends scalapb.GeneratedMessageCompanion[JSONEntry] {
implicit def messageCompanion: scalapb.GeneratedMessageCompanion[JSONEntry] = this
def fromFieldsMap(__fieldsMap: scala.collection.immutable.Map[_root_.com.google.protobuf.Descriptors.FieldDescriptor, scala.Any]): JSONEntry = {
require(__fieldsMap.keys.forall(_.getContainingType() == javaDescriptor), "FieldDescriptor does not match message type.")
val __fields = javaDescriptor.getFields
JSONEntry(
__fieldsMap(__fields.get(0)).asInstanceOf[_root_.scala.Long],
__fieldsMap(__fields.get(1)).asInstanceOf[_root_.scala.Predef.String]
)
}
implicit def messageReads: _root_.scalapb.descriptors.Reads[JSONEntry] = _root_.scalapb.descriptors.Reads{
case _root_.scalapb.descriptors.PMessage(__fieldsMap) =>
require(__fieldsMap.keys.forall(_.containingMessage == scalaDescriptor), "FieldDescriptor does not match message type.")
JSONEntry(
__fieldsMap.get(scalaDescriptor.findFieldByNumber(1).get).get.as[_root_.scala.Long],
__fieldsMap.get(scalaDescriptor.findFieldByNumber(2).get).get.as[_root_.scala.Predef.String]
)
case _ => throw new RuntimeException("Expected PMessage")
}
def javaDescriptor: _root_.com.google.protobuf.Descriptors.Descriptor = DataProto.javaDescriptor.getMessageTypes.get(0)
def scalaDescriptor: _root_.scalapb.descriptors.Descriptor = DataProto.scalaDescriptor.messages(0)
def messageCompanionForFieldNumber(__number: _root_.scala.Int): _root_.scalapb.GeneratedMessageCompanion[_] = throw new MatchError(__number)
lazy val nestedMessagesCompanions: Seq[_root_.scalapb.GeneratedMessageCompanion[_]] = Seq.empty
def enumCompanionForFieldNumber(__fieldNumber: _root_.scala.Int): _root_.scalapb.GeneratedEnumCompanion[_] = throw new MatchError(__fieldNumber)
lazy val defaultInstance = JSONEntry(
ts = 0L,
data = ""
)
implicit class JSONEntryLens[UpperPB](_l: _root_.scalapb.lenses.Lens[UpperPB, JSONEntry]) extends _root_.scalapb.lenses.ObjectLens[UpperPB, JSONEntry](_l) {
def ts: _root_.scalapb.lenses.Lens[UpperPB, _root_.scala.Long] = field(_.ts)((c_, f_) => c_.copy(ts = f_))
def data: _root_.scalapb.lenses.Lens[UpperPB, _root_.scala.Predef.String] = field(_.data)((c_, f_) => c_.copy(data = f_))
}
final val TS_FIELD_NUMBER = 1
final val DATA_FIELD_NUMBER = 2
}
#SerialVersionUID(0L)
final case class JSONOutput(
metadata: scala.Option[_root_.scala.Predef.String] = None,
entry: _root_.scala.collection.Seq[JSONEntry] = _root_.scala.collection.Seq.empty
) extends scalapb.GeneratedMessage with scalapb.Message[JSONOutput] with scalapb.lenses.Updatable[JSONOutput] {
#transient
private[this] var __serializedSizeCachedValue: _root_.scala.Int = 0
private[this] def __computeSerializedValue(): _root_.scala.Int = {
var __size = 0
if (metadata.isDefined) { __size += _root_.com.google.protobuf.CodedOutputStream.computeStringSize(1, metadata.get) }
entry.foreach(entry => __size += 1 + _root_.com.google.protobuf.CodedOutputStream.computeUInt32SizeNoTag(entry.serializedSize) + entry.serializedSize)
__size
}
final override def serializedSize: _root_.scala.Int = {
var read = __serializedSizeCachedValue
if (read == 0) {
read = __computeSerializedValue()
__serializedSizeCachedValue = read
}
read
}
def writeTo(`_output__`: _root_.com.google.protobuf.CodedOutputStream): Unit = {
metadata.foreach { __v =>
_output__.writeString(1, __v)
};
entry.foreach { __v =>
_output__.writeTag(2, 2)
_output__.writeUInt32NoTag(__v.serializedSize)
__v.writeTo(_output__)
};
}
def mergeFrom(`_input__`: _root_.com.google.protobuf.CodedInputStream): JSONOutput = {
var __metadata = this.metadata
val __entry = (_root_.scala.collection.immutable.Vector.newBuilder[JSONEntry] ++= this.entry)
var _done__ = false
while (!_done__) {
val _tag__ = _input__.readTag()
_tag__ match {
case 0 => _done__ = true
case 10 =>
__metadata = Option(_input__.readString())
case 18 =>
__entry += _root_.scalapb.LiteParser.readMessage(_input__, JSONEntry.defaultInstance)
case tag => _input__.skipField(tag)
}
}
JSONOutput(
metadata = __metadata,
entry = __entry.result()
)
}
def getMetadata: _root_.scala.Predef.String = metadata.getOrElse("")
def clearMetadata: JSONOutput = copy(metadata = None)
def withMetadata(__v: _root_.scala.Predef.String): JSONOutput = copy(metadata = Option(__v))
def clearEntry = copy(entry = _root_.scala.collection.Seq.empty)
def addEntry(__vs: JSONEntry*): JSONOutput = addAllEntry(__vs)
def addAllEntry(__vs: TraversableOnce[JSONEntry]): JSONOutput = copy(entry = entry ++ __vs)
def withEntry(__v: _root_.scala.collection.Seq[JSONEntry]): JSONOutput = copy(entry = __v)
def getFieldByNumber(__fieldNumber: _root_.scala.Int): scala.Any = {
(__fieldNumber: #_root_.scala.unchecked) match {
case 1 => metadata.orNull
case 2 => entry
}
}
def getField(__field: _root_.scalapb.descriptors.FieldDescriptor): _root_.scalapb.descriptors.PValue = {
require(__field.containingMessage eq companion.scalaDescriptor)
(__field.number: #_root_.scala.unchecked) match {
case 1 => metadata.map(_root_.scalapb.descriptors.PString).getOrElse(_root_.scalapb.descriptors.PEmpty)
case 2 => _root_.scalapb.descriptors.PRepeated(entry.map(_.toPMessage)(_root_.scala.collection.breakOut))
}
}
def toProtoString: _root_.scala.Predef.String = _root_.scalapb.TextFormat.printToUnicodeString(this)
def companion = JSONOutput
}
object JSONOutput extends scalapb.GeneratedMessageCompanion[JSONOutput] {
implicit def messageCompanion: scalapb.GeneratedMessageCompanion[JSONOutput] = this
def fromFieldsMap(__fieldsMap: scala.collection.immutable.Map[_root_.com.google.protobuf.Descriptors.FieldDescriptor, scala.Any]): JSONOutput = {
require(__fieldsMap.keys.forall(_.getContainingType() == javaDescriptor), "FieldDescriptor does not match message type.")
val __fields = javaDescriptor.getFields
JSONOutput(
__fieldsMap.get(__fields.get(0)).asInstanceOf[scala.Option[_root_.scala.Predef.String]],
__fieldsMap.getOrElse(__fields.get(1), Nil).asInstanceOf[_root_.scala.collection.Seq[JSONEntry]]
)
}
implicit def messageReads: _root_.scalapb.descriptors.Reads[JSONOutput] = _root_.scalapb.descriptors.Reads{
case _root_.scalapb.descriptors.PMessage(__fieldsMap) =>
require(__fieldsMap.keys.forall(_.containingMessage == scalaDescriptor), "FieldDescriptor does not match message type.")
JSONOutput(
__fieldsMap.get(scalaDescriptor.findFieldByNumber(1).get).flatMap(_.as[scala.Option[_root_.scala.Predef.String]]),
__fieldsMap.get(scalaDescriptor.findFieldByNumber(2).get).map(_.as[_root_.scala.collection.Seq[JSONEntry]]).getOrElse(_root_.scala.collection.Seq.empty)
)
case _ => throw new RuntimeException("Expected PMessage")
}
def javaDescriptor: _root_.com.google.protobuf.Descriptors.Descriptor = DataProto.javaDescriptor.getMessageTypes.get(1)
def scalaDescriptor: _root_.scalapb.descriptors.Descriptor = DataProto.scalaDescriptor.messages(1)
def messageCompanionForFieldNumber(__number: _root_.scala.Int): _root_.scalapb.GeneratedMessageCompanion[_] = {
var __out: _root_.scalapb.GeneratedMessageCompanion[_] = null
(__number: #_root_.scala.unchecked) match {
case 2 => __out = JSONEntry
}
__out
}
lazy val nestedMessagesCompanions: Seq[_root_.scalapb.GeneratedMessageCompanion[_]] = Seq.empty
def enumCompanionForFieldNumber(__fieldNumber: _root_.scala.Int): _root_.scalapb.GeneratedEnumCompanion[_] = throw new MatchError(__fieldNumber)
lazy val defaultInstance = JSONOutput(
)
implicit class JSONOutputLens[UpperPB](_l: _root_.scalapb.lenses.Lens[UpperPB, JSONOutput]) extends _root_.scalapb.lenses.ObjectLens[UpperPB, JSONOutput](_l) {
def metadata: _root_.scalapb.lenses.Lens[UpperPB, _root_.scala.Predef.String] = field(_.getMetadata)((c_, f_) => c_.copy(metadata = Option(f_)))
def optionalMetadata: _root_.scalapb.lenses.Lens[UpperPB, scala.Option[_root_.scala.Predef.String]] = field(_.metadata)((c_, f_) => c_.copy(metadata = f_))
def entry: _root_.scalapb.lenses.Lens[UpperPB, _root_.scala.collection.Seq[JSONEntry]] = field(_.entry)((c_, f_) => c_.copy(entry = f_))
}
final val METADATA_FIELD_NUMBER = 1
final val ENTRY_FIELD_NUMBER = 2
}
object DataProto extends _root_.scalapb.GeneratedFileObject {
lazy val dependencies: Seq[_root_.scalapb.GeneratedFileObject] = Seq(
scalapb.options.ScalapbProto
)
lazy val messagesCompanions: Seq[_root_.scalapb.GeneratedMessageCompanion[_]] = Seq(
JSONEntry,
JSONOutput
)
private lazy val ProtoBytes: Array[Byte] =
scalapb.Encoding.fromBase64(scala.collection.Seq(
"""CgpkYXRhLnByb3RvGhVzY2FsYXBiL3NjYWxhcGIucHJvdG8iLwoJSlNPTkVudHJ5Eg4KAnRzGAEgAigEUgJ0cxISCgRkYXRhG
AIgAigJUgRkYXRhIkoKCkpTT05PdXRwdXQSGgoIbWV0YWRhdGEYASABKAlSCG1ldGFkYXRhEiAKBWVudHJ5GAIgAygLMgouSlNPT
kVudHJ5UgVlbnRyeUIH4j8EEAEoAQ=="""
).mkString)
lazy val scalaDescriptor: _root_.scalapb.descriptors.FileDescriptor = {
val scalaProto = com.google.protobuf.descriptor.FileDescriptorProto.parseFrom(ProtoBytes)
_root_.scalapb.descriptors.FileDescriptor.buildFrom(scalaProto, dependencies.map(_.scalaDescriptor))
}
lazy val javaDescriptor: com.google.protobuf.Descriptors.FileDescriptor = {
val javaProto = com.google.protobuf.DescriptorProtos.FileDescriptorProto.parseFrom(ProtoBytes)
com.google.protobuf.Descriptors.FileDescriptor.buildFrom(javaProto, Array(
scalapb.options.ScalapbProto.javaDescriptor
))
}
#deprecated("Use javaDescriptor instead. In a future version this will refer to scalaDescriptor.", "ScalaPB 0.5.47")
def descriptor: com.google.protobuf.Descriptors.FileDescriptor = javaDescriptor
}
Error
<console>:82: error: type mismatch;
found : JSONEntry.type
required: scalapb.GeneratedMessageCompanion[_]
def companion = JSONEntry
^
I was able to successfully compile your proto file with the following code
project/scalapb.sbt
addSbtPlugin("com.thesamet" % "sbt-protoc" % "0.99.16")
libraryDependencies += "com.thesamet.scalapb" %% "compilerplugin" % "0.7.0"
build.sbt
lazy val root = (project in file(".")).
settings(
inThisBuild(List(
organization := "com.example",
scalaVersion := "2.12.4",
version := "0.1.0-SNAPSHOT"
)),
name := "protobuf",
libraryDependencies ++= Seq(
"com.thesamet.scalapb" %% "scalapb-runtime" % scalapb.compiler.Version.scalapbVersion % "protobuf"
),
PB.targets in (Compile) := Seq(
scalapb.gen() -> (sourceManaged in Compile).value
)
)
Now if you copy and paste your photo file into src/main/protobuf as hello.proto and do a sbt clean compile
The only thing I did differently is that I added a package to the photo file
syntax = "proto2";
import "scalapb/scalapb.proto";
option (scalapb.options) = {
package_name: "com.abhi"
flat_package: true
single_file: true
};
message JSONEntry {
required uint64 ts = 1;
required string data = 2;
}
message JSONOutput {
optional string metadata = 1;
repeated JSONEntry entry = 2;
}
Now finally use the generated code in your app
package example
import com.abhi.JSONEntry
import java.io._
object Hello extends App {
val jsonEntry = JSONEntry(10L, "foo")
val target = new FileOutputStream(new File("foo.bin"))
jsonEntry.writeTo(target)
target.close()
}
The code compiles correctly and there is no compilation error

MongoSpark duplicate key error only on massive datasets

Using MongoSpark, running the same code on 2 different datasets of differing sizes causing one to throw the E11000 duplicate key error.
Before we proceed, here is the code below:
object ScrapeHubCompanyImporter {
def importData(path: String, companyMongoUrl: String): Unit = {
val spark = SparkSession.builder()
.master("local[*]")
.config("spark.mongodb.input.uri", companyMongoUrl)
.config("spark.mongodb.output.uri", companyMongoUrl)
.config("spark.mongodb.input.partitionerOptions.partitionKey", "profileUrl")
.getOrCreate()
import spark.implicits._
val websiteToDomainTransformer = udf((website: String) => {
val tldExtract = SplitHost.fromURL(website)
if (tldExtract.domain == "") {
null
} else {
tldExtract.domain + "." + tldExtract.tld
}
})
val jsonDF =
spark
.read
.json(path)
.filter { row =>
row.getAs[String]("canonical_url") != null
}
.dropDuplicates(Seq("canonical_url"))
.select(
toHttpsUdf($"canonical_url").as("profileUrl"),
$"city",
$"country",
$"founded",
$"hq".as("headquartes"),
$"industry",
$"company_id".as("companyId"),
$"name",
$"postal",
$"size",
$"specialties",
$"state",
$"street_1",
$"street_2",
$"type",
$"website"
)
.filter { row => row.getAs[String]("website") != null }
.withColumn("domain", websiteToDomainTransformer($"website"))
.filter(row => row.getAs[String]("domain") != null)
.as[ScrapeHubCompanyDataRep]
val jsonColsSet = jsonDF.columns.toSet
val mongoData = MongoSpark
.load[LinkedinCompanyRep](spark)
.withColumn("companyUrl", toHttpsUdf($"companyUrl"))
.as[CompanyRep]
val mongoColsSet = mongoData.columns.toSet
val union = jsonDF.joinWith(
mongoData,
jsonDF("companyUrl") === mongoData("companyUrl"),
joinType = "left")
.map { t =>
val scrapeHub = t._1
val liCompanyRep = if (t._2 != null ) {
t._2
} else {
CompanyRep(domain = scrapeHub.domain)
}
CompanyRep(
_id = pickValue(liCompanyRep._id, None),
city = pickValue(scrapeHub.city, liCompanyRep.city),
country = pickValue(scrapeHub.country, liCompanyRep.country),
postal = pickValue(scrapeHub.postal, liCompanyRep.postal),
domain = scrapeHub.domain,
founded = pickValue(scrapeHub.founded, liCompanyRep.founded),
headquartes = pickValue(scrapeHub.headquartes, liCompanyRep.headquartes),
headquarters = liCompanyRep.headquarters,
industry = pickValue(scrapeHub.industry, liCompanyRep.industry),
linkedinId = pickValue(scrapeHub.companyId, liCompanyRep.companyId),
companyUrl = Option(scrapeHub.companyUrl),
name = pickValue(scrapeHub.name, liCompanyRep.name),
size = pickValue(scrapeHub.size, liCompanyRep.size),
specialties = pickValue(scrapeHub.specialties, liCompanyRep.specialties),
street_1 = pickValue(scrapeHub.street_1, liCompanyRep.street_1),
street_2 = pickValue(scrapeHub.street_2, liCompanyRep.street_2),
state = pickValue(scrapeHub.state, liCompanyRep.state),
`type` = pickValue(scrapeHub.`type`, liCompanyRep.`type`),
website = pickValue(scrapeHub.website, liCompanyRep.website),
updatedDate = None,
scraped = Some(true)
)
}
val idToMongoId = udf { st: String =>
if (st != null) {
ObjectId(st)
} else {
null
}
}
val saveReady =
union
.map { rep =>
rep.copy(
updatedDate = Some(new Timestamp(System.currentTimeMillis)),
scraped = Some(true),
headquarters = generateCompanyHeadquarters(rep)
)
}
.dropDuplicates(Seq("companyUrl"))
MongoSpark.save(
saveReady.withColumn("_id", idToMongoId($"_id")),
WriteConfig(Map(
"uri" -> companyMongoUrl
)))
}
def generateCompanyHeadquarters(companyRep: CompanyRep): Option[CompanyHeadquarters] = {
val hq = CompanyHeadquarters(
country = companyRep.country,
geographicArea = companyRep.state,
city = companyRep.city,
postalCode = companyRep.postal,
line1 = companyRep.street_1,
line2 = companyRep.street_2
)
CompanyHeadquarters
.unapply(hq)
.get
.productIterator.toSeq.exists {
case a: Option[_] => a.isDefined
case _ => false
} match {
case true => Some(hq)
case false => None
}
}
def pickValue(left: Option[String], right: Option[String]): Option[String] = {
def _noneIfNull(opt: Option[String]): Option[String] = {
if (opt != null) {
opt
} else {
None
}
}
val lOpt = _noneIfNull(left)
val rOpt = _noneIfNull(right)
lOpt match {
case Some(l) => Option(l)
case None => rOpt match {
case Some(r) => Option(r)
case None => None
}
}
}
}
This issue is around the companyUrl which is one of the unique keys in the collection, the other being the _id key. The issue is that there are tons of duplicates that Spark will attempt to save on a 700gb dataset, but if I run a very small dataset locally, Im never able to replicate the issue. Im trying to understand whats going on, and how can I make sure to group all the existing companies on the companyUrl, and make sure that duplicates really are removed globally across the dataset.
EDIT
Here are some scenarios that arise:
Company is in Mongo, the file thats read has updated data -> Duplicate key error can occur here
Company not in Mongo but in file -> Duplicate key error can occur here as well.
EDIT2
The duplication error occurs around companyUrl field.
EDIT 3
I've narrowed down this as being an issue the merging stage. Looking through records that have been marked as having duplicate companyUrl's, some of those records are not in the target collection yet somehow a duplicate record is still being written to the collection. In other situations, the _id field of the new record doesn't match the old record that had the same companyUrl.

handle spark state istimingout

I'm using the new mapWithState function in spark streaming (1.6) with a timing out state. I want to use the timing out state and add it to another rdd in order to use it for calculations further down the road:
val aggedlogs = sc.emptyRDD[MyLog];
val mappingFunc = (key: String, newlog: Option[MyLog], state: State[MyLog]) => {
val _newLog = newlog.getOrElse(null)
if ((state.exists())&&(_newLog!=null))
{
val stateLog = state.get()
val combinedLog = LogUtil.CombineLogs(_newLog, stateLog);
state.update(combinedLog)
}
else if (_newLog !=null) {
state.update(_newLog);
}
if (state.isTimingOut())
{
val stateLog = state.get();
aggedlogs.union(sc.parallelize(List(stateLog), 1))
}
val stateLog = state.get();
(key,stateLog);
}
val stateDstream = reducedlogs.mapWithState(StateSpec.function(mappingFunc).timeout(Seconds(10)))
but when I try to add it to an rdd in the StateSpec function, I get an error that the function is not serializable. Any thoughts on how I can get pass this?
EDIT:
After drilling deeper i found that my approach was wrong. before trying this solution i tried to get the timing out logs from the statesnapeshot(), but they were not there anymore, changing the mapping function to :
def mappingFunc(key: String, newlog: Option[MyLog], state: State[KomoonaLog]) : Option[(String, MyLog)] = {
val _newLog = newlog.getOrElse(null)
if ((state.exists())&&(_newLog!=null))
{
val stateLog = state.get()
val combinedLog = LogUtil.CombineLogs(_newLog, stateLog);
state.update(combinedLog)
Some(key,combinedLog);
}
else if (_newLog !=null) {
state.update(_newLog);
Some(key,_newLog);
}
if (state.isTimingOut())
{
val stateLog = state.get();
stateLog.timinigOut = true;
System.out.println("timinigOut : " +key );
Some(key, stateLog);
}
val stateLog = state.get();
Some(key,stateLog);
}
i managed to filter the mapedwithstatedstream for the logs that are timing out in each batch:
val stateDstream = reducedlogs.mapWithState(
StateSpec.function(mappingFunc _).timeout(Seconds(60)))
val tiningoutlogs= stateDstream.filter (filtertimingout)

Iterating over cogrouped RDD

I have used a cogroup function and obtain following RDD:
org.apache.spark.rdd.RDD[(Int, (Iterable[(Int, Long)], Iterable[(Int, Long)]))]
Before the map operation joined object would look like this:
RDD[(Int, (Iterable[(Int, Long)], Iterable[(Int, Long)]))]
(-2095842000,(CompactBuffer((1504999740,1430096464017), (613904354,1430211912709), (-1514234644,1430288363100), (-276850688,1430330412225)),CompactBuffer((-511732877,1428682217564), (1133633791,1428831320960), (1168566678,1428964645450), (-407341933,1429009306167), (-1996133514,1429016485487), (872888282,1429031501681), (-826902224,1429034491003), (818711584,1429111125268), (-1068875079,1429117498135), (301875333,1429121399450), (-1730846275,1429131773065), (1806256621,1429135583312))))
(352234000,(CompactBuffer((1350763226,1430006650167), (-330160951,1430320010314)),CompactBuffer((2113207721,1428994842593), (-483470471,1429324209560), (1803928603,1429426861915))))
Now I want to do the following:
val globalBuffer = ListBuffer[Double]()
val joined = data1.cogroup(data2).map(x => {
val listA = x._2._1.toList
val listB = x._2._2.toList
for(tupleB <- listB) {
val localResults = ListBuffer[Double]()
val itemToTest = Set(tupleB._1)
val tempList = ListBuffer[(Int, Double)]()
for(tupleA <- listA) {
val tValue = someFunctionReturnDouble(tupleB._2, tupleA._2)
val i = (tupleA._1, tValue)
tempList += i
}
val sortList = tempList.sortWith(_._2 > _._2).slice(0,20).map(i => i._1)
val intersect = sortList.toSet.intersect(itemToTest)
if (intersect.size > 0)
localResults += 1.0
else localResults += 0.0
val normalized = sum(localResults.toList)/localResults.size
globalBuffer += normalized
}
})
//method sum
def sum(xs: List[Double]): Double = {//do the sum}
At the end of this I was expecting joined to be a list with double values. But when I looked at it it was unit. Also I will this is not the Scala way of doing it. How do I obtain globalBuffer as the final result.
Hmm, if I understood your code correctly, it could benefit from these improvements:
val joined = data1.cogroup(data2).map(x => {
val listA = x._2._1.toList
val listB = x._2._2.toList
val localResults = listB.map {
case (intBValue, longBValue) =>
val itemToTest = intBValue // it's always one element
val tempList = listA.map {
case (intAValue, longAValue) =>
(intAValue, someFunctionReturnDouble(longBvalue, longAValue))
}
val sortList = tempList.sortWith(-_._2).slice(0,20).map(i => i._1)
if (sortList.toSet.contains(itemToTest)) { 1.0 } else {0.0}
// no real need to convert to a set for 20 elements, by the way
}
sum(localResults)/localResults.size
})
Transformations of RDDs are not going to modify globalBuffer. Copies of globalBuffer are made and sent out to each of the workers, but any modifications to these copies on the workers will never modify the globalBuffer that exists on the driver (the one you have defined outside the map on the RDD.) Here's what I do (with a few additional modifications):
val joined = data1.cogroup(data2) map { x =>
val iterA = x._2._1
val iterB = x._2._2
var count, positiveCount = 0
val tempList = ListBuffer[(Int, Double)]()
for (tupleB <- iterB) {
tempList.clear
for(tupleA <- iterA) {
val tValue = someFunctionReturnDouble(tupleB._2, tupleA._2)
tempList += ((tupleA._1, tValue))
}
val sortList = tempList.sortWith(_._2 > _._2).iterator.take(20)
if (sortList.exists(_._1 == tupleB._1)) positiveCount += 1
count += 1
}
positiveCount.toDouble/count
}
At this point you can obtain of local copy of the proportions by using joined.collect.

Spark: split rows and accumulate

I have this code:
val rdd = sc.textFile(sample.log")
val splitRDD = rdd.map(r => StringUtils.splitPreserveAllTokens(r, "\\|"))
val rdd2 = splitRDD.filter(...).map(row => createRow(row, fieldsMap))
sqlContext.createDataFrame(rdd2, structType).save(
org.apache.phoenix.spark, SaveMode.Overwrite, Map("table" -> table, "zkUrl" -> zkUrl))
def createRow(row: Array[String], fieldsMap: ListMap[Int, FieldConfig]): Row = {
//add additional index for invalidValues
val arrSize = fieldsMap.size + 1
val arr = new Array[Any](arrSize)
var invalidValues = ""
for ((k, v) <- fieldsMap) {
val valid = ...
var value : Any = null
if (valid) {
value = row(k)
// if (v.code == "SOURCE_NAME") --> 5th column in the row
// sourceNameCount = row(k).split(",").size
} else {
invalidValues += v.code + " : " + row(k) + " | "
}
arr(k) = value
}
arr(arrSize - 1) = invalidValues
Row.fromSeq(arr.toSeq)
}
fieldsMap contains the mapping of the input columns: (index, FieldConfig). Where FieldConfig class contains "code" and "dataType" values.
TOPIC -> (0, v.code = "TOPIC", v.dataType = "String")
GROUP -> (1, v.code = "GROUP")
SOURCE_NAME1,SOURCE_NAME2,SOURCE_NAME3 -> (4, v.code = "SOURCE_NAME")
This is the sample.log:
TOPIC|GROUP|TIMESTAMP|STATUS|SOURCE_NAME1,SOURCE_NAME2,SOURCE_NAME3|
SOURCE_TYPE1,SOURCE_TYPE2,SOURCE_TYPE3|SOURCE_COUNT1,SOURCE_COUNT2,SOURCE_COUNT3|
DEST_NAME1,DEST_NAME2,DEST_NAME3|DEST_TYPE1,DEST_TYPE2,DEST_TYPE3|
DEST_COUNT1,DEST_COUNT2,DEST_COUNT3|
The goal is to split the input (sample.log), based on the number of source_name(s).. In the example above, the output will have 3 rows:
TOPIC|GROUP|TIMESTAMP|STATUS|SOURCE_NAME1|SOURCE_TYPE1|SOURCE_COUNT1|
|DEST_NAME1|DEST_TYPE1|DEST_COUNT1|
TOPIC|GROUP|TIMESTAMP|STATUS|SOURCE_NAME2|SOURCE_TYPE2|SOURCE_COUNT2|
DEST_NAME2|DEST_TYPE2|DEST_COUNT2|
TOPIC|GROUP|TIMESTAMP|STATUS|SOURCE_NAME3|SOURCE_TYPE3|SOURCE_COUNT3|
|DEST_NAME3|DEST_TYPE3|DEST_COUNT3|
This is the new code I am working on (still using createRow defined above):
val rdd2 = splitRDD.filter(...).flatMap(row => {
val srcName = row(4).split(",")
val srcType = row(5).split(",")
val srcCount = row(6).split(",")
val destName = row(7).split(",")
val destType = row(8).split(",")
val destCount = row(9).split(",")
var newRDD: ArrayBuffer[Row] = new ArrayBuffer[Row]()
//if (srcName != null) {
println("\n\nsrcName.size: " + srcName.size + "\n\n")
for (i <- 0 to srcName.size - 1) {
// missing column: destType can sometimes be null
val splittedRow: Array[String] = Row.fromSeq(Seq((row(0), row(1), row(2), row(3),
srcName(i), srcType(i), srcCount(i), destName(i), "", destCount(i)))).toSeq.toArray[String]
newRDD = newRDD ++ Seq(createRow(splittedRow, fieldsMap))
}
//}
Seq(Row.fromSeq(Seq(newRDD)))
})
Since I am having an error in converting my splittedRow to Array[String]
(".toSeq.toArray[String]")
error: type arguments [String] do not conform to method toArray's type parameter bounds [B >: Any]
I decided to update my splittedRow to:
val rowArr: Array[String] = new Array[String](10)
for (j <- 0 to 3) {
rowArr(j) = row(j)
}
rowArr(4) = srcName(i)
rowArr(5) = row(5).split(",")(i)
rowArr(6) = row(6).split(",")(i)
rowArr(7) = row(7).split(",")(i)
rowArr(8) = row(8).split(",")(i)
rowArr(9) = row(9).split(",")(i)
val splittedRow = rowArr
You could use a flatMap operation instead of a map operation to return multiple rows. Consequently, your createRow would be refactored to createRows(row: Array[String], fieldsMap: List[Int, IngestFieldConfig]): Seq[Row].