I want to share a HashMap across every node in Flink and allow the nodes to update that HashMap. I have this code so far:
object ParallelStreams {
val env = StreamExecutionEnvironment.getExecutionEnvironment
//Is there a way to attach a HashMap to this config variable?
val config = new Configuration()
config.setClass("HashMap", Class[CustomGlobal])
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
class CustomGlobal extends ExecutionConfig.GlobalJobParameters {
override def toMap: util.Map[String, String] = {
new HashMap[String, String]()
}
}
class MyCoMap extends RichCoMapFunction[String, String, String] {
var users: HashMap[String, String] = null
//How do I get access the HashMap I attach to the global config here?
override def open(parameters: Configuration): Unit = {
super.open(parameters)
val globalParams = getRuntimeContext.getExecutionConfig.getGlobalJobParameters
val globalConf = globalParams[Configuration]
val hashMap = globalConf.getClass
}
//Other functions to override here
}
}
I was wondering if you can attach a custom object to the config variable created here val config = new Configuration()? (Please see comments in the code above).
I noticed you can only attach primitive values. I created a custom class that extends ExecutionConfig.GlobalJobParameters and attached that class by doing config.setClass("HashMap", Class[CustomGlobal]) but I am not sure if that is how you are supposed to do it?
The common way to distribute parameters to operators is to have them as regular member variables in the function class. The function object that is created and assigned during plan construction is serialized and shipped to all workers. So you don't have to pass parameters via a configuration.
This would look as follows
class MyMapper(map: HashMap) extends MapFunction[String, String] {
// class definition
}
val inStream: DataStream[String] = ???
val myHashMap: HashMap = ???
val myMapper: MyMapper = new MyMapper(myHashMap)
val mappedStream: DataStream[String] = inStream.map(myMapper)
The myMapper object is serialized (using Java serialization) and shipped for execution. So the type of map must implement the Java Serializable interface.
EDIT: I missed the part that you want the map to be updatable from all parallel tasks. That is not possible with Flink. You would have to either fully replicate the map and all updated (by broadcasting) or use an external system (key-value store) for that.
Related
This is more of a Scala concept doubt than Spark. I have this Spark initialization code :
object EntryPoint {
val spark = SparkFactory.createSparkSession(...
val funcsSingleton = ContextSingleton[CustomFunctions] { new CustomFunctions(Some(hashConf)) }
lazy val funcs = funcsSingleton.get
//this part I want moved to another place since there are many many UDFs
spark.udf.register("funcName", udf {funcName _ })
}
The other class, CustomFunctions looks like this
class CustomFunctions(val hashConfig: Option[HashConfig], sark: Option[SparkSession] = None) {
val funcUdf = udf { funcName _ }
def funcName(colValue: String) = withDefinedOpt(hashConfig) { c =>
...}
}
^ class is wrapped in Serializable interface using ContextSingleton which is defined like so
class ContextSingleton[T: ClassTag](constructor: => T) extends AnyRef with Serializable {
val uuid = UUID.randomUUID.toString
#transient private lazy val instance = ContextSingleton.pool.synchronized {
ContextSingleton.pool.getOrElseUpdate(uuid, constructor)
}
def get = instance.asInstanceOf[T]
}
object ContextSingleton {
private val pool = new TrieMap[String, Any]()
def apply[T: ClassTag](constructor: => T): ContextSingleton[T] = new ContextSingleton[T](constructor)
def poolSize: Int = pool.size
def poolClear(): Unit = pool.clear()
}
Now to my problem, I want to not have to explicitly register the udfs as done in the EntryPoint app. I create all udfs as needed in my CustomFunctions class and then register dynamically only the ones that I read from user provided config. What would be the best way to achieve it? Also, I want to register the required udfs outside the main app but that throws my the infamous TaskNotSerializable exception. Serializing the big CustomFunctions is not a good idea, hence wrapped it up in ContextSingleton but my problem of registering udfs outside cannot be solved that way. Please suggest the right approach.
I want to write a unit test for a Scala class. The purpose of the class is to collect metrics and post them on a Kafka topic. I am trying to mock the producer in the unit test to ensure sanity of the rest of the code. Below is a simplified version of my class:
class MyEmitter(sparkConf: SparkConf) {
<snip> -- member variables
private val kafkaProducer = createProducer()
def createProducer(): Producer[String, MyMetricClass] = {
val props = new Properties()
...
Code to initialize properties
...
new KafkaProducer[String, MyMetricClass](props)
}
def initEmitter(metricName: String): SomeClass = {
// Some implementation
}
def collect(key: String, value: String): Unit = {
// Some implementation
}
def emit(): Unit = {
val record = new ProducerRecord("<topic name>", "<key>", "<value>")
kafkaProducer.send(record)
}
What I would like to do in my unit test is to mock out the producer and check whether the send() command has been called and, if so, whether the producer record matches the expectation. I have been unsuccessful to find a solution on my own. Googling the solution has also been unfruitful. If anyone knows how the problem could be solved, I will be most grateful.
'new' is generally an enemy of testing, so you should extract the creation of that object so you can either pass a real KafkaProducer or a mock
One way to do it without changing the interface could be
def createProducer(
producer: Properties => KafkaProducer = props => new KafkaProducer[String, MyMetricClass](props)
): Producer[String, MyMetricClass] = {
val props = new Properties()
producer(props)
}
So then in real code you keep calling
myEmmiter.createProducer()
But in test you'd do
val producerMock = mock[KafkaProducer]
myEmmiter.createProducer(_ => producerMock)
Another good thing about this is that you could also stub the function itself so you can verify that the props your method creates are the expected ones
Hope it helps
Lets say I have bunch of car objects in my project, for example:
object Porsche extends Car {
override def start() {...}
override def canStart(fuelInLitr: Int) = fuelInLitr > 5
override val fuelInLitr = 45
override val carId = 1234567
}
im extending Car which is just a trait to set a car structure:
trait Car {
def start(): Unit
val canStart(fuel: Double): Boolean
val fuelInLitr: Int
val carId: Int
}
Now, in the start() method I want to use some api service that will give me a car key based on its id so I cant start the car.
So I have this CarApiService:
class CarApiService (wsClient: WSClient, configuration: Configuration) {
implicit val formats: Formats = DefaultFormats
def getCarkey(carId: String): Future[Option[CarKey]] = {
val carInfoServiceApi = s"${configuration.get[String]("carsdb.carsInfo")}?carId=$carId"
wsClient.url(carInfoServiceApi).withHttpHeaders(("Content-Type", "application/json")).get.map { response =>
response.status match {
case Status.OK => Some(parse(response.body).extract[CarKey])
case Status.NO_CONTENT => None
case _ => throw new Exception(s"carsdb failed to perform operation with status: ${response.status}, and body: ${response.body}")
}
}
}
}
I want to have the ability to use getCarkey() in my car objects, so I created a CarsApiServicesModule which will give my access to the carApiService and I can use its methods:
trait CarsApiServicesModule {
/// this supply the carApiService its confuguration dependancy
lazy val configuration: Config = ConfigFactory.load()
lazy val conf: Configuration = wire[Configuration]
/// this supply the carApiService its WSClient dependancy
lazy val wsc: WSClient = wire[WSClient]
lazy val carApiService: CarApiService = wire[CarApiService]
}
and now I want to add mix this trait in my car object this way:
object Porsche extends Car with CarsApiServicesModule {
// here I want to use myApiService
// for example: carApiService.getCarkey(carId)...
}
but when compiling this I get this error:
does anyone know what is the issue?
also, is that design make sense?
You need to keep in mind that wire is just a helper macro which tries to generate new instance creation code: it's quite dumb, in fact. Here, it would try to create a new instance of WSClient.
However, not all objects can be instantiated using a simple new call - sometimes you need to invoke "factory" method.
In this case, if you take a look at the readme on GitHub, you'll see that to instantiate the WSClient, you need to create it through the StandaloneAhcWSClient() object.
So in this case, wire won't help you - you'll need to simply write the initialisation code by hand. Luckily it's not too large.
Currently I'm using GridGain/Ignite in my project and faced with some problems:
As you may know, GridGain can hold any serializable object in Cache, like this:
val mycache = ignite.getOrCreateCache[String,MyClass]("MyName")
It means, that we can define our class and extend it with Dynamic property - that's ok.
If we set Ignite-annotation (#QuerySqlField) at specific class-field - Ignite can use sql-queries with your classes like this:
val sql = select * from MyClass
mycache.query(new SqlFieldsQuers(sql))
And now my question:
How can I set Ignite-annotations with dynamic fields in dynamic classes in Scala? I've attached my dynamic class definition and hope for some help..
class DynamicType extends Dynamic with Serializable
{
private val fields = mutable.Map.empty[String,Any].withDefault{key=>throw new NoSuchFieldError(key)}
def selectDynamic(key: String) = fields(key)
def updateDynamic(key: String)(value: Any) = fields(key) = value
def applyDynamic(key: String)(args: Any*) = fields(key)
}
As I understand your dynamic type implementation will represent just a map of fields. In this case Ignite will serialize that map as DynamicType instance field. So it's like any object with field of Map type. Map's key/value pairs can't be annotated and can't be indexed by Ignite.
I know the benefits of lazy fields when a postponed evaluation of values is needed for some reasons. I was wondering what was the behavior of lazy fields in terms of serialization.
Consider the following class.
class MyClass {
lazy val myLazyVal = {...}
...
}
Questions:
If an instance of MyClass is serialized, does the lazy field get serialized too ?
Does the behavior of serialization change if the field has been accessed or not before the serialization ? I mean, if I don't cause the evaluation of the field, is it considered as null ?
Does the serialization mechanism provoke an implicit evaluation of the lazy field ?
Is there a simple way to avoid the serialization of the variable and getting the value recomputed one more time lazily after the deserialization ? This should happen independently from the evaluation of the field.
Answers
Yes if field was already initialized, if not you can tread it as a method. Value is not computed -> not serialized, but available after de serialization.
If you didn't touch field it's serialized almost as it's a simple 'def' method, you don't need it's type to be serializable itself, it will be recalculated after de-serialization
No
You can add #transient before lazy val definition in my code example, as I understand it will do exactly what you want
Code to prove
object LazySerializationTest extends App {
def serialize(obj: Any): Array[Byte] = {
val bytes = new ByteArrayOutputStream()
val out = new ObjectOutputStream(bytes)
out.writeObject(obj)
out.close()
bytes.toByteArray
}
def deSerialise(bytes: Array[Byte]): MyClass = {
new ObjectInputStream(new ByteArrayInputStream(bytes)).
readObject().asInstanceOf[MyClass]
}
def test(obj: MyClass): Unit = {
val bytes = serialize(obj)
val fromBytes = deSerialise(bytes)
println(s"Original cnt = ${obj.x.cnt}")
println(s"De Serialized cnt = ${fromBytes.x.cnt}")
}
object X {
val cnt = new AtomicInteger()
}
class X {
// Not Serializable
val cnt = X.cnt.incrementAndGet
println(s"Create instance of X #$cnt")
}
class MyClass extends Serializable {
lazy val x = new X
}
// Not initialized
val mc1 = new MyClass
test(mc1)
// Force lazy evaluation
val mc2 = new MyClass
mc2.x
test(mc2) // Failed with NotSerializableException
}