How to change the key of a KStream and then write to a topic using Scala? - scala

I am using Kafka Streams 1.0, I am reading a topic in a Kstream[String, CustomObject], then I am trying to select a new key that comes from one member of the CustomObject, the code looks like this:
myStream: KStream[String, CustomObject] = builder.stream("topic")
.mapValues {
...
//code to transform json to CustomObject
customObject
}
myStream.selectKey((k,v) => v.id)
.to("outputTopic", Produced.`with`(Serdes.String(),
customObjectSerde))
It gives this error:
Error:(109, 7) overloaded method value to with alternatives:
(x$1: String,x$2: org.apache.kafka.streams.kstream.Produced[?0(in value x$1),com.myobject.CustomObject])Unit <and>
(x$1: org.apache.kafka.streams.processor.StreamPartitioner[_ >: ?0(in value x$1), _ >: com.myobject.CustomObject],x$2: String)Unit
cannot be applied to (String, org.apache.kafka.streams.kstream.Produced[String,com.myobject.CustomObject])
).to("outputTopic", Produced.`with`(Serdes.String(),
I am not able to understand what is wrong.
Hopefully somebody can help me. Thanks!

Kafka Streams API uses Java generic types extensively that make it hard for the Scala compiler to infer types correctly. Thus, you need to specify types manually for some cases to avoid ambiguous method overloads.
Also compare: https://docs.confluent.io/current/streams/faq.html#scala-compile-error-no-type-parameter-java-defined-trait-is-invariant-in-type-t
I good way to avoid this issues, is to not chain multiple operators, but introduce a new typed KStream variable after each operation:
// not this
myStream.selectKey((k,v) => v.id)
.to("outputTopic", Produced.`with`(Serdes.String(),customObjectSerde))
// but this
newStream: KStream[KeyType,ValueType] = myStream.selectKey((k,v) => v.id)
newStream.to("outputTopic", Produced.`with`(Serdes.String(),customObjectSerde))
Btw: Kafka 2.0 will offer a proper Scala API for Kafka Streams (https://cwiki.apache.org/confluence/display/KAFKA/KIP-270+-+A+Scala+Wrapper+Library+for+Kafka+Streams) that will fix those Scala issues.

Related

Ambiguous Implicit Values when using HMap

HMap seems to be the perfect data structure for my use case, however, I can't get it working:
case class Node[N](node: N)
class ImplVal[K, V]
implicit val iv1 = new ImplVal[Int, Node[Int]]
implicit val iv2 = new ImplVal[Int, Node[String]]
implicit val iv3 = new ImplVal[String, Node[Int]]
val hm = HMap[ImplVal](1 -> Node(1), 2 -> Node("two"), "three" -> Node(3))
My first question is whether it is possible to create those implicits vals automatically. For sure, for typical combinations I could create them manually, but I'm wondering if there is something more generic, less boilerplate way.
Next question is, how to get values out of the map:
val res1 = hm.get(1) // (1) ambiguous implicit values: both value iv2 [...] and value iv1 [...] match expected type ImplVal[Int,V]`
To me, Node[Int] (iv1) and Node[String] (iv2) look pretty different :) I thought, despite the JVM type erasure limitations, Scala could differentiate here. What am I missing? Do I have to use other implicit values to make the difference clear?
The explicit version works:
val res2 = hm.get[Int, Node[Int]](1) // (2) works
Of course, in this simple case, I could add the type information to the get call. But in the following case, where only the keys are known in advance, I don't know how to do it:
def get[T <: HList](keys: T): HList = // return associated values for keys
Is there any simple solution to this problem?
BTW, what documentation about Scala's type system (or Shapeless or in functional programming in general) could be recommended to understand the whole topic better as I have to admit, I'm lacking some background for this topic.
The type of the key determines the type of the value. You have Int keys corresponding to both Node[Int] and Node[String] values, hence the ambiguity. You might find this article helpful in explaining the general mechanism underlying this.

KafkaProducer for both keyed and unkeyed ProducerRecords

I'm using the 0.9 Kafka Java client in Scala.
scala> val kafkaProducer = new KafkaProducer[String, String](props)
ProducerRecord has several constructors that allow you to include or not include a key and/or partition.
scala> val keyedRecord = new ProducerRecord("topic", "key", "value")
scala> kafkaProducer.send(keyedRecord)
should have no problem.
However, an unkeyed ProducerRecord gives a type error.
scala> val unkeyedRecord = new ProducerRecord("topic", "value")
res8: org.apache.kafka.clients.producer.ProducerRecord[Nothing,String] =
ProducerRecord(topic=topic, partition=null, key=null, value=value
scala> kafkaProducer.send(res8)
<console>:17: error: type mismatch;
found : org.apache.kafka.clients.producer.ProducerRecord[Nothing,String]
required: org.apache.kafka.clients.producer.ProducerRecord[String,String]
Note: Nothing <: String, but Java-defined class ProducerRecord is invariant in type K.
You may wish to investigate a wildcard type such as `_ <: String`. (SLS 3.2.10)
kafkaProducer.send(res8)
^
Is this against Kafka's rules or could it be an unnecessary precaution that has come from using this Java API in Scala?
More fundamentally, is it poor form to put keyed and unkeyed messages in the same Kafka topic?
Thank you
Javadoc: http://kafka.apache.org/090/javadoc/org/apache/kafka/clients/producer/package-summary.html
Edit
Could changing the variance of parameter K in KafkaProducer fix this?
It looks like the answer is in the comments, but to spell it out, Scala uses type inference when types are not explicitly provided. Since you wrote:
val unkeyedRecord = new ProducerRecord("topic", "value")
The key is not provided, and it becomes null, which Scala's type system infers is a Nothing instance. To fix that, declare the types explicitly:
val unkeyedRecord = new ProducerRecord[String,String]("topic", "value")

Apache Spark join/cogroup on generic type RDD

I have a problem with the join or cogroup methods on RDD. In detail, I have to join two RDDs and one of them is an RDD of a generic type, used with wildcard.
val indexedMeasures = measures.map(m => (m.id(), m)) // RDD[(String, Measure[_]]
val indexedRegistry = registry.map(r => (r.id, r)) // RDD[(String, Registry)]
indexedRegistry.cogroup(indexedMeasures)
The last statement gives a compile time error, which is the following:
no type parameters for method cogroup: (other: org.apache.spark.rdd.RDD[(String, W)])org.apache.spark.rdd.RDD[(String, (Iterable[Registry],
Iterable[W]))] exist so that it can be applied to arguments (org.apache.spark.rdd.RDD[(String, Measure[?0]) forSome { type ?0 }]) --- because --- argument expression's type is not compatible
with formal parameter type; found : org.apache.spark.rdd.RDD[(String, Measure[?0]) forSome { type ?0 }] required: org.apache.spark.rdd.RDD[(String, ?W)] Note: (String,
Measure[?0]) forSome { type ?0 } >: (String, ?W), but class RDD is invariant in type T. You may wish to define T as -T instead. (SLS 4.5)
What's going on here? Why can't I cogroup RDDs that use a generic wildcarded type?
Thanks for all of your responses.
The problem is stated in this article Towards Equal Rights for Higher-kinded Types
Generics are a very popular feature of contemporary OO languages,
such as Java, C# or Scala. Their support for genericity is lacking, however. The
problem is that they only support abstracting over proper types, and not over
generic types. This limitation makes it impossible to, e.g., define a precise interface
for Iterable, a core abstraction in Scala’s collection API. We implemented
“type constructor polymorphism” in Scala 2.5, which solves this problem
at the root, thus greatly reducing the duplication of type signatures and code.

are there any tricks to working with overloaded methods in specs2?

i've been getting beat up attempting to match on an overloaded method.
i'm new to scala and specs2, so that is likely one factor ;)
so i have a mock of this SchedulerDriver class
and i'm trying to verify the content of the arguments that are being
passed to the signature of this launchTasks method:
http://mesos.apache.org/api/latest/java/org/apache/mesos/SchedulerDriver.html#launchTasks(java.util.Collection,%20java.util.Collection)
i have tried the answers style like so:
val mockSchedulerDriver = mock[SchedulerDriver]
mockSchedulerDriver.launchTasks(haveInterface[Collection[OfferID]], haveInterface[Collection[TaskInfo]]) answers { i => System.out.println(s"i=$i") }
and get
ambiguous reference to overloaded definition, both method launchTasks in trait SchedulerDriver of type (x$1: org.apache.mesos.Protos.OfferID, x$2: java.util.Collection[org.apache.mesos.Protos.TaskInfo])org.apache.mesos.Protos.Status and method launchTasks in trait SchedulerDriver of type (x$1: java.util.Collection[org.apache.mesos.Protos.OfferID], x$2: java.util.Collection[org.apache.mesos.Protos.TaskInfo])org.apache.mesos.Protos.Status match argument types (org.specs2.matcher.Matcher[Any],org.specs2.matcher.Matcher[Any])
and i have tried the capture style like so:
val mockSchedulerDriver = mock[SchedulerDriver]
val offerIdCollectionCaptor = capture[Collection[OfferID]]
val taskInfoCollectionCaptor = capture[Collection[TaskInfo]]
there was one(mockSchedulerDriver).launchTasks(offerIdCollectionCaptor, taskInfoCollectionCaptor)
and get:
overloaded method value launchTasks with alternatives: (x$1: org.apache.mesos.Protos.OfferID,x$2: java.util.Collection[org.apache.mesos.Protos.TaskInfo])org.apache.mesos.Protos.Status <and> (x$1: java.util.Collection[org.apache.mesos.Protos.OfferID],x$2: java.util.Collection[org.apache.mesos.Protos.TaskInfo])org.apache.mesos.Protos.Status cannot be applied to (org.specs2.mock.mockito.ArgumentCapture[java.util.Collection[mesosphere.mesos.protos.OfferID]], org.specs2.mock.mockito.ArgumentCapture[java.util.Collection[org.apache.mesos.Protos.TaskInfo]])
any guidance or suggestions on how to approach this appreciated...!
best,
tony.
You can use the any matcher in that case:
val mockSchedulerDriver = mock[SchedulerDriver]
mockSchedulerDriver.launchTasks(
any[Collection[OfferID]],
any[Collection[TaskInfo]]) answers { i => System.out.println(s"i=$i")
The difference is that any[T] is a Matcher[T] and the overloading resolution works in that case (whereas haveInterface is a Matcher[AnyRef] so it can't direct the overloading resolution).
I don't understand why the first alternative didn't work, but the second alternative isn't working because scala doesn't consider implicit functions when resolving which overloaded method to call, and the magic that lets you use a capture as though it were the thing you captured depends on an implicit function call.
So what if you make it explicit?
val mockSchedulerDriver = mock[SchedulerDriver]
val offerIdCollectionCaptor = capture[Collection[OfferID]]
val taskInfoCollectionCaptor = capture[Collection[TaskInfo]]
there was one(mockSchedulerDriver).launchTasks(
offerIdCollectionCaptor.capture, taskInfoCollectionCaptor.capture)

Spurious ambiguous reference error in Scala 2.7.7 compiler/interpreter?

Can anyone explain the compile error below? Interestingly, if I change the return type of the get() method to String, the code compiles just fine. Note that the thenReturn method has two overloads: a unary method and a varargs method that takes at least one argument. It seems to me that if the invocation is ambiguous here, then it would always be ambiguous.
More importantly, is there any way to resolve the ambiguity?
import org.scalatest.mock.MockitoSugar
import org.mockito.Mockito._
trait Thing {
def get(): java.lang.Object
}
new MockitoSugar {
val t = mock[Thing]
when(t.get()).thenReturn("a")
}
error: ambiguous reference to overloaded definition,
both method thenReturn in trait OngoingStubbing of type
java.lang.Object,java.lang.Object*)org.mockito.stubbing.OngoingStubbing[java.lang.Object]
and method thenReturn in trait OngoingStubbing of type
(java.lang.Object)org.mockito.stubbing.OngoingStubbing[java.lang.Object]
match argument types (java.lang.String)
when(t.get()).thenReturn("a")
Well, it is ambiguous. I suppose Java semantics allow for it, and it might merit a ticket asking for Java semantics to be applied in Scala.
The source of the ambiguitity is this: a vararg parameter may receive any number of arguments, including 0. So, when you write thenReturn("a"), do you mean to call the thenReturn which receives a single argument, or do you mean to call the thenReturn that receives one object plus a vararg, passing 0 arguments to the vararg?
Now, what this kind of thing happens, Scala tries to find which method is "more specific". Anyone interested in the details should look up that in Scala's specification, but here is the explanation of what happens in this particular case:
object t {
def f(x: AnyRef) = 1 // A
def f(x: AnyRef, xs: AnyRef*) = 2 // B
}
if you call f("foo"), both A and B
are applicable. Which one is more
specific?
it is possible to call B with parameters of type (AnyRef), so A is
as specific as B.
it is possible to call A with parameters of type (AnyRef,
Seq[AnyRef]) thanks to tuple
conversion, Tuple2[AnyRef,
Seq[AnyRef]] conforms to AnyRef. So
B is as specific as A. Since both are
as specific as the other, the
reference to f is ambiguous.
As to the "tuple conversion" thing, it is one of the most obscure syntactic sugars of Scala. If you make a call f(a, b), where a and b have types A and B, and there is no f accepting (A, B) but there is an f which accepts (Tuple2(A, B)), then the parameters (a, b) will be converted into a tuple.
For example:
scala> def f(t: Tuple2[Int, Int]) = t._1 + t._2
f: (t: (Int, Int))Int
scala> f(1,2)
res0: Int = 3
Now, there is no tuple conversion going on when thenReturn("a") is called. That is not the problem. The problem is that, given that tuple conversion is possible, neither version of thenReturn is more specific, because any parameter passed to one could be passed to the other as well.
In the specific case of Mockito, it's possible to use the alternate API methods designed for use with void methods:
doReturn("a").when(t).get()
Clunky, but it'll have to do, as Martin et al don't seem likely to compromise Scala in order to support Java's varargs.
Well, I figured out how to resolve the ambiguity (seems kind of obvious in retrospect):
when(t.get()).thenReturn("a", Array[Object](): _*)
As Andreas noted, if the ambiguous method requires a null reference rather than an empty array, you can use something like
v.overloadedMethod(arg0, null.asInstanceOf[Array[Object]]: _*)
to resolve the ambiguity.
If you look at the standard library APIs you'll see this issue handled like this:
def meth(t1: Thing): OtherThing = { ... }
def meth(t1: Thing, t2: Thing, ts: Thing*): OtherThing = { ... }
By doing this, no call (with at least one Thing parameter) is ambiguous without extra fluff like Array[Thing](): _*.
I had a similar problem using Oval (oval.sf.net) trying to call it's validate()-method.
Oval defines 2 validate() methods:
public List<ConstraintViolation> validate(final Object validatedObject)
public List<ConstraintViolation> validate(final Object validatedObject, final String... profiles)
Trying this from Scala:
validator.validate(value)
produces the following compiler-error:
both method validate in class Validator of type (x$1: Any,x$2: <repeated...>[java.lang.String])java.util.List[net.sf.oval.ConstraintViolation]
and method validate in class Validator of type (x$1: Any)java.util.List[net.sf.oval.ConstraintViolation]
match argument types (T)
var violations = validator.validate(entity);
Oval needs the varargs-parameter to be null, not an empty-array, so I finally got it to work with this:
validator.validate(value, null.asInstanceOf[Array[String]]: _*)