How to understand `e:` and `columnsName` in spark error message when using window function? - scala

I have a very simple code like
val win = Window.partitionBy("app").orderBy("date")
val appSpendChange = appSpend
.withColumn("prevSpend", lag(col("Spend")).over(win))
.withColumn("spendChange", when(isnull($"Spend" - "prevSpend"), 0)
.otherwise($"spend" - "prevSpend"))
display(appSpendChange)
This should work as I am referring a PySpark example from and change it to scala :Pyspark Column Transformation: Calculate Percentage Change for Each Group in a Column
However, I get this error:
error: overloaded method value lag with alternatives:
(e: org.apache.spark.sql.Column,offset: Int,defaultValue: Any,ignoreNulls: Boolean)org.apache.spark.sql.Column <and>
(e: org.apache.spark.sql.Column,offset: Int,defaultValue: Any)org.apache.spark.sql.Column <and>
(columnName: String,offset: Int,defaultValue: Any)org.apache.spark.sql.Column <and>
(columnName: String,offset: Int)org.apache.spark.sql.Column <and>
(e: org.apache.spark.sql.Column,offset: Int)org.apache.spark.sql.Column
cannot be applied to (org.apache.spark.sql.Column)
.withColumn("prevPctSpend", lag(col("pctCtvSpend")).over(win))
^
How should I understand it? Especially the e: annotation? Thanks and appreciate any feedback.

You should understand this error as following:
there are 5 methods lag defined with following parameters and return type ((<parameters>)<return>:
(e: org.apache.spark.sql.Column,offset: Int,defaultValue: Any,ignoreNulls: Boolean)org.apache.spark.sql.Column
(e: org.apache.spark.sql.Column,offset: Int,defaultValue: Any)org.apache.spark.sql.Column
(columnName: String,offset: Int,defaultValue: Any)org.apache.spark.sql.Column
(columnName: String,offset: Int)org.apache.spark.sql.Column
(e: org.apache.spark.sql.Column,offset: Int)org.apache.spark.sql.Column
none of these possibilities can be applied with parameters of types (org.apache.spark.sql.Column) (the code you wrote)
In the end it means you called a method with missing or invalid parameters.
As #Dima said, you likely want to add a second parameter (offset) to your call to lag.

Related

Spark-Cassandra-connector issue: value write is not a member of Unit in BOresultDf.write [duplicate]

The following code works fine until I add show after agg. Why is show not possible?
val tempTableB = tableB.groupBy("idB")
.agg(first("numB").as("numB")) //when I add a .show here, it doesn't work
tableA.join(tempTableB, $"idA" === $"idB", "inner")
.drop("idA", "numA").show
The error says:
error: overloaded method value join with alternatives:
(right: org.apache.spark.sql.Dataset[_],joinExprs: org.apache.spark.sql.Column,joinType: String)org.apache.spark.sql.DataFrame <and>
(right: org.apache.spark.sql.Dataset[_],usingColumns: Seq[String],joinType: String)org.apache.spark.sql.DataFrame
cannot be applied to (Unit, org.apache.spark.sql.Column, String)
tableA.join(tempTableB, $"idA" === $"idB", "inner")
^
Why is this behaving this way?
.show() is a function with, what we call in Scala, a side-effect. It prints to stdout and returns Unit(), just like println
Example:
val a = Array(1,2,3).foreach(println)
a: Unit = ()
In scala, you can assume that everything is a function and will return something. In your case, Unit() is being returned and that's what's getting stored in tempTableB.
As #philantrovert has already answered with much detailed explanation. So I shall not explain.
What you can do if you want to see whats in tempTableB then you can do so after it has been assigned as below.
val tempTableB = tableB.groupBy("idB")
.agg(first("numB").as("numB"))
tempTableB.show
tableA.join(tempTableB, $"idA" === $"idB", "inner")
.drop("idA", "numA").show
It should work then

How to change the key of a KStream and then write to a topic using Scala?

I am using Kafka Streams 1.0, I am reading a topic in a Kstream[String, CustomObject], then I am trying to select a new key that comes from one member of the CustomObject, the code looks like this:
myStream: KStream[String, CustomObject] = builder.stream("topic")
.mapValues {
...
//code to transform json to CustomObject
customObject
}
myStream.selectKey((k,v) => v.id)
.to("outputTopic", Produced.`with`(Serdes.String(),
customObjectSerde))
It gives this error:
Error:(109, 7) overloaded method value to with alternatives:
(x$1: String,x$2: org.apache.kafka.streams.kstream.Produced[?0(in value x$1),com.myobject.CustomObject])Unit <and>
(x$1: org.apache.kafka.streams.processor.StreamPartitioner[_ >: ?0(in value x$1), _ >: com.myobject.CustomObject],x$2: String)Unit
cannot be applied to (String, org.apache.kafka.streams.kstream.Produced[String,com.myobject.CustomObject])
).to("outputTopic", Produced.`with`(Serdes.String(),
I am not able to understand what is wrong.
Hopefully somebody can help me. Thanks!
Kafka Streams API uses Java generic types extensively that make it hard for the Scala compiler to infer types correctly. Thus, you need to specify types manually for some cases to avoid ambiguous method overloads.
Also compare: https://docs.confluent.io/current/streams/faq.html#scala-compile-error-no-type-parameter-java-defined-trait-is-invariant-in-type-t
I good way to avoid this issues, is to not chain multiple operators, but introduce a new typed KStream variable after each operation:
// not this
myStream.selectKey((k,v) => v.id)
.to("outputTopic", Produced.`with`(Serdes.String(),customObjectSerde))
// but this
newStream: KStream[KeyType,ValueType] = myStream.selectKey((k,v) => v.id)
newStream.to("outputTopic", Produced.`with`(Serdes.String(),customObjectSerde))
Btw: Kafka 2.0 will offer a proper Scala API for Kafka Streams (https://cwiki.apache.org/confluence/display/KAFKA/KIP-270+-+A+Scala+Wrapper+Library+for+Kafka+Streams) that will fix those Scala issues.

Scala - Passing function as parameter when function is overloaded

I'm writing a function that looks like this:
def func(str: String, logFunction: String => Unit) = {
logFunction(s"message is: $str")
}
When I try to pass Logger.info from Play framework, I get this error:
type mismatch;
[error] found : (message: => String, error: => Throwable)Unit <and> (message: => String)Unit
[error] required: String => Unit
It seems like it found the function with two parameters, and tried to pass that to my function. How do I specify the one-parameter Logger.info to be passed to my function?
As you mentioned, there are two overloaded Logger.info methods in Play. To turn that method into a function and to choose the overload you want, you can explicitly specify the type and add an underscore after the function. The underscore turns a method into a function, which is sometimes done automatically, but in the case can be done explicitly. See also how to get a function from an overloaded method.
In this specific case try
val logger: String => Unit = Logger.info _

are there any tricks to working with overloaded methods in specs2?

i've been getting beat up attempting to match on an overloaded method.
i'm new to scala and specs2, so that is likely one factor ;)
so i have a mock of this SchedulerDriver class
and i'm trying to verify the content of the arguments that are being
passed to the signature of this launchTasks method:
http://mesos.apache.org/api/latest/java/org/apache/mesos/SchedulerDriver.html#launchTasks(java.util.Collection,%20java.util.Collection)
i have tried the answers style like so:
val mockSchedulerDriver = mock[SchedulerDriver]
mockSchedulerDriver.launchTasks(haveInterface[Collection[OfferID]], haveInterface[Collection[TaskInfo]]) answers { i => System.out.println(s"i=$i") }
and get
ambiguous reference to overloaded definition, both method launchTasks in trait SchedulerDriver of type (x$1: org.apache.mesos.Protos.OfferID, x$2: java.util.Collection[org.apache.mesos.Protos.TaskInfo])org.apache.mesos.Protos.Status and method launchTasks in trait SchedulerDriver of type (x$1: java.util.Collection[org.apache.mesos.Protos.OfferID], x$2: java.util.Collection[org.apache.mesos.Protos.TaskInfo])org.apache.mesos.Protos.Status match argument types (org.specs2.matcher.Matcher[Any],org.specs2.matcher.Matcher[Any])
and i have tried the capture style like so:
val mockSchedulerDriver = mock[SchedulerDriver]
val offerIdCollectionCaptor = capture[Collection[OfferID]]
val taskInfoCollectionCaptor = capture[Collection[TaskInfo]]
there was one(mockSchedulerDriver).launchTasks(offerIdCollectionCaptor, taskInfoCollectionCaptor)
and get:
overloaded method value launchTasks with alternatives: (x$1: org.apache.mesos.Protos.OfferID,x$2: java.util.Collection[org.apache.mesos.Protos.TaskInfo])org.apache.mesos.Protos.Status <and> (x$1: java.util.Collection[org.apache.mesos.Protos.OfferID],x$2: java.util.Collection[org.apache.mesos.Protos.TaskInfo])org.apache.mesos.Protos.Status cannot be applied to (org.specs2.mock.mockito.ArgumentCapture[java.util.Collection[mesosphere.mesos.protos.OfferID]], org.specs2.mock.mockito.ArgumentCapture[java.util.Collection[org.apache.mesos.Protos.TaskInfo]])
any guidance or suggestions on how to approach this appreciated...!
best,
tony.
You can use the any matcher in that case:
val mockSchedulerDriver = mock[SchedulerDriver]
mockSchedulerDriver.launchTasks(
any[Collection[OfferID]],
any[Collection[TaskInfo]]) answers { i => System.out.println(s"i=$i")
The difference is that any[T] is a Matcher[T] and the overloading resolution works in that case (whereas haveInterface is a Matcher[AnyRef] so it can't direct the overloading resolution).
I don't understand why the first alternative didn't work, but the second alternative isn't working because scala doesn't consider implicit functions when resolving which overloaded method to call, and the magic that lets you use a capture as though it were the thing you captured depends on an implicit function call.
So what if you make it explicit?
val mockSchedulerDriver = mock[SchedulerDriver]
val offerIdCollectionCaptor = capture[Collection[OfferID]]
val taskInfoCollectionCaptor = capture[Collection[TaskInfo]]
there was one(mockSchedulerDriver).launchTasks(
offerIdCollectionCaptor.capture, taskInfoCollectionCaptor.capture)

Error while writing the content of a file to another in scala

Hello can any one help me how to solve this error i tried in many ways but in vein
import scala.io.Source
import java.io._
object test1
{
def main(args: Array[String])
{
val a=Source.fromFile("pg1661.txt").mkString
val count=a.split("\\s+").groupBy(x=>x).mapValues(x=>x.length)
val writer=new PrintWriter(new File("output.txt"))
writer.write(count)
writer.close()
}
}
but it is showing error #write(count) and the error is
Multiple markers at this line
- overloaded method value write with alternatives:
(java.lang.String)Unit <and> (Array[Char])Unit <and> (Int)Unit cannot be
applied to (scala.collection.immutable.Map[java.lang.String,Int])
- overloaded method value write with alternatives:
(java.lang.String)Unit <and> (Array[Char])Unit <and> (Int)Unit cannot be
applied to (scala.collection.immutable.Map[java.lang.String,Int])
kindly help me.Thanks in advance
writer.print(count.mkString)
this would work in the sense that items get written to the file
for (i <- count.keys.toList.sorted )
writer.println(i.mkString, count.get(i).mkString)
might look a bit nicer
If you want to print in a more key-value style then try
count.foreach{ case (key, value) => writer.println(s"$key: $value") }