Spark 2.2 Structured Streaming Stream - Static left outer join issue - spark-structured-streaming

I seem to be missing something on the Stream - Static Join in Spark 2.2.
The manual states that such a join is possible, but I cannot get the syntax correct. Odd. No watermark being used.
val joinedDs = salesDs
.join(customerDs, "customerId", joinType="leftOuter")
Error gotten is as follows, but I am pretty sure I have the sides right:
<console>:81: error: overloaded method value join with alternatives:
(right: org.apache.spark.sql.Dataset[_],joinExprs:
org.apache.spark.sql.Column,joinType: String)org.apache.spark.sql.DataFrame <and>
(right: org.apache.spark.sql.Dataset[_],usingColumns: Seq[String],joinType: String)org.apache.spark.sql.DataFrame
cannot be applied to (org.apache.spark.sql.Dataset[Customer], String, joinType: String)
.join(customerDs, "customerId", joinType="left_Outer")
^

For some reason when adding joinType I also needed to add the Seq.
.join(customerDs, Seq("customerId"), "left_Outer")

Related

How to understand `e:` and `columnsName` in spark error message when using window function?

I have a very simple code like
val win = Window.partitionBy("app").orderBy("date")
val appSpendChange = appSpend
.withColumn("prevSpend", lag(col("Spend")).over(win))
.withColumn("spendChange", when(isnull($"Spend" - "prevSpend"), 0)
.otherwise($"spend" - "prevSpend"))
display(appSpendChange)
This should work as I am referring a PySpark example from and change it to scala :Pyspark Column Transformation: Calculate Percentage Change for Each Group in a Column
However, I get this error:
error: overloaded method value lag with alternatives:
(e: org.apache.spark.sql.Column,offset: Int,defaultValue: Any,ignoreNulls: Boolean)org.apache.spark.sql.Column <and>
(e: org.apache.spark.sql.Column,offset: Int,defaultValue: Any)org.apache.spark.sql.Column <and>
(columnName: String,offset: Int,defaultValue: Any)org.apache.spark.sql.Column <and>
(columnName: String,offset: Int)org.apache.spark.sql.Column <and>
(e: org.apache.spark.sql.Column,offset: Int)org.apache.spark.sql.Column
cannot be applied to (org.apache.spark.sql.Column)
.withColumn("prevPctSpend", lag(col("pctCtvSpend")).over(win))
^
How should I understand it? Especially the e: annotation? Thanks and appreciate any feedback.
You should understand this error as following:
there are 5 methods lag defined with following parameters and return type ((<parameters>)<return>:
(e: org.apache.spark.sql.Column,offset: Int,defaultValue: Any,ignoreNulls: Boolean)org.apache.spark.sql.Column
(e: org.apache.spark.sql.Column,offset: Int,defaultValue: Any)org.apache.spark.sql.Column
(columnName: String,offset: Int,defaultValue: Any)org.apache.spark.sql.Column
(columnName: String,offset: Int)org.apache.spark.sql.Column
(e: org.apache.spark.sql.Column,offset: Int)org.apache.spark.sql.Column
none of these possibilities can be applied with parameters of types (org.apache.spark.sql.Column) (the code you wrote)
In the end it means you called a method with missing or invalid parameters.
As #Dima said, you likely want to add a second parameter (offset) to your call to lag.

Unit testing Scala using mockito

I'm very much new to doing unit testing in Scala using Mockito. I'm getting an error in the thenReturn statement.
it should "read null when readFromPostgresTarget is called with some
random driver" in {
Given("a null query is sent as query")
val query = ""
val pgObject = mock[PersistenceObject]
val postgresPersistenceObject =
mock[PostgressPersistenceServiceTrait]
val mockDF = mock[DataFrame]
When("it is passed to readFromPostgresTarget")
when(postgresPersistenceObject.readFromPostgresTarget(any[String],mock[Spark
Session], pgObject)).thenReturn(mockDF)
assert(postgresPersistenceObject.readFromPostgresTarget(query,
sparkSession, pgObject) === any[DataFrame])
Then("a null value should be returned")
verify(postgresPersistenceObject, times(1))
}
I'm getting the error-
overloaded method value thenReturn with alternatives:
(x$1: Unit,x$2: Unit*)org.mockito.stubbing.OngoingStubbing[Unit] <and>
(x$1: Unit)org.mockito.stubbing.OngoingStubbing[Unit]
cannot be applied to (org.apache.spark.sql.DataFrame)
.thenReturn(mockDF)
I tried making the mockDF in thenReturn(mockDF) to thenReturn(any[DatafRame]), it's not fixing the issue.
I tried passing a SparkSession instead of the mock it isn't working.
I can't figure what mistake I'm doing.
To avoid those problems (related to Scala/Java interop) you should use mockito-scala

Spark-Cassandra-connector issue: value write is not a member of Unit in BOresultDf.write [duplicate]

The following code works fine until I add show after agg. Why is show not possible?
val tempTableB = tableB.groupBy("idB")
.agg(first("numB").as("numB")) //when I add a .show here, it doesn't work
tableA.join(tempTableB, $"idA" === $"idB", "inner")
.drop("idA", "numA").show
The error says:
error: overloaded method value join with alternatives:
(right: org.apache.spark.sql.Dataset[_],joinExprs: org.apache.spark.sql.Column,joinType: String)org.apache.spark.sql.DataFrame <and>
(right: org.apache.spark.sql.Dataset[_],usingColumns: Seq[String],joinType: String)org.apache.spark.sql.DataFrame
cannot be applied to (Unit, org.apache.spark.sql.Column, String)
tableA.join(tempTableB, $"idA" === $"idB", "inner")
^
Why is this behaving this way?
.show() is a function with, what we call in Scala, a side-effect. It prints to stdout and returns Unit(), just like println
Example:
val a = Array(1,2,3).foreach(println)
a: Unit = ()
In scala, you can assume that everything is a function and will return something. In your case, Unit() is being returned and that's what's getting stored in tempTableB.
As #philantrovert has already answered with much detailed explanation. So I shall not explain.
What you can do if you want to see whats in tempTableB then you can do so after it has been assigned as below.
val tempTableB = tableB.groupBy("idB")
.agg(first("numB").as("numB"))
tempTableB.show
tableA.join(tempTableB, $"idA" === $"idB", "inner")
.drop("idA", "numA").show
It should work then

How to change the key of a KStream and then write to a topic using Scala?

I am using Kafka Streams 1.0, I am reading a topic in a Kstream[String, CustomObject], then I am trying to select a new key that comes from one member of the CustomObject, the code looks like this:
myStream: KStream[String, CustomObject] = builder.stream("topic")
.mapValues {
...
//code to transform json to CustomObject
customObject
}
myStream.selectKey((k,v) => v.id)
.to("outputTopic", Produced.`with`(Serdes.String(),
customObjectSerde))
It gives this error:
Error:(109, 7) overloaded method value to with alternatives:
(x$1: String,x$2: org.apache.kafka.streams.kstream.Produced[?0(in value x$1),com.myobject.CustomObject])Unit <and>
(x$1: org.apache.kafka.streams.processor.StreamPartitioner[_ >: ?0(in value x$1), _ >: com.myobject.CustomObject],x$2: String)Unit
cannot be applied to (String, org.apache.kafka.streams.kstream.Produced[String,com.myobject.CustomObject])
).to("outputTopic", Produced.`with`(Serdes.String(),
I am not able to understand what is wrong.
Hopefully somebody can help me. Thanks!
Kafka Streams API uses Java generic types extensively that make it hard for the Scala compiler to infer types correctly. Thus, you need to specify types manually for some cases to avoid ambiguous method overloads.
Also compare: https://docs.confluent.io/current/streams/faq.html#scala-compile-error-no-type-parameter-java-defined-trait-is-invariant-in-type-t
I good way to avoid this issues, is to not chain multiple operators, but introduce a new typed KStream variable after each operation:
// not this
myStream.selectKey((k,v) => v.id)
.to("outputTopic", Produced.`with`(Serdes.String(),customObjectSerde))
// but this
newStream: KStream[KeyType,ValueType] = myStream.selectKey((k,v) => v.id)
newStream.to("outputTopic", Produced.`with`(Serdes.String(),customObjectSerde))
Btw: Kafka 2.0 will offer a proper Scala API for Kafka Streams (https://cwiki.apache.org/confluence/display/KAFKA/KIP-270+-+A+Scala+Wrapper+Library+for+Kafka+Streams) that will fix those Scala issues.

How to properly use groupBy on Spark RDD composed of case class instances?

I am trying to do groupBy on an RDD whose elements are instances of a simple case class and I am getting a weird error that I don't know how to work around. The following code reproduces the problem in the Spark-shell (Spark 0.9.0, Scala 2.10.3, Java 1.7.0 ):
case class EmployeeRec( name : String, position : String, salary : Double ) extends Serializable;
// I suspect extends Serializable is not needed for case classes, but just in case...
val data = sc.parallelize( Vector( EmployeeRec("Ana", "Analist", 200 ),
EmployeeRec("Maria", "Manager", 250.0 ),
EmployeeRec("Paul", "Director", 300.0 ) ) )
val groupFun = ( emp : EmployeeRec ) => emp.position
val dataByPos = data.groupBy( groupFun )
The resulting error from the last statement is:
val dataByPos = data.groupBy( groupFun )
<console>:21: error: type mismatch;
found : EmployeeRec => String
required: EmployeeRec => ?
val dataByPos = data.groupBy( groupFun )
So I tried:
val dataByPos = data.groupBy[String]( groupFun )
The error is a bit more scary now:
val dataByPos = data.groupBy[String]( groupFun )
<console>:18: error: overloaded method value groupBy with alternatives:
(f: EmployeeRec => String,p: org.apache.spark.Partitioner)(implicit evidence$8: scala.reflect.ClassTag[String])org.apache.spark.rdd.RDD[(String, Seq[EmployeeRec])] <and>
(f: EmployeeRec => String,numPartitions: Int)(implicit evidence$7: scala.reflect.ClassTag[String])org.apache.spark.rdd.RDD[(String, Seq[EmployeeRec])] <and>
(f: EmployeeRec => String)(implicit evidence$6: scala.reflect.ClassTag[String])org.apache.spark.rdd.RDD[(String, Seq[EmployeeRec])]
cannot be applied to (EmployeeRec => String)
val dataByPos = data.groupBy[String]( groupFun )
I tried to be more specific about the version of the overloaded method groupBy that I want to apply by adding the extra argument numPartions = 10 (of course my real dataset is much bigger than just 3 records)
val dataByPos = data.groupBy[String]( groupFun, 10 )
I get the exact same error as before.
Any ideas? I suspect the issue might be related to the implicit evidence argument... Unfortunately this is one of the areas of scala about which I do not understand much.
Note 1: The analog of this code using tuples instead of case class EmployeeRec, works without any problem. However, I was hoping to be able to use case classes instead of tuples for nicer, more maintable code that doesn't require me to remember or handle fields by position instead of by name (in reality I have many more than 3 fields per employee.)
Note 2: It seems that this issue observed (when using case class EmployeeRec) might be fixed in Spark 1.+, since any of the versions of the code above is correctly compiled by the eclipse scala pluggin when using the spark-core_2.10-1.0.0-cdh5.1.0.jar.
However, I am not sure how or whether I will be able to run that version of Spark in the cluster I have access to, and I was hoping to better understand the problem so as to come up with a work-around for Spark 0.9.0