Spark-Cassandra-connector issue: value write is not a member of Unit in BOresultDf.write [duplicate] - scala

The following code works fine until I add show after agg. Why is show not possible?
val tempTableB = tableB.groupBy("idB")
.agg(first("numB").as("numB")) //when I add a .show here, it doesn't work
tableA.join(tempTableB, $"idA" === $"idB", "inner")
.drop("idA", "numA").show
The error says:
error: overloaded method value join with alternatives:
(right: org.apache.spark.sql.Dataset[_],joinExprs: org.apache.spark.sql.Column,joinType: String)org.apache.spark.sql.DataFrame <and>
(right: org.apache.spark.sql.Dataset[_],usingColumns: Seq[String],joinType: String)org.apache.spark.sql.DataFrame
cannot be applied to (Unit, org.apache.spark.sql.Column, String)
tableA.join(tempTableB, $"idA" === $"idB", "inner")
^
Why is this behaving this way?

.show() is a function with, what we call in Scala, a side-effect. It prints to stdout and returns Unit(), just like println
Example:
val a = Array(1,2,3).foreach(println)
a: Unit = ()
In scala, you can assume that everything is a function and will return something. In your case, Unit() is being returned and that's what's getting stored in tempTableB.

As #philantrovert has already answered with much detailed explanation. So I shall not explain.
What you can do if you want to see whats in tempTableB then you can do so after it has been assigned as below.
val tempTableB = tableB.groupBy("idB")
.agg(first("numB").as("numB"))
tempTableB.show
tableA.join(tempTableB, $"idA" === $"idB", "inner")
.drop("idA", "numA").show
It should work then

Related

Unit testing Scala using mockito

I'm very much new to doing unit testing in Scala using Mockito. I'm getting an error in the thenReturn statement.
it should "read null when readFromPostgresTarget is called with some
random driver" in {
Given("a null query is sent as query")
val query = ""
val pgObject = mock[PersistenceObject]
val postgresPersistenceObject =
mock[PostgressPersistenceServiceTrait]
val mockDF = mock[DataFrame]
When("it is passed to readFromPostgresTarget")
when(postgresPersistenceObject.readFromPostgresTarget(any[String],mock[Spark
Session], pgObject)).thenReturn(mockDF)
assert(postgresPersistenceObject.readFromPostgresTarget(query,
sparkSession, pgObject) === any[DataFrame])
Then("a null value should be returned")
verify(postgresPersistenceObject, times(1))
}
I'm getting the error-
overloaded method value thenReturn with alternatives:
(x$1: Unit,x$2: Unit*)org.mockito.stubbing.OngoingStubbing[Unit] <and>
(x$1: Unit)org.mockito.stubbing.OngoingStubbing[Unit]
cannot be applied to (org.apache.spark.sql.DataFrame)
.thenReturn(mockDF)
I tried making the mockDF in thenReturn(mockDF) to thenReturn(any[DatafRame]), it's not fixing the issue.
I tried passing a SparkSession instead of the mock it isn't working.
I can't figure what mistake I'm doing.
To avoid those problems (related to Scala/Java interop) you should use mockito-scala

Spark 2.2 Structured Streaming Stream - Static left outer join issue

I seem to be missing something on the Stream - Static Join in Spark 2.2.
The manual states that such a join is possible, but I cannot get the syntax correct. Odd. No watermark being used.
val joinedDs = salesDs
.join(customerDs, "customerId", joinType="leftOuter")
Error gotten is as follows, but I am pretty sure I have the sides right:
<console>:81: error: overloaded method value join with alternatives:
(right: org.apache.spark.sql.Dataset[_],joinExprs:
org.apache.spark.sql.Column,joinType: String)org.apache.spark.sql.DataFrame <and>
(right: org.apache.spark.sql.Dataset[_],usingColumns: Seq[String],joinType: String)org.apache.spark.sql.DataFrame
cannot be applied to (org.apache.spark.sql.Dataset[Customer], String, joinType: String)
.join(customerDs, "customerId", joinType="left_Outer")
^
For some reason when adding joinType I also needed to add the Seq.
.join(customerDs, Seq("customerId"), "left_Outer")

are there any tricks to working with overloaded methods in specs2?

i've been getting beat up attempting to match on an overloaded method.
i'm new to scala and specs2, so that is likely one factor ;)
so i have a mock of this SchedulerDriver class
and i'm trying to verify the content of the arguments that are being
passed to the signature of this launchTasks method:
http://mesos.apache.org/api/latest/java/org/apache/mesos/SchedulerDriver.html#launchTasks(java.util.Collection,%20java.util.Collection)
i have tried the answers style like so:
val mockSchedulerDriver = mock[SchedulerDriver]
mockSchedulerDriver.launchTasks(haveInterface[Collection[OfferID]], haveInterface[Collection[TaskInfo]]) answers { i => System.out.println(s"i=$i") }
and get
ambiguous reference to overloaded definition, both method launchTasks in trait SchedulerDriver of type (x$1: org.apache.mesos.Protos.OfferID, x$2: java.util.Collection[org.apache.mesos.Protos.TaskInfo])org.apache.mesos.Protos.Status and method launchTasks in trait SchedulerDriver of type (x$1: java.util.Collection[org.apache.mesos.Protos.OfferID], x$2: java.util.Collection[org.apache.mesos.Protos.TaskInfo])org.apache.mesos.Protos.Status match argument types (org.specs2.matcher.Matcher[Any],org.specs2.matcher.Matcher[Any])
and i have tried the capture style like so:
val mockSchedulerDriver = mock[SchedulerDriver]
val offerIdCollectionCaptor = capture[Collection[OfferID]]
val taskInfoCollectionCaptor = capture[Collection[TaskInfo]]
there was one(mockSchedulerDriver).launchTasks(offerIdCollectionCaptor, taskInfoCollectionCaptor)
and get:
overloaded method value launchTasks with alternatives: (x$1: org.apache.mesos.Protos.OfferID,x$2: java.util.Collection[org.apache.mesos.Protos.TaskInfo])org.apache.mesos.Protos.Status <and> (x$1: java.util.Collection[org.apache.mesos.Protos.OfferID],x$2: java.util.Collection[org.apache.mesos.Protos.TaskInfo])org.apache.mesos.Protos.Status cannot be applied to (org.specs2.mock.mockito.ArgumentCapture[java.util.Collection[mesosphere.mesos.protos.OfferID]], org.specs2.mock.mockito.ArgumentCapture[java.util.Collection[org.apache.mesos.Protos.TaskInfo]])
any guidance or suggestions on how to approach this appreciated...!
best,
tony.
You can use the any matcher in that case:
val mockSchedulerDriver = mock[SchedulerDriver]
mockSchedulerDriver.launchTasks(
any[Collection[OfferID]],
any[Collection[TaskInfo]]) answers { i => System.out.println(s"i=$i")
The difference is that any[T] is a Matcher[T] and the overloading resolution works in that case (whereas haveInterface is a Matcher[AnyRef] so it can't direct the overloading resolution).
I don't understand why the first alternative didn't work, but the second alternative isn't working because scala doesn't consider implicit functions when resolving which overloaded method to call, and the magic that lets you use a capture as though it were the thing you captured depends on an implicit function call.
So what if you make it explicit?
val mockSchedulerDriver = mock[SchedulerDriver]
val offerIdCollectionCaptor = capture[Collection[OfferID]]
val taskInfoCollectionCaptor = capture[Collection[TaskInfo]]
there was one(mockSchedulerDriver).launchTasks(
offerIdCollectionCaptor.capture, taskInfoCollectionCaptor.capture)

How to properly use groupBy on Spark RDD composed of case class instances?

I am trying to do groupBy on an RDD whose elements are instances of a simple case class and I am getting a weird error that I don't know how to work around. The following code reproduces the problem in the Spark-shell (Spark 0.9.0, Scala 2.10.3, Java 1.7.0 ):
case class EmployeeRec( name : String, position : String, salary : Double ) extends Serializable;
// I suspect extends Serializable is not needed for case classes, but just in case...
val data = sc.parallelize( Vector( EmployeeRec("Ana", "Analist", 200 ),
EmployeeRec("Maria", "Manager", 250.0 ),
EmployeeRec("Paul", "Director", 300.0 ) ) )
val groupFun = ( emp : EmployeeRec ) => emp.position
val dataByPos = data.groupBy( groupFun )
The resulting error from the last statement is:
val dataByPos = data.groupBy( groupFun )
<console>:21: error: type mismatch;
found : EmployeeRec => String
required: EmployeeRec => ?
val dataByPos = data.groupBy( groupFun )
So I tried:
val dataByPos = data.groupBy[String]( groupFun )
The error is a bit more scary now:
val dataByPos = data.groupBy[String]( groupFun )
<console>:18: error: overloaded method value groupBy with alternatives:
(f: EmployeeRec => String,p: org.apache.spark.Partitioner)(implicit evidence$8: scala.reflect.ClassTag[String])org.apache.spark.rdd.RDD[(String, Seq[EmployeeRec])] <and>
(f: EmployeeRec => String,numPartitions: Int)(implicit evidence$7: scala.reflect.ClassTag[String])org.apache.spark.rdd.RDD[(String, Seq[EmployeeRec])] <and>
(f: EmployeeRec => String)(implicit evidence$6: scala.reflect.ClassTag[String])org.apache.spark.rdd.RDD[(String, Seq[EmployeeRec])]
cannot be applied to (EmployeeRec => String)
val dataByPos = data.groupBy[String]( groupFun )
I tried to be more specific about the version of the overloaded method groupBy that I want to apply by adding the extra argument numPartions = 10 (of course my real dataset is much bigger than just 3 records)
val dataByPos = data.groupBy[String]( groupFun, 10 )
I get the exact same error as before.
Any ideas? I suspect the issue might be related to the implicit evidence argument... Unfortunately this is one of the areas of scala about which I do not understand much.
Note 1: The analog of this code using tuples instead of case class EmployeeRec, works without any problem. However, I was hoping to be able to use case classes instead of tuples for nicer, more maintable code that doesn't require me to remember or handle fields by position instead of by name (in reality I have many more than 3 fields per employee.)
Note 2: It seems that this issue observed (when using case class EmployeeRec) might be fixed in Spark 1.+, since any of the versions of the code above is correctly compiled by the eclipse scala pluggin when using the spark-core_2.10-1.0.0-cdh5.1.0.jar.
However, I am not sure how or whether I will be able to run that version of Spark in the cluster I have access to, and I was hoping to better understand the problem so as to come up with a work-around for Spark 0.9.0

Error while writing the content of a file to another in scala

Hello can any one help me how to solve this error i tried in many ways but in vein
import scala.io.Source
import java.io._
object test1
{
def main(args: Array[String])
{
val a=Source.fromFile("pg1661.txt").mkString
val count=a.split("\\s+").groupBy(x=>x).mapValues(x=>x.length)
val writer=new PrintWriter(new File("output.txt"))
writer.write(count)
writer.close()
}
}
but it is showing error #write(count) and the error is
Multiple markers at this line
- overloaded method value write with alternatives:
(java.lang.String)Unit <and> (Array[Char])Unit <and> (Int)Unit cannot be
applied to (scala.collection.immutable.Map[java.lang.String,Int])
- overloaded method value write with alternatives:
(java.lang.String)Unit <and> (Array[Char])Unit <and> (Int)Unit cannot be
applied to (scala.collection.immutable.Map[java.lang.String,Int])
kindly help me.Thanks in advance
writer.print(count.mkString)
this would work in the sense that items get written to the file
for (i <- count.keys.toList.sorted )
writer.println(i.mkString, count.get(i).mkString)
might look a bit nicer
If you want to print in a more key-value style then try
count.foreach{ case (key, value) => writer.println(s"$key: $value") }