Difference between calling sql() and using Spark API call() - scala

I am new to Spark/Scala/Hive. Am just wondering if there are any differences between calling
spark = new SparkSession(...).getHiveContext()
spark.sql("SELECR * FROM table")
and
spark = new SparkSession(...).getHiveContext() // not using
spark.read.table(table).select(from("*"))
??
Particularly, are there any performance difference.

These two snippets have the same run-time performance.
The second API is safer, is you make a typo or try to used some non supported operation it will give you a quick and clear compilation error. It's funny that you wrote SELECR and not SELECT, that a good illustration of this point :)

Related

Getting Py4JJavaError Pyspark error on using rdd

I am getting the below error:
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
on this line:
result = df.select('student_age').rdd.flatMap(lambda x: x).collect()
'student_age' is a column name. It was running fine until last week but now this error.
Does anyone have any insights on that?
Using collect is dangerous for this very reason, It's prone to Out Of Memory errors. I suggest removing it. You also do not need to use a rdd for this you can do this with a data frame:
result = df.select(explode(df['student_age'])) #returns a dataFrame
#write code to use a data frame instead of any array.
If nothing else changed, likely the data did, and finally outgrew the size in memory.
It's also possible that you have new 'bad' data that is throwing an error.
Either way you could likely prove this by find this(OOM) or prove the data is bad by printing it.
def f(row):
print(row.student_age)
result.foreach(f) # used for simple stuff that doesn't require heavy initialization.
IF that works you may want to break your code down to use foreachPartition. This will let you do math on each value in the memory of each executor. The only trick is that within fun below as you are executing this code on the executor you cannot reference anything that uses sparkContext. (Python code only instead of Pyspark).
def f(rows):
#intialize a database connection here
for row in rows:
print(row.student_age) # do stuff with student_age
#close database connection here
result.foreachPartition(f) # used for things that need heavy initialization
Spark foreachPartition vs foreach | what to use?
This issue is solved, here is the answer:
result = [i[0] for i in df.select('student_age').toLocalIterator()]

Using DATASet Api with Spark Scala

Hi I am very new to spark/Scala and trying to implement some functionality.My requirement is very simple.I have to perform all the the operations using DataSet API.
Question1:
I converted the csv in form a case Class?Is it correct way of converting data frame to DataSet??Am I doing it correctly?
Also when I am trying to do transformation on orderItemFile1,for filter/map operation I am able to access with _.order_id.But same is not happening with groupBy
case class orderItemDetails (order_id_order_item:Int, item_desc:String,qty:Int, sale_value:Int)
val orderItemFile1=ss.read.format("csv")
.option("header",true)
.option("infersSchema",true)
.load("src/main/resources/Order_ItemData.csv").as[orderItemDetails]
orderItemFile1.filter(_.order_id_order_item>100) //Works Fine
orderItemFile1.map(_.order_id_order_item.toInt)// Works Fine
//Error .Inside group By I am unable to access it as _.order_id_order_item. Why So?
orderItemFile1.groupBy(_.order_id_order_item)
//Below Works..But How this will provide compile time safely as committed
//by DataSet Api.I can pass any wrong column name also here and it will be //caught only on run time
orderItemFile1.groupBy(orderItemFile1("order_id_order_item")).agg(sum(orderItemFile1("item_desc")))
Perhaps the functionality you're looking for is #groupByKey. See example here.
As for your first question, basically yes, you're reading a CSV into a Dataset[A] where A is a case class you've declared.

Not able to mock the getTimestamp method on mocked Row

I am writing an application that interacts with Cassandra using Scala. While performing unit testing, I am using mockito wherein I am mocking the resultSet and row
val mockedResultSet = mock[ResultSet]
val mockedRow = mock[Row]
Now while mocking the methods of the mockedRow, such as
doReturn("mocked").when(mockedRow).getString("ColumnName")
works fine. However, I am not able to mock the getTimestamp method of the mockedRow. I have tried 2 approaches but was not successful.
First approach
val testDate = "2018-08-23 15:51:12+0530"
val formatter = new SimpleDateFormat("yyyy-mm-dd HH:mm:ssZ")
val date: Date = formatter.parse(testDate)
doReturn(date).when(mockedRow).getTimestamp("ColumnName")
and second approach
when(mockedRow.getTimestamp("column")).thenReturn(Timestamp.valueOf("2018-08-23 15:51:12+0530"))
Both of them return null i.e it does not return the mocked value of the getTimestamp method. I am using cassandra driver core 3.0 dependency in my project.
Any help would br highly appreciated. Thanks in advance !!!
Mocking objects you don't own is usually considered a bad practice, that said, what you can do to try to see what's going on is to verify the interactions with the mock, i.e.
verify(mockedRow).getTimestamp("column")
Given you are getting null out of the mock, that statement should fail, but the failure will show all the actual calls received by the mock (and it's parameters), which should help you to debug.
A way to minimize this kind of problems is to use a mockito session, in standard mockito they can only be used through a JUnit runner, but with mockito-scala you can use them manually like this
MockitoScalaSession().run {
val mockedRow = mock[Row]
when(mockedRow.getTimestamp("column")).thenReturn(Timestamp.valueOf("2018-08-23 15:51:12+0530"))
//Execute your test
}
That code will check that the mock is not being called with anything that hasn't been stubbed for, it will also tell you if you had provided stubs that weren't actually used and a few more things.
If you like that behaviour (and you are using ScalaTest) you can apply it automatically to every test by using MockitoFixture
I'm a developer of mockito-scala btw

How to COUNT(*) in Slick 3.0?

I've been using Slick for quite a while and now I'm migrating from Slick 2.1 to 3.0. Unfortunatelly I got stuck with ordinary stuff like counting lines. My code worked perfectly in Slick 2.1 when I used to do this:
connection.withSession {
implicit session => coffees.length.run
}
On the code above I would get my result as an Int, but I can't get it to work now after I moved to Slick 3.0.2 though the documentation tells me that the code should be the same.
I tried the following (I already removed the withSession deprecated call):
connection.createSession.withTransaction {
coffees.length
}
But this code will return a slick.lifted.Rep[Int] which does not have any method to get the integer value. Am I missing some implicit import?
As you have probably realised, the result of the run call is to produce a Future, which will resolve at some later point.
While this means that eventually somewhere in the code the future will need to be waited on in a manner like you show in your answer, this can, and should, be pushed back as late as possible. If you are working with, for example, the Play framework, use async Actions and let Play handle it for you.
In the meantime work with the Future as you would any other monadic construct (like Option) - calling map, flatMap, onSuccess and so on to chain your computations inside the propagated Future context.
Please, someone tell me there is a better way to answer my question. I got it to work doing this, but this looks terrible:
import scala.concurrent.duration._
import scala.concurrent.Await
val timeout = Duration(10, SECONDS)
val count = Await.result(connection.run(coffees.length.result), timeout)

Graphx: I've got NullPointerException inside mapVertices

I want to use graphx. For now I just launchs it locally.
I've got NullPointerException in these few lines. First println works well, and second one fails.
..........
val graph: Graph[Int, Int] = Graph(users, relationships)
println("graph.inDegrees = " + graph.inDegrees.count) // this line works well
graph.mapVertices((id, v) => {
println("graph.inDegrees = " + graph.inDegrees.count) // but this one fails
42 // doesn't mean anything
}).vertices.collect
And it does not matter which method of 'graph' object I call. But 'graph' is not null inside 'mapVertices'.
Exception failure in TID 2 on host localhost:
java.lang.NullPointerException
org.apache.spark.graphx.impl.GraphImpl.mapReduceTriplets(GraphImpl.scala:168)
org.apache.spark.graphx.GraphOps.degreesRDD(GraphOps.scala:72)
org.apache.spark.graphx.GraphOps.inDegrees$lzycompute(GraphOps.scala:49)
org.apache.spark.graphx.GraphOps.inDegrees(GraphOps.scala:48)
ololo.MyOwnObject$$anonfun$main$1.apply$mcIJI$sp(Twitter.scala:42)
Reproduced using GraphX 2.10 on Spark 1.0.2. I'll give you a workaround and then explain what I think is happening. This works for me:
val c = graph.inDegrees.count
graph.mapVertices((id, v) => {
println("graph.inDegrees = " + c)
}).vertices.collect
In general, Spark gets prickly when you try to access an entire RDD or other distributed object (like a Graph) in code that's intended to execute in parallel on a single partition, like the function you're passing into mapVertices. But it's also usually a bad idea even when you can get it to work. (As a separate matter, as you've seen, when it doesn't work it tends to result in really unhelpful behavior.)
The vertices of a Graph are represented as an RDD, and the function you pass into mapVertices runs locally in the appropriate partitions, where it is given access to local vertex data: id and v. You really don't want the entire graph to be copied to each partition. In this case you just need to broadcast a scalar to each partition, so pulling it out solved the problem and the broadcast is really cheap.
There are tricks in the Spark APIs for accessing more complex objects in such a situation, but if you use them carelessly they will destroy your performance because they'll tend to introduce lots of communication. Often people are tempted to use them because they don't understand the computation model, rather than because they really need to, although that does happen too.
Spark does not support nested RDDs or user-defined functions that refer to other RDDs, hence the NullPointerException; see this thread on the spark-users mailing list. In this case, you're attempting to call count() on a Graph (which performs an action on a Spark RDD) from inside of a mapVertices() transformation, leading to a NullPointerException when mapVertices() attempts to access data structures that are only callable by the Spark driver.
In a nutshell, only the Spark driver can launch new Spark jobs; you can't call actions on RDDs from inside of other RDD actions.
See https://stackoverflow.com/a/23793399/590203 for another example of this issue.