Unable to execute Code on Databricks Community Edition - scala

I am training on Event Hubs with Databricks on my Databricks Community Edition. I need to run the following code/Notebook, however I get the error message:
NoSuchElementException: head of empty list
However, when I run the same code/Notebook on my Azure Databricks the code run successfully. Does anyone clues why this won't run on Databricks Community, whereas the code runs without any issues on Azure Databricks?
%scala
val tags = com.databricks.logging.AttributionContext.current.tags
//*******************************************
// GET VERSION OF APACHE SPARK
//*******************************************
// Get the version of spark
val Array(sparkMajorVersion, sparkMinorVersion, _) = spark.version.split("""\.""")
// Set the major and minor versions
spark.conf.set("com.databricks.training.spark.major-version", sparkMajorVersion)
spark.conf.set("com.databricks.training.spark.minor-version", sparkMinorVersion)
//*******************************************
// GET VERSION OF DATABRICKS RUNTIME
//*******************************************
// Get the version of the Databricks Runtime
val runtimeVersion = tags.collect(
{ case (t, v) if t.name == "sparkVersion" => v }).head
val runtimeVersions = runtimeVersion.split("""-""")
val (dbrVersion, scalaVersion) = if (runtimeVersions.size == 3) {
val Array(dbrVersion, _, scalaVersion) = runtimeVersions
(dbrVersion, scalaVersion.replace("scala", ""))
} else {
val Array(dbrVersion, scalaVersion) = runtimeVersions
(dbrVersion, scalaVersion.replace("scala", ""))
}
val Array(dbrMajorVersion, dbrMinorVersion, _) = dbrVersion.split("""\.""")
// Set the the major and minor versions
spark.conf.set("com.databricks.training.dbr.major-version", dbrMajorVersion)
spark.conf.set("com.databricks.training.dbr.minor-version", dbrMinorVersion)
//*******************************************
// GET USERNAME AND USERHOME
//*******************************************
// Get the user's name
val username = tags.getOrElse(com.databricks.logging.BaseTagDefinitions.TAG_USER,
java.util.UUID.randomUUID.toString.replace("-", ""))
val userhome = s"dbfs:/user/$username"
// Set the user's name and home directory
spark.conf.set("com.databricks.training.username", username)
spark.conf.set("com.databricks.training.userhome", userhome)
//**********************************
// VARIOUS UTILITY FUNCTIONS
//**********************************
def assertSparkVersion(expMajor:Int, expMinor:Int):Unit = {
val major = spark.conf.get("com.databricks.training.spark.major-version")
val minor = spark.conf.get("com.databricks.training.spark.minor-version")
if ((major.toInt < expMajor) || (major.toInt == expMajor && minor.toInt < expMinor))
throw new Exception(s"This notebook must be ran on Spark version $expMajor.$expMinor or better, found Spark $major.$minor")
}
def assertDbrVersion(expMajor:Int, expMinor:Int):Unit = {
val major = spark.conf.get("com.databricks.training.dbr.major-version")
val minor = spark.conf.get("com.databricks.training.dbr.minor-version")
if ((major.toInt < expMajor) || (major.toInt == expMajor && minor.toInt < expMinor))
throw new Exception(s"This notebook must be ran on Databricks Runtime (DBR) version $expMajor.$expMinor or better, found $major.$minor.")
}
//*******************************************
// CHECK FOR REQUIRED VERIONS OF SPARK & DBR
//*******************************************
assertDbrVersion(4, 0)
assertSparkVersion(2, 3)
displayHTML("Initialized module variables & functions...")

Related

Scala: get data from scylla using spark

scala/spark newbie here. I have inherited an old code which I have refactored and been trying to use in order to retrieve data from Scylla. The code looks like:
val TEST_QUERY = s"SELECT user_id FROM test_table WHERE name = ? AND id_type = 'test_type';"
var selectData = List[Row]()
dataRdd.foreachPartition {
iter => {
// Build up a cluster that we can connect to
// Start a session with the cluster by connecting to it.
val cluster = ScyllaConnector.getCluster(clusterIpString, scyllaPreferredDc, scyllaUsername, scyllaPassword)
var batchCounter = 0
val session = cluster.connect(tableConfig.keySpace)
val preparedStatement: PreparedStatement = session.prepare(TEST_QUERY)
iter.foreach {
case (test_name: String) => {
// Get results
val testResults = session.execute(preparedStatement.bind(test_name))
if (testResults != null){
val testResult = testResults.one()
if(testResult != null){
val user_id = testResult.getString("user_id")
selectData ::= Row(user_id, test_name)
}
}
}
}
session.close()
cluster.close()
}
}
println("Head is =======> ")
println(selectData.head)
The above does not return any data and fails with null pointer exception because the selectedData list is empty although there is data in there for sure that matches the select statement. I feel like how I'm doing it is not correct but can't figure out what needs to change in order to get this fixed so any help is much appreciated.
PS: The whole idea of me using a list to keep the results is so that I can use that list to create a dataframe. I'd be grateful if you could point me to the right direction here.
If you look into the definition of the foreachPartition function, you will see that it's by definition can't return anything because its return type is void.
Anyway, it's a very bad way of querying data from Cassandra/Scylla from Spark. For that exists Spark Cassandra Connector that should be able to work with Scylla as well because of the protocol compatibility.
To read a dataframe from Cassandra just do:
spark.read
.format("cassandra")
.option("keyspace", "ksname")
.option("table", "tab")
.load()
Documentation is quite detailed, so just read it.

Spark: show and collect-println giving different outputs

I am using Spark 2.2
I feel like I have something odd going on here. Basic premise is that
I have a set of KIE/Drools rules running through a Dataset of profile objects
I am then trying to show/collect-print the resulting output
I then cast the output as a tuple to flatmap it later
Code below
implicit val mapEncoder = Encoders.kryo[java.util.HashMap[String, Any]]
implicit val recommendationEncoder = Encoders.kryo[Recommendation]
val mapper = new ObjectMapper()
val kieOuts = uberDs.map(profile => {
val map = mapper.convertValue(profile, classOf[java.util.HashMap[String, Any]])
val profile = Profile(map)
// setup the kie session
val ks = KieServices.Factory.get
val kContainer = ks.getKieClasspathContainer
val kSession = kContainer.newKieSession() //TODO: stateful session, how to do stateless?
// insert profile object into kie session
val kCmds = ks.getCommands
val cmds = new java.util.ArrayList[Command[_]]()
cmds.add(kCmds.newInsert(profile))
cmds.add(kCmds.newFireAllRules("outFired"))
// fire kie rules
val results = kSession.execute(kCmds.newBatchExecution(cmds))
val fired = results.getValue("outFired").toString.toInt
// collect the inserted recommendation objects and create uid string
import scala.collection.JavaConversions._
var gresults = kSession.getObjects
gresults = gresults.drop(1) // drop the inserted profile object which also gets collected
val recommendations = scala.collection.mutable.ListBuffer[Recommendation]()
gresults.toList.foreach(reco => {
val recommendation = reco.asInstanceOf[Recommendation]
recommendations += recommendation
})
kSession.dispose
val uIds = StringBuilder.newBuilder
if(recommendations.size > 0) {
recommendations.foreach(recommendation => {
uIds.append(recommendation.getOfferId + "_" + recommendation.getScore)
uIds.append(";")
})
uIds.deleteCharAt(uIds.size - 1)
}
new ORecommendation(profile.getAttributes().get("cId").toString.toLong, fired, uIds.toString)
})
println("======================Output#1======================")
kieOuts.show(1000, false)
println("======================Output#2======================")
kieOuts.collect.foreach(println)
//separating cid and and each uid into individual rows
val kieOutsDs = kieOuts.as[(Long, Int, String)]
println("======================Output#3======================")
kieOutsDs.show(1000, false)
(I have sanitized/shortened the id's below, they are much bigger but with a similar format)
What I am seeing as outputs
Output#1 has a set of uIds(as String) come up
+----+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|cId |rulesFired | eligibleUIds |
|842 | 17|123-25_2.0;12345678-48_9.0;28a-ad_5.0;123-56_10.0;123-27_2.0;123-32_3.0;c6d-e5_5.0;123-26_2.0;123-51_10.0;8e8-c1_5.0;123-24_2.0;df8-ad_5.0;123-36_5.0;123-16_2.0;123-34_3.0|
+----+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Output#2 has mostly a similar set of uIds show up(usually off by 1 element)
ORecommendation(842,17,123-36_5.0;123-24_2.0;8e8-c1_5.0;df8-ad_5.0;28a-ad_5.0;660-73_5.0;123-34_3.0;123-48_9.0;123-16_2.0;123-51_10.0;123-26_2.0;c6d-e5_5.0;123-25_2.0;123-56_10.0;123-32_3.0)
Output#3 is same as #Output1
+----+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|842 | 17 |123-32_3.0;8e8-c1_5.0;123-51_10.0;123-48_9.0;28a-ad_5.0;c6d-e5_5.0;123-27_2.0;123-16_2.0;123-24_2.0;123-56_10.0;123-34_3.0;123-36_5.0;123-6_2.0;123-25_2.0;660-73_5.0|
Every time I run it the difference between Output#1 and Output#2 is 1 element but never the same element (In the above example, Output#1 has 123-27_2.0 but Output#2 has 660-73_5.0)
Should they not be the same? I am still new to Scala/Spark and feel like I am missing something very fundamental
I think I figured this out, adding cache to kieOuts atleast got me identical outputs between show and collect.
I will be looking at why KIE gives me different output for every run of the same input but that is a different issue

Spark Graphx : class not found error on EMR cluster

I am trying to process Hierarchical Data using Grapghx Pregel and the code I have works fine on my local.
But when I am running on my Amazon EMR cluster it is giving me an error:
java.lang.NoClassDefFoundError: Could not initialize class
What would be the reason of this happening? I know the class is there in the jar file as it run fine on my local as well there is no build error.
I have included GraphX dependency on pom file.
Here is a snippet of code where error is being thrown:
def calcTopLevelHierarcy (vertexDF: DataFrame, edgeDF: DataFrame): RDD[(Any, (Int, Any, String, Int, Int))] =
{
val verticesRDD = vertexDF.rdd
.map { x => (x.get(0), x.get(1), x.get(2)) }
.map { x => (MurmurHash3.stringHash(x._1.toString).toLong, (x._1.asInstanceOf[Any], x._2.asInstanceOf[Any], x._3.asInstanceOf[String])) }
//create the edge RD top down relationship
val EdgesRDD = edgeDF.rdd.map { x => (x.get(0), x.get(1)) }
.map { x => Edge(MurmurHash3.stringHash(x._1.toString).toLong, MurmurHash3.stringHash(x._2.toString).toLong, "topdown") }
// create the edge RD top down relationship
val graph = Graph(verticesRDD, EdgesRDD).cache()
//val pathSeperator = """/"""
//initialize id,level,root,path,iscyclic, isleaf
val initialMsg = (0L, 0, 0.asInstanceOf[Any], List("dummy"), 0, 1)
val initialGraph = graph.mapVertices((id, v) => (id, 0, v._2, List(v._3), 0, v._3, 1, v._1))
val hrchyRDD = initialGraph.pregel(initialMsg, Int.MaxValue, EdgeDirection.Out)(setMsg, sendMsg, mergeMsg)
//build the path from the list
val hrchyOutRDD = hrchyRDD.vertices.map { case (id, v) => (v._8, (v._2, v._3, pathSeperator + v._4.reverse.mkString(pathSeperator), v._5, v._7)) }
hrchyOutRDD
}
I was able to narrow down the line that is causing an error:
val hrchyRDD = initialGraph.pregel(initialMsg, Int.MaxValue, EdgeDirection.Out)(setMsg, sendMsg, mergeMsg)
I had this exact same issue happening to me, where I was able to run it on spark-shell failing when executed from spark-submit. Here’s an example of the code I was trying to execute (looks like it's the same as yours)
The error that pointed me to the right solution was:
org.apache.spark.SparkException: A master URL must be set in your configuration
In my case, I was getting that error because I had defined the SparkContext outside the main function:
object Test {
val sc = SparkContext.getOrCreate
val sqlContext = new SQLContext(sc)
def main(args: Array[String]) {
...
}
}
I was able to solve it by moving SparkContext and sqlContext inside the main function as described in this other post

Spark - Prediction.io - scala.MatchError: null

I'm working on a template for prediction.io and I'm running into trouble with Spark.
I keep getting a scala.MatchError error: full gist here
scala.MatchError: null
at org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:831)
at org.apache.spark.mllib.recommendation.MatrixFactorizationModel.predict(MatrixFactorizationModel.scala:66)
at org.template.prediction.ALSAlgorithm$$anonfun$predict$1$$anonfun$apply$1.apply(ALSAlgorithm.scala:86)
at org.template.prediction.ALSAlgorithm$$anonfun$predict$1$$anonfun$apply$1.apply(ALSAlgorithm.scala:79)
at scala.Option.map(Option.scala:145)
at org.template.prediction.ALSAlgorithm$$anonfun$predict$1.apply(ALSAlgorithm.scala:79)
at org.template.prediction.ALSAlgorithm$$anonfun$predict$1.apply(ALSAlgorithm.scala:78)
The code github source here
val usersWithCounts =
ratingsRDD
.map(r => (r.user, (1, Seq[Rating](Rating(r.user, r.item, r.rating)))))
.reduceByKey((v1, v2) => (v1._1 + v2._1, v1._2.union(v2._2)))
.filter(_._2._1 >= evalK)
// create evalK folds of ratings
(0 until evalK).map { idx =>
// start by getting this fold's ratings for each user
val fold = usersWithCounts
.map { userKV =>
val userRatings = userKV._2._2.zipWithIndex
val trainingRatings = userRatings.filter(_._2 % evalK != idx).map(_._1)
val testingRatings = userRatings.filter(_._2 % evalK == idx).map(_._1)
(trainingRatings, testingRatings) // split the user's ratings into a training set and a testing set
}
.reduce((l, r) => (l._1.union(r._1), l._2.union(r._2))) // merge all the testing and training sets into a single testing and training set
val testingSet = fold._2.map {
r => (new Query(r.user, r.item), new ActualResult(r.rating))
}
(
new TrainingData(sc.parallelize(fold._1)),
new EmptyEvaluationInfo(),
sc.parallelize(testingSet)
)
}
In order to do evaluation I need to split the ratings into a training and a testing group. To make sure each user has been included as part of the training, I group all the user's ratings together and then do the split on each user and then join the splits together.
Maybe there's a better way to do this?
The error means that the userFeatures of the MLlib MatrixFactorizationModel doesn't contain the user id (say, if the user is not in training data). MLlib doesn't check for this after the lookup (.head is used):
https://github.com/apache/spark/blob/v1.2.0/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L66
To debug if it's the case, you can implement a modified version of model.predict() to check if userId/itemId exists in model instead of calling the default one:
val itemScore = model.predict(userInt, itemInt)
(https://github.com/nickpoorman/template-scala-parallel-prediction/blob/master/src/main/scala/ALSAlgorithm.scala#L80):
Change to use .headOption:
val itemScore = model.userFeatures.lookup(userInt).headOption.map { userFeature =>
model.productFeatures.lookup(itemInt).headOption.map { productFeature =>
val userVector = new DoubleMatrix(userFeature)
val productVector = new DoubleMatrix(productFeature)
userVector.dot(productVector)
}.getOrElse{
logger.info(s"No itemFeature for item ${query.item}.")
0.0 // return default score
}
}.getOrElse{
logger.info(s"No userFeature for user ${query.user}.")
0.0 // return default score
}

Scala Slick filter two conditions - error session is not applicable

I am using Scala and Slick and I am trying to execute simple query with two conditions
import JiraData._
import org.scala_tools.time.Imports._
import scala.slick.driver.PostgresDriver.simple._
val today = new DateTime()
val yesterday = today.plusDays(-1)
implicit val session = Database.forURL("jdbc:postgresql://localhost/jira-performance-manager",
driver = "org.postgresql.Driver",
user = "jira-performance-manager",
password = "jira-performance-manager").withSession {
implicit session =>
val activeUsers = users.filter(_.active === true)
for (activeUser <- activeUsers) {
val activeUserWorkogs = worklogs.filter(x => x.username === activeUser.name && x.workDate === yesterday)
}
}
But I receive error:
Error:(20, 95) value === is not a member of scala.slick.lifted.Column[org.scala_tools.time.Imports.DateTime]
Note: implicit value session is not applicable here because it comes after the application point and it lacks an explicit result type
val activeUserWorkogs = worklogs.filter(x => x.username === activeUser.name && x.workDate === yesterday)
^
What's wrong here? How can I get list of results filtered by two conditions?
scala-tools.time uses JodaDateTime. See https://github.com/jorgeortiz85/scala-time/blob/master/src/main/scala/org/scala_tools/time/Imports.scala . Slick does not have built-in support for Joda. There is Slick Joda mapper: https://github.com/tototoshi/slick-joda-mapper . Or it is easy to add yourself: http://slick.typesafe.com/doc/2.1.0/userdefined.html#using-custom-scalar-types-in-queries
As a side-note: Something like
for (activeUser <- activeUsers) {
val activeUserWorkogs = worklogs.filter(...)
looks like going into the wrong direction. It will run another query for each active user. Better is to use a join or run a single accumulated query for the work logs of all active users.