Loading data using sparkJDBCDataset with jars not working - pyspark

When using a sparkJDBCDataset to load a table using a JDBC connection, I keep running into the error that spark cannot find my driver. The driver definitely exists on the machine and it's directory is specified inside the spark.yml file under config/base.
I've also followed the instructions and added def init_spark_session method to src/project_name/run.py. I'm suspicious though, that the sparksession defined here is not being picked up by the sparkJDBCDataset class. When you look at the source code for creating the sparksession and loading datasets inside sparkJDBCDataset, it looks like a vanilla sparksession with no configs is defined to load and save the data. The configs defined inside spark.yml are not used to create this sparksession. Below is an excerpt from the source code
#staticmethod
def _get_spark():
return SparkSession.builder.getOrCreate()
def _load(self) -> DataFrame:
return self._get_spark().read.jdbc(self._url, self._table, **self._load_args)
When I load data from a jdbc source outside of Kedro, with a SparkSession defined with spark.jars, the data loads in as expected.
Is there a way to specify spark.jars as well other other sparkConf when building the sparksession that reads the data in?

SparkSession.builder.getOrCreate will actually do as it says and will get the existing spark session. However, you’re correct that, if there is no existing session, then a vanilla session will be created.
Best place to run init_spark_session is in your run_package function, within your run.py context, right after the context is loaded. That run.py gets called when your kedro run command is called.
If you wish to test your catalog alone, then the simple work around here is to make sure that, in your testing code or what have you, you are calling the init_spark_session manually before executing the JDBC connection code.
This can be done with the following:
from kedro.context import load_context
kedro_project_path = “./“
context = load_context(kedro_project_path)
context.init_spark_session()
Where kedro_project_path is appropriate.
Sorry for the formatting btw, am on mobile.

Related

check whether is spark format exists or not

Context
Spark reader has the function format, which is used to specify a data source type, for example, JSON, CSV or third party com.databricks.spark.redshift
Help
how can I check whether a third-party format exists or not, let me give a case
In local spark, connect to redshift two open source libs available 1. com.databricks.spark.redshift 2. io.github.spark_redshift_community.spark.redshift, how I can determine which libs the user pastes in the classpath
What I tried
Class.forName("com.databricks.spark.redshift"), not worked
I tried to check spark code for how they are throwing error, here is line, but unfortunately Utils is not available publically
Instead of targeting format option, I tried to target JAR file System.getProperty("java.class.path")
spark.read.format("..").load() in try/catch
I looking for a proper & reliable solution
May this answer help you.
To only check whether is spark format exists or not,
spark.read.format("..").load() in try/catch
is enough.
And as all data sources usually register themselves using DataSourceRegister interface (and use shortName to provide their alias):
You can use Java's ServiceLoader.load method to find all registered implementations of DataSourceRegister interface.
import java.util.ServiceLoader
import org.apache.spark.sql.sources.DataSourceRegister
val formats = ServiceLoader.load(classOf[DataSourceRegister])
import scala.collection.JavaConverters._
formats.asScala.map(_.shortName).foreach(println)

How to process REPL generated class files in Spark by using parallel running Scala interpreters?

In my company we are currently using Spark interpreter to generate dynamically class files with spark-jobserver. Those class files are generated on our Spark cluster driver and saved into the directory (on that driver) defined by using "-Yrepl-outdir" command from standard ScalaSettings. It represents sort of cache for our executors which load the class files from there.
Everything works fine with standard setup, with one interpreter per driver, but the problem occurs when I tried to improve performances by introducing multiple interpreters running in parallel. I used Akka router design pattern with single interpreter per each routee, where each routee runs in its own thread, and of course I hit the wall. Namely, those interpreters are overriding results of each other inside of output directory on evaluating class files.
I've tried to fix it by adding different output directory for each interpreter, but in that case those output directories were not recognized by the Spark as directories to look for generated class files. For each particular interpreter I defined separate output directory by using "-Yrepl-outdir" command by somehow it wasn't enough.
I was also trying to change class loader to modify default names of those generated packages/classes, having each one starting with some prefix unique for the certain interpreter, but I haven't found working solution yet.
Since for reproducing this issue you need to have Spark cluster instance running and programmatic setup of the Spark Scala interpreter I'll just expose simplified method to show our generation of Scala interpreter in general:
def addInterpreter(classpath: String, outputDir: File, loader: ClassLoader, conf: SparkConf): IMain = {
val settings = new Settings()
val writer = new java.io.StringWriter()
settings.usejavacp.value = true
settings.embeddedDefaults(loader)
settings.classpath.value = (classpath.distinct mkString java.io.File.pathSeparator).replace("file:", "")
SparkIMainServer.createInterpreter(conf, outputDir, settings, writer)
}
Here you can see some simplified output of my running interpreters with packages on the left-side panel and the content of the one of them ($line3) on the right side. What I think would solve my problem is to give custom names to those packages - instead of $line1, $line2, etc. something like p466234$line1, p198934$line2, etc. with unique prefixes for each interpreter.
So, what's the easiest way to rename those class-files/packages generated by Spark Scala interpreter? Is there any other solution to this problem?

Find name of currently running SparkContext

I swear I've done this before but I can't find the code or an answer. I want to get the name of a currently running SparkContext and read it into a variable or print it to the screen. Something along the lines of:
val myContext = SparkContext.getName
So for example, if I was in spark-shell and ran it it would return "sc". Anyone know how to get that?
I'm not quite sure I follow... by name, do you mean the name of the application? If so, you would call appName. In spark-shell, for example: sc.appName.
If you're asking to get the name of the variable holding the context, then I'm not sure you can. sc is just the val used to access the context inside spark-shell, but you could name it anything you want in your own application.
[EDIT]
There's a getOrCreate method on the SparkContext which can return an existing created and registered context. Will this do what you want?
https://spark.apache.org/docs/1.5.1/api/java/org/apache/spark/SparkContext.html#getOrCreate()

Issues with reading xml file after creating jar

We are building an application using ScalaFX. When I run the project in IntelliJIDEA, everything works fine. However, when I create jar file and try to execute it, I am getting errors in reading some xml file.
I tried various solutions posted in SO, but with no use.
package com.app.adt
import scalafx.application.JFXApp
import scalafx.Includes._
import scalafx.scene.Scene
import scala.reflect.runtime.universe.typeOf
import scalafxml.core.{FXMLView, DependenciesByType}
object App extends JFXApp {
val root = FXMLView(getClass.getResource("/com/app/adt/Home.fxml"),
new DependenciesByType(Map(
typeOf[TestDependency] -> new TestDependency("ADT"))))
stage = new JFXApp.PrimaryStage() {
title = "ADT"
scene = new Scene(root)
}
}
The xml file(Home.fxml) is placed in com/app/adt package. I am creating the jar file using sbt-one-jar.
I have tried different combinations of path, but alwasys gives the same error.
Error Stack:
Caused by: javafx.fxml.LoadException:
file:/adt-app_2.11-1.3-SNAPSHOT-one-jar.jar!/main/adt-app_2.11-1.3-S
NAPSHOT.jar!/com/app/adt/Home.fxml
at javafx.fxml.FXMLLoader.constructLoadException(FXMLLoader.java:2611)
at javafx.fxml.FXMLLoader.loadImpl(FXMLLoader.java:2589)
at javafx.fxml.FXMLLoader.loadImpl(FXMLLoader.java:2435)
at javafx.fxml.FXMLLoader.load(FXMLLoader.java:2403)
at scalafxml.core.FXMLView$.apply(FXMLView.scala:17)
Jar Structure:
adt-app_2.11-1.3-SNAPSHOT-one-jar.jar
|
main
|
adt-app_2.11-1.3-SNAPSHOT.jar
|
com\app\adt
|
App.scala
Home.fxml
Also, I have tried with sbt-assembly instead of sbt-one-jar. But , still getting the same error. :(
Tried with below answers in SO:
Q1
Q2
The real problem is rather tricky. Firstly, one needs to realize that JAR is an archive (e.g. similar to ZIP) and archives are regular files. Thus the archive itself is located somewhere in the file system, hence, it is accessible via URL.
On the contrary, the "subfiles" (entries) are just data-block within the archive. Neither the operating system nor the JVM knows that this particular file is an archive therefore they treat is as a regular file.
If you're interested in deeper archive handling, try to figure out how ZipFile works. JAR is basically ZIP so you're able to apply this class to it.
Java provides Class.getResourceAsStream methods that enables the programmer to read files as streams. This solution is obviously useless in this particular example since the ScalaFX method expects the File instead.
So basically you have three options
Use the stream API in order to duplicate the XML into temporary file, than pass this file to the method.
Deploy your resources separately in a way they remain regular files.
Re-implement JavaFX in order to accept streams (this should probably happen anyway)

What is a good strategy for keeping global application state in Scala?

As a simplest example, say I'm starting my application in a certain mode (e.g. test), then I want to be able to check in other parts of the application what mode I'm running in. This should be extremely simple, but I'm looking for the right Scala replacement for global variables. Please give me a bit more than : "Scala objects are like global variables"
The ideal solution is that at start-up, the application will create an object, and at creation time, that object's 'mode' is set. After that, other parts of the application will just be able to read the state of 'mode'. How can I do this without passing a reference to an object all over the application?
My real scenario actually includes things such as selecting the database name, or singleton database object at start-up, and not allowing anything else to change that object afterwards. The one problem is that I'm trying to achieve this without passing around that reference to the database.
UPDATE:
Here is a simple example of what I would like to do, and my current solution:
object DB{
class PDB extends ProductionDB
class TDB extends TestComplianceDB
lazy val pdb = new PDB
lazy val tdb = new TDB
def db = tdb //(or pdb) How can I set this once at initialisation?
}
So, I've created different database configurations as traits. Depending on whether I'm running in Test or Production mode, I would like to use the correct configuration where configurations look something like:
trait TestDB extends DBConfig {
val m = new Model("H2", new DAL(H2Driver),
Database.forURL("jdbc:h2:mem:testdb", driver = "org.h2.Driver"))
// This is an in-memory database, so it will not yet exist.
dblogger.info("Using TestDB")
m.createDB
}
So now, whenever I use the database, I could use it like this:
val m = DB.db.m
m.getEmployees(departmentId)
My question really is, is this style bad, good or ok (using a singleton to hold a handle to the database). I'm using Slick, and I think this relates to having just one instance of Slick running. Could this lead to scalability issues.
Is there a better way to solve the problem?
You can use the typesafe config library, this is also used in projects like Play and Akka. Both the Play and Akka documentation explain basic parts of it's usage. From the Play documentation (Additional configuration)
Specifying alternative configuration file
The default is to load the application.conf file from the classpath. You can specify an alternative configuration file if needed:
Using -Dconfig.resource
-Dconfig.resource=prod.conf
Using -Dconfig.file
-Dconfig.file=/opt/conf/prod.conf
Using -Dconfig.url
-Dconfig.url=http://conf.mycompany.com/conf/prod.conf
Note that you can always reference the original configuration file in a new prod.conf file using the include directive, such as:
include "application.conf"
key.to.override=blah