Find name of currently running SparkContext - scala

I swear I've done this before but I can't find the code or an answer. I want to get the name of a currently running SparkContext and read it into a variable or print it to the screen. Something along the lines of:
val myContext = SparkContext.getName
So for example, if I was in spark-shell and ran it it would return "sc". Anyone know how to get that?

I'm not quite sure I follow... by name, do you mean the name of the application? If so, you would call appName. In spark-shell, for example: sc.appName.
If you're asking to get the name of the variable holding the context, then I'm not sure you can. sc is just the val used to access the context inside spark-shell, but you could name it anything you want in your own application.
[EDIT]
There's a getOrCreate method on the SparkContext which can return an existing created and registered context. Will this do what you want?
https://spark.apache.org/docs/1.5.1/api/java/org/apache/spark/SparkContext.html#getOrCreate()

Related

Loading data using sparkJDBCDataset with jars not working

When using a sparkJDBCDataset to load a table using a JDBC connection, I keep running into the error that spark cannot find my driver. The driver definitely exists on the machine and it's directory is specified inside the spark.yml file under config/base.
I've also followed the instructions and added def init_spark_session method to src/project_name/run.py. I'm suspicious though, that the sparksession defined here is not being picked up by the sparkJDBCDataset class. When you look at the source code for creating the sparksession and loading datasets inside sparkJDBCDataset, it looks like a vanilla sparksession with no configs is defined to load and save the data. The configs defined inside spark.yml are not used to create this sparksession. Below is an excerpt from the source code
#staticmethod
def _get_spark():
return SparkSession.builder.getOrCreate()
def _load(self) -> DataFrame:
return self._get_spark().read.jdbc(self._url, self._table, **self._load_args)
When I load data from a jdbc source outside of Kedro, with a SparkSession defined with spark.jars, the data loads in as expected.
Is there a way to specify spark.jars as well other other sparkConf when building the sparksession that reads the data in?
SparkSession.builder.getOrCreate will actually do as it says and will get the existing spark session. However, you’re correct that, if there is no existing session, then a vanilla session will be created.
Best place to run init_spark_session is in your run_package function, within your run.py context, right after the context is loaded. That run.py gets called when your kedro run command is called.
If you wish to test your catalog alone, then the simple work around here is to make sure that, in your testing code or what have you, you are calling the init_spark_session manually before executing the JDBC connection code.
This can be done with the following:
from kedro.context import load_context
kedro_project_path = “./“
context = load_context(kedro_project_path)
context.init_spark_session()
Where kedro_project_path is appropriate.
Sorry for the formatting btw, am on mobile.

Scala + Spark: ways to pass parameters in a program. Is it possible to use the Context for this?

I am wondering if it is possible to pass parameters in a Scala Spark program using the context or something similar. I mean, I read some parameters from spark-submit inside my app, but those parameters will be necessary "at the end"(let's say). So I have to pass them from the driver to another file, and then to another file and so on... So, my call to a method have a huge list of parameters.
Thank you in advance!
The key point to understand is, you provide spark submit the application jar file and any command line parameters that you wishes spark submit provide while invoking the jar.
My understanding is, you only need some of those parameters at the very end of execution and you do not carry all those arguments in nested function calls. I will say, there is definite scope of refactoring the design.
Anycase, one trick you can employ is, write those parameters to a json file and make it available to be read by your spark application when necessary(I would write those parameter to aws s3 and read them when needed).
Or, you can create an implicit variable and carry it through out the code which I believe will not be a good design.

Spark word count in Scala (running in Apache Sandbox)

I am trying to do a word count lab in Spark on Scala. I am able to successfully load the text file into a variable (RDD), but when I do the .flatmap, .map, and reduceByKey, I receive the attached error message. I am new to this, so any type of help would be greatly appreciated. Please let me know.capture
Your program is failing because it was not able to detect the file present on Hadoop
Need to specify the file in the following format
sc.textFile("hdfs://namenodedetails:8020/input.txt")
You need to give the complete qualified path of the file. Since Spark builds a Dependency graph and evaluates lazily when an action is called, you are facing the error when you are trying to call an action.
It is better to debug after reading the file from HDFS using .first or .take(n) methods

What is a good strategy for keeping global application state in Scala?

As a simplest example, say I'm starting my application in a certain mode (e.g. test), then I want to be able to check in other parts of the application what mode I'm running in. This should be extremely simple, but I'm looking for the right Scala replacement for global variables. Please give me a bit more than : "Scala objects are like global variables"
The ideal solution is that at start-up, the application will create an object, and at creation time, that object's 'mode' is set. After that, other parts of the application will just be able to read the state of 'mode'. How can I do this without passing a reference to an object all over the application?
My real scenario actually includes things such as selecting the database name, or singleton database object at start-up, and not allowing anything else to change that object afterwards. The one problem is that I'm trying to achieve this without passing around that reference to the database.
UPDATE:
Here is a simple example of what I would like to do, and my current solution:
object DB{
class PDB extends ProductionDB
class TDB extends TestComplianceDB
lazy val pdb = new PDB
lazy val tdb = new TDB
def db = tdb //(or pdb) How can I set this once at initialisation?
}
So, I've created different database configurations as traits. Depending on whether I'm running in Test or Production mode, I would like to use the correct configuration where configurations look something like:
trait TestDB extends DBConfig {
val m = new Model("H2", new DAL(H2Driver),
Database.forURL("jdbc:h2:mem:testdb", driver = "org.h2.Driver"))
// This is an in-memory database, so it will not yet exist.
dblogger.info("Using TestDB")
m.createDB
}
So now, whenever I use the database, I could use it like this:
val m = DB.db.m
m.getEmployees(departmentId)
My question really is, is this style bad, good or ok (using a singleton to hold a handle to the database). I'm using Slick, and I think this relates to having just one instance of Slick running. Could this lead to scalability issues.
Is there a better way to solve the problem?
You can use the typesafe config library, this is also used in projects like Play and Akka. Both the Play and Akka documentation explain basic parts of it's usage. From the Play documentation (Additional configuration)
Specifying alternative configuration file
The default is to load the application.conf file from the classpath. You can specify an alternative configuration file if needed:
Using -Dconfig.resource
-Dconfig.resource=prod.conf
Using -Dconfig.file
-Dconfig.file=/opt/conf/prod.conf
Using -Dconfig.url
-Dconfig.url=http://conf.mycompany.com/conf/prod.conf
Note that you can always reference the original configuration file in a new prod.conf file using the include directive, such as:
include "application.conf"
key.to.override=blah

In Scala, can I call Source.reset() on resources read from the classpath?

Suppose I have a jarfile on my classpath. In that jarfile I have a file afile.txt.
I need to iterate on that file twice, once to count the lines and once to parse it. This is what I did:
val source = Source.fromInputStream(/*some magic to get the resource's InputStream*/)
source.getLines.foreach (/*count the lines*/)
source.getLines.reset.foreach (/*do something interesting*/)
But this doesn't work. In the debugger it looks like the call to reset() returns an empty iterator. The code above works fine when the Source refers to a file on the filesystem instead of on the classpath.
Am I doing something wrong, or is this a bug in Scala's io library?
I think this is a bug is the Scala library. I had a quick look at Source.scala in 2.8 trunk and reset seems to return a new wrapper around the original input stream which would have no content left after the first pass. I think it should throw an exception. I can't think of a straightforward way you could reset an arbitrary input stream.
I think you can simply call val source2 = Source.fromInputStream and read again, as it seems reset does not do more than that.