How to get default property values in Spark - scala

I am using this version of Spark : spark-1.4.0-bin-hadoop2.6 . I want to check few default properties. So I gave the following statement in spark-shell
scala> sqlContext.getConf("spark.sql.hive.metastore.version")
I was expecting the call to method getConf to return a value of 0.13.1 as desribed in this link. But I got the below exception
java.util.NoSuchElementException: spark.sql.hive.metastore.version
at org.apache.spark.sql.SQLConf$$anonfun$getConf$1.apply(SQLConf.scala:283)
at org.apache.spark.sql.SQLConf$$anonfun$getConf$1.apply(SQLConf.scala:283)
Am I retrieving the properties in the right way?

You can use
sc.getConf.toDebugString
OR
sqlContext.getAllConfs
which will return all values that have been set, however some defaults are in the code. In your specific example, it is indeed in the code:
getConf(HIVE_METASTORE_VERSION, hiveExecutionVersion)
where the default is indeed in the code:
val hiveExecutionVersion: String = "0.13.1"
So, getConf will attempt to pull the metastore version from the config, falling back to a default, but this is not listed in the conf itself.

In Spark 2.x.x If I wanted to know default value of a Spark Conf I would do this:
Below command will return a Scala Map in spark-shell.
spark.sqlContext.getAllConfs
To find our value for a conf property:
e.g. - To find the default warehouse dir used by spark set to conf -
spark.sql.warehouse.dir:
spark.sqlContext.getAllConfs.get("spark.sql.warehouse.dir")

Related

Setting GOOGLE_APPLICATION_CREDENTIALS environment variable in Scala

I am running Spark-Shell with Scala and I want to set an environment variable to load data into Google bigQuery. The environment variable is GOOGLE_APPLICATION_CREDENTIALS and it contains /path/to/service/account.json
In python environment I can easily do,
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "path/to/service/account.json"
However, I cannot do this in Scala. I can print out the system environment variables using,
scala> sys.env
or
scala> System.getenv()
which returns me a map of String Key,Value pairs. However,
scala> System.getenv("GOOGLE_APPLICATION_CREDENTIALS") = "path/to/service/account.json"
returns an error
<console>:26: error: value update is not a member of java.util.Map[String,String]
I found a work around for this problem, though I dont think its the best practice. Here is the 2 step solution for this -
From terminal/cmd, first create the environment variable -
export GOOGLE_APPLICATION_CREDENTIALS=path/to/service/account.json
From the same terminal window, open spark-shell and run -
System.getenv("GOOGLE_APPLICATION_CREDENTIALS")

Spark-SQL: access file in current worker node directory

I need to read a file using spark-sql, and the file is in the current directory.
I use this command to decompress a list of files I have stored on HDFS.
val decompressCommand = Seq(laszippath, "-i", inputFileName , "-o", "out.las").!!
The file is outputted in the current worker node directory, and I know this because executing "ls -a"!! through scala I can see that the file is there. I then try to access it with the following command:
val dataFrame = sqlContext.read.las("out.las")
I assumed that the sql context would try to find the file in the current directory, but it doesn't. Also, it doesn't throw an error but a warning stating that the file could not be found (so spark continues to run).
I attempted to add the file using: sparkContext.addFile("out.las") and then access the location using: val location = SparkFiles.get("out.las") but this didn't work either.
I even ran the command val locationPt = "pwd"!! and then did val fullLocation = locationPt + "/out.las" and attempted to use that value but it didn't work either.
The actual exception that gets thrown is the following:
User class threw exception: org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: [];
org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: []
And this happens when I try to access column "x" from a dataframe. I know that column 'X' exists because I've downloaded some of the files from HDFS, decompressed them locally and ran some tests.
I need to decompress files one by one because I have 1.6TB of data and so I cannot decompress it at one go and access them later.
Can anyone tell me what I can do to access files which are being outputted to the worker node directory? Or maybe should I be doing it some other way?
So I managed to do it now. What I'm doing is I'm saving the file to HDFS, and then retrieving the file using the sql context through hdfs. I overwrite "out.las" each time in HDFS so that I don't have take too much space.
I have used the hadoop API before to get to files, I dunno if it will help you here.
val filePath = "/user/me/dataForHDFS/"
val fs:FileSystem = FileSystem.get(new java.net.URI(filePath + "out.las"), sc.hadoopConfiguration)
And I've not tested the below, but I'm pretty sure I'm passing the java array to scala illegally. But just giving an idea of what to do afterward.
var readIn: Array[Byte] = Array.empty[Byte]
val fileIn: FSDataInputStream = fs.open(file)
val fileIn.readFully(0, readIn)

No configuration setting found for key typesafe config

Im trying to implement a configuration tool typesafehub/config
im using this code
val conf = ConfigFactory.load()
val url = conf.getString("add.prefix") + id + "/?" + conf.getString("add.token")
And the location of the property file is /src/main/resources/application.conf
But for some reason i'm receiving
com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'add'
File content
add {
token = "access_token=6235uhC9kG05ulDtG8DJDA"
prefix = "https://graph.facebook.com/v2.2/"
limit = "&limit=250"
comments="?pretty=0&limit=250&access_token=69kG05ulDtG8DJDA&filter=stream"
feed="/feed?limit=200&access_token=623501EuhC9kG05ulDtG8DJDA&pretty=0"
}
Everything looks configured correctly ?? do i missed something .
thanks,
miki
The error message is telling you that whatever configuration got read, it didn't include a top level setting named add. The ConfigFactory.load function will attempt to load the configuration from a variety of places. By default it will look for a file named application with a suffix of .conf or .json. It looks for that file as a Java resource on your class path. However, various system properties will override this default behavior.
So, it is likely that what you missed is one of these:
Is it possible that src/main/resources is not on your class path?
Are the config.file, config.resource or config.url properties set?
Is your application.conf file empty?
Do you have an application.conf that would be found earlier in your class path?
Is the key: add defined in the application.conf?
Are you using an IDE or sbt?
I had a similar problem while using Eclipse. It simply did not find the application.conf file at first and later on failed to notice edits.
However, once I ran my program via sbt, all worked just fine, including Eclipse. So, I added 'main/resources' to the libraries (Project -> Properties -> Java Build Path -> Libraries", "add class folder"). That might help you as well.
Place your application.conf in the src folder and it should work
I ran into this issue inside a Specs2 test that was driven by SBT. It turned out that the issue was caused by https://github.com/etorreborre/specs2/issues/556. In that case, the Thread's contextClassLoader wasn't using the correct classloader. If you run into a similar error, there are other versions of ConfigFactory.load() that allow you to pass the current class's ClassLoader instead. If you're using Specs2 and you're seeing this issue, use a version <= 3.8.6 or >= 4.0.1.
Check you path. In my case I got the same issue, having application.conf placed in src/main/resources/configuration/common/application.conf
Incorrect:
val conf = ConfigFactory.load(s"/configuration/common/application.conf")
Correct
val conf = ConfigFactory.load(s"configuration/common/application.conf")
it turned out to be a silly mistake i made.
Following that, i does not matter if you use ":" or "=" in .conf file.
Getting the value from example:
server{
proc {
max = "600"
}
}
conf.getString("server.proc.max")
Even you can have the following conf:
proc {
max = "600"
}
proc {
main = "60000"
}
conf.getString("proc.max") //prints 600
conf.getString("proc.min") //prints 60000
I ran into this doing a getString on an integer in my configuration file.
I ran into exactly the same problem and the solution was to replace = with : in the application.conf. Try with the following content in your application.conf:
add {
token: "access_token=6235uhC9kG05ulDtG8DJDA"
prefix: "https://graph.facebook.com/v2.2/"
limit: "&limit=250"
comments: "?pretty=0&limit=250&access_token=69kG05ulDtG8DJDA&filter=stream"
feed: "/feed?limit=200&access_token=623501EuhC9kG05ulDtG8DJDA&pretty=0"
}
Strangely, IntelliJ doesn't detect any formatting or syntax error when using = for me.
in my case it was a stupid mistake,
i m change file name from "application.config" to "application.conf" and its works .
If the application.conf is not getting discovered, you could add this to build.sbt:
unmanagedSourceDirectories in Compile += baseDirectory.value / "main/resources"
Please don't use this to include any custom path. Follow the guidelines and best-practices
As mentioned by others, make sure the application.conf is place in: src/main/resources.
I placed the file there error went away.
Looking at these examples helped me as well:
https://github.com/lightbend/config/tree/main/examples/scala
Use ConfigFactory.parseFile for other locations

Spark Tachyon: How to delete a file?

In Scala, as an experiment I create a sequence file on Tachyon using Spark and read it back in. I want to delete the file from Tachyon using the Spark script also.
val rdd = sc.parallelize(Array(("a",2), ("b",3), ("c",1)))
rdd.saveAsSequenceFile("tachyon://127.0.0.1:19998/files/123.sf2")
val rdd2 = sc.sequenceFile[String,Int]("tachyon://127.0.0.1:19998/files/123.sf2")
I don't understand the Scala language very well and I cannot find a reference about file path manipulation. I did find a way of somehow using Java in Scala to do this, but I cannot get it to work using Tachyon.
import java.io._
new File("tachyon://127.0.0.1:19998/files/123.sf2").delete()
There are different approaches, e.g.:
CLI:
./bin/tachyon tfs rm filePath
More info: http://tachyon-project.org/Command-Line-Interface.html
API:
TachyonFS sTachyonClient = TachyonFS.get(args[0]);
sTachyonClient.delete(filePath, true);
More info:
https://github.com/amplab/tachyon/blob/master/core/src/main/java/tachyon/examples/BasicOperations.java

Typesafe ConfigFactory load error

I'm trying to load the application.conf that I have under my resources folder using the following line:
val config = ConfigFactory.load(getClass.getResource("application.conf").getPath)
However, it fails and the application.conf is not loaded. There is no error or whatsoever. Any ideas as to what to look for?
ConfigFactory.load takes a resource-name as parameter not a complete path. So it should be enough if you just use "application.conf" as argument, like this:
ConfigFactory.load("application.conf")
As "application.conf is the default name anyways it should actually be enough to just go without arguments:
ConfigFactory.load()
You can make the library produce a nice meaningful error, using this overload of ConfigFactory.load.
val config = ConfigFactory.load(configName,
ConfigParseOptions.defaults().setAllowMissing(false),
ConfigResolveOptions.defaults())
(I was fairly surprised that they didn't make this the default).