Spark properties file read - scala

I tried to read a properties file in spark where my file location is available while run the job getting below error
code is
object runEmpJob {
def main(args: Array[String]): Unit = {
println("starting emp job")
val props = ConfigFactory.load()
val envProps = props.getConfig("C:\\Users\\mmishra092815\\IdeaProjects\\use_case_1\\src\\main\\Resource\\filepath.properties")
System.setProperty("hadoop.home.directory", "D:\\SHARED\\winutils-master\\hadoop-2.6.3\\bin")
val spark = SparkSession.builder().
appName("emp dept operation").
master(envProps.getString("Dev.executionMode")).
getOrCreate()
val empObj = new EmpOperation
empObj.runEmpOperation(spark, "String", fileType = "csv")
val inPutPath = args(1)
val outPutPath = args(2)
}
}
getting error:
Exception in thread "main"
com.typesafe.config.ConfigException$BadPath: path parameter: Invalid path C:\Users\mmishra092815\IdeaProjects\use_case_1\src\main\Resource\filepath.properties':
Token not allowed in path expression: ':' (you can double-quote this token if you really want it here)
at com.typesafe.config.impl.PathParser.parsePathExpression(PathParser.java:155)
at com.typesafe.config.impl.PathParser.parsePathExpression(PathParser.java:74)
at com.typesafe.config.impl.PathParser.parsePath(PathParser.java:61)
at com.typesafe.config.impl.Path.newPath(Path.java:230)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:192)
at com.typesafe.config.impl.SimpleConfig.getObject(SimpleConfig.java:268)
at com.typesafe.config.impl.SimpleConfig.getConfig(SimpleConfig.java:274)
at com.typesafe.config.impl.SimpleConfig.getConfig(SimpleConfig.java:41)
at executor.runEmpJob$.main(runEmpJob.scala:12)
at executor.runEmpJob.main(runEmpJob.scala)
Process finished with exit code 1

Loading happens in ConfigFactory.load(). If you want to load configuration from specific file, pass it like:
val props = ConfigFactory.load("C:\\Users\\mmishra092815\\IdeaProjects\\use_case_1\\src\\main\\Resource\\filepath.properties")
As described in API documentation, getConfig method does not load configuration from file - it returns a Config object for given config path (not filesystem path!)

Related

Task not serializable - foreach function spark

I have a function getS3Object to get a json object stored in S3
def getS3Object (s3ObjectName) : Unit = {
val bucketName = "xyz"
val object_to_write = s3client.getObject(bucketName, s3ObjectName)
val file = new File(filename)
fileWriter = new FileWriter(file)
bw = new BufferedWriter(fileWriter)
bw.write(object_to_write)
bw.close()
fileWriter.close()
}
My dataframe (df) contains one column where each row is the S3ObjectName
S3ObjectName
a1.json
b2.json
c3.json
d4.json
e5.json
When I execute the below logic I get an error saying "task is not serializable"
Method 1:- df.foreach(x => getS3Object(x.getString(0))
I tried converting the df to rdd but still get the same error
Method 2:- df.rdd.foreach(x => getS3Object(x.getString(0))
However it works with collect()
Method 3:- df.collect.foreach(x => getS3Object(x.getString(0))
I do not wish to use the collect() method as all the elements of the dataframe are collected to the driver and potentially result in OutOfMemory error.
Is there a way to make the foreach() function work using Method 1?
The problem for your s3Client can be solved as following. But you have to remember that these functions run on executor nodes (other machines), so your whole val file = new File(filename) thing is probably not going to work here.
You can put your files on some distibuted file system like HDFS or S3.
object S3ClientWrapper extends Serializable {
// s3Client must be created here.
val s3Client = {
val awsCreds = new BasicAWSCredentials("access_key_id", "secret_key_id")
AmazonS3ClientBuilder.standard()
.withCredentials(new AWSStaticCredentialsProvider(awsCreds))
.build()
}
}
def getS3Object (s3ObjectName) : Unit = {
val bucketName = "xyz"
val object_to_write = S3ClientWrapper.s3Client.getObject(bucketName, s3ObjectName)
// now you have to solve your file problem
}

creating function for loading conf file and store all properties in case class

I have one app.conf file like below:
some_variable_a = some_infohere_a
some_variable_b = some_infohere_b
Now I need to write scala function to load this app.conf file and create a scala case class to store all these properties. with try-catch and file checking conditions and corner cases.
I am very new to scala do not have much knowledge on this please provide me a correct way to do this.
Whatever I tried I am writing below:
import java.io.File
import com.typesafe.config.{ Config, ConfigFactory }
import com.typesafe.config._
import java.nio.file.Paths
private def ReadConfFile(path: String) = {
val fileTemp = new File(path)
if (fileTemp.exists) {
val confFile = Paths.get(Path).toFile
val config = ConfigFactory.parseFile(confFile)
val some_variable_a = config.getString("some_variable_a")
val some_variable_b = config.getString("some_variable_b")
}
}
Assuming that app.conf is in root folder of your application, this should be enough to access those variables from config file:
val config = ConfigFactory.load("app.conf")
val some_variable_a = config.getString("some_variable_a")
val some_variable_b = config.getString("some_variable_b")
In case you need to load it from the file using absolute path, you can do that in this way:
val myConfigFile = new File("/home/user/location/app.conf")
val config = ConfigFactory.parseFile(myConfigFile)
val some_variable_a = config.getString("some_variable_a")
val some_variable_b = config.getString("some_variable_b")
Or something similar as you did. In your code there is a typo I guess in Paths.get(Path).toFile. "Path" should be "path". If you If you don't have some variable Path, you should get compiling error for that. If that is not the problem, then check if you are providing correct path:
private def ReadConfFile(path: String) = {
val fileTemp = new File(path)
if (fileTemp.exists) {
val confFile = Paths.get(path).toFile
val config = ConfigFactory.parseFile(confFile)
val some_variable_a = config.getString("some_variable_a")
val some_variable_b = config.getString("some_variable_b")
}
}
ReadConfFile("/home/user/location/app.conf")

Using Spark on Dataproc, how to write to GCS separately from each partition?

Using Spark on GCP Dataproc, I successfuly write an entire RDD to GCS like so:
rdd.saveAsTextFile(s"gs://$path")
The products are files for each partition in the same path.
How do I write files for each partition (with a unique path based on information from the partition)
Below is an invented non working wishful code example
rdd.mapPartitionsWithIndex(
(i, partition) =>{
partition.write(path = s"gs://partition_$i", data = partition_specific_data)
}
)
when I call the function below from within the partition on my mac it writes to local disk, on Dataproc I get an error not recognizing the gs as a valid path.
def writeLocally(filePath: String, data: Array[Byte], errorMessage: String): Unit = {
println("Juicy Platform")
val path = new Path(filePath)
var ofos: Option[FSDataOutputStream] = null
try {
println(s"\nTrying to write to $filePath\n")
val conf = new Configuration()
conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
// conf.addResource(new Path("/home/hadoop/conf/core-site.xml"))
println(conf.toString)
val fs = FileSystem.get(conf)
val fos = fs.create(path)
ofos = Option(fos)
fos.write(data)
println(s"\nWrote to $filePath\n")
}
catch {
case e: Exception =>
logError(errorMessage, s"Exception occurred writing to GCS:\n${ExceptionUtils.getStackTrace(e)}")
}
finally {
ofos match {
case Some(i) => i.close()
case _ =>
}
}
}
This is the error:
java.lang.IllegalArgumentException: Wrong FS: gs://path/myFile.json, expected: hdfs://cluster-95cf-m
If running on a Dataproc cluster, you shouldn't need to explicitly populate "fs.gs.impl" in the Configuration; a new Configuration() should already contain the necessary mappings.
The main problem here is that val fs = FileSystem.get(conf) is using the fs.defaultFS property of the conf; it has no way of knowing whether you wanted to get a FileSystem instance specific to HDFS or to GCS. In general, In Hadoop and Spark, a FileSystem instance is fundamentally tied to a single URL scheme; you need to fetch a scheme-specific instance for each different scheme, such as hdfs:// or gs:// or s3://.
The simplest fix to your problem is to always use Path.getFileSystem(Configuration) as opposed to FileSystem.get(Configuration). And make sure your path is fully-qualified with the scheme:
...
val path = "gs://bucket/foo/data"
val fs = path.getFileSystem(conf)
val fos = fs.create(path)
ofos = Option(fos)
fos.write(data)
...

Not able to read Configuration file Using scala typesafe API

I have a spark/scala project named as Omega
I have a conf file inside Omega/conf/omega.config
I use API's from typesafe to load the config file from conf/omega.config.
It was working fine and I was able to read the respective value for each key
Now today, For the first time I added some more key-value pairs in my omega.config file and tried to retrieve them from my scala code. It throws
Exception in thread "main" com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'job_name'
This issue started happening after adding new value for the key job_name in my omega.config file.
Also I am not able to read the newly added -key-values, I am still able to read all old values using config. getString method
I am building my spark/scala application using maven.
Omega.config
input_path="/user/cloudera/data
user_name="surender"
job_name="SAMPLE"
I am Not able to access the recently added key "job_name" alone
package com.pack1
import com.pack2.ApplicationUtil
object OmegaMain {
val config_loc = "conf/omega.config"
def main(args: Array[String]): Unit = {
val config = ApplicationUtil.loadConfig(config_loc)
val jobName = ApplicationUtil.getFromConfig(config,"job_name")
}
}
package com.pack2
import com.typesafe.config.{Config, ConfigFactory}
object ApplicationUtil {
def loadConfig(filePath:String):Config={
val config = ConfigFactory.parseFile(new File(filePath))
config
}
def getFromConfig(config:Config,jobName:String):String={
config.getString(jobName)
}
}
Could some one help me what went wrong?
You can try something like:
def loadConfig(filename: String, syntax: ConfigSyntax): Config = {
val in: InputStream = getClass.getResourceAsStream(filename)
if (in == null) return null
val file: File = File.createTempFile(String.valueOf(in.hashCode()), ".conf")
file.deleteOnExit()
val out: FileOutputStream = new FileOutputStream(file)
val buffer: Array[Byte] = new Array(1024)
var bytesRead: Int = in.read(buffer)
while (bytesRead != -1) { out.write(buffer, 0, bytesRead); bytesRead = in.read(buffer) }
out.close()
val conf: Config = ConfigFactory.parseFile(file, ConfigParseOptions.defaults().setSyntax(syntax).setAllowMissing(false).setOriginDescription("Merged with " + filename))
conf
}
filename is some file path in the classpath. If you want to update this method to taking some external file into account, change update the 4th with val file: File = new File("absolute Path of he file")
I am guessing the file isn't on the classpath after you build with Maven.
Since you are using Maven to build a jar, you need your omega.config to be in the classpath. This means that you either have to put it into src/main/resources by default or explicitly tell Maven to add conf to the default resources classpath.

Reading a large file using Akka Streams

I'm trying out Akka Streams and here is a short snippet that I have:
override def main(args: Array[String]) {
val filePath = "/Users/joe/Softwares/data/FoodFacts.csv"//args(0)
val file = new File(filePath)
println(file.getAbsolutePath)
// read 1MB of file as a stream
val fileSource = SynchronousFileSource(file, 1 * 1024 * 1024)
val shaFlow = fileSource.map(chunk => {
println(s"the string obtained is ${chunk.toString}")
})
shaFlow.to(Sink.foreach(println(_))).run // fails with a null pointer
def sha256(s: String) = {
val messageDigest = MessageDigest.getInstance("SHA-256")
messageDigest.digest(s.getBytes("UTF-8"))
}
}
When I ran this snippet, I get:
Exception in thread "main" java.lang.NullPointerException
at akka.stream.scaladsl.RunnableGraph.run(Flow.scala:365)
at com.test.api.consumer.DataScienceBoot$.main(DataScienceBoot.scala:30)
at com.test.api.consumer.DataScienceBoot.main(DataScienceBoot.scala)
It seems to me that it is not fileSource is just empty? Why is this? Any ideas? The FoodFacts.csv if 40MB in size and all I'm trying to do is to create a 1MB stream of data!
Even using the defaultChunkSize of 8192 did not work!
Well 1.0 is deprecated. And if you can, use 2.x.
When I try with 2.0.1 version by using FileIO.fromFile(file) instead of SynchronousFileSource, it is a compile failure with message fails with null pointer. This was simply because it didnt have ActorMaterializer in scope. Including it, makes it work:
object TImpl extends App {
import java.io.File
implicit val system = ActorSystem("Sys")
implicit val materializer = ActorMaterializer()
val file = new File("somefile.csv")
val fileSource = FileIO.fromFile(file,1 * 1024 * 1024 )
val shaFlow: Source[String, Future[Long]] = fileSource.map(chunk => {
s"the string obtained is ${chunk.toString()}"
})
shaFlow.runForeach(println(_))
}
This works for file of any size. For more information on configuration of dispatcher, refer here.