Iteration over typesafe files - scala

I have read this topic
Iterate over fields in typesafe config
and made some changes but still don't know how to iterate over conf files in play framework.
Providers=[{1234 : "CProduct"},
{12345 : "ChProduct"},
{123 : "SProduct"}]
This is my Conf file called providers.conf , the question is how can i iterate over them and create a dropdownbox from them. I would like to take them as map if possible which is [int,string]
I know , i have to take them like
val config = ConfigFactory.load("providers.conf").getConfigList("Providers")
i can the conf file like that but , i should get it from template in order to do that i need to convert it to either hashmap or list or whatever functional.
Cheers,

I'm not sure if this is the most efficient way to do this, but this works:
1) Lets get our config file:
val config = ConfigFactory.load().getConfigList("providers")
scala> config.getConfigList("providers")
res23: java.util.List[_ <: com.typesafe.config.Config] = [Config(SimpleConfigObject({"id":"1234","name":" Product2"})), Config(SimpleConfigObject({"id":"4523","name":"Product1"})), Config(SimpleConfigObject({"id":"432","name":" Product3"}))]
2) For this example introduce Provider entity:
case class Provider(id: String, name: String)
3) Now lets convert list with configs to providers:
import scala.collection.JavaConversions._
providers.map(conf => Provider(conf.getString("id"), conf.getString("name"))).toList
res27: List[Provider] = List(Provider(1234, Product2), Provider(4523,Product1), Provider(432, Product3))
We need explicitly convert it toList, cause by default Java List converts to Buffer.

Here is my solution for that ,
val config = ConfigFactory.load("providers.conf").getConfigList("Providers")
var providerlist = new java.util.ArrayList[model.Provider]
val providers = (0 until config.size())
providers foreach {
count =>
val iterator = config.get(count).entrySet().iterator()
while(iterator.hasNext()) {
val entry = iterator.next()
val p = new Provider(entry.getKey(), entry.getValue().render())
providerlist.add(p);
}
}
println(providerlist.get(0).providerId+providerlist.get(0).providerName)
println(providerlist.get(33).providerId+providerlist.get(33).providerName)
and my provider.class
package model
case class Provider(providerId: String, providerName: String)

Related

Create Spark UDF of a function that depends on other resources

I have a code for tokenizing a string.
But that tokenization method uses some data which is loaded when my application starts.
val stopwords = getStopwords();
val tokens = tokenize("hello i am good",stopwords)
def tokenize(string:String,stopwords: List[String]) : List[String] = {
val splitted = string.split(" ")
// I use this stopwords for filtering my splitted array.
// Then i return the items back.
}
Now I want to make the tokenize method an UDF for Spark.I want to use it to create new column in DataFrame Transformations.
I created simple UDFs before which had no dependencies like it needs items that needs to be read from text file etc.
Can some one tell me how to do these kind of operation?
This is what I have tried ,and its working.
val moviesDF = Seq(
("kingdomofheaven"),
("enemyatthegates"),
("salesinfointheyearofdecember"),
).toDF("column_name")
val tokenizeUDF: UserDefinedFunction = udf(tokenize(_: String): List[String])
moviesDF.withColumn("tokenized", tokenizeUDF(col("column_name"))).show(100, false)
def tokenize(name: String): List[String] = {
val wordFreqMap: Map[String, Double] = DataProviderUtil.getWordFreqMap()
val stopWords: Set[String] = DataProviderUtil.getStopWordSet()
val maxLengthWord: Int = wordFreqMap.keys.maxBy(_.length).length
.................
.................
}
Its giving me the expected output:
+----------------------------+--------------------------+
|columnname |tokenized |
+----------------------------+--------------------------+
|kingdomofheaven |[kingdom, heaven] |
|enemyatthegates |[enemi, gate] |
|salesinfointheyearofdecember|[sale, info, year, decemb]|
+----------------------------+--------------------------+
Now my question is , will it work when its deployed ? Currently I am
running it locally. My main concern it that this function reads from a
file to get information like stopwords,wordfreq etc for making the
tokenization possible. So registering it like this will work properly
?
At this point, if you deploy this code Spark will try to serialize your DataProviderUtil, you would need to mark as serializable that class. Another possibility is to declare you logic inside an Object. Functions inside objects are considered static functions and they are not serialized.

Save Dataset elements to files with specified file path

I have a dataset of event case class that I would like to save the json string element inside it into a file on s3 with a path like bucketName/service/yyyy/mm/dd/hh/[SomeGuid].gz
So for example, the events case class looks like this:
case class Event(
hourPath: String, // e.g. bucketName/service/yyyy/mm/dd/hh/
json: String // The json line that represents this particular event.
... // Other properties used in earlier transformations.
)
Is there a way to save on the dataset where we write the events that belong to a particular hour into a file on s3?
Calling partitionBy on the DataframeWriter is the closest I can get, but the file path isn't exactly what I want.
You can iterate each item and write it to a file in S3. It's efficient to do it with Spark because it will be executed in parallel.
This code is working for me:
val tempDS = eventsDS.rdd.collect.map(x => saveJSONtoS3(x.hourPath,x.json))
def saveJSONtoS3(path: String, jsonString: String) : Unit = {
val bucketName = path.substring(0,path.indexOf('/'));
val file = path.substring(bucketName.length()+1);
val creds = new BasicAWSCredentials(AWS_ACCESS_KEY, AWS_SECRET_KEY)
val amazonS3Client = new AmazonS3Client(creds)
val meta = new ObjectMetadata();
amazonS3Client.putObject(bucketName, file, new ByteArrayInputStream(jsonString.getBytes), meta)
}
You need to import:
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.auth.BasicAWSCredentials
import com.amazonaws.services.s3.model.ObjectMetadata
You need to include aws-java-sdk library.

How to efficiently read/parse loads of .gz files in a s3 folder with spark on EMR

I'm trying to read all files in a directory on s3 via a spark app that's executing on EMR.
The data is store in a typical format like "s3a://Some/path/yyyy/mm/dd/hh/blah.gz"
If I use deeply nested wildcards (e.g. "s3a://SomeBucket/SomeFolder/////*.gz"), the performance is terrible and takes about 40 minutes to read a few tens of thousand small gzipped json files.
It works, but losing 40 minutes to test some code is really bad.
I have two other approaches that my research has told me are much more performant.
Using the hadoop.fs library (2.8.5) I try to read each file path I provide it.
private def getEventDataHadoop(
eventsFilePaths: RDD[String]
)(implicit sqlContext: SQLContext): Try[RDD[String]] =
Try(
{
val conf = sqlContext.sparkContext.hadoopConfiguration
eventsFilePaths.map(eventsFilePath => {
val p = new Path(eventsFilePath)
val fs = p.getFileSystem(conf)
val eventData: FSDataInputStream = fs.open(p)
IOUtils.toString(eventData)
})
}
)
These file paths are generated by the below code:
private[disneystreaming] def generateInputBucketPaths(
s3Protocol: String,
bucketName: String,
service: String,
region: String,
yearsMonths: Map[String, Set[String]]
): Try[Set[String]] =
Try(
{
val days = 1 to 31
val hours = 0 to 23
val dateFormatter: Int => String = buildDateFormat("00")
yearsMonths.flatMap { yearMonth: (String, Set[String]) =>
for {
month: String <- yearMonth._2
day: Int <- days
hour: Int <- hours
} yield
s"$s3Protocol$bucketName/$service/$region/${dateFormatter(yearMonth._1.toInt)}/${dateFormatter(month.toInt)}/" +
s"${dateFormatter(day)}/${dateFormatter(hour)}/*.gz"
}.toSet
}
)
The hadoop.fs code fails because the Path class is not serializable. I can't think of how I can get around that.
So this led me to another approach using AmazonS3Client, where I just ask the client to give me all the file paths in a folder (or prefix), then parse the files to a string, which will likely fail due to them being compressed:
private def getEventDataS3(bucketName: String, prefix: String)(
implicit sqlContext: SQLContext
): Try[RDD[String]] =
Try(
{
import com.amazonaws.services.s3._, model._
import scala.collection.JavaConverters._
val request = new ListObjectsRequest()
request.setBucketName(bucketName)
request.setPrefix(prefix)
request.setMaxKeys(Integer.MAX_VALUE)
val s3 = new AmazonS3Client(new ProfileCredentialsProvider("default"))
val objs: ObjectListing = s3.listObjects(request) // Note that this method returns truncated data if longer than the "pageLength" above. You might need to deal with that.
sqlContext.sparkContext
.parallelize(objs.getObjectSummaries.asScala.map(_.getKey).toList)
.flatMap { key =>
Source
.fromInputStream(s3.getObject(bucketName, key).getObjectContent: InputStream)
.getLines()
}
}
)
This code produce a null exception because the profile cannot be null ("java.lang.IllegalArgumentException: profile file cannot be null").
Remember this code is running on EMR within AWS, so how do I provide the credentials it wants? How are other people running spark jobs on EMR using this client?
Any help with getting any of these approaches working is much appreciated.
Path is serializable in later Hadoop releases, because it is useful to be able to use in Spark RDDs. Until then, convert the path to a URI, marshall that, and create a new path from that URI inside your closure.

Iterate over fields in typesafe config file

In my Scala application I have a configuration like this:
datasets {
dataset1 = "path1"
dataset2 = "path2"
dataset3 = "path3"
}
Ho do I iterate over all the datasets to get a map [dataset, path]?
You can call entrySet() after getting config with getConfig()
import scala.collection.JavaConversions._
val config = ConfigFactory.load()
val datasets = config.getConfig("datasets")
val configMap = datasets.entrySet().toList.map(
entry => (entry.getKey, entry.getValue)
).toMap
You will end up with a Map[String, ConfigValue].
You can try my scala wrapper https://github.com/andr83/scalaconfig - it supports reading native scala types directly from config object:
val datasets = config.as[Map[String, String]]("datasets")

How do I provide basic configuration for a Scala application?

I am working on a small GUI application written in Scala. There are a few settings that the user will set in the GUI and I want them to persist between program executions. Basically I want a scala.collections.mutable.Map that automatically persists to a file when modified.
This seems like it must be a common problem, but I have been unable to find a lightweight solution. How is this problem typically solved?
I do a lot of this, and I use .properties files (it's idiomatic in Java-land). I keep my config pretty straight-forward by design, though. If you have nested config constructs you might want a different format like YAML (if humans are the main authors) or JSON or XML (if machines are the authors).
Here's some example code for loading props, manipulating as Scala Map, then saving as .properties again:
import java.io._
import java.util._
import scala.collection.JavaConverters._
val f = new File("test.properties")
// test.properties:
// foo=bar
// baz=123
val props = new Properties
// Note: in real code make sure all these streams are
// closed carefully in try/finally
val fis = new InputStreamReader(new FileInputStream(f), "UTF-8")
props.load(fis)
fis.close()
println(props) // {baz=123, foo=bar}
val map = props.asScala // Get to Scala Map via JavaConverters
map("foo") = "42"
map("quux") = "newvalue"
println(map) // Map(baz -> 123, quux -> newvalue, foo -> 42)
println(props) // {baz=123, quux=newvalue, foo=42}
val fos = new OutputStreamWriter(new FileOutputStream(f), "UTF-8")
props.store(fos, "")
fos.close()
Here's an example of using XML and a case class for reading a config. A real class can be nicer than a map. (You could also do what sbt and at least one project do, take the config as Scala source and compile it in; saving it is less automatic. Or as a repl script. I haven't googled, but someone must have done that.)
Here's the simpler code.
This version uses a case class:
case class PluginDescription(name: String, classname: String) {
def toXML: Node = {
<plugin>
<name>{name}</name>
<classname>{classname}</classname>
</plugin>
}
}
object PluginDescription {
def fromXML(xml: Node): PluginDescription = {
// extract one field
def getField(field: String): Option[String] = {
val text = (xml \\ field).text.trim
if (text == "") None else Some(text)
}
def extracted = {
val name = "name"
val claas = "classname"
val vs = Map(name -> getField(name), claas -> getField(claas))
if (vs.values exists (_.isEmpty)) fail()
else PluginDescription(name = vs(name).get, classname = vs(claas).get)
}
def fail() = throw new RuntimeException("Bad plugin descriptor.")
// check the top-level tag
xml match {
case <plugin>{_*}</plugin> => extracted
case _ => fail()
}
}
}
This code reflectively calls the apply of a case class. The use case is that fields missing from config can be supplied by default args. No type conversions here. E.g., case class Config(foo: String = "bar").
// isn't it easier to write a quick loop to reflect the field names?
import scala.reflect.runtime.{currentMirror => cm, universe => ru}
import ru._
def fromXML(xml: Node): Option[PluginDescription] = {
def extract[A]()(implicit tt: TypeTag[A]): Option[A] = {
// extract one field
def getField(field: String): Option[String] = {
val text = (xml \\ field).text.trim
if (text == "") None else Some(text)
}
val apply = ru.newTermName("apply")
val module = ru.typeOf[A].typeSymbol.companionSymbol.asModule
val ts = module.moduleClass.typeSignature
val m = (ts member apply).asMethod
val im = cm reflect (cm reflectModule module).instance
val mm = im reflectMethod m
def getDefault(i: Int): Option[Any] = {
val n = ru.newTermName("apply$default$" + (i+1))
val m = ts member n
if (m == NoSymbol) None
else Some((im reflectMethod m.asMethod)())
}
def extractArgs(pss: List[List[Symbol]]): List[Option[Any]] =
pss.flatten.zipWithIndex map (p => getField(p._1.name.encoded) orElse getDefault(p._2))
val args = extractArgs(m.paramss)
if (args exists (!_.isDefined)) None
else Some(mm(args.flatten: _*).asInstanceOf[A])
}
// check the top-level tag
xml match {
case <plugin>{_*}</plugin> => extract[PluginDescription]()
case _ => None
}
}
XML has loadFile and save, it's too bad there seems to be no one-liner for Properties.
$ scala
Welcome to Scala version 2.10.0-RC5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_06).
Type in expressions to have them evaluated.
Type :help for more information.
scala> import reflect.io._
import reflect.io._
scala> import java.util._
import java.util._
scala> import java.io.{StringReader, File=>JFile}
import java.io.{StringReader, File=>JFile}
scala> import scala.collection.JavaConverters._
import scala.collection.JavaConverters._
scala> val p = new Properties
p: java.util.Properties = {}
scala> p load new StringReader(
| (new File(new JFile("t.properties"))).slurp)
scala> p.asScala
res2: scala.collection.mutable.Map[String,String] = Map(foo -> bar)
As it all boils down to serializing a map / object to a file, your choices are:
classic serialization to Bytecode
serialization to XML
serialization to JSON (easy using Jackson, or Lift-JSON)
use of a properties file (ugly, no utf-8 support)
serialization to a proprietary format (ugly, why reinvent the wheel)
I suggest to convert Map to Properties and vice versa. "*.properties" files are standard for storing configuration in Java world, why not use it for Scala?
The common way are *. properties, *.xml, since scala supports xml natively, so it would be easier using xml config then in java.