SCALA: Read YAML file using Scala with class constructor - scala

I was trying to read YAML file using scala with a constructor call as below
import org.yaml.snakeyaml.constructor.Constructor
import java.io.{File, FileInputStream}
import scala.collection.mutable.ArrayBuffer
import org.yaml.snakeyaml.Yaml
object aggregation {
def main(args:Array[String]) : Unit = {
//val conf = new SparkConf().setAppName("yaml test").setMaster("local[*]")
//val sc = new SparkContext(conf)
val yamlfile = "C:\\Users\\***\\Desktop\\mongoDB\\sparkTest\\project\\properties.yaml"
val input1 = new FileInputStream(new File(yamlfile))
val yaml = new Yaml(new Constructor(classOf[ReadProperties]))
val e = yaml.load(input1).asInstanceOf[ReadProperties]
println(e.file1)
}
}
And I have a separate class so that I can have the YAML items as beans as below,
class ReadProperties(#BeanProperty var file1:String,#BeanProperty var file2:String) {
//constructor
}
And the content of my yaml file(properties.yaml) is as below,
file1: C:\\data\\names.txt
file2: C:\\data\\names2.txt
but the error is that
Can't construct a java object for tag:yaml.org,2002:ReadProperties; exception=java.lang.NoSuchMethodException: ReadProperties.<init>()
in 'reader', line 1, column 1:
file1: C:\\data\\names.txt
^
at org.yaml.snakeyaml.constructor.Constructor$ConstructYamlObject.construct(Constructor.java:350)
at org.yaml.snakeyaml.constructor.BaseConstructor.constructObject(BaseConstructor.java:182)
But if I use the below code it works(without constructor class),
val yaml = new Yaml
val obj = yaml.load(input)
val e = obj.asInstanceOf[java.util.HashMap[String,String]]
println(e)
result :
{file1=C:\\data\\names.txt, file2=C:\\data\\names2.txt}
16/10/02 01:24:28 INFO SparkContext: Invoking stop() from shutdown hook
I want my constructor to work and wanted to directly refer the values of parameters in yaml properties file. (for example, there are two parameters in yaml file "file1" and "file2" so I wanted to refer them directly)
Any help would be appreciated. thanks in advance!

This post suggests to use
val yaml = new Yaml
val obj = yaml.load(input1)
val e = obj.asInstanceOf[ReadProperties]
Because SnakeYAML requires a non-argument constructor if directly deserializing to a custom class object.
It is also possible to provide a custom constructor for the ReadProperties class, but since I do not really know Scala, it is beyond my abilities to show code for that. The official documentation may help.

It worked, actual issue seems to be something wrong environment. Anyways, here is how I did, yes i know its too late to comment here,
def main(args:Array[String]) : Unit = {
val conf = new SparkConf().setAppName("yaml test").setMaster("local[*]")
val sc = new SparkContext(conf)
val yamlfile = "C:\\Users\\Chaitu-Padi\\Desktop\\mongoDB\\sparkTest\\project\\properties.yaml"
val input1 = new FileInputStream(new File(yamlfile))
val yaml = new Yaml(new Constructor(classOf[yamlProperties]))
val e = yaml.load(input1)
val d = e.asInstanceOf[yamlProperties]
println(d)
}
class yamlProperties {
#BeanProperty var file1: String = null
#BeanProperty var file2: String = null
#BeanProperty var file3: String = null
override def toString: String = {
//return "%s,%s".format(file1, file2)
return ( "%s,%s,%s").format (file1, file2, file3)
}
}

Related

Task not serializable - foreach function spark

I have a function getS3Object to get a json object stored in S3
def getS3Object (s3ObjectName) : Unit = {
val bucketName = "xyz"
val object_to_write = s3client.getObject(bucketName, s3ObjectName)
val file = new File(filename)
fileWriter = new FileWriter(file)
bw = new BufferedWriter(fileWriter)
bw.write(object_to_write)
bw.close()
fileWriter.close()
}
My dataframe (df) contains one column where each row is the S3ObjectName
S3ObjectName
a1.json
b2.json
c3.json
d4.json
e5.json
When I execute the below logic I get an error saying "task is not serializable"
Method 1:- df.foreach(x => getS3Object(x.getString(0))
I tried converting the df to rdd but still get the same error
Method 2:- df.rdd.foreach(x => getS3Object(x.getString(0))
However it works with collect()
Method 3:- df.collect.foreach(x => getS3Object(x.getString(0))
I do not wish to use the collect() method as all the elements of the dataframe are collected to the driver and potentially result in OutOfMemory error.
Is there a way to make the foreach() function work using Method 1?
The problem for your s3Client can be solved as following. But you have to remember that these functions run on executor nodes (other machines), so your whole val file = new File(filename) thing is probably not going to work here.
You can put your files on some distibuted file system like HDFS or S3.
object S3ClientWrapper extends Serializable {
// s3Client must be created here.
val s3Client = {
val awsCreds = new BasicAWSCredentials("access_key_id", "secret_key_id")
AmazonS3ClientBuilder.standard()
.withCredentials(new AWSStaticCredentialsProvider(awsCreds))
.build()
}
}
def getS3Object (s3ObjectName) : Unit = {
val bucketName = "xyz"
val object_to_write = S3ClientWrapper.s3Client.getObject(bucketName, s3ObjectName)
// now you have to solve your file problem
}

creating function for loading conf file and store all properties in case class

I have one app.conf file like below:
some_variable_a = some_infohere_a
some_variable_b = some_infohere_b
Now I need to write scala function to load this app.conf file and create a scala case class to store all these properties. with try-catch and file checking conditions and corner cases.
I am very new to scala do not have much knowledge on this please provide me a correct way to do this.
Whatever I tried I am writing below:
import java.io.File
import com.typesafe.config.{ Config, ConfigFactory }
import com.typesafe.config._
import java.nio.file.Paths
private def ReadConfFile(path: String) = {
val fileTemp = new File(path)
if (fileTemp.exists) {
val confFile = Paths.get(Path).toFile
val config = ConfigFactory.parseFile(confFile)
val some_variable_a = config.getString("some_variable_a")
val some_variable_b = config.getString("some_variable_b")
}
}
Assuming that app.conf is in root folder of your application, this should be enough to access those variables from config file:
val config = ConfigFactory.load("app.conf")
val some_variable_a = config.getString("some_variable_a")
val some_variable_b = config.getString("some_variable_b")
In case you need to load it from the file using absolute path, you can do that in this way:
val myConfigFile = new File("/home/user/location/app.conf")
val config = ConfigFactory.parseFile(myConfigFile)
val some_variable_a = config.getString("some_variable_a")
val some_variable_b = config.getString("some_variable_b")
Or something similar as you did. In your code there is a typo I guess in Paths.get(Path).toFile. "Path" should be "path". If you If you don't have some variable Path, you should get compiling error for that. If that is not the problem, then check if you are providing correct path:
private def ReadConfFile(path: String) = {
val fileTemp = new File(path)
if (fileTemp.exists) {
val confFile = Paths.get(path).toFile
val config = ConfigFactory.parseFile(confFile)
val some_variable_a = config.getString("some_variable_a")
val some_variable_b = config.getString("some_variable_b")
}
}
ReadConfFile("/home/user/location/app.conf")

the generation of parse tree of StanfordCoreNLP is stuck

When I use the StanfordCoreNLP to generate the parse using bigdata on Spark, one of the tasks had stuck for a long time. I looked for the error, it showed as follows:
at edu.stanford.nlp.ling.CoreLabel.(CoreLabel.java:68)
  at edu.stanford.nlp.ling.CoreLabel$CoreLabelFactory.newLabel(CoreLabel.java:248)
  at edu.stanford.nlp.trees.LabeledScoredTreeFactory.newLeaf(LabeledScoredTreeFactory.java:51)
  at edu.stanford.nlp.parser.lexparser.Debinarizer.transformTreeHelper(Debinarizer.java:27)
  at edu.stanford.nlp.parser.lexparser.Debinarizer.transformTreeHelper(Debinarizer.java:34)
  at edu.stanford.nlp.parser.lexparser.Debinarizer.transformTreeHelper(Debinarizer.java:34)
  at edu.stanford.nlp.parser.lexparser.Debinarizer.transformTreeHelper(Debinarizer.java:34)
  at edu.stanford.nlp.parser.lexparser.Debinarizer.transformTreeHelper(Debinarizer.java:34)
the relevant codes I think are as follows:
import edu.stanford.nlp.pipeline.Annotation
import edu.stanford.nlp.pipeline.StanfordCoreNLP
import java.util.Properties
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation
import edu.stanford.nlp.trees.TreeCoreAnnotations.TreeAnnotation
import edu.stanford.nlp.util.CoreMap
import scala.collection.JavaConversions._
object CoreNLP {
def transform(Content: String): String = {
val v = new CoreNLP
v.runEnglishAnnotators(Content);
v.runChineseAnnotators(Content)
}
}
class CoreNLP {
def runEnglishAnnotators(inputContent: String): String = {
var document = new Annotation(inputContent)
val props = new Properties
props.setProperty("annotators", "tokenize, ssplit, parse")
val coreNLP = new StanfordCoreNLP(props)
coreNLP.annotate(document)
parserOutput(document)
}
def runChineseAnnotators(inputContent: String): String = {
var document = new Annotation(inputContent)
val props = new Properties
val corenlp = new StanfordCoreNLP("StanfordCoreNLP-chinese.properties")
corenlp.annotate(document)
parserOutput(document)
}
def parserOutput(document: Annotation): String = {
val sentences = document.get(classOf[SentencesAnnotation])
var result = ""
for (sentence: CoreMap <- sentences) {
val tree = sentence.get(classOf[TreeAnnotation])
//output the tree to file
result = result + "\n" + tree.toString
}
result
}
}
My classmate said the data used to test is recurse and thus the NLP is endlessly run. I don't know whether it's true.
If you add props.setProperty("parse.maxlen", "100"); to your code that will set the parser to not parse sentences longer than 100 tokens. That can help prevent crash issues. You should experiment with the best max sentence length for your application.

How to read serialized object with a method taking generic type by Scala/Kryo?

Use Kryo to read serialized object is easy when I know the specific class type, but if I want to create a method that takes simple generic type, how to do it?
I have code that can not be compiled:
def load[T](path: String): T = {
val instantiator = new ScalaKryoInstantiator
instantiator.setRegistrationRequired(false)
val kryo = instantiator.newKryo()
val input = new Input(FileUtils.readFileToByteArray(new File(path)))
kryo.readObject[T](input, classOf[T])
}
The error I got is:
class type required but T found
kryo.readObject[T](input, classOf[T])
I know what the error means, but don't know the right way to fix it.
The code is modified by my original type-specific code:
def load(path: String): SomeClassType = {
val instantiator = new ScalaKryoInstantiator
instantiator.setRegistrationRequired(false)
val kryo = instantiator.newKryo()
val input = new Input(FileUtils.readFileToByteArray(new File(path)))
kryo.readObject(input, classOf[SomeClassType])
}
I've found the answer, the key is ClassTag:
def load[M: ClassTag](path: String)(implicit tag: ClassTag[M]): M = {
val instantiator = new ScalaKryoInstantiator
instantiator.setRegistrationRequired(false)
val kryo = instantiator.newKryo()
val input = new Input(FileUtils.readFileToByteArray(new File(path)))
kryo.readObject(input, tag.runtimeClass.asInstanceOf[Class[M]])
}
In some threads, the last line is:
kryo.readObject(input, tag.runtimeClass)
This doesn't work in my case, it has to be:
tag.runtimeClass.asInstanceOf[Class[M]]

What happens when you do java data manipulations in Spark outside of an RDD

I am reading a csv file from hdfs using Spark. It's going into an FSDataInputStream object. I cant use the textfile() method because it splits up the csv file by line feed, and I am reading a csv file that has line feeds inside the text fields. Opencsv from sourcefourge handles line feeds inside the cells, its a nice project, but it accepts a Reader as an input. I need to convert it to a string so that I can pass it to opencsv as a StringReader. So, HDFS File -> FSdataINputStream -> String -> StringReader -> an opencsv list of strings. Below is the code...
import java.io._
import org.apache.spark.sql.SQLContext
import org.apache.hadoop.fs._
import org.apache.hadoop.conf._
import com.opencsv._
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import java.lang.StringBuilder
val conf = new Configuration()
val hdfsCoreSitePath = new Path("core-site.xml")
val hdfsHDFSSitePath = new Path("hdfs-site.xml")
conf.addResource(hdfsCoreSitePath)
conf.addResource(hdfsHDFSSitePath)
val fileSystem = FileSystem.get(conf)
val csvPath = new Path("/raw_data/project_name/csv/file_name.csv")
val csvFile = fileSystem.open(csvPath)
val fileLen = fileSystem.getFileStatus(csvPath).getLen().toInt
var b = Array.fill[Byte](2048)(0)
var j = 1
val stringBuilder = new StringBuilder()
var bufferString = ""
csvFile.seek(0)
csvFile.read(b)
var bufferString = new String(b,"UTF-8")
stringBuilder.append(bufferString)
while(j != -1) {b = Array.fill[Byte](2048)(0);j=csvFile.read(b);bufferString = new String(b,"UTF-8");stringBuilder.append(bufferString)}
val stringBuilderClean = new StringBuilder()
stringBuilderClean = stringBuilder.substring(0,fileLen)
val reader: Reader = new StringReader(stringBuilderClean.toString()).asInstanceOf[Reader]
val csv = new CSVReader(reader)
val javaContext = new JavaSparkContext(sc)
val sqlContext = new SQLContext(sc)
val javaRDD = javaContext.parallelize(csv.readAll())
//do a bunch of transformations on the RDD
It works but I doubt it is scalable. It makes me wonder how big of a limitation it is to have a driver program which pipes in all the data trough one jvm. My questions to anyone very familiar with spark are:
What happens when you do data manipulations across your whole data set like this, before it even gets dropped into the input RDD? It is just treated as any other program and would be swapping out like crazy I guess?
How would you then make any spark program scalable? Do you always NEED to extract the data directly into an input RDD?
Your code loads the data into the memory, and then Spark driver will split and send each part of data to executor, of cause, it is not scalable.
There are two ways to resolve your question.
write custom InputFormat to support CSV file format
import java.io.{InputStreamReader, IOException}
import com.google.common.base.Charsets
import com.opencsv.{CSVParser, CSVReader}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{Seekable, Path, FileSystem}
import org.apache.hadoop.io.compress._
import org.apache.hadoop.io.{ArrayWritable, Text, LongWritable}
import org.apache.hadoop.mapred._
class CSVInputFormat extends FileInputFormat[LongWritable, ArrayWritable] with JobConfigurable {
private var compressionCodecs: CompressionCodecFactory = _
def configure(conf: JobConf) {
compressionCodecs = new CompressionCodecFactory(conf)
}
protected override def isSplitable(fs: FileSystem, file: Path): Boolean = {
val codec: CompressionCodec = compressionCodecs.getCodec(file)
if (null == codec) {
return true
}
codec.isInstanceOf[SplittableCompressionCodec]
}
#throws(classOf[IOException])
def getRecordReader(genericSplit: InputSplit, job: JobConf, reporter: Reporter): RecordReader[LongWritable, ArrayWritable] = {
reporter.setStatus(genericSplit.toString)
val delimiter: String = job.get("textinputformat.record.delimiter")
var recordDelimiterBytes: Array[Byte] = null
if (null != delimiter) {
recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8)
}
new CsvLineRecordReader(job, genericSplit.asInstanceOf[FileSplit], recordDelimiterBytes)
}
}
class CsvLineRecordReader(job: Configuration, split: FileSplit, recordDelimiter: Array[Byte])
extends RecordReader[LongWritable, ArrayWritable] {
private val compressionCodecs = new CompressionCodecFactory(job)
private val maxLineLength = job.getInt(org.apache.hadoop.mapreduce.lib.input.
LineRecordReader.MAX_LINE_LENGTH, Integer.MAX_VALUE)
private var filePosition: Seekable = _
private val file = split.getPath
private val codec = compressionCodecs.getCodec(file)
private val isCompressedInput = codec != null
private val fs = file.getFileSystem(job)
private val fileIn = fs.open(file)
private var start = split.getStart
private var pos: Long = 0L
private var end = start + split.getLength
private var reader: CSVReader = _
private var decompressor: Decompressor = _
private lazy val CSVSeparator =
if (recordDelimiter == null)
CSVParser.DEFAULT_SEPARATOR
else
recordDelimiter(0).asInstanceOf[Char]
if (isCompressedInput) {
decompressor = CodecPool.getDecompressor(codec)
if (codec.isInstanceOf[SplittableCompressionCodec]) {
val cIn = (codec.asInstanceOf[SplittableCompressionCodec])
.createInputStream(fileIn, decompressor, start, end, SplittableCompressionCodec.READ_MODE.BYBLOCK)
reader = new CSVReader(new InputStreamReader(cIn), CSVSeparator)
start = cIn.getAdjustedStart
end = cIn.getAdjustedEnd
filePosition = cIn
}else {
reader = new CSVReader(new InputStreamReader(codec.createInputStream(fileIn, decompressor)), CSVSeparator)
filePosition = fileIn
}
} else {
fileIn.seek(start)
reader = new CSVReader(new InputStreamReader(fileIn), CSVSeparator)
filePosition = fileIn
}
#throws(classOf[IOException])
private def getFilePosition: Long = {
if (isCompressedInput && null != filePosition) {
filePosition.getPos
}else
pos
}
private def nextLine: Option[Array[String]] = {
if (getFilePosition < end){
//readNext automatical split the line to elements
reader.readNext() match {
case null => None
case elems => Some(elems)
}
} else
None
}
override def next(key: LongWritable, value: ArrayWritable): Boolean =
nextLine
.exists { elems =>
key.set(pos)
val lineLength = elems.foldRight(0)((a, b) => a.length + 1 + b)
pos += lineLength
value.set(elems.map(new Text(_)))
if (lineLength < maxLineLength) true else false
}
#throws(classOf[IOException])
def getProgress: Float =
if (start == end)
0.0f
else
Math.min(1.0f, (getFilePosition - start) / (end - start).toFloat)
override def getPos: Long = pos
override def createKey(): LongWritable = new LongWritable
override def close(): Unit = {
try {
if (reader != null) {
reader.close
}
} finally {
if (decompressor != null) {
CodecPool.returnDecompressor(decompressor)
}
}
}
override def createValue(): ArrayWritable = new ArrayWritable(classOf[Text])
}
Simple test example:
val arrayRdd = sc.hadoopFile("source path", classOf[CSVInputFormat], classOf[LongWritable], classOf[ArrayWritable],
sc.defaultMinPartitions).map(_._2.get().map(_.toString))
arrayRdd.collect().foreach(e => println(e.mkString(",")))
The other way which I prefer uses spark-csv written by databricks, which is well supported for CSV file format, you can take some practices in the github page.
Updated for spark-csv, using univocity as parserLib, which can handle multi-line cells
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("parserLib", "univocity")
.option("inferSchema", "true") // Automatically infer data types
.load("source path")
What happens when you do data manipulations across your whole data set like this, before it even gets dropped into the input RDD? It is just treated as any other program and would be swapping out like crazy I guess?
You load the whole dataset into local memory. So if you have the memory, it works.
How would you then make any spark program scalable?
You have select the a data format that spark can load, or you change your application so that it can load the data format into spark directly or a bit of both.
In this case you could look at creating a custom InputFormat that splits on something other than newlines. I think you would want to also look at how you write your data so it is partitioned in HDFS at record boundaries not new lines.
However I suspect the simplest answer is to encode the data differently. JSON Lines or encode the newlines in the CSV file during the write or Avro or... Anything that fits better with Spark & HDFS.