Reading YAML in Scala without 'case class' - scala

I am trying to read YAML file from Scala and I am able to read the file using the code given below. One of the disadvantage I see here is the necessity for me to create case class to have a mapping with the YAML file I am using. Every time I change the content of YAML it becomes necessary for me to change the case class. Is there any way in Scala to read YAML with out the need for me to create the case class. (I have also used Python to read YAML ; where we do not have the constraint of mapping a Class with the YAML structure...and would like to do similar thing in Scala)
Note : When I add a new property in YAML and if my case class do not have a matching property I get UnrecognizedPropertyException
package yamlexamples
import com.fasterxml.jackson.dataformat.yaml.YAMLFactory
import com.fasterxml.jackson.databind.ObjectMapper
import com.fasterxml.jackson.module.scala.DefaultScalaModule
object YamlTest extends App{
case class Prop(country: String, state: List[String])
val mapper: ObjectMapper = new ObjectMapper(new YAMLFactory())
mapper.registerModule(DefaultScalaModule)
val fileStream = getClass.getResourceAsStream("/sample.yaml")
val prop:Prop = mapper.readValue(fileStream, classOf[Prop])
println(prop.country + ", " + prop.state)
}
sample.yaml(This works with code)
country: US
state:
- TX
- VA
sample.yaml (This throws Exception)
country: US
state:
- TX
- VA
continent: North America

You could parse the Yaml file and load it as a collections object instead of case case. But this comes at the cost of losing typesafety in your code. The Below code uses the load function supported by Yaml. Note that the load has overloaded methods to read from a inputstream/reader as well..
import scala.collection.JavaConverters._
val yaml = new Yaml()
val data = yaml.load(
"""
|country: US
|state:
| - TX
| - VA
|continent: North America
""".stripMargin).asInstanceOf[java.util.Map[String, Any]].asScala
Now data is a scala mutable collection instead of a case class
data: scala.collection.mutable.Map[String,Any] = Map(country -> US, state -> [TX, VA], continent -> North America)

You could parse the YAML file using Jackson or SnakeYaml. However, Jackson does not support references/anchors, while SnakeYaml does. So, it is better to parse the YAML file using SnakeYaml & access data elements with Jackson library
import java.io.{File, FileInputStream, FileReader}
import com.fasterxml.jackson.dataformat.yaml.YAMLFactory
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import com.fasterxml.jackson.databind.{JsonNode, ObjectMapper}
import org.yaml.snakeyaml.Yaml
// Parsing the YAML file with SnakeYAML - since Jackson Parser does not support Anchors and references
val ios = new FileInputStream(new File(yamlFilePath))
val yaml = new Yaml()
val mapper = new ObjectMapper().registerModules(DefaultScalaModule)
val yamlObj = yaml.loadAs(ios, classOf[Any])
// Converting the YAML to Jackson YAML - since it has more flexibility
val jsonString = mapper.writerWithDefaultPrettyPrinter().writeValueAsString(yamlObj) // Formats YAML to a pretty printed JSON string - easy to read
val jsonObj = mapper.readTree(jsonString)
Finally, you get the JsonNode object, which enables us to convert to other datatypes.

Related

Simple Spark Scala Post to External Rest API Example

New to Spark Scala, I just want to read a json file and post the content to an external rest api server. Can anyone provide a simple example? or provide guidelines?
You probably do not want to use Spark for this. Spark is an analytical engine for processing large amounts of data - unless you're reading in massive amounts of json from hdfs, this task is more suitable for scala. You should look up ways to read a json file in scala, and send that content to a server in scala.
Here are some great places to get started:
Scala Read JSON file
https://alvinalexander.com/scala/how-to-send-json-post-data-to-restful-url-in-scala
The following is from the above URL:
import java.io._
import org.apache.commons._
import org.apache.http._
import org.apache.http.client._
import org.apache.http.client.methods.HttpPost
import org.apache.http.impl.client.DefaultHttpClient
import java.util.ArrayList
import org.apache.http.message.BasicNameValuePair
import org.apache.http.client.entity.UrlEncodedFormEntity
import com.google.gson.Gson
case class Person(firstName: String, lastName: String, age: Int)
object HttpJsonPostTest extends App {
// create our object as a json string
val spock = new Person("Leonard", "Nimoy", 82)
val spockAsJson = new Gson().toJson(spock)
// add name value pairs to a post object
val post = new HttpPost("http://localhost:8080/posttest")
val nameValuePairs = new ArrayList[NameValuePair]()
nameValuePairs.add(new BasicNameValuePair("JSON", spockAsJson))
post.setEntity(new UrlEncodedFormEntity(nameValuePairs))
// send the post request
val client = new DefaultHttpClient
val response = client.execute(post)
println("--- HEADERS ---")
response.getAllHeaders.foreach(arg => println(arg))
}

Can we create a xml file with specific node with Spark Scala?

I have another question about Spark and Scala. I want to use that technologie to get data and generate a xml.
Therefore, I want to know if it is possible to create node ourself (not automatic creation) and what library can we use ? I search but I found nothing very interesting(Like I'm new in this technologie, I don't know many keywords).
I want to know if there is in Spark something like this code (I write that in scala. It works in local but I can't use new File() in Spark).
val docBuilder: DocumentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder()
val document = docBuilder.newDocument()
ar root:Element = document.createElement("<name Balise>")
attr = document.createAttribute("<attr1>")
attr.setValue("<value attr1>")
root.setAttributeNode(<attr>)
attr = document.createAttribute("<attr2>")
attr.setValue("<value attr2>")
root.setAttributeNode(attr)
document.appendChild(root)
document.setXmlStandalone(true)
var transformerFactory:TransformerFactory = TransformerFactory.newInstance()
var transformer:Transformer = transformerFactory.newTransformer()
var domSource:DOMSource = new DOMSource(document)
var streamResult:StreamResult = new StreamResult(new File(destination))
transformer.transform(domSource,streamResult)
I want to know if it's possible to do that with spark.
Thanks for your answer and have a good day.
Not exactly, but you can do something similar by using Spark XML API pr XStream API on Spark.
First try using Spark XML API which is most useful when reading and writing XML files using Spark. However, At the time of writing this, Spark XML has following limitations.
1) Adding attribute to root element has not supported.
2) Does not support following structure where you have header and footer elements.
<parent>
<header></header>
<dataset>
<data attr="1"> suports xml tags and data here</data>
<data attr="2">value2</data>
</dataset>
<footer></footer>
</parent>
If you have one root element and following data then Spark XML is go to api.
Alternatively, you can look at XStream API. Below are steps how to use it to create custom XML structures.
1) First, create a Scala class similar to the structure you wanted in XML.
case class XMLData(name:String, value:String, attr:String)
2) Create an instance of this class
val data = XMLData("bookName","AnyValue", "AttributeValue")
3) Conver data object to XML using XStream API. If you already have data in a DataFrame, then do a map transformation to convert data to an XML string and store it back in DataFrame. if you do so, then you can skip step #4
val xstream = new XStream(new DomDriver)
val xmlString = xstream.toXML(data)
4) Now convert xmlString to DataFrame
val df = xmlString.toDF()
5) Finally, write to a file
df.write.text("file://filename")
Here isa full sample example with XStream API
import com.thoughtworks.xstream.XStream
import com.thoughtworks.xstream.io.xml.DomDriver
import org.apache.spark.sql.SparkSession
case class Animal(cri:String,taille:Int)
object SparkXMLUsingXStream{
def main(args: Array[String]): Unit = {
val spark = SparkSession.
builder.master ("local[*]")
.appName ("sparkbyexamples.com")
.getOrCreate ()
var animal:Animal = Animal("Rugissement",150)
val xstream1 = new XStream(new DomDriver())
xstream1.alias("testAni",classOf[Animal])
xstream1.aliasField("cricri",classOf[Animal],"cri")
val xmlString = Seq(xstream1.toXML(animal))
import spark.implicits._
val newDf = xmlString.toDF()
newDf.show(false)
}
}
Hope this helps !!
Thanks

Read specific column from Parquet without using Spark

I am trying to read Parquet files without using Apache Spark and I am able to do it but I am finding it hard to read specific columns. I am not able to find any good resource of Google as almost all the post is about reading the parquet file using. Below is my code:
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.avro.generic.GenericRecord
import org.apache.parquet.hadoop.ParquetReader
import org.apache.parquet.avro.AvroParquetReader
object parquetToJson{
def main (args : Array[String]):Unit= {
//case class Customer(key: Int, name: String, sellAmount: Double, profit: Double, state:String)
val parquetFilePath = new Path("data/parquet/Customer/")
val reader = AvroParquetReader.builder[GenericRecord](parquetFilePath).build()//.asInstanceOf[ParquetReader[GenericRecord]]
val iter = Iterator.continually(reader.read).takeWhile(_ != null)
val list = iter.toList
list.foreach(record => println(record))
}
}
The commented out case class represents the schema of my file and write now the above code reads all the columns from the file. I want to read specific columns.
If you just want to read specific columns, then you need to set a read schema on the configuration that the ParquetReader builder accepts. (This is also known as a projection).
In your case you should be able to call .withConf(conf) on the AvroParquetReader builder class, and in the conf you pass in, invoke conf.set(ReadSupport.PARQUET_READ_SCHEMA, schema) where schema is a avro schema in String form.

How do I set Jackson parser features when using json4s?

I am receiving the following error while attempting to parse JSON with json4s:
Non-standard token 'NaN': enable JsonParser.Feature.ALLOW_NON_NUMERIC_NUMBERS to allow
How do I enable this feature?
Assuming your ObjectMapper object is named mapper:
val mapper = new ObjectMapper()
// Configure NaN here
mapper.configure(JsonParser.Feature.ALLOW_NON_NUMERIC_NUMBERS, true)
...
val json = ... //Get your json
val imported = mapper.readValue(json, classOf[Thing]) // Thing being whatever class you're importing to.
#Nathaniel Ford, thanks for setting me on the right path!
I ended up looking at the source code for the parse() method (which is what I should have done in the first place). This works:
import com.fasterxml.jackson.core.JsonParser
import com.fasterxml.jackson.databind.ObjectMapper
import org.json4s._
import org.json4s.jackson.Json4sScalaModule
val jsonString = """{"price": NaN}"""
val mapper = new ObjectMapper()
// Configure NaN here
mapper.configure(JsonParser.Feature.ALLOW_NON_NUMERIC_NUMBERS, true)
mapper.registerModule(new Json4sScalaModule)
val json = mapper.readValue(jsonString, classOf[JValue])
While the answers above are still correct, what should be amended is, that since Jackson 2.10 JsonParser.Feature.ALLOW_NON_NUMERIC_NUMBERS is deprecated.
The sustainable way for configuring correct NaN handling is the following:
val mapper = JsonMapper.builder().enable(JsonReadFeature.ALLOW_NON_NUMERIC_NUMBERS).build();
// now your parsing

How to save RandomForestClassifier Spark model in scala?

I built a random forest model using the following code:
import org.apache.spark.ml.classification.RandomForestClassificationModel
import org.apache.spark.ml.classification.RandomForestClassifier
val rf = new RandomForestClassifier().setLabelCol("indexedLabel").setFeaturesCol("features")
val labelConverter = new IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels)
val training = labelIndexer.transform(df)
val model = rf.fit(training)
now I want to save the model in order to predict later using the following code:
val predictions: DataFrame = model.transform(testData)
I've looked into Spark documentation here and didn't find any option to do that. Any idea?
It took me a few hours to build the model , if Spark is crushing I won't be able to get it back.
It's possible to save and reload tree based models in HDFS using Spark 1.6 using saveAsObjectFile() for both Pipeline based and basic model.
Below is example for pipeline based model.
// model
val model = pipeline.fit(trainingData)
// Create rdd using Seq
sc.parallelize(Seq(model), 1).saveAsObjectFile("hdfs://filepath")
// Reload model by using it's class
// You can get class of object using object.getClass()
val sameModel = sc.objectFile[PipelineModel]("filepath").first()
For RandomForestClassifier save & load model: tested spark 1.6.2 + scala in ml(in spark 2.0 you can have direct save option for model)
import org.apache.spark.ml.classification.RandomForestClassificationModel
import org.apache.spark.ml.classification.RandomForestClassifier //imports
val classifier = new RandomForestClassifier().setImpurity("gini").setMaxDepth(3).setNumTrees(20).setFeatureSubsetStrategy("auto").setSeed(5043)
val model = classifier.fit(trainingData)
sc.parallelize(Seq(model), 1).saveAsObjectFile(modelSavePath) //save model
val linRegModel = sc.objectFile[RandomForestClassificationModel](modelSavePath).first() //load model
`val predictions1 = linRegModel.transform(testData)` //predictions1 is dataframe
It is in the MLWriter interface - that is accessed via the writer attribute on your model:
model.asInstanceOf[MLWritable].write.save(path)
Here is the interface:
abstract class MLWriter extends BaseReadWrite with Logging {
protected var shouldOverwrite: Boolean = false
/**
* Saves the ML instances to the input path.
*/
#Since("1.6.0")
#throws[IOException]("If the input path already exists but overwrite is not enabled.")
def save(path: String): Unit = {
This is a refactoring from earlier versions of mllib/spark.ml
Update It appears that the Model were not writable:
Exception in thread "main" java.lang.UnsupportedOperationException:
Pipeline write will fail on this Pipeline because it contains a stage
which does not implement Writable. Non-Writable stage:
rfc_4e467607406f of type class
org.apache.spark.ml.classification.RandomForestClassificationModel
So there may not be a straightforward solution for this.
Here is a PySpark v1.6 implementation corresponding to the Scala saveAsObjectFile() answer above.
It coerses the Python objects to/from Java objects to achieve serialisation with saveAsObjectFile().
Without the Java coersion I had weird Py4J errors on serialisation. If anyone has a simplier implementation, please edit or comment.
Save a trained RandomForestClassificationModel object:
# Save RandomForestClassificationModel to hdfs
gateway = sc._gateway
java_list = gateway.jvm.java.util.ArrayList()
java_list.add(rfModel._java_obj)
modelRdd = sc._jsc.parallelize(java_list)
modelRdd.saveAsObjectFile("hdfs:///some/path/rfModel")
Load a trained RandomForestClassificationModel object:
# Load RandomForestClassificationModel from hdfs
rfObjectFileLoaded = sc._jsc.objectFile("hdfs:///some/path/rfModel")
rfModelLoaded_JavaObject = rfObjectFileLoaded.first()
rfModelLoaded = RandomForestClassificationModel(rfModelLoaded_JavaObject)
predictions = rfModelLoaded.transform(test_input_df)