How do I set Jackson parser features when using json4s? - scala

I am receiving the following error while attempting to parse JSON with json4s:
Non-standard token 'NaN': enable JsonParser.Feature.ALLOW_NON_NUMERIC_NUMBERS to allow
How do I enable this feature?

Assuming your ObjectMapper object is named mapper:
val mapper = new ObjectMapper()
// Configure NaN here
mapper.configure(JsonParser.Feature.ALLOW_NON_NUMERIC_NUMBERS, true)
...
val json = ... //Get your json
val imported = mapper.readValue(json, classOf[Thing]) // Thing being whatever class you're importing to.

#Nathaniel Ford, thanks for setting me on the right path!
I ended up looking at the source code for the parse() method (which is what I should have done in the first place). This works:
import com.fasterxml.jackson.core.JsonParser
import com.fasterxml.jackson.databind.ObjectMapper
import org.json4s._
import org.json4s.jackson.Json4sScalaModule
val jsonString = """{"price": NaN}"""
val mapper = new ObjectMapper()
// Configure NaN here
mapper.configure(JsonParser.Feature.ALLOW_NON_NUMERIC_NUMBERS, true)
mapper.registerModule(new Json4sScalaModule)
val json = mapper.readValue(jsonString, classOf[JValue])

While the answers above are still correct, what should be amended is, that since Jackson 2.10 JsonParser.Feature.ALLOW_NON_NUMERIC_NUMBERS is deprecated.
The sustainable way for configuring correct NaN handling is the following:
val mapper = JsonMapper.builder().enable(JsonReadFeature.ALLOW_NON_NUMERIC_NUMBERS).build();
// now your parsing

Related

Can we create a xml file with specific node with Spark Scala?

I have another question about Spark and Scala. I want to use that technologie to get data and generate a xml.
Therefore, I want to know if it is possible to create node ourself (not automatic creation) and what library can we use ? I search but I found nothing very interesting(Like I'm new in this technologie, I don't know many keywords).
I want to know if there is in Spark something like this code (I write that in scala. It works in local but I can't use new File() in Spark).
val docBuilder: DocumentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder()
val document = docBuilder.newDocument()
ar root:Element = document.createElement("<name Balise>")
attr = document.createAttribute("<attr1>")
attr.setValue("<value attr1>")
root.setAttributeNode(<attr>)
attr = document.createAttribute("<attr2>")
attr.setValue("<value attr2>")
root.setAttributeNode(attr)
document.appendChild(root)
document.setXmlStandalone(true)
var transformerFactory:TransformerFactory = TransformerFactory.newInstance()
var transformer:Transformer = transformerFactory.newTransformer()
var domSource:DOMSource = new DOMSource(document)
var streamResult:StreamResult = new StreamResult(new File(destination))
transformer.transform(domSource,streamResult)
I want to know if it's possible to do that with spark.
Thanks for your answer and have a good day.
Not exactly, but you can do something similar by using Spark XML API pr XStream API on Spark.
First try using Spark XML API which is most useful when reading and writing XML files using Spark. However, At the time of writing this, Spark XML has following limitations.
1) Adding attribute to root element has not supported.
2) Does not support following structure where you have header and footer elements.
<parent>
<header></header>
<dataset>
<data attr="1"> suports xml tags and data here</data>
<data attr="2">value2</data>
</dataset>
<footer></footer>
</parent>
If you have one root element and following data then Spark XML is go to api.
Alternatively, you can look at XStream API. Below are steps how to use it to create custom XML structures.
1) First, create a Scala class similar to the structure you wanted in XML.
case class XMLData(name:String, value:String, attr:String)
2) Create an instance of this class
val data = XMLData("bookName","AnyValue", "AttributeValue")
3) Conver data object to XML using XStream API. If you already have data in a DataFrame, then do a map transformation to convert data to an XML string and store it back in DataFrame. if you do so, then you can skip step #4
val xstream = new XStream(new DomDriver)
val xmlString = xstream.toXML(data)
4) Now convert xmlString to DataFrame
val df = xmlString.toDF()
5) Finally, write to a file
df.write.text("file://filename")
Here isa full sample example with XStream API
import com.thoughtworks.xstream.XStream
import com.thoughtworks.xstream.io.xml.DomDriver
import org.apache.spark.sql.SparkSession
case class Animal(cri:String,taille:Int)
object SparkXMLUsingXStream{
def main(args: Array[String]): Unit = {
val spark = SparkSession.
builder.master ("local[*]")
.appName ("sparkbyexamples.com")
.getOrCreate ()
var animal:Animal = Animal("Rugissement",150)
val xstream1 = new XStream(new DomDriver())
xstream1.alias("testAni",classOf[Animal])
xstream1.aliasField("cricri",classOf[Animal],"cri")
val xmlString = Seq(xstream1.toXML(animal))
import spark.implicits._
val newDf = xmlString.toDF()
newDf.show(false)
}
}
Hope this helps !!
Thanks

Converting from java.util.List to spark dataset

I am still very new to spark and scala, but very familiar with Java. I have some java jar that has a function that returns an List (java.util.List) of Integers, but I want to convert these to a spark dataset so I can append it to another column and then perform a join. Is there any easy way to do this? I've tried things similar to this code:
val testDSArray : java.util.List[Integer] = new util.ArrayList[Integer]()
testDSArray.add(4)
testDSArray.add(7)
testDSArray.add(10)
val testDS : Dataset[Integer] = spark.createDataset(testDSArray, Encoders.INT())
but it gives me compiler errors (cannot resolve overloaded method)?
If you look at the type signature you will see that in Scala the encoder is passed in a second (and implicit) parameter list.
You may:
Pass it in another parameter list.
val testDS = spark.createDataset(testDSArray)(Encoders.INT)
Don't pass it, and leave the Scala's implicit mechanism resolves it.
import spark.implicits._
val testDS = spark.createDataset(testDSArray)
Convert the Java's List to a Scala's one first.
import collection.JavaConverters._
import spark.implicits._
val testDS = testDSArray.asScala.toDS()

How to convert RDD[GenericRecord] to dataframe in scala?

I get tweets from kafka topic with Avro (serializer and deserializer).
Then i create a spark consumer which extracts tweets in Dstream of RDD[GenericRecord].
Now i want to convert each rdd to a dataframe to analyse these tweets via SQL.
Any solution to convert RDD[GenericRecord] to dataframe please ?
I spent some time trying to make this work (specially how deserialize the data properly but it looks like you already cover this) ... UPDATED
//Define function to convert from GenericRecord to Row
def genericRecordToRow(record: GenericRecord, sqlType : SchemaConverters.SchemaType): Row = {
val objectArray = new Array[Any](record.asInstanceOf[GenericRecord].getSchema.getFields.size)
import scala.collection.JavaConversions._
for (field <- record.getSchema.getFields) {
objectArray(field.pos) = record.get(field.pos)
}
new GenericRowWithSchema(objectArray, sqlType.dataType.asInstanceOf[StructType])
}
//Inside your stream foreachRDD
val yourGenericRecordRDD = ...
val schema = new Schema.Parser().parse(...) // your schema
val sqlType = SchemaConverters.toSqlType(new Schema.Parser().parse(strSchema))
var rowRDD = yourGeneircRecordRDD.map(record => genericRecordToRow(record, sqlType))
val df = sqlContext.createDataFrame(rowRDD , sqlType.dataType.asInstanceOf[StructType])
As you see, I am using a SchemaConverter to get the dataframe structure from the schema that you used to deserialize (this could be more painful with schema registry). For this you need the following dependency
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>3.2.0</version>
</dependency>
you will need to change your spark version depending on yours.
UPDATE: the code above only works for flat avro schemas.
For nested structures I used something different. You can copy the class SchemaConverters, it has to be inside of com.databricks.spark.avro (it uses some protected classes from the databricks package) or you can try to use the spark-bigquery dependency. The class will not be accessible by default, so you will need to create a class inside a package com.databricks.spark.avro to access the factory method.
package com.databricks.spark.avro
import com.databricks.spark.avro.SchemaConverters.createConverterToSQL
import org.apache.avro.Schema
import org.apache.spark.sql.types.StructType
class SchemaConverterUtils {
def converterSql(schema : Schema, sqlType : StructType) = {
createConverterToSQL(schema, sqlType)
}
}
After that you should be able to convert the data like
val schema = .. // your schema
val sqlType = SchemaConverters.toSqlType(schema).dataType.asInstanceOf[StructType]
....
//inside foreach RDD
var genericRecordRDD = deserializeAvroData(rdd)
///
var converter = SchemaConverterUtils.converterSql(schema, sqlType)
...
val rowRdd = genericRecordRDD.flatMap(record => {
Try(converter(record).asInstanceOf[Row]).toOption
})
//To DataFrame
val df = sqlContext.createDataFrame(rowRdd, sqlType)
A combination of https://stackoverflow.com/a/48828303/5957143 and https://stackoverflow.com/a/47267060/5957143 works for me.
I used the following to create MySchemaConversions
package com.databricks.spark.avro
import org.apache.avro.Schema
import org.apache.avro.generic.GenericRecord
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.DataType
object MySchemaConversions {
def createConverterToSQL(avroSchema: Schema, sparkSchema: DataType): (GenericRecord) => Row =
SchemaConverters.createConverterToSQL(avroSchema, sparkSchema).asInstanceOf[(GenericRecord) => Row]
}
And then I used
val myAvroType = SchemaConverters.toSqlType(schema).dataType
val myAvroRecordConverter = MySchemaConversions.createConverterToSQL(schema, myAvroType)
// unionedResultRdd is unionRDD[GenericRecord]
var rowRDD = unionedResultRdd.map(record => MyObject.myConverter(record, myAvroRecordConverter))
val df = sparkSession.createDataFrame(rowRDD , myAvroType.asInstanceOf[StructType])
The advantage of having myConverter in the object MyObject is that you will not encounter serialization issues (java.io.NotSerializableException).
object MyObject{
def myConverter(record: GenericRecord,
myAvroRecordConverter: (GenericRecord) => Row): Row =
myAvroRecordConverter.apply(record)
}
Even though something like this may help you,
val stream = ...
val dfStream = stream.transform(rdd:RDD[GenericRecord]=>{
val df = rdd.map(_.toSeq)
.map(seq=> Row.fromSeq(seq))
.toDF(col1,col2, ....)
df
})
I'd like to suggest you an alternate approach. With Spark 2.x you can skip the whole process of creating DStreams. Instead, you can do something like this with structured streaming,
val df = ss.readStream
.format("com.databricks.spark.avro")
.load("/path/to/files")
This will give you a single dataframe which you can directly query. Here, ss is the instance of spark session. /path/to/files is the place where all your avro files are being dumped from kafka.
PS: You may need to import spark-avro
libraryDependencies += "com.databricks" %% "spark-avro" % "4.0.0"
Hope this helped. Cheers
You can use createDataFrame(rowRDD: RDD[Row], schema: StructType), which is available in the SQLContext object. Example for converting an RDD of an old DataFrame:
import sqlContext.implicits.
val rdd = oldDF.rdd
val newDF = oldDF.sqlContext.createDataFrame(rdd, oldDF.schema)
Note that there is no need to explicitly set any schema column. We reuse the old DF's schema, which is of StructType class and can be easily extended. However, this approach sometimes is not possible, and in some cases can be less efficient than the first one.

Reading YAML in Scala without 'case class'

I am trying to read YAML file from Scala and I am able to read the file using the code given below. One of the disadvantage I see here is the necessity for me to create case class to have a mapping with the YAML file I am using. Every time I change the content of YAML it becomes necessary for me to change the case class. Is there any way in Scala to read YAML with out the need for me to create the case class. (I have also used Python to read YAML ; where we do not have the constraint of mapping a Class with the YAML structure...and would like to do similar thing in Scala)
Note : When I add a new property in YAML and if my case class do not have a matching property I get UnrecognizedPropertyException
package yamlexamples
import com.fasterxml.jackson.dataformat.yaml.YAMLFactory
import com.fasterxml.jackson.databind.ObjectMapper
import com.fasterxml.jackson.module.scala.DefaultScalaModule
object YamlTest extends App{
case class Prop(country: String, state: List[String])
val mapper: ObjectMapper = new ObjectMapper(new YAMLFactory())
mapper.registerModule(DefaultScalaModule)
val fileStream = getClass.getResourceAsStream("/sample.yaml")
val prop:Prop = mapper.readValue(fileStream, classOf[Prop])
println(prop.country + ", " + prop.state)
}
sample.yaml(This works with code)
country: US
state:
- TX
- VA
sample.yaml (This throws Exception)
country: US
state:
- TX
- VA
continent: North America
You could parse the Yaml file and load it as a collections object instead of case case. But this comes at the cost of losing typesafety in your code. The Below code uses the load function supported by Yaml. Note that the load has overloaded methods to read from a inputstream/reader as well..
import scala.collection.JavaConverters._
val yaml = new Yaml()
val data = yaml.load(
"""
|country: US
|state:
| - TX
| - VA
|continent: North America
""".stripMargin).asInstanceOf[java.util.Map[String, Any]].asScala
Now data is a scala mutable collection instead of a case class
data: scala.collection.mutable.Map[String,Any] = Map(country -> US, state -> [TX, VA], continent -> North America)
You could parse the YAML file using Jackson or SnakeYaml. However, Jackson does not support references/anchors, while SnakeYaml does. So, it is better to parse the YAML file using SnakeYaml & access data elements with Jackson library
import java.io.{File, FileInputStream, FileReader}
import com.fasterxml.jackson.dataformat.yaml.YAMLFactory
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import com.fasterxml.jackson.databind.{JsonNode, ObjectMapper}
import org.yaml.snakeyaml.Yaml
// Parsing the YAML file with SnakeYAML - since Jackson Parser does not support Anchors and references
val ios = new FileInputStream(new File(yamlFilePath))
val yaml = new Yaml()
val mapper = new ObjectMapper().registerModules(DefaultScalaModule)
val yamlObj = yaml.loadAs(ios, classOf[Any])
// Converting the YAML to Jackson YAML - since it has more flexibility
val jsonString = mapper.writerWithDefaultPrettyPrinter().writeValueAsString(yamlObj) // Formats YAML to a pretty printed JSON string - easy to read
val jsonObj = mapper.readTree(jsonString)
Finally, you get the JsonNode object, which enables us to convert to other datatypes.

How do I generate pretty JSON with json4s?

This snippet of code works very well, but it generates compact JSON (no line breaks / not very human readable).
import org.json4s.native.Serialization.write
implicit val jsonFormats = DefaultFormats
//snapshotList is a case class
val jsonString: String = write(snapshotList)
Is there an easy way to generate pretty JSON from this?
I have this workaround but I'm wondering if a more efficient way exists:
import org.json4s.jackson.JsonMethods._
val prettyJsonString = pretty(render(parse(jsonString)))
import org.json4s.jackson.Serialization.writePretty
val jsonString: String = writePretty(snapshotList)
You can use the ObjectMapper writerWithDefaultPrettyPrinter function:
ObjectMapper mapper = new ObjectMapper();
val pretty = mapper.writerWithDefaultPrettyPrinter()
.writeValueAsString(jsonString));
This returns a ObjectWriter object from which you can get a pretty-formatted string.