Mixed Content XML parsing using DataFrame

Mixed Content XML parsing using DataFrame - scala

I have an XML document that has mixed content and I am using a custom schema in Dataframe to parse it. I am having an issue where the schema will only pick up the text for "Measure".
The XML looks like this
<QData>
<Measure> some text here
<Answer>Answer1</Answer>
<Question>Question1</Question>
</Measure>
<Measure> some text here
<Answer>Answer1</Answer>
<Question>Question1</Question>
</Meaure>
</QData>
My schema is as follows:
def getCustomSchema():StructType = {StructField("QData",
StructType(Array(
StructField("Measure",
StructType( Array(
StructField("Answer",StringType,true),
StructField("Question",StringType,true)
)),true)
)),true)}
When I try to access the data in Measure I am only getting "some text here" and it fails when I try to get info from Answer. I am also just getting one Measure.
EDIT: This is how I am trying to access the data
val result = sc.read.format("com.databricks.spark.xml").option("attributePrefix", "attr_").schema(getCustomSchema)
.load(filename.toString)
val qDfTemp = result.mapPartitions(partition =>{val mapper = new QDMapper();partition.map(row=>{mapper(row)}).flatMap(list=>list)}).toDF()
case class QDMapper(){
def apply(row: Row):List[QData]={
val qDList = new ListBuffer[QData]()
val qualData = row.getAs[Row]("QData") //When I print as list I get the first Measure text and that is it
val measure = qualData.getAs[Row]("Measure") //This fails
}
}

you can use row tag as a root tag and access other element:-
df_schema = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='<xml_tag_name>').load(schema_path)
please visit https://github.com/harshaltaware/Pyspark/blob/main/Spark-data-parsing/xmlparsing.py for brief code

Related

Add element to the xml string in scala

I have the following relatively simple scenario, but it’s working.
I need an append to my xml string, here's the scenario:
val xmlStr = "<return> <numberPin> 123456 </numberPin> </return>"
I need some way to add the element data and return the string below, I would like some solution with regular expression if possible
"<return> <numberPin> 123456 </numberPin> <date> 2019-09-04 00:00:00 </date> </return>"

You can create a template xml at first that can be updated at runtime.
You can do something like below:
def updateXml (xmlStr:String, dateContent: String) = {
xmlStr.replace("DATE_DATA", dateContent)
}
val xmlStr = "<return> <numberPin> 123456 </numberPin> DATE_DATA </return>"
val dateData = "<date> 2019-09-04 00:00:00 </date>"
updateXml(xmlStr, dateData)
Another alternative is to create an xml template in a file(if the xml content is like a big file). Read it in your code and insert required data at run-time as shown in the above example(where i stuffed DATE_DATA in template and replaced it at runtime using the method).

How to write a DataFrame schema to file in Scala

I have a DataFrame that loads from a huge json file and gets the schema from it. The schema is basically around 1000 columns. I want the same output of printSchema to be saved in a file instead of the console.
Any ideas?

You can do the following if you are working in a local environment :
val filePath = "/path/to/file/schema_file"
new PrintWriter(filePath) { write(df.schema.treeString); close }
If you are on HDFS, you'll need to provide a URI.

This is the body of printSchema():
/**
* Prints the schema to the console in a nice tree format.
* #group basic
* #since 1.3.0
*/
// scalastyle:off println
def printSchema(): Unit = println(schema.treeString)
// scalastyle:on println
So you can't do much, but I have a work around that can work in your case.
Set the out stream to a file Stream so that it gets printed to your File.
Something like this
val out = new PrintStream(new FileOutputStream("output.txt"));
System.setOut(out);
I hope I solved your query !

How to dump nested list using SnakeYAML

I'm parsing this yaml file
View:
from : 01.01.2007
to : 04.01.2007
driver : sun.jdbc.odbc.JdbcOdbcDriver
using SnakeYAML in Scala like this:
val stream = getClass.getResourceAsStream("/config_view.yml")
var configMap: Map[String, Any] = new Yaml().load(stream).asInstanceOf[java.util.Map[String, Any]].asScala
var view = configMap("View").asInstanceOf[java.util.LinkedHashMap[String, String]].asScala
view = view + ("from" -> "neu") // some test modifying
and I dump it like this:
val fileWriter = new FileWriter(System.getProperty("user.home") + "\\Desktop\\test.yml")
new Yaml().dump(Map[String, Any]("View" -> view.asJava).asJava, fileWriter)
which saves the new yaml file like this:
View: {driver: sun.jdbc.odbc.JdbcOdbcDriver, from: neu, to: 04.01.2007}
But I want it to save it like this:
View:
driver: sun.jdbc.odbc.JdbcOdbcDriver
from: neu
to: 04.01.2007
How can I tell SnakeYAML to save it in the desired format you see above?

By default SnakeYAML uses the DumperOptions.FlowStyle.FLOW but it can be changed to the DumperOptions.FlowStyle.BLOCK that will dump the data with the desired format.
An example in Kotlin:
val options = DumperOptions()
options.indent = 2
options.defaultFlowStyle = DumperOptions.FlowStyle.BLOCK
Yaml(options).dump(yourObject)

How about manually handling the indentation and key: value formatting:
view.map{ case (k,v) => s"\t$k: $v\n" }
In the case of nested maps you will want a method that
accepts the current "level" of nesting. Place the level tabs in front of the output to give the proper output nesting
checks each of the entries. If it were another collection type then it needs to recursively invoke itself - which will increase the indention level

How do I save the file name in the database

When I add a new record in a database using a form, I also can upload an image. These two are not linked together; record goes in database and the image goes in an folder on my desktop, so to know which image belongs to which record, I want to put the filename in a column. How do i approach this?
Im using PlayFramework 2.4, Scala, H2 Database and Anorm for my Project

In your html form you need to have an input tag of file type, something like:
<input type="file" name="picture">
And in your scala method Controller, where you are getting the form submit, something like:
def save = Action(parse.multipartFormData) { request =>
request.body.file("picture").map { picture =>
import java.io.File
val filename = picture.filename
println(filename)
Ok("saved")
}

You could retrieve the absolute path of the file like so:
scala> val filePath = getClass.getResource("myImage.png")
filePath: java.net.URI = file:/home/robert/myImage.png

store (binary) file - play framework using scala in heroku

I'm trying to store user-uploaded images in my application which is written by scala and play framework 2.2.x
I've deployed my app in heroku.
Heroku does not allow me to save my file in file system.
So I've tried to store my file in data base.
here is the code that I use for storing image :
def updateImage(id: Long, image: Array[Byte]) = {
val selected = getById(id)
DB.withConnection {
implicit c =>
SQL("update subcategory set image={image} where id = {id}").on('id -> id, 'image -> image).executeUpdate()
}
selected }
and here is the code that I use to retreive my image :
def getImageById(id: Long): Array[Byte] = DB.withConnection {
implicit c =>
val all = SQL("select image from subcategory where id = {id}").on('id -> id)().map {
case Row(image: Array[Byte]) => image
case Row(Some(image: Array[Byte])) => image
case Row(image: java.sql.Blob )=> image.getBytes(0 , image.length().toInt)
}
all.head
}
The problem is: when I use H2 database and blob column, I get the "Match Error" exception.
When I use Postgresql and bytea column, I got no error but when I retrieve the image, It's in hex format and some of the bytes in the beginning of the array are missing.

According to the PostgreSQL documentation, bytea stores the length of the array in the four bytes at the beginning of the array. These are stripped when you read the row, so that's why they seem to be "missing" when you compare the data in Scala with the data in the DB.
You will have to set the response's content-type to the appropriate value if you want the web browser to display the image correctly, as otherwise it does not know it is receiving image data. The Ok.sendFile helper does it for you. Otherwise you will have to do it by hand:
def getPicture = Action {
SimpleResult(
header = ResponseHeader(200),
body = Enumerator(pictureByteArray))
.as(pictureContentType)
}
In the example above, pictureByteArray is the Array[Byte] containing the picture data from your database, and pictureContentType is a string with the appropriate content type (for example, image/jpeg).
This is all quite well explained in the Play documentation.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Mixed Content XML parsing using DataFrame - scala

you can use row tag as a root tag and access other element:- df_schema = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='<xml_tag_name>').load(schema_path) please visit https://github.com/harshaltaware/Pyspark/blob/main/Spark-data-parsing/xmlparsing.py for brief code

Related

Add element to the xml string in scala

How to write a DataFrame schema to file in Scala

How to dump nested list using SnakeYAML

How do I save the file name in the database

store (binary) file - play framework using scala in heroku

Categories

Resources