I have the following relatively simple scenario, but it’s working.
I need an append to my xml string, here's the scenario:
val xmlStr = "<return> <numberPin> 123456 </numberPin> </return>"
I need some way to add the element data and return the string below, I would like some solution with regular expression if possible
"<return> <numberPin> 123456 </numberPin> <date> 2019-09-04 00:00:00 </date> </return>"
You can create a template xml at first that can be updated at runtime.
You can do something like below:
def updateXml (xmlStr:String, dateContent: String) = {
xmlStr.replace("DATE_DATA", dateContent)
}
val xmlStr = "<return> <numberPin> 123456 </numberPin> DATE_DATA </return>"
val dateData = "<date> 2019-09-04 00:00:00 </date>"
updateXml(xmlStr, dateData)
Another alternative is to create an xml template in a file(if the xml content is like a big file). Read it in your code and insert required data at run-time as shown in the above example(where i stuffed DATE_DATA in template and replaced it at runtime using the method).
Related
I am using spark 2.3.2 with python 3.7 to parse xml.
In an xml file (sample), I have appended 2 xmls.
When I parse it with:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.7.0 pyspark-shell'
conf = pyspark.SparkConf()
sc = SparkSession.builder.config(conf=conf).getOrCreate()
spark = SQLContext(sc)
dfSample = (spark.read.format("xml").option("rowTag", "xocs:doc")
.load(r"sample.xml"))
I see 2 xmls' data:
However, what I need is to extract the info under "ref-info" tag (along with their corresponding key eids), so my code is:
(dfSample.
withColumn("metaExp", F.explode(F.array("xocs:meta"))).
withColumn("eid", F.col("metaExp.xocs:eid")).
select("eid","xocs:item").
withColumn("xocs:itemExp", F.explode(F.array("xocs:item"))).
withColumn("item", F.col("xocs:itemExp.item")).
withColumn("itemExp", F.explode(F.array("item"))).
withColumn("bibrecord", F.col("item.bibrecord")).
withColumn("bibrecordExp", F.explode(F.array("bibrecord"))).
withColumn("tail", F.col("bibrecord.tail")).
withColumn("tailExp", F.explode(F.array("tail"))).
withColumn("bibliography", F.col("tail.bibliography")).
withColumn("bibliographyExp", F.explode(F.array("bibliography"))).
withColumn("reference", F.col("bibliography.reference")).
withColumn("referenceExp", F.explode(F.array("reference"))).
withColumn("ref-infoExp", F.explode(F.col("reference.ref-info"))).
withColumn("authors", F.explode(F.col("ref-infoExp.ref-authors.author"))).
withColumn("py", (F.col("ref-infoExp.ref-publicationyear._first"))).
withColumn("so", (F.col("ref-infoExp.ref-sourcetitle"))).
withColumn("ti", (F.col("ref-infoExp.ref-title"))).
drop("xocs:item", "xocs:itemExp", "item", "itemExp", "bibrecord", "bibrecordExp", "tail", "tailExp", "bibliography",
"bibliographyExp", "reference", "referenceExp").show())
This extracts the info only from the xml with eid = 85082880163
When I delete this one and only kept the one with eid = 85082880158, it works.
My file is an xml file containing those 2 lines in the link. I have also tried to merge those 2 into one xml but could not manage.
What is wrong with my data/approach? (My ultimate plan is to create such a file containing thousands of different xmls to be parsed)
Your trouble lies in this piece of XML:
<xocs:item>
<item>
<bibrecord>
<head>
<abstracts>
<abstract original="y" xml:lang="eng">
<ce:para>
Apple is often affected by [...] intercellular CO <inf>2</inf> concentration [...]
^^^^^^^^^^^^
While it is correct XML (mixed content), this embedded tag <inf>2</inf> seems to be breaking spark-xml parser. If you remove it (or convert to corresponding HTML entities) you will get correct results.
I try to combine the two columns "Format Group" and "Format SubGroup" to a single column called Format.
The O/P in the final Format column should be in the form of Format Group:Format Subgroup
I need to create my own UDF using some given data, but I am not sure why my UDF doesn't like the input I have given it.
This is the first rows of the data I use:
checkoutDF:
BibNumber, ItemBarcode, ItemType, Collection, CallNumber, CheckoutDateTime
1842225, 0010035249209, acbk, namys, MYSTERY ELKINS1999, 05/23/2005 03:20:00 PM
dataDictionaryDF:
Code, Description, Code Type, Format Group, Format Subgroup
acdvd, DVD: Adult/YA, ItemType, Media, Video Disc
Here's how it looks in the IntelliJ IDEA
Updated the code: changed seq[seq[string]] to String
def numberCheckoutRecordsPerFormat(checkoutDF: DataFrame, dataDictionaryDF: DataFrame): DataFrame = {
val createFeatureVector = udf{(Format_Group:String, Format_Subgroup:String) => {
dataDictionaryDF.map(x => if(Format_Group.flatten.contains(x)) 1.0 else 0.0)++Array(Format_Subgroup)
}
}
checkoutDF
.na.drop()
.join(dataDictionaryDF
.select($"Format_Group", $"Format_Subgroup", $"Code".as("ItemType"))
, "ItemType")
.withColumn("Format", createFeatureVector(dataDictionaryDF("Format_Group"), dataDictionaryDF("Format_Subgroup")))
.groupBy("ItemBarCode")
.agg(count("ItemBarCode"))
.withColumnRenamed("count(ItemBarCode)", "CheckoutCount")
.select($"Format", $"CheckoutCount")
}
Furthermore, the numberCheckoutRecordsPerFormat should return a DataFrame of Format and number of Checkouts for a given item - but I got this part covered myself.
The data set used is the Seattle Library Checkout Records from Kaggle
Thanks, people!
Doomdaam, you can try to use the concat_ws built-in function (always use built-in functions when possible). Your code will look like :
checkoutDF
.na.drop()
.join(dataDictionaryDF
.select($"Format_Group", $"Format_Subgroup", $"Code".as("ItemType"))
, "ItemType")
.withColumn("Format", concat_ws(":",$"Format_Group", $"Format_Subgroup"))
.groupBy("ItemBarCode")
.agg(count("ItemBarCode"))
.withColumnRenamed("count(ItemBarCode)", "CheckoutCount")
.select($"Format", $"CheckoutCount")
Otherwise your UDF would have been :
val createFeatureVector = udf{(formatGroup:String, formatSubgroup:String) => Seq(formatGroup,formatSubgroup).mkString(":")}
I have an XML document that has mixed content and I am using a custom schema in Dataframe to parse it. I am having an issue where the schema will only pick up the text for "Measure".
The XML looks like this
<QData>
<Measure> some text here
<Answer>Answer1</Answer>
<Question>Question1</Question>
</Measure>
<Measure> some text here
<Answer>Answer1</Answer>
<Question>Question1</Question>
</Meaure>
</QData>
My schema is as follows:
def getCustomSchema():StructType = {StructField("QData",
StructType(Array(
StructField("Measure",
StructType( Array(
StructField("Answer",StringType,true),
StructField("Question",StringType,true)
)),true)
)),true)}
When I try to access the data in Measure I am only getting "some text here" and it fails when I try to get info from Answer. I am also just getting one Measure.
EDIT: This is how I am trying to access the data
val result = sc.read.format("com.databricks.spark.xml").option("attributePrefix", "attr_").schema(getCustomSchema)
.load(filename.toString)
val qDfTemp = result.mapPartitions(partition =>{val mapper = new QDMapper();partition.map(row=>{mapper(row)}).flatMap(list=>list)}).toDF()
case class QDMapper(){
def apply(row: Row):List[QData]={
val qDList = new ListBuffer[QData]()
val qualData = row.getAs[Row]("QData") //When I print as list I get the first Measure text and that is it
val measure = qualData.getAs[Row]("Measure") //This fails
}
}
you can use row tag as a root tag and access other element:-
df_schema = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='<xml_tag_name>').load(schema_path)
please visit https://github.com/harshaltaware/Pyspark/blob/main/Spark-data-parsing/xmlparsing.py for brief code
I have a UDF like this
case class bodyresults(text:String,code:String)
val bodyudf = udf{ (body: String) =>
//Appending body tag explicitly to the xml before parsing
val xmlElems = xml.XML.loadString(s"""<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE body [<!ENTITY nbsp " ">]><body>${body}</body>""")
// extract the code inside the req
val code = (xmlElems \\ "body" \\"code").text
val text = (xmlElems \\ "body").text.replace(s"${code}" ,"" )
bodyresults(text, code)
}
I am trying to do convert Body string into code, text strings
CODE: inside the xml elements named code.
TEXT: Everything else.
The column body type is String and the contents looks like this
<p>I want to use a track-bar to change a form's opacity.</p>
<p>This is my code:</p>
<pre><code>decimal trans = trackBar1.Value / 5000;
this.Opacity = trans;
</code></pre>
<p>When I build the application, it gives the following error:</p>
<blockquote>
<p>Cannot implicitly convert type 'decimal' to 'double'.</p>
</blockquote>
<p>I tried using <code>trans</code> and <code>double</code> but then the
control doesn't work. This code worked fine in a past VB.NET project. </p>
,While applying opacity to a form should we use a decimal or double value?
I am trying to use this UDF using the following command
val posts5=posts4.withColumn("codetext",bodyudf(col("Body")))
posts5.select("codetext").show()
This causes Error
org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (string) => struct<text:string,code:string>)
Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 129; The element type "body" must be terminated by the matching end-tag "</body>"
But as you can see in the UDF I am appending body tag and closing it.
Note: Whats surprising is that it works fine if execute below command
posts5.select("codetext").show(19)
+--------------------+
| codetext|
+--------------------+
|[Given a represe...|
|[Is there any sta...|
|[What is the diff...|
|[How do I store b...|
|[If I have a trig...|
|[How do you page ...|
|[Does anyone know...|
|[Does anybody kno...|
|[What are some gu...|
|[There are severa...|
|[I wrote a window...|
|[How do I format ...|
|[One may not alwa... |
|[Are PHP variable...|
|[What's the simpl...|
|[Does anyone know...|
|[I'm looking for ...|
|[What is the corr...|
|[I was wondering ...|
+--------------------+
But if I use any number more than 19 its causing error
posts5.select("codetext").show(20)
or
posts5.select("codetext").show()
Just in case I am attaching the body string in the 20th row
<p>I have a Queue<T> object that I have initialised to a capacity of 2, but obviously that is just the capacity and it keeps expanding as I add items. Is there already an object that automatically dequeues an item when the limit is reached, or is the best solution to create my own inherited class?</p>,Limit size of Queue<T> in .NET?
I can not figure out what is the reason for this error. I can not find relevant information on web so please let me know whats causing error?
EDIT:
I dropped the row 20 because that string is missing the closing tag .
But now the error comes at row 19.
posts5.select("codetext").show(18) //18 or below works fine
posts5.select("codetext").show(19) // does not work
I have taken string in 19 th row passed it directly to function and its working fine.
But when I pass the entire column to UDF its not working?
I'm parsing this yaml file
View:
from : 01.01.2007
to : 04.01.2007
driver : sun.jdbc.odbc.JdbcOdbcDriver
using SnakeYAML in Scala like this:
val stream = getClass.getResourceAsStream("/config_view.yml")
var configMap: Map[String, Any] = new Yaml().load(stream).asInstanceOf[java.util.Map[String, Any]].asScala
var view = configMap("View").asInstanceOf[java.util.LinkedHashMap[String, String]].asScala
view = view + ("from" -> "neu") // some test modifying
and I dump it like this:
val fileWriter = new FileWriter(System.getProperty("user.home") + "\\Desktop\\test.yml")
new Yaml().dump(Map[String, Any]("View" -> view.asJava).asJava, fileWriter)
which saves the new yaml file like this:
View: {driver: sun.jdbc.odbc.JdbcOdbcDriver, from: neu, to: 04.01.2007}
But I want it to save it like this:
View:
driver: sun.jdbc.odbc.JdbcOdbcDriver
from: neu
to: 04.01.2007
How can I tell SnakeYAML to save it in the desired format you see above?
By default SnakeYAML uses the DumperOptions.FlowStyle.FLOW but it can be changed to the DumperOptions.FlowStyle.BLOCK that will dump the data with the desired format.
An example in Kotlin:
val options = DumperOptions()
options.indent = 2
options.defaultFlowStyle = DumperOptions.FlowStyle.BLOCK
Yaml(options).dump(yourObject)
How about manually handling the indentation and key: value formatting:
view.map{ case (k,v) => s"\t$k: $v\n" }
In the case of nested maps you will want a method that
accepts the current "level" of nesting. Place the level tabs in front of the output to give the proper output nesting
checks each of the entries. If it were another collection type then it needs to recursively invoke itself - which will increase the indention level