How to parse a dataframe containing xml strings? - apache-spark-xml

How to parse xml file containing xml data within one of it's column itself?
In one of our project, we receive xml files, in which some of the columns store another xml. While loading this data to dataframe, the inner xml is getting converted to StringType (which is not intended), so not being able to get to the nodes while querying the data (using dot operator).
I have looked around for answers vividly in net, but no luck. Found one open issue exactly identical to my use case in GitHub. The link is here.
https://github.com/databricks/spark-xml/issues/140
My xml source file looks like below.
+------+--------------------+
| id | xml |
+------+--------------------+
| 6723 |<?xml version="1....|
| 6741 |<?xml version="1....|
| 6774 |<?xml version="1....|
| 6735 |<?xml version="1....|
| 6828 |<?xml version="1....|
| 6764 |<?xml version="1....|
| 6732 |<?xml version="1....|
| 6792 |<?xml version="1....|
| 6754 |<?xml version="1....|
| 6833 |<?xml version="1....|
+------+--------------------+
In SQL Server, to store xml within a database column, there is the XML datatype but same is not present in Spark SQL.
Has anyone come around the same issue and found any workaround? If yes, please share. We're using Spark Scala.

You can use something like below :
df.withColumn("ID", split(col("xml"), ",").getItem(1))
where ID is a new field name and in
col("xml")
xml is the dataframe field name.
"," - separated by delimiter comma (use as per requirement)

row_counter = Row('id', 'abc')
def parser_xml(string_xml):
root = ET.fromstring(string_xml[0])
col1= root.find('visitor').attrib['id']
col2= root.find('visitor').attrib['abc']
return row_counter(id, abc)
data = rdd.map(lambda string_file: parser_xml(string_file))
df_xml = spark.createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True)
display(df_xml)

Related

How to decode HTML entities in Spark-scala?

I have a spark code to read some data from a database.
One of the columns (of type string) named "title" contains the following data.
+-------------------------------------------------+
|title |
+-------------------------------------------------+
|Example sentence |
|Read the ‘Book’ |
|‘LOTR’ Is A Great Book |
+-------------------------------------------------+
I'd like to remove the HTML entities and decode it to look as given below.
+-------------------------------------------+
|title |
+-------------------------------------------+
|Example sentence |
|Read the ‘Book’ |
|‘LOTR’ Is A Great Book |
+-------------------------------------------+
There is a library "html-enitites" for node.js that does exactly what I am looking for,
but i am unable to find something similar for spark-scala.
What would be good approach to do this?
You can use org.apache.commons.lang.StringEscapeUtils with a help of UDF to achieve this.
import org.apache.commons.lang.StringEscapeUtils;
val decodeHtml = (html:String) => {
StringEscapeUtils.unescapeHtml(html);
}
val decodeHtmlUDF = udf(decodeHtml)
df.withColumn("title", decodeHtmlUDF($"title")).show()
/*
+--------------------+
| title|
+--------------------+
| Example sentence |
| Read the ‘Book’ |
|‘LOTR’ Is A Great...|
+--------------------+
*/

Exporting Spark DataFrame to S3

So after certain operations I have some data in a Spark DataFrame, to be specific, org.apache.spark.sql.DataFrame = [_1: string, _2: string ... 1 more field]
Now when I do df.show(), I get the following output, which is expected.
+--------------------+--------------------+--------------------+
| _1| _2| _3|
+--------------------+--------------------+--------------------+
|industry_name_ANZSIC|'industry_name_AN...|.isComplete("indu...|
|industry_name_ANZSIC|'industry_name_AN...|.isContainedIn("i...|
|industry_name_ANZSIC|'industry_name_AN...|.isContainedIn("i...|
| rme_size_grp|'rme_size_grp' is...|.isComplete("rme_...|
| rme_size_grp|'rme_size_grp' ha...|.isContainedIn("r...|
| rme_size_grp|'rme_size_grp' ha...|.isContainedIn("r...|
| year| 'year' is not null| .isComplete("year")|
| year|'year' has type I...|.hasDataType("yea...|
| year|'year' has no neg...|.isNonNegative("y...|
|industry_code_ANZSIC|'industry_code_AN...|.isComplete("indu...|
|industry_code_ANZSIC|'industry_code_AN...|.isContainedIn("i...|
|industry_code_ANZSIC|'industry_code_AN...|.isContainedIn("i...|
| variable|'variable' is not...|.isComplete("vari...|
| variable|'variable' has va...|.isContainedIn("v...|
| unit| 'unit' is not null| .isComplete("unit")|
| unit|'unit' has value ...|.isContainedIn("u...|
| value| 'value' is not null|.isComplete("value")|
+--------------------+--------------------+--------------------+
The problem occurs when I try exporting the dataframe as a csv to my S3 bucket.
The code I have is : df.coalesce(1).write.mode("Append").csv("s3://<my path>")
But the csv generated in my S3 path is full of gibberish or rich text. Also, the spark prompt doesn't reappear after execution (meaning execution didn't finish?) Here's a sample screenshot of the generated csv in my S3 :
What am I doing wrong and how do I rectify this?
S3: short description.
When you change the letter on the URI scheme, it will make a big difference because it causes different software to be used to interface to S3.
This is the difference between the three:
s3 is a block-based overlay on top of Amazon S3,whereas s3n/s3a are not. These are are object-based.
s3n supports objects up to 5GB when size is the concern, while s3a supports objects up to 5TB and has higher performance.Note that s3a is the successor to s3n.

Spark: choose default value for MergeSchema fields

I have a parquet that has an old schema like this :
| name | gender | age |
| Tom | Male | 30 |
And as our schema got updated to :
| name | gender | age | office |
we used mergeSchema when reading from the old parquet :
val mergedDF = spark.read.option("mergeSchema", "true").parquet("data/test_table")
But when reading from these old parquet files, I got the following output :
| name | gender | age | office |
| Tom | Male | 30 | null |
which is normal. But I would like to take a default value for office (e.g. "California"), if and only if the field is not present in old schema. Is it possible ?
You don't have any simple method to put a default value when column doesn't exist in some parquet files but exists in other parquet files
In Parquet file format, each parquet file contains the schema definition. By default, when reading parquet, Spark get the schema from parquet file. The only effect of mergeSchema option is that instead of retrieving schema from one random parquet file, with mergeSchema Spark will read all schema of all parquet files and merge them.
So you can't put a default value without modifying the parquet files.
The other possible method is to provide your own schema when reading parquets by setting the option .schema() like that:
spark.read.schema(StructType(Array(FieldType("name", StringType), ...)).parquet(...)
But in this case, there is no option to set a default value.
So the only remaining solution is to add column default value manually
If we have two parquets, first one containing the data with the old schema:
+----+------+---+
|name|gender|age|
+----+------+---+
|Tom |Male |30 |
+----+------+---+
and second one containing the data with the new schema:
+-----+------+---+------+
|name |gender|age|office|
+-----+------+---+------+
|Jane |Female|45 |Idaho |
|Roger|Male |22 |null |
+-----+------+---+------+
If you don't care to replace all the null value in "office" column, you can use .na.fill as follow:
spark.read.option("mergeSchema", "true").parquet(path).na.fill("California", Array("office"))
And you get the following result:
+-----+------+---+----------+
|name |gender|age|office |
+-----+------+---+----------+
|Jane |Female|45 |Idaho |
|Roger|Male |22 |California|
|Tom |Male |30 |California|
+-----+------+---+----------+
If you want that only the old data get the default value, you have to read each parquet file to a dataframe, add the column with default value if necessary, and union all the resulting dataframes:
import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
import org.apache.spark.sql.execution.datasources.v2.parquet.ParquetTable
import org.apache.spark.sql.util.CaseInsensitiveStringMap
ParquetTable("my_table",
sparkSession = spark,
options = CaseInsensitiveStringMap.empty(),
paths = Seq(path),
userSpecifiedSchema = None,
fallbackFileFormat = classOf[ParquetFileFormat]
).fileIndex.allFiles().map(file => {
val dataframe = spark.read.parquet(file.getPath.toString)
if (dataframe.columns.contains("office")) {
dataframe
} else {
dataframe.withColumn("office", lit("California"))
}
}).reduce(_ unionByName _)
And you get the following result:
+-----+------+---+----------+
|name |gender|age|office |
+-----+------+---+----------+
|Jane |Female|45 |Idaho |
|Roger|Male |22 |null |
|Tom |Male |30 |California|
+-----+------+---+----------+
Note that all the part with ParquetTable([...].allFiles() is to retrieve the list of parquet files. It can be simplified if you are on hadoop or on local file system.

"Enrich" Spark DataFrame from another DF (or from HBase)

I am not sure this is the right title so feel free to suggest an edit. Btw, I'm really new to Scala and Spark.
Basically, I have a DF df_1 looking like this:
| ID | name | city_id |
| 0 | "abc"| 123 |
| 1 | "cba"| 124 |
...
The city_id is a key in a huge HBase:
123; New York; .... 124; Los Angeles; .... etc.
The result should be df_1:
| ID | name | city_id |
| 0 | "abc"| New York|
| 1 | "cba"| Los Angeles|
...
My approach was to create an external Hive table on top of HBase with the columns I need. But then again I do not know how to join them in the most efficient manner.
I suppose there is a way to do it dirrectly from HBase, but again I do not know how.
Any hint is appreciated. :)
There is no need to create an itermediate hive table over hbase. Spark sql can deal with all kind of unstructured data directly. Just load hbase data into a dataframe with the hbase data source
Once you have the proper hbase dataframe use the following
sample spark-scala code to get the joined dataframe:
val df=Seq((0,"abc",123),(1,"cda",124),(2,"dsd",125),(3,"gft",126),(4,"dty",127)).toDF("ID","name","city_id")
val hbaseDF=Seq((123,"New York"),(124,"Los Angeles"),(125,"Chicago"),(126,"Seattle")).toDF("city_id","city_name")
df.join(hbaseDF,Seq("city_id"),"inner").drop("city_id").show()

Spark explode multiple columns of row in multiple rows

I have a problem with converting one row using three 3 columns into 3 rows
For example:
<pre>
<b>ID</b> | <b>String</b> | <b>colA</b> | <b>colB</b> | <b>colC</b>
<em>1</em> | <em>sometext</em> | <em>1</em> | <em>2</em> | <em>3</em>
</pre>
I need to convert it into:
<pre>
<b>ID</b> | <b>String</b> | <b>resultColumn</b>
<em>1</em> | <em>sometext</em> | <em>1</em>
<em>1</em> | <em>sometext</em> | <em>2</em>
<em>1</em> | <em>sometext</em> | <em>3</em>
</pre>
I just have dataFrame which is connected with first schema(table).
val df: dataFrame
Note: I can do it using RDD, but do we have other way? Thanks
Assuming that df has the schema of your first snippet, I would try:
df.select($"ID", $"String", explode(array($"colA", $"colB",$"colC")).as("resultColumn"))
I you further want to keep the column names, you can use a trick that consists in creating a column of arrays that contains the array of the value and the name. First create your expression
val expr = explode(array(array($"colA", lit("colA")), array($"colB", lit("colB")), array($"colC", lit("colC"))))
then use getItem (since you can not use generator on nested expressions, you need 2 select here)
df.select($"ID, $"String", expr.as("tmp")).select($"ID", $"String", $"tmp".getItem(0).as("resultColumn"), $"tmp".getItem(1).as("columnName"))
It is a bit verbose though, there might be more elegant way to do this.