spark-xml: Crashing out of memory trying to parse single large XML file - apache-spark-xml

I'm attempting to process bz2 compressed XML files with a nested XML schema into normalized tables where each level of the schema is stored as a row, and any child elements are stored as rows in a separate table with a foreign key relating back to the keyed row of which it is a child.
These files can be rather large, 181MB compressed exploding into large numbers of rows from within a single file. If we don't supply the schema to the DataFrameReader, it will crash with out of mem errors just trying to parse the file and infer the schema. I provided the schema to get past this but...
I currently have a DataFrame at the parent level of the XML selecting a nested schema out of the entire file, but when I try to write the parent DataFrame node into a parquet, it is causing the single node doing the file load to throw out of mem errors.
How can I traverse the XML and write the data into normalized parquets for each level of the schema?
The schema is:
val schema = StructType(Seq(
StructField("_xmlns", StringType, true),
StructField("includedCalendarYearCategory", StructType(Seq(
StructField("calendarYear", LongType, true),
StructField("includedPlanCategory", StructType(Seq(
StructField("includedBilltypeCategory", ArrayType(StructType(Seq(
StructField("billTypeCode", LongType, true),
StructField("claimsExcluded", LongType, true),
StructField("claimsIncluded", LongType, true)
)), true), true),
StructField("includedMedicalClaimCategory", ArrayType(StructType(Seq(
StructField("enrolleeIdentifier", StringType, true),
StructField("includedSupplementalRecordCategory", StructType(Seq(
StructField("addDeleteVoidCode", StringType, true),
StructField("enrolleeIdentifier", StringType, true),
StructField("originalMedicalClaimId", StringType, true),
StructField("reasonCode", StringType, true),
StructField("supplementalDetailRecordId", StringType, true)
)), true),
StructField("medicalClaimIdentifier", StringType, true),
StructField("raEligibleIndicator", LongType, true),
StructField("rxcEligibleIndicator", LongType, true),
StructField("serviceCode", LongType, true)
)), true), true),
StructField("includedPharmacyClaimCategory", ArrayType(StructType(Seq(
StructField("enrolleeIdentifier", StringType, true),
StructField("nationalDrugCode", LongType, true),
StructField("pharmacyClaimIdentifier", StringType, true),
StructField("policyPaidAmount", DoubleType, true),
StructField("raEligibleIndicator", LongType, true),
StructField("reasonCode", StringType, true)
)), true), true),
StructField("includedReasonCodeCategory", ArrayType(StructType(Seq(
StructField("medicalClaimsExcluded", LongType, true),
StructField("medicalReasonCode", StringType, true),
StructField("pharmacyClaimsExcluded", LongType, true),
StructField("pharmacyReasonCode", StringType, true)
)), true), true),
StructField("includedServiceCodeCategory", ArrayType(StructType(Seq(
StructField("claimsExcluded", LongType, true),
StructField("claimsIncluded", LongType, true),
StructField("serviceCode", StringType, true)
)), true), true),
StructField("includedUnlinkedSupplementalCategory", ArrayType(StructType(Seq(
StructField("addDeleteVoidCode", LongType, true),
StructField("enrolleeIdentifier", StringType, true),
StructField("originalMedicalClaimId", StringType, true),
StructField("supplementalDiagnosisDetailRecordId", LongType, true)
)), true), true),
StructField("medicalClaimsExcluded", LongType, true),
StructField("medicalClaimsIncluded", LongType, true),
StructField("pharmacyClaimsExcluded", LongType, true),
StructField("pharmacyClaimsIncluded", LongType, true),
StructField("planIdentifier", StringType, true),
StructField("supplementalRecordsExcluded", LongType, true),
StructField("supplementalRecordsIncluded", LongType, true),
StructField("totalEnrollees", LongType, true),
StructField("totalEnrolleesWRaEligibleclaims", LongType, true),
StructField("totalUniqueNDC", LongType, true)
)), true),
StructField("medicalClaimsExcluded", LongType, true),
StructField("medicalClaimsIncluded", LongType, true),
StructField("pharmacyClaimsExcluded", LongType, true),
StructField("pharmacyClaimsIncluded", LongType, true),
StructField("supplementalRecordsExcluded", LongType, true),
StructField("supplementalRecordsIncluded", LongType, true),
StructField("totalEnrollees", LongType, true),
StructField("totalEnrolleesWRaEligibleclaims", LongType, true),
StructField("totalUniqueNDC", LongType, true)
)), true),
StructField("includedFileHeader", StructType(Seq(
StructField("cmsBatchIdentifier", StringType, true),
StructField("cmsJobIdentifier", LongType, true),
StructField("edgeServerIdentifier", LongType, true),
StructField("edgeServerProcessIdentifier", LongType, true),
StructField("edgeServerVersion", StringType, true),
StructField("globalReferenceDataVersion", StringType, true),
StructField("interfaceControlReleaseNumber", StringType, true),
StructField("issuerIdentifier", LongType, true),
StructField("outboundFileGenerationDateTime", TimestampType, true),
StructField("outboundFileIdentifier", StringType, true),
StructField("outboundFileTypeCode", StringType, true),
StructField("snapShotFileHash", StringType, true),
StructField("snapShotFileName", StringType, true)
)), true)
))
We are reading the dataframe like:
sparkSession.read
.schema(schema) // schema from above code block
.format("xml")
.option("rootTag", "riskAdjustmentClaimSelectionDetailReport")
.option("rowTag", "riskAdjustmentClaimSelectionDetailReport")
.xml(path)
.repartition(200)
Repartitioning I thought would help spread the single file across multiple nodes but it makes sense that the XML has to be fully parsed before it can figure out how to chunk it. Is there something I can configure in spark to load a massive XML file across a cluster instead of all in a single node using the embedded spark-xml library?

Turns out that Spark can't handle large XML files as it must read the entirety of it in a single node in order to determine how to break it up. If the file is too large to fit in memory uncompressed, it will choke on the massive XML file.
I had to use Scala to parse it linearly without Spark, node by node in recursive fashion, to prevent it from loading the entire file into memory.
I wrote a recursive method to traverse the entirety of the XML tree of the massive file and appended each element's children into a row in a CSV for that particular element type referencing its parent node's ID using Scala's built in XML traversal system.

Related

Creating Schema of JSON type and Reading it using Spark in Scala [Error : cannot resolve jsontostructs]

I have a JSON file like below :
{"Codes":[{"CName":"012","CValue":"XYZ1234","CLevel":"0","msg":"","CType":"event"},{"CName":"013","CValue":"ABC1234","CLevel":"1","msg":"","CType":"event"}}
I wanted to create the schema for this and if the JSON file is empty({}) it should be an empty String.
However, df Output is below when I used df.show:
[[012, XYZ1234, 0, event, ], [013, ABC1234, 1, event, ]]
I created Schema like below :
val schemaF = ArrayType(
StructType(
Array(
StructField("CName", StringType),
StructField("CValue", StringType),
StructField("CLevel", StringType),
StructField("msg", StringType),
StructField("CType", StringType)
)
)
)
When I tried below,
val df1 = df.withColumn("Codes",from_json('Codes, schemaF))
It gives AnalysisException :
org.apache.spark.sql.AnalysisException: cannot resolve
'jsontostructs(Codes)' due to data type mismatch: argument 1
requires string type, however, 'Codes' is of
array<structCName:string,CValue:string,CLevel:string,CType:string,msg:string>
type.;; 'Project [valid#51,
jsontostructs(ArrayType(StructType(StructField(CName,StringType,true),
StructField(CValue,StringType,true),
StructField(CLevel,StringType,true), StructField(msg,StringType,true),
StructField(CType,StringType,true)),true), Codes#8,
Some(America/Bogota)) AS errorCodes#77]
Can someone please tell me why and how to resolve this issue?
val schema =
StructType(
Array(
StructField("CName", StringType),
StructField("CValue", StringType),
StructField("CLevel", StringType),
StructField("msg", StringType),
StructField("CType", StringType)
)
)
val df0 = spark.read.schema(schema).json("/path/to/data.json")
Your schema does not correspond to the JSON file you're trying to read. It's missing the field Codes of array type, it should look like this :
val schema = StructType(
Array(
StructField(
"Codes",
ArrayType(
StructType(
Array(
StructField("CLevel", StringType, true),
StructField("CName", StringType, true),
StructField("CType", StringType, true),
StructField("CValue", StringType, true),
StructField("msg", StringType, true)
)
), true)
,true)
)
)
And you want to apply it when reading the json not with from_json function :
val df = spark.read.schema(schema).json("path/to/json/file")
df.printSchema
//root
// |-- Codes: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- CLevel: string (nullable = true)
// | | |-- CName: string (nullable = true)
// | | |-- CType: string (nullable = true)
// | | |-- CValue: string (nullable = true)
// | | |-- msg: string (nullable = true)
EDIT:
For your comment question, you can use this schema definition:
val schema = StructType(
Array(
StructField(
"Codes",
ArrayType(
StructType(
Array(
StructField("CLevel", StringType, true),
StructField("CName", StringType, true),
StructField("CType", StringType, true),
StructField("CValue", StringType, true),
StructField("msg", StringType, true)
)
), true)
,true),
StructField("lid", StructType(Array(StructField("idNo", StringType, true))), true)
)
)

How do I specify a schema when loading a csv from S3 in Spark with Scala?

I've googled through multiple syntax iterations on stack, and none of them are working for me. My code is as follows:
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, DoubleType};
val schema1 = (new StructType)
.add("PASSENGERID", IntegerType, true)
.add("PCLASS", IntegerType, true)
.add("NAME", IntegerType, true)
.add("SEX", StringType, true)
.add("AGE", DoubleType, true)
.add("SIBSP", IntegerType, true)
.add("PARCH", IntegerType, true)
.add("TICKET", StringType, true)
.add("FARE", DoubleType, true)
.add("CABIN", StringType, true)
.add("EMBARKED", StringType, true)
val schema2 = StructType(
StructField("PASSENGERID", IntegerType, true) ::
StructField("PCLASS", IntegerType, true) ::
StructField("NAME", IntegerType, true) ::
StructField("SEX", StringType, true) ::
StructField("AGE", DoubleType, true) ::
StructField("SIBSP", IntegerType, true) ::
StructField("PARCH", IntegerType, true) ::
StructField("TICKET", StringType, true) ::
StructField("FARE", DoubleType, true) ::
StructField("CABIN", StringType, true) ::
StructField("EMBARKED", StringType, true) :: Nil)
val schema3 = StructType(Array(
StructField("PASSENGERID", IntegerType, true),
StructField("PCLASS", IntegerType, true),
StructField("NAME", IntegerType, true),
StructField("SEX", StringType, true),
StructField("AGE", DoubleType, true),
StructField("SIBSP", IntegerType, true),
StructField("PARCH", IntegerType, true),
StructField("TICKET", StringType, true),
StructField("FARE", DoubleType, true),
StructField("CABIN", StringType, true),
StructField("EMBARKED", StringType, true)))
val schema4 = StructType(Seq(
StructField("PASSENGERID", IntegerType, true),
StructField("PCLASS", IntegerType, true),
StructField("NAME", IntegerType, true),
StructField("SEX", StringType, true),
StructField("AGE", DoubleType, true),
StructField("SIBSP", IntegerType, true),
StructField("PARCH", IntegerType, true),
StructField("TICKET", StringType, true),
StructField("FARE", DoubleType, true),
StructField("CABIN", StringType, true),
StructField("EMBARKED", StringType, true)
))
val schema5 = StructType(
List(
StructField("PASSENGERID", IntegerType, true),
StructField("PCLASS", IntegerType, true),
StructField("NAME", IntegerType, true),
StructField("SEX", StringType, true),
StructField("AGE", DoubleType, true),
StructField("SIBSP", IntegerType, true),
StructField("PARCH", IntegerType, true),
StructField("TICKET", StringType, true),
StructField("FARE", DoubleType, true),
StructField("CABIN", StringType, true),
StructField("EMBARKED", StringType, true)
)
)
/*
val df = spark.read
.option("header", true)
.csv("s3a://mybucket/ybspark/input/PASSENGERS.csv")
.schema(schema)
*/
//this works
val df = spark.read.option("header", true).csv("s3a://mybucket/ybspark/input/PASSENGERS.csv")
df.show(false)
df.printSchema()
//fun errors
val df1 = spark.read.option("header", true).csv("s3a://mybucket/ybspark/input/PASSENGERS.csv").schema(schema1)
val df2 = spark.read.option("header", true).csv("s3a://mybucket/ybspark/input/PASSENGERS.csv").schema(schema2)
val df3 = spark.read.option("header", true).csv("s3a://mybucket/ybspark/input/PASSENGERS.csv").schema(schema3)
val df4 = spark.read.option("header", true).csv("s3a://mybucket/ybspark/input/PASSENGERS.csv").schema(schema4)
val df5 = spark.read.option("header", true).csv("s3a://mybucket/ybspark/input/PASSENGERS.csv").schema(schema5)
The data is the kaggle titanic survival set, with fields in the header capitalized. I've tried this as a script submit to spark-shell as well as run commands within spark-shell manually. The spark-shell -i spits out some syntax errors on the dfX reads, if I manually load any of the schemas they seem fine though, and the reads all have the same error.
scala> val df4 = spark.read.option("header", true).csv("s3a://mybucket/ybspark/input/PASSENGERS.csv").schema(schema4)
<console>:26: error: overloaded method value apply with alternatives:
(fieldIndex: Int)org.apache.spark.sql.types.StructField <and>
(names: Set[String])org.apache.spark.sql.types.StructType <and>
(name: String)org.apache.spark.sql.types.StructField
cannot be applied to (org.apache.spark.sql.types.StructType)
val df4 = spark.read.option("header", true).csv("s3a://mybucket/ybspark/input/PASSENGERS.csv").schema(schema4)
I don't understand what I'm doing wrong. I'm on Spark version 2.4.4 on AWS EMR.
set inferSchema param false so that spark will not inferSchema while loading data.
Move your .schema before .csv as DataFrame object will not have schema function.
Please check below code.
scala> val df1 = spark.read.option("header", true).option("inferSchema", false).schema(schema1).csv("s3a://mybucket/ybspark/input/PASSENGERS\.csv")
df1: org.apache.spark.sql.DataFrame = [PASSENGERID: int, PCLASS: int ... 9 more fields]
scala> val df2 = spark.read.option("header", true).option("inferSchema", false).schema(schema2).csv("s3a://mybucket/ybspark/input/PASSENGERS\.csv")
df2: org.apache.spark.sql.DataFrame = [PASSENGERID: int, PCLASS: int ... 9 more fields]
scala> val df3 = spark.read.option("header", true).option("inferSchema", false).schema(schema3).csv("s3a://mybucket/ybspark/input/PASSENGERS\.csv")
df3: org.apache.spark.sql.DataFrame = [PASSENGERID: int, PCLASS: int ... 9 more fields]
scala> val df4 = spark.read.option("header", true).option("inferSchema", false).schema(schema4).csv("s3a://mybucket/ybspark/input/PASSENGERS\.csv")
df4: org.apache.spark.sql.DataFrame = [PASSENGERID: int, PCLASS: int ... 9 more fields]
scala> val df5 = spark.read.option("header", true).option("inferSchema", false).schema(schema5).csv("s3a://mybucket/ybspark/input/PASSENGERS\.csv")
df5: org.apache.spark.sql.DataFrame = [PASSENGERID: int, PCLASS: int ... 9 more fields]

Overloaded method value apply with alternatives:

I am new to spark and I was trying to define a schema for a json data and ran into the following error in (spark-shell,
<console>:28: error: overloaded method value apply with alternatives:
(fields: Array[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
(fields: java.util.List[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
(fields: Seq[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType
cannot be applied to (org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField)
val schema = StructType(Array(StructField("type", StructType(StructField("name", StringType, true), StructField("version", StringType, true)), true) :: StructField("value", StructType(StructField("answerBlacklistedEntities", StringType, true) :: StructField("answerBlacklistedPhrase", StringType, true) :: StructField("answerEntities", StringType, true) :: StructField("answerText", StringType, true) :: StructField("blacklistReason", StringType, true) :: StructField("blacklistedDomains", StringType, true) :: StructField("blacklistedEntities", ArrayType(StringType, true), true) :: StructField("customerId", StringType, true) :: StructField("impolitePhrase", StringType, true) :: StructField("isResponseBlacklisted", BooleanType, true) :: StructField("queryString", StringType, true) :: StructField("utteranceDomains", StringType, true) :: StructField("utteranceEntities", ArrayType(StringType, true), true) :: StructField("utteranceId", StructType(StructField("identifier", StringType, true)), true)) :: Nil)))
Can anybody guide me to what's going on here? :) I'd really appreciate your help!
This happens because of this:
val schema = StructType(Array(StructField("type",
StructType(StructField("name", StringType, true), ...))
You create StructType and pass StructField as an argument, while it should be a sequence of StructFields:
val schema = StructType(Array(StructField("type",
StructType(Array(StructField("name", StringType, true), ...)) ...)

Turn many features in a data frame with spark ML

I was following this tutorial https://mapr.com/blog/churn-prediction-sparkml/
and I realized that the csv structure had to be written by hand like this :
val schema = StructType(Array(
StructField("state", StringType, true),
StructField("len", IntegerType, true),
StructField("acode", StringType, true),
StructField("intlplan", StringType, true),
StructField("vplan", StringType, true),
StructField("numvmail", DoubleType, true),
StructField("tdmins", DoubleType, true),
StructField("tdcalls", DoubleType, true),
StructField("tdcharge", DoubleType, true),
StructField("temins", DoubleType, true),
StructField("tecalls", DoubleType, true),
StructField("techarge", DoubleType, true),
StructField("tnmins", DoubleType, true),
StructField("tncalls", DoubleType, true),
StructField("tncharge", DoubleType, true),
StructField("timins", DoubleType, true),
StructField("ticalls", DoubleType, true),
StructField("ticharge", DoubleType, true),
StructField("numcs", DoubleType, true),
StructField("churn", StringType, true)
However I have a dataset with 335 features so I don't want to write them all... Is there a simple way to retrieve them and define the schema accordingly ?
I found the solution here : https://dzone.com/articles/using-apache-spark-dataframes-for-processing-of-ta
It was easier than I thought

Spark: Programmatically creating dataframe schema in scala

I have a smallish dataset that will be the result of a Spark job. I am thinking about converting this dataset to a dataframe for convenience at the end of the job, but have struggled to correctly define the schema. The problem is the last field below (topValues); it is an ArrayBuffer of tuples -- keys and counts.
val innerSchema =
StructType(
Array(
StructField("value", StringType),
StructField("count", LongType)
)
)
val outputSchema =
StructType(
Array(
StructField("name", StringType, nullable=false),
StructField("index", IntegerType, nullable=false),
StructField("count", LongType, nullable=false),
StructField("empties", LongType, nullable=false),
StructField("nulls", LongType, nullable=false),
StructField("uniqueValues", LongType, nullable=false),
StructField("mean", DoubleType),
StructField("min", DoubleType),
StructField("max", DoubleType),
StructField("topValues", innerSchema)
)
)
val result = stats.columnStats.map{ c =>
Row(c._2.name, c._1, c._2.count, c._2.empties, c._2.nulls, c._2.uniqueValues, c._2.mean, c._2.min, c._2.max, c._2.topValues.topN)
}
val rdd = sc.parallelize(result.toSeq)
val outputDf = sqlContext.createDataFrame(rdd, outputSchema)
outputDf.show()
The error I'm getting is a MatchError: scala.MatchError: ArrayBuffer((10,2), (20,3), (8,1)) (of class scala.collection.mutable.ArrayBuffer)
When I debug and inspect my objects, I'm seeing this:
rdd: ParallelCollectionRDD[2]
rdd.data: "ArrayBuffer" size = 2
rdd.data(0): [age,2,6,0,0,3,14.666666666666666,8.0,20.0,ArrayBuffer((10,2), (20,3), (8,1))]
rdd.data(1): [gender,3,6,0,0,2,0.0,0.0,0.0,ArrayBuffer((M,4), (F,2))]
It seems to me that I've accurately described the ArrayBuffer of tuples in my innerSchema, but Spark disagrees.
Any idea how I should be defining the schema?
val rdd = sc.parallelize(Array(Row(ArrayBuffer(1,2,3,4))))
val df = sqlContext.createDataFrame(
rdd,
StructType(Seq(StructField("arr", ArrayType(IntegerType, false), false)
)
df.printSchema
root
|-- arr: array (nullable = false)
| |-- element: integer (containsNull = false)
df.show
+------------+
| arr|
+------------+
|[1, 2, 3, 4]|
+------------+
As David pointed out, I needed to use an ArrayType. Spark is happy with this:
val outputSchema =
StructType(
Array(
StructField("name", StringType, nullable=false),
StructField("index", IntegerType, nullable=false),
StructField("count", LongType, nullable=false),
StructField("empties", LongType, nullable=false),
StructField("nulls", LongType, nullable=false),
StructField("uniqueValues", LongType, nullable=false),
StructField("mean", DoubleType),
StructField("min", DoubleType),
StructField("max", DoubleType),
StructField("topValues", ArrayType(StructType(Array(
StructField("value", StringType),
StructField("count", LongType)
))))
)
)
import spark.implicits._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val searchPath = "/path/to/.csv"
val columns = "col1,col2,col3,col4,col5,col6,col7"
val fields = columns.split(",").map(fieldName => StructField(fieldName, StringType,
nullable = true))
val customSchema = StructType(fields)
var dfPivot =spark.read.format("com.databricks.spark.csv").option("header","false").option("inferSchema", "false").schema(customSchema).load(searchPath)
When you load the data with custom schema will be much faster compared to loading data with default schema