Flattening a nested ORC file with Spark - Performance issue - scala

We are facing a severe performance issue when reading a nested ORC file.
This is our ORC schema:
|-- uploader: string (nullable = true)
|-- email: string (nullable = true)
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- startTime: string (nullable = true)
| | |-- endTime: string (nullable = true)
| | |-- val1: string (nullable = true)
| | |-- val2: string (nullable = true)
| | |-- val3: integer (nullable = true)
| | |-- val4: integer (nullable = true)
| | |-- val5: integer (nullable = true)
| | |-- val6: integer (nullable = true)
The ‘data’ array could potentially contain 75K objects.
In our spark application, we flatten this ORC, as you can see below:
val dataFrame = spark.read.orc(files: _*)
val withData = dataFrame.withColumn("data", explode(dataFrame.col("data")))
val withUploader = withData.select($"uploader", $"data")
val allData = withUploader
.withColumn("val_1", $"data.val1")
.withColumn("val_2", $"data.val2")
.withColumn("val_3", $"data.val3")
.withColumn("val_4", $"data.val4")
.withColumn("val_5", $"data.val5")
.withColumn("val_6", $"data.val6")
.withColumn("utc_start_time", timestampUdf($"data.startTime"))
.withColumn("utc_end_time", timestampUdf($"data.endTime"))
allData.drop("data")
The flattening process seems to be a very heavy operation:
Reading a 2MB ORC file with 20 records, each of which contains a data array with 75K objects, results in hours of processing time. Reading the file and collecting it without flattening it, takes 22 seconds.
Is there a way to make spark process the data faster?

I'd try to avoid large explodes completely. With 75K elements in the array:
You create 75K Row objects per Row. This is a huge allocation effort.
You duplicate uploaded and email 75K times. In short term it will reference the same data, but once data is serialized and deserialized with internal format, they'll like point to different objects effectively multiplying memory requirements.
Depending on the transformations you want to apply it might be the case where using UDF to process arrays as whole, will be much more efficient.

In case this helps someone, I found that flattening the data using flatmap is much faster than doing it with explode:
dataFrame.as[InputFormat].flatMap(r => r.data.map(v => OutputFormat(v, r.tenant)))
The improvement in performance was dramatic.
Processing a file with 20 records, each containing an array with 250K rows- with the explode implementation it took 8 hours, with the flatmap implementation- 7 minutes (!)

Related

Spark Streaming | Write different data frames to multiple tables in parallel

I am reading data from Kafka and loading into data warehouse, from one Kafka topic I am
creating a data frame and after applying the required transformation I am creating multiple DFs out of it and loading those DFs to different tables, but this operation is happening in sequence. Is there a way I can parallelize this table load process?
root
|-- attribute1Formatted: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- accexecattributes: struct (nullable = true)
| | | |-- id: string (nullable = true)
| | | |-- name: string (nullable = true)
| | | |-- primary: boolean (nullable = true)
| | |-- accountExecUUID: string (nullable = true)
|-- attribute2Formatted: struct (nullable = true)
| |-- Jake-DOT-Sandler#xyz.com: struct (nullable = true)
| | |-- id: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- primary: boolean (nullable = true)
have created two different datframes respectively for attribute1Formatted and attribute2Formatted and further these DFs are getting saved into database in different tables.
I don't have much knowledge of spark streaming but I believe streaming are iterative micro-batch, and in spark batch execution each action has one sink/output. So you can't store it in different tables with one execution.
Now,
if you write it in one table, reader can simply read only the column that they require. I mean: do you really need to store it in different places?
You can write it twice, filtering the fields that are not required
both write action will execute the computation of the full dataset, then remove not required columns
if the full dataset computation is long, you can cache it before the filtering+write

Divide dataframe into batches Spark

I need to run a set of transformations on batches of hours of the dataframe. And the number of hours should be parameterized so it could be changed - for example, run transformations on 3 hours of dataframe, then next 2 hours. In such a way, there should be a step with a parameterized number of hours for each transformation.
The signature of the transformation looks like this:
def transform(wordsFeed: DataFrame)(filesFeed: DataFrame): Unit
So I want to do this division into batches and then call a transform on this datafeed. But I can't use groupBy as it would change dataframe into grouped dataset while I need to preserve all the columns in the schema. How can I do that?
val groupedDf = df.srcHours.groupBy($"event_ts")
transform(keywords)(groupedDf)
Data schema looks like this:
root
|-- date_time: integer (nullable = true)
|-- user_id: long (nullable = true)
|-- order_id: string (nullable = true)
|-- description: string (nullable = true)
|-- hashed_user_id: string (nullable = true)
|-- event_date: date (nullable = true)
|-- event_ts: timestamp (nullable = true)
|-- event_hour: long (nullable = true)
The main reason to introduce this batching is that there's too much data to process at once.
Note: I still want to use batch data processing and not streaming in this case

How to perform general processing on spark StructType in Scala UDF?

I have dataframe with following schema
root
|-- name: integer (nullable = true)
|-- address: integer (nullable = true)
|-- cases: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- caseOpenTime: integer (nullable = true)
| | |-- caseDescription: string (nullable = true)
| | |-- caseSymptons: map (nullable = true)
| | | |-- key: string
| | | |-- value: struct (valueContainsNull = true)
| | | | |-- duration: integer (nullable = true)
| | | | |-- description: string (nullable = true)
I want to write UDF that can take "cases" column in data frame and produce another column "derivedColumn1" from this.
I want to write this derivation logic with general processing without using SQL constructs supported by Spark Dataframe. So steps will be:
val deriveDerivedColumen1_with_MapType = udf(_ => MapType = (casesColumn: ArrayType) => {
val derivedMapSchema = DataTypes.createMapType(StringType, LongType)
1. Convert casesColumn to scala object-X
2. ScalaMap<String, Long> scalaMap = myScalaObjectProcessing(object-X)
3. (return) scalaMap.convertToSparkMapType(schema = derivedMapSchema)
})
For specific use-cases, Dataframe SQL constructs can be used. But I am looking for general processing that is not constrained by SQL constructs so specifically looking for ways for:
How to convert complex spark StructType in Scala datatype Object-X ?
Then perform "SOME" general purpose processing on Scala Object-X
How to convert back Scala Object-X into spark MapType which can be added as new column in dataframe ?

Take only a part of a MongoDB Document into a Spark Dataframe

I'm holding relatively large Documents in my MongoDB, I need only a small part of the information of the Document to be loaded into a Spark Dataframe to work on. This is an example of a Document (without a lot lot more of unnecessary fields I've removed for readability of this question)
root
|-- _id: struct (nullable = true)
| |-- oid: string (nullable = true)
|-- customerInfo: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- events: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- relevantField: integer (nullable = true)
| | | | |-- relevantField_2: string (nullable = true)
| | |-- situation: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- currentRank: integer (nullable = true)
| | |-- info: struct (nullable = true)
| | | |-- customerId: integer (nullable = true)
What I do now is explode "customerInfo":
val df = MongoSpark.load(sparksess)
val new_df = df.withColumn("customerInfo", explode(col("customerInfo")))
.select(col("_id"),
col("customerInfo.situation").getItem(13).getField("currentRank").alias("currentRank"),
col("customerInfo.info.customerId"),
col("customerInfo.events.relevantField"),
col("customerInfo.events.relevantField_2"))
Now, to my understanding this loads the whole "customerInfo" into memory to do actions over it which is a waste of time and resources, how can I explode only the specific information I need? Thank you!
how can I explode only the specific information I need?
Use Filters to filter the data in MongoDB first before sending it to Spark.
MongoDB Spark Connector will construct an Aggregation Pipeline to only send the filtered data into Spark, reducing the amount of data.
You could use $project aggregation stage to project certain fields only. See also MongoDB Spark Connector: Filters and Aggregation

Pyspark RDD to DataFrame with Enforced Schema: Value Error

I am working with pyspark with a schema commensurate with that shown at the end of this post (note the nested lists, unordered fields), initially imported from Parquet as a DataFrame. Fundamentally, the issue that I am running into is the inability to process this data as a RDD, and then convert back to a DataFrame. (I have reviewed several related posts, but I still cannot tell where I am going wrong.)
Trivially, the following code works fine (as one would expect):
schema = deepcopy(tripDF.schema)
tripRDD = tripDF.rdd
tripDFNew = sqlContext.createDataFrame(tripRDD, schema)
tripDFNew.take(1)
Things do not work when I need to map the RDD (as would be the case to add a field, for instance).
schema = deepcopy(tripDF.schema)
tripRDD = tripDF.rdd
def trivial_map(row):
rowDict = row.asDict()
return pyspark.Row(**rowDict)
tripRDDNew = tripRDD.map(lambda row: trivial_map(row))
tripDFNew = sqlContext.createDataFrame(tripRDDNew, schema)
tripDFNew.take(1)
The code above gives the following exception where XXX is a stand-in for an integer, which changes from run to run (e.g., I've seen 1, 16, 23, etc.):
File "/opt/cloudera/parcels/CDH-5.8.3-
1.cdh5.8.3.p1967.2057/lib/spark/python/pyspark/sql/types.py", line 546, in
toInternal
raise ValueError("Unexpected tuple %r with StructType" % obj)
ValueError: Unexpected tuple XXX with StructType`
Given this information, is there a clear error in the second block of code? (I note that tripRDD is of class rdd.RDD while tripRDDNew is of class rdd.PipelinedRDD, but I don't think this should be a problem.) (I also note that the schema for tripRDD is not sorted by field name, while the schema for tripRDDNew is sorted by field name. Again, I don't see why this would be a problem.)
Schema:
root
|-- foo: struct (nullable = true)
| |-- bar_1: integer (nullable = true)
| |-- bar_2: integer (nullable = true)
| |-- bar_3: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- baz_1: integer (nullable = true)
| | | |-- baz_2: string (nullable = true)
| | | |-- baz_3: double (nullable = true)
| |-- bar_4: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- baz_1: integer (nullable = true)
| | | |-- baz_2: string (nullable = true)
| | | |-- baz_3: double (nullable = true)
|-- qux: integer (nullable = true)
|-- corge: integer (nullable = true)
|-- uier: integer (nullable = true)`
As noted in the post, the original schema has fields that are not alphabetically ordered. Therein lies the problem. The use of .asDict() in the mapping function orders the fields of the resulting RDD. The field order of tripRDDNew is in conflict with schema at the call to createDataFrame. The ValueError results from an attempt to parse one of the integer fields (i.e., qux, corge, or uier in the example) as a StructType.
(As an aside: It is a little surprising that createDataFrame requires the schema fields to have the same order as the RDD fields. You should either need consistent field names OR consistent field ordering, but requiring both seems like overkill.)
(As a second aside: The existence of non-alphabetical fields in the DataFrame is somewhat abnormal. For instance, sc.parallelize() will automatically order fields alphabetically when distributing the data structure. It seems like the fields should be ordered when importing the DataFrame from the parquet file. It might be interesting to investigate why this is not the case.)