Divide dataframe into batches Spark - scala

I need to run a set of transformations on batches of hours of the dataframe. And the number of hours should be parameterized so it could be changed - for example, run transformations on 3 hours of dataframe, then next 2 hours. In such a way, there should be a step with a parameterized number of hours for each transformation.
The signature of the transformation looks like this:
def transform(wordsFeed: DataFrame)(filesFeed: DataFrame): Unit
So I want to do this division into batches and then call a transform on this datafeed. But I can't use groupBy as it would change dataframe into grouped dataset while I need to preserve all the columns in the schema. How can I do that?
val groupedDf = df.srcHours.groupBy($"event_ts")
transform(keywords)(groupedDf)
Data schema looks like this:
root
|-- date_time: integer (nullable = true)
|-- user_id: long (nullable = true)
|-- order_id: string (nullable = true)
|-- description: string (nullable = true)
|-- hashed_user_id: string (nullable = true)
|-- event_date: date (nullable = true)
|-- event_ts: timestamp (nullable = true)
|-- event_hour: long (nullable = true)
The main reason to introduce this batching is that there's too much data to process at once.
Note: I still want to use batch data processing and not streaming in this case

Related

scala: read csv of documents, create cosine similarity

I'm reading in dozens of documents. They seem to be read into both RDD and DFs as a string of columns:
This is the schema:
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)...
|-- _c58: string (nullable = true)
|-- _c59: string (nullable = true)
This is the head of the df:
_c1| _c2|..........
V1 V2
This text ... This is an...
I'm trying to create a cosine similarity matrix using this:
val contentRDD = spark.sparkContext.textFile("...documents_vector.csv").toDF()
val Row(coeff0: Matrix) = Correlation.corr(contentRDD, "features").head
println(s"Pearson correlation matrix:\n $coeff0")
This is another way I was doing it:
val df_4 = spark.read.csv("/document_vector.csv")
Where features would be the name of the column created by converting the single row of 59 columns into a single column of 59 rows, named features.
Is there a way to map each new element in the csv to a new row to complete the cosine similarity matrix? Is there another way I should be doing this?
Thank you to any who consider this.

Deequ - How to put validation on a subset of dataset?

I have a usecase where I want to put certain validations on subset of data that satisfies a specific condition.
For example, I have a dataframe which has 4 columns. colA, colB, colC, colD
df.printSchema
root
|-- colA: string (nullable = true)
|-- colB: integer (nullable = false)
|-- colC: string (nullable = true)
|-- colD: string (nullable = true)
I want to put a validation that, wherever "colA == "x" and colB > 20" , combination of "colC and colD" should be unique. ( basically, hasUniqueness(Seq("colC", "colD"), Check.IsOne)

How to compare 2 JSON schemas using pyspark?

I have 2 JSON schemas as below -
df1.printSchema()
# root
# |-- name: string (nullable = true)
# |-- age: long (nullable = true)
df2.printSchema()
#root
# |-- name: array (nullable = true)
# |-- gender: integer (nullable = true)
# |-- age: long (nullable = true)
How can I compare these 2 schemas and highlight the differences using pyspark as I am using pyspark-sql to load data from the JSON file into a DF.
While it is not clear what do you mean by "compare", the following code will give you the fields (FieldType) which are on DF2 and not on DF1.
set(df2.schema.fields) - set(df1.schema.fields)
Set will take your list and truncate the duplicates.
I find the following one line code useful and neat. I also provides you the two-directional differences at a column level
set(df1.schema).symmetric_difference(set(df2.schema))

Flattening a nested ORC file with Spark - Performance issue

We are facing a severe performance issue when reading a nested ORC file.
This is our ORC schema:
|-- uploader: string (nullable = true)
|-- email: string (nullable = true)
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- startTime: string (nullable = true)
| | |-- endTime: string (nullable = true)
| | |-- val1: string (nullable = true)
| | |-- val2: string (nullable = true)
| | |-- val3: integer (nullable = true)
| | |-- val4: integer (nullable = true)
| | |-- val5: integer (nullable = true)
| | |-- val6: integer (nullable = true)
The ‘data’ array could potentially contain 75K objects.
In our spark application, we flatten this ORC, as you can see below:
val dataFrame = spark.read.orc(files: _*)
val withData = dataFrame.withColumn("data", explode(dataFrame.col("data")))
val withUploader = withData.select($"uploader", $"data")
val allData = withUploader
.withColumn("val_1", $"data.val1")
.withColumn("val_2", $"data.val2")
.withColumn("val_3", $"data.val3")
.withColumn("val_4", $"data.val4")
.withColumn("val_5", $"data.val5")
.withColumn("val_6", $"data.val6")
.withColumn("utc_start_time", timestampUdf($"data.startTime"))
.withColumn("utc_end_time", timestampUdf($"data.endTime"))
allData.drop("data")
The flattening process seems to be a very heavy operation:
Reading a 2MB ORC file with 20 records, each of which contains a data array with 75K objects, results in hours of processing time. Reading the file and collecting it without flattening it, takes 22 seconds.
Is there a way to make spark process the data faster?
I'd try to avoid large explodes completely. With 75K elements in the array:
You create 75K Row objects per Row. This is a huge allocation effort.
You duplicate uploaded and email 75K times. In short term it will reference the same data, but once data is serialized and deserialized with internal format, they'll like point to different objects effectively multiplying memory requirements.
Depending on the transformations you want to apply it might be the case where using UDF to process arrays as whole, will be much more efficient.
In case this helps someone, I found that flattening the data using flatmap is much faster than doing it with explode:
dataFrame.as[InputFormat].flatMap(r => r.data.map(v => OutputFormat(v, r.tenant)))
The improvement in performance was dramatic.
Processing a file with 20 records, each containing an array with 250K rows- with the explode implementation it took 8 hours, with the flatmap implementation- 7 minutes (!)

How to efficiently join 2 DateFrames based on Timestamp difference?

I have two DataFrames with two different time-series data. For simplicity let's call them Events and Status.
events:
root
|-- timestamp: timestamp (nullable = true)
|-- event_type: string (nullable = true)
|-- event_id: string (nullable = true)
statuses:
root
|-- timestamp: timestamp (nullable = true)
|-- status: string (nullable = true)
|-- field1: string (nullable = true)
|-- field2: string (nullable = true)
|-- field3: string (nullable = true)
I'd like to join them so every Event will have a column of list_statuses that contains all Row objects of statuses in the previous X hours of its own timestamp.
I can do it by a cartesian product of events and statuses and then filter for the time criteria but that it (extremely) inefficient.
Is there any better way to do it? Anything off-the-shelf?
(I thought to group both dataframes on a time-window, then self-join the second to contain both the current and the previous time-windows and then join between them and filter, but if there is anything ready and bug-free, I'd happily use...)
Thanks!
Almost 2 months later but I thought it might help others if I post something I got to:
http://zachmoshe.com/2016/09/26/efficient-range-joins-with-spark.html
It basically a more efficient implementation of range-join between two DataSets based on a Timestamp or a Numeric field (Scala, with Spark 2.0).