Spark Dataset - "edit" parquet file for each row - scala

Context
I am trying to use Spark/Scala in order to "edit" multiple parquet files (potentially 50k+) efficiently. The only edit that needs to be done is deletion (i.e. deleting records/rows) based on a given set of row IDs.
The parquet files are stored in s3 as a partitioned DataFrame where an example partition looks like this:
s3://mybucket/transformed/year=2021/month=11/day=02/*.snappy.parquet
Each partition can have upwards of 100 parquet files that each are between 50mb and 500mb in size.
Inputs
We are given a spark Dataset[MyClass] called filesToModify which has 2 columns:
s3path: String = the complete s3 path to a parquet file in s3 that needs to be edited
ids: Set[String] = a set of IDs (rows) that need to be deleted in the parquet file located at s3path
Example input dataset filesToModify:
s3path
ids
s3://mybucket/transformed/year=2021/month=11/day=02/part-1.snappy.parquet
Set("a", "b")
s3://mybucket/transformed/year=2021/month=11/day=02/part-2.snappy.parquet
Set("b")
Expected Behaviour
Given filesToModify I want to take advantage of parallelism in Spark do the following for each row:
Load the parquet file located at row.s3path
Filter so that we exclude any row whose id is in the set row.ids
Count the number of deleted/excluded rows per id in row.ids (optional)
Save the filtered data back to the same row.s3path to overwrite the file
Return the number of deleted rows (optional)
What I have tried
I have tried using filesToModify.map(row => deleteIDs(row.s3path, row.ids)) where deleteIDs is looks like this:
def deleteIDs(s3path: String, ids: Set[String]): Int = {
import spark.implicits._
val data = spark
.read
.parquet(s3path)
.as[DataModel]
val clean = data
.filter(not(col("id").isInCollection(ids)))
// write to a temp directory and then upload to s3 with same
// prefix as original file to overwrite it
writeToSingleFile(clean, s3path)
1 // dummy output for simplicity (otherwise it should correspond to the number of deleted rows)
}
However this leads to NullPointerException when executed within the map operation. If I execute it alone outside of the map block then it works but I can't understand why it doesn't inside it (something to do with lazy evaluation?).

You get a NullPointerException because you try to retrieve your spark session from an executor.
It is not explicit, but to perform spark action, your DeleteIDs function needs to retrieve active spark session. To do so, it calls method getActiveSession from SparkSession object. But when called from an executor, this getActiveSession method returns None as stated in SparkSession's source code:
Returns the default SparkSession that is returned by the builder.
Note: Return None, when calling this function on executors
And thus NullPointerException is thrown when your code starts using this None spark session.
More generally, you can't recreate a dataset and use spark transformations/actions in transformations of another dataset.
So I see two solutions for your problem:
either to rewrite DeleteIDs function's code without using spark, and modify your parquet files by using parquet4s for instance.
or transform filesToModify to a Scala collection and use Scala's map instead of Spark's one.

s3path and ids parameters that are passed to deleteIDs are not actually strings and sets respectively. They are instead columns.
In order to operate over these values you can instead create a UDF that accepts columns instead of intrinsic types, or you can collect your dataset if it is small enough so that you can use the values in the deleteIDs function directly. The former is likely your best bet if you seek to take advantage of Spark's parallelism.
You can read about UDFs here

Related

Populate a Properties Object from Spark Databricks File System

TL:DR
Is there a way to read a Scala/Java properties file from a Databricks file system?
Or, is there a way to convert a spark data frame Rows into a set of text key/value pairs (that Scala will understand)?
Full Problem:
The properties file is not local, it's on the Databricks cluster. Attempts to read a file from "dbfs:/" or "/dbfs" fail to find the file when using the scala.io.Source library. My guess is that Scala Source has no ability to recognize the URI for the Databricks file system(?).
I'm able to read the file into a Spark Dataframe however, but attempts to populate a java.utils.Properties object fail with an error that it doesn't accept the Spark Dataframe "ROW" type. I've tried changing the data frame to an Array and List, but run into the same type mismatch. java.util.List[org.apache.spark.sql.Row] for example, is what I get when converting the data frame to a list. I'm guessing that means dataFrameObject.collectAsList() makes a list of spark rows instead of a text list of key/value pairs.
Obviously I'm new to Scala... If there isn't a way to read/load my properties file directly from DBFS, is there a way to convert the spark Row to a key/value pairs - or a byteStream?
Cheers and thanks,
Simon
If you're using full version of the Databricks, not community edition, then you should be able to access files on DBFS via /dbfs/_the_rest_of_your_path_without_dbfs:/_...
But if you can't access /dbfs/..., then you can still load properties as following:
load the file into Spark using the text format that converts every line in the file into individual row
create text from that rows - first you collect all rows to the driver node, then extract string from rows (using the .getString(0) to fetch first element of the row), and then merging all lines together using the mkString
create reader for that text
create properties object and load data from reader (don't forget to close reader after use):
val path_to_file = "dbfs:/something...."
val df = spark.read.format("text").load(path_to_file)
val allTextg = df.collect().map(_.getString(0)).mkString("\n")
val reader = new java.io.StringReader(allText)
val props = new java.util.Properties()
props.load(reader)
reader.close()
and you can check that properties are loaded with
props.list(System.out)

Spark : Dynamic generation of the query based on the fields in s3 file

Oversimplified Scenario:
A process which generates monthly data in a s3 file. The number of fields could be different in each monthly run. Based on this data in s3,we load the data to a table and we manually (as number of fields could change in each run with addition or deletion of few columns) run a SQL for few metrics.There are more calculations/transforms on this data,but to have starter Im presenting the simpler version of the usecase.
Approach:
Considering the schema-less nature, as the number of fields in the s3 file could differ in each run with addition/deletion of few fields,which requires manual changes every-time in the SQL, Im planning to explore Spark/Scala, so that we can directly read from s3 and dynamically generate SQL based on the fields.
Query:
How I can achieve this in scala/spark-SQL/dataframe? s3 file contains only the required fields from each run.Hence there is no issue reading the dynamic fields from s3 as it is taken care by dataframe.The issue is how can we generate SQL dataframe-API/spark-SQL code to handle.
I can read s3 file via dataframe and register the dataframe as createOrReplaceTempView to write SQL, but I dont think it helps manually changing the spark-SQL, during addition of a new field in s3 during next run. what is the best way to dynamically generate the sql/any better ways to handle the issue?
Usecase-1:
First-run
dataframe: customer,1st_month_count (here dataframe directly points to s3, which has only required attributes)
--sample code
SELECT customer,sum(month_1_count)
FROM dataframe
GROUP BY customer
--Dataframe API/SparkSQL
dataframe.groupBy("customer").sum("month_1_count").show()
Second-Run - One additional column was added
dataframe: customer,month_1_count,month_2_count) (here dataframe directly points to s3, which has only required attributes)
--Sample SQL
SELECT customer,sum(month_1_count),sum(month_2_count)
FROM dataframe
GROUP BY customer
--Dataframe API/SparkSQL
dataframe.groupBy("customer").sum("month_1_count","month_2_count").show()
Im new to Spark/Scala, would be helpful if you can provide the direction so that I can explore further.
It sounds like you want to perform the same operation over and over again on new columns as they appear in the dataframe schema? This works:
from pyspark.sql import functions
#search for column names you want to sum, I put in "month"
column_search = lambda col_names: 'month' in col_names
#get column names of temp dataframe w/ only the columns you want to sum
relevant_columns = original_df.select(*filter(column_search, original_df.columns)).columns
#create dictionary with relevant column names to be passed to the agg function
columns = {col_names: "sum" for col_names in relevant_columns}
#apply agg function with your groupBy, passing in columns dictionary
grouped_df = original_df.groupBy("customer").agg(columns)
#show result
grouped_df.show()
Some important concepts can help you to learn:
DataFrames have data attributes stored in a list: dataframe.columns
Functions can be applied to lists to create new lists as in "column_search"
Agg function accepts multiple expressions in a dictionary as explained here which is what I pass into "columns"
Spark is lazy so it doesn't change data state or perform operations until you perform an action like show(). This means writing out temporary dataframes to use one element of the dataframe like column as I do is not costly even though it may seem inefficient if you're used to SQL.

Is there any way to capture the input file name of multiple parquet files read in with a wildcard in Spark?

I am using Spark to read multiple parquet files into a single RDD, using standard wildcard path conventions. In other words, I'm doing something like this:
val myRdd = spark.read.parquet("s3://my-bucket/my-folder/**/*.parquet")
However, sometimes these Parquet files will have different schemas. When I'm doing my transforms on the RDD, I can try and differentiate between them in the map functions, by looking for the existence (or absence) of certain columns. However a surefire way to know which schema a given row in the RDD uses - and the way I'm asking about specifically here - is to know which file path I'm looking at.
Is there any way, on an RDD level, to tell which specific parquet file the current row came from? So imagine my code looks something like this, currently (this is a simplified example):
val mapFunction = new MapFunction[Row, (String, Row)] {
override def call(row: Row): (String, Row) = myJob.transform(row)
}
val pairRdd = myRdd.map(mapFunction, encoder=kryo[(String, Row)]
Within the myJob.transform( ) code, I'm decorating the result with other values, converting it to a pair RDD, and do some other transforms as well.
I make use of the row.getAs( ... ) method to look up particular column values, and that's a really useful method. I'm wondering if there are any similar methods (e.g. row.getInputFile( ) or something like that) to get the name of the specific file that I'm currently operating on?
Since I'm passing in wildcards to read multiple parquet files into a single RDD, I don't have any insight into which file I'm operating on. If nothing else, I'd love a way to decorate the RDD rows with the input file name. Is this possible?
You can add a new column for the file name as shown below
import org.apache.spark.sql.functions._
val myDF = spark.read.parquet("s3://my-bucket/my-folder/**/*.parquet").withColumn("inputFile", input_file_name())

Recursively adding rows to a dataframe

I am new to spark. I have some json data that comes as an HttpResponse. I'll need to store this data in hive tables. Every HttpGet request returns a json which will be a single row in the table. Due to this, I am having to write single rows as files in the hive table directory.
But I feel having too many small files will reduce the speed and efficiency. So is there a way I can recursively add new rows to the Dataframe and write it to the hive table directory all at once. I feel this will also reduce the runtime of my spark code.
Example:
for(i <- 1 to 10){
newDF = hiveContext.read.json("path")
df = df.union(newDF)
}
df.write()
I understand that the dataframes are immutable. Is there a way to achieve this?
Any help would be appreciated. Thank you.
You are mostly on the right track, what you want to do is to obtain multiple single records as a Seq[DataFrame], and then reduce the Seq[DataFrame] to a single DataFrame by unioning them.
Going from the code you provided:
val BatchSize = 100
val HiveTableName = "table"
(0 until BatchSize).
map(_ => hiveContext.read.json("path")).
reduce(_ union _).
write.insertInto(HiveTableName)
Alternatively, if you want to perform the HTTP requests as you go, we can do that too. Let's assume you have a function that does the HTTP request and converts it into a DataFrame:
def obtainRecord(...): DataFrame = ???
You can do something along the lines of:
val HiveTableName = "table"
val OtherHiveTableName = "other_table"
val jsonArray = ???
val batched: DataFrame =
jsonArray.
map { parameter =>
obtainRecord(parameter)
}.
reduce(_ union _)
batched.write.insertInto(HiveTableName)
batched.select($"...").write.insertInto(OtherHiveTableName)
You are clearly misusing Spark. Apache Spark is analytical system, not a database API. There is no benefit of using Spark to modify Hive database like this. It will only bring a severe performance penalty without benefiting from any of the Spark features, including distributed processing.
Instead you should use Hive client directly to perform transactional operations.
If you can batch-download all of the data (for example with a script using curl or some other program) and store it in a file first (or many files, spark can load an entire directory at once) you can then load that file(or files) all at once into spark to do your processing. I would also check to see it the webapi as any endpoints to fetch all the data you need instead of just one record at a time.

Spark: grouping during loading

Usually I load csv files and then I run different kind of aggregations like for example "group by" with Spark. I was wondering if it is possible to start this sort of operations during the file loading (typically a few millions of rows) instead of sequentialize them and if it can be worthy (as time saving).
Example:
val csv = sc.textFile("file.csv")
val data = csv.map(line => line.split(",").map(elem => elem.trim))
val header = data.take(1)
val rows = data.filter(line => header(0) != "id")
val trows = rows.map(row => (row(0), row))
trows.groupBy(//row(0) etc.)
For my understanding of how Spark works, the groupBy (or aggregate) will be "postponed" to the loading in memory of the whole file csv. If this is correct, can the loading and the grouping run at the "same" time instead of sequencing the two steps?
the groupBy (or aggregate) will be "postponed" to the loading in memory of the whole file csv.
It is not the case. At the local (single partition) level Spark operates on lazy sequences so operations belonging to a single task (this includes map side aggregation) can squashed together.
In other words when you have chain of methods operations are performed line-by-line not transformation-by-transformation. In other words the first line will be mapped, filtered, mapped once again and passed to aggregator before the next one is accessed.
To start a group by on load operation You could proceed with 2 options:
Write your own loader and make your own group by inside that + aggregationByKey. The cons of that is writting more code & more maintanance.
Use Parquet format files as input + DataFrames, due it's columnar it will read only desired columns used in your groupBy. so it should be faster. - DataFrameReader
df = spark.read.parquet('file_path')
df = df.groupBy('column_a', 'column_b', '...').count()
df.show()
Due Spark is Lazy it won't load your file until you call action methods like show/collect/write. So Spark will know which columns read and which ignore on the load process.