Combine rdd string lines without reduce - scala

When I retrieve data from text file using rdd in spark, it looks like that the retrieved lines are seprated from each other.
What i want is that the rdd combines them and treat them as if I have parallelized a single string.
e.g: from rddcontent:
sc.TextFile("sample.txt") // content: List("abc", \n "def")
To:
sc.parallelize("abcdef") // content: "abcdef"
This should be done because the whole data is too big to fit in memory with reduce but still needs to be processed as a whole(In parallel of course but without the line seperations)

Related

Spark Dataset - "edit" parquet file for each row

Context
I am trying to use Spark/Scala in order to "edit" multiple parquet files (potentially 50k+) efficiently. The only edit that needs to be done is deletion (i.e. deleting records/rows) based on a given set of row IDs.
The parquet files are stored in s3 as a partitioned DataFrame where an example partition looks like this:
s3://mybucket/transformed/year=2021/month=11/day=02/*.snappy.parquet
Each partition can have upwards of 100 parquet files that each are between 50mb and 500mb in size.
Inputs
We are given a spark Dataset[MyClass] called filesToModify which has 2 columns:
s3path: String = the complete s3 path to a parquet file in s3 that needs to be edited
ids: Set[String] = a set of IDs (rows) that need to be deleted in the parquet file located at s3path
Example input dataset filesToModify:
s3path
ids
s3://mybucket/transformed/year=2021/month=11/day=02/part-1.snappy.parquet
Set("a", "b")
s3://mybucket/transformed/year=2021/month=11/day=02/part-2.snappy.parquet
Set("b")
Expected Behaviour
Given filesToModify I want to take advantage of parallelism in Spark do the following for each row:
Load the parquet file located at row.s3path
Filter so that we exclude any row whose id is in the set row.ids
Count the number of deleted/excluded rows per id in row.ids (optional)
Save the filtered data back to the same row.s3path to overwrite the file
Return the number of deleted rows (optional)
What I have tried
I have tried using filesToModify.map(row => deleteIDs(row.s3path, row.ids)) where deleteIDs is looks like this:
def deleteIDs(s3path: String, ids: Set[String]): Int = {
import spark.implicits._
val data = spark
.read
.parquet(s3path)
.as[DataModel]
val clean = data
.filter(not(col("id").isInCollection(ids)))
// write to a temp directory and then upload to s3 with same
// prefix as original file to overwrite it
writeToSingleFile(clean, s3path)
1 // dummy output for simplicity (otherwise it should correspond to the number of deleted rows)
}
However this leads to NullPointerException when executed within the map operation. If I execute it alone outside of the map block then it works but I can't understand why it doesn't inside it (something to do with lazy evaluation?).
You get a NullPointerException because you try to retrieve your spark session from an executor.
It is not explicit, but to perform spark action, your DeleteIDs function needs to retrieve active spark session. To do so, it calls method getActiveSession from SparkSession object. But when called from an executor, this getActiveSession method returns None as stated in SparkSession's source code:
Returns the default SparkSession that is returned by the builder.
Note: Return None, when calling this function on executors
And thus NullPointerException is thrown when your code starts using this None spark session.
More generally, you can't recreate a dataset and use spark transformations/actions in transformations of another dataset.
So I see two solutions for your problem:
either to rewrite DeleteIDs function's code without using spark, and modify your parquet files by using parquet4s for instance.
or transform filesToModify to a Scala collection and use Scala's map instead of Spark's one.
s3path and ids parameters that are passed to deleteIDs are not actually strings and sets respectively. They are instead columns.
In order to operate over these values you can instead create a UDF that accepts columns instead of intrinsic types, or you can collect your dataset if it is small enough so that you can use the values in the deleteIDs function directly. The former is likely your best bet if you seek to take advantage of Spark's parallelism.
You can read about UDFs here

Number of rows increases and data corrupts in resulting csv

Converting dataframe from one form to another:
Parquet -> Parquet (number of rows remains same, NO PROBLEM)
Parquet -> CSV (number of rows INCREASES! Data corruption!)
First read data from parquet, then convert to csv/parquet using below command :
<input_dataframe>.coalesce(1).write.option("header", true).csv/parquet(output_path)
scala> spark.read.option("header", true).parquet(input_path).count
Long = 8387913
scala> spark.read.option("header", true).csv(output_path).count
Long = 8387932
As a result of this rows are getting mixed with one another, and the records are getting spilled over or corrupted.
There is a workaround to it ONLY in case you have to read the csv using spark. You can pass option multiline as true.
scala> spark.read.option("header", true).option("multiline", true).csv(output_path).count
Long = 8387913 << input parquet contains same number of records
But this is not what I want to do with the CSV. I need to read it without using spark.
How do I keep the structure intact while writing dataframe to csv?
NOTE: This might not be a reproducible case with all dataframes. My data has it for some unknown reasons. I noticed this problem when I found string type field are storing integral value in resulting CSV, and a whole set of records being corrupt with arbitrary values. CSV size ~ 2.5 GB.

Getting the value of a DataFrame column in Spark

I am trying to retrieve the value of a DataFrame column and store it in a variable. I tried this :
val name=df.select("name")
val name1=name.collect()
But none of the above is returning the value of column "name".
Spark version :2.2.0
Scala version :2.11.11
There are couple of things here. If you want see all the data collect is the way to go. However in case your data is too huge it will cause drive to fail.
So the alternate is to check few items from the dataframe. What I generally do is
df.limit(10).select("name").as[String].collect()
This will provide output of 10 element. But now the output doesn't look good
So, 2nd alternative is
df.select("name").show(10)
This will print first 10 element, Sometime if the column values are big it generally put "..." instead of actual value which is annoying.
Hence there is third option
df.select("name").take(10).foreach(println)
Takes 10 element and print them.
Now in all the cases you won't get a fair sample of the data, as the first 10 data will be picked. So to truely pickup randomly from the dataframe you can use
df.select("name").sample(.2, true).show(10)
or
df.select("name").sample(.2, true).take(10).foreach(println)
You can check the "sample" function on dataframe
The first will do :)
val name = df.select("name") will return another DataFrame. You can do for example name.show() to show content of the DataFrame. You can also do collect or collectAsMap to materialize results on driver, but be aware, that data amount should not be too big for driver
You can also do:
val names = df.select("name").as[String].collect()
This will return array of names in this DataFrame

Spark: grouping during loading

Usually I load csv files and then I run different kind of aggregations like for example "group by" with Spark. I was wondering if it is possible to start this sort of operations during the file loading (typically a few millions of rows) instead of sequentialize them and if it can be worthy (as time saving).
Example:
val csv = sc.textFile("file.csv")
val data = csv.map(line => line.split(",").map(elem => elem.trim))
val header = data.take(1)
val rows = data.filter(line => header(0) != "id")
val trows = rows.map(row => (row(0), row))
trows.groupBy(//row(0) etc.)
For my understanding of how Spark works, the groupBy (or aggregate) will be "postponed" to the loading in memory of the whole file csv. If this is correct, can the loading and the grouping run at the "same" time instead of sequencing the two steps?
the groupBy (or aggregate) will be "postponed" to the loading in memory of the whole file csv.
It is not the case. At the local (single partition) level Spark operates on lazy sequences so operations belonging to a single task (this includes map side aggregation) can squashed together.
In other words when you have chain of methods operations are performed line-by-line not transformation-by-transformation. In other words the first line will be mapped, filtered, mapped once again and passed to aggregator before the next one is accessed.
To start a group by on load operation You could proceed with 2 options:
Write your own loader and make your own group by inside that + aggregationByKey. The cons of that is writting more code & more maintanance.
Use Parquet format files as input + DataFrames, due it's columnar it will read only desired columns used in your groupBy. so it should be faster. - DataFrameReader
df = spark.read.parquet('file_path')
df = df.groupBy('column_a', 'column_b', '...').count()
df.show()
Due Spark is Lazy it won't load your file until you call action methods like show/collect/write. So Spark will know which columns read and which ignore on the load process.

Bulk File Conversion with Apache Spark

Spark/Scala n00bie here...
I've got a large set of documents whose pages are stored as individual tif images. I need to convert and concatenate the individual tifs' into a single PDF document. i.e 1.tif, 2.tif, 3.tif -> 123.pdf
I've been investigating using Spark to accomplish this. I create the initial RDD like this:
val inputTifRDD = sc.binaryFile("file:///some/path/to/lots/of/*tif")
with inputTifRDD comprised of tuples of the form:
(fullFilePath: String, data:org.apache.spark.input.PortableDataStream )
I then apply a custom map function that converts each tif into a pdf and returns a RDD comprised of tuples of the form:
(fullFilePath: String, data:com.itextpdf.text.Document)
I now want to apply an action to this RDD to concatenate the PDF's into a single PDF. I don't think reducing is possible because the concatenation is not commutative - the order of pages matters.
My question is how do I effect the concatenation of elements in the RDD in the correct order? The filenames contain the page number, so this information can be used. Alternatively - is there another, better/more efficient way to do the conversion/concatentation with Spark?