Read, transform and write data within each partition in the DataFrame - scala

Language - Scala
Spark version - 2.4
I am new to both Scala and Spark. (I am from python background, so the whole JVM ecosystem is quite new to me)
I want to write a spark program to parallelize following steps:
Read data from S3 in dataframe
Transform each row of this dataframe
Write the updated dataframe back to S3 at a new location
Let's say I have 3 items, A, B and C. For each of these items, I want to do the above 3 steps.
I want to do this in parallel for all these 3 items.
I tried creating an RDD with 3 partition, where each partition has one item, A, B and C, respectively.
Then I tried to use mapPartition method to write my logic for each partition (the 3 steps mentioned above).
I am getting Task not serializable errors. Although I understand the meaning of this error, I don't know how to solve it.
val items = Array[String]("A", "B", "C")
val rdd = sc.parallelize(items, 3)
rdd.mapPartitions(
partition => {
val item = partition.next()
val filePath = new ListBuffer[String]()
filePath += s"$basePath/item=$item/*"
val df = spark.read.format("parquet").option("basePath",s"$basePath").schema(schema).load(filePaths: _*)
//Transform this dataframe
val newDF = df.rdd.mapPartitions(partition => {partition.map(row =>{methodToTransformAndReturnRow(row)})})
newDf.write.mode(SaveMode.Overwrite).parquet(path)
})
My use case is, for each item, read the data from S3, transform it (I am adding new columns directly to each row for our use case), and write the final result, for each item, back to S3.
Note - I can read the whole data, repartition by items, transform and write it back, but repartition results in a shuffle, which I am trying to avoid, and the way I am trying is, reading data for each item in an executor itself, so that it can work on whatever data it gets, and there is no need for shuffle.

I am not sure what you are trying to achieve using the approach that you have shown, but I feel that you might be going about it the hard way. Unless there is a good reason to do so, it is often best to let Spark (especially spark 2.0+) to let it do it's own thing. In this case, just process all three partitions using a single operation. Spark will usually manage your dataset quite well. It may also automatically introduce optimisations that you did not think of, or optimisations that it can not do if you try to control the process too much. Having said that, if it doesn't manage the process well, then you can start arguing with it by trying to take more control and do things more manually. At least that is my experience so far.
For example, I once had a complex series of transformations that added more logic to each step/DataFrame. If I forced spark to evaluate each intermediate one (e.g. run a count or a show on the intermediate dataframes) I would eventually get to a point where it couldn't evaluate one DataFrame (i.e. it couldn't do the count) because of insufficient resources. However, if I ignored that and added in more transformations, Spark was able to push some optimisations to earlier steps (from the later steps). This meant that the subsequent DataFrames (and importantly my final DataFrame) could be evaluated correctly. This final evaluation was possible despite the fact that the intermediate DataFrame, that on it's own could not be evaluated, was still in the overall process.
Consider the following. I use CSV, but it will work the same for parquet.
Here is my input:
data
├── tag=A
│   └── data.csv
├── tag=B
│   └── data.csv
└── tag=C
└── data.csv
Here is an example of one of the data files (tag=A/data.csv)
id,name,amount
1,Fred,100
2,Jane,200
Here is a script that recognises the partitions within this structure (i.e. tag is one of the columns).
scala> val inDataDF = spark.read.option("header","true").option("inferSchema","true").csv("data")
inDataDF: org.apache.spark.sql.DataFrame = [id: int, name: string ... 2 more fields]
scala> inDataDF.show
+---+-------+------+---+
| id| name|amount|tag|
+---+-------+------+---+
| 31| Scott| 3100| C|
| 32|Barnaby| 3200| C|
| 20| Bill| 2000| B|
| 21| Julia| 2100| B|
| 1| Fred| 100| A|
| 2| Jane| 200| A|
+---+-------+------+---+
scala> inDataDF.printSchema
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- amount: integer (nullable = true)
|-- tag: string (nullable = true)
scala> inDataDF.write.partitionBy("tag").csv("outData")
scala>
Again I used csv rather than parquet so you can dispense with the options to read a header and infer the schema (parquet will do that automatically). But apart from that, it will work the same way.
The above produces the following directory structure:
outData/
├── _SUCCESS
├── tag=A
│   └── part-00002-9e13ec13-7c63-4cda-b5af-e2d69cb76278.c000.csv
├── tag=B
│   └── part-00001-9e13ec13-7c63-4cda-b5af-e2d69cb76278.c000.csv
└── tag=C
└── part-00000-9e13ec13-7c63-4cda-b5af-e2d69cb76278.c000.csv
If you wanted to manipulate the contenst, by all means add whatever map operation, join, filter or whatever else you need between the read and write.
For example, add 500 to the amount:
scala> val outDataDF = inDataDF.withColumn("amount", $"amount" + 500)
outDataDF: org.apache.spark.sql.DataFrame = [id: int, name: string ... 2 more fields]
scala> outDataDF.show(false)
+---+-------+------+---+
|id |name |amount|tag|
+---+-------+------+---+
|31 |Scott |3600 |C |
|32 |Barnaby|3700 |C |
|20 |Bill |2500 |B |
|21 |Julia |2600 |B |
|1 |Fred |600 |A |
|2 |Jane |700 |A |
+---+-------+------+---+
Then simply write outDataDF instead of inDataDF.

Related

Passing column information to SparkSession within withColumn

I have a simple df with 2 columns, as shown below,
+------------+---+
|file_name |id |
+------------+---+
|file1.csv |1 |
|file2.csv |2 |
+------------+---+
root
|-- file_name: string (nullable = true)
|-- id: string (nullable = true)
I wish to add a 3rd column with the count() from each file specified in the file_name column
These are large files so I wish to go for a Spark based approach for getting the count() from each file.
Assuming originalDF is the above df,
I have tried:
dfWithCounts = originalDF.withColumn("counts", lit(spark.read.csv(lit(col('file_name'))).count))
but this seems to be throwing error.
Column is not iterable
Is there way I can achieve this?
I'm using Spark 2.4.
You can't run a Spark job from within another Spark job. Assuming that the file list is not super huge you can collect originalDF to the driver and spawn individual jobs to count lines from there.
val dfWithCounts = originalDF.collect.map { r =>
(r.getString(0), r.getInt(1), spark.read.csv(r.getString(0)).count)
}.toSeq.toDF("file_name", "id", "count")
Optionally you can use Scala parallel collections to run these jobs in parallel.
val dfWithCounts = originalDF.collect.par.map { r =>
(r.getString(0), r.getInt(1), spark.read.csv(r.getString(0)).count)
}.toSeq.seq.toDF("file_name", "id", "count")

Exporting Spark DataFrame to S3

So after certain operations I have some data in a Spark DataFrame, to be specific, org.apache.spark.sql.DataFrame = [_1: string, _2: string ... 1 more field]
Now when I do df.show(), I get the following output, which is expected.
+--------------------+--------------------+--------------------+
| _1| _2| _3|
+--------------------+--------------------+--------------------+
|industry_name_ANZSIC|'industry_name_AN...|.isComplete("indu...|
|industry_name_ANZSIC|'industry_name_AN...|.isContainedIn("i...|
|industry_name_ANZSIC|'industry_name_AN...|.isContainedIn("i...|
| rme_size_grp|'rme_size_grp' is...|.isComplete("rme_...|
| rme_size_grp|'rme_size_grp' ha...|.isContainedIn("r...|
| rme_size_grp|'rme_size_grp' ha...|.isContainedIn("r...|
| year| 'year' is not null| .isComplete("year")|
| year|'year' has type I...|.hasDataType("yea...|
| year|'year' has no neg...|.isNonNegative("y...|
|industry_code_ANZSIC|'industry_code_AN...|.isComplete("indu...|
|industry_code_ANZSIC|'industry_code_AN...|.isContainedIn("i...|
|industry_code_ANZSIC|'industry_code_AN...|.isContainedIn("i...|
| variable|'variable' is not...|.isComplete("vari...|
| variable|'variable' has va...|.isContainedIn("v...|
| unit| 'unit' is not null| .isComplete("unit")|
| unit|'unit' has value ...|.isContainedIn("u...|
| value| 'value' is not null|.isComplete("value")|
+--------------------+--------------------+--------------------+
The problem occurs when I try exporting the dataframe as a csv to my S3 bucket.
The code I have is : df.coalesce(1).write.mode("Append").csv("s3://<my path>")
But the csv generated in my S3 path is full of gibberish or rich text. Also, the spark prompt doesn't reappear after execution (meaning execution didn't finish?) Here's a sample screenshot of the generated csv in my S3 :
What am I doing wrong and how do I rectify this?
S3: short description.
When you change the letter on the URI scheme, it will make a big difference because it causes different software to be used to interface to S3.
This is the difference between the three:
s3 is a block-based overlay on top of Amazon S3,whereas s3n/s3a are not. These are are object-based.
s3n supports objects up to 5GB when size is the concern, while s3a supports objects up to 5TB and has higher performance.Note that s3a is the successor to s3n.

Operating in parallel on a Spark Dataframe Rows

Environment: Scala,spark,structured streaming,kafka
i have a DF coming from a kafka stream with the following schema
DF:
BATCH ID: 0
+-----------------------+-----+---------+------+
| value|topic|partition|offset|
+-----------------------+-----+---------+------+
|{"big and nested json"}| A | 0| 0|
|{"big and nested json"}| B | 0| 0|
+-----------------------+-----+---------+------+
i want to process each row in parallel by using spark, and i manage to split them to my executors using
DF.repartition(Number).foreach(row=> processRow(row))
i need to extract the value from the value column into its own dataframe to process it.
Im having difficulties working with the Dataframe generic Row object..
is there a way to turn the single row in each executor to its very own Dataframe (using a fixed schema?) and write in a fixed location?
Is there a better approach to solve my problem?
EDIT + Clarification:
The DF im receiving is coming as a Batch using a forEachBatch function of the writeStream functionality that exists since spark2.4
Currently splitting the DF into ROWS makes it that the rows will be split equally into all my executors, i would like to turn a single GenericRow object into a DataFrame so i can process using a function i made
for example i would send the row to the function
processRow(row:row)
take the value and the topic and turn it back into a single-lined DF
+-----------------------+-----+
| value|topic|
+-----------------------+-----+
|{"big and nested json"}| A |
+-----------------------+-----+
for further processing
I guess you are consuming multiple kafka data at a time.
First you need to prepare schema for all kafka topics, here for example I have used two different JSON in value column.
scala> val df = Seq(("""{"name":"Srinivas"}""","A"),("""{"age":20}""","B")).toDF("value","topic")
scala> df.show(false)
+-------------------+-----+
|value |topic|
+-------------------+-----+
|{"name":"Srinivas"}|A |
|{"age":20} |B |
+-------------------+-----+
scala> import org.apache.spark.sql.types._
Schema for topic A
scala> val topicASchema = DataType.fromJson("""{"type":"struct","fields":[{"name":"name","type":"string","nullable":true,"metadata":{}}]}""").asInstanceOf[StructType]
Schema for topic B
scala> val topicBSchema = DataType.fromJson("""{"type":"struct","fields":[{"name":"age","type":"long","nullable":true,"metadata":{}}]}""").asInstanceOf[StructType]
Combining topic & its schema.
scala> val topicSchema = Seq(("A",topicASchema),("B",topicBSchema)) // Adding Topic & Its Schema.
Processing DataFrame
scala> topicSchema
.par
.map(d => df.filter($"topic" === d._1).withColumn("value",from_json($"value",d._2)))
.foreach(_.show(false)) // Using .par & filtering dataframe based on topic & then applying schema to value column.
+----------+-----+
|value |topic|
+----------+-----+
|[Srinivas]|A |
+----------+-----+
+-----+-----+
|value|topic|
+-----+-----+
|[20] |B |
+-----+-----+
Writing to hdfs
scala> topicSchema
.par
.map(d => df.filter($"topic" === d._1).withColumn("value",from_json($"value",d._2)).write.format("json").save(s"/tmp/kafka_data/${d._1}"))
Final Data stored in hdfs
scala> import sys.process._
import sys.process._
scala> "tree /tmp/kafka_data".!
/tmp/kafka_data
├── A
│   ├── part-00000-1e854106-49de-44b3-ab18-6c98a126c8ca-c000.json
│   └── _SUCCESS
└── B
├── part-00000-1bd51ad7-cfb6-4187-a374-4e2d4ce9cc50-c000.json
└── _SUCCESS
2 directories, 4 files
In this case it's better suited to use .map instead of .foreach. The reason is that the map returns a new dataset while foreach just a function and doesn't return anything.
One other thing that can help you is to parse the schema located in JSON.
I had a similar requirement recently.
My JSON object has a "similar" schema for both topic A and B. If that is not the case for you, you might need to create multiple dataframes in the solution below by grouping them by topic.
val sanitiseJson: String => String = value => value
.replace("\\\"", "\"")
.replace("\\\\", "\\")
.replace("\n", "")
.replace("\"{", "{")
.replace("}\"", "}")
val parsed = df.toJSON
.map(sanitiseJson)
This will give you something like:
{
"value": { ... },
"topic": "A"
}
Then you can pass that into a new read function:
var dfWithSchema = spark.read.json(parsed)
At this point you would access the value in the nested JSON:
dfWithSchema.select($"value.propertyInJson")
There are some optimizations you can do when it comes to sanitiseJson if needed.

Selecting a column from a dataset's row

I'd like to loop on a Spark dataset and save specific values in a Map depending on the characteristics of each row. I'm new to Spark and Scala so I joined a simple example of what I'm trying to do in python.
Minimal working example in python:
mydict = dict()
for row in data:
if row['name'] == "Yan":
mydict[row['id']] = row['surname']
else:
mydict[row['id']] = "Random lad"
Where data is a (big) spark dataset, of type org.apache.spark.sql.Dataset[org.apache.spark.sql.Row].
Do you know the Spark or Scala way of doing it?
You can not loop over the contents of a Dataset because they are not accessible on the machine running this code but instead are scattered over (possibly many) different worker nodes. That is a fundamental concept of distributed execution engines like spark.
Instead you have to manipulate your data in a functional (where map, filter, reduce, ... operations are spread to the workers) or declarative (sql queries that are performed on the workers) way.
To achieve your goal you could run a map over you data which checks whether the name equals "Yan" and go on from there. After this transformation you can collect your dataframe and transform it into a dict.
You should also check your approach on using Spark and the map: it seems you want to create an entry in mydict for each element of data. This means your data is either small enough that you don't really have to use Spark or it will probably fail because it does not fit in your drivers memory.
I think you are looking for something like that. If your final df is not big you can collect it and store as map.
scala> df.show()
+---+----+--------+
| id|name|surrname|
+---+----+--------+
| 1| Yan| abc123|
| 2| Abc| def123|
+---+----+--------+
scala> df.select('id, when('name === "Yan", 'surrname).otherwise("Random lad")).toDF("K","V").show()
+---+----------+
| K| V|
+---+----------+
| 1| abc123|
| 2|Random lad|
+---+----------+
Here is a simple way to do it, but be careful with collect(), since it collects the data in driver. The data should be able to fit in driver.
I don't recommend you to do this.
var df: DataFrame = Seq(
("1", "Yan", "surname1"),
("2", "Yan1", "surname2"),
("3", "Yan", "surname3"),
("4", "Yan2", "surname4")
).toDF("id", "name", "surname")
val myDict = df.withColumn("newName", when($"name" === "Yan", $"surname").otherwise("RandomeName"))
.rdd.map(row => (row.getAs[String]("id"), row.getAs[String]("newName")))
.collectAsMap()
myDict.foreach(println)
Output:
(2,RandomeName)
(1,surname1)
(4,RandomeName)
(3,surname3)

How convert Spark dataframe column from Array[Int] to linalg.Vector?

I have a dataframe, df, that looks like this:
+--------+--------------------+
| user_id| is_following|
+--------+--------------------+
| 1|[2, 3, 4, 5, 6, 7] |
| 2|[20, 30, 40, 50] |
+--------+--------------------+
I can confirm this has the schema:
root
|-- user_id: integer (nullable = true)
|-- is_following: array (nullable = true)
| |-- element: integer (containsNull = true)
I would like to use Spark's ML routines such as LDA to do some machine learning on this, requiring me to convert the is_following column to a linalg.Vector (not a Scala vector). When I try to do this via
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val assembler = new VectorAssembler().setInputCols(Array("is_following")).setOutputCol("features")
val output = assembler.transform(df)
I then get the following error:
java.lang.IllegalArgumentException: Data type ArrayType(IntegerType,true) is not supported.
If I am interpreting that correctly, I take away from it that I need to convert types here from integer to something else. (Double? String?)
My question is what is the best way to convert this array to something that will properly vectorize for the ML pipeline?
EDIT: If it helps, I don't have to structure the dataframe this way. I could instead have it be:
+--------+------------+
| user_id|is_following|
+--------+------------+
| 1| 2|
| 1| 3|
| 1| 4|
| 1| 5|
| 1| 6|
| 1| 7|
| 2| 20|
| ...| ...|
+--------+------------+
A simple solution to both converting the array into a linalg.Vector and at the same time convert the integers into doubles would be to use an UDF.
Using your dataframe:
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val df = spark.createDataFrame(Seq((1, Array(2,3,4,5,6,7)), (2, Array(20,30,40,50))))
.toDF("user_id", "is_following")
val convertToVector = udf((array: Seq[Int]) => {
Vectors.dense(array.map(_.toDouble).toArray)
})
val df2 = df.withColumn("is_following", convertToVector($"is_following"))
spark.implicits._ is imported here to allow the use of $, col() or ' could be used instead.
Printing the df2 dataframe will give the wanted results:
+-------+-------------------------+
|user_id|is_following |
+-------+-------------------------+
|1 |[2.0,3.0,4.0,5.0,6.0,7.0]|
|2 |[20.0,30.0,40.0,50.0] |
+-------+-------------------------+
schema:
root
|-- user_id: integer (nullable = false)
|-- is_following: vector (nullable = true)
So your initial input might be better suited than your transformed input. Spark's VectorAssembler requires that all of the columns be Doubles, and not array's of doubles. Since different users could follow different numbers of people your current structure could be good, you just need to convert the is_following into a Double, you could infact do this with Spark's VectorIndexer https://spark.apache.org/docs/2.1.0/ml-features.html#vectorindexer or just manually do it in SQL.
So the tl;dr is - the type error is because Spark's Vector's only support Doubles (this is changing likely for image data in the not so distant future but isn't a good fit for your use case anyways) and you're alternative structure might actually be better suited (the one without the grouping).
You might find looking at the collaborative filtering example in the Spark documentation useful on your further adventure - https://spark.apache.org/docs/latest/ml-collaborative-filtering.html . Good luck and have fun with Spark ML :)
edit:
I noticed you said you're looking to do LDA on the inputs so let's also look at how to prepare the data for that format. For LDA input you might want to consider using the CountVectorizer (see https://spark.apache.org/docs/2.1.0/ml-features.html#countvectorizer)