Operating in parallel on a Spark Dataframe Rows - scala

Environment: Scala,spark,structured streaming,kafka
i have a DF coming from a kafka stream with the following schema
DF:
BATCH ID: 0
+-----------------------+-----+---------+------+
| value|topic|partition|offset|
+-----------------------+-----+---------+------+
|{"big and nested json"}| A | 0| 0|
|{"big and nested json"}| B | 0| 0|
+-----------------------+-----+---------+------+
i want to process each row in parallel by using spark, and i manage to split them to my executors using
DF.repartition(Number).foreach(row=> processRow(row))
i need to extract the value from the value column into its own dataframe to process it.
Im having difficulties working with the Dataframe generic Row object..
is there a way to turn the single row in each executor to its very own Dataframe (using a fixed schema?) and write in a fixed location?
Is there a better approach to solve my problem?
EDIT + Clarification:
The DF im receiving is coming as a Batch using a forEachBatch function of the writeStream functionality that exists since spark2.4
Currently splitting the DF into ROWS makes it that the rows will be split equally into all my executors, i would like to turn a single GenericRow object into a DataFrame so i can process using a function i made
for example i would send the row to the function
processRow(row:row)
take the value and the topic and turn it back into a single-lined DF
+-----------------------+-----+
| value|topic|
+-----------------------+-----+
|{"big and nested json"}| A |
+-----------------------+-----+
for further processing

I guess you are consuming multiple kafka data at a time.
First you need to prepare schema for all kafka topics, here for example I have used two different JSON in value column.
scala> val df = Seq(("""{"name":"Srinivas"}""","A"),("""{"age":20}""","B")).toDF("value","topic")
scala> df.show(false)
+-------------------+-----+
|value |topic|
+-------------------+-----+
|{"name":"Srinivas"}|A |
|{"age":20} |B |
+-------------------+-----+
scala> import org.apache.spark.sql.types._
Schema for topic A
scala> val topicASchema = DataType.fromJson("""{"type":"struct","fields":[{"name":"name","type":"string","nullable":true,"metadata":{}}]}""").asInstanceOf[StructType]
Schema for topic B
scala> val topicBSchema = DataType.fromJson("""{"type":"struct","fields":[{"name":"age","type":"long","nullable":true,"metadata":{}}]}""").asInstanceOf[StructType]
Combining topic & its schema.
scala> val topicSchema = Seq(("A",topicASchema),("B",topicBSchema)) // Adding Topic & Its Schema.
Processing DataFrame
scala> topicSchema
.par
.map(d => df.filter($"topic" === d._1).withColumn("value",from_json($"value",d._2)))
.foreach(_.show(false)) // Using .par & filtering dataframe based on topic & then applying schema to value column.
+----------+-----+
|value |topic|
+----------+-----+
|[Srinivas]|A |
+----------+-----+
+-----+-----+
|value|topic|
+-----+-----+
|[20] |B |
+-----+-----+
Writing to hdfs
scala> topicSchema
.par
.map(d => df.filter($"topic" === d._1).withColumn("value",from_json($"value",d._2)).write.format("json").save(s"/tmp/kafka_data/${d._1}"))
Final Data stored in hdfs
scala> import sys.process._
import sys.process._
scala> "tree /tmp/kafka_data".!
/tmp/kafka_data
├── A
│   ├── part-00000-1e854106-49de-44b3-ab18-6c98a126c8ca-c000.json
│   └── _SUCCESS
└── B
├── part-00000-1bd51ad7-cfb6-4187-a374-4e2d4ce9cc50-c000.json
└── _SUCCESS
2 directories, 4 files

In this case it's better suited to use .map instead of .foreach. The reason is that the map returns a new dataset while foreach just a function and doesn't return anything.
One other thing that can help you is to parse the schema located in JSON.
I had a similar requirement recently.
My JSON object has a "similar" schema for both topic A and B. If that is not the case for you, you might need to create multiple dataframes in the solution below by grouping them by topic.
val sanitiseJson: String => String = value => value
.replace("\\\"", "\"")
.replace("\\\\", "\\")
.replace("\n", "")
.replace("\"{", "{")
.replace("}\"", "}")
val parsed = df.toJSON
.map(sanitiseJson)
This will give you something like:
{
"value": { ... },
"topic": "A"
}
Then you can pass that into a new read function:
var dfWithSchema = spark.read.json(parsed)
At this point you would access the value in the nested JSON:
dfWithSchema.select($"value.propertyInJson")
There are some optimizations you can do when it comes to sanitiseJson if needed.

Related

Selecting a column from a dataset's row

I'd like to loop on a Spark dataset and save specific values in a Map depending on the characteristics of each row. I'm new to Spark and Scala so I joined a simple example of what I'm trying to do in python.
Minimal working example in python:
mydict = dict()
for row in data:
if row['name'] == "Yan":
mydict[row['id']] = row['surname']
else:
mydict[row['id']] = "Random lad"
Where data is a (big) spark dataset, of type org.apache.spark.sql.Dataset[org.apache.spark.sql.Row].
Do you know the Spark or Scala way of doing it?
You can not loop over the contents of a Dataset because they are not accessible on the machine running this code but instead are scattered over (possibly many) different worker nodes. That is a fundamental concept of distributed execution engines like spark.
Instead you have to manipulate your data in a functional (where map, filter, reduce, ... operations are spread to the workers) or declarative (sql queries that are performed on the workers) way.
To achieve your goal you could run a map over you data which checks whether the name equals "Yan" and go on from there. After this transformation you can collect your dataframe and transform it into a dict.
You should also check your approach on using Spark and the map: it seems you want to create an entry in mydict for each element of data. This means your data is either small enough that you don't really have to use Spark or it will probably fail because it does not fit in your drivers memory.
I think you are looking for something like that. If your final df is not big you can collect it and store as map.
scala> df.show()
+---+----+--------+
| id|name|surrname|
+---+----+--------+
| 1| Yan| abc123|
| 2| Abc| def123|
+---+----+--------+
scala> df.select('id, when('name === "Yan", 'surrname).otherwise("Random lad")).toDF("K","V").show()
+---+----------+
| K| V|
+---+----------+
| 1| abc123|
| 2|Random lad|
+---+----------+
Here is a simple way to do it, but be careful with collect(), since it collects the data in driver. The data should be able to fit in driver.
I don't recommend you to do this.
var df: DataFrame = Seq(
("1", "Yan", "surname1"),
("2", "Yan1", "surname2"),
("3", "Yan", "surname3"),
("4", "Yan2", "surname4")
).toDF("id", "name", "surname")
val myDict = df.withColumn("newName", when($"name" === "Yan", $"surname").otherwise("RandomeName"))
.rdd.map(row => (row.getAs[String]("id"), row.getAs[String]("newName")))
.collectAsMap()
myDict.foreach(println)
Output:
(2,RandomeName)
(1,surname1)
(4,RandomeName)
(3,surname3)

Read, transform and write data within each partition in the DataFrame

Language - Scala
Spark version - 2.4
I am new to both Scala and Spark. (I am from python background, so the whole JVM ecosystem is quite new to me)
I want to write a spark program to parallelize following steps:
Read data from S3 in dataframe
Transform each row of this dataframe
Write the updated dataframe back to S3 at a new location
Let's say I have 3 items, A, B and C. For each of these items, I want to do the above 3 steps.
I want to do this in parallel for all these 3 items.
I tried creating an RDD with 3 partition, where each partition has one item, A, B and C, respectively.
Then I tried to use mapPartition method to write my logic for each partition (the 3 steps mentioned above).
I am getting Task not serializable errors. Although I understand the meaning of this error, I don't know how to solve it.
val items = Array[String]("A", "B", "C")
val rdd = sc.parallelize(items, 3)
rdd.mapPartitions(
partition => {
val item = partition.next()
val filePath = new ListBuffer[String]()
filePath += s"$basePath/item=$item/*"
val df = spark.read.format("parquet").option("basePath",s"$basePath").schema(schema).load(filePaths: _*)
//Transform this dataframe
val newDF = df.rdd.mapPartitions(partition => {partition.map(row =>{methodToTransformAndReturnRow(row)})})
newDf.write.mode(SaveMode.Overwrite).parquet(path)
})
My use case is, for each item, read the data from S3, transform it (I am adding new columns directly to each row for our use case), and write the final result, for each item, back to S3.
Note - I can read the whole data, repartition by items, transform and write it back, but repartition results in a shuffle, which I am trying to avoid, and the way I am trying is, reading data for each item in an executor itself, so that it can work on whatever data it gets, and there is no need for shuffle.
I am not sure what you are trying to achieve using the approach that you have shown, but I feel that you might be going about it the hard way. Unless there is a good reason to do so, it is often best to let Spark (especially spark 2.0+) to let it do it's own thing. In this case, just process all three partitions using a single operation. Spark will usually manage your dataset quite well. It may also automatically introduce optimisations that you did not think of, or optimisations that it can not do if you try to control the process too much. Having said that, if it doesn't manage the process well, then you can start arguing with it by trying to take more control and do things more manually. At least that is my experience so far.
For example, I once had a complex series of transformations that added more logic to each step/DataFrame. If I forced spark to evaluate each intermediate one (e.g. run a count or a show on the intermediate dataframes) I would eventually get to a point where it couldn't evaluate one DataFrame (i.e. it couldn't do the count) because of insufficient resources. However, if I ignored that and added in more transformations, Spark was able to push some optimisations to earlier steps (from the later steps). This meant that the subsequent DataFrames (and importantly my final DataFrame) could be evaluated correctly. This final evaluation was possible despite the fact that the intermediate DataFrame, that on it's own could not be evaluated, was still in the overall process.
Consider the following. I use CSV, but it will work the same for parquet.
Here is my input:
data
├── tag=A
│   └── data.csv
├── tag=B
│   └── data.csv
└── tag=C
└── data.csv
Here is an example of one of the data files (tag=A/data.csv)
id,name,amount
1,Fred,100
2,Jane,200
Here is a script that recognises the partitions within this structure (i.e. tag is one of the columns).
scala> val inDataDF = spark.read.option("header","true").option("inferSchema","true").csv("data")
inDataDF: org.apache.spark.sql.DataFrame = [id: int, name: string ... 2 more fields]
scala> inDataDF.show
+---+-------+------+---+
| id| name|amount|tag|
+---+-------+------+---+
| 31| Scott| 3100| C|
| 32|Barnaby| 3200| C|
| 20| Bill| 2000| B|
| 21| Julia| 2100| B|
| 1| Fred| 100| A|
| 2| Jane| 200| A|
+---+-------+------+---+
scala> inDataDF.printSchema
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- amount: integer (nullable = true)
|-- tag: string (nullable = true)
scala> inDataDF.write.partitionBy("tag").csv("outData")
scala>
Again I used csv rather than parquet so you can dispense with the options to read a header and infer the schema (parquet will do that automatically). But apart from that, it will work the same way.
The above produces the following directory structure:
outData/
├── _SUCCESS
├── tag=A
│   └── part-00002-9e13ec13-7c63-4cda-b5af-e2d69cb76278.c000.csv
├── tag=B
│   └── part-00001-9e13ec13-7c63-4cda-b5af-e2d69cb76278.c000.csv
└── tag=C
└── part-00000-9e13ec13-7c63-4cda-b5af-e2d69cb76278.c000.csv
If you wanted to manipulate the contenst, by all means add whatever map operation, join, filter or whatever else you need between the read and write.
For example, add 500 to the amount:
scala> val outDataDF = inDataDF.withColumn("amount", $"amount" + 500)
outDataDF: org.apache.spark.sql.DataFrame = [id: int, name: string ... 2 more fields]
scala> outDataDF.show(false)
+---+-------+------+---+
|id |name |amount|tag|
+---+-------+------+---+
|31 |Scott |3600 |C |
|32 |Barnaby|3700 |C |
|20 |Bill |2500 |B |
|21 |Julia |2600 |B |
|1 |Fred |600 |A |
|2 |Jane |700 |A |
+---+-------+------+---+
Then simply write outDataDF instead of inDataDF.

check data size spark dataframes

I have the following question :
Actually I am working with the following csv file:
""job"";""marital"""
""management"";""married"""
""technician"";""single"""
I loaded it into a spark dataframe as follows:
My aim is to check the length and type of each field in the dataframe following the set od rules below :
col type
job char10
marital char7
I started implementing the check of the length of each field but I am getting a compilation error :
val data = spark.read.option("inferSchema", "true").option("header", "true").csv("file:////home/user/Desktop/user/file.csv")
data.map(line => {
val fields = line.toString.split(";")
fields(0).size
fields(1).size
})
The expected output should be:
List(10,10)
As for the check of the types I don't have any idea about how to implement it as we are using dataframes. Any idea about a function verifying the data format ?
Thanks a lot in advance for your replies.
ata
I see you are trying to use Dataframe, But if there are multiple double quotes then you can read as a textFile and remove them and convert to Dataframe as below
import org.apache.spark.sql.functions._
import spark.implicits._
val raw = spark.read.textFile("path to file ")
.map(_.replaceAll("\"", ""))
val header = raw.first
val data = raw.filter(row => row != header)
.map { r => val x = r.split(";"); (x(0), x(1)) }
.toDF(header.split(";"): _ *)
You get with data.show(false)
+----------+-------+
|job |marital|
+----------+-------+
|management|married|
|technician|single |
+----------+-------+
To calculate the size you can use withColumn and length function and play around as you need.
data.withColumn("jobSize", length($"job"))
.withColumn("martialSize", length($"marital"))
.show(false)
Output:
+----------+-------+-------+-----------+
|job |marital|jobSize|martialSize|
+----------+-------+-------+-----------+
|management|married|10 |7 |
|technician|single |10 |6 |
+----------+-------+-------+-----------+
All the column type are String.
Hope this helps!
You are using a dataframe. So when you use the map method, you are processing Row in your lambda.
so line is a Row.
Row.toString will return a string representing the Row, so in your case 2 structfields typed as String.
If you want to use map and process your Row, you have to get the vlaue inside the fields manually. with getAsString and getAsString.
Usually when you use Dataframes, you have to work in column's logic as in SQL using select, where... or directly the SQL syntax.

Reshape spark data frame of key-value pairs with keys as new columns

I am new to spark and scala. Lets say I have a data frame of lists that are key value pairs. Is there a way to map the id vars of column ids as new columns?
df.show()
+--------------------+-------------------- +
| ids | vals |
+--------------------+-------------------- +
|[id1,id2,id3] | null |
|[id2,id5,id6] |[WrappedArray(0,2,4)] |
|[id2,id4,id7] |[WrappedArray(6,8,10)]|
Expected output:
+----+----+
|id1 | id2| ...
+----+----+
|null| 0 | ...
|null| 6 | ...
A possible way would be to compute the columns of the new DataFrame and use those columns to construct the rows.
import org.apache.spark.sql.functions._
val data = List((Seq("id1","id2","id3"),None),(Seq("id2","id4","id5"),Some(Seq(2,4,5))),(Seq("id3","id5","id6"),Some(Seq(3,5,6))))
val df = sparkContext.parallelize(data).toDF("ids","values")
val values = df.flatMap{
case Row(t1:Seq[String], t2:Seq[Int]) => Some((t1 zip t2).toMap)
case Row(_, null) => None
}
// get the unique names of the columns across the original data
val ids = df.select(explode($"ids")).distinct.collect.map(_.getString(0))
// map the values to the new columns (to Some value or None)
val transposed = values.map(entry => Row.fromSeq(ids.map(id => entry.get(id))))
// programmatically recreate the target schema with the columns we found in the data
import org.apache.spark.sql.types._
val schema = StructType(ids.map(id => StructField(id, IntegerType, nullable=true)))
// Create the new DataFrame
val transposedDf = sqlContext.createDataFrame(transposed, schema)
This process will pass through the data 2 times, although depending on the backing data source, calculating the column names can be rather cheap.
Also, this goes back and forth between DataFrames and RDD. I would be interested in seeing a "pure" DataFrame process.

SPARK: What is the most efficient way to take a KV pair and turn it into a typed dataframe

Spark Newbie here attempting to use Spark to do some ETL and am having trouble finding a clean way of unifying the data into the destination scheme.
I have multiple dataframes with these keys / values in a spark context (streaming)
Dataframe of long values:
entry---------|long---------
----------------------------
alert_refresh |1446668689989
assigned_on |1446668689777
Dataframe of string values
entry---------|string-------
----------------------------
statusmsg |alert msg
url |http:/svr/pth
Dataframe of boolean values
entry---------|boolean------
----------------------------
led_1 |true
led_2 |true
Dataframe of integer values:
entry---------|int----------
----------------------------
id |789456123
I need to create a unified dataframe based on these where the key is the fieldName and it maintains the type from each source dataframe. It would look something like this:
id-------|led_1|led_2|statusmsg|url----------|alert_refresh|assigned_on
-----------------------------------------------------------------------
789456123|true |true |alert msg|http:/svr/pth|1446668689989|1446668689777
What is the most efficient way to do this in Spark?
BTW - I tried doing a matrix transform:
val seq_b= df_booleans.flatMap(row => (row.toSeq.map(col => (col, row.toSeq.indexOf(col)))))
.map(v => (v._2, v._1))
.groupByKey.sortByKey(true)
.map(._2)
val b_schema_names = seq_b.first.flatMap(r => Array(r))
val b_schema = StructType(b_schema_names.map(r => StructField(r.toString(), BooleanType, true)))
val b_data = seq_b.zipWithIndex.filter(._2==1).map(_._1).first()
val boolean_df = sparkContext.createDataFrame(b_data, b_schema)
Issue: Takes 12 seconds and .sortByKey(true) does not always sort values last