Inline map function instead of loops

Inline map function instead of loops - scala

I have a table in dataframe with three columns. city_name, driver_name, vehicles out of which vehicle is a list.
I also have some other details such as driver hours, driver contact etc for each driver in mysql. Tables in database are in this format: city_name.driver_name.
scala> val tables = """
[
{"vehicles" : ["subaru","mazda"], "city_name" : "seattle", "driver_name" : "x"},
{"city_name" : "seattle", "driver_name" : "y"},
{"city_name" : "newyork", "driver_name" : "x"},
{"city_name" : "dallas", "driver_name" : "y"}
]
""" | | | | | | |
tables: String =
"
[
{"vehicles" : ["subaru","mazda"], "city_name" : "seattle", "driver_name" : "x"},
{"city_name" : "seattle", "driver_name" : "y"},
{"city_name" : "newyork", "driver_name" : "x"},
{"city_name" : "dallas", "driver_name" : "y"}
]
"
scala> val metadataRDD = sc.parallelize(tables.split('\n').map(_.trim.filter(_ >= ' ')).mkString :: Nil)
metadataRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[5] at parallelize at <console>:30
scala> val metadataDF = spark.read.json(metadataRDD)
metadataDF: org.apache.spark.sql.DataFrame = [city_name: string, driver_name: string ... 1 more field]
scala> metadataDF.show
+---------+-----------+---------------+
|city_name|driver_name| vehicles|
+---------+-----------+---------------+
| seattle| x|[subaru, mazda]|
| seattle| y| null|
| newyork| x| null|
| dallas| y| null|
+---------+-----------+---------------+
For each of these driver I need to apply a function and write to a parquet. What I am trying to do is use a inline function as below but I can't get it to work:
metadataDF.map((e) => {
val path = "s3://test/"
val df = sparkJdbcReader.option("dbtable",
e.city_name + "." + e.driver_name).load()
val dir = path + e.driver_name + e.city_name
if (e.vehicles)
do something
else:
df.write.mode("overwrite").format("parquet").save(dir)
})
Basically the questions is around how to use that inline function.

A call to map() function always transforms the given input collection of type A to another collection of type B, using the supplied function. In your map function call you are saving the Dataframe to your Storage layer[presumably HDFS]. The save() method defined on the DataFrameWriter Class has a return type of Unit [think of it as void in Java]. Hence, your function will not work as it is transforming your DataFrame to essentially two types: the data-type returned from the if block and Unit returned from the else block.
You can refactor your code and break it up in two blocks or so:
import org.apache.spark.sql.functions.{concat,concat_ws,lit,col}
import org.apache.spark.sql.DataFrame
import org.apache.spark.rdd.RDD
val metadataRDD: RDD[String] = sc.parallelize(tables.split('\n').map(_.trim.filter(_ >= ' ')).mkString :: Nil)
val metadataDF: DataFrame = spark.read.json(metadataRDD)
val df_new_col: DataFrame = metadataDF
.withColumn("city_driver",concat_ws(".",col("city_name"),col("driver_name")))
.withColumn("dir",concat(lit("s3://test/"),col("city_name"),col("driver_name")))
you now have two columns where you have table names and their paths next to them. You can collect them and use them to read your dataframes to be stored in Parquet format.

Related

DataFrame how to add in a column containing a multidimensional array

I am unsuccessfully trying to manipulate an array in a dataFrame. for example I would like to make an addition or get the max for each id.
MyJSON file is like:
{
"prices": [
[
1601759212851,
0.1858011018283193
],
[
1601924861574,
0.1858011018283193
],
[
1601971658854,
0.1858011018283193
]
],
"vol": [
[
1606930725994,
351221.0671864218
]
],
"id": "myId1"
}
{
"prices": [
[
1606930723991,
0.002319862805425766
]
],
"vol": [
[
1606930723991,
651491.0171818669
]
],
"id": "myId2"
}
val tf = spark.read.json("myJsonFile")
val test = tf.select("prices")
val test: org.apache.spark.sql.DataFrame = [prices: array<array<double>>]
test.map(_._2).sum
I am getting this error:
<console>:28: error: value _2 is not a member of org.apache.spark.sql.Row
I've tried an udf as:
val totalPrice = udf((xs: Seq[Row]) => xs.map(_.getAs[Double]("prices")).sum)
but it doesn't work!

There are a few different ways of doing this. The easiest way, as of Spark 3.0, is to use the transform function (which is analogous to map, working with Column elements) on your DataFrame:
df.select('id, array_max(transform('prices, x => x(1))).as("max_price")).show(false)
+-----+--------------------+
|id |max_price |
+-----+--------------------+
|myId1|0.1858011018283193 |
|myId2|0.002319862805425766|
+-----+--------------------+
Another way is to transform your DataFrame into a DataSet, in which case map, filter, etc. will behave as you expect because you are dealing with simple Scala types:
case class Result(id : String, maxPrice : Double)
val result = df.map{
case org.apache.spark.sql.Row(id : String, prices : Seq[Seq[Double]], _) => Result(id, prices.map(_(1)).max)
}
result.show
+-----+--------------------+
| id| maxPrice|
+-----+--------------------+
|myId1| 0.1858011018283193|
|myId2|0.002319862805425766|
+-----+--------------------+
If for whatever reason you can't use the DataSet and are on a 2.x version of Spark, you can explode the prices column and use other DF functions on the simpler arrays:
df.select('id, explode('prices).as("prices")).groupBy('id).agg(max($"prices"(1)).as("max_price")).show
+-----+--------------------+
| id| max_price|
+-----+--------------------+
|myId1| 0.1858011018283193|
|myId2|0.002319862805425766|
+-----+--------------------+
UDFs are also an option, but discouraged. I'd encourage you to check out https://databricks.com/blog/2017/05/24/working-with-nested-data-using-higher-order-functions-in-sql-on-databricks.html for a rundown of various ways to deal with nested arrays in Spark.

create view for two different dataframe in scala spark

I have a code snippet that will read a Json array of the file path and then union the output and gives me two different tables. So I want to create two different createOrReplaceview(name) for those two tables and the name will be available in json array like below:
{
"source": [
{
"name": "testPersons",
"data": [
"E:\\dataset\\2020-05-01\\",
"E:\\dataset\\2020-05-02\\"
],
"type": "json"
},
{
"name": "testPets",
"data": [
"E:\\dataset\\2020-05-01\\078\\",
"E:\\dataset\\2020-05-02\\078\\"
],
"type": "json"
}
]
}
My output:
testPersons
+---+------+
|name |age|
+---+------+
|John |24 |
|Cammy |20 |
|Britto|30 |
|George|23 |
|Mikle |15 |
+---+------+
testPets
+---+------+
|name |age|
+---+------+
|piku |2 |
|jimmy |3 |
|rapido|1 |
+---+------+
Above is my Output and Json array my code iterate through each array and read the data section and read the data.
But how to change my below code to create a temp view for each output table.
for example i want to create .createOrReplaceTempView(testPersons) and .createOrReplaceTempView(testPets)
view name as per in Json array
if (dataArr(counter)("type").value.toString() == "json") {
val name = dataArr(counter)("name").value.toString()
val dataPath = dataArr(counter)("data").arr
val input = dataPath.map(item => {
val rdd = spark.sparkContext.wholeTextFiles(item.str).map(i => "[" + i._2.replaceAll("\\}.*\n{0,}.*\\{", "},{") + "]")
spark
.read
.schema(Schema.getSchema(name))
.option("multiLine", true)
.json(rdd)
})
val emptyDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], Schema.getSchema(name))
val finalDF = input.foldLeft(emptyDF)((x, y) => x.union(y))
finalDF.show()
Expected output:
spark.sql("SELECT * FROM testPersons").show()
spark.sql("SELECT * FROM testPets").show()
It should give me the table for each one.

Since you already have your data wrangled into shape and have your rows in DataFrames and simply want to access them as temporary views, I suppose you are looking for the function(s):
createOrReplaceGlobalTempView
createOrReplaceTempView
They can be invoked from a DataFrame/Dataset.
df.createOrReplaceGlobalTempView("testPersons")
spark.sql("SELECT * FROM global_temp.testPersons").show()
df.createOrReplaceTempView("testPersons")
spark.sql("SELECT * FROM testPersons").show()
For an explanation about the difference between the two, you can take a look at this question.
If you are trying to dynamically read the JSON, get the files in data into DataFrames and then save them into their own table.
import net.liftweb.json._
import net.liftweb.json.DefaultFormats
case class Source(name: String, data: List[String], `type`: String)
val file = scala.io.Source.fromFile("path/to/your/file").mkString
implicit val formats: DefaultFormats.type = DefaultFormats
val json = parse(file)
val sourceList = (json \ "source").children
for (source <- sourceList) {
val s = source.extract[Source]
val df = s.data.map(d => spark.read(d)).reduce(_ union _)
df.createOrReplaceTempView(s.name)
}

validation of a csv against JSON definition

I'm new to Scala and wonder what would be the best methods to validate CSV file
preferably using map function and adding new column depending if the conditions were met.
I want to put this as UDF function for my data frame in Apache Spark.
Here is the schema:
Record Type val1 val2 val3
TYPE1 1 2 ZZ
TYPE2 2 555 KK
And JSON definition I want to validate against:
"rows" :
{
"TYPE1" :
"fields" : [
{
"required" : "true",
"regex": "TYPE1",
},
{
"required" : true",
"regex" :"[a-zA-Z]{2}[a-zA-Z]{2}",
"allowed_values": null
},
{
"required" : true",
"regex" :"[a-zA-Z]{2}[a-zA-Z]{2}",
"allowed_values" : ["ZZ","KK"]
}
]
}

I'm not sure about your JSON definition (it's also missing some quotes and curly braces), and whether Record Type is a column in the CSV, but here's a simplification -- you can add "Record Type" logic around it if needed.
Assuming a file validator.json:
{
"fields" : [
{
"name" : "val1",
"regex": "[0-9]+"
},{
"name" : "val2",
"regex" :"[0-9]+"
},{
"name" : "val3",
"regex" :"[A-Z]{2}"
}
]
}
Generally, by default (without extra options regarding the schema) spark.read.format("csv").option("header", "true").load("file.csv") will use Strings for all of the columns in your file. Here, it is assumed you have a header val1,val2,val3, as the first line of your CSV. An equivalently defined DF inline:
val df = Seq(("1", "2", "ZZ"), ("2", "555", "KK")).toDF("val1", "val2", "val3")
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import com.fasterxml.jackson.databind.ObjectMapper
import scala.io.Source
val mapper = new ObjectMapper
mapper.registerModule(DefaultScalaModule)
// read the validator as one long string
val jsonString = Source.fromFile("validator.json").getLines.mkString("")
// map the json string into an object (nested map)
val regexMap:Map[String,Seq[Map[String,String]]] = mapper.readValue(jsonString, classOf[Map[String, Seq[Map[String, String]]]])
//val1 rlike '[0-9]+' AND val2 rlike '[0-9]+' AND val3 rlike '[A-Z]{2}'
val exprStr:String = regexMap("fields").map((fieldDef:Map[String, String]) => s"${fieldDef("name")} rlike '${fieldDef("regex")}'").mkString(" AND ")
// this asks whether all rows match
val matchingRowCount:Long = df.filter(expr("val1 rlike '[0-9]+' AND val2 rlike '[0-9]+' AND val3 rlike '[A-Z][A-Z]'")).count
// if the counts match, then all of the rows follow the rules
df.count == matchingRowCount
// this adds a column about whether the row matches
df.withColumn("matches",expr(exprStr)).show
result:
+----+----+----+-------+
|val1|val2|val3|matches|
+----+----+----+-------+
| 1| 2| ZZ| true|
| 2| 555| KK| true|
+----+----+----+-------+

dataframe with nested aggregation

i have json file which looks like this:
{{"name":"jonh", "food":"tomato", "weight": 1},
{"name":"jonh", "food":"carrot", "weight": 4},
{"name":"bill", "food":"apple", "weight": 1},
{"name":"john", "food":"tomato", "weight": 2},
{"name":"bill", "food":"taco", "weight": 2}},
{"name":"bill", "food":"taco", "weight": 4}},
i need to create new json like this:
{
{"name":"jonh",
"buy": [{"tomato": 3},{"carrot": 4}]
},
{"name":"bill",
"buy": [{"apple": 1},{"taco": 6}]
}
}
this is my dataFrame
val df = Seq(
("john", "tomato", 1),
("john", "carrot", 4),
("bill", "apple", 1),
("john", "tomato", 2),
("bill", "taco", 2),
("bill", "taco", 4)
).toDF("name", "food", "weight")
how can i get dataframe with final structure? groupBy and agg gives me wrong structure
import org.apache.spark.sql.functions._
df.groupBy("name", "food").agg(sum("weight").as("weight"))
.groupBy("name").agg(collect_list(struct("food", "weight")).as("acc"))
+----+------------------------+
|name|acc |
+----+------------------------+
|john|[[carrot,4], [tomato,3]]|
|bill|[[taco,6], [apple,1]] |
+----+------------------------+
{"name":"john","acc":[{"food":"carrot","weight":4},{"food":"tomato","weight":3}]}
{"name":"bill","acc":[{"food":"taco","weight":6},{"food":"apple","weight":1}]}
please give me right direction how to solve it.

You can always convert the values manually, by iterating over Rows, and assembling the food-weight pairs, and then converting them to a Map
val step1 = df.groupBy("name", "food").agg(sum("weight").as("weight")).
groupBy("name").agg(collect_list(struct("food", "weight")).as("buy"))
val result = step1.map(row =>
(row.getAs[String]("name"), row.getAs[Seq[Row]]("buy").map(map =>
map.getAs[String]("food") -> map.getAs[Long]("weight")).toMap)
).toDF("name", "buy")
result.toJSON.show(false)
+---------------------------------------------+
|{"name":"john","buy":{"carrot":4,"tomato":3}}|
|{"name":"bill","buy":{"taco":6,"apple":1}} |
+---------------------------------------------+

You can achive your required json format by using replace techniques
udf way
udf function works on primitive data types so replace function can be used to replace the food and weight string from final dataframe as
import org.apache.spark.sql.functions._
def replaeUdf = udf((json: String) => json.replace("\"food\":", "").replace("\"weight\":", ""))
val temp = df.groupBy("name", "food").agg(sum("weight").as("weight"))
.groupBy("name").agg(collect_list(struct(col("food"), col("weight"))).as("buy"))
.toJSON.withColumn("value", replaeUdf(col("value")))
You should have output dataframe as
+-------------------------------------------------+
|value |
+-------------------------------------------------+
|{"name":"john","buy":[{"carrot",4},{"tomato",3}]}|
|{"name":"bill","buy":[{"taco",6},{"apple",1}]} |
+-------------------------------------------------+
regex_replace function
regex_replace inbuilt function can be used to get the desired output as well
val temp = df.groupBy("name", "food").agg(sum("weight").as("weight"))
.groupBy("name").agg(collect_list(struct(col("food"), col("weight"))).as("buy"))
.toJSON.withColumn("value", regexp_replace(regexp_replace(col("value"), "\"food\":", ""), "\"weight\":", ""))

Spark - How to combine/merge elements in Dataframe which are in Seq[Row] to generate a Row

I want to start by saying I am forced to use Spark 1.6
I am generating a DataFrame from a JSON file like this:
{"id" : "1201", "name" : "satish", "age" : "25"},
{"id" : "1202", "name" : "krishna", "age" : "28"},
{"id" : "1203", "name" : "amith", "age" : "28"},
{"id" : "1204", "name" : "javed", "age" : "23"},
{"id" : "1205", "name" : "mendy", "age" : "25"},
{"id" : "1206", "name" : "rob", "age" : "24"},
{"id" : "1207", "name" : "prudvi", "age" : "23"}
The DataFrame looks like:
+---+----+-------+
|age| id| name|
+---+----+-------+
| 25|1201| satish|
| 28|1202|krishna|
| 28|1203| amith|
| 23|1204| javed|
| 25|1205| mendy|
| 24|1206| rob|
| 23|1207| prudvi|
+---+----+-------+
What I do with this DataFrame is to group by age, order by id and filter all age group with more than 1 student. I use the following script:
import sqlContext.implicits._
val df = sqlContext.read.json("students.json")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val arrLen = udf {a: Seq[Row] => a.length > 1 }
val mergedDF = df.withColumn("newCol", collect_set(struct("age","id","name")).over(Window.partitionBy("age").orderBy("id"))).select("newCol","age")
val filterd = mergedDF.filter(arrLen(col("newCol")))
And now the current result is:
[WrappedArray([28,1203,amith], [28,1202,krishna]),28]
[WrappedArray([25,1201,satish], [25,1205,mendy]),25]
[WrappedArray([23,1204,javed], [23,1207,prudvi]),23]
What I want now is to merge those two students rows inside the WrappedArray into one, taking for example the id of the first student and the name of the second student.
To achieve that I wrote the following function:
def PrintOne(List : Seq[Row], age : String):Row ={
val studentsDetails = Array(age, List(0).getAs[String]("id"), List(1).getAs[String]("name"))
val mergedStudent= new GenericRowWithSchema(studentsDetails .toArray,List(0).schema)
mergedStudent
}
I know this function do the trick, because when I test it using a foreach it prints out the expected values:
filterd.foreach{x => val student = PrintOne(x.getAs[Seq[Row]](0), x.getAs[String]("age"))
println("merged student: "+student)
}
OutPut:
merged student: [28,1203,krishna]
merged student: [23,1204,prudvi]
merged student: [25,1201,mendy]
But when I try to do the same inside a map to collect the returned values the problems begin.
If I run without encoder:
val merged = filterd.map{row => (row.getAs[String]("age") , PrintOne(row.getAs[Seq[Row]](0), row.getAs[String]("age")))}
I get the following exception:
Exception in thread "main" java.lang.UnsupportedOperationException: No
Encoder found for org.apache.spark.sql.Row
- field (class: "org.apache.spark.sql.Row", name: "_2")
- root class: "scala.Tuple2"
And when I try to generate an Econder on my own, I fail as well:
import org.apache.spark.sql.catalyst.encoders.RowEncoder
implicit val encoder = RowEncoder(filterd.schema)
val merged = filterd.map{row => (row.getAs[String]("age") , PrintOne(row.getAs[Seq[Row]](0), row.getAs[String]("age")))}(encoder)
type mismatch; found :
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder[org.apache.spark.sql.Row]
required: org.apache.spark.sql.Encoder[(String,
org.apache.spark.sql.Row)]
How can I provide the correct encoder or even better, avoid it?
I have been told to avoid using map + a custom function, but the logic I need to apply is more complex than just pick up one field from each row. It will be more to combine fields from several, checking the order of the rows and if the values are null or not. And as far as I know just by using a custom function I can solve it.

The output from the map is of type (String, Row) therefore it cannot be encoded using RowEncoder alone. You have to provide matching tuple encoder:
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Encoder, Encoders}
import org.apache.spark.sql.catalyst.encoders.RowEncoder
val encoder = Encoders.tuple(
Encoders.STRING,
RowEncoder(
// The same as df.schema in your case
StructType(Seq(
StructField("age", StringType),
StructField("id", StringType),
StructField("name", StringType)))))
filterd.map{row => (
row.getAs[String]("age"),
PrintOne(row.getAs[Seq[Row]](0), row.getAs[String]("age")))
}(encoder)
Overall this approach looks like an anti-pattern. If you want to use more functional style you should avoid Dataset[Row]:
case class Person(age: String, id: String, name: String)
filterd.as[(Seq[Person], String)].map {
case (people, age) => (age, (age, people(0).id, people(1).name))
}
or udf.
Also please note that o.a.s.sql.catalyst package, including GenericRowWithSchema, is intended mostly for internal usage. Unless necessary otherwise, prefer o.a.s.sql.Row.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Inline map function instead of loops - scala

Related

DataFrame how to add in a column containing a multidimensional array

create view for two different dataframe in scala spark

validation of a csv against JSON definition

dataframe with nested aggregation

Spark - How to combine/merge elements in Dataframe which are in Seq[Row] to generate a Row

Categories

Resources