how to change data-frame header based on particular value? - scala

Suppose we have the following csv file like
fname,age,class,dob
Second csv file name like
f_name,user_age,class,DataofBith
I am trying to make it common header for all csv file that return same Dataframe of header like in standard manner like
first_name,age,class,dob
val df2 = df.withColumnRenamed("DateOfBirth","DateOfBirth").withColumnRenamed("fname","name")
df2.printSchema()
But that way is not generic. Can we do this in a dynamic manner for all CSV as per standard conversion like of DataFrame of CSV in fname,f_name it should be converted to the name ?

You can use List of schema then Iterate on top of schema like below -
Approach :1
val df= Seq((1,"goutam","kumar"),(2,"xyz","kumar")).toDF("id","fname","lname")
val schema=Seq("id"->"sid","fname"->"sfname","lname"->"slname")
val mapedSchema = schema.map(x=>df(x._1).as(x._2))
df.select(mapedSchema :_*)
while reading csv give "option("header", false)" then you can get read of mapping of old schema with new schema.
Approach :2
val schema=Seq("sid","sfname","slname")
val mapedSchema=data.columns.zip(schema)
val mapedSchemaWithDF = mapedSchema.map(x=>df(x._1).as(x._2))
df.select(mapedSchemaWithDF:_*)

The function withColumnRenamed works also if the column is not present in the dataframe. Hence you can go ahead and read all dataframes and apply the same renaming logic everywhere and union them all later.
import org.apache.spark.sql.DataFrame
def renaming(df: DataFrame): DataFrame = {
df.withColumnRenamed("dob", "DateOfBirth")
.withColumnRenamed("fname", "name")
.withColumnRenamed("f_name", "name")
.withColumnRenamed("user_age", "age")
// append more renaming functions here
}
val df1 = renaming(spark.read.csv("...your path here"))
val df2 = renaming(spark.read.csv("...another path here"))
val result = df1.unionAll(df2)
result will have the same schema (DateOfBirth, name, age) in this case.
Edit:
Following your input, if I understand correctly what you have to do, what about this?
val df1 = spark.read.csv("...your path here").toDF("name", "age", "class", "born_date")
val df2 = spark.read.csv("...another path here").toDF("name", "age", "class", "born_date")

You can use a simple select in combination with Scala Map. Is easier to handle the column transformations via a dictionary (Map) is which key will be the old name and value the new name.
Lets create first the two datasets as you described them:
val df1 = Seq(
("toto", 23, "g", "2010-06-09"),
("bla", 35, "s", "1990-10-01"),
("pino", 12, "a", "1995-10-05")
).toDF("fname", "age", "class", "dob")
val df2 = Seq(
("toto", 23, "g", "2010-06-09"),
("bla", 35, "s", "1990-10-01"),
("pino", 12, "a", "1995-10-05")
).toDF("f_name", "user_age", "class", "DataofBith")
Next we have created a Scala function named transform which accept two arguments, the target df and mapping which contains the transformations details:
val mapping = Map(
"fname" -> "first_name",
"f_name" -> "first_name",
"user_age" -> "age",
"DataofBith" -> "dob"
)
def transform(df: DataFrame, mapping: Map[String, String]) : DataFrame = {
val keys = mapping.keySet
val cols = df.columns.map{c =>
if(keys.contains(c))
df(c).as(mapping(c))
else
df(c)
}
df.select(cols:_*)
}
The function goes through the given columns checking first whether the current column exists in mapping. If so, it renames using the corresponding value from the dictionary otherwise the column remains untouched. Note that this will just rename the column (via alias) hence we don't expect to affect performance.
Finally, some examples:
val newDF1 = transform(df1, mapping)
newDF1.show
// +----------+---+-----+----------+
// |first_name|age|class| dob|
// +----------+---+-----+----------+
// | toto| 23| g|2010-06-09|
// | bla| 35| s|1990-10-01|
// | pino| 12| a|1995-10-05|
// +----------+---+-----+----------+
val newDF2 = transform(df2, mapping)
newDF2.show
// +----------+---+-----+----------+
// |first_name|age|class| dob|
// +----------+---+-----+----------+
// | toto| 23| g|2010-06-09|
// | bla| 35| s|1990-10-01|
// | pino| 12| a|1995-10-05|
// +----------+---+-----+----------+

Related

How to split text in each row when getting the data from kafka topic?

I retrive data from a kafka topic. After converting the data into a dataframe with 10 columns, i choose 1 one of the columns. i want to split string in each row so i can convert words to their pronounciation. evrything seems to be ok, but the only problem that i have is that i can't run split method on a row.?
here is my code
val df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "rocket-01.srvs.cloudkafka.com:9094")
.option("subscribe", "k-news")
.option("startingOffsets", "latest")
.option("kafka.security.protocol","SASL_SSL")
.option("kafka.sasl.mechanism", "SCRAM-SHA-256")
.option("kafka.sasl.jaas.config", "org.apache.kafka.common.security.scram.ScramLoginModule required username=" " password="";").load()
val newsStringDF = df.selectExpr("CAST(value AS STRING)")
val newsSchema = StructType( Array(
StructField("_c0",StringType,true),
StructField("id",StringType,true),
StructField("title",StringType,true),
StructField("publication",StringType,true),
StructField("author",StringType,true),
StructField("date",StringType,true),
StructField("year",StringType,true),
StructField("month",StringType,true),
StructField("url",StringType,true),
StructField("content",StringType,true)))
val lines = Source.fromFile("/Project/symbols/cmudict.dict").getLines()
val res = lines.map { line =>line.split(" ", 2) match { case Array(a, b) => a -> b }}.toMap
val newsDF = newsStringDF.select(from_json(col("value"),
newsSchema).as("data")).select("data.*")
val titleColumn = newsDF.select("title").as[String].foreach( message =>
message.split(" ").toList.map { s =>if(res.get(s).isDefined) {res(s)} else
{s}}.mkString(" ")).toDF("title")
val streaming = titleColumn.writeStream.format("console").outputMode("append").trigger(Trigger.ProcessingTime("10 seconds")).start().awaitTermination()
The output should be something like this:
i get message(string from title column) from my kafka-topic and I replace it with another string. "I like football" --> "I like ˈfo͝otˌbôl" and write the message to the console with .writeStream
thanks in advance
You have to remember that a Dataframe is really just a Dataset[Row]. When working with simple types (and even some custom case classes) it's easy to convert between a typed Dataset, and an untyped Dataframe. In this particular case, it should be as easy as:
val titleColumn = newsDF
.select("title")
.as[String]
.map( message => message.split(" ").toList.map { s =>
if(res.get(s).isDefined) {res(s)} else {s}}.mkString(" ")
)
.toDF("title")
The extra lines .as[String] converts your Dataframe to a Dataset[String], and the .toDF("title") converts it back. This allows your map to operate on Strings instead of Rows. Since this is a tranformation, you'll also want to use .map instead of .foreach on the Dataset.
One other options is to retrieve the title String from the Row and split on that:
message.getAs[String]("title").split(" ")
I'm not sure which approach is more efficient. Here is a tested sample:
val df = Seq("aa 11", "bb 22", "cc 33", "dd 44").toDF("title")
val res = Map("aa" -> "AA", "bb" -> "BB", "cc" -> "CC", "dd" -> "DD")
df.as[String].map(m => {
m.split(" ").toList.map(s => {
if(res.get(s).isDefined) res(s) else s
}).mkString(" ")
}).toDF("title").show()
Which results in:
+-----+
|title|
+-----+
|AA 11|
|BB 22|
|CC 33|
|DD 44|
+-----+

create view for two different dataframe in scala spark

I have a code snippet that will read a Json array of the file path and then union the output and gives me two different tables. So I want to create two different createOrReplaceview(name) for those two tables and the name will be available in json array like below:
{
"source": [
{
"name": "testPersons",
"data": [
"E:\\dataset\\2020-05-01\\",
"E:\\dataset\\2020-05-02\\"
],
"type": "json"
},
{
"name": "testPets",
"data": [
"E:\\dataset\\2020-05-01\\078\\",
"E:\\dataset\\2020-05-02\\078\\"
],
"type": "json"
}
]
}
My output:
testPersons
+---+------+
|name |age|
+---+------+
|John |24 |
|Cammy |20 |
|Britto|30 |
|George|23 |
|Mikle |15 |
+---+------+
testPets
+---+------+
|name |age|
+---+------+
|piku |2 |
|jimmy |3 |
|rapido|1 |
+---+------+
Above is my Output and Json array my code iterate through each array and read the data section and read the data.
But how to change my below code to create a temp view for each output table.
for example i want to create .createOrReplaceTempView(testPersons) and .createOrReplaceTempView(testPets)
view name as per in Json array
if (dataArr(counter)("type").value.toString() == "json") {
val name = dataArr(counter)("name").value.toString()
val dataPath = dataArr(counter)("data").arr
val input = dataPath.map(item => {
val rdd = spark.sparkContext.wholeTextFiles(item.str).map(i => "[" + i._2.replaceAll("\\}.*\n{0,}.*\\{", "},{") + "]")
spark
.read
.schema(Schema.getSchema(name))
.option("multiLine", true)
.json(rdd)
})
val emptyDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], Schema.getSchema(name))
val finalDF = input.foldLeft(emptyDF)((x, y) => x.union(y))
finalDF.show()
Expected output:
spark.sql("SELECT * FROM testPersons").show()
spark.sql("SELECT * FROM testPets").show()
It should give me the table for each one.
Since you already have your data wrangled into shape and have your rows in DataFrames and simply want to access them as temporary views, I suppose you are looking for the function(s):
createOrReplaceGlobalTempView
createOrReplaceTempView
They can be invoked from a DataFrame/Dataset.
df.createOrReplaceGlobalTempView("testPersons")
spark.sql("SELECT * FROM global_temp.testPersons").show()
df.createOrReplaceTempView("testPersons")
spark.sql("SELECT * FROM testPersons").show()
For an explanation about the difference between the two, you can take a look at this question.
If you are trying to dynamically read the JSON, get the files in data into DataFrames and then save them into their own table.
import net.liftweb.json._
import net.liftweb.json.DefaultFormats
case class Source(name: String, data: List[String], `type`: String)
val file = scala.io.Source.fromFile("path/to/your/file").mkString
implicit val formats: DefaultFormats.type = DefaultFormats
val json = parse(file)
val sourceList = (json \ "source").children
for (source <- sourceList) {
val s = source.extract[Source]
val df = s.data.map(d => spark.read(d)).reduce(_ union _)
df.createOrReplaceTempView(s.name)
}

Functionnal way of writing huge when rlike statement

I'm using regex to identify file type based on extension in DataFrame.
import org.apache.spark.sql.{Column, DataFrame}
val ignoreCase :String = "(?i)"
val ignoreExtension :String = "(?:\\.[_\\d]+)*(?:|\\.bck|\\.old|\\.orig|\\.bz2|\\.gz|\\.7z|\\.z|\\.zip)*(?:\\.[_\\d]+)*$"
val pictureFileName :String = "image"
val pictureFileType :String = ignoreCase + "^.+(?:\\.gif|\\.ico|\\.jpeg|\\.jpg|\\.png|\\.svg|\\.tga|\\.tif|\\.tiff|\\.xmp)" + ignoreExtension
val videoFileName :String = "video"
val videoFileType :String = ignoreCase + "^.+(?:\\.mod|\\.mp4|\\.mkv|\\.avi|\\.mpg|\\.mpeg|\\.flv)" + ignoreExtension
val otherFileName :String = "other"
def pathToExtension(cl: Column): Column = {
when(cl.rlike( pictureFileType ), pictureFileName ).
when(cl.rlike( videoFileType ), videoFileName ).
otherwise(otherFileName)
}
val df = List("file.jpg", "file.avi", "file.jpg", "file3.tIf", "file5.AVI.zip", "file4.mp4","afile" ).toDF("filename")
val df2 = df.withColumn("filetype", pathToExtension( col( "filename" ) ) )
df2.show
This is only a sample and I have 30 regex and type identified, thus the function pathToExtension() is really long because I have to put a new when statement for each type.
I can't find a proper way to write this code the functional way with a list or map containing the regexp and the name like this :
val typelist = List((pictureFileName,pictureFileType),(videoFileName,videoFileType))
foreach [need help for this part]
All the code I've tried so far won't work properly.
You can use foldLeft to traverse your list of when conditions and chain them as shown below:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
import spark.implicits._
val default = "other"
def chainedWhen(c: Column, rList: List[(String, String)]): Column = rList.tail.
foldLeft(when(c rlike rList.head._2, rList.head._1))( (acc, t) =>
acc.when(c rlike t._2, t._1)
).otherwise(default)
Testing the method:
val df = Seq(
(1, "a.txt"), (2, "b.gif"), (3, "c.zip"), (4, "d.oth")
).toDF("id", "file_name")
val rList = List(("text", ".*\\.txt"), ("gif", ".*\\.gif"), ("zip", ".*\\.zip"))
df.withColumn("file_type", chainedWhen($"file_name", rList)).show
// +---+---------+---------+
// | id|file_name|file_type|
// +---+---------+---------+
// | 1| a.txt| text|
// | 2| b.gif| gif|
// | 3| c.zip| zip|
// | 4| d.oth| other|
// +---+---------+---------+

Aggregating all Column values within a Map after groupBy in Apache Spark

I've been trying this all day long with a Dataframe but no luck so far. Already did it with a RDD but it isn't really readable, so this approach would be much better when it comes to code readability.
Take this initial and result DF, both the starting DF and what I would like to obtain after peforming .groupBy().
case class SampleRow(name:String, surname:String, age:Int, city:String)
case class ResultRow(name: String, surnamesAndAges: Map[String, (Int, String)])
val df = List(
SampleRow("Rick", "Fake", 17, "NY"),
SampleRow("Rick", "Jordan", 18, "NY"),
SampleRow("Sandy", "Sample", 19, "NY")
).toDF()
val resultDf = List(
ResultRow("Rick", Map("Fake" -> (17, "NY"), "Jordan" -> (18, "NY"))),
ResultRow("Sandy", Map("Sample" -> (19, "NY")))
).toDF()
What I've tried so far is performing the following .groupBy...
val resultDf = df
.groupBy(
Name
)
.agg(
functions.map(
selectColumn(Surname),
functions.array(
selectColumn(Age),
selectColumn(City)
)
)
)
However, the following is prompt into console.
Exception in thread "main" org.apache.spark.sql.AnalysisException: expression '`surname`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;
However, doing that would result in a single entry per surname and I would like to accumulate those in a single Map as you can see in resultDf. Is there an easy way to achieve this using DFs?
you can achieve it with a single UDF to convert your data to map:
val toMap = udf((keys: Seq[String], values1: Seq[String], values2: Seq[String]) => {
keys.zip(values1.zip(values2)).toMap
})
val myResultDF = df.groupBy("name").agg(collect_list("surname") as "surname", collect_list("age") as "age", collect_list("city") as "city").withColumn("surnamesAndAges", toMap($"surname", $"age", $"city")).drop("age", "city", "surname").show(false)
+-----+--------------------------------------+
|name |surnamesAndAges |
+-----+--------------------------------------+
|Sandy|[Sample -> [19, NY]] |
|Rick |[Fake -> [17, NY], Jordan -> [18, NY]]|
+-----+--------------------------------------+
If you are not concerned about typecasting the Dataframe to DataSet (In this case ResultRow you could do something like this
val grouped =df.withColumn("surnameAndAge",struct($"surname",$"age"))
.groupBy($"name")
.agg(collect_list("surnameAndAge").alias("surnamesAndAges"))
Then you could create a User defined function which would look like
import org.apache.spark.sql._
val arrayToMap = udf[Map[String, String], Seq[Row]] {
array => array.map {
case Row(key: String, value: String) => (key, value) }.toMap
}
Now you could use a .withColumn and call this udf
val finalData = grouped.withColumn("surnamesAndAges",arrayToMap($"surnamesAndAges"))
The Dataframe would look something like this
finalData: org.apache.spark.sql.DataFrame = [name: string, surnamesAndAges: map<string,string>]
Since Spark 2.4, you don't need to use a Spark user-defined function:
import org.apache.spark.sql.functions.{col, collect_set, map_from_entries, struct}
df.withColumn("mapEntry", struct(col("surname"), struct(col("age"), col("city"))))
.groupBy("name")
.agg(map_from_entries(collect_set("mapEntry")).as("surnameAndAges"))
Explanation
You first add a column containing a Map entry from desired columns. a Map entry is merely a struct containing two columns: first column is the key and the second column is the value. You can put another struct as the value. So here your Map entry will use column surname as key, and a struct of columns age and city as value:
struct(col("surname"), struct(col("age"), col("city")))
Then, you collect all the Map entries grouped by your groupBy key, which is column name using function collect_set, and you convert this list of Map entries to a Map using function map_from_entries

Spark update value in the second dataset based on the value from first dataset

I have two spark datasets, one with columns accountid and key, the key column in the format of an array [key1,key2,key3..] and another dataset with two columns accountid and key values which is in json. accountid , {key:value, key,value...}. I need to update the value in the second dataset if key appear for accountid in first dataset.
import org.apache.spark.sql.functions._
val df= sc.parallelize(Seq(("20180610114049", "id1","key1"),
("20180610114049", "id2","key2"),
("20180610114049", "id1","key1"),
("20180612114049", "id2","key1"),
("20180613114049", "id3","key2"),
("20180613114049", "id3","key3")
)).toDF("date","accountid", "key")
val gp=df.groupBy("accountid","date").agg(collect_list("key"))
+---------+--------------+-----------------+
|accountid| date|collect_list(key)|
+---------+--------------+-----------------+
| id2|20180610114049| [key2]|
| id1|20180610114049| [key1, key1]|
| id3|20180613114049| [key2, key3]|
| id2|20180612114049| [key1]|
+---------+--------------+-----------------+
val df2= sc.parallelize(Seq(("20180610114049", "id1","{'key1':'0.0','key2':'0.0','key3':'0.0'}"),
("20180610114049", "id2","{'key1':'0.0','key2':'0.0','key3':'0.0'}"),
("20180611114049", "id1","{'key1':'0.0','key2':'0.0','key3':'0.0'}"),
("20180612114049", "id2","{'key1':'0.0','key2':'0.0','key3':'0.0'}"),
("20180613114049", "id3","{'key1':'0.0','key2':'0.0','key3':'0.0'}")
)).toDF("date","accountid", "result")
+--------------+---------+----------------------------------------+
|date |accountid|result |
+--------------+---------+----------------------------------------+
|20180610114049|id1 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
|20180610114049|id2 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
|20180611114049|id1 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
|20180612114049|id2 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
|20180613114049|id3 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
+--------------+---------+----------------------------------------+
expected output
+--------------+---------+----------------------------------------+
|date |accountid|result |
+--------------+---------+----------------------------------------+
|20180610114049|id1 |{'key1':'1.0','key2':'0.0','key3':'0.0'}|
|20180610114049|id2 |{'key1':'0.0','key2':'1.0','key3':'0.0'}|
|20180611114049|id1 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
|20180612114049|id2 |{'key1':'1.0','key2':'0.0','key3':'0.0'}|
|20180613114049|id3 |{'key1':'0.0','key2':'1.0','key3':'1.0'}|
+--------------+---------+----------------------------------------+
You will most definitely need a UDF to do it cleanly here.
You can pass both the array and the JSON to the UDF after joining on date and accountid, parse the JSON inside the UDF using the parser of your choice (I'm using JSON4S in the example), check if the key exists in the array and then change the value, convert it to JSON again and return it from the UDF.
val gp=df.groupBy("accountid","date").agg(collect_list("key").as("key"))
val joined = df2.join(gp, Seq("date", "accountid") , "left_outer")
joined.show(false)
//+--------------+---------+----------------------------------------+------------+
//|date |accountid|result |key |
//+--------------+---------+----------------------------------------+------------+
//|20180610114049|id2 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|[key2] |
//|20180613114049|id3 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|[key2, key3]|
//|20180610114049|id1 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|[key1, key1]|
//|20180611114049|id1 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|null |
//|20180612114049|id2 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|[key1] |
//+--------------+---------+----------------------------------------+------------+
// the UDF that will do the most work
// it's important to declare `formats` inside the function
// to avoid object not Serializable exception
// Not all cases are covered, use with caution :D
val convertJsonValues = udf{(json: String, arr: Seq[String]) =>
import org.json4s.jackson.JsonMethods._
import org.json4s.JsonDSL._
implicit val format = org.json4s.DefaultFormats
// replace single quotes with double
val kvMap = parse(json.replaceAll("'", """"""")).values.asInstanceOf[Map[String,String]]
val updatedKV = kvMap.map{ case(k,v) => if(arr.contains(k)) (k,"1.0") else (k,v) }
compact(render(updatedKV))
}
// Use when-otherwise and send empty array where `key` is null
joined.select($"date",
$"accountid",
when($"key".isNull, convertJsonValues($"result", array()))
.otherwise(convertJsonValues($"result", $"key"))
.as("result")
).show(false)
//+--------------+---------+----------------------------------------+
//|date |accountid|result |
//+--------------+---------+----------------------------------------+
//|20180610114049|id2 |{"key1":"0.0","key2":"1.0","key3":"0.0"}|
//|20180613114049|id3 |{"key1":"0.0","key2":"1.0","key3":"1.0"}|
//|20180610114049|id1 |{"key1":"1.0","key2":"0.0","key3":"0.0"}|
//|20180611114049|id1 |{"key1":"0.0","key2":"0.0","key3":"0.0"}|
//|20180612114049|id2 |{"key1":"1.0","key2":"0.0","key3":"0.0"}|
//+--------------+---------+----------------------------------------+
You can achieve your requirement with the use of udf function after you join both dataframes. Of course there are stuffs like converting json to struct, struct to json again, case class usage and more (comments are provided for further explanation)
import org.apache.spark.sql.functions._
//aliasing the collected key
val gp = df.groupBy("accountid","date").agg(collect_list("key").as("keys"))
//schema for converting json to struct
val schema = StructType(Seq(StructField("key1", StringType, true), StructField("key2", StringType, true), StructField("key3", StringType, true)))
//udf function to update the values of struct where result is a case class
def updateKeysUdf = udf((arr: Seq[String], json: Row) => Seq(json.schema.fieldNames.map(key => if(arr.contains(key)) "1.0" else json.getAs[String](key))).collect{case Array(a,b,c) => result(a,b,c)}.toList(0))
//changing json string to stuct using the above schema
df2.withColumn("result", from_json(col("result"), schema))
.as("df2") //aliasing df2 for joining and selecting
.join(gp.as("gp"), col("df2.accountid") === col("gp.accountid"), "left") //aliasing gp dataframe and joining with accountid
.select(col("df2.accountid"), col("df2.date"), to_json(updateKeysUdf(col("gp.keys"), col("df2.result"))).as("result")) //selecting and calling above udf function and finally converting to json stirng
.show(false)
where result is a case class
case class result(key1: String, key2: String, key3: String)
which should give you
+---------+--------------+----------------------------------------+
|accountid|date |result |
+---------+--------------+----------------------------------------+
|id3 |20180613114049|{"key1":"0.0","key2":"1.0","key3":"1.0"}|
|id1 |20180610114049|{"key1":"1.0","key2":"0.0","key3":"0.0"}|
|id1 |20180611114049|{"key1":"1.0","key2":"0.0","key3":"0.0"}|
|id2 |20180610114049|{"key1":"0.0","key2":"1.0","key3":"0.0"}|
|id2 |20180610114049|{"key1":"1.0","key2":"0.0","key3":"0.0"}|
|id2 |20180612114049|{"key1":"0.0","key2":"1.0","key3":"0.0"}|
|id2 |20180612114049|{"key1":"1.0","key2":"0.0","key3":"0.0"}|
+---------+--------------+----------------------------------------+
I hope the answer is helpful