Fetch the partial value from a column having key value pairs and assign it to new column in Spark Dataframe - scala

I have a data frame as below
+----+-----------------------------+
|id | att |
+----+-----------------------------+
| 25 | {"State":"abc","City":"xyz"}|
| 26 | null |
| 27 | {"State":"pqr"} |
+----+-----------------------------+
I want a dataframe with columns id and city if the att column has city attribute else null
+----+------+
|id | City |
+----+------+
| 25 | xyz |
| 26 | null |
| 27 | null |
+----+------+
Language : Scala

You can use from_json to parse and convert your json data to Map. Then access the map item using one of:
getItem method of the Column class
default accessor, i.e map("map_key")
element_at function
import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.types.{MapType, StringType}
import sparkSession.implicits._
val df = Seq(
(25, """{"State":"abc","City":"xyz"}"""),
(26, null),
(27, """{"State":"pqr"}""")
).toDF("id", "att")
val schema = MapType(StringType, StringType)
df.select($"id", from_json($"att", schema).getItem("City").as("City"))
//or df.select($"id", from_json($"att", schema)("City").as("City"))
//or df.select($"id", element_at(from_json($"att", schema), "City").as("City"))
// +---+----+
// | id|City|
// +---+----+
// | 25| xyz|
// | 26|null|
// | 27|null|
// +---+----+

Related

Remove unwanted columns from a dataframe in scala

I am pretty new to scala
I have a situation where I have a dataframe with multiple columns, some of these columns having random null values at random places. I need to find any such column having even a single null value and drop it from the dataframe.
#### Input
| Column 1 | Column 2 | Column 3 | Column 4 | Column 5 |
| --------------| --------------| --------------| --------------| --------------|
|(123)-456-7890 | 123-456-7890 |(123)-456-789 | |(123)-456-7890 |
|(123)-456-7890 | 123-4567890 |(123)-456-7890 |(123)-456-7890 | null |
|(123)-456-7890 | 1234567890 |(123)-456-7890 |(123)-456-7890 | null |
#### Output
| Column 1 | Column 2 |
| --------------| --------------|
|(123)-456-7890 | 123-456-7890 |
|(123)-456-7890 | 123-4567890 |
|(123)-456-7890 | 1234567890 |
Please advise.
Thank you.
I would recommend a 2-step approach:
Exclude columns that are not nullable from the dataframe
Assemble a list of columns that contain at least a null and drop them altogether
Creating a sample dataframe with a mix of nullable/non-nullable columns:
import org.apache.spark.sql.Column
import org.apache.spark.sql.types._
val df0 = Seq(
(Some(1), Some("x"), Some("a"), None),
(Some(2), Some("y"), None, Some(20.0)),
(Some(3), Some("z"), None, Some(30.0))
).toDF("c1", "c2", "c3", "c4")
val newSchema = StructType(df0.schema.map{ field =>
if (field.name == "c1") field.copy(name = "c1_notnull", nullable = false) else field
})
// Revised dataframe with non-nullable `c1`
val df = spark.createDataFrame(df0.rdd, newSchema)
Carrying out step 1 & 2:
val nullableCols = df.schema.collect{ case StructField(name, _, true, _) => name }
// nullableCols: Seq[String] = List(c2, c3, c4)
val colsWithNulls = nullableCols.filter(c => df.where(col(c).isNull).count > 0)
// colsWithNulls: Seq[String] = List(c3, c4)
df.drop(colsWithNulls: _*).show
// +----------+---+
// |c1_notnull| c2|
// +----------+---+
// | 1| x|
// | 2| y|
// | 3| z|
// +----------+---+

How to use dataframe inside an udf and parse the data in spark scala

I am new to scala and spark. I have a requirement to create the new dataframe by using the udf.
I have a 2 dataframes, one contains 3 columns namely company, id, and type.
df2 contains 2 columns namely company and message.
df2 JSON will be like this
{"company": "Honda", "message": ["19:[\"cost 500k\"],[\"colour blue\"]","20:[\"cost 600k\"],[\"colour white\"]"]}
{"company": "BMW", "message": ["19:[\"cost 1500k\"],[\"colour blue\"]"]}
df2 will be like this:
+-------+--------------------+
|company| message|
+-------+--------------------+
| Honda|[19:["cost 500k"]...|
| BMW|[19:["cost 1500k"...|
+-------+--------------------+
|-- company: string (nullable = true)
|-- message: array (nullable = true)
| |-- element: string (containsNull = true)
df1 will be like this:
+----------+---+-------+
|company | id| name|
+----------+---+-------+
| Honda | 19| city |
| Honda | 20| amaze |
| BMW | 19| x1 |
+----------+---+-------+
I want to create a new data frame by replacing the id in df2 with the name in df1.
["city:[\"cost 500k\"],[\"colour blue\"]","amaze:[\"cost 600k\"],[\"colour white\"]"]
I had tried with udf by passing message as Seq[String] and company but I was not able to select the data in df1.
I want the output like this:
+-------+----------------------+
|company| message |
+-------+----------------------+
| Honda|[city:["cost 500k"]...|
| BMW|[x1:["cost 1500k"... |
+-------+----------------------+
I tried by using the fallowing udf but I was facing errors while selecting the name
def asdf(categories: Seq[String]):String={
| var data=""
| for(w<-categories){
| if (w != null){
| var id=w.toString().indexOf(":")
| var namea=df1.select("name").where($"id" === 20).map(_.getString(0)).collect()
| var name=namea(0)
| println(name)
| var ids=w.toString().substring(0,id)
| var li=w.toString().replace(ids,name)
| println(li)
| data=data+li
| }
| }
| data
| }
Please check below code.
scala> df1.show(false)
+-------+---------------------------------------------------------------------+
|company|message |
+-------+---------------------------------------------------------------------+
|Honda |[19:["cost 500k"],["colour blue"], 20:["cost 600k"],["colour white"]]|
|BMW |[19:["cost 1500k"],["colour blue"]] |
+-------+---------------------------------------------------------------------+
scala> df2.show(false)
+-------+---+-----+
|company|id |name |
+-------+---+-----+
|Honda | 19|city |
|Honda | 20|amaze|
|BMW | 19|x1 |
+-------+---+-----+
val replaceFirst = udf((message: String,id:String,name:String) =>
if(message.contains(s"""${id}:""")) message.replaceFirst(s"""${id}:""",s"${name}:") else ""
)
val jdf =
df1
.withColumn("message",explode($"message"))
.join(df2,df1("company") === df2("company"),"inner")
.withColumn(
"message_data",
replaceFirst($"message",trim($"id"),$"name")
)
.filter($"message_data" =!= "")
scala> jdf.show(false)
+-------+---------------------------------+-------+---+-----+------------------------------------+
|company|message |company|id |name |message_data |
+-------+---------------------------------+-------+---+-----+------------------------------------+
|Honda |19:["cost 500k"],["colour blue"] |Honda | 19|city |city:["cost 500k"],["colour blue"] |
|Honda |20:["cost 600k"],["colour white"]|Honda | 20|amaze|amaze:["cost 600k"],["colour white"]|
|BMW |19:["cost 1500k"],["colour blue"]|BMW | 19|x1 |x1:["cost 1500k"],["colour blue"] |
+-------+---------------------------------+-------+---+-----+------------------------------------+
scala> df1.join(df2,df1("company") === df2("company"),"inner").select(df1("company"),df1("message"),df2("id"),df2("name")).withColumn("message",explode($"message")).withColumn("message",replaceFirst($"message",trim($"id"),$"name")).filter($"message" =!= "").groupBy($"company").agg(collect_list($"message").cast("string").as("message")).show(false)
+-------+--------------------------------------------------------------------------+
|company|message |
+-------+--------------------------------------------------------------------------+
|Honda |[amaze:["cost 600k"],["colour white"], city:["cost 500k"],["colour blue"]]|
|BMW |[x1:["cost 1500k"],["colour blue"]] |
+-------+--------------------------------------------------------------------------+

scala spark dataframe: explode a string column to multiple strings

Any pointers on below?
input df: here col1 is of type string
+----------------------------------+
| col1|
+----------------------------------+
|[{a:1,g:2},{b:3,h:4},{c:5,i:6}] |
|[{d:7,j:8},{e:9,k:10},{f:11,l:12}]|
+----------------------------------+
expected output: (again col1 is of type string)
+-------------+
| col1 |
+-------------+
| {a:1,g:2} |
| {b:3,h:4} |
| {c:5,i:6} |
| {d:7,j:8} |
| {e:9,k:10} |
| {f:11,l:12}|
+-----+
Thanks!
You can use the Spark SQL explode function with an UDF :
import spark.implicits._
val df = spark.createDataset(Seq("[{a},{b},{c}]","[{d},{e},{f}]")).toDF("col1")
df.show()
+-------------+
| col1|
+-------------+
|[{a},{b},{c}]|
|[{d},{e},{f}]|
+-------------+
import org.apache.spark.sql.functions._
val stringToSeq = udf{s: String => s.drop(1).dropRight(1).split(",")}
df.withColumn("col1", explode(stringToSeq($"col1"))).show()
+----+
|col1|
+----+
| {a}|
| {b}|
| {c}|
| {d}|
| {e}|
| {f}|
+----+
Edit: for you new input data, the custom UDF can evolve as above :
val stringToSeq = udf{s: String =>
val extractor = "[^{]*:[^}]*".r
extractor.findAllIn(s).map(m => s"{$m}").toSeq
}
new output :
+-----------+
| col1|
+-----------+
| {a:1,g:2}|
| {b:3,h:4}|
| {c:5,i:6}|
| {d:7,j:8}|
| {e:9,k:10}|
|{f:11,l:12}|
+-----------+
Spark provides a quite rich trim function which can be used to remove the leading and the trailing chars, [] in your case. As #LeoC already mentioned the required functionality can be implemented through the build-in functions which will perform much better:
import org.apache.spark.sql.functions.{trim, explode, split}
val df = Seq(
("[{a},{b},{c}]"),
("[{d},{e},{f}]")
).toDF("col1")
df.select(
explode(
split(
trim($"col1", "[]"), ","))).show
// +---+
// |col|
// +---+
// |{a}|
// |{b}|
// |{c}|
// |{d}|
// |{e}|
// |{f}|
// +---+
EDIT:
For the new dataset the logic remains the same with the difference that you need to split with a different character other than ,. You can achieve this using regexp_replace to replace }, with }| in order to be able later to split with | instead of ,:
import org.apache.spark.sql.functions.{trim, explode, split, regexp_replace}
val df = Seq(
("[{a:1,g:2},{b:3,h:4},{c:5,i:6}]"),
("[{d:7,j:8},{e:9,k:10},{f:11,l:12}]")
).toDF("col1")
df.select(
explode(
split(
regexp_replace(trim($"col1", "[]"), "},", "}|"), // gives: {a:1,g:2}|{b:3,h:4}|{c:5,i:6}
"\\|")
)
).show(false)
// +-----------+
// |col |
// +-----------+
// |{a:1,g:2} |
// |{b:3,h:4} |
// |{c:5,i:6} |
// |{d:7,j:8} |
// |{e:9,k:10} |
// |{f:11,l:12}|
// +-----------+
Note: with split(..., "\\|") we escape | which is a special regex character.
You can do:
val newDF = df.as[String].flatMap(line=>line.replaceAll("\\[", "").replaceAll("\\]", "").split(","))
newDF.show()
Output:
+-----+
|value|
+-----+
| {a}|
| {b}|
| {c}|
| {d}|
| {e}|
| {f}|
+-----+
Just as a note, this process will name the output column as value, that you can easily rename it (if needed), using select, withColumn, etc.
Finally what worked:
import spark.implicits._
val df = spark.createDataset(Seq("[{a:1,g:2},{b:3,h:4},{c:5,i:6}]","[{d:7,j:8},{e:9,k:10},{f:11,l:12}]")).toDF("col1")
df.show()
val toStr = udf((value : String) => value.split("},\\{").map(_.toString))
val addParanthesis = udf((value : String) => ("{" + value + "}"))
val removeParanthesis = udf((value : String) => (value.slice(2,value.length()-2)))
import org.apache.spark.sql.functions._
df
.withColumn("col0", removeParanthesis(col("col1")))
.withColumn("col2", toStr(col("col0")))
.withColumn("col3", explode(col("col2")))
.withColumn("col4", addParanthesis(col("col3")))
.show()
output:
+--------------------+--------------------+--------------------+---------+-----------+
| col1| col0| col2| col3| col4|
+--------------------+--------------------+--------------------+---------+-----------+
|[{a:1,g:2},{b:3,h...|a:1,g:2},{b:3,h:4...|[a:1,g:2, b:3,h:4...| a:1,g:2| {a:1,g:2}|
|[{a:1,g:2},{b:3,h...|a:1,g:2},{b:3,h:4...|[a:1,g:2, b:3,h:4...| b:3,h:4| {b:3,h:4}|
|[{a:1,g:2},{b:3,h...|a:1,g:2},{b:3,h:4...|[a:1,g:2, b:3,h:4...| c:5,i:6| {c:5,i:6}|
|[{d:7,j:8},{e:9,k...|d:7,j:8},{e:9,k:1...|[d:7,j:8, e:9,k:1...| d:7,j:8| {d:7,j:8}|
|[{d:7,j:8},{e:9,k...|d:7,j:8},{e:9,k:1...|[d:7,j:8, e:9,k:1...| e:9,k:10| {e:9,k:10}|
|[{d:7,j:8},{e:9,k...|d:7,j:8},{e:9,k:1...|[d:7,j:8, e:9,k:1...|f:11,l:12|{f:11,l:12}|
+--------------------+--------------------+--------------------+---------+-----------+

Spark scala create multiple columns from array column

Creating a multiple columns from array column
Dataframe
Car name | details
Toyota | [[year,2000],[price,20000]]
Audi | [[mpg,22]]
Expected dataframe
Car name | year | price | mpg
Toyota | 2000 | 20000 | null
Audi | null | null | 22
You can try this
Let's define the data
scala> val carsDF = Seq(("toyota",Array(("year", 2000), ("price", 100000))), ("Audi", Array(("mpg", 22)))).toDF("car", "details")
carsDF: org.apache.spark.sql.DataFrame = [car: string, details: array<struct<_1:string,_2:int>>]
scala> carsDF.show(false)
+------+-----------------------------+
|car |details |
+------+-----------------------------+
|toyota|[[year,2000], [price,100000]]|
|Audi |[[mpg,22]] |
+------+-----------------------------+
Splitting the data & accessing the values in the data
scala> carsDF.withColumn("split", explode($"details")).withColumn("col", $"split"("_1")).withColumn("val", $"split"("_2")).select("car", "col", "val").show
+------+-----+------+
| car| col| val|
+------+-----+------+
|toyota| year| 2000|
|toyota|price|100000|
| Audi| mpg| 22|
+------+-----+------+
Define the list of columns that are required
scala> val colNames = Seq("mpg", "price", "year", "dummy")
colNames: Seq[String] = List(mpg, price, year, dummy)
Use pivoting on the above defined column names gives required output.
By giving new column names in the sequence makes it a single point input
scala> weDF.groupBy("car").pivot("col", colNames).agg(avg($"val")).show
+------+----+--------+------+-----+
| car| mpg| price| year|dummy|
+------+----+--------+------+-----+
|toyota|null|100000.0|2000.0| null|
| Audi|22.0| null| null| null|
+------+----+--------+------+-----+
This seems more elegant & easy way to achieve the output
you can do it like that
import org.apache.spark.functions.col
val df: DataFrame = Seq(
("toyota",Array(("year", 2000), ("price", 100000))),
("toyota",Array(("year", 2001)))
).toDF("car", "details")
+------+-------------------------------+
|car |details |
+------+-------------------------------+
|toyota|[[year, 2000], [price, 100000]]|
|toyota|[[year, 2001]] |
+------+-------------------------------+
val newdf = df
.withColumn("year", when(col("details")(0)("_1") === lit("year"), col("details")(0)("_2")).otherwise(col("details")(1)("_2")))
.withColumn("price", when(col("details")(0)("_1") === lit("price"), col("details")(0)("_2")).otherwise(col("details")(1)("_2")))
.drop("details")
newdf.show()
+------+----+------+
| car|year| price|
+------+----+------+
|toyota|2000|100000|
|toyota|2001| null|
+------+----+------+

Spark Dataframe :How to add a index Column : Aka Distributed Data Index

I read data from a csv file ,but don't have index.
I want to add a column from 1 to row's number.
What should I do,Thanks (scala)
With Scala you can use:
import org.apache.spark.sql.functions._
df.withColumn("id",monotonicallyIncreasingId)
You can refer to this exemple and scala docs.
With Pyspark you can use:
from pyspark.sql.functions import monotonically_increasing_id
df_index = df.select("*").withColumn("id", monotonically_increasing_id())
monotonically_increasing_id - The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
"I want to add a column from 1 to row's number."
Let say we have the following DF
+--------+-------------+-------+
| userId | productCode | count |
+--------+-------------+-------+
| 25 | 6001 | 2 |
| 11 | 5001 | 8 |
| 23 | 123 | 5 |
+--------+-------------+-------+
To generate the IDs starting from 1
val w = Window.orderBy("count")
val result = df.withColumn("index", row_number().over(w))
This would add an index column ordered by increasing value of count.
+--------+-------------+-------+-------+
| userId | productCode | count | index |
+--------+-------------+-------+-------+
| 25 | 6001 | 2 | 1 |
| 23 | 123 | 5 | 2 |
| 11 | 5001 | 8 | 3 |
+--------+-------------+-------+-------+
How to get a sequential id column id[1, 2, 3, 4...n]:
from pyspark.sql.functions import desc, row_number, monotonically_increasing_id
from pyspark.sql.window import Window
df_with_seq_id = df.withColumn('index_column_name', row_number().over(Window.orderBy(monotonically_increasing_id())) - 1)
Note that row_number() starts at 1, therefore subtract by 1 if you want 0-indexed column
NOTE : Above approaches doesn't give a sequence number, but it does give increasing id.
Simple way to do that and ensure the order of indexes is like below.. zipWithIndex.
Sample data.
+-------------------+
| Name|
+-------------------+
| Ram Ghadiyaram|
| Ravichandra|
| ilker|
| nick|
| Naveed|
| Gobinathan SP|
|Sreenivas Venigalla|
| Jackela Kowski|
| Arindam Sengupta|
| Liangpi|
| Omar14|
| anshu kumar|
+-------------------+
package com.example
import org.apache.spark.internal.Logging
import org.apache.spark.sql.SparkSession._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{LongType, StructField, StructType}
import org.apache.spark.sql.{DataFrame, Row}
/**
* DistributedDataIndex : Program to index an RDD with
*/
object DistributedDataIndex extends App with Logging {
val spark = builder
.master("local[*]")
.appName(this.getClass.getName)
.getOrCreate()
import spark.implicits._
val df = spark.sparkContext.parallelize(
Seq("Ram Ghadiyaram", "Ravichandra", "ilker", "nick"
, "Naveed", "Gobinathan SP", "Sreenivas Venigalla", "Jackela Kowski", "Arindam Sengupta", "Liangpi", "Omar14", "anshu kumar"
)).toDF("Name")
df.show
logInfo("addColumnIndex here")
// Add index now...
val df1WithIndex = addColumnIndex(df)
.withColumn("monotonically_increasing_id", monotonically_increasing_id)
df1WithIndex.show(false)
/**
* Add Column Index to dataframe to each row
*/
def addColumnIndex(df: DataFrame) = {
spark.sqlContext.createDataFrame(
df.rdd.zipWithIndex.map {
case (row, index) => Row.fromSeq(row.toSeq :+ index)
},
// Create schema for index column
StructType(df.schema.fields :+ StructField("index", LongType, false)))
}
}
Result :
+-------------------+-----+---------------------------+
|Name |index|monotonically_increasing_id|
+-------------------+-----+---------------------------+
|Ram Ghadiyaram |0 |0 |
|Ravichandra |1 |8589934592 |
|ilker |2 |8589934593 |
|nick |3 |17179869184 |
|Naveed |4 |25769803776 |
|Gobinathan SP |5 |25769803777 |
|Sreenivas Venigalla|6 |34359738368 |
|Jackela Kowski |7 |42949672960 |
|Arindam Sengupta |8 |42949672961 |
|Liangpi |9 |51539607552 |
|Omar14 |10 |60129542144 |
|anshu kumar |11 |60129542145 |
+-------------------+-----+---------------------------+
As Ram said, zippedwithindex is better than monotonically increasing id, id you need consecutive row numbers. Try this (PySpark environment):
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, LongType
new_schema = StructType(**original_dataframe**.schema.fields[:] + [StructField("index", LongType(), False)])
zipped_rdd = **original_dataframe**.rdd.zipWithIndex()
indexed = (zipped_rdd.map(lambda ri: row_with_index(*list(ri[0]) + [ri[1]])).toDF(new_schema))
where original_dataframe is the dataframe you have to add index on and row_with_index is the new schema with the column index which you can write as
row_with_index = Row(
"calendar_date"
,"year_week_number"
,"year_period_number"
,"realization"
,"index"
)
Here, calendar_date, year_week_number, year_period_number and realization were the columns of my original dataframe. You can replace the names with the names of your columns. index is the new column name you had to add for the row numbers.
If you require a unique sequence number for each row, I have a slightly different approach, where a static column is added and is used to compute the row number using that column.
val srcData = spark.read.option("header","true").csv("/FileStore/sample.csv")
srcData.show(5)
+--------+--------------------+
| Job| Name|
+--------+--------------------+
|Morpheus| HR Specialist|
| Kayla| Lawyer|
| Trisha| Bus Driver|
| Robert|Elementary School...|
| Ober| Judge|
+--------+--------------------+
val srcDataModf = srcData.withColumn("sl_no",lit("1"))
val windowSpecRowNum = Window.partitionBy("sl_no").orderBy("sl_no")
srcDataModf.withColumn("row_num",row_number.over(windowSpecRowNum)).drop("sl_no").select("row_num","Name","Job")show(5)
+-------+--------------------+--------+
|row_num| Name| Job|
+-------+--------------------+--------+
| 1| HR Specialist|Morpheus|
| 2| Lawyer| Kayla|
| 3| Bus Driver| Trisha|
| 4|Elementary School...| Robert|
| 5| Judge| Ober|
+-------+--------------------+--------+
For SparkR:
(Assuming sdf is some sort of spark data frame)
sdf<- withColumn(sdf, "row_id", SparkR:::monotonically_increasing_id())