Spark scala create multiple columns from array column - scala

Creating a multiple columns from array column
Dataframe
Car name | details
Toyota | [[year,2000],[price,20000]]
Audi | [[mpg,22]]
Expected dataframe
Car name | year | price | mpg
Toyota | 2000 | 20000 | null
Audi | null | null | 22

You can try this
Let's define the data
scala> val carsDF = Seq(("toyota",Array(("year", 2000), ("price", 100000))), ("Audi", Array(("mpg", 22)))).toDF("car", "details")
carsDF: org.apache.spark.sql.DataFrame = [car: string, details: array<struct<_1:string,_2:int>>]
scala> carsDF.show(false)
+------+-----------------------------+
|car |details |
+------+-----------------------------+
|toyota|[[year,2000], [price,100000]]|
|Audi |[[mpg,22]] |
+------+-----------------------------+
Splitting the data & accessing the values in the data
scala> carsDF.withColumn("split", explode($"details")).withColumn("col", $"split"("_1")).withColumn("val", $"split"("_2")).select("car", "col", "val").show
+------+-----+------+
| car| col| val|
+------+-----+------+
|toyota| year| 2000|
|toyota|price|100000|
| Audi| mpg| 22|
+------+-----+------+
Define the list of columns that are required
scala> val colNames = Seq("mpg", "price", "year", "dummy")
colNames: Seq[String] = List(mpg, price, year, dummy)
Use pivoting on the above defined column names gives required output.
By giving new column names in the sequence makes it a single point input
scala> weDF.groupBy("car").pivot("col", colNames).agg(avg($"val")).show
+------+----+--------+------+-----+
| car| mpg| price| year|dummy|
+------+----+--------+------+-----+
|toyota|null|100000.0|2000.0| null|
| Audi|22.0| null| null| null|
+------+----+--------+------+-----+
This seems more elegant & easy way to achieve the output

you can do it like that
import org.apache.spark.functions.col
val df: DataFrame = Seq(
("toyota",Array(("year", 2000), ("price", 100000))),
("toyota",Array(("year", 2001)))
).toDF("car", "details")
+------+-------------------------------+
|car |details |
+------+-------------------------------+
|toyota|[[year, 2000], [price, 100000]]|
|toyota|[[year, 2001]] |
+------+-------------------------------+
val newdf = df
.withColumn("year", when(col("details")(0)("_1") === lit("year"), col("details")(0)("_2")).otherwise(col("details")(1)("_2")))
.withColumn("price", when(col("details")(0)("_1") === lit("price"), col("details")(0)("_2")).otherwise(col("details")(1)("_2")))
.drop("details")
newdf.show()
+------+----+------+
| car|year| price|
+------+----+------+
|toyota|2000|100000|
|toyota|2001| null|
+------+----+------+

Related

Spark Scala split column values in a dataframe to appended lists

I have data in a spark dataframe that I need to search for elements by name, append the values to a list, and split searched elements into separate columns of the dataframe.
I am using scala and the below is an example of my current code that works to get the first value but I need to append all values available not just the first.
I'm new to Scala (and python) so any help will be greatly appreciated!
val getNumber: (String => String) = (colString: String) => {
if (colString != null) {
raw"number:(\d+)".r
.findAllIn(colString)
.group(1)
}
else
null
}
val udfGetColumn = udf(getNumber)
val mydf = df.select(cols.....)
.withColumn("var_number", udfGetColumn($"var"))
Example Data:
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| key| var |
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1 |["[number:123456 rate:111970 position:1]","[number:123457 rate:662352 position:2]","[number:123458 rate:890 position:3]","[number:123459 rate:190 position:4]"] | |
|2 |["[number:654321 rate:211971 position:1]","[number:654322 rate:124 position:2]","[number:654323 rate:421 position:3]"] |
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
Desired Result:
+------+------------------------------------------------------------+
| key| var_number | var_rate | var_position |
+------+------------------------------------------------------------+
|1 | 123456 | 111970 | 1 |
|1 | 123457 | 662352 | 2 |
|1 | 123458 | 890 | 3 |
|1 | 123459 | 190 | 4 |
|2 | 654321 | 211971 | 1 |
|2 | 654322 | 124 | 2 |
|2 | 654323 | 421 | 3 |
+------+-----------------+---------------------+--------------------+
You don't need to use UDF here. You can easily transform the array column var by converting each element into a map using str_to_map after removing the square brackets ([]) with regexp_replace function. Finally, explode the transformed array and select the fields:
val df = Seq(
(1, Seq("[number:123456 rate:111970 position:1]", "[number:123457 rate:662352 position:2]", "[number:123458 rate:890 position:3]", "[number:123459 rate:190 position:4]")),
(2, Seq("[number:654321 rate:211971 position:1]", "[number:654322 rate:124 position:2]", "[number:654323 rate:421 position:3]"))
).toDF("key", "var")
val result = df.withColumn(
"var",
explode(expr(raw"transform(var, x -> str_to_map(regexp_replace(x, '[\\[\\]]', ''), ' '))"))
).select(
col("key"),
col("var").getField("number").alias("var_number"),
col("var").getField("rate").alias("var_rate"),
col("var").getField("position").alias("var_position")
)
result.show
//+---+----------+--------+------------+
//|key|var_number|var_rate|var_position|
//+---+----------+--------+------------+
//| 1| 123456| 111970| 1|
//| 1| 123457| 662352| 2|
//| 1| 123458| 890| 3|
//| 1| 123459| 190| 4|
//| 2| 654321| 211971| 1|
//| 2| 654322| 124| 2|
//| 2| 654323| 421| 3|
//+---+----------+--------+------------+
From you comment, it appears the column var is of type string not array. In this case, you can first transform it by removing [] and " characters then split by comma to get an array:
val df = Seq(
(1, """["[number:123456 rate:111970 position:1]", "[number:123457 rate:662352 position:2]", "[number:123458 rate:890 position:3]", "[number:123459 rate:190 position:4]"]"""),
(2, """["[number:654321 rate:211971 position:1]", "[number:654322 rate:124 position:2]", "[number:654323 rate:421 position:3]"]""")
).toDF("key", "var")
val result = df.withColumn(
"var",
split(regexp_replace(col("var"), "[\\[\\]\"]", ""), ",")
).withColumn(
"var",
explode(expr("transform(var, x -> str_to_map(x, ' '))"))
).select(
// select your columns as above...
)

Fetch the partial value from a column having key value pairs and assign it to new column in Spark Dataframe

I have a data frame as below
+----+-----------------------------+
|id | att |
+----+-----------------------------+
| 25 | {"State":"abc","City":"xyz"}|
| 26 | null |
| 27 | {"State":"pqr"} |
+----+-----------------------------+
I want a dataframe with columns id and city if the att column has city attribute else null
+----+------+
|id | City |
+----+------+
| 25 | xyz |
| 26 | null |
| 27 | null |
+----+------+
Language : Scala
You can use from_json to parse and convert your json data to Map. Then access the map item using one of:
getItem method of the Column class
default accessor, i.e map("map_key")
element_at function
import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.types.{MapType, StringType}
import sparkSession.implicits._
val df = Seq(
(25, """{"State":"abc","City":"xyz"}"""),
(26, null),
(27, """{"State":"pqr"}""")
).toDF("id", "att")
val schema = MapType(StringType, StringType)
df.select($"id", from_json($"att", schema).getItem("City").as("City"))
//or df.select($"id", from_json($"att", schema)("City").as("City"))
//or df.select($"id", element_at(from_json($"att", schema), "City").as("City"))
// +---+----+
// | id|City|
// +---+----+
// | 25| xyz|
// | 26|null|
// | 27|null|
// +---+----+

Remove unwanted columns from a dataframe in scala

I am pretty new to scala
I have a situation where I have a dataframe with multiple columns, some of these columns having random null values at random places. I need to find any such column having even a single null value and drop it from the dataframe.
#### Input
| Column 1 | Column 2 | Column 3 | Column 4 | Column 5 |
| --------------| --------------| --------------| --------------| --------------|
|(123)-456-7890 | 123-456-7890 |(123)-456-789 | |(123)-456-7890 |
|(123)-456-7890 | 123-4567890 |(123)-456-7890 |(123)-456-7890 | null |
|(123)-456-7890 | 1234567890 |(123)-456-7890 |(123)-456-7890 | null |
#### Output
| Column 1 | Column 2 |
| --------------| --------------|
|(123)-456-7890 | 123-456-7890 |
|(123)-456-7890 | 123-4567890 |
|(123)-456-7890 | 1234567890 |
Please advise.
Thank you.
I would recommend a 2-step approach:
Exclude columns that are not nullable from the dataframe
Assemble a list of columns that contain at least a null and drop them altogether
Creating a sample dataframe with a mix of nullable/non-nullable columns:
import org.apache.spark.sql.Column
import org.apache.spark.sql.types._
val df0 = Seq(
(Some(1), Some("x"), Some("a"), None),
(Some(2), Some("y"), None, Some(20.0)),
(Some(3), Some("z"), None, Some(30.0))
).toDF("c1", "c2", "c3", "c4")
val newSchema = StructType(df0.schema.map{ field =>
if (field.name == "c1") field.copy(name = "c1_notnull", nullable = false) else field
})
// Revised dataframe with non-nullable `c1`
val df = spark.createDataFrame(df0.rdd, newSchema)
Carrying out step 1 & 2:
val nullableCols = df.schema.collect{ case StructField(name, _, true, _) => name }
// nullableCols: Seq[String] = List(c2, c3, c4)
val colsWithNulls = nullableCols.filter(c => df.where(col(c).isNull).count > 0)
// colsWithNulls: Seq[String] = List(c3, c4)
df.drop(colsWithNulls: _*).show
// +----------+---+
// |c1_notnull| c2|
// +----------+---+
// | 1| x|
// | 2| y|
// | 3| z|
// +----------+---+

Spark Scala - Need to iterate over column in dataframe

Got the next dataframe:
+---+----------------+
|id |job_title |
+---+----------------+
|1 |ceo |
|2 |product manager |
|3 |surfer |
+---+----------------+
I want to get a column from a dataframe and to create another column with indication called 'rank':
+---+----------------+-------+
|id |job_title | rank |
+---+----------------+-------+
|1 |ceo |c-level|
|2 |product manager |manager|
|3 |surfer |other |
+---+----------------+-------+
--- UPDATED ---
What I tried to do by now is:
def func (col: column) : Column = {
val cLevel = List("ceo","cfo")
val managerLevel = List("manager","team leader")
when (col.contains(cLevel), "C-level")
.otherwise(when(col.contains(managerLevel),"manager").otherwise("other"))}
Currently I get a this error:
type mismatch;
found : Boolean
required: org.apache.spark.sql.Column
and I think I have also other problems within the code.Sorry but I'm on a starting level with Scala over Spark.
You can use when/otherwise inbuilt function for that case as
import org.apache.spark.sql.functions._
def func = when(col("job_title").contains("cheif") || col("job_title").contains("ceo"), "c-level")
.otherwise(when(col("job_title").contains("manager"), "manager")
.otherwise("other"))
and you can call the function by using withColumn as
df.withColumn("rank", func).show(false)
which should give you
+---+---------------+-------+
|id |job_title |rank |
+---+---------------+-------+
|1 |ceo |c-level|
|2 |product manager|manager|
|3 |surfer |other |
+---+---------------+-------+
I hope the answer is helpful
Updated
I see that you have updated your post with your tryings, and you have tried creating a list of levels and you want to validate against the list. For that case you will have to write a udf function as
val cLevel = List("ceo","cfo")
val managerLevel = List("manager","team leader")
import org.apache.spark.sql.functions._
def rankUdf = udf((jobTitle: String) => jobTitle match {
case x if(cLevel.exists(_.contains(x)) || cLevel.exists(x.contains(_))) => "C-Level"
case x if(managerLevel.exists(_.contains(x)) || managerLevel.exists(x.contains(_))) => "manager"
case _ => "other"
})
df.withColumn("rank", rankUdf(col("job_title"))).show(false)
which should give you your desired output
val df = sc.parallelize(Seq(
(1,"ceo"),
( 2,"product manager"),
(3,"surfer"),
(4,"Vaquar khan")
)).toDF("id", "job_title")
df.show()
//option 2
df.createOrReplaceTempView("user_details")
sqlContext.sql("SELECT job_title, RANK() OVER (ORDER BY id) AS rank FROM user_details").show
val df1 = sc.parallelize(Seq(
("ceo","c-level"),
( "product manager","manager"),
("surfer","other"),
("Vaquar khan","Problem solver")
)).toDF("job_title", "ranks")
df1.show()
df1.createOrReplaceTempView("user_rank")
sqlContext.sql("SELECT user_details.id,user_details.job_title,user_rank.ranks FROM user_rank JOIN user_details ON user_rank.job_title = user_details.job_title order by user_details.id").show
Results :
+---+---------------+
| id| job_title|
+---+---------------+
| 1| ceo|
| 2|product manager|
| 3| surfer|
| 4| Vaquar khan|
+---+---------------+
+---------------+----+
| job_title|rank|
+---------------+----+
| ceo| 1|
|product manager| 2|
| surfer| 3|
| Vaquar khan| 4|
+---------------+----+
+---------------+--------------+
| job_title| ranks|
+---------------+--------------+
| ceo| c-level|
|product manager| manager|
| surfer| other|
| Vaquar khan|Problem solver|
+---------------+--------------+
+---+---------------+--------------+
| id| job_title| ranks|
+---+---------------+--------------+
| 1| ceo| c-level|
| 2|product manager| manager|
| 3| surfer| other|
| 4| Vaquar khan|Problem solver|
+---+---------------+--------------+
df: org.apache.spark.sql.DataFrame = [id: int, job_title: string]
df1: org.apache.spark.sql.DataFrame = [job_title: string, ranks: string]
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

How to pivot dataset?

I use Spark 2.1.
I have some data in a Spark Dataframe, which looks like below:
**ID** **type** **val**
1 t1 v1
1 t11 v11
2 t2 v2
I want to pivot up this data using either spark Scala (preferably) or Spark SQL so that final output should look like below:
**ID** **t1** **t11** **t2**
1 v1 v11
2 v2
You can use groupBy.pivot:
import org.apache.spark.sql.functions.first
df.groupBy("ID").pivot("type").agg(first($"val")).na.fill("").show
+---+---+---+---+
| ID| t1|t11| t2|
+---+---+---+---+
| 1| v1|v11| |
| 2| | | v2|
+---+---+---+---+
Note: depending on the actual data, i.e. how many values there are for each combination of ID and type, you might choose a different aggregation function.
Here's one way to do it:
val df = Seq(
(1, "T1", "v1"),
(1, "T11", "v11"),
(2, "T2", "v2")
).toDF(
"id", "type", "val"
).as[(Int, String, String)]
val df2 = df.groupBy("id").pivot("type").agg(concat_ws(",", collect_list("val")))
df2.show
+---+---+---+---+
| id| T1|T11| T2|
+---+---+---+---+
| 1| v1|v11| |
| 2| | | v2|
+---+---+---+---+
Note that if there are different vals associated with a given type, they will be grouped (comma-delimited) under the type in df2.
This one should work
val seq = Seq((123,"2016-01-01","1"),(123,"2016-01-02","2"),(123,"2016-01-03","3"))
val seq = Seq((1,"t1","v1"),(1,"t11","v11"),(2,"t2","v2"))
val df = seq.toDF("id","type","val")
val pivotedDF = df.groupBy("id").pivot("type").agg(first("val"))
pivotedDF.show
Output:
+---+----+----+----+
| id| t1| t11| t2|
+---+----+----+----+
| 1| v1| v11|null|
| 2|null|null| v2|
+---+----+----+----+