scala spark dataframe: explode a string column to multiple strings - scala

Any pointers on below?
input df: here col1 is of type string
+----------------------------------+
| col1|
+----------------------------------+
|[{a:1,g:2},{b:3,h:4},{c:5,i:6}] |
|[{d:7,j:8},{e:9,k:10},{f:11,l:12}]|
+----------------------------------+
expected output: (again col1 is of type string)
+-------------+
| col1 |
+-------------+
| {a:1,g:2} |
| {b:3,h:4} |
| {c:5,i:6} |
| {d:7,j:8} |
| {e:9,k:10} |
| {f:11,l:12}|
+-----+
Thanks!

You can use the Spark SQL explode function with an UDF :
import spark.implicits._
val df = spark.createDataset(Seq("[{a},{b},{c}]","[{d},{e},{f}]")).toDF("col1")
df.show()
+-------------+
| col1|
+-------------+
|[{a},{b},{c}]|
|[{d},{e},{f}]|
+-------------+
import org.apache.spark.sql.functions._
val stringToSeq = udf{s: String => s.drop(1).dropRight(1).split(",")}
df.withColumn("col1", explode(stringToSeq($"col1"))).show()
+----+
|col1|
+----+
| {a}|
| {b}|
| {c}|
| {d}|
| {e}|
| {f}|
+----+
Edit: for you new input data, the custom UDF can evolve as above :
val stringToSeq = udf{s: String =>
val extractor = "[^{]*:[^}]*".r
extractor.findAllIn(s).map(m => s"{$m}").toSeq
}
new output :
+-----------+
| col1|
+-----------+
| {a:1,g:2}|
| {b:3,h:4}|
| {c:5,i:6}|
| {d:7,j:8}|
| {e:9,k:10}|
|{f:11,l:12}|
+-----------+

Spark provides a quite rich trim function which can be used to remove the leading and the trailing chars, [] in your case. As #LeoC already mentioned the required functionality can be implemented through the build-in functions which will perform much better:
import org.apache.spark.sql.functions.{trim, explode, split}
val df = Seq(
("[{a},{b},{c}]"),
("[{d},{e},{f}]")
).toDF("col1")
df.select(
explode(
split(
trim($"col1", "[]"), ","))).show
// +---+
// |col|
// +---+
// |{a}|
// |{b}|
// |{c}|
// |{d}|
// |{e}|
// |{f}|
// +---+
EDIT:
For the new dataset the logic remains the same with the difference that you need to split with a different character other than ,. You can achieve this using regexp_replace to replace }, with }| in order to be able later to split with | instead of ,:
import org.apache.spark.sql.functions.{trim, explode, split, regexp_replace}
val df = Seq(
("[{a:1,g:2},{b:3,h:4},{c:5,i:6}]"),
("[{d:7,j:8},{e:9,k:10},{f:11,l:12}]")
).toDF("col1")
df.select(
explode(
split(
regexp_replace(trim($"col1", "[]"), "},", "}|"), // gives: {a:1,g:2}|{b:3,h:4}|{c:5,i:6}
"\\|")
)
).show(false)
// +-----------+
// |col |
// +-----------+
// |{a:1,g:2} |
// |{b:3,h:4} |
// |{c:5,i:6} |
// |{d:7,j:8} |
// |{e:9,k:10} |
// |{f:11,l:12}|
// +-----------+
Note: with split(..., "\\|") we escape | which is a special regex character.

You can do:
val newDF = df.as[String].flatMap(line=>line.replaceAll("\\[", "").replaceAll("\\]", "").split(","))
newDF.show()
Output:
+-----+
|value|
+-----+
| {a}|
| {b}|
| {c}|
| {d}|
| {e}|
| {f}|
+-----+
Just as a note, this process will name the output column as value, that you can easily rename it (if needed), using select, withColumn, etc.

Finally what worked:
import spark.implicits._
val df = spark.createDataset(Seq("[{a:1,g:2},{b:3,h:4},{c:5,i:6}]","[{d:7,j:8},{e:9,k:10},{f:11,l:12}]")).toDF("col1")
df.show()
val toStr = udf((value : String) => value.split("},\\{").map(_.toString))
val addParanthesis = udf((value : String) => ("{" + value + "}"))
val removeParanthesis = udf((value : String) => (value.slice(2,value.length()-2)))
import org.apache.spark.sql.functions._
df
.withColumn("col0", removeParanthesis(col("col1")))
.withColumn("col2", toStr(col("col0")))
.withColumn("col3", explode(col("col2")))
.withColumn("col4", addParanthesis(col("col3")))
.show()
output:
+--------------------+--------------------+--------------------+---------+-----------+
| col1| col0| col2| col3| col4|
+--------------------+--------------------+--------------------+---------+-----------+
|[{a:1,g:2},{b:3,h...|a:1,g:2},{b:3,h:4...|[a:1,g:2, b:3,h:4...| a:1,g:2| {a:1,g:2}|
|[{a:1,g:2},{b:3,h...|a:1,g:2},{b:3,h:4...|[a:1,g:2, b:3,h:4...| b:3,h:4| {b:3,h:4}|
|[{a:1,g:2},{b:3,h...|a:1,g:2},{b:3,h:4...|[a:1,g:2, b:3,h:4...| c:5,i:6| {c:5,i:6}|
|[{d:7,j:8},{e:9,k...|d:7,j:8},{e:9,k:1...|[d:7,j:8, e:9,k:1...| d:7,j:8| {d:7,j:8}|
|[{d:7,j:8},{e:9,k...|d:7,j:8},{e:9,k:1...|[d:7,j:8, e:9,k:1...| e:9,k:10| {e:9,k:10}|
|[{d:7,j:8},{e:9,k...|d:7,j:8},{e:9,k:1...|[d:7,j:8, e:9,k:1...|f:11,l:12|{f:11,l:12}|
+--------------------+--------------------+--------------------+---------+-----------+

Related

Fetch the partial value from a column having key value pairs and assign it to new column in Spark Dataframe

I have a data frame as below
+----+-----------------------------+
|id | att |
+----+-----------------------------+
| 25 | {"State":"abc","City":"xyz"}|
| 26 | null |
| 27 | {"State":"pqr"} |
+----+-----------------------------+
I want a dataframe with columns id and city if the att column has city attribute else null
+----+------+
|id | City |
+----+------+
| 25 | xyz |
| 26 | null |
| 27 | null |
+----+------+
Language : Scala
You can use from_json to parse and convert your json data to Map. Then access the map item using one of:
getItem method of the Column class
default accessor, i.e map("map_key")
element_at function
import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.types.{MapType, StringType}
import sparkSession.implicits._
val df = Seq(
(25, """{"State":"abc","City":"xyz"}"""),
(26, null),
(27, """{"State":"pqr"}""")
).toDF("id", "att")
val schema = MapType(StringType, StringType)
df.select($"id", from_json($"att", schema).getItem("City").as("City"))
//or df.select($"id", from_json($"att", schema)("City").as("City"))
//or df.select($"id", element_at(from_json($"att", schema), "City").as("City"))
// +---+----+
// | id|City|
// +---+----+
// | 25| xyz|
// | 26|null|
// | 27|null|
// +---+----+

base64 decoding of a dataframe

I have an encoded dataframe and I managed to get it decoded using following code in PySpark. Is there any simple way where I can have an additional column in the dataframe itself through Scala/PySpark?
import base64
import numpy as np
df = spark.read.parquet("file_path")
encodedColumn = base64.decodestring(df.take(1)[0].column2)
t1 = np.frombuffer(encodedColumn ,dtype='<f4')
I looked up multiple similar questions, but couldnt get them to work.
Edit:
Got it working with help from a colleague.
def binaryToFloatArray(stringValue: String): Array[Float] = {
val t:Array[Byte] = Base64.getDecoder().decode(stringValue)
val b = ByteBuffer.wrap(t).order(ByteOrder.LITTLE_ENDIAN).asFloatBuffer()
val copy = new Array[Float](2048)
b.get(copy)
return copy
}
val binaryToFloatArrayUDF = udf(binaryToFloatArray _)
val finalResultDf = dftest.withColumn("myFloatArray", binaryToFloatArrayUDF(col("_2"))).drop("_2")
You have base64 and unbase64 functions for this.
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=streaming#pyspark.sql.functions.base64
You could
from pyspark.sql.functions import unbase64,base64
got = spark.createDataFrame([(1, "Jon"), (2, "Danny"), (3, "Tyrion")], ("id", "name"))
+---+------+
| id| name|
+---+------+
| 1| Jon|
| 2| Danny|
| 3|Tyrion|
+---+------+
encoded_got = got.withColumn('encoded_base64_name', base64(got.name))
+---+------+-------------------+
| id| name|encoded_base64_name|
+---+------+-------------------+
| 1| Jon| Sm9u|
| 2| Danny| RGFubnk=|
| 3|Tyrion| VHlyaW9u|
+---+------+-------------------+
decoded_got = encoded_got.withColumn('decoded_base64', unbase64(encoded_got.encoded_base64).cast("string"))
# Need to use cast("string") to convert from binary to string
+---+------+--------------+--------------+
| id| name|encoded_base64|decoded_base64|
+---+------+--------------+--------------+
| 1| Jon| Sm9u| Jon|
| 2| Danny| RGFubnk=| Danny|
| 3|Tyrion| VHlyaW9u| Tyrion|
+---+------+--------------+--------------+

How to transform a string column of a dataframe into a column of Array[String] with Apache Spark and Scala

I have a DataFrame with a column 'title_from' as below.
.
This colume contains a sentence and I want to transform this column into a Array[String]. I have tried something like this but it does not works.
val newDF = df.select("title_from").map(x => x.split("\\\s+")
How can I achieve this? How can I transform a datafram of strings into a dataframe of Array[string]? I want evry line of newDF to be an array of words from df.
Thanks for any help!
You can use the withColumn function.
import org.apache.spark.sql.functions._
val newDF = df.withColumn("split_title_from", split(col("title_from"), "\\s+"))
.select("split_title_from")
Can you try following to get the list of all authors
scala> val df = Seq((1,"a1,a2,a3"), (2,"a1,a4,a10")).toDF("id","author")
df: org.apache.spark.sql.DataFrame = [id: int, author: string]
scala> df.show()
+---+---------+
| id| author|
+---+---------+
| 1| a1,a2,a3|
| 2|a1,a4,a10|
+---+---------+
scala> df.select("author").show
+---------+
| author|
+---------+
| a1,a2,a3|
|a1,a4,a10|
+---------+
scala> df.select("author").flatMap( row => { row.get(0).toString().split(",")}).show()
+-----+
|value|
+-----+
| a1|
| a2|
| a3|
| a1|
| a4|
| a10|
+-----+

Sequential Dynamic filters on the same Spark Dataframe Column in Scala Spark

I have a column named root and need to filter dataframe based on the different values of a root column.
Suppose I have a values in root are parent,child or sub-child and I want to apply these filters dynamically through a variable.
val x = ("parent,child,sub-child").split(",")
x.map(eachvalue <- {
var df1 = df.filter(col("root").contains(eachvalue))
}
But when I am doing it, it always overwriting the DF1 instead, I want to apply all the 3 filters and get the result.
May be in future I may extend the list to any number of filter values and the code should work.
Thanks,
Bab
You should apply the subsequent filters to the result of the previous filter, not on df:
val x = ("parent,child,sub-child").split(",")
var df1 = df
x.map(eachvalue <- {
df1 = df1.filter(col("root").contains(eachvalue))
}
df1 after the map operation will have all filters applied to it.
Let's see an example with spark shell. Hope it helps you.
scala> import spark.implicits._
import spark.implicits._
scala> val df0 =
spark.sparkContext.parallelize(List(1,2,1,3,3,2,1)).toDF("number")
df0: org.apache.spark.sql.DataFrame = [number: int]
scala> val list = List(1,2,3)
list: List[Int] = List(1, 2, 3)
scala> val dfFiltered = for (number <- list) yield { df0.filter($"number" === number)}
dfFiltered: List[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]] = List([number: int], [number: int], [number: int])
scala> dfFiltered(0).show
+------+
|number|
+------+
| 1|
| 1|
| 1|
+------+
scala> dfFiltered(1).show
+------+
|number|
+------+
| 2|
| 2|
+------+
scala> dfFiltered(2).show
+------+
|number|
+------+
| 3|
| 3|
+------+
AFAIK isin can be used in this case below is the example.
import spark.implicits._
val colorStringArr = "red,yellow,blue".split(",")
val colorDF =
List(
"red",
"yellow",
"purple"
).toDF("color")
// to derive a column using a list
colorDF.withColumn(
"is_primary_color",
col("color").isin(colorStringArr: _*)
).show()
println( "if you don't want derived column and directly want to filter using a list with isin then .. ")
colorDF.filter(col("color").isin(colorStringArr: _*)).show
Result :
+------+----------------+
| color|is_primary_color|
+------+----------------+
| red| true|
|yellow| true|
|purple| false|
+------+----------------+
if you don't want derived column and directly want to filter using a list with isin then ....
+------+
| color|
+------+
| red|
|yellow|
+------+
One more way using array_contains and swapping the arguments.
scala> val x = ("parent,child,sub-child").split(",")
x: Array[String] = Array(parent, child, sub-child)
scala> val df = Seq(("parent"),("grand-parent"),("child"),("sub-child"),("cousin")).toDF("root")
df: org.apache.spark.sql.DataFrame = [root: string]
scala> df.show
+------------+
| root|
+------------+
| parent|
|grand-parent|
| child|
| sub-child|
| cousin|
+------------+
scala> df.withColumn("check", array_contains(lit(x),'root)).show
+------------+-----+
| root|check|
+------------+-----+
| parent| true|
|grand-parent|false|
| child| true|
| sub-child| true|
| cousin|false|
+------------+-----+
scala>
Here are my two cents
val filters = List(1,2,3)
val data = List(5,1,2,1,3,3,2,1,4)
val colName = "number"
val df = spark.
sparkContext.
parallelize(data).
toDF(colName).
filter(
r => filters.contains(r.getAs[Int](colName))
)
df.show()
which results in
+------+
|number|
+------+
| 1|
| 2|
| 1|
| 3|
| 3|
| 2|
| 1|
+------+

I want to add month to a Date using SqlContext

'01-FEB-2013' This is my date. how can I get the result as 01-MAR-2013?
SELECT DATE_ADD( '2011-01-01', INTERVAL 1 month );
This is possible by mySql.I want the result using sqlContext in scala Is it possible?
You will want to use org.apache.spark.sql.functions.add_months:
def add_months(startDate: Column, numMonths: Int): Column
"Returns the date that is numMonths after startDate."
Here is an example of its usage:
scala> val df = sc.parallelize((0 to 6).map(i =>
{now.setMonth(i); (i, new java.sql.Date(now.getTime))}).toSeq)
.toDF("ID", "Dates")
df: org.apache.spark.sql.DataFrame = [ID: int, Dates: date]
scala> df.show
+---+----------+
| ID| Dates|
+---+----------+
| 0|2016-01-21|
| 1|2016-02-21|
| 2|2016-03-21|
| 3|2016-04-21|
| 4|2016-05-21|
| 5|2016-06-21|
| 6|2016-07-21|
+---+----------+
scala> df.withColumn("New Dates", add_months(df("Dates"),1)).show
+---+----------+----------+
| ID| Dates| New Dates|
+---+----------+----------+
| 0|2016-01-21|2016-02-21|
| 1|2016-02-21|2016-03-21|
| 2|2016-03-21|2016-04-21|
| 3|2016-04-21|2016-05-21|
| 4|2016-05-21|2016-06-21|
| 5|2016-06-21|2016-07-21|
| 6|2016-07-21|2016-08-21|
+---+----------+----------+