Splitting row in multiple row in spark-shell - scala

I have imported data in Spark dataframe in spark-shell. Data is filled in it like :
Col1 | Col2 | Col3 | Col4
A1 | 11 | B2 | a|b;1;0xFFFFFF
A1 | 12 | B1 | 2
A2 | 12 | B2 | 0xFFF45B
Here in Col4, the values are of different kinds and I want to separate them like (suppose "a|b" is type of alphabets, "1 or 2" is a type of digit and "0xFFFFFF or 0xFFF45B" is a type of hexadecimal no.):
So, the output should be :
Col1 | Col2 | Col3 | alphabets | digits | hexadecimal
A1 | 11 | B2 | a | 1 | 0xFFFFFF
A1 | 11 | B2 | b | 1 | 0xFFFFFF
A1 | 12 | B1 | | 2 |
A2 | 12 | B2 | | | 0xFFF45B
Hope I've made my query clear to you and I am using spark-shell. Thanks in advance.

Edit after getting this answer about how to make backreference in regexp_replace.
You can use regexp_replace with a backreference, then split twice and explode. It is, imo, cleaner than my original solution
val df = List(
("A1" , "11" , "B2" , "a|b;1;0xFFFFFF"),
("A1" , "12" , "B1" , "2"),
("A2" , "12" , "B2" , "0xFFF45B")
).toDF("Col1" , "Col2" , "Col3" , "Col4")
val regExStr = "^([A-z|]+)?;?(\\d+)?;?(0x.*)?$"
val res = df
'backrefReplace(1) .as("digits"),
'backrefReplace(2) .as("hexadecimal")
| A1| 11| B2| a| 1| 0xFFFFFF|
| A1| 11| B2| b| 1| 0xFFFFFF|
| A1| 12| B1| | 2| |
| A2| 12| B2| | | 0xFFF45B|
you still need to replace empty strings by nullthough...
Previous Answer (somebody might still prefer it):
Here is a solution that sticks to DataFrames but is also quite messy. You can first use regexp_extract three times (possible to do less with backreference?), and finally split on "|" and explode. Note that you need a coalesce for explode to return everything (you still might want to change the empty strings in letter to null in this solution).
val res = df
.withColumn("alphabets", regexp_extract('Col4,"(^[A-z|]+)?",1))
.withColumn("digits", regexp_extract('Col4,"^([A-z|]+)?;?(\\d+)?;?(0x.*)?$",2))
|Col1|Col2|Col3| Col4|alphabets|digits|hexadecimal|letter|
| A1| 11| B2|a|b;1;0xFFFFFF| a|b| 1| 0xFFFFFF| a|
| A1| 11| B2|a|b;1;0xFFFFFF| a|b| 1| 0xFFFFFF| b|
| A1| 12| B1| 2| null| 2| null| |
| A2| 12| B2| 0xFFF45B| null| null| 0xFFF45B| |
Note: The regexp part could be so much better with backreference, so if somebody knows how to do it, please comment!

Not sure this is doable while staying 100% with Dataframes, here's a (somewhat messy?) solution using RDDs for the split itself:
import org.apache.spark.sql.functions._
import sqlContext.implicits._
// we switch to RDD to perform the split of Col4 into 3 columns
val rddWithSplitCol4 = input.rdd.map { r =>
val indexToValue = r.getAs[String]("Col4").split(';').map {
case s if s.startsWith("0x") => 2 -> s
case s if s.matches("\\d+") => 1 -> s
case s => 0 -> s
val newCols: Array[String] = indexToValue.foldLeft(Array.fill[String](3)("")) {
case (arr, (index, value)) => arr.updated(index, value)
(r.getAs[String]("Col1"), r.getAs[Int]("Col2"), r.getAs[String]("Col3"), newCols(0), newCols(1), newCols(2))
// switch back to Dataframe and explode alphabets column
val result = rddWithSplitCol4
.toDF("Col1", "Col2", "Col3", "alphabets", "digits", "hexadecimal")
.withColumn("alphabets", explode(split(col("alphabets"), "\\|")))
result.show(truncate = false)
// +----+----+----+---------+------+-----------+
// |Col1|Col2|Col3|alphabets|digits|hexadecimal|
// +----+----+----+---------+------+-----------+
// |A1 |11 |B2 |a |1 |0xFFFFFF |
// |A1 |11 |B2 |b |1 |0xFFFFFF |
// |A1 |12 |B1 | |2 | |
// |A2 |12 |B2 | | |0xFFF45B |
// +----+----+----+---------+------+-----------+


How to add a list of data into dataschame in pyspark?

there is a dataframe contains two columns, one is key and another is value. Like below:
| key| value |
|a |abcde |
I want to slice the value into mutiple values with position and generate a new dataframe following the key. Like below:
| key| value|
|a |[a, 0] |
|a |[b, 1] |
|a |[c, 2] |
|a |[d, 3] |
|a |[e, 4] |
I have tried to use join() and StructType() but I failed. Are there any possible method to do that? THANKS!
from pyspark.sql import SparkSession,Row
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [("a","abcd",3000)]
df = spark.createDataFrame(data,["Name","Department","Salary"])
df.select( f.expr( "posexplode( filter (split(Department,''), x -> x != '' ) ) as (value, index)") , f.col("Name")).show()
| 0| a| a|
| 1| b| a|
| 2| c| a|
| 3| d| a|
# to explain the above functions more clearly.
f.expr( "..." ) #--> write a SQL expression
"posexplode( [..] )" #for an array turn them into rows with their index
"filter( [..] , x -> x != '' )" #for an array filter out ''
"split( Department, '' )" #split the column on null (extract characters) which will add null in out array that we need to filter out.
Here's the update to fit with your exact request, just a little more manipulation to put it into your required format:
df.select( f.expr( "posexplode( filter (split(Department,''), x -> x != '' ) ) as (myvalue, index)") , f.col("Name"),f.expr( "array(myvalue,index) as value ")).drop("index","myvalue").show()
|Name| value|
| a|[0, a]|
| a|[1, b]|
| a|[2, c]|
| a|[3, d]|
The following snippet transform your data into your specified format:
import pyspark.sql.functions as F
df = spark.createDataFrame([("a", "abcde",)], ["key", "value"])
df_split = df.withColumn("split", F.array_remove(F.split("value", ""), ""))
df_exploded = df_split.select("key", F.posexplode("split"))
df_array = df_exploded.select("key", F.array("col", "pos").alias("value"))
|key|value| split|
| a|abcde|[a, b, c, d, e]|
| a| 0| a|
| a| 1| b|
| a| 2| c|
| a| 3| d|
| a| 4| e|
|key| value|
| a|[a, 0]|
| a|[b, 1]|
| a|[c, 2]|
| a|[d, 3]|
| a|[e, 4]|
First, the string is split into an array, where the split pattern is the empty string. Therefore, the last element needs to be removed.
Then each element of a the split column array is transformed to a row with its position in the array as column pos
Lastly, the columns are combined to an array.

How do you split a column such that first half becomes the column name and the second the column value in Scala Spark?

I have a column which has value like
|UserId |col |
|1 |firstname=abc |
|2 |lastname=xyz |
|3 |firstname=pqr;lastname=zzz |
|4 |firstname=aaa;middlename=xxx;lastname=bbb|
and what I want is something like this:
|UserId |firstname | lastname| middlename|
|1 |abc | null | null |
|2 |null | xyz | null |
|3 |pqr | zzz | null |
|4 |aaa | bbb | xxx |
I have already done this:
var new_df = df.withColumn("temp_new", split(col("col"), "\\;")).select(
(0 until numCols).map(i => split(col("temp_new").getItem(i), "=").getItem(1).as(s"col$i")): _*
where numCols is the max length of col
but as you may have guessed I get something like this as the output:
|UserId |col0 | col1 | col2 |
|1 |abc | null | null |
|2 |xyz | null | null |
|3 |pqr | zzz | null |
|4 |aaa | xxx | bbb |
NOTE: The above is just an example. There could be more additions to the columns like firstname=aaa;middlename=xxx;lastname=bbb;age=20;country=India and so on for around 40-50 columnnames and values. They are dynamic and I don't know most of them in advance
I am looking for a a way to achieve the result with Scala in Spark.
You could apply groupBy/pivot to generate key columns after converting the key/value-pairs string column into a Map column via SQL function str_to_map, as shown below:
val df = Seq(
(1, "firstname=joe;age=33"),
(2, "lastname=smith;country=usa"),
(3, "firstname=zoe;lastname=cooper;age=44;country=aus"),
(4, "firstname=john;lastname=doe")
).toDF("user_id", "key_values")
select($"user_id", explode(expr("str_to_map(key_values, ';', '=')"))).
orderBy("user_id"). // only for ordered output
|user_id| age|country|firstname|lastname|
| 1| 33| null| joe| null|
| 2|null| usa| null| smith|
| 3| 44| aus| zoe| cooper|
| 4|null| null| john| doe|
Since your data is split by ; then your key value pairs are split by = you may consider using str_to_map the following:
creating a temporary view of your data eg
Running the following on your spark session
result_df = sparkSession.sql("<insert sql below here>")
WITH split_data AS (
str_to_map(col,';','=') full_name
full_name['firstname'] as firstname,
full_name['lastname'] as lastname,
full_name['middlename'] as middlename
This solution is proposed in accordance with the expanded requirement described in the other answer's comments section:
Existence of duplicate keys in column key_values
Only duplicate key columns will be aggregated as ArrayType
There are probably other approaches. The solution below uses groupBy/pivot with collect_list, followed by extracting the single element (null if empty) from the non-duplicate key columns.
val df = Seq(
(1, "firstname=joe;age=33;moviegenre=comedy"),
(2, "lastname=smith;country=usa;moviegenre=drama"),
(3, "firstname=zoe;lastname=cooper;age=44;country=aus"),
(4, "firstname=john;lastname=doe;moviegenre=drama;moviegenre=comedy")
).toDF("user_id", "key_values")
val mainCols = df.columns diff Seq("key_values")
val dfNew = df.
withColumn("kv_arr", split($"key_values", ";")).
withColumn("kv", explode(expr("transform(kv_arr, kv -> split(kv, '='))"))).
val dupeKeys = Seq("moviegenre") // user-provided
val nonDupeKeys = dfNew.columns diff (mainCols ++ dupeKeys)
mainCols.map(col) ++
dupeKeys.map(col) ++
nonDupeKeys.map(k => when(size(col(k)) > 0, col(k)(0)).as(k)): _*
orderBy("user_id"). // only for ordered output
|user_id| moviegenre| age|country|firstname|lastname|
| 1| [comedy]| 33| null| joe| null|
| 2| [drama]|null| usa| null| smith|
| 3| []| 44| aus| zoe| cooper|
| 4|[drama, comedy]|null| null| john| doe|
Note that higher-order function transform is used to handle the key/value split, as SQL function str_to_map (used in the original solution) can't handle duplicate keys.

Filter one data frame using other data frame in spark scala

I am going to demonstrate my question using following two data frames.
val datF1= Seq((1,"everlasting",1.39),(1,"game", 2.7),(1,"life",0.69),(1,"learning",0.69),
| ID| token|value|
| 1|everlasting| 1.39|
| 1| game| 2.7|
| 1| life| 0.69|
| 1| learning| 0.69|
| 2| living| 1.38|
| 2| worth| 1.38|
| 2| life| 0.69|
| 3| learning| 0.69|
| 3| never| 1.38|
val dataF2= Seq(("life ",0.71),("learning",0.75)).toDF("token1","val2")
| token1|val2|
| life |0.71|
I want to filter the ID and value of dataF1 based on the token1 of dataF2. For the each word in token1 of dataF2 , if there is a word token then value should be equal to the value of dataF1 else value should be zero.
In other words my desired output should be like this
| ID| val|val2|
| 1|0.69|0.69|
| 2| 0.0|0.69|
| 3|0.69| 0.0|
Since learning is not presented in ID equals 2 , the val has equal to zero. Similarly since life is not there for ID equal 3, val2 equlas zero.
I did it manually as follows ,
val newQ61=datF1.filter($"token"==="learning")
val newQ7 =Seq(1,2,3).toDF("ID")
val newQ81 =newQ7.join(newQ61, Seq("ID"), "left")
val tf2=newQ81.select($"ID" ,when(col("value").isNull ,0).otherwise(col("value")) as "val" )
val newQ62=datF1.filter($"token"==="life")
val newQ71 =Seq(1,2,3).toDF("ID")
val newQ82 =newQ71.join(newQ62, Seq("ID"), "left")
val tf3=newQ82.select($"ID" ,when(col("value").isNull ,0).otherwise(col("value")) as "val2" )
val tf4 =tf2.join(tf3 ,Seq("ID"), "left")
| ID| val|val2|
| 1|0.69|0.69|
| 2| 0.0|0.69|
| 3|0.69| 0.0|
Instead of doing this manually , is there a way to do this more efficiently by accessing indexes of one data frame within the other data frame ? because in real life situations, there can be more than 2 words so manually accessing each word may be very hard thing to do.
Thank you
When i use leftsemi join my output is like this :
datF1.join(dataF2, $"token"===$"token1", "leftsemi").show()
| ID| token|value|
| 1|learning| 0.69|
| 3|learning| 0.69|
I believe a left outer join and then pivoting on token can work here:
val ans = df1.join(df2, $"token" === $"token1", "LEFT_OUTER")
The result (without the null handling):
| ID|learning|life|
| 1| 0.69|0.69|
| 3| 0.69|0.0 |
| 2| 0.0 |0.69|
UPDATE: as the answer by Lamanus suggest, an inner join is possibly a better approach than an outer join + filter.
I think the inner join is enough. Btw, I found the typo in your test case, which makes the result wrong.
val dataF1= Seq((1,"everlasting",1.39),
(1,"game", 2.7),
// +---+-----------+-----+
// | ID| token|value|
// +---+-----------+-----+
// | 1|everlasting| 1.39|
// | 1| game| 2.7|
// | 1| life| 0.69|
// | 1| learning| 0.69|
// | 2| living| 1.38|
// | 2| worth| 1.38|
// | 2| life| 0.69|
// | 3| learning| 0.69|
// | 3| never| 1.38|
// +---+-----------+-----+
val dataF2= Seq(("life",0.71), // "life " -> "life"
// +--------+----+
// | token1|val2|
// +--------+----+
// | life|0.71|
// |learning|0.75|
// +--------+----+
val resultDF = dataF1.join(dataF2, $"token" === $"token1", "inner")
// +---+--------+-----+--------+----+
// | ID| token|value| token1|val2|
// +---+--------+-----+--------+----+
// | 1| life| 0.69| life|0.71|
// | 1|learning| 0.69|learning|0.75|
// | 2| life| 0.69| life|0.71|
// | 3|learning| 0.69|learning|0.75|
// +---+--------+-----+--------+----+
This will give you the result such as
| ID|learning|life|
| 1| 0.69|0.69|
| 2| 0.0|0.69|
| 3| 0.69| 0.0|
Seems like you need "left semi-join". It will filter one dataframe, based on another one.
Try using it like
datF1.join(datF2, $"token"===$"token2", "leftsemi")
You can find a bit more info here - https://medium.com/datamindedbe/little-known-spark-dataframe-join-types-cc524ea39fd5

How to dynamically add columns to a DataFrame?

I am trying to dynamically add columns to a DataFrame from a Seq of String.
Here's an example : the source dataframe is like:
|id | A | B | C | D |
|1 |toto|tata|titi| |
|2 |bla |blo | | |
|3 |b | c | a | d |
I also have a Seq of String which contains name of columns I want to add. If a column already exists in the source DataFrame, it must do some kind of difference like below :
The Seq looks like :
val columns = Seq("A", "B", "F", "G", "H")
The expectation is:
|id | A | B | C | D | F | G | H |
|1 |toto|tata|titi|tutu|null|null|null
|2 |bla |blo | | |null|null|null|
|3 |b | c | a | d |null|null|null|
What I've done so far is something like this :
val difference = columns diff sourceDF.columns
val finalDF = difference.foldLeft(sourceDF)((df, field) => if (!sourceDF.columns.contains(field)) df.withColumn(field, lit(null))) else df)
.select(columns.head, columns.tail:_*)
But I can't figure how to do this using Spark efficiently in a more simpler and easier way to read ...
Thanks in advance
Here is another way using Seq.diff, single select and map to generate your final column list:
import org.apache.spark.sql.functions.{lit, col}
val newCols = Seq("A", "B", "F", "G", "H")
val updatedCols = newCols.diff(df.columns).map{ c => lit(null).as(c)}
val selectExpr = df.columns.map(col) ++ updatedCols
// +---+----+----+----+----+----+----+----+
// | id| A| B| C| D| F| G| H|
// +---+----+----+----+----+----+----+----+
// | 1|toto|tata|titi|null|null|null|null|
// | 2| bla| blo|null|null|null|null|null|
// | 3| b| c| a| d|null|null|null|
// +---+----+----+----+----+----+----+----+
First we find the diff between newCols and df.columns this gives us: F, G, H. Next we transform each element of the list to lit(null).as(c) via map function. Finally, we concatenate the existing and the new list together to produce selectExpr which is used for the select.
Below will be optimised way with your logic.
scala> df.show
| id| A| B| C| D|
| 1|toto|tata|titi|null|
| 2| bla| blo|null|null|
| 3| b| c| a| d|
scala> val Columns = Seq("A", "B", "F", "G", "H")
scala> val newCol = Columns filterNot df.columns.toSeq.contains
scala> val df1 = newCol.foldLeft(df)((df,name) => df.withColumn(name, lit(null)))
scala> df1.show()
| id| A| B| C| D| F| G| H|
| 1|toto|tata|titi|null|null|null|null|
| 2| bla| blo|null|null|null|null|null|
| 3| b| c| a| d|null|null|null|
If you do not want to use foldLeft then you can use RunTimeMirror which will be faster. Check Below Code.
scala> import scala.reflect.runtime.universe.runtimeMirror
scala> import scala.tools.reflect.ToolBox
scala> import org.apache.spark.sql.DataFrame
scala> df.show
| id| A| B| C| D|
| 1|toto|tata|titi|null|
| 2| bla| blo|null|null|
| 3| b| c| a| d|
scala> def compile[A](code: String): DataFrame => A = {
| val tb = runtimeMirror(getClass.getClassLoader).mkToolBox()
| val tree = tb.parse(
| s"""
| |import org.elasticsearch.spark.sql._
| |import org.apache.spark.sql.DataFrame
| |def wrapper(context:DataFrame): Any = {
| | $code
| |}
| |wrapper _
| """.stripMargin)
| val fun = tb.compile(tree)
| val wrapper = fun()
| wrapper.asInstanceOf[DataFrame => A]
| }
scala> def AddColumns(df:DataFrame,withColumnsString:String):DataFrame = {
| val code =
| s"""
| |import org.apache.spark.sql.functions._
| |import org.elasticsearch.spark.sql._
| |import org.apache.spark.sql.DataFrame
| |var data = context.asInstanceOf[DataFrame]
| |data = data
| """ + withColumnsString +
| """
| |
| |data
| """.stripMargin
| val fun = compile[DataFrame](code)
| val res = fun(df)
| res
| }
scala> val Columns = Seq("A", "B", "F", "G", "H")
scala> val newCol = Columns filterNot df.columns.toSeq.contains
scala> var cols = ""
scala> newCol.foreach{ name =>
| cols = ".withColumn(\""+ name + "\" , lit(null))" + cols
| }
scala> val df1 = AddColumns(df,cols)
scala> df1.show
| id| A| B| C| D| H| G| F|
| 1|toto|tata|titi|null|null|null|null|
| 2| bla| blo|null|null|null|null|null|
| 3| b| c| a| d|null|null|null|

Convert the map RDD into dataframe

I am using Spark 1.6.0, I have input map RDD (key,value) pair and want to convert to dataframe.
Input format RDD:
((1, A, ABC), List(pz,A1))
((2, B, PQR), List(az,B1))
((3, C, MNR), List(cs,c1))
Output format:
| c1 | c2 | c3 | c4 | c5 |
| 1 | A | ABC | pz | A1 |
| 2 | B | PQR | az | B1 |
| 3 | C | MNR | cs | C1 |
Can someone help me on this.
I would suggest you to go with datasets as datasets are optimized and typesafe dataframes.
first you need to create a case class as
case class table(c1: Int, c2: String, c3: String, c4:String, c5:String)
then you would just need a map function to parse your data to the case class and call .toDS
rdd.map(x => table(x._1._1, x._1._2, x._1._3, x._2(0), x._2(1))).toDS().show()
you should have following output
| c1| c2| c3| c4| c5|
| 1| A|ABC| pz| A1|
| 2| B|PQR| az| B1|
| 3| C|MNR| cs| c1|
you can use dataframe as well, for that you can use .toDF() instead of .toDS().
val a = Seq(((1,"A","ABC"),List("pz","A1")),((2, "B", "PQR"),
List("az","B1")),((3,"C", "MNR"), List("cs","c1")))
val a1 = sc.parallelize(a);
val a2 = a1.map(rec=>
| _1| _2| _3| _4| _5|
| 1| A|ABC| pz| A1|
| 2 | B |PQR| az| B1|
| 3 | C |MNR| cs| C1|