Spark:join condition with Array (nullable ones) - scala

I have 2 Dataframes and would like to join them and would like to filter the data, i want to filter the
data where OrgTypeToExclude is matching with respect to each transactionid.
in single word my transactionId is join contiions and OrgTypeToExclude is exclude condition,sharing a simple example here
import org.apache.spark.sql.functions.expr
import spark.implicits._
val jsonstr ="""{
"id": "3b4219f8-0579-4933-ba5e-c0fc532eeb2a",
"Transactions": [
{
"TransactionId": "USAL",
"OrgTypeToExclude": ["A","B"]
},
{
"TransactionId": "USMD",
"OrgTypeToExclude": ["E"]
},
{
"TransactionId": "USGA",
"OrgTypeToExclude": []
}
]
}"""
val df = Seq((1, "USAL","A"),(4, "USAL","C"), (2, "USMD","B"),(5, "USMD","E"), (3, "USGA","C")).toDF("id", "code","Alp")
val json = spark.read.json(Seq(jsonstr).toDS).select("Transactions.TransactionId","Transactions.OrgTypeToExclude")
df.printSchema()
json.printSchema()
df.join(json,$"code"<=> $"TransactionId".cast("string") && !exp("array_contains(OrgTypeToExclude, Alp)") ,"inner" ).show()
--Expecting output
id Code Alp
4 "USAL" "C"
2 "USMD" "B"
3 "USGA" "C"
Thanks,
Manoj.

Transactions is an array type & you are accessing TransactionId & OrgTypeToExclude on that so you will be getting multiple arrays.
Instead of that You just explode root level Transactions array & extract the struct values that is OrgTypeToExclude & TransactionId next steps will be easy.
Please check below code.
scala> val jsonstr ="""{
|
| "id": "3b4219f8-0579-4933-ba5e-c0fc532eeb2a",
| "Transactions": [
| {
| "TransactionId": "USAL",
| "OrgTypeToExclude": ["A","B"]
| },
| {
| "TransactionId": "USMD",
| "OrgTypeToExclude": ["E"]
| },
| {
| "TransactionId": "USGA",
| "OrgTypeToExclude": []
| }
| ]
| }"""
jsonstr: String =
{
"id": "3b4219f8-0579-4933-ba5e-c0fc532eeb2a",
"Transactions": [
{
"TransactionId": "USAL",
"OrgTypeToExclude": ["A","B"]
},
{
"TransactionId": "USMD",
"OrgTypeToExclude": ["E"]
},
{
"TransactionId": "USGA",
"OrgTypeToExclude": []
}
]
}
scala> val df = Seq((1, "USAL","A"),(4, "USAL","C"), (2, "USMD","B"),(5, "USMD","E"), (3, "USGA","C")).toDF("id", "code","Alp")
df: org.apache.spark.sql.DataFrame = [id: int, code: string ... 1 more field]
scala> val json = spark.read.json(Seq(jsonstr).toDS).select(explode($"Transactions").as("Transactions")).select($"Transactions.*")
json: org.apache.spark.sql.DataFrame = [OrgTypeToExclude: array<string>, TransactionId: string]
scala> df.show(false)
+---+----+---+
|id |code|Alp|
+---+----+---+
|1 |USAL|A |
|4 |USAL|C |
|2 |USMD|B |
|5 |USMD|E |
|3 |USGA|C |
+---+----+---+
scala> json.show(false)
+----------------+-------------+
|OrgTypeToExclude|TransactionId|
+----------------+-------------+
|[A, B] |USAL |
|[E] |USMD |
|[] |USGA |
+----------------+-------------+
scala> df.join(jsondf,(df("code") === jsondf("TransactionId") && !array_contains(jsondf("OrgTypeToExclude"),df("Alp"))),"inner").select("id","code","alp").show(false)
+---+----+---+
|id |code|alp|
+---+----+---+
|4 |USAL|C |
|2 |USMD|B |
|3 |USGA|C |
+---+----+---+
scala>

First, it looks like you overlooked the fact that Transactions is also an array, which we can use explode to deal with:
val json = spark.read.json(Seq(jsonstr).toDS)
.select(explode($"Transactions").as("t")) // deal with Transactions array first
.select($"t.TransactionId", $"t.OrgTypeToExclude")
Also, array_contains wants a value rather than a column as its second argument. I'm not aware of a version that supports referencing a column, so we'll make a udf:
val arr_con = udf { (a: Seq[String], v: String) => a.contains(v) }
We can then modify the join condition like so:
df.join(json0, $"code" <=> $"TransactionId" && ! arr_con($"OrgTypeToExclude", $"Alp"), "inner").show()
And the expected result:
scala> df.join(json, $"code" <=> $"TransactionId" && ! arr_con($"OrgTypeToExclude", $"Alp"), "inner").show()
+---+----+---+-------------+----------------+
| id|code|Alp|TransactionId|OrgTypeToExclude|
+---+----+---+-------------+----------------+
| 4|USAL| C| USAL| [A, B]|
| 2|USMD| B| USMD| [E]|
| 3|USGA| C| USGA| []|
+---+----+---+-------------+----------------+

Related

Fixed length parsing in spark scala

I have created the dataframe and the input is like this:
+-----------------------------------+
|value |
+-----------------------------------+
|1 PRE123 21 |
|2 TEST 32 |
|7 XYZ .7 |
+-----------------------------------+
and on the basis on the below metadata information we need to split the above data frame and create a new dataframe, having columns name id,name and class and it start and index loction is given in this json meta data.
{
"columnName": "id",
"start": 1,
"end": 2
},
{
"columnName": "name",
"start": 5,
"end": 10
},
{
"columnName": "class",
"start": 20,
"end": 22
}
OUTPUT :
+---+------+-----+
| id| name|class|
+---+------+-----+
| 1|PRE123| 21|
| 2| TEST| 32|
| 7| XYZ| .7|
+---+------+-----+
For loading the df, I have created the list:
list.+=(loadedDF.col("value").substr(fixedLength.getStart, (fixedLength.getEnd - fixedLength.getStart)).alias(fixedLength.getColumnName))
and from this list, I have created the dataframe
var df: DataFrame = loadedDF.select(list: _*)
Need to know the order better approach for creating the dataframe from the metadata.
As the list created will bring all the data to the driver node.
If I understood correctly you requirements you are trying to extract the columns from a string separated by an arbitrary number of spaces.
Here is one solution with substr function:
val df = Seq(
("1 PRE123 21"),
("2 TEST 32"),
("7 XYZ .7"))
.toDF("value")
val colMetadata = Map("id" -> (1,2), "name" -> (5,10), "class" -> (20,22))
val columns = colMetadata.map { case (cname, meta) =>
val len = meta._2 - meta._1
$"value".substr(meta._1, len).as(cname)
}.toSeq
df.select(columns:_*).show
And a generic solution when you don't have the column boundaries available using the split function:
import org.apache.spark.sql.functions.split
val df = Seq(
("1 PRE123 21"),
("2 TEST 32"),
("7 XYZ .7"))
.toDF("value")
val colNames = Seq("id", "name", "class")
val columns = colNames.zipWithIndex.map { case (cname, idx) =>
split($"value", "\\s+").getItem(idx).as(cname)
}
df.select(columns:_*).show
Output:
+---+------+-----+
| id| name|class|
+---+------+-----+
| 1|PRE123| 21|
| 2| TEST| 32|
| 7| XYZ| .7|
+---+------+-----+
Notice that I used \\s+ as separator. This represents a regex for one or more spaces.

How to split Comma-separated multiple columns into multiple rows?

I have a data-frame with N fields as mentioned below. The number of columns and length of the value will vary.
Input Table:
+--------------+-----------+-----------------------+
|Date |Amount |Status |
+--------------+-----------+-----------------------+
|2019,2018,2017|100,200,300|IN,PRE,POST |
|2018 |73 |IN |
|2018,2017 |56,89 |IN,PRE |
+--------------+-----------+-----------------------+
I have to convert it into the below format with one sequence column.
Expected Output Table:
+-------------+------+---------+
|Date |Amount|Status| Sequence|
+------+------+------+---------+
|2019 |100 |IN | 1 |
|2018 |200 |PRE | 2 |
|2017 |300 |POST | 3 |
|2018 |73 |IN | 1 |
|2018 |56 |IN | 1 |
|2017 |89 |PRE | 2 |
+-------------+------+---------+
I have Tried using explode but explode only take one array at a time.
var df = dataRefined.withColumn("TOT_OVRDUE_TYPE", explode(split($"TOT_OVRDUE_TYPE", "\\"))).toDF
var df1 = df.withColumn("TOT_OD_TYPE_AMT", explode(split($"TOT_OD_TYPE_AMT", "\\"))).show
Does someone know how I can do it? Thank you for your help.
Here is another approach using posexplode for each column and joining all produced dataframes into one:
import org.apache.spark.sql.functions.{posexplode, monotonically_increasing_id, col}
val df = Seq(
(Seq("2019", "2018", "2017"), Seq("100", "200", "300"), Seq("IN", "PRE", "POST")),
(Seq("2018"), Seq("73"), Seq("IN")),
(Seq("2018", "2017"), Seq("56", "89"), Seq("IN", "PRE")))
.toDF("Date","Amount", "Status")
.withColumn("idx", monotonically_increasing_id)
df.columns.filter(_ != "idx").map{
c => df.select($"idx", posexplode(col(c))).withColumnRenamed("col", c)
}
.reduce((ds1, ds2) => ds1.join(ds2, Seq("idx", "pos")))
.select($"Date", $"Amount", $"Status", $"pos".plus(1).as("Sequence"))
.show
Output:
+----+------+------+--------+
|Date|Amount|Status|Sequence|
+----+------+------+--------+
|2019| 100| IN| 1|
|2018| 200| PRE| 2|
|2017| 300| POST| 3|
|2018| 73| IN| 1|
|2018| 56| IN| 1|
|2017| 89| PRE| 2|
+----+------+------+--------+
You can achieve this by using Dataframe inbuilt functions arrays_zip,split,posexplode
Explanation:
scala>val df=Seq((("2019,2018,2017"),("100,200,300"),("IN,PRE,POST")),(("2018"),("73"),("IN")),(("2018,2017"),("56,89"),("IN,PRE"))).toDF("date","amount","status")
scala>:paste
df.selectExpr("""posexplode(
arrays_zip(
split(date,","), //split date string with ',' to create array
split(amount,","),
split(status,","))) //zip arrays
as (p,colum) //pos explode on zip arrays will give position and column value
""")
.selectExpr("colum.`0` as Date", //get 0 column as date
"colum.`1` as Amount",
"colum.`2` as Status",
"p+1 as Sequence") //add 1 to the position value
.show()
Result:
+----+------+------+--------+
|Date|Amount|Status|Sequence|
+----+------+------+--------+
|2019| 100| IN| 1|
|2018| 200| PRE| 2|
|2017| 300| POST| 3|
|2018| 73| IN| 1|
|2018| 56| IN| 1|
|2017| 89| PRE| 2|
+----+------+------+--------+
Yes, I personally also find explode a bit annoying and in your case I would probably go with a flatMap instead:
import spark.implicits._
import org.apache.spark.sql.Row
val df = spark.sparkContext.parallelize(Seq((Seq(2019,2018,2017), Seq(100,200,300), Seq("IN","PRE","POST")),(Seq(2018), Seq(73), Seq("IN")),(Seq(2018,2017), Seq(56,89), Seq("IN","PRE")))).toDF()
val transformedDF = df
.flatMap{case Row(dates: Seq[Int], amounts: Seq[Int], statuses: Seq[String]) =>
dates.indices.map(index => (dates(index), amounts(index), statuses(index), index+1))}
.toDF("Date", "Amount", "Status", "Sequence")
Output:
df.show
+----+------+------+--------+
|Date|Amount|Status|Sequence|
+----+------+------+--------+
|2019| 100| IN| 1|
|2018| 200| PRE| 2|
|2017| 300| POST| 3|
|2018| 73| IN| 1|
|2018| 56| IN| 1|
|2017| 89| PRE| 2|
+----+------+------+--------+
Assuming the number of data elements in each column is the same for each row:
First, I recreated your DataFrame
import org.apache.spark.sql._
import scala.collection.mutable.ListBuffer
val df = Seq(("2019,2018,2017", "100,200,300", "IN,PRE,POST"), ("2018", "73", "IN"),
("2018,2017", "56,89", "IN,PRE")).toDF("Date", "Amount", "Status")
Next, I split the rows and added a sequence value, then converted back to a DF:
val exploded = df.rdd.flatMap(row => {
val buffer = new ListBuffer[(String, String, String, Int)]
val dateSplit = row(0).toString.split("\\,", -1)
val amountSplit = row(1).toString.split("\\,", -1)
val statusSplit = row(2).toString.split("\\,", -1)
val seqSize = dateSplit.size
for(i <- 0 to seqSize-1)
buffer += Tuple4(dateSplit(i), amountSplit(i), statusSplit(i), i+1)
buffer.toList
}).toDF((df.columns:+"Sequence"): _*)
I'm sure there are other ways to do it without first converting the DF to an RDD, but this will still result with a DF with the correct answer.
Let me know if you have any questions.
I took advantage of the transpose to zip all Sequences by position and then did a posexplode. Selects on dataFrames are dynamic to satisfy the condition: The number of columns and length of the value will vary in the question.
import org.apache.spark.sql.functions._
val df = Seq(
("2019,2018,2017", "100,200,300", "IN,PRE,POST"),
("2018", "73", "IN"),
("2018,2017", "56,89", "IN,PRE")
).toDF("Date", "Amount", "Status")
df: org.apache.spark.sql.DataFrame = [Date: string, Amount: string ... 1 more field]
scala> df.show(false)
+--------------+-----------+-----------+
|Date |Amount |Status |
+--------------+-----------+-----------+
|2019,2018,2017|100,200,300|IN,PRE,POST|
|2018 |73 |IN |
|2018,2017 |56,89 |IN,PRE |
+--------------+-----------+-----------+
scala> def transposeSeqOfSeq[S](x:Seq[Seq[S]]): Seq[Seq[S]] = { x.transpose }
transposeSeqOfSeq: [S](x: Seq[Seq[S]])Seq[Seq[S]]
scala> val myUdf = udf { transposeSeqOfSeq[String] _}
myUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(ArrayType(StringType,true),true),Some(List(ArrayType(ArrayType(StringType,true),true))))
scala> val df2 = df.select(df.columns.map(c => split(col(c), ",") as c): _*)
df2: org.apache.spark.sql.DataFrame = [Date: array<string>, Amount: array<string> ... 1 more field]
scala> df2.show(false)
+------------------+---------------+---------------+
|Date |Amount |Status |
+------------------+---------------+---------------+
|[2019, 2018, 2017]|[100, 200, 300]|[IN, PRE, POST]|
|[2018] |[73] |[IN] |
|[2018, 2017] |[56, 89] |[IN, PRE] |
+------------------+---------------+---------------+
scala> val df3 = df2.withColumn("allcols", array(df.columns.map(c => col(c)): _*))
df3: org.apache.spark.sql.DataFrame = [Date: array<string>, Amount: array<string> ... 2 more fields]
scala> df3.show(false)
+------------------+---------------+---------------+------------------------------------------------------+
|Date |Amount |Status |allcols |
+------------------+---------------+---------------+------------------------------------------------------+
|[2019, 2018, 2017]|[100, 200, 300]|[IN, PRE, POST]|[[2019, 2018, 2017], [100, 200, 300], [IN, PRE, POST]]|
|[2018] |[73] |[IN] |[[2018], [73], [IN]] |
|[2018, 2017] |[56, 89] |[IN, PRE] |[[2018, 2017], [56, 89], [IN, PRE]] |
+------------------+---------------+---------------+------------------------------------------------------+
scala> val df4 = df3.withColumn("ab", myUdf($"allcols")).select($"ab", posexplode($"ab"))
df4: org.apache.spark.sql.DataFrame = [ab: array<array<string>>, pos: int ... 1 more field]
scala> df4.show(false)
+------------------------------------------------------+---+-----------------+
|ab |pos|col |
+------------------------------------------------------+---+-----------------+
|[[2019, 100, IN], [2018, 200, PRE], [2017, 300, POST]]|0 |[2019, 100, IN] |
|[[2019, 100, IN], [2018, 200, PRE], [2017, 300, POST]]|1 |[2018, 200, PRE] |
|[[2019, 100, IN], [2018, 200, PRE], [2017, 300, POST]]|2 |[2017, 300, POST]|
|[[2018, 73, IN]] |0 |[2018, 73, IN] |
|[[2018, 56, IN], [2017, 89, PRE]] |0 |[2018, 56, IN] |
|[[2018, 56, IN], [2017, 89, PRE]] |1 |[2017, 89, PRE] |
+------------------------------------------------------+---+-----------------+
scala> val selCols = (0 until df.columns.length).map(i => $"col".getItem(i).as(df.columns(i))) :+ ($"pos"+1).as("Sequence")
selCols: scala.collection.immutable.IndexedSeq[org.apache.spark.sql.Column] = Vector(col[0] AS `Date`, col[1] AS `Amount`, col[2] AS `Status`, (pos + 1) AS `Sequence`)
scala> df4.select(selCols:_*).show(false)
+----+------+------+--------+
|Date|Amount|Status|Sequence|
+----+------+------+--------+
|2019|100 |IN |1 |
|2018|200 |PRE |2 |
|2017|300 |POST |3 |
|2018|73 |IN |1 |
|2018|56 |IN |1 |
|2017|89 |PRE |2 |
+----+------+------+--------+
This is why I love spark-core APIs. Just with the help of map and flatMap you can handle many problems. Just pass your df and the instance of SQLContext to below method and it will give the desired result -
def reShapeDf(df:DataFrame, sqlContext: SQLContext): DataFrame ={
val rdd = df.rdd.map(m => (m.getAs[String](0),m.getAs[String](1),m.getAs[String](2)))
val rdd1 = rdd.flatMap(a => a._1.split(",").zip(a._2.split(",")).zip(a._3.split(",")))
val rdd2 = rdd1.map{
case ((a,b),c) => (a,b,c)
}
sqlContext.createDataFrame(rdd2.map(m => Row.fromTuple(m)),df.schema)
}

Add new record before another in Spark

I have a Dataframe:
| ID | TIMESTAMP | VALUE |
1 15:00:01 3
1 17:04:02 2
I want to add a new record with Spark-Scala before with the same time minus 1 second when the value is 2.
The output would be:
| ID | TIMESTAMP | VALUE |
1 15:00:01 3
1 17:04:01 2
1 17:04:02 2
Thanks
You need a .flatMap()
Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).
val data = (spark.createDataset(Seq(
(1, "15:00:01", 3),
(1, "17:04:02", 2)
)).toDF("ID", "TIMESTAMP_STR", "VALUE")
.withColumn("TIMESTAMP", $"TIMESTAMP_STR".cast("timestamp").as("TIMESTAMP"))
.drop("TIMESTAMP_STR")
.select("ID", "TIMESTAMP", "VALUE")
)
data.as[(Long, java.sql.Timestamp, Long)].flatMap(r => {
if(r._3 == 2) {
Seq(
(r._1, new java.sql.Timestamp(r._2.getTime() - 1000L), r._3),
(r._1, r._2, r._3)
)
} else {
Some(r._1, r._2, r._3)
}
}).toDF("ID", "TIMESTAMP", "VALUE").show()
Which results in:
+---+-------------------+-----+
| ID| TIMESTAMP|VALUE|
+---+-------------------+-----+
| 1|2019-03-04 15:00:01| 3|
| 1|2019-03-04 17:04:01| 2|
| 1|2019-03-04 17:04:02| 2|
+---+-------------------+-----+
You can introduce a new column array - when value =2 then Array(-1,0) else Array(0), then explode that column and add it with the timestamp as seconds. The below one should work for you. Check this out:
scala> val df = Seq((1,"15:00:01",3),(1,"17:04:02",2)).toDF("id","timestamp","value")
df: org.apache.spark.sql.DataFrame = [id: int, timestamp: string ... 1 more field]
scala> val df2 = df.withColumn("timestamp",'timestamp.cast("timestamp"))
df2: org.apache.spark.sql.DataFrame = [id: int, timestamp: timestamp ... 1 more field]
scala> df2.show(false)
+---+-------------------+-----+
|id |timestamp |value|
+---+-------------------+-----+
|1 |2019-03-04 15:00:01|3 |
|1 |2019-03-04 17:04:02|2 |
+---+-------------------+-----+
scala> val df3 = df2.withColumn("newc", when($"value"===lit(2),lit(Array(-1,0))).otherwise(lit(Array(0))))
df3: org.apache.spark.sql.DataFrame = [id: int, timestamp: timestamp ... 2 more fields]
scala> df3.show(false)
+---+-------------------+-----+-------+
|id |timestamp |value|newc |
+---+-------------------+-----+-------+
|1 |2019-03-04 15:00:01|3 |[0] |
|1 |2019-03-04 17:04:02|2 |[-1, 0]|
+---+-------------------+-----+-------+
scala> val df4 = df3.withColumn("c_explode",explode('newc)).withColumn("timestamp2",to_timestamp(unix_timestamp('timestamp)+'c_explode))
df4: org.apache.spark.sql.DataFrame = [id: int, timestamp: timestamp ... 4 more fields]
scala> df4.select($"id",$"timestamp2",$"value").show(false)
+---+-------------------+-----+
|id |timestamp2 |value|
+---+-------------------+-----+
|1 |2019-03-04 15:00:01|3 |
|1 |2019-03-04 17:04:01|2 |
|1 |2019-03-04 17:04:02|2 |
+---+-------------------+-----+
scala>
If you want the time part alone, then you can do like
scala> df4.withColumn("timestamp",from_unixtime(unix_timestamp('timestamp2),"HH:mm:ss")).select($"id",$"timestamp",$"value").show(false)
+---+---------+-----+
|id |timestamp|value|
+---+---------+-----+
|1 |15:00:01 |3 |
|1 |17:04:01 |2 |
|1 |17:04:02 |2 |
+---+---------+-----+

Scala : Passing elements of a Dataframe from every row and get back the result in separate rows

In My requirment , i come across a situation where i have to pass 2 strings from my dataframe's 2 column and get back the result in string and want to store it back to a dataframe.
Now while passing the value as string, it is always returning the same value. So in all the rows the same value is being populated. (In My case PPPP is being populated in all rows)
Is there a way to pass element (for those 2 columns) from every row and get the result in separate rows.
I am ready to modify my function to accept Dataframe and return Dataframe OR accept arrayOfString and get back ArrayOfString but i dont know how to do that as i am new to programming. Can someone please help me.
Thanks.
def myFunction(key: String , value :String ) : String = {
//Do my functions and get back a string value2 and return this value2 string
value2
}
val DF2 = DF1.select (
DF1("col1")
,DF1("col2")
,DF1("col5") )
.withColumn("anyName", lit(myFunction ( DF1("col3").toString() , DF1("col4").toString() )))
/* DF1:
/*+-----+-----+----------------+------+
/*|col1 |col2 |col3 | col4 | col 5|
/*+-----+-----+----------------+------+
/*|Hello|5 |valueAAA | XXX | 123 |
/*|How |3 |valueCCC | YYY | 111 |
/*|World|5 |valueDDD | ZZZ | 222 |
/*+-----+-----+----------------+------+
/*DF2:
/*+-----+-----+--------------+
/*|col1 |col2 |col5| anyName |
/*+-----+-----+--------------+
/*|Hello|5 |123 | PPPPP |
/*|How |3 |111 | PPPPP |
/*|World|5 |222 | PPPPP |
/*+-----+-----+--------------+
*/
After you define the function, you need to register them as udf(). The udf() function is available in org.apache.spark.sql.functions. check this out
scala> val DF1 = Seq(("Hello",5,"valueAAA","XXX",123),
| ("How",3,"valueCCC","YYY",111),
| ("World",5,"valueDDD","ZZZ",222)
| ).toDF("col1","col2","col3","col4","col5")
DF1: org.apache.spark.sql.DataFrame = [col1: string, col2: int ... 3 more fields]
scala> val DF2 = DF1.select ( DF1("col1") ,DF1("col2") ,DF1("col5") )
DF2: org.apache.spark.sql.DataFrame = [col1: string, col2: int ... 1 more field]
scala> DF2.show(false)
+-----+----+----+
|col1 |col2|col5|
+-----+----+----+
|Hello|5 |123 |
|How |3 |111 |
|World|5 |222 |
+-----+----+----+
scala> DF1.select("*").show(false)
+-----+----+--------+----+----+
|col1 |col2|col3 |col4|col5|
+-----+----+--------+----+----+
|Hello|5 |valueAAA|XXX |123 |
|How |3 |valueCCC|YYY |111 |
|World|5 |valueDDD|ZZZ |222 |
+-----+----+--------+----+----+
scala> def myConcat(a:String,b:String):String=
| return a + "--" + b
myConcat: (a: String, b: String)String
scala>
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> val myConcatUDF = udf(myConcat(_:String,_:String):String)
myConcatUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,StringType,Some(List(StringType, StringType)))
scala> DF1.select ( DF1("col1") ,DF1("col2") ,DF1("col5"), myConcatUDF( DF1("col3"), DF1("col4"))).show()
+-----+----+----+---------------+
| col1|col2|col5|UDF(col3, col4)|
+-----+----+----+---------------+
|Hello| 5| 123| valueAAA--XXX|
| How| 3| 111| valueCCC--YYY|
|World| 5| 222| valueDDD--ZZZ|
+-----+----+----+---------------+
scala>

Spark Scala - Need to iterate over column in dataframe

Got the next dataframe:
+---+----------------+
|id |job_title |
+---+----------------+
|1 |ceo |
|2 |product manager |
|3 |surfer |
+---+----------------+
I want to get a column from a dataframe and to create another column with indication called 'rank':
+---+----------------+-------+
|id |job_title | rank |
+---+----------------+-------+
|1 |ceo |c-level|
|2 |product manager |manager|
|3 |surfer |other |
+---+----------------+-------+
--- UPDATED ---
What I tried to do by now is:
def func (col: column) : Column = {
val cLevel = List("ceo","cfo")
val managerLevel = List("manager","team leader")
when (col.contains(cLevel), "C-level")
.otherwise(when(col.contains(managerLevel),"manager").otherwise("other"))}
Currently I get a this error:
type mismatch;
found : Boolean
required: org.apache.spark.sql.Column
and I think I have also other problems within the code.Sorry but I'm on a starting level with Scala over Spark.
You can use when/otherwise inbuilt function for that case as
import org.apache.spark.sql.functions._
def func = when(col("job_title").contains("cheif") || col("job_title").contains("ceo"), "c-level")
.otherwise(when(col("job_title").contains("manager"), "manager")
.otherwise("other"))
and you can call the function by using withColumn as
df.withColumn("rank", func).show(false)
which should give you
+---+---------------+-------+
|id |job_title |rank |
+---+---------------+-------+
|1 |ceo |c-level|
|2 |product manager|manager|
|3 |surfer |other |
+---+---------------+-------+
I hope the answer is helpful
Updated
I see that you have updated your post with your tryings, and you have tried creating a list of levels and you want to validate against the list. For that case you will have to write a udf function as
val cLevel = List("ceo","cfo")
val managerLevel = List("manager","team leader")
import org.apache.spark.sql.functions._
def rankUdf = udf((jobTitle: String) => jobTitle match {
case x if(cLevel.exists(_.contains(x)) || cLevel.exists(x.contains(_))) => "C-Level"
case x if(managerLevel.exists(_.contains(x)) || managerLevel.exists(x.contains(_))) => "manager"
case _ => "other"
})
df.withColumn("rank", rankUdf(col("job_title"))).show(false)
which should give you your desired output
val df = sc.parallelize(Seq(
(1,"ceo"),
( 2,"product manager"),
(3,"surfer"),
(4,"Vaquar khan")
)).toDF("id", "job_title")
df.show()
//option 2
df.createOrReplaceTempView("user_details")
sqlContext.sql("SELECT job_title, RANK() OVER (ORDER BY id) AS rank FROM user_details").show
val df1 = sc.parallelize(Seq(
("ceo","c-level"),
( "product manager","manager"),
("surfer","other"),
("Vaquar khan","Problem solver")
)).toDF("job_title", "ranks")
df1.show()
df1.createOrReplaceTempView("user_rank")
sqlContext.sql("SELECT user_details.id,user_details.job_title,user_rank.ranks FROM user_rank JOIN user_details ON user_rank.job_title = user_details.job_title order by user_details.id").show
Results :
+---+---------------+
| id| job_title|
+---+---------------+
| 1| ceo|
| 2|product manager|
| 3| surfer|
| 4| Vaquar khan|
+---+---------------+
+---------------+----+
| job_title|rank|
+---------------+----+
| ceo| 1|
|product manager| 2|
| surfer| 3|
| Vaquar khan| 4|
+---------------+----+
+---------------+--------------+
| job_title| ranks|
+---------------+--------------+
| ceo| c-level|
|product manager| manager|
| surfer| other|
| Vaquar khan|Problem solver|
+---------------+--------------+
+---+---------------+--------------+
| id| job_title| ranks|
+---+---------------+--------------+
| 1| ceo| c-level|
| 2|product manager| manager|
| 3| surfer| other|
| 4| Vaquar khan|Problem solver|
+---+---------------+--------------+
df: org.apache.spark.sql.DataFrame = [id: int, job_title: string]
df1: org.apache.spark.sql.DataFrame = [job_title: string, ranks: string]
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html