spark - conditinal statements inside select - scala

I am selecting two Columns from Dataframe col1 and col2.
df.select((col("a")+col("b")).as("sum_col")
now user wants this sum_col to be fixed spaces to 4.
so length of a and b is 2 hence max value can come less than 100 (two) or more than 100(three) so need to do it conditionally to add 1 or 2 spaces with it.
can anyone tell me how to handle within select block with cinditional logic to cast the Column to concat and decide one or two spaces to be added

Just use format_string function
import org.apache.spark.sql.functions.format_string
val df = Seq(1, 10, 100).toDF("sum_col")
val result = df.withColumn("sum_col_fmt", format_string("%4d", $"sum_col"))
And proof it works:
result.withColumn("proof", concat(lit("'"), $"sum_col_fmt", lit("'"))).show
// +-------+-----------+------+
// |sum_col|sum_col_fmt| proof|
// +-------+-----------+------+
// | 1| 1|' 1'|
//| 10| 10|' 10'|
// | 100| 100|' 100'|
// +-------+-----------+------+

UDF with String.format:
val df = List((1, 2)).toDF("a", "b")
val leadingZeroes = (value: Integer) => String.format("%04d", value)
val leadingZeroesUDF = udf(leadingZeroes)
val result = df.withColumn("sum_col", leadingZeroesUDF($"a" + $"b"))
result.show(false)
Output:
+---+---+-------+
|a |b |sum_col|
+---+---+-------+
|1 |2 |0003 |
+---+---+-------+

Define a UDF and then register it. I added a dot in front of the format so that it can be shown in the output. Check this out
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> val df = spark.range(1,20).toDF("col1")
df: org.apache.spark.sql.DataFrame = [col1: bigint]
scala> val df2 = df.withColumn("newcol", 'col1 + 'col1)
df2: org.apache.spark.sql.DataFrame = [col1: bigint, newcol: bigint]
scala> def myPadding(a:String):String =
| return ".%4s".format(a)
myPadding: (a: String)String
scala> val myUDFPad = udf( myPadding(_:String):String)
myUDFPad: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
scala> df2.select(myUDFPad(df2("newcol"))).show
+-----------+
|UDF(newcol)|
+-----------+
| . 2|
| . 4|
| . 6|
| . 8|
| . 10|
| . 12|
| . 14|
| . 16|
| . 18|
| . 20|
| . 22|
| . 24|
| . 26|
| . 28|
| . 30|
| . 32|
| . 34|
| . 36|
| . 38|
+-----------+
scala>

Related

string manipulations using Spark scala

I have the following Spark scala dataframe.
val someDF = Seq(
(1, "bat",1.3222),
(4, "cbat",1.40222),
(3, "horse",1.501212)
).toDF("number", "word","value")
I created a User Defined Function (UDF) to create a new variable as follows :
Logic : if words equals bat then value else zero.
import org.apache.spark.sql.functions.{col}
val func1 = udf( (s:String ,y:Double) => if(s.contains("bat")) y else 0 )
func1(col("word"),col("value"))
+------+-----+-------+
|number| word|cal_var|
+------+-----+-------+
| 1| bat| 1.3222|
| 4| cbat|1.40222|
| 3|horse| 0.0|
+------+-----+-------+
Here to check the equality i used contains function . Because of that i am getting the incorrect output .
My desired output should be like this :
+------+-----+-------+
|number| word|cal_var|
+------+-----+-------+
| 1| bat| 1.3222|
| 4| cbat| 0.0|
| 3|horse| 0.0|
+------+-----+-------+
Can anyone help me to figure out the correct string function that i should use to check the equality ?
Thank you
Try to avoid using UDF's as it gives poor performance,
Another approach:
val someDF = Seq(
(1, "bat",1.3222),
(4, "cbat",1.40222),
(3, "horse",1.501212)
).toDF("number", "word","value")
import org.apache.spark.sql.functions._
someDF.show
+------+-----+--------+
|number| word| value|
+------+-----+--------+
| 1| bat| 1.3222|
| 4| cbat| 1.40222|
| 3|horse|1.501212|
+------+-----+--------+
someDF.withColumn("value",when('word === "bat",'value).otherwise(0)).show()
+------+-----+------+
|number| word| value|
+------+-----+------+
| 1| bat|1.3222|
| 4| cbat| 0.0|
| 3|horse| 0.0|
+------+-----+------+
The solution is to use equals method rather than contains. contains checks whether string bat is present anywhere in the given string s and not the equality. The code is shown below:
scala> someDF.show
+------+-----+--------+
|number| word| value|
+------+-----+--------+
| 1| bat| 1.3222|
| 4| cbat| 1.40222|
| 3|horse|1.501212|
+------+-----+--------+
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> val func1 = udf( (s:String ,y:Double) => if(s.equals("bat")) y else 0 )
func1: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,DoubleType,Some(List(StringType, DoubleType)))
scala> someDF.withColumn("col_var", func1(col("word"),col("value"))).drop("value").show
+------+-----+-------+
|number| word|col_var|
+------+-----+-------+
| 1| bat| 1.3222|
| 4| cbat| 0.0|
| 3|horse| 0.0|
+------+-----+-------+
Let me know if it helps!!

How to dynamically add columns to a DataFrame?

I am trying to dynamically add columns to a DataFrame from a Seq of String.
Here's an example : the source dataframe is like:
+-----+---+----+---+---+
|id | A | B | C | D |
+-----+---+----+---+---+
|1 |toto|tata|titi| |
|2 |bla |blo | | |
|3 |b | c | a | d |
+-----+---+----+---+---+
I also have a Seq of String which contains name of columns I want to add. If a column already exists in the source DataFrame, it must do some kind of difference like below :
The Seq looks like :
val columns = Seq("A", "B", "F", "G", "H")
The expectation is:
+-----+---+----+---+---+---+---+---+
|id | A | B | C | D | F | G | H |
+-----+---+----+---+---+---+---+---+
|1 |toto|tata|titi|tutu|null|null|null
|2 |bla |blo | | |null|null|null|
|3 |b | c | a | d |null|null|null|
+-----+---+----+---+---+---+---+---+
What I've done so far is something like this :
val difference = columns diff sourceDF.columns
val finalDF = difference.foldLeft(sourceDF)((df, field) => if (!sourceDF.columns.contains(field)) df.withColumn(field, lit(null))) else df)
.select(columns.head, columns.tail:_*)
But I can't figure how to do this using Spark efficiently in a more simpler and easier way to read ...
Thanks in advance
Here is another way using Seq.diff, single select and map to generate your final column list:
import org.apache.spark.sql.functions.{lit, col}
val newCols = Seq("A", "B", "F", "G", "H")
val updatedCols = newCols.diff(df.columns).map{ c => lit(null).as(c)}
val selectExpr = df.columns.map(col) ++ updatedCols
df.select(selectExpr:_*).show
// +---+----+----+----+----+----+----+----+
// | id| A| B| C| D| F| G| H|
// +---+----+----+----+----+----+----+----+
// | 1|toto|tata|titi|null|null|null|null|
// | 2| bla| blo|null|null|null|null|null|
// | 3| b| c| a| d|null|null|null|
// +---+----+----+----+----+----+----+----+
First we find the diff between newCols and df.columns this gives us: F, G, H. Next we transform each element of the list to lit(null).as(c) via map function. Finally, we concatenate the existing and the new list together to produce selectExpr which is used for the select.
Below will be optimised way with your logic.
scala> df.show
+---+----+----+----+----+
| id| A| B| C| D|
+---+----+----+----+----+
| 1|toto|tata|titi|null|
| 2| bla| blo|null|null|
| 3| b| c| a| d|
+---+----+----+----+----+
scala> val Columns = Seq("A", "B", "F", "G", "H")
scala> val newCol = Columns filterNot df.columns.toSeq.contains
scala> val df1 = newCol.foldLeft(df)((df,name) => df.withColumn(name, lit(null)))
scala> df1.show()
+---+----+----+----+----+----+----+----+
| id| A| B| C| D| F| G| H|
+---+----+----+----+----+----+----+----+
| 1|toto|tata|titi|null|null|null|null|
| 2| bla| blo|null|null|null|null|null|
| 3| b| c| a| d|null|null|null|
+---+----+----+----+----+----+----+----+
If you do not want to use foldLeft then you can use RunTimeMirror which will be faster. Check Below Code.
scala> import scala.reflect.runtime.universe.runtimeMirror
scala> import scala.tools.reflect.ToolBox
scala> import org.apache.spark.sql.DataFrame
scala> df.show
+---+----+----+----+----+
| id| A| B| C| D|
+---+----+----+----+----+
| 1|toto|tata|titi|null|
| 2| bla| blo|null|null|
| 3| b| c| a| d|
+---+----+----+----+----+
scala> def compile[A](code: String): DataFrame => A = {
| val tb = runtimeMirror(getClass.getClassLoader).mkToolBox()
| val tree = tb.parse(
| s"""
| |import org.elasticsearch.spark.sql._
| |import org.apache.spark.sql.DataFrame
| |def wrapper(context:DataFrame): Any = {
| | $code
| |}
| |wrapper _
| """.stripMargin)
|
| val fun = tb.compile(tree)
| val wrapper = fun()
| wrapper.asInstanceOf[DataFrame => A]
| }
scala> def AddColumns(df:DataFrame,withColumnsString:String):DataFrame = {
| val code =
| s"""
| |import org.apache.spark.sql.functions._
| |import org.elasticsearch.spark.sql._
| |import org.apache.spark.sql.DataFrame
| |var data = context.asInstanceOf[DataFrame]
| |data = data
| """ + withColumnsString +
| """
| |
| |data
| """.stripMargin
|
| val fun = compile[DataFrame](code)
| val res = fun(df)
| res
| }
scala> val Columns = Seq("A", "B", "F", "G", "H")
scala> val newCol = Columns filterNot df.columns.toSeq.contains
scala> var cols = ""
scala> newCol.foreach{ name =>
| cols = ".withColumn(\""+ name + "\" , lit(null))" + cols
| }
scala> val df1 = AddColumns(df,cols)
scala> df1.show
+---+----+----+----+----+----+----+----+
| id| A| B| C| D| H| G| F|
+---+----+----+----+----+----+----+----+
| 1|toto|tata|titi|null|null|null|null|
| 2| bla| blo|null|null|null|null|null|
| 3| b| c| a| d|null|null|null|
+---+----+----+----+----+----+----+----+

Building up a dataframe

I am trying to build a dataframe of 10k records to then save to a parquet file on Spark 2.4.3 standalone
The following works in a small scale up to 1000 records but takes forever when ramping up to 10k
scala> import spark.implicits._
import spark.implicits._
scala> var someDF = Seq((0, "item0")).toDF("x", "y")
someDF: org.apache.spark.sql.DataFrame = [x: int, y: string]
scala> for ( i <- 1 to 1000 ) {someDF = someDF.union(Seq((i,"item"+i)).toDF("x", "y")) }
scala> someDF.show
+---+------+
| x| y|
+---+------+
| 0| item0|
| 1| item1|
| 2| item2|
| 3| item3|
| 4| item4|
| 5| item5|
| 6| item6|
| 7| item7|
| 8| item8|
| 9| item9|
| 10|item10|
| 11|item11|
| 12|item12|
| 13|item13|
| 14|item14|
| 15|item15|
| 16|item16|
| 17|item17|
| 18|item18|
| 19|item19|
+---+------+
only showing top 20 rows
[Stage 2:=========================================================(20 + 0) / 20]
scala> var someDF = Seq((0, "item0")).toDF("x", "y")
someDF: org.apache.spark.sql.DataFrame = [x: int, y: string]
scala> someDF.show
+---+-----+
| x| y|
+---+-----+
| 0|item0|
+---+-----+
scala> for ( i <- 1 to 10000 ) {someDF = someDF.union(Seq((i,"item"+i)).toDF("x", "y")) }
Just want to save someDF to a parquet file to then load into Impala
//declare Range that you want
scala> val r = 1 to 10000
//create DataFrame with range
scala> val df = sc.parallelize(r).toDF("x")
//Add new column "y"
scala> val final_df = df.select(col("x"),concat(lit("item"),col("x")).alias("y"))
scala> final_df.show
+---+------+
| x| y|
+---+------+
| 1| item1|
| 2| item2|
| 3| item3|
| 4| item4|
| 5| item5|
| 6| item6|
| 7| item7|
| 8| item8|
| 9| item9|
| 10|item10|
| 11|item11|
| 12|item12|
| 13|item13|
| 14|item14|
| 15|item15|
| 16|item16|
| 17|item17|
| 18|item18|
| 19|item19|
| 20|item20|
+---+------+
scala> final_df.count
res17: Long = 10000
//Write final_df to path in parquet format
scala> final_df.write.format("parquet").save(<path to write>)

how to apply joins in spark scala when we have multiple values in the join column

I have data in two text files as
file 1:(patient id,diagnosis code)
+----------+-------+
|patient_id|diag_cd|
+----------+-------+
| 1| y,t,k|
| 2| u,t,p|
| 3| u,t,k|
| 4| f,o,k|
| 5| e,o,u|
+----------+-------+
file2(diagnosis code,diagnosis description) Time T1
+-------+---------+
|diag_cd|diag_desc|
+-------+---------+
| y| yen|
| t| ten|
| k| ken|
| u| uen|
| p| pen|
| f| fen|
| o| oen|
| e| een|
+-------+---------+
data in file 2 is not fixed and keeps on changing, means at any given point of time diagnosis code y can have diagnosis description as yen and at other point of time it can have diagnosis description as ten. For example below
file2 at Time T2
+-------+---------+
|diag_cd|diag_desc|
+-------+---------+
| y| ten|
| t| yen|
| k| uen|
| u| oen|
| p| ken|
| f| pen|
| o| een|
| e| fen|
+-------+---------+
I have to read these two files data in spark and want only those patients id who are diagnosed with uen.
it can be done using spark sql or scala both.
I tried to read the file1 in spark-shell. The two columns in file1 are pipe delimited.
scala> val tes1 = sc.textFile("file1.txt").map(x => x.split('|')).filter(y => y(1).contains("u")).collect
tes1: Array[Array[String]] = Array(Array(2, u,t,p), Array(3, u,t,k), Array(5, e,o,u))
But as the diagnosis code related to a diagnosis description is not constant in file2 so will have to use the join condition. But I dont know how to apply joins when the diag_cd column in file1 has multiple values.
any help would be appreciated.
Please find the answer below
//Read the file1 into a dataframe
val file1DF = spark.read.format("csv").option("delimiter","|")
.option("header",true)
.load("file1PATH")
//Read the file2 into a dataframe
val file2DF = spark.read.format("csv").option("delimiter","|")
.option("header",true)
.load("file2path")
//get the patient id dataframe for the diag_desc as uen
file1DF.join(file2DF,file1DF.col("diag_cd").contains(file2DF.col("diag_cd")),"inner")
.filter(file2DF.col("diag_desc") === "uen")
.select("patient_id").show
Convert the table t1 from format1 to format2 using explode method.
Format1:
file 1:(patient id,diagnosis code)
+----------+-------+
|patient_id|diag_cd|
+----------+-------+
| 1| y,t,k|
| 2| u,t,p|
+----------+-------+
to
file 1:(patient id,diagnosis code)
+----------+-------+
|patient_id|diag_cd|
+----------+-------+
| 1| y |
| 1| t |
| 1| k |
| 2| u |
| 2| t |
| 2| p |
+----------+-------+
Code:
scala> val data = Seq("1|y,t,k", "2|u,t,p")
data: Seq[String] = List(1|y,t,k, 2|u,t,p)
scala> val df1 = sc.parallelize(data).toDF("c1").withColumn("patient_id", split(col("c1"), "\\|").getItem(0)).withColumn("col2", split(col("c1"), "\\|").getItem(1)).select("patient_id", "col2").withColumn("diag_cd", explode(split($"col2", "\\,"))).select("patient_id", "diag_cd")
df1: org.apache.spark.sql.DataFrame = [patient_id: string, diag_cd: string]
scala> df1.collect()
res4: Array[org.apache.spark.sql.Row] = Array([1,y], [1,t], [1,k], [2,u], [2,t], [2,p])
I have created dummy data here for illustration. Note how we are exploding the particular column above using
scala> val df1 = sc.parallelize(data).toDF("c1").
| withColumn("patient_id", split(col("c1"), "\\|").getItem(0)).
| withColumn("col2", split(col("c1"), "\\|").getItem(1)).
| select("patient_id", "col2").
| withColumn("diag_cd", explode(split($"col2", "\\,"))).
| select("patient_id", "diag_cd")
df1: org.apache.spark.sql.DataFrame = [patient_id: string, diag_cd: string]
Now you can create df2 for file 2 using -
scala> val df2 = sc.textFile("file2.txt").map(x => (x.split(",")(0),x.split(",")(1))).toDF("diag_cd", "diag_desc")
df2: org.apache.spark.sql.DataFrame = [diag_cd: string, diag_desc: string]
Join df1 with df2 and filter as per the requirement.
df1.join(df2, df1.col("diag_cd") === df2.col("diag_cd")).filter(df2.col("diag_desc") === "ten").select(df1.col("patient_id")).collect()

How to use NOT IN clause in filter condition in spark

I want to filter a column of an RDD source :
val source = sql("SELECT * from sample.source").rdd.map(_.mkString(","))
val destination = sql("select * from sample.destination").rdd.map(_.mkString(","))
val source_primary_key = source.map(rec => (rec.split(",")(0)))
val destination_primary_key = destination.map(rec => (rec.split(",")(0)))
val src = source_primary_key.subtractByKey(destination_primary_key)
I want to use IN clause in filter condition to filter out only the values present in src from source, something like below(EDITED):
val source = spark.read.csv(inputPath + "/source").rdd.map(_.mkString(","))
val destination = spark.read.csv(inputPath + "/destination").rdd.map(_.mkString(","))
val source_primary_key = source.map(rec => (rec.split(",")(0)))
val destination_primary_key = destination.map(rec => (rec.split(",")(0)))
val extra_in_source = source_primary_key.filter(rec._1 != destination_primary_key._1)
equivalent SQL code is
SELECT * FROM SOURCE WHERE ID IN (select ID from src)
Thank you
Since your code isn't reproducible, here is a small example using spark-sql on how to select * from t where id in (...) :
// create a DataFrame for a range 'id' from 1 to 9.
scala> val df = spark.range(1,10).toDF
df: org.apache.spark.sql.DataFrame = [id: bigint]
// values to exclude
scala> val f = Seq(5,6,7)
f: Seq[Int] = List(5, 6, 7)
// select * from df where id is not in the values to exclude
scala> df.filter(!col("id").isin(f : _*)).show
+---+
| id|
+---+
| 1|
| 2|
| 3|
| 4|
| 8|
| 9|
+---+
// select * from df where id is in the values to exclude
scala> df.filter(col("id").isin(f : _*)).show
Here is the RDD version of the not isin :
scala> val rdd = sc.parallelize(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:24
scala> val f = Seq(5,6,7)
f: Seq[Int] = List(5, 6, 7)
scala> val rdd2 = rdd.filter(x => !f.contains(x))
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[3] at filter at <console>:28
Nevertheless, I still believe this is an overkill since you are already using spark-sql.
It seems in your case that you are actually dealing with DataFrames, thus the solutions mentioned above don't work.
You can use the left anti join approach :
scala> val source = spark.read.format("csv").load("source.file")
source: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 9 more fields]
scala> val destination = spark.read.format("csv").load("destination.file")
destination: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 9 more fields]
scala> source.show
+---+------------------+--------+----------+---------------+---+---+----------+-----+---------+------------+
|_c0| _c1| _c2| _c3| _c4|_c5|_c6| _c7| _c8| _c9| _c10|
+---+------------------+--------+----------+---------------+---+---+----------+-----+---------+------------+
| 1| Ravi kumar| Ravi | kumar| MSO | 1| M|17-01-1994| 74.5| 24000.78| Alabama |
| 2|Shekhar shudhanshu| Shekhar|shudhanshu| Manulife | 2| M|18-01-1994|76.34| 250000| Alaska |
| 3|Preethi Narasingam| Preethi|Narasingam| Retail | 3| F|19-01-1994|77.45|270000.01| Arizona |
| 4| Abhishek Nair|Abhishek| Nair| Banking | 4| M|20-01-1994|78.65| 345000| Arkansas |
| 5| Ram Sharma| Ram| Sharma|Infrastructure | 5| M|21-01-1994|79.12| 45000| California |
| 6| Chandani Kumari|Chandani| Kumari| BNFS | 6| F|22-01-1994|80.13| 43000.02| Colorado |
| 7| Balaji Kumar| Balaji| Kumar| MSO | 1| M|23-01-1994|81.33| 1234678|Connecticut |
| 8| Naveen Shekrappa| Naveen| Shekrappa| Manulife | 2| M|24-01-1994| 100| 789414| Delaware |
| 9| Milind Chavan| Milind| Chavan| Retail | 3| M|25-01-1994|83.66| 245555| Florida |
| 10| Raghu Rajeev| Raghu| Rajeev| Banking | 4| M|26-01-1994|87.65| 235468| Georgia|
+---+------------------+--------+----------+---------------+---+---+----------+-----+---------+------------+
scala> destination.show
+---+-------------------+--------+----------+---------------+---+---+----------+-----+---------+------------+
|_c0| _c1| _c2| _c3| _c4|_c5|_c6| _c7| _c8| _c9| _c10|
+---+-------------------+--------+----------+---------------+---+---+----------+-----+---------+------------+
| 1| Ravi kumar| Revi | kumar| MSO | 1| M|17-01-1994| 74.5| 24000.78| Alabama |
| 1| Ravi1 kumar| Revi | kumar| MSO | 1| M|17-01-1994| 74.5| 24000.78| Alabama |
| 1| Ravi2 kumar| Revi | kumar| MSO | 1| M|17-01-1994| 74.5| 24000.78| Alabama |
| 2| Shekhar shudhanshu| Shekhar|shudhanshu| Manulife | 2| M|18-01-1994|76.34| 250000| Alaska |
| 3|Preethi Narasingam1| Preethi|Narasingam| Retail | 3| F|19-01-1994|77.45|270000.01| Arizona |
| 4| Abhishek Nair1|Abhishek| Nair| Banking | 4| M|20-01-1994|78.65| 345000| Arkansas |
| 5| Ram Sharma| Ram| Sharma|Infrastructure | 5| M|21-01-1994|79.12| 45000| California |
| 6| Chandani Kumari|Chandani| Kumari| BNFS | 6| F|22-01-1994|80.13| 43000.02| Colorado |
| 7| Balaji Kumar| Balaji| Kumar| MSO | 1| M|23-01-1994|81.33| 1234678|Connecticut |
| 8| Naveen Shekrappa| Naveen| Shekrappa| Manulife | 2| M|24-01-1994| 100| 789414| Delaware |
| 9| Milind Chavan| Milind| Chavan| Retail | 3| M|25-01-1994|83.66| 245555| Florida |
| 10| Raghu Rajeev| Raghu| Rajeev| Banking | 4| M|26-01-1994|87.65| 235468| Georgia|
+---+-------------------+--------+----------+---------------+---+---+----------+-----+---------+------------+
You'll just need to do the following :
scala> val res1 = source.join(destination, Seq("_c0"), "leftanti")
scala> val res2 = destination.join(source, Seq("_c0"), "leftanti")
It's the same logic I mentioned in my answer here.
You can try like--
df.filter(~df.Dept.isin("30","20")).show()
//This will list all the columns of df where Dept NOT IN 30 or 20
You can try something similar in Java,
ds = ds.filter(functions.not(functions.col(COLUMN_NAME).isin(exclusionSet)));
where exclusionSet is a set of objects that needs to be removed from your dataset.