string manipulations using Spark scala - scala

I have the following Spark scala dataframe.
val someDF = Seq(
(1, "bat",1.3222),
(4, "cbat",1.40222),
(3, "horse",1.501212)
).toDF("number", "word","value")
I created a User Defined Function (UDF) to create a new variable as follows :
Logic : if words equals bat then value else zero.
import org.apache.spark.sql.functions.{col}
val func1 = udf( (s:String ,y:Double) => if(s.contains("bat")) y else 0 )
func1(col("word"),col("value"))
+------+-----+-------+
|number| word|cal_var|
+------+-----+-------+
| 1| bat| 1.3222|
| 4| cbat|1.40222|
| 3|horse| 0.0|
+------+-----+-------+
Here to check the equality i used contains function . Because of that i am getting the incorrect output .
My desired output should be like this :
+------+-----+-------+
|number| word|cal_var|
+------+-----+-------+
| 1| bat| 1.3222|
| 4| cbat| 0.0|
| 3|horse| 0.0|
+------+-----+-------+
Can anyone help me to figure out the correct string function that i should use to check the equality ?
Thank you

Try to avoid using UDF's as it gives poor performance,
Another approach:
val someDF = Seq(
(1, "bat",1.3222),
(4, "cbat",1.40222),
(3, "horse",1.501212)
).toDF("number", "word","value")
import org.apache.spark.sql.functions._
someDF.show
+------+-----+--------+
|number| word| value|
+------+-----+--------+
| 1| bat| 1.3222|
| 4| cbat| 1.40222|
| 3|horse|1.501212|
+------+-----+--------+
someDF.withColumn("value",when('word === "bat",'value).otherwise(0)).show()
+------+-----+------+
|number| word| value|
+------+-----+------+
| 1| bat|1.3222|
| 4| cbat| 0.0|
| 3|horse| 0.0|
+------+-----+------+

The solution is to use equals method rather than contains. contains checks whether string bat is present anywhere in the given string s and not the equality. The code is shown below:
scala> someDF.show
+------+-----+--------+
|number| word| value|
+------+-----+--------+
| 1| bat| 1.3222|
| 4| cbat| 1.40222|
| 3|horse|1.501212|
+------+-----+--------+
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> val func1 = udf( (s:String ,y:Double) => if(s.equals("bat")) y else 0 )
func1: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,DoubleType,Some(List(StringType, DoubleType)))
scala> someDF.withColumn("col_var", func1(col("word"),col("value"))).drop("value").show
+------+-----+-------+
|number| word|col_var|
+------+-----+-------+
| 1| bat| 1.3222|
| 4| cbat| 0.0|
| 3|horse| 0.0|
+------+-----+-------+
Let me know if it helps!!

Related

Split multiple array columns into rows

This is a question identical to
Pyspark: Split multiple array columns into rows
but I want to know how to do it in scala
for a dataframe like this,
+---+---------+---------+---+
| a| b| c| d|
+---+---------+---------+---+
| 1|[1, 2, 3]|[, 8, 9] |foo|
+---+---------+---------+---+
I want to have it in following format
+---+---+-------+------+
| a| b| c | d |
+---+---+-------+------+
| 1| 1| None | foo |
| 1| 2| 8 | foo |
| 1| 3| 9 | foo |
+---+---+-------+------+
In scala, I know there's an explode function, but I don't think it's applicable here.
I tried
import org.apache.spark.sql.functions.arrays_zip
but I get an error, saying arrays_zip is not a member of org.apache.spark.sql.functions although it's clearly a function in https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html
the below answer might be helpful to you,
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions._
val arrayData = Seq(
Row(1,List(1,2,3),List(0,8,9),"foo"))
val arraySchema = new StructType().add("a",IntegerType).add("b", ArrayType(IntegerType)).add("c", ArrayType(IntegerType)).add("d",StringType)
val df = spark.createDataFrame(spark.sparkContext.parallelize(arrayData),arraySchema)
df.select($"a",$"d",explode($"b",$"c")).show(false)
val zip = udf((x: Seq[Int], y: Seq[Int]) => x.zip(y))
df.withColumn("vars", explode(zip($"b", $"c"))).select($"a", $"d",$"vars._1".alias("b"), $"vars._2".alias("c")).show()
/*
+---+---+---+---+
| a| d| b| c|
+---+---+---+---+
| 1|foo| 1| 0|
| 1|foo| 2| 8|
| 1|foo| 3| 9|
+---+---+---+---+
*/

How to replace all numbers and "." by "" of a column in Dataframes spark scala

How to replace all numbers and "." i.e All decimal numbers should be replaced by "") of a column in Dataframes spark scala
Eg:+56.5 or -64.83 should get replaced by empty character which is ""
I am following
regexp_replace(col("col1"),"\\+|\\-|\\.|0|1|2|3|4|5|6|7|8|9", "")
Is there any other better way of doing this
Thanks
Looks like regexp for decimal numbers is required, tag "regex" can be added to question.
Such regexp can be used:
// this is pattern to use
val decimalNumbersPattern = "[-+]?[0-9]+\\.[0-9]+"
val df = Seq("Replaced: +56.5", "Replaced: -64.83", "Remains: 44").toDF()
df
.select(regexp_replace($"value", decimalNumbersPattern, "").alias("result"))
Output:
+-----------+
|result |
+-----------+
|Replaced: |
|Replaced: |
|Remains: 44|
+-----------+
import org.apache.spark.sql.functions._
val df = meta.core.DataCore.spark.createDataFrame(Seq(
(0, "+56.5"),
(1, "-64.83"),
(2, "+12.1234"),
(3, "13"),
(4, "-10.0"),
(5, "2"),
(6, "0")
)).toDF("id", "all_digitals")
df
.withColumn("not_decimals", when(col("all_digitals").contains("."), "").otherwise(col("all_digitals")))
.show()
Result is :
+---+------------+------------+
| id|all_digitals|not_decimals|
+---+------------+------------+
| 0| +56.5| |
| 1| -64.83| |
| 2| +12.1234| |
| 3| 13| 13|
| 4| -10.0| |
| 5| 2| 2|
| 6| 0| 0|
+---+------------+------------+

Building up a dataframe

I am trying to build a dataframe of 10k records to then save to a parquet file on Spark 2.4.3 standalone
The following works in a small scale up to 1000 records but takes forever when ramping up to 10k
scala> import spark.implicits._
import spark.implicits._
scala> var someDF = Seq((0, "item0")).toDF("x", "y")
someDF: org.apache.spark.sql.DataFrame = [x: int, y: string]
scala> for ( i <- 1 to 1000 ) {someDF = someDF.union(Seq((i,"item"+i)).toDF("x", "y")) }
scala> someDF.show
+---+------+
| x| y|
+---+------+
| 0| item0|
| 1| item1|
| 2| item2|
| 3| item3|
| 4| item4|
| 5| item5|
| 6| item6|
| 7| item7|
| 8| item8|
| 9| item9|
| 10|item10|
| 11|item11|
| 12|item12|
| 13|item13|
| 14|item14|
| 15|item15|
| 16|item16|
| 17|item17|
| 18|item18|
| 19|item19|
+---+------+
only showing top 20 rows
[Stage 2:=========================================================(20 + 0) / 20]
scala> var someDF = Seq((0, "item0")).toDF("x", "y")
someDF: org.apache.spark.sql.DataFrame = [x: int, y: string]
scala> someDF.show
+---+-----+
| x| y|
+---+-----+
| 0|item0|
+---+-----+
scala> for ( i <- 1 to 10000 ) {someDF = someDF.union(Seq((i,"item"+i)).toDF("x", "y")) }
Just want to save someDF to a parquet file to then load into Impala
//declare Range that you want
scala> val r = 1 to 10000
//create DataFrame with range
scala> val df = sc.parallelize(r).toDF("x")
//Add new column "y"
scala> val final_df = df.select(col("x"),concat(lit("item"),col("x")).alias("y"))
scala> final_df.show
+---+------+
| x| y|
+---+------+
| 1| item1|
| 2| item2|
| 3| item3|
| 4| item4|
| 5| item5|
| 6| item6|
| 7| item7|
| 8| item8|
| 9| item9|
| 10|item10|
| 11|item11|
| 12|item12|
| 13|item13|
| 14|item14|
| 15|item15|
| 16|item16|
| 17|item17|
| 18|item18|
| 19|item19|
| 20|item20|
+---+------+
scala> final_df.count
res17: Long = 10000
//Write final_df to path in parquet format
scala> final_df.write.format("parquet").save(<path to write>)

How to use NOT IN clause in filter condition in spark

I want to filter a column of an RDD source :
val source = sql("SELECT * from sample.source").rdd.map(_.mkString(","))
val destination = sql("select * from sample.destination").rdd.map(_.mkString(","))
val source_primary_key = source.map(rec => (rec.split(",")(0)))
val destination_primary_key = destination.map(rec => (rec.split(",")(0)))
val src = source_primary_key.subtractByKey(destination_primary_key)
I want to use IN clause in filter condition to filter out only the values present in src from source, something like below(EDITED):
val source = spark.read.csv(inputPath + "/source").rdd.map(_.mkString(","))
val destination = spark.read.csv(inputPath + "/destination").rdd.map(_.mkString(","))
val source_primary_key = source.map(rec => (rec.split(",")(0)))
val destination_primary_key = destination.map(rec => (rec.split(",")(0)))
val extra_in_source = source_primary_key.filter(rec._1 != destination_primary_key._1)
equivalent SQL code is
SELECT * FROM SOURCE WHERE ID IN (select ID from src)
Thank you
Since your code isn't reproducible, here is a small example using spark-sql on how to select * from t where id in (...) :
// create a DataFrame for a range 'id' from 1 to 9.
scala> val df = spark.range(1,10).toDF
df: org.apache.spark.sql.DataFrame = [id: bigint]
// values to exclude
scala> val f = Seq(5,6,7)
f: Seq[Int] = List(5, 6, 7)
// select * from df where id is not in the values to exclude
scala> df.filter(!col("id").isin(f : _*)).show
+---+
| id|
+---+
| 1|
| 2|
| 3|
| 4|
| 8|
| 9|
+---+
// select * from df where id is in the values to exclude
scala> df.filter(col("id").isin(f : _*)).show
Here is the RDD version of the not isin :
scala> val rdd = sc.parallelize(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:24
scala> val f = Seq(5,6,7)
f: Seq[Int] = List(5, 6, 7)
scala> val rdd2 = rdd.filter(x => !f.contains(x))
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[3] at filter at <console>:28
Nevertheless, I still believe this is an overkill since you are already using spark-sql.
It seems in your case that you are actually dealing with DataFrames, thus the solutions mentioned above don't work.
You can use the left anti join approach :
scala> val source = spark.read.format("csv").load("source.file")
source: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 9 more fields]
scala> val destination = spark.read.format("csv").load("destination.file")
destination: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 9 more fields]
scala> source.show
+---+------------------+--------+----------+---------------+---+---+----------+-----+---------+------------+
|_c0| _c1| _c2| _c3| _c4|_c5|_c6| _c7| _c8| _c9| _c10|
+---+------------------+--------+----------+---------------+---+---+----------+-----+---------+------------+
| 1| Ravi kumar| Ravi | kumar| MSO | 1| M|17-01-1994| 74.5| 24000.78| Alabama |
| 2|Shekhar shudhanshu| Shekhar|shudhanshu| Manulife | 2| M|18-01-1994|76.34| 250000| Alaska |
| 3|Preethi Narasingam| Preethi|Narasingam| Retail | 3| F|19-01-1994|77.45|270000.01| Arizona |
| 4| Abhishek Nair|Abhishek| Nair| Banking | 4| M|20-01-1994|78.65| 345000| Arkansas |
| 5| Ram Sharma| Ram| Sharma|Infrastructure | 5| M|21-01-1994|79.12| 45000| California |
| 6| Chandani Kumari|Chandani| Kumari| BNFS | 6| F|22-01-1994|80.13| 43000.02| Colorado |
| 7| Balaji Kumar| Balaji| Kumar| MSO | 1| M|23-01-1994|81.33| 1234678|Connecticut |
| 8| Naveen Shekrappa| Naveen| Shekrappa| Manulife | 2| M|24-01-1994| 100| 789414| Delaware |
| 9| Milind Chavan| Milind| Chavan| Retail | 3| M|25-01-1994|83.66| 245555| Florida |
| 10| Raghu Rajeev| Raghu| Rajeev| Banking | 4| M|26-01-1994|87.65| 235468| Georgia|
+---+------------------+--------+----------+---------------+---+---+----------+-----+---------+------------+
scala> destination.show
+---+-------------------+--------+----------+---------------+---+---+----------+-----+---------+------------+
|_c0| _c1| _c2| _c3| _c4|_c5|_c6| _c7| _c8| _c9| _c10|
+---+-------------------+--------+----------+---------------+---+---+----------+-----+---------+------------+
| 1| Ravi kumar| Revi | kumar| MSO | 1| M|17-01-1994| 74.5| 24000.78| Alabama |
| 1| Ravi1 kumar| Revi | kumar| MSO | 1| M|17-01-1994| 74.5| 24000.78| Alabama |
| 1| Ravi2 kumar| Revi | kumar| MSO | 1| M|17-01-1994| 74.5| 24000.78| Alabama |
| 2| Shekhar shudhanshu| Shekhar|shudhanshu| Manulife | 2| M|18-01-1994|76.34| 250000| Alaska |
| 3|Preethi Narasingam1| Preethi|Narasingam| Retail | 3| F|19-01-1994|77.45|270000.01| Arizona |
| 4| Abhishek Nair1|Abhishek| Nair| Banking | 4| M|20-01-1994|78.65| 345000| Arkansas |
| 5| Ram Sharma| Ram| Sharma|Infrastructure | 5| M|21-01-1994|79.12| 45000| California |
| 6| Chandani Kumari|Chandani| Kumari| BNFS | 6| F|22-01-1994|80.13| 43000.02| Colorado |
| 7| Balaji Kumar| Balaji| Kumar| MSO | 1| M|23-01-1994|81.33| 1234678|Connecticut |
| 8| Naveen Shekrappa| Naveen| Shekrappa| Manulife | 2| M|24-01-1994| 100| 789414| Delaware |
| 9| Milind Chavan| Milind| Chavan| Retail | 3| M|25-01-1994|83.66| 245555| Florida |
| 10| Raghu Rajeev| Raghu| Rajeev| Banking | 4| M|26-01-1994|87.65| 235468| Georgia|
+---+-------------------+--------+----------+---------------+---+---+----------+-----+---------+------------+
You'll just need to do the following :
scala> val res1 = source.join(destination, Seq("_c0"), "leftanti")
scala> val res2 = destination.join(source, Seq("_c0"), "leftanti")
It's the same logic I mentioned in my answer here.
You can try like--
df.filter(~df.Dept.isin("30","20")).show()
//This will list all the columns of df where Dept NOT IN 30 or 20
You can try something similar in Java,
ds = ds.filter(functions.not(functions.col(COLUMN_NAME).isin(exclusionSet)));
where exclusionSet is a set of objects that needs to be removed from your dataset.

Spark dataframe filter

val df = sc.parallelize(Seq((1,"Emailab"), (2,"Phoneab"), (3, "Faxab"),(4,"Mail"),(5,"Other"),(6,"MSL12"),(7,"MSL"),(8,"HCP"),(9,"HCP12"))).toDF("c1","c2")
+---+-------+
| c1| c2|
+---+-------+
| 1|Emailab|
| 2|Phoneab|
| 3| Faxab|
| 4| Mail|
| 5| Other|
| 6| MSL12|
| 7| MSL|
| 8| HCP|
| 9| HCP12|
+---+-------+
I want to filter out records which have first 3 characters of column 'c2' either 'MSL' or 'HCP'.
So the output should be like below.
+---+-------+
| c1| c2|
+---+-------+
| 1|Emailab|
| 2|Phoneab|
| 3| Faxab|
| 4| Mail|
| 5| Other|
+---+-------+
Can any one please help on this?
I knew that df.filter($"c2".rlike("MSL")) -- This is for selecting the records but how to exclude the records. ?
Version: Spark 1.6.2
Scala : 2.10
This works too. Concise and very similar to SQL.
df.filter("c2 not like 'MSL%' and c2 not like 'HCP%'").show
+---+-------+
| c1| c2|
+---+-------+
| 1|Emailab|
| 2|Phoneab|
| 3| Faxab|
| 4| Mail|
| 5| Other|
+---+-------+
df.filter(not(
substring(col("c2"), 0, 3).isin("MSL", "HCP"))
)
I used below to filter rows from dataframe and this worked form me.Spark 2.2
val spark = new org.apache.spark.sql.SQLContext(sc)
val data = spark.read.format("csv").
option("header", "true").
option("delimiter", "|").
option("inferSchema", "true").
load("D:\\test.csv")
import spark.implicits._
val filter=data.filter($"dept" === "IT" )
OR
val filter=data.filter($"dept" =!= "IT" )
val df1 = df.filter(not(df("c2").rlike("MSL"))&&not(df("c2").rlike("HCP")))
This worked.