Remove bad character if at beginning of column

Remove bad character if at beginning of column - scala

I have a DF of emails where the first character in the email is occasionally a symbol. I am trying to remove it if it exists.
+--------------------+
| Email |
+--------------------+
|bob#gmail.com |
|*steve#yahoo.com |
|leeroy#hotmail.com |
|#grant#gmail.com |
+--------------------+
The final df would look like this:
+--------------------+
| Email |
+--------------------+
|bob#gmail.com |
|steve#yahoo.com |
|leeroy#hotmail.com |
|grant#gmail.com |
+--------------------+
Is there a way to do this efficiently?

Using regexp_replace function should do the job.
You can adapt the regex if needed. Here it removes all non-alphanumeric characters from the beginning.
val df1 = df.withColumn("Email", regexp_replace(col("Email"), "^[^a-zA-z0-9]", ""))
df1.show()
//+------------------+
//| Email|
//+------------------+
//| bob#gmail.com|
//| steve#yahoo.com|
//|leeroy#hotmail.com|
//| grant#gmail.com|
//+------------------+

Related

Blank spaces in string | Spark Scala

I have a non-breaking trailing space in a string in a column . I have tried the below solutions but cannot get rid of the space.
df.select(
col("city"),
regexp_replace(col("city"), " ", ""),
regexp_replace(col("city"), "[\\r\\n]", ""),
regexp_replace(col("city"), "\\s+$", ""),
rtrim(col("city"))
).show()
Is there any other possible solution I can try to remove the blank space?

You can use the ltrim, rtrim or trim functions from org.apache.sql.functions:
import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(
("Bengaluru "),
(" Bengaluru"),
(" Bengaluru ")
).toDF("city")
df.show
+----------------+
| city|
+----------------+
| Bengaluru |
| Bengaluru|
| Bengaluru |
+----------------+
df.withColumn("city", ltrim(col("city"))).show
+-------------+
| city|
+-------------+
| Bengaluru |
| Bengaluru|
|Bengaluru |
+-------------
df.withColumn("city", rtrim(col("city"))).show
+------------+
| city|
+------------+
| Bengaluru|
| Bengaluru|
| Bengaluru|
+------------+
df.withColumn("city", trim(col("city"))).show
+---------+
| city|
+---------+
|Bengaluru|
|Bengaluru|
|Bengaluru|
+---------+
Choosing whether you want to remove leading/trailing spaces or both.
Hope this helps!

Conditional string manipulation in Pyspark

I have a Pyspark dataframe, among others, a column of MSNs (of string type) like the following:
+------+
| Col1 |
+------+
| 654- |
| 1859 |
| 5875 |
| 784- |
| 596- |
| 668- |
| 1075 |
+------+
As you can see, those entries with a value of less than 1000 (i.e. three characters) have a - character at the end to make a total of 4 characters.
I want to get rid of that - character, so that I end up with something like:
+------+
| Col2 |
+------+
| 654 |
| 1859 |
| 5875 |
| 784 |
| 596 |
| 668 |
| 1075 |
+------+
I have tried the following code (where df is the dataframe containing the column, but it does not appear to work:
if df.Col1[3] == "-":
df = df.withColumn('Col2', df.series.substr(1, 3))
return df
else:
return df
Does anyone know how to do it?

You can replace - in the column with empty string ("") using F.regexp_replace
See the code below,
df.withColumn("Col2", F.regexp_replace("Col1", "-", "")).show()
+----+----+
|Col1|Col2|
+----+----+
|589-| 589|
|1245|1245|
|145-| 145|
+----+----+

Here is a solution using the .substr() method:
df.withColumn("Col2", F.when(F.col("Col1").substr(4, 1) == "-",
F.col("Col1").substr(1, 3)
).otherwise(
F.col("Col1"))).show()
+----+----+
|Col1|Col2|
+----+----+
|654-| 654|
|1859|1859|
|5875|5875|
|784-| 784|
|596-| 596|
|668-| 668|
|1075|1075|
+----+----+

Spark (scala) dataframes - Check whether strings in column exist in a column of another dataframe

I have a spark dataframe, and I wish to check whether each string in a particular column exists in a pre-defined a column of another dataframe.
I have found a same problem in Spark (scala) dataframes - Check whether strings in column contain any items from a set
but I want to Check whether strings in column exists in a column of another dataframe not a List or a set follow that question. Who can help me! I don't know convert a column to a set or a list and i don't know "exists" method in dataframe.
My data is similar to this
df1:
+---+-----------------+
| id| url |
+---+-----------------+
| 1|google.com |
| 2|facebook.com |
| 3|github.com |
| 4|stackoverflow.com|
+---+-----------------+
df2:
+-----+------------+
| id | urldetail |
+-----+------------+
| 11 |google.com |
| 12 |yahoo.com |
| 13 |facebook.com|
| 14 |twitter.com |
| 15 |youtube.com |
+-----+------------+
Now, i am trying to create a third column with the results of a comparison to see if the strings in the $"urldetail" column if exists in $"url"
+---+------------+-------------+
| id| urldetail | check |
+---+------------+-------------+
| 11|google.com | 1 |
| 12|yahoo.com | 0 |
| 13|facebook.com| 1 |
| 14|twitter.com | 0 |
| 15|youtube.com | 0 |
+---+------------+-------------+
I want to use UDF but i don't know how to check whether string exists in a column of a dataframe! Please help me!

I have a spark dataframe, and I wish to check whether each string in a
particular column contains any number of words from a pre-defined a
column of another dataframe.
Here is the way. using = or like
package examples
import org.apache.log4j.Level
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, _}
object CompareColumns extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder()
.appName(this.getClass.getName)
.config("spark.master", "local").getOrCreate()
import spark.implicits._
val df1 = Seq(
(1, "google.com"),
(2, "facebook.com"),
(3, "github.com"),
(4, "stackoverflow.com")).toDF("id", "url").as("first")
df1.show
val df2 = Seq(
(11, "google.com"),
(12, "yahoo.com"),
(13, "facebook.com"),
(14, "twitter.com")).toDF("id", "url").as("second")
df2.show
val df3 = df2.join(df1, expr("first.url like second.url"), "full_outer").select(
col("first.url")
, col("first.url").contains(col("second.url")).as("check")).filter("url is not null")
df3.na.fill(Map("check" -> false))
.show
}
Result :
+---+-----------------+
| id| url|
+---+-----------------+
| 1| google.com|
| 2| facebook.com|
| 3| github.com|
| 4|stackoverflow.com|
+---+-----------------+
+---+------------+
| id| url|
+---+------------+
| 11| google.com|
| 12| yahoo.com|
| 13|facebook.com|
| 14| twitter.com|
+---+------------+
+-----------------+-----+
| url|check|
+-----------------+-----+
| google.com| true|
| facebook.com| true|
| github.com|false|
|stackoverflow.com|false|
+-----------------+-----+
with full outer join we can achive this...
For more details see my article with all joins here in my linked in post
Note : Instead of 0 for false 1 for true i have used boolean
conditions here.. you can translate them in to what ever you wanted...
UPDATE : If rows are increasing in second dataframe
you can use this, it wont miss any rows from second
val df3 = df2.join(df1, expr("first.url like second.url"), "full").select(
col("second.*")
, col("first.url").contains(col("second.url")).as("check"))
.filter("url is not null")
df3.na.fill(Map("check" -> false))
.show
Also, one more thing is you can try regexp_extract as shown in below post
https://stackoverflow.com/a/53880542/647053

read in your data and use the trim operation just to be conservative when joining on the strings to remove the whitesapace
val df= Seq((1,"google.com"), (2,"facebook.com"), ( 3,"github.com "), (4,"stackoverflow.com")).toDF("id", "url").select($"id", trim($"url").as("url"))
val df2 =Seq(( 11 ,"google.com"), (12 ,"yahoo.com"), (13 ,"facebook.com"),(14 ,"twitter.com"),(15,"youtube.com")).toDF( "id" ,"urldetail").select($"id", trim($"urldetail").as("urldetail"))
df.join(df2.withColumn("flag", lit(1)).drop("id"), (df("url")===df2("urldetail")), "left_outer").withColumn("contains_bool",
when($"flag"===1, true) otherwise(false)).drop("flag","urldetail").show
+---+-----------------+-------------+
| id| url|contains_bool|
+---+-----------------+-------------+
| 1| google.com| true|
| 2| facebook.com| true|
| 3| github.com| false|
| 4|stackoverflow.com| false|
+---+-----------------+-------------+

how to apply joins in spark scala when we have multiple values in the join column

I have data in two text files as
file 1:(patient id,diagnosis code)
+----------+-------+
|patient_id|diag_cd|
+----------+-------+
| 1| y,t,k|
| 2| u,t,p|
| 3| u,t,k|
| 4| f,o,k|
| 5| e,o,u|
+----------+-------+
file2(diagnosis code,diagnosis description) Time T1
+-------+---------+
|diag_cd|diag_desc|
+-------+---------+
| y| yen|
| t| ten|
| k| ken|
| u| uen|
| p| pen|
| f| fen|
| o| oen|
| e| een|
+-------+---------+
data in file 2 is not fixed and keeps on changing, means at any given point of time diagnosis code y can have diagnosis description as yen and at other point of time it can have diagnosis description as ten. For example below
file2 at Time T2
+-------+---------+
|diag_cd|diag_desc|
+-------+---------+
| y| ten|
| t| yen|
| k| uen|
| u| oen|
| p| ken|
| f| pen|
| o| een|
| e| fen|
+-------+---------+
I have to read these two files data in spark and want only those patients id who are diagnosed with uen.
it can be done using spark sql or scala both.
I tried to read the file1 in spark-shell. The two columns in file1 are pipe delimited.
scala> val tes1 = sc.textFile("file1.txt").map(x => x.split('|')).filter(y => y(1).contains("u")).collect
tes1: Array[Array[String]] = Array(Array(2, u,t,p), Array(3, u,t,k), Array(5, e,o,u))
But as the diagnosis code related to a diagnosis description is not constant in file2 so will have to use the join condition. But I dont know how to apply joins when the diag_cd column in file1 has multiple values.
any help would be appreciated.

Please find the answer below
//Read the file1 into a dataframe
val file1DF = spark.read.format("csv").option("delimiter","|")
.option("header",true)
.load("file1PATH")
//Read the file2 into a dataframe
val file2DF = spark.read.format("csv").option("delimiter","|")
.option("header",true)
.load("file2path")
//get the patient id dataframe for the diag_desc as uen
file1DF.join(file2DF,file1DF.col("diag_cd").contains(file2DF.col("diag_cd")),"inner")
.filter(file2DF.col("diag_desc") === "uen")
.select("patient_id").show

Convert the table t1 from format1 to format2 using explode method.
Format1:
file 1:(patient id,diagnosis code)
+----------+-------+
|patient_id|diag_cd|
+----------+-------+
| 1| y,t,k|
| 2| u,t,p|
+----------+-------+
to
file 1:(patient id,diagnosis code)
+----------+-------+
|patient_id|diag_cd|
+----------+-------+
| 1| y |
| 1| t |
| 1| k |
| 2| u |
| 2| t |
| 2| p |
+----------+-------+
Code:
scala> val data = Seq("1|y,t,k", "2|u,t,p")
data: Seq[String] = List(1|y,t,k, 2|u,t,p)
scala> val df1 = sc.parallelize(data).toDF("c1").withColumn("patient_id", split(col("c1"), "\\|").getItem(0)).withColumn("col2", split(col("c1"), "\\|").getItem(1)).select("patient_id", "col2").withColumn("diag_cd", explode(split($"col2", "\\,"))).select("patient_id", "diag_cd")
df1: org.apache.spark.sql.DataFrame = [patient_id: string, diag_cd: string]
scala> df1.collect()
res4: Array[org.apache.spark.sql.Row] = Array([1,y], [1,t], [1,k], [2,u], [2,t], [2,p])
I have created dummy data here for illustration. Note how we are exploding the particular column above using
scala> val df1 = sc.parallelize(data).toDF("c1").
| withColumn("patient_id", split(col("c1"), "\\|").getItem(0)).
| withColumn("col2", split(col("c1"), "\\|").getItem(1)).
| select("patient_id", "col2").
| withColumn("diag_cd", explode(split($"col2", "\\,"))).
| select("patient_id", "diag_cd")
df1: org.apache.spark.sql.DataFrame = [patient_id: string, diag_cd: string]
Now you can create df2 for file 2 using -
scala> val df2 = sc.textFile("file2.txt").map(x => (x.split(",")(0),x.split(",")(1))).toDF("diag_cd", "diag_desc")
df2: org.apache.spark.sql.DataFrame = [diag_cd: string, diag_desc: string]
Join df1 with df2 and filter as per the requirement.
df1.join(df2, df1.col("diag_cd") === df2.col("diag_cd")).filter(df2.col("diag_desc") === "ten").select(df1.col("patient_id")).collect()

Replace words in Data frame using List of words in another Data frame in Spark Scala

I have two dataframes, lets say df1 and df2 in Spark Scala
df1 has two fields, 'ID' and 'Text' where 'Text' has some description (Multiple words). I have already removed all special characters and numeric characters from field 'Text' leaving only alphabets and spaces.
df1 Sample
+--------------++--------------------+
|ID ||Text |
+--------------++--------------------+
| 1 ||helo how are you |
| 2 ||hai haiden |
| 3 ||hw are u uma |
--------------------------------------
df2 contains a list of words and corresponding replacement words
df2 Sample
+--------------++--------------------+
|Word ||Replace |
+--------------++--------------------+
| helo ||hello |
| hai ||hi |
| hw ||how |
| u ||you |
--------------------------------------
I would need to find all occurrence of words in df2("Word") from df1("Text") and replace it with df2("Replace")
With the sample dataframes above, I would expect a resulting dataframe, DF3 as given below
df3 Sample
+--------------++--------------------+
|ID ||Text |
+--------------++--------------------+
| 1 ||hello how are you |
| 2 ||hi haiden |
| 3 ||how are you uma |
--------------------------------------
Your help is greatly appreciated in doing the same in Spark using Scala.

It'd be easier to accomplish this if you convert your df2 to a Map. Assuming it's not a huge table, you can do the following :
val keyVal = df2.map( r =>( r(0).toString, r(1).toString ) ).collect.toMap
This will give you a Map to refer to :
scala.collection.immutable.Map[String,String] = Map(helo -> hello, hai -> hi, hw -> how, u -> you)
Now you can use UDF to create a function that will utilize keyVal Map to replace values :
val getVal = udf[String, String] (x => x.split(" ").map(x => res18.get(x).getOrElse(x) ).mkString( " " ) )
Now, you can call the udf getVal on your dataframe to get the desired result.
df1.withColumn("text" , getVal(df1("text")) ).show
+---+-----------------+
| id| text|
+---+-----------------+
| 1|hello how are you|
| 2| hi haiden|
| 3| how are you uma|
+---+-----------------+

I will demonstrate only for the first id and assume that you can not do a collect action on your df2. First you need to be sure that the schema for your dataframe is and array for text column on your df1
+---+--------------------+
| id| text|
+---+--------------------+
| 1|[helo, how, are, ...|
+---+--------------------+
with schema like this:
|-- id: integer (nullable = true)
|-- text: array (nullable = true)
| |-- element: string (containsNull = true)
After that you can do an explode on the text column
res1.withColumn("text", explode(res1("text")))
+---+----+
| id|text|
+---+----+
| 1|helo|
| 1| how|
| 1| are|
| 1| you|
+---+----+
Assuming you're replace dataframe looks like this:
+----+-------+
|word|replace|
+----+-------+
|helo| hello|
| hai| hi|
+----+-------+
Joining the two dataframe will look like this:
res6.join(res8, res6("text") === res8("word"), "left_outer")
+---+----+----+-------+
| id|text|word|replace|
+---+----+----+-------+
| 1| you|null| null|
| 1| how|null| null|
| 1|helo|helo| hello|
| 1| are|null| null|
+---+----+----+-------+
Do a select with coalescing null values:
res26.select(res26("id"), coalesce(res26("replace"), res26("text")).as("replaced_text"))
+---+-------------+
| id|replaced_text|
+---+-------------+
| 1| you|
| 1| how|
| 1| hello|
| 1| are|
+---+-------------+
and then group by id and aggregate into a collect list function:
res33.groupBy("id").agg(collect_list("replaced_text"))
+---+---------------------------+
| id|collect_list(replaced_text)|
+---+---------------------------+
| 1| [you, how, hello,...|
+---+---------------------------+
Keep in mind that you should preserve you initial order of text elements.

I Suppose code below should solve your problem
I have solved this by using RDD
val wordRdd = df1.rdd.flatMap{ row =>
val wordList = row.getAs[String]("Text").split(" ").toList
wordList.map{word => Row.fromTuple(row.getAs[Int]("id"),word)}
}.zipWithIndex()
val wordDf = sqlContext.createDataFrame(wordRdd.map(x => Row.fromSeq(x._1.toSeq++Seq(x._2))),StructType(List(StructField("id",IntegerType),StructField("word",StringType),StructField("index",LongType))))
val opRdd = wordDf.join(df2,wordDf("word")===df2("word"),"left_outer").drop(df2("word")).rdd.groupBy(_.getAs[Int]("id")).map(x => Row.fromTuple(x._1,x._2.toList.sortBy(x => x.getAs[Long]("index")).map(row => if(row.getAs[String]("Replace")!=null) row.getAs[String]("Replace") else row.getAs[String]("word")).mkString(" ")))
val opDF = sqlContext.createDataFrame(opRdd,StructType(List(StructField("id",IntegerType),StructField("Text",StringType))))

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Remove bad character if at beginning of column - scala

Related

Blank spaces in string | Spark Scala

Conditional string manipulation in Pyspark

Spark (scala) dataframes - Check whether strings in column exist in a column of another dataframe

how to apply joins in spark scala when we have multiple values in the join column

Replace words in Data frame using List of words in another Data frame in Spark Scala

Categories

Resources