Need to remove string "_value" in spark dataframe column names using scala - scala

Have my dataframe as shown below.Here I have to remove the last occurrence of the string "_value" from all the column name of my dataframe.
import spark.implicits._
import org.apache.spark.sql.functions._
val simpledata = Seq(("file1","name1","101"),
("file1","name1","101"),
("file1","name1","101"),
("file1","name1","101"),
("file1","name1","101"))
val df = simpledata.toDF("filename_value","name_value_value","serialNo_value")
df.show()
Output menu
enter image description here
If I use replaceAll:
val renamedColumnsDf = df.columns.map(c => df(c).as(c.replaceAll('_value',""))) it removes all the _values but i need only to remove the string based on last occurance.
Need help here to remove the string based on occurrence in column name.
My output should be:
+--------------+----------------+--------------+
|filename |name_value |serialNo |
+--------------+----------------+--------------+
| file1| name1| 101|
| file1| name1| 101|
| file1| name1| 101|
| file1| name1| 101|
| file1| name1| 101|
+--------------+----------------+--------------+

If you wish to remove the _value substring only if it is the suffix of the column name, you can do the following:
val simpleDf: DataFrame = simpledata.toDF("filename_value", "name_value_value", "serialNo_value")
val suffix: String = "_value"
val renamedDf: DataFrame = simpleDf.columns.foldLeft(simpleDf) { (df, c) =>
if (c.endsWith(suffix)) df.withColumnRenamed(c, c.substring(0, c.length - suffix.length)) else df}
renamedDf.show()
The output will be:
+--------+----------+--------+
|filename|name_value|serialNo|
+--------+----------+--------+
| file1| name1| 101|
| file1| name1| 101|
| file1| name1| 101|
| file1| name1| 101|
| file1| name1| 101|
+--------+----------+--------+

Why bother complicated coding? You can use pattern matching on the column name inside your map transformation:
val newName = columnName match {
case s"${something}_value" => something
case other => other
}

Related

How to select rows using Window function? [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 1 year ago.
I have the following DataFrame df in Spark:
+------------+---------+-----------+
|OrderID | Type| Qty|
+------------+---------+-----------+
| 571936| 62800| 1|
| 571936| 62800| 1|
| 571936| 62802| 3|
| 661455| 72800| 1|
| 661455| 72801| 1|
I need to select the row that has a largest value of Qty per each unique OrderID or the last rows per OrderID if all Qty are the same (e.g. as for 661455). The expected result:
+------------+---------+-----------+
|OrderID | Type| Qty|
+------------+---------+-----------+
| 571936| 62802| 3|
| 661455| 72801| 1|
Any ides how to get it?
This is what I tried:
val partitionWindow = Window.partitionBy(col("OrderID")).orderBy(col("Qty").asc)
val result = df.over(partitionWindow)
scala> val w = Window.partitionBy("OrderID").orderBy("Qty")
scala> val w1 = Window.partitionBy("OrderID")
scala> df.show()
+-------+-----+---+
|OrderID| Type|Qty|
+-------+-----+---+
| 571936|62800| 1|
| 571936|62800| 1|
| 571936|62802| 3|
| 661455|72800| 1|
| 661455|72801| 1|
+-------+-----+---+
scala> df.withColumn("rn", row_number.over(w)).withColumn("mxrn", max("rn").over(w1)).filter($"mxrn" === $"rn").drop("mxrn","rn").show
+-------+-----+---+
|OrderID| Type|Qty|
+-------+-----+---+
| 661455|72801| 1|
| 571936|62802| 3|
+-------+-----+---+

How to truncate the values of a column of a spark dataframe? [duplicate]

This question already has answers here:
remove last few characters in PySpark dataframe column
(5 answers)
Closed 3 years ago.
I would like to remove the last two values of a string for each string in a single column of a spark dataframe. I would like to do this in the spark dataframe not by moving it to pandas and then back.
An example dataframe would be below,
# +----+-------+
# | age| name|
# +----+-------+
# | 350|Michael|
# | 290| Andy|
# | 123| Justin|
# +----+-------+
where the dtype of the age column is a string.
# +----+-------+
# | age| name|
# +----+-------+
# | 3|Michael|
# | 2| Andy|
# | 1| Justin|
# +----+-------+
This is the expected output. The last two characters of the string have been removed.
Hi Scala/sparkSql way of doing this is very Simple.
val result = originalDF.withColumn("age", substring(col("age"),0,1))
reult.show
you can probably get your syntax for pyspark
substring, length, col, expr from functions can be used for this purpose.
from pyspark.sql.functions import substring, length, col, expr
df = your df here
substring index 1, -2 were used since its 3 digits and .... its age field logically a person
wont live more than 100 years :-) OP can change substring function suiting to his requirement.
df.withColumn("age",expr("substring(age, 1, length(age)-2)"))
df.show
Result :
+----+-------+
| age| name|
+----+-------+
| 3|Michael|
| 2| Andy|
| 1| Justin|
+----+-------+
Scala answer :
val originalDF = Seq(
(350, "Michael"),
(290, "Andy"),
(123, "Justin")
).toDF("age", "name")
println(" originalDF " )
originalDF.show
println("modified")
originalDF.selectExpr("substring(age,0,1) as age", "name " ).show
Result :
originalDF
+---+-------+
|age| name|
+---+-------+
|350|Michael|
|290| Andy|
|123| Justin|
+---+-------+
modified
+---+-------+
|age| name|
+---+-------+
| 3|Michael|
| 2| Andy|
| 1| Justin|
+---+-------+

how to apply joins in spark scala when we have multiple values in the join column

I have data in two text files as
file 1:(patient id,diagnosis code)
+----------+-------+
|patient_id|diag_cd|
+----------+-------+
| 1| y,t,k|
| 2| u,t,p|
| 3| u,t,k|
| 4| f,o,k|
| 5| e,o,u|
+----------+-------+
file2(diagnosis code,diagnosis description) Time T1
+-------+---------+
|diag_cd|diag_desc|
+-------+---------+
| y| yen|
| t| ten|
| k| ken|
| u| uen|
| p| pen|
| f| fen|
| o| oen|
| e| een|
+-------+---------+
data in file 2 is not fixed and keeps on changing, means at any given point of time diagnosis code y can have diagnosis description as yen and at other point of time it can have diagnosis description as ten. For example below
file2 at Time T2
+-------+---------+
|diag_cd|diag_desc|
+-------+---------+
| y| ten|
| t| yen|
| k| uen|
| u| oen|
| p| ken|
| f| pen|
| o| een|
| e| fen|
+-------+---------+
I have to read these two files data in spark and want only those patients id who are diagnosed with uen.
it can be done using spark sql or scala both.
I tried to read the file1 in spark-shell. The two columns in file1 are pipe delimited.
scala> val tes1 = sc.textFile("file1.txt").map(x => x.split('|')).filter(y => y(1).contains("u")).collect
tes1: Array[Array[String]] = Array(Array(2, u,t,p), Array(3, u,t,k), Array(5, e,o,u))
But as the diagnosis code related to a diagnosis description is not constant in file2 so will have to use the join condition. But I dont know how to apply joins when the diag_cd column in file1 has multiple values.
any help would be appreciated.
Please find the answer below
//Read the file1 into a dataframe
val file1DF = spark.read.format("csv").option("delimiter","|")
.option("header",true)
.load("file1PATH")
//Read the file2 into a dataframe
val file2DF = spark.read.format("csv").option("delimiter","|")
.option("header",true)
.load("file2path")
//get the patient id dataframe for the diag_desc as uen
file1DF.join(file2DF,file1DF.col("diag_cd").contains(file2DF.col("diag_cd")),"inner")
.filter(file2DF.col("diag_desc") === "uen")
.select("patient_id").show
Convert the table t1 from format1 to format2 using explode method.
Format1:
file 1:(patient id,diagnosis code)
+----------+-------+
|patient_id|diag_cd|
+----------+-------+
| 1| y,t,k|
| 2| u,t,p|
+----------+-------+
to
file 1:(patient id,diagnosis code)
+----------+-------+
|patient_id|diag_cd|
+----------+-------+
| 1| y |
| 1| t |
| 1| k |
| 2| u |
| 2| t |
| 2| p |
+----------+-------+
Code:
scala> val data = Seq("1|y,t,k", "2|u,t,p")
data: Seq[String] = List(1|y,t,k, 2|u,t,p)
scala> val df1 = sc.parallelize(data).toDF("c1").withColumn("patient_id", split(col("c1"), "\\|").getItem(0)).withColumn("col2", split(col("c1"), "\\|").getItem(1)).select("patient_id", "col2").withColumn("diag_cd", explode(split($"col2", "\\,"))).select("patient_id", "diag_cd")
df1: org.apache.spark.sql.DataFrame = [patient_id: string, diag_cd: string]
scala> df1.collect()
res4: Array[org.apache.spark.sql.Row] = Array([1,y], [1,t], [1,k], [2,u], [2,t], [2,p])
I have created dummy data here for illustration. Note how we are exploding the particular column above using
scala> val df1 = sc.parallelize(data).toDF("c1").
| withColumn("patient_id", split(col("c1"), "\\|").getItem(0)).
| withColumn("col2", split(col("c1"), "\\|").getItem(1)).
| select("patient_id", "col2").
| withColumn("diag_cd", explode(split($"col2", "\\,"))).
| select("patient_id", "diag_cd")
df1: org.apache.spark.sql.DataFrame = [patient_id: string, diag_cd: string]
Now you can create df2 for file 2 using -
scala> val df2 = sc.textFile("file2.txt").map(x => (x.split(",")(0),x.split(",")(1))).toDF("diag_cd", "diag_desc")
df2: org.apache.spark.sql.DataFrame = [diag_cd: string, diag_desc: string]
Join df1 with df2 and filter as per the requirement.
df1.join(df2, df1.col("diag_cd") === df2.col("diag_cd")).filter(df2.col("diag_desc") === "ten").select(df1.col("patient_id")).collect()

Spark dataframe convert string to timestamp - returns null for empty value

I have a spark app, that need to convert from string to timestamp below is my code.
val df = sc.parallelize(Seq("09/18/2017","")).toDF("sDate")
+----------+
| sDate|
+----------+
|09/18/2017|
| |
+----------+
val ts = unix_timestamp($"sDate","MM/dd/yyyy").cast("timestamp")
df.withColumn("ts", ts).show()
+----------+--------------------+
| sDate| ts|
+----------+--------------------+
|09/18/2017|2017-09-18 00:00:...|
| | null|
+----------+--------------------+
The conversion is doing good, but if the value is empty , I'm getting null after casting.
Is there any way to return empty if the source value is empty.
you can use when function as below
import org.apache.spark.sql.functions._
val ts = unix_timestamp($"sDate","MM/dd/yyyy").cast("timestamp")
df.withColumn("ts", when(ts.isNotNull, ts).otherwise(lit("empty"))).show()
which would give you output as
+----------+-------------------+
| sDate| ts|
+----------+-------------------+
|09/18/2017|2017-09-18 00:00:00|
| | empty|
+----------+-------------------+

Replace words in Data frame using List of words in another Data frame in Spark Scala

I have two dataframes, lets say df1 and df2 in Spark Scala
df1 has two fields, 'ID' and 'Text' where 'Text' has some description (Multiple words). I have already removed all special characters and numeric characters from field 'Text' leaving only alphabets and spaces.
df1 Sample
+--------------++--------------------+
|ID ||Text |
+--------------++--------------------+
| 1 ||helo how are you |
| 2 ||hai haiden |
| 3 ||hw are u uma |
--------------------------------------
df2 contains a list of words and corresponding replacement words
df2 Sample
+--------------++--------------------+
|Word ||Replace |
+--------------++--------------------+
| helo ||hello |
| hai ||hi |
| hw ||how |
| u ||you |
--------------------------------------
I would need to find all occurrence of words in df2("Word") from df1("Text") and replace it with df2("Replace")
With the sample dataframes above, I would expect a resulting dataframe, DF3 as given below
df3 Sample
+--------------++--------------------+
|ID ||Text |
+--------------++--------------------+
| 1 ||hello how are you |
| 2 ||hi haiden |
| 3 ||how are you uma |
--------------------------------------
Your help is greatly appreciated in doing the same in Spark using Scala.
It'd be easier to accomplish this if you convert your df2 to a Map. Assuming it's not a huge table, you can do the following :
val keyVal = df2.map( r =>( r(0).toString, r(1).toString ) ).collect.toMap
This will give you a Map to refer to :
scala.collection.immutable.Map[String,String] = Map(helo -> hello, hai -> hi, hw -> how, u -> you)
Now you can use UDF to create a function that will utilize keyVal Map to replace values :
val getVal = udf[String, String] (x => x.split(" ").map(x => res18.get(x).getOrElse(x) ).mkString( " " ) )
Now, you can call the udf getVal on your dataframe to get the desired result.
df1.withColumn("text" , getVal(df1("text")) ).show
+---+-----------------+
| id| text|
+---+-----------------+
| 1|hello how are you|
| 2| hi haiden|
| 3| how are you uma|
+---+-----------------+
I will demonstrate only for the first id and assume that you can not do a collect action on your df2. First you need to be sure that the schema for your dataframe is and array for text column on your df1
+---+--------------------+
| id| text|
+---+--------------------+
| 1|[helo, how, are, ...|
+---+--------------------+
with schema like this:
|-- id: integer (nullable = true)
|-- text: array (nullable = true)
| |-- element: string (containsNull = true)
After that you can do an explode on the text column
res1.withColumn("text", explode(res1("text")))
+---+----+
| id|text|
+---+----+
| 1|helo|
| 1| how|
| 1| are|
| 1| you|
+---+----+
Assuming you're replace dataframe looks like this:
+----+-------+
|word|replace|
+----+-------+
|helo| hello|
| hai| hi|
+----+-------+
Joining the two dataframe will look like this:
res6.join(res8, res6("text") === res8("word"), "left_outer")
+---+----+----+-------+
| id|text|word|replace|
+---+----+----+-------+
| 1| you|null| null|
| 1| how|null| null|
| 1|helo|helo| hello|
| 1| are|null| null|
+---+----+----+-------+
Do a select with coalescing null values:
res26.select(res26("id"), coalesce(res26("replace"), res26("text")).as("replaced_text"))
+---+-------------+
| id|replaced_text|
+---+-------------+
| 1| you|
| 1| how|
| 1| hello|
| 1| are|
+---+-------------+
and then group by id and aggregate into a collect list function:
res33.groupBy("id").agg(collect_list("replaced_text"))
+---+---------------------------+
| id|collect_list(replaced_text)|
+---+---------------------------+
| 1| [you, how, hello,...|
+---+---------------------------+
Keep in mind that you should preserve you initial order of text elements.
I Suppose code below should solve your problem
I have solved this by using RDD
val wordRdd = df1.rdd.flatMap{ row =>
val wordList = row.getAs[String]("Text").split(" ").toList
wordList.map{word => Row.fromTuple(row.getAs[Int]("id"),word)}
}.zipWithIndex()
val wordDf = sqlContext.createDataFrame(wordRdd.map(x => Row.fromSeq(x._1.toSeq++Seq(x._2))),StructType(List(StructField("id",IntegerType),StructField("word",StringType),StructField("index",LongType))))
val opRdd = wordDf.join(df2,wordDf("word")===df2("word"),"left_outer").drop(df2("word")).rdd.groupBy(_.getAs[Int]("id")).map(x => Row.fromTuple(x._1,x._2.toList.sortBy(x => x.getAs[Long]("index")).map(row => if(row.getAs[String]("Replace")!=null) row.getAs[String]("Replace") else row.getAs[String]("word")).mkString(" ")))
val opDF = sqlContext.createDataFrame(opRdd,StructType(List(StructField("id",IntegerType),StructField("Text",StringType))))