Remove words from column if present in list - scala

I have a dataframe with column 'text' which has many rows consisting of english sentences.
text
It is evening
Good morning
Hello everyone
What is your name
I'll see you tomorrow
I have a variable of type List which has some words such as
val removeList = List("Hello", "evening", "because", "is")
I want to remove all those words from column text which are present in removeList.
So my output should be
It
Good morning
everyone
What your name
I'll see you tomorrow
How can I do this using Spark Scala.
I wrote a code something like this:
val stopWordsList = List("Hello", "evening", "because", "is");
val df3 = sqlContext.sql("SELECT text FROM table");
val df4 = df3.map(x => cleanText(x.mkString, stopWordsList));
def cleanText(x:String, stopWordsList:List[String]):Any = {
for(str <- stopWordsList) {
if(x.contains(str)) {
x.replaceAll(str, "")
}
}
}
But I am getting error
Error:(44, 12) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
val df4 = df3.map(x => cleanText(x.mkString, stopWordsList));
Error:(44, 12) not enough arguments for method map: (implicit evidence$6: org.apache.spark.sql.Encoder[String])org.apache.spark.sql.Dataset[String].
Unspecified value parameter evidence$6.
val df4 = df3.map(x => cleanText(x.mkString, stopWordsList));

Check this df and rdd way.
val df = Seq(("It is evening"),("Good morning"),("Hello everyone"),("What is your name"),("I'll see you tomorrow")).toDF("data")
val removeList = List("Hello", "evening", "because", "is")
val rdd2 = df.rdd.map{ x=> {val p = x.getAs[String]("data") ; val k = removeList.foldLeft(p) ( (p,t) => p.replaceAll("\\b"+t+"\\b","") ) ; Row(x(0),k) } }
spark.createDataFrame(rdd2, df.schema.add(StructField("new1",StringType))).show(false)
Output:
+---------------------+---------------------+
|data |new1 |
+---------------------+---------------------+
|It is evening |It |
|Good morning |Good morning |
|Hello everyone | everyone |
|What is your name |What your name |
|I'll see you tomorrow|I'll see you tomorrow|
+---------------------+---------------------+

This code works for me.
Spark version 2.3.0, Scala version 2.11.8.
Using Datasets
import org.apache.spark.sql.SparkSession
val data = List(
"It is evening",
"Good morning",
"Hello everyone",
"What is your name",
"I'll see you tomorrow"
)
val removeList = List("Hello", "evening", "because", "is")
val spark = SparkSession.builder.master("local[*]").appName("test").getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
def cleanText(text: String, removeList: List[String]): String =
removeList.fold(text) {
case (text, termToRemove) => text.replaceAllLiterally(termToRemove, "")
}
val df1 = sc.parallelize(data).toDS // Dataset[String]
val df2 = df1.map(text => cleanText(text, removeList)) // Dataset[String]
Using DataFrames
import org.apache.spark.sql.SparkSession
val data = List(
"It is evening",
"Good morning",
"Hello everyone",
"What is your name",
"I'll see you tomorrow"
)
val removeList = List("Hello", "evening", "because", "is")
val spark = SparkSession.builder.master("local[*]").appName("test").getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
def cleanText(text: String, removeList: List[String]): String =
removeList.fold(text) {
case (text, termToRemove) => text.replaceAllLiterally(termToRemove, "")
}
// Creates a temp table.
sc.parallelize(data).toDF("text").createTempView("table")
val df1 = spark.sql("SELECT text FROM table") // DataFrame = [text: string]
val df2 = df1.map(row => cleanText(row.getAs[String](fieldName = "text"), removeList)).toDF("text") // DataFrame = [text: string]

Related

scala spark dataframe modify column with udf return value

I have a spark dataframe which has a timestamp field and i want to convert this to long datatype. I used a UDF and the standalone code works fine but when i plug to to a generic logic where any timestamp will need to be converted i m not ble to get it working.Issue is how can i assing the return value from UDF back to the dataframe column
Below is the code snippet
val spark: SparkSession = SparkSession.builder().master("local[*]").appName("Test3").getOrCreate();
import org.apache.spark.sql.functions._
val sqlContext = spark.sqlContext
val df2 = sqlContext.jsonRDD(spark.sparkContext.parallelize(Array(
"""{"year":2012, "make": "Tesla", "model": "S", "comment": "No Comment", "blank": "","manufacture_ts":"2017-10-16 00:00:00"}""",
"""{"year":1997, "make": "Ford", "model": "E350", "comment": "Get one", "blank": "","manufacture_ts":"2017-10-16 00:00:00"}""",
)))
val convertTimeStamp = udf { (manTs :java.sql.Timestamp) =>
manTs.getTime
}
df2.withColumn("manufacture_ts",getTime(df2("manufacture_ts"))).show
+-----+----------+-----+--------------+-----+----+
| |No Comment|Tesla| 1508126400000| S|2012|
| | Get one| Ford| 1508126400000| E350|1997|
| | |Chevy| 1508126400000| Volt|2015|
+-----+----------+-----+--------------+-----+----+
Now i want to invoke this from a dataframe to be clled on all columns which are of type long
object Test4 extends App{
val spark: SparkSession = SparkSession.builder().master("local[*]").appName("Test").getOrCreate();
import spark.implicits._
import scala.collection.JavaConversions._
val long : Long = "1508299200000".toLong
val data = Seq(Row("10000020_LUX_OTC",long,"2020-02-14"))
val schema = List( StructField("rowkey",StringType,true)
,StructField("order_receipt_dt",LongType,true)
,StructField("maturity_dt",StringType,true))
val dataDF = spark.createDataFrame(spark.sparkContext.parallelize(data),StructType(schema))
val modifedDf2= schema.foldLeft(dataDF) { case (newDF,StructField(name,dataType,flag,metadata)) =>
newDF.withColumn(name,DataTypeUtil.transformLong(newDF,name,dataType.typeName))
modifedDf2,show
}
}
val convertTimeStamp = udf { (manTs :java.sql.Timestamp) =>
manTs.getTime
}
def transformLong(dataFrame: DataFrame,name:String, fieldType:String):Column = {
import org.apache.spark.sql.functions._
fieldType.toLowerCase match {
case "timestamp" => convertTimeStamp(dataFrame(name))
case _ => dataFrame.col(name)
}
}
Maybe your udf crashed if the timestamp is nullYou can do :
use unix_timestamp instead of UDF.. or make your UDF null-safe
only apply on fields which need to be converted.
Given the data:
import spark.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.TimestampType
val df = Seq(
(1L,Timestamp.valueOf(LocalDateTime.now()),Timestamp.valueOf(LocalDateTime.now()))
).toDF("id","ts1","ts2")
you can do:
val newDF = df.schema.fields.filter(_.dataType == TimestampType).map(_.name)
.foldLeft(df)((df,field) => df.withColumn(field,unix_timestamp(col(field))))
newDF.show()
which gives:
+---+----------+----------+
| id| ts1| ts2|
+---+----------+----------+
| 1|1589109282|1589109282|
+---+----------+----------+

spark Scala RDD to DataFrame Date format

Would you be able to help in this spark prob statement
Data -
empno|ename|designation|manager|hire_date|sal|deptno
7369|SMITH|CLERK|9902|2010-12-17|800.00|20
7499|ALLEN|SALESMAN|9698|2011-02-20|1600.00|30
Code:
val rawrdd = spark.sparkContext.textFile("C:\\Users\\cmohamma\\data\\delta scenarios\\emp_20191010.txt")
val refinedRDD = rawrdd.map( lines => {
val fields = lines.split("\\|") (fields(0).toInt,fields(1),fields(2),fields(3).toInt,fields(4).toDate,fields(5).toFloat,fields(6).toInt)
})
Problem Statement - This is not working -fields(4).toDate , whats is the alternative or what is the usage ?
What i have tried ?
tried replacing it to - to_date(col(fields(4)) , "yyy-MM-dd") - Not working
2.
Step 1.
val refinedRDD = rawrdd.map( lines => {
val fields = lines.split("\\|")
(fields(0),fields(1),fields(2),fields(3),fields(4),fields(5),fields(6))
})
Now this tuples are all strings
Step 2.
mySchema = StructType(StructField(empno,IntegerType,true), StructField(ename,StringType,true), StructField(designation,StringType,true), StructField(manager,IntegerType,true), StructField(hire_date,DateType,true), StructField(sal,DoubleType,true), StructField(deptno,IntegerType,true))
Step 3. converting the string tuples to Rows
val rowRDD = refinedRDD.map(attributes => Row(attributes._1, attributes._2, attributes._3, attributes._4, attributes._5 , attributes._6, attributes._7))
Step 4.
val empDF = spark.createDataFrame(rowRDD, mySchema)
This is also not working and gives error related to types. to solve this i changed the step 1 as
(fields(0).toInt,fields(1),fields(2),fields(3).toInt,fields(4),fields(5).toFloat,fields(6).toInt)
Now this is giving error for the date type column and i am again at the main problem.
Use Case - use textFile Api, convert this to a dataframe using custom schema (StructType) on top of it.
This can be done using the case class but in case class also i would be stuck where i would need to do a fields(4).toDate (i know i can cast string to date later in code but if the above problem solutionis possible)
You can use the following code snippet
import org.apache.spark.sql.functions.to_timestamp
scala> val df = spark.read.format("csv").option("header", "true").option("delimiter", "|").load("gs://otif-etl-input/test.csv")
df: org.apache.spark.sql.DataFrame = [empno: string, ename: string ... 5 more fields]
scala> val ts = to_timestamp($"hire_date", "yyyy-MM-dd")
ts: org.apache.spark.sql.Column = to_timestamp(`hire_date`, 'yyyy-MM-dd')
scala> val enriched_df = df.withColumn("ts", ts).show(2, false)
+-----+-----+-----------+-------+----------+-------+----------+-------------------+
|empno|ename|designation|manager|hire_date |sal |deptno |ts |
+-----+-----+-----------+-------+----------+-------+----------+-------------------+
|7369 |SMITH|CLERK |9902 |2010-12-17|800.00 |20 |2010-12-17 00:00:00|
|7499 |ALLEN|SALESMAN |9698 |2011-02-20|1600.00|30 |2011-02-20 00:00:00|
+-----+-----+-----------+-------+----------+-------+----------+-------------------+
enriched_df: Unit = ()
There are multiple ways to cast your data to proper data types.
First : use InferSchema
val df = spark.read .option("delimiter", "\\|").option("header", true) .option("inferSchema", "true").csv(path)
df.printSchema
Some time it doesn't work as expected. see details here
Second : provide your own Datatype conversion template
val rawDF = Seq(("7369", "SMITH" , "2010-12-17", "800.00"), ("7499", "ALLEN","2011-02-20", "1600.00")).toDF("empno", "ename","hire_date", "sal")
//define schema in DF , hire_date as Date
val schemaDF = Seq(("empno", "INT"), ("ename", "STRING"), (**"hire_date", "date"**) , ("sal", "double")).toDF("columnName", "columnType")
rawDF.printSchema
//fetch schema details
val dataTypes = schemaDF.select("columnName", "columnType")
val listOfElements = dataTypes.collect.map(_.toSeq.toList)
//creating a map friendly template
val validationTemplate = (c: Any, t: Any) => {
val column = c.asInstanceOf[String]
val typ = t.asInstanceOf[String]
col(column).cast(typ)
}
//Apply datatype conversion template on rawDF
val convertedDF = rawDF.select(listOfElements.map(element => validationTemplate(element(0), element(1))): _*)
println("Conversion done!")
convertedDF.show()
convertedDF.printSchema
Third : Case Class
Create schema from caseclass with ScalaReflection and provide this customized schema while loading DF.
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types._
case class MySchema(empno: int, ename: String, hire_date: Date, sal: Double)
val schema = ScalaReflection.schemaFor[MySchema].dataType.asInstanceOf[StructType]
val rawDF = spark.read.schema(schema).option("header", "true").load(path)
rawDF.printSchema
Hope this will help.

How to convert matrix to DataFrame with scala/spark?

I have a matrix and number of columns and rows is unknow
One example Matrix is:
[5,1.3]
[1,5.2]
I want to convert it to DataFrame,column name is random,how to achive it?
this is my expect result:
+-------------+----+
| _1 | _2 |
+-------------+----+
|5 |1.3 |
|1 |5.2 |
--------------------
I suggest you convert matrix to RDD and then convert RDD to DataFrame, it is not a good way but works fine in Spark 2.0.0.
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.mllib.linalg._
import org.apache.spark.rdd.RDD
object mat2df {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("mat2df").setMaster("local[1]")
val sc = new SparkContext(conf)
val values = Array(5, 1, 1.3, 5.2)
val mat = Matrices.dense(2, 2, values).asInstanceOf[DenseMatrix]
def toRDD(m: Matrix): RDD[Vector] = {
val columns = m.toArray.grouped(m.numRows)
val rows = columns.toSeq.transpose
val vectors = rows.map(row => new DenseVector(row.toArray))
sc.parallelize(vectors)
}
val mat_rows = toRDD(mat)// matrix to rdd
val mat_rdd = mat_rows.map(_.toArray).map{case Array(p0, p1) => (p0, p1)}
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate
val df = spark.createDataFrame(mat_rdd) // rdd to dataframe
df.show()
}
}
def matrixToDataFrame(sc:SparkContext, matrix:Matrix, m_nodeColName:String):DataFrame={
val rdd = sc.parallelize(matrix.colIter.toSeq).map(x => {
Row.fromSeq(x.toArray.toSeq)
})
val sc = new SQLContext(nodeContext.getSparkCtx())
var schema = new StructType()
val ids = ArrayBuffer[String]()
for (i <- 0 until matrix.rowIter.size) {
schema = schema.add(StructField(m_nodeColName +"_"+ i.toString(), DoubleType, true))
ids.append(m_nodeColName +"_"+ i.toString())
}
sc.sparkSession.createDataFrame(rdd, schema)
}

Scala - Groupby and Max on pair RDD

I am new in spark scala and want to find the max salary in each department
Dept,Salary
Dept1,1000
Dept2,2000
Dept1,2500
Dept2,1500
Dept1,1700
Dept2,2800
I implemented below code
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object MaxSalary {
val sc = new SparkContext(new SparkConf().setAppName("Max Salary").setMaster("local[2]"))
case class Dept(dept_name : String, Salary : Int)
val data = sc.textFile("file:///home/user/Documents/dept.txt").map(_.split(","))
val recs = data.map(r => (r(0), Dept(r(0), r(1).toInt)))
val a = recs.max()???????
})
}
but stuck how to implement group by and max function. I am using pair RDD.
Thanks
This can be done using RDDs with the following code:
val emp = sc.textFile("file:///home/user/Documents/dept.txt")
.mapPartitionsWithIndex( (idx, row) => if(idx==0) row.drop(1) else row )
.map(x => (x.split(",")(0).toString, x.split(",")(1).toInt))
val maxSal = emp.reduceByKey(math.max(_,_))
Should give you:
Array[(String, Int)] = Array((Dept1,2500), (Dept2,2800))
If you are using Dataset here is the solution
case class Dept(dept_name : String, Salary : Int)
val sc = new SparkContext(new SparkConf().setAppName("Max Salary").setMaster("local[2]"))
val sq = new SQLContext(sc)
import sq.implicits._
val file = "resources/ip.csv"
val data = sc.textFile(file).map(_.split(","))
val recs = data.map(r => Dept(r(0), r(1).toInt )).toDS()
recs.groupBy($"dept_name").agg(max("Salary").alias("max_solution")).show()
Output:
+---------+------------+
|dept_name|max_solution|
+---------+------------+
| Dept2| 2800|
| Dept1| 2500|
+---------+------------+

Spark - Make dataframe with multi column csv

origin.csv
no,key1,key2,key3,key4,key5,...
1,A1,B1,C1,D1,E1,..
2,A2,B2,C2,D2,E2,..
3,A3,B3,C3,D3,E3,..
WhatIwant.csv
1,A1,key1
1,B1,key2
1,C1,key3
...
3,A3,key1
3,B3,key2
...
I loaded csv with read method(origin.csv dataframe), but unable to convert it.
val df = spark.read
.option("header", true)
.option("charset", "euc-kr")
.csv(csvFilePath)
Any idea of this?
Try this.
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
val df = Seq((1,"A1","B1","C1","D1"), (2,"A2","B2","C2","D2"), (3,"A3","B3","C3","D2")).toDF("no", "key1", "key2","key3","key4")
df.show
def myUDF(df: DataFrame, by: Seq[String]): DataFrame = {
val (columns, types) = df.dtypes.filter{ case (clm, _) => !by.contains(clm)}.unzip
require(types.distinct.size == 1)
val keys = explode(array(
columns.map(clm => struct(lit(clm).alias("key"),col(clm).alias("val"))): _*
))
val byValue = by.map(col(_))
df.select(byValue :+ keys.alias("_key"): _*).select(byValue ++ Seq($"_key.val", $"_key.key"): _*)
}
val df1 = myUDF(df, Seq("no"))
df1.show