I have a dataframe with column 'text' which has many rows consisting of english sentences.
It is evening
Good morning
Hello everyone
What is your name
I'll see you tomorrow
I have a variable of type List which has some words such as
val removeList = List("Hello", "evening", "because", "is")
I want to remove all those words from column text which are present in removeList.
So my output should be
Good morning
What your name
I'll see you tomorrow
How can I do this using Spark Scala.
I wrote a code something like this:
val stopWordsList = List("Hello", "evening", "because", "is");
val df3 = sqlContext.sql("SELECT text FROM table");
val df4 = df3.map(x => cleanText(x.mkString, stopWordsList));
def cleanText(x:String, stopWordsList:List[String]):Any = {
for(str <- stopWordsList) {
if(x.contains(str)) {
x.replaceAll(str, "")
But I am getting error
Error:(44, 12) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
val df4 = df3.map(x => cleanText(x.mkString, stopWordsList));
Error:(44, 12) not enough arguments for method map: (implicit evidence$6: org.apache.spark.sql.Encoder[String])org.apache.spark.sql.Dataset[String].
Unspecified value parameter evidence$6.
val df4 = df3.map(x => cleanText(x.mkString, stopWordsList));
Check this df and rdd way.
val df = Seq(("It is evening"),("Good morning"),("Hello everyone"),("What is your name"),("I'll see you tomorrow")).toDF("data")
val removeList = List("Hello", "evening", "because", "is")
val rdd2 = df.rdd.map{ x=> {val p = x.getAs[String]("data") ; val k = removeList.foldLeft(p) ( (p,t) => p.replaceAll("\\b"+t+"\\b","") ) ; Row(x(0),k) } }
spark.createDataFrame(rdd2, df.schema.add(StructField("new1",StringType))).show(false)
|data |new1 |
|It is evening |It |
|Good morning |Good morning |
|Hello everyone | everyone |
|What is your name |What your name |
|I'll see you tomorrow|I'll see you tomorrow|
This code works for me.
Spark version 2.3.0, Scala version 2.11.8.
Using Datasets
import org.apache.spark.sql.SparkSession
val data = List(
"It is evening",
"Good morning",
"Hello everyone",
"What is your name",
"I'll see you tomorrow"
val removeList = List("Hello", "evening", "because", "is")
val spark = SparkSession.builder.master("local[*]").appName("test").getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
def cleanText(text: String, removeList: List[String]): String =
removeList.fold(text) {
case (text, termToRemove) => text.replaceAllLiterally(termToRemove, "")
val df1 = sc.parallelize(data).toDS // Dataset[String]
val df2 = df1.map(text => cleanText(text, removeList)) // Dataset[String]
Using DataFrames
import org.apache.spark.sql.SparkSession
val data = List(
"It is evening",
"Good morning",
"Hello everyone",
"What is your name",
"I'll see you tomorrow"
val removeList = List("Hello", "evening", "because", "is")
val spark = SparkSession.builder.master("local[*]").appName("test").getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
def cleanText(text: String, removeList: List[String]): String =
removeList.fold(text) {
case (text, termToRemove) => text.replaceAllLiterally(termToRemove, "")
// Creates a temp table.
val df1 = spark.sql("SELECT text FROM table") // DataFrame = [text: string]
val df2 = df1.map(row => cleanText(row.getAs[String](fieldName = "text"), removeList)).toDF("text") // DataFrame = [text: string]
I have a spark dataframe which has a timestamp field and i want to convert this to long datatype. I used a UDF and the standalone code works fine but when i plug to to a generic logic where any timestamp will need to be converted i m not ble to get it working.Issue is how can i assing the return value from UDF back to the dataframe column
Below is the code snippet
val spark: SparkSession = SparkSession.builder().master("local[*]").appName("Test3").getOrCreate();
import org.apache.spark.sql.functions._
val sqlContext = spark.sqlContext
val df2 = sqlContext.jsonRDD(spark.sparkContext.parallelize(Array(
"""{"year":2012, "make": "Tesla", "model": "S", "comment": "No Comment", "blank": "","manufacture_ts":"2017-10-16 00:00:00"}""",
"""{"year":1997, "make": "Ford", "model": "E350", "comment": "Get one", "blank": "","manufacture_ts":"2017-10-16 00:00:00"}""",
val convertTimeStamp = udf { (manTs :java.sql.Timestamp) =>
| |No Comment|Tesla| 1508126400000| S|2012|
| | Get one| Ford| 1508126400000| E350|1997|
| | |Chevy| 1508126400000| Volt|2015|
Now i want to invoke this from a dataframe to be clled on all columns which are of type long
object Test4 extends App{
val spark: SparkSession = SparkSession.builder().master("local[*]").appName("Test").getOrCreate();
import spark.implicits._
import scala.collection.JavaConversions._
val long : Long = "1508299200000".toLong
val data = Seq(Row("10000020_LUX_OTC",long,"2020-02-14"))
val schema = List( StructField("rowkey",StringType,true)
val dataDF = spark.createDataFrame(spark.sparkContext.parallelize(data),StructType(schema))
val modifedDf2= schema.foldLeft(dataDF) { case (newDF,StructField(name,dataType,flag,metadata)) =>
val convertTimeStamp = udf { (manTs :java.sql.Timestamp) =>
def transformLong(dataFrame: DataFrame,name:String, fieldType:String):Column = {
import org.apache.spark.sql.functions._
fieldType.toLowerCase match {
case "timestamp" => convertTimeStamp(dataFrame(name))
case _ => dataFrame.col(name)
Maybe your udf crashed if the timestamp is nullYou can do :
use unix_timestamp instead of UDF.. or make your UDF null-safe
only apply on fields which need to be converted.
Given the data:
import spark.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.TimestampType
val df = Seq(
you can do:
val newDF = df.schema.fields.filter(_.dataType == TimestampType).map(_.name)
.foldLeft(df)((df,field) => df.withColumn(field,unix_timestamp(col(field))))
which gives:
| id| ts1| ts2|
| 1|1589109282|1589109282|
Would you be able to help in this spark prob statement
Data -
val rawrdd = spark.sparkContext.textFile("C:\\Users\\cmohamma\\data\\delta scenarios\\emp_20191010.txt")
val refinedRDD = rawrdd.map( lines => {
val fields = lines.split("\\|") (fields(0).toInt,fields(1),fields(2),fields(3).toInt,fields(4).toDate,fields(5).toFloat,fields(6).toInt)
Problem Statement - This is not working -fields(4).toDate , whats is the alternative or what is the usage ?
What i have tried ?
tried replacing it to - to_date(col(fields(4)) , "yyy-MM-dd") - Not working
Step 1.
val refinedRDD = rawrdd.map( lines => {
val fields = lines.split("\\|")
Now this tuples are all strings
Step 2.
mySchema = StructType(StructField(empno,IntegerType,true), StructField(ename,StringType,true), StructField(designation,StringType,true), StructField(manager,IntegerType,true), StructField(hire_date,DateType,true), StructField(sal,DoubleType,true), StructField(deptno,IntegerType,true))
Step 3. converting the string tuples to Rows
val rowRDD = refinedRDD.map(attributes => Row(attributes._1, attributes._2, attributes._3, attributes._4, attributes._5 , attributes._6, attributes._7))
Step 4.
val empDF = spark.createDataFrame(rowRDD, mySchema)
This is also not working and gives error related to types. to solve this i changed the step 1 as
Now this is giving error for the date type column and i am again at the main problem.
Use Case - use textFile Api, convert this to a dataframe using custom schema (StructType) on top of it.
This can be done using the case class but in case class also i would be stuck where i would need to do a fields(4).toDate (i know i can cast string to date later in code but if the above problem solutionis possible)
You can use the following code snippet
import org.apache.spark.sql.functions.to_timestamp
scala> val df = spark.read.format("csv").option("header", "true").option("delimiter", "|").load("gs://otif-etl-input/test.csv")
df: org.apache.spark.sql.DataFrame = [empno: string, ename: string ... 5 more fields]
scala> val ts = to_timestamp($"hire_date", "yyyy-MM-dd")
ts: org.apache.spark.sql.Column = to_timestamp(`hire_date`, 'yyyy-MM-dd')
scala> val enriched_df = df.withColumn("ts", ts).show(2, false)
|empno|ename|designation|manager|hire_date |sal |deptno |ts |
|7369 |SMITH|CLERK |9902 |2010-12-17|800.00 |20 |2010-12-17 00:00:00|
|7499 |ALLEN|SALESMAN |9698 |2011-02-20|1600.00|30 |2011-02-20 00:00:00|
enriched_df: Unit = ()
There are multiple ways to cast your data to proper data types.
First : use InferSchema
val df = spark.read .option("delimiter", "\\|").option("header", true) .option("inferSchema", "true").csv(path)
Some time it doesn't work as expected. see details here
Second : provide your own Datatype conversion template
val rawDF = Seq(("7369", "SMITH" , "2010-12-17", "800.00"), ("7499", "ALLEN","2011-02-20", "1600.00")).toDF("empno", "ename","hire_date", "sal")
//define schema in DF , hire_date as Date
val schemaDF = Seq(("empno", "INT"), ("ename", "STRING"), (**"hire_date", "date"**) , ("sal", "double")).toDF("columnName", "columnType")
//fetch schema details
val dataTypes = schemaDF.select("columnName", "columnType")
val listOfElements = dataTypes.collect.map(_.toSeq.toList)
//creating a map friendly template
val validationTemplate = (c: Any, t: Any) => {
val column = c.asInstanceOf[String]
val typ = t.asInstanceOf[String]
//Apply datatype conversion template on rawDF
val convertedDF = rawDF.select(listOfElements.map(element => validationTemplate(element(0), element(1))): _*)
println("Conversion done!")
Third : Case Class
Create schema from caseclass with ScalaReflection and provide this customized schema while loading DF.
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types._
case class MySchema(empno: int, ename: String, hire_date: Date, sal: Double)
val schema = ScalaReflection.schemaFor[MySchema].dataType.asInstanceOf[StructType]
val rawDF = spark.read.schema(schema).option("header", "true").load(path)
Hope this will help.
I have a matrix and number of columns and rows is unknow
One example Matrix is:
I want to convert it to DataFrame,column name is random,how to achive it?
this is my expect result:
| _1 | _2 |
|5 |1.3 |
|1 |5.2 |
I suggest you convert matrix to RDD and then convert RDD to DataFrame, it is not a good way but works fine in Spark 2.0.0.
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.mllib.linalg._
import org.apache.spark.rdd.RDD
object mat2df {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("mat2df").setMaster("local[1]")
val sc = new SparkContext(conf)
val values = Array(5, 1, 1.3, 5.2)
val mat = Matrices.dense(2, 2, values).asInstanceOf[DenseMatrix]
def toRDD(m: Matrix): RDD[Vector] = {
val columns = m.toArray.grouped(m.numRows)
val rows = columns.toSeq.transpose
val vectors = rows.map(row => new DenseVector(row.toArray))
val mat_rows = toRDD(mat)// matrix to rdd
val mat_rdd = mat_rows.map(_.toArray).map{case Array(p0, p1) => (p0, p1)}
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate
val df = spark.createDataFrame(mat_rdd) // rdd to dataframe
def matrixToDataFrame(sc:SparkContext, matrix:Matrix, m_nodeColName:String):DataFrame={
val rdd = sc.parallelize(matrix.colIter.toSeq).map(x => {
val sc = new SQLContext(nodeContext.getSparkCtx())
var schema = new StructType()
val ids = ArrayBuffer[String]()
for (i <- 0 until matrix.rowIter.size) {
schema = schema.add(StructField(m_nodeColName +"_"+ i.toString(), DoubleType, true))
ids.append(m_nodeColName +"_"+ i.toString())
sc.sparkSession.createDataFrame(rdd, schema)
I am new in spark scala and want to find the max salary in each department
I implemented below code
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object MaxSalary {
val sc = new SparkContext(new SparkConf().setAppName("Max Salary").setMaster("local[2]"))
case class Dept(dept_name : String, Salary : Int)
val data = sc.textFile("file:///home/user/Documents/dept.txt").map(_.split(","))
val recs = data.map(r => (r(0), Dept(r(0), r(1).toInt)))
val a = recs.max()???????
but stuck how to implement group by and max function. I am using pair RDD.
This can be done using RDDs with the following code:
val emp = sc.textFile("file:///home/user/Documents/dept.txt")
.mapPartitionsWithIndex( (idx, row) => if(idx==0) row.drop(1) else row )
.map(x => (x.split(",")(0).toString, x.split(",")(1).toInt))
val maxSal = emp.reduceByKey(math.max(_,_))
Should give you:
Array[(String, Int)] = Array((Dept1,2500), (Dept2,2800))
If you are using Dataset here is the solution
case class Dept(dept_name : String, Salary : Int)
val sc = new SparkContext(new SparkConf().setAppName("Max Salary").setMaster("local[2]"))
val sq = new SQLContext(sc)
import sq.implicits._
val file = "resources/ip.csv"
val data = sc.textFile(file).map(_.split(","))
val recs = data.map(r => Dept(r(0), r(1).toInt )).toDS()
| Dept2| 2800|
| Dept1| 2500|
I loaded csv with read method(origin.csv dataframe), but unable to convert it.
val df = spark.read
.option("header", true)
.option("charset", "euc-kr")
Any idea of this?
Try this.
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
val df = Seq((1,"A1","B1","C1","D1"), (2,"A2","B2","C2","D2"), (3,"A3","B3","C3","D2")).toDF("no", "key1", "key2","key3","key4")
def myUDF(df: DataFrame, by: Seq[String]): DataFrame = {
val (columns, types) = df.dtypes.filter{ case (clm, _) => !by.contains(clm)}.unzip
require(types.distinct.size == 1)
val keys = explode(array(
columns.map(clm => struct(lit(clm).alias("key"),col(clm).alias("val"))): _*
val byValue = by.map(col(_))
df.select(byValue :+ keys.alias("_key"): _*).select(byValue ++ Seq($"_key.val", $"_key.key"): _*)
val df1 = myUDF(df, Seq("no"))