I have a variable of type Map[String, Set[String]
val metadata = Map(a -> Set(b ,c))
val colToUse = "existingcol" // Option[String]
I am trying to add a new column in my dataFrame using metadata and colToUse which is an existing column in my dataframe
value of metadata is Set of Strings and
key is just a string which is a value of a column in df.
eg :
val metadata = Map['mike', ['physics','chemistry']]
val colToUse = 'student_name' // student_name is a column name in df
'mike' will be a value of "student_name" column.
i am trying to add a new column in existing DF where i can add subjects of each student based on student_name and metadata
myDF.withColumn("subjects", metadata.getorelse(col(colToUse), set.empty)
The above will not work in scala as i need pass columns only in withColumn.
Tried using UDF
def logic: (Map[String, Set[String]], String) => Set[String] =
(metadata: Map[String, Set[String]], colToUse: String) => {
metadata.getOrElse(colToUse, Set("a"))
def myUDF = udf(logic)
def getVal: Column = { myUDF(metadata, col(colToUse.get) }
and using it in withcolumn :
myDF.withColumn("newCol", getVal(metadata, colToUse)
Getting error : Unsupported literal type class scala.Tuple2
Looking for a best simplistic way to approach this ?
Issue 2: In getVal , for passing metadata a column is expected but i am passing a map
Is something like this is what you need:
val spark = SparkSession.builder().master("local[1]").getOrCreate()
val df = spark.createDataFrame(
StructType(List(StructField("student_name", StringType)))
First test dataframe:
| mike|
And now, create the udf that uses the map:
val metadata = Map("mike" -> Set("physics", "chemistry"))
val colToUse = "student_name"
def createUdf =
udf((key: String) => metadata.getOrElse(key, Set.empty))
and uset it in withColumn function:
df.withColumn("subjects", createUdf(col(colToUse))).show()
it gives:
|student_name| subjects|
| mike|[physics, chemistry]|
am I missing something?
Would you be able to help in this spark prob statement
Data -
val rawrdd = spark.sparkContext.textFile("C:\\Users\\cmohamma\\data\\delta scenarios\\emp_20191010.txt")
val refinedRDD = rawrdd.map( lines => {
val fields = lines.split("\\|") (fields(0).toInt,fields(1),fields(2),fields(3).toInt,fields(4).toDate,fields(5).toFloat,fields(6).toInt)
Problem Statement - This is not working -fields(4).toDate , whats is the alternative or what is the usage ?
What i have tried ?
tried replacing it to - to_date(col(fields(4)) , "yyy-MM-dd") - Not working
Step 1.
val refinedRDD = rawrdd.map( lines => {
val fields = lines.split("\\|")
Now this tuples are all strings
Step 2.
mySchema = StructType(StructField(empno,IntegerType,true), StructField(ename,StringType,true), StructField(designation,StringType,true), StructField(manager,IntegerType,true), StructField(hire_date,DateType,true), StructField(sal,DoubleType,true), StructField(deptno,IntegerType,true))
Step 3. converting the string tuples to Rows
val rowRDD = refinedRDD.map(attributes => Row(attributes._1, attributes._2, attributes._3, attributes._4, attributes._5 , attributes._6, attributes._7))
Step 4.
val empDF = spark.createDataFrame(rowRDD, mySchema)
This is also not working and gives error related to types. to solve this i changed the step 1 as
Now this is giving error for the date type column and i am again at the main problem.
Use Case - use textFile Api, convert this to a dataframe using custom schema (StructType) on top of it.
This can be done using the case class but in case class also i would be stuck where i would need to do a fields(4).toDate (i know i can cast string to date later in code but if the above problem solutionis possible)
You can use the following code snippet
import org.apache.spark.sql.functions.to_timestamp
scala> val df = spark.read.format("csv").option("header", "true").option("delimiter", "|").load("gs://otif-etl-input/test.csv")
df: org.apache.spark.sql.DataFrame = [empno: string, ename: string ... 5 more fields]
scala> val ts = to_timestamp($"hire_date", "yyyy-MM-dd")
ts: org.apache.spark.sql.Column = to_timestamp(`hire_date`, 'yyyy-MM-dd')
scala> val enriched_df = df.withColumn("ts", ts).show(2, false)
|empno|ename|designation|manager|hire_date |sal |deptno |ts |
|7369 |SMITH|CLERK |9902 |2010-12-17|800.00 |20 |2010-12-17 00:00:00|
|7499 |ALLEN|SALESMAN |9698 |2011-02-20|1600.00|30 |2011-02-20 00:00:00|
enriched_df: Unit = ()
There are multiple ways to cast your data to proper data types.
First : use InferSchema
val df = spark.read .option("delimiter", "\\|").option("header", true) .option("inferSchema", "true").csv(path)
Some time it doesn't work as expected. see details here
Second : provide your own Datatype conversion template
val rawDF = Seq(("7369", "SMITH" , "2010-12-17", "800.00"), ("7499", "ALLEN","2011-02-20", "1600.00")).toDF("empno", "ename","hire_date", "sal")
//define schema in DF , hire_date as Date
val schemaDF = Seq(("empno", "INT"), ("ename", "STRING"), (**"hire_date", "date"**) , ("sal", "double")).toDF("columnName", "columnType")
//fetch schema details
val dataTypes = schemaDF.select("columnName", "columnType")
val listOfElements = dataTypes.collect.map(_.toSeq.toList)
//creating a map friendly template
val validationTemplate = (c: Any, t: Any) => {
val column = c.asInstanceOf[String]
val typ = t.asInstanceOf[String]
//Apply datatype conversion template on rawDF
val convertedDF = rawDF.select(listOfElements.map(element => validationTemplate(element(0), element(1))): _*)
println("Conversion done!")
Third : Case Class
Create schema from caseclass with ScalaReflection and provide this customized schema while loading DF.
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types._
case class MySchema(empno: int, ename: String, hire_date: Date, sal: Double)
val schema = ScalaReflection.schemaFor[MySchema].dataType.asInstanceOf[StructType]
val rawDF = spark.read.schema(schema).option("header", "true").load(path)
Hope this will help.
I am quite new to Spark. I have a input json file which I am reading as
val df = spark.read.json("/Users/user/Desktop/resource.json");
Contents of resource.json looks like this:
Is there any way we can process this dataframe and aggregate result as
Map<key, List<data>>
where data is each json object in which key is present.
For ex: expected result is
Map<key1 =[{"path":"path1","key":"key1","region":"region1"}, {"path":"path112","key":"key1","region":"region1"}] ,
key2 = [{"path":"path22","key":"key2","region":"region1"}]>
Any reference/documents/link to proceed further would be a great help.
Thank you.
Here is what you can do:
import org.json4s._
import org.json4s.jackson.Serialization.read
case class cC(path: String, key: String, region: String)
val df = spark.read.json("/Users/user/Desktop/resource.json");
scala> df.show
| key| path| region|
|key1| path1|region1|
|key2| path22|region1|
//Please note that original json structure is gone. Use .toJSON to get json back and extract key from json and create RDD[(String, String)] RDD[(key, json)]
val rdd = df.toJSON.rdd.map(m => {
implicit val formats = DefaultFormats
val parsedObj = read[cC](m)
(parsedObj.key, m)
scala> rdd.collect.groupBy(_._1).map(m => (m._1,m._2.map(_._2).toList))
res39: scala.collection.immutable.Map[String,List[String]] = Map(key2 -> List({"key":"key2","path":"path22","region":"region1"}), key1 -> List({"key":"key1","path":"path1","region":"region1"}, {"key":"key1","path":"path112","region":"region1"}))
You can use groupBy with collect_list, which is an aggregation function that collects all matching values into a list per key.
Notice that the original JSON strings are already "gone" (Spark parses them into individual columns), so if you really want a list of all records (with all their columns, including the key), you can use the struct function to combine columns into one column:
import org.apache.spark.sql.functions._
import spark.implicits._
.agg(collect_list(struct($"path", $"key", $"region")) as "value")
The result would be:
|key |value |
|key1|[[path1, key1, region1], [path112, key1, region1]]|
|key2|[[path22, key2, region1]] |
I can load data from database, and I do some process with this data.
The problem is some table has date column as 'String', but some others trait it as 'timestamp'.
I cannot know what type of date column is until loading data.
> x.getAs[String]("date") // could be error when date column is timestamp type
> x.getAs[Timestamp]("date") // could be error when date column is string type
This is how I load data from spark.
.option("url", url)
.option("dbtable", table)
.option("user", user)
.option("password", password)
Is there any way to trait them together? or convert it as string always?
You can pattern-match on the type of the column (using the DataFrame's schema) to decide whether to parse the String into a Timestamp or just use the Timestamp as is - and use the unix_timestamp function to do the actual conversion:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StringType
// preparing some example data - df1 with String type and df2 with Timestamp type
val df1 = Seq(("a", "2016-02-01"), ("b", "2016-02-02")).toDF("key", "date")
val df2 = Seq(
("a", new Timestamp(new SimpleDateFormat("yyyy-MM-dd").parse("2016-02-01").getTime)),
("b", new Timestamp(new SimpleDateFormat("yyyy-MM-dd").parse("2016-02-02").getTime))
).toDF("key", "date")
// If column is String, converts it to Timestamp
def normalizeDate(df: DataFrame): DataFrame = {
df.schema("date").dataType match {
case StringType => df.withColumn("date", unix_timestamp($"date", "yyyy-MM-dd").cast("timestamp"))
case _ => df
// after "normalizing", you can assume date has Timestamp type -
// both would print the same thing:
normalizeDate(df1).rdd.map(r => r.getAs[Timestamp]("date")).foreach(println)
normalizeDate(df2).rdd.map(r => r.getAs[Timestamp]("date")).foreach(println)
Here are a few things you can try:
(1) Start utilizing the inferSchema function during load if you have a version that supports it. This will have spark figure the data type of columns, this doesn't work in all scenarios. Also look at the input data, if you have quotes I advise adding an extra argument to account for them during the load.
val inputDF = spark.read.format("csv").option("header","true").option("inferSchema","true").load(fileLocation)
(2) To identify the data type of a column you can use the below code, it will place all of the column name and data types into their own Arrays of Strings.
val columnNames : Array[String] = inputDF.columns
val columnDataTypes : Array[String] = inputDF.schema.fields.map(x=>x.dataType).map(x=>x.toString)
It has a easy way to address this which is get(i: Int): Any. And it will be map between Spark SQL types and return types automatically. e.g.
val fieldIndex = row.fieldIndex("date")
val date = row.get(fieldIndex)
def parseLocationColumn(df: DataFrame): DataFrame = {
df.schema("location").dataType match {
case StringType => df.withColumn("locationTemp", $"location")
.withColumn("countryTemp", lit("Unknown"))
.withColumn("regionTemp", lit("Unknown"))
.withColumn("zoneTemp", lit("Unknown"))
case _ => df.withColumn("locationTemp", $"location.location")
.withColumn("countryTemp", $"location.country")
.withColumn("regionTemp", $"location.region")
.withColumn("zoneTemp", $"location.zone")
I am trying to figure out how to map a SchemaRDD object that I retrieved from a sql HiveContext over to a PairRDDFunctions[String, Vector] object where the string value is the name column in the schemaRDD and the rest of the columns (BytesIn, BytesOut, etc...) are the vector.
Assuming you have columns: "name", "bytesIn", "bytesOut"
val schemaRDD: SchemaRDD = ...
val pairs: RDD[(String, (Long, Long)] =
schemaRDD.select("name", "bytesIn", "bytesOut").rdd.map {
case Row(name, bytesIn, bytesOut) =>
name -> (bytesIn, bytesOut)
// To import PairRDDFunctions via implicits
import SparkContext._
pairs.groupByKey ... etc
I am trying to run random forest classification by using Spark ML api but I am having issues with creating right data frame input into pipeline.
Here is sample data:
age and hours_per_week are integers while other features including label salaryRange are categorical (String)
Loading this csv file (lets call it sample.csv) can be done by Spark csv library like this:
val data = sqlContext.csvFile("/home/dusan/sample.csv")
By default all columns are imported as string so we need to change "age" and "hours_per_week" to Int:
val toInt = udf[Int, String]( _.toInt)
val dataFixed = data.withColumn("age", toInt(data("age"))).withColumn("hours_per_week",toInt(data("hours_per_week")))
Just to check how schema looks now:
scala> dataFixed.printSchema
|-- age: integer (nullable = true)
|-- hours_per_week: integer (nullable = true)
|-- education: string (nullable = true)
|-- sex: string (nullable = true)
|-- salaryRange: string (nullable = true)
Then lets set the cross validator and pipeline:
val rf = new RandomForestClassifier()
val pipeline = new Pipeline().setStages(Array(rf))
val cv = new CrossValidator().setNumFolds(10).setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator)
Error shows up when running this line:
val cmModel = cv.fit(dataFixed)
java.lang.IllegalArgumentException: Field "features" does not exist.
It is possible to set label column and feature column in RandomForestClassifier ,however I have 4 columns as predictors (features) not only one.
How I should organize my data frame so it has label and features columns organized correctly?
For your convenience here is full code :
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.tuning.CrossValidator
import org.apache.spark.ml.Pipeline
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
import org.apache.spark.mllib.linalg.{Vector, Vectors}
object SampleClassification {
def main(args: Array[String]): Unit = {
//set spark context
val conf = new SparkConf().setAppName("Simple Application").setMaster("local");
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import com.databricks.spark.csv._
//load data by using databricks "Spark CSV Library"
val data = sqlContext.csvFile("/home/dusan/sample.csv")
//by default all columns are imported as string so we need to change "age" and "hours_per_week" to Int
val toInt = udf[Int, String]( _.toInt)
val dataFixed = data.withColumn("age", toInt(data("age"))).withColumn("hours_per_week",toInt(data("hours_per_week")))
val rf = new RandomForestClassifier()
val pipeline = new Pipeline().setStages(Array(rf))
val cv = new CrossValidator().setNumFolds(10).setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator)
// this fails with error
//java.lang.IllegalArgumentException: Field "features" does not exist.
val cmModel = cv.fit(dataFixed)
Thanks for help!
As of Spark 1.4, you can use Transformer org.apache.spark.ml.feature.VectorAssembler.
Just provide column names you want to be features.
val assembler = new VectorAssembler()
.setInputCols(Array("col1", "col2", "col3"))
and add it to your pipeline.
You simply need to make sure that you have a "features" column in your dataframe that is of type VectorUDF as show below:
scala> val df2 = dataFixed.withColumnRenamed("age", "features")
df2: org.apache.spark.sql.DataFrame = [features: int, hours_per_week: int, education: string, sex: string, salaryRange: string]
scala> val cmModel = cv.fit(df2)
java.lang.IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.mllib.linalg.VectorUDT#1eef but was actually IntegerType.
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:37)
at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:50)
at org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:71)
at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:118)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:164)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:164)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108)
at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:164)
at org.apache.spark.ml.tuning.CrossValidator.transformSchema(CrossValidator.scala:142)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:59)
at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:107)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:67)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:72)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:74)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:76)
Essentially there need to be two fields in your data frame "features" for feature vector and "label" for instance labels. Instance must be of type Double.
To create a "features" fields with Vector type first create a udf as show below:
val toVec4 = udf[Vector, Int, Int, String, String] { (a,b,c,d) =>
val e3 = c match {
case "hs-grad" => 0
case "bachelors" => 1
case "masters" => 2
val e4 = d match {case "male" => 0 case "female" => 1}
Vectors.dense(a, b, e3, e4)
Now to also encode the "label" field, create another udf as shown below:
val encodeLabel = udf[Double, String]( _ match { case "A" => 0.0 case "B" => 1.0} )
Now we transform original dataframe using these two udf:
val df = dataFixed.withColumn(
).withColumn("label", encodeLabel(dataFixed("salaryRange"))).select("features", "label")
Note that there can be extra columns / fields present in the dataframe, but in this case I have selected only features and label:
scala> df.show()
| features|label|
|[38.0,40.0,0.0,0.0]| 0.0|
|[28.0,40.0,1.0,1.0]| 0.0|
|[52.0,45.0,0.0,0.0]| 1.0|
|[31.0,50.0,2.0,1.0]| 1.0|
|[42.0,40.0,1.0,0.0]| 1.0|
Now its upto you to set correct parameters for your learning algorithm to make it work.
According to spark documentation on mllib - random trees, seems to me that you should define the features map that you are using and the points should be a labeledpoint.
This will tell the algorithm which column should be used as prediction and which ones are the features.