Create spark data frame from custom data format - scala

I have text file with String REC as the record delimiter and line break as the column delimiter, and every data has column name attached to it with comma as delimiter, below is the sample data format
REC
Id,19048
Term,milk
Rank,1
REC
Id,19049
Term,corn
Rank,5
Used REC as the record delimiter.Now, i want to create the spark data frame with column names ID, Term and Rank.Please Assist me on this.

here is working code
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
import org.apache.spark.{SparkConf, SparkContext}
object RecordSeparator extends App {
var conf = new
SparkConf().setAppName("test").setMaster("local[1]")
.setExecutorEnv("executor- cores", "2")
var sc = new SparkContext(conf)
val hconf = new Configuration
hconf.set("textinputformat.record.delimiter", "REC")
val data = sc.newAPIHadoopFile("data.txt",
classOf[TextInputFormat], classOf[LongWritable],
classOf[Text], hconf).map(x => x._2.toString.trim).filter(x => x != "")
.map(x => getRecord(x)).map(x => x.split(","))
.map(x => record(x(0), x(2), x(2)))
val sqlContext = new SQLContext(sc)
val df = data.toDF()
df.printSchema()
df.show(false)
def getRecord(in: String): String = {
val ar = in.split("\n").mkString(",").split(",")
val data = Array(ar(1), ar(3), ar(5))
data.mkString(",")
}
}
case class record(Id: String, Term: String, Rank: String)
Output:
root
|-- Id: string (nullable = true)
|-- Term: string (nullable = true)
|-- Rank: string (nullable = true)
+-----+----+----+
|Id |Term|Rank|
+-----+----+----+
|19048|1 |1 |
|19049|5 |5 |
+-----+----+----+

Supposing you have your file on the "normal" filesystem (not HDFS), you have to write a file parser and then use sc.parallelize to create a RDD and then a DataFrame:
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.mutable
object Demo extends App {
val conf = new SparkConf().setMaster("local[1]").setAppName("Demo")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
case class Record(
var id:Option[Int] = None,
var term:Option[String] = None,
var rank:Option[Int] = None)
val filename = "data.dat"
val records = readFile(filename)
val df = sc.parallelize(records).toDF
df.printSchema()
df.show()
def readFile(filename:String) : Seq[Record] = {
import scala.io.Source
val records = mutable.ArrayBuffer.empty[Record]
var currentRecord: Record = null
for (line <- Source.fromFile(filename).getLines) {
val tokens = line.split(',')
currentRecord = tokens match {
case Array("REC") => Record()
case Array("Id", id) => {
currentRecord.id = Some(id.toInt); currentRecord
}
case Array("Term", term) => {
currentRecord.term = Some(term); currentRecord
}
case Array("Rank", rank) => {
currentRecord.rank = Some(rank.toInt); records += currentRecord;
null
}
}
}
return records
}
}
this gives
root
|-- id: integer (nullable = true)
|-- term: string (nullable = true)
|-- rank: integer (nullable = true)
+-----+----+----+
| id|term|rank|
+-----+----+----+
|19048|milk| 1|
|19049|corn| 5|
+-----+----+----+

Related

How to get Schema as a Spark Dataframe from a Nested Structured Spark DataFrame

I have a sample Dataframe that I create using below code
val data = Seq(
Row(20.0, "dog"),
Row(3.5, "cat"),
Row(0.000006, "ant")
)
val schema = StructType(
List(
StructField("weight", DoubleType, true),
StructField("animal_type", StringType, true)
)
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(data),
schema
)
val actualDF = df.withColumn(
"animal_interpretation",
struct(
(col("weight") > 5).as("is_large_animal"),
col("animal_type").isin("rat", "cat", "dog").as("is_mammal")
)
)
actualDF.show(false)
+------+-----------+---------------------+
|weight|animal_type|animal_interpretation|
+------+-----------+---------------------+
|20.0 |dog |[true,true] |
|3.5 |cat |[false,true] |
|6.0E-6|ant |[false,false] |
+------+-----------+---------------------+
The schema of this Spark DF can be printed using -
scala> actualDF.printSchema
root
|-- weight: double (nullable = true)
|-- animal_type: string (nullable = true)
|-- animal_interpretation: struct (nullable = false)
| |-- is_large_animal: boolean (nullable = true)
| |-- is_mammal: boolean (nullable = true)
However, I would like to get this schema in the form of a dataframe that has 3 columns - field, type, nullable. The output dataframe from the schema would something like this -
+-------------------------------------+--------------+--------+
|field |type |nullable|
+-------------------------------------+--------------+--------+
|weight |double |true |
|animal_type |string |true |
|animal_interpretation |struct |false |
|animal_interpretation.is_large_animal|boolean |true |
|animal_interpretation.is_mammal |boolean |true |
+----------------------------------------------------+--------+
How can I achieve this in Spark. I am using Scala for coding.
Here is a complete example including your code. I used the somewhat common flattenSchema method for matching like Shankar did to traverse the Struct but rather than having this method return the flattened schema I used an ArrayBuffer to aggregate the datatypes of the StructType and returned the ArrayBuffer. I then turned the ArrayBuffer into a Sequence and finally, using Spark, converted the Sequence to a DataFrame.
import org.apache.spark.sql.types.{StructType, StructField, DoubleType, StringType}
import org.apache.spark.sql.functions.{struct, col}
import scala.collection.mutable.ArrayBuffer
val data = Seq(
Row(20.0, "dog"),
Row(3.5, "cat"),
Row(0.000006, "ant")
)
val schema = StructType(
List(
StructField("weight", DoubleType, true),
StructField("animal_type", StringType, true)
)
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(data),
schema
)
val actualDF = df.withColumn(
"animal_interpretation",
struct(
(col("weight") > 5).as("is_large_animal"),
col("animal_type").isin("rat", "cat", "dog").as("is_mammal")
)
)
var fieldStructs = new ArrayBuffer[(String, String, Boolean)]()
def flattenSchema(schema: StructType, fieldStructs: ArrayBuffer[(String, String, Boolean)], prefix: String = null): ArrayBuffer[(String, String, Boolean)] = {
schema.fields.foreach(field => {
val col = if (prefix == null) field.name else (prefix + "." + field.name)
field.dataType match {
case st: StructType => {
fieldStructs += ((col, field.dataType.typeName, field.nullable))
flattenSchema(st, fieldStructs, col)
}
case _ => {
fieldStructs += ((col, field.dataType.simpleString, field.nullable))
}
}}
)
fieldStructs
}
val foo = flattenSchema(actualDF.schema, fieldStructs).toSeq.toDF("field", "type", "nullable")
foo.show(false)
If you run the above you should get the following.
+-------------------------------------+-------+--------+
|field |type |nullable|
+-------------------------------------+-------+--------+
|weight |double |true |
|animal_type |string |true |
|animal_interpretation |struct |false |
|animal_interpretation.is_large_animal|boolean|true |
|animal_interpretation.is_mammal |boolean|true |
+-------------------------------------+-------+--------+
You could do something like this
def flattenSchema(schema: StructType, prefix: String = null) : Seq[(String, String, Boolean)] = {
schema.fields.flatMap(field => {
val col = if (prefix == null) field.name else (prefix + "." + field.name)
field.dataType match {
case st: StructType => flattenSchema(st, col)
case _ => Array((col, field.dataType.simpleString, field.nullable))
}
})
}
flattenSchema(actualDF.schema).toDF("field", "type", "nullable").show()
Hope this helps!

Best approch for parsing large structured file with Apache spark

I have huge text file (in GBs) with plan text data in each line, which needs to be parsed and extracted to a structure for further processing. Each line has text with 200 charactor length and I have an Regular Expression to parse each line and split into different groups, which will later saved to a flat column data
data sample
1759387ACD06JAN1910MAR191234567ACRT
RegExp
(.{7})(.{3})(.{7})(.{7})(.{7})(.{4})
Data Structure
Customer ID, Code, From Date, To Date, TrasactionId, Product code
1759387, ACD, 06JAN19, 10MAR19, 1234567, ACRT
Please suggest a BEST approch to parse this huge data and push to In Memory grid, which will be used again by Spark Jobs for further processing, when respective APIs are invoked.
You can use the DF approach. Copy the serial file to HDFS using -copyFromLocal command
and use the below code to parse each records
I'm assuming the sample records in gireesh.txt as below
1759387ACD06JAN1910MAR191234567ACRT
2759387ACD08JAN1910MAY191234567ACRY
3759387ACD03JAN1910FEB191234567ACRZ
The spark code
import org.apache.log4j.{Level, Logger}
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.Encoders._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
object Gireesh {
def main(args: Array[String]) {
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession.builder().appName("Operations..").master("local[*]").getOrCreate()
import spark.implicits._
val pat="""(.{7})(.{3})(.{7})(.{7})(.{7})(.{4})""".r
val headers = List("custid","code","fdate","tdate","tranid","prdcode")
val rdd = spark.sparkContext.textFile("in/gireesh.txt")
.map( x => {
val y = scala.collection.mutable.ArrayBuffer[String]()
pat.findAllIn(x).matchData.foreach( m=> y.appendAll(m.subgroups))
(y(0).toLong,y(1),y(2),y(3),y(4).toLong,y(5))
}
)
val df = rdd.toDF(headers:_*)
df.printSchema()
df.show(false)
}
}
gives the below results.
root
|-- custid: long (nullable = false)
|-- code: string (nullable = true)
|-- fdate: string (nullable = true)
|-- tdate: string (nullable = true)
|-- tranid: long (nullable = false)
|-- prdcode: string (nullable = true)
+-------+----+-------+-------+-------+-------+
|custid |code|fdate |tdate |tranid |prdcode|
+-------+----+-------+-------+-------+-------+
|1759387|ACD |06JAN19|10MAR19|1234567|ACRT |
|2759387|ACD |08JAN19|10MAY19|1234567|ACRY |
|3759387|ACD |03JAN19|10FEB19|1234567|ACRZ |
+-------+----+-------+-------+-------+-------+
EDIT1:
You can have the map "transformation" in a separate function like below.
def parse(record:String) = {
val y = scala.collection.mutable.ArrayBuffer[String]()
pat.findAllIn(record).matchData.foreach( m=> y.appendAll(m.subgroups))
(y(0).toLong,y(1),y(2),y(3),y(4).toLong,y(5))
}
val rdd = spark.sparkContext.textFile("in/gireesh.txt")
.map( x => parse(x) )
val df = rdd.toDF(headers:_*)
df.printSchema()
You need to tell spark which file to read and how to process the content while reading it.
Here is an example:
val numberOfPartitions = 5 // this needs to be optimized based on the size of the file and the available resources (e.g. memory)
val someObjectsRDD: RDD[SomeObject] =
sparkContext.textFile("/path/to/your/file", numberOfPartitions)
.mapPartitions(
{ stringsFromFileIterator =>
stringsFromFileIterator.map(stringFromFile => //here process the raw string and return the result)
}
, preservesPartitioning = true
)
In the code snippet SomeObject is an object with the data structure from the question

Use Spark Scala to transform flat data into nested object

I need help converting a flat dataset into a nested format using Apache Spark / Scala.
Is it possible to automatically create a nested structure derived from input column namespaces
[level 1].[level 2]? In my example, the nesting level is determined by the period symbol '.' within the column headers.
I assuming this is possible to achieve using a map function. I am open to alternative solutions, particularly if there is a more elegant way of achieving the same outcome.
package org.acme.au
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SQLContext
import scala.collection.Seq
object testNestedObject extends App {
// Configure spark
val spark = SparkSession.builder()
.appName("Spark batch demo")
.master("local[*]")
.config("spark.driver.host", "localhost")
.getOrCreate()
// Start spark
val sc = spark.sparkContext
sc.setLogLevel("ERROR")
val sqlContext = new SQLContext(sc)
// Define schema for input data
val flatSchema = new StructType()
.add(StructField("id", StringType, false))
.add(StructField("name", StringType, false))
.add(StructField("custom_fields.fav_colour", StringType, true))
.add(StructField("custom_fields.star_sign", StringType, true))
// Create a row with dummy data
val row1 = Row("123456", "John Citizen", "Blue", "Scorpio")
val row2 = Row("990087", "Jane Simth", "Green", "Taurus")
val flatData = Seq(row1, row2)
// Convert into dataframe
val dfIn = spark.createDataFrame(spark.sparkContext.parallelize(flatData), flatSchema)
// Print to console
dfIn.printSchema()
dfIn.show()
// Convert flat data into nested structure as either Parquet or JSON format
val dfOut = dfIn.rdd
.map(
row => ( /* TODO: Need help with mapping flat data to nested structure derived from input column namespaces
*
* For example:
*
* <id>12345<id>
* <name>John Citizen</name>
* <custom_fields>
* <fav_colour>Blue</fav_colour>
* <star_sign>Scorpio</star_sign>
* </custom_fields>
*
*/ ))
// Stop spark
sc.stop()
}
This solution is for the revised requirement that the JSON output would consist of an array of {K:valueK, V:valueV} rather than {valueK1: valueV1, valueK2: valueV2, ...}. For example:
// FROM:
"custom_fields":{"fav_colour":"Blue", "star_sign":"Scorpio"}
// TO:
"custom_fields":[{"key":"fav_colour", "value":"Blue"}, {"key":"star_sign", "value":"Scorpio"}]
Sample code below:
import org.apache.spark.sql.functions._
val dfIn = Seq(
(123456, "John Citizen", "Blue", "Scorpio"),
(990087, "Jane Simth", "Green", "Taurus")
).toDF("id", "name", "custom_fields.fav_colour", "custom_fields.star_sign")
val structCols = dfIn.columns.filter(_.contains("."))
// structCols: Array[String] =
// Array(custom_fields.fav_colour, custom_fields.star_sign)
val structColsMap = structCols.map(_.split("\\.")).
groupBy(_(0)).mapValues(_.map(_(1)))
// structColsMap: scala.collection.immutable.Map[String,Array[String]] =
// Map(custom_fields -> Array(fav_colour, star_sign))
val dfExpanded = structColsMap.foldLeft(dfIn){ (accDF, kv) =>
val cols = kv._2.map( v =>
struct(lit(v).as("key"), col("`" + kv._1 + "." + v + "`").as("value"))
)
accDF.withColumn(kv._1, array(cols: _*))
}
val dfResult = structCols.foldLeft(dfExpanded)(_ drop _)
dfResult.show(false)
// +------+------------+----------------------------------------+
// |id |name |custom_fields |
// +------+------------+----------------------------------------+
// |123456|John Citizen|[[fav_colour,Blue], [star_sign,Scorpio]]|
// |990087|Jane Simth |[[fav_colour,Green], [star_sign,Taurus]]|
// +------+------------+----------------------------------------+
dfResult.printSchema
// root
// |-- id: integer (nullable = false)
// |-- name: string (nullable = true)
// |-- custom_fields: array (nullable = false)
// | |-- element: struct (containsNull = false)
// | | |-- key: string (nullable = false)
// | | |-- value: string (nullable = true)
dfResult.toJSON.show(false)
// +-------------------------------------------------------------------------------------------------------------------------------+
// |value |
// +-------------------------------------------------------------------------------------------------------------------------------+
// |{"id":123456,"name":"John Citizen","custom_fields":[{"key":"fav_colour","value":"Blue"},{"key":"star_sign","value":"Scorpio"}]}|
// |{"id":990087,"name":"Jane Simth","custom_fields":[{"key":"fav_colour","value":"Green"},{"key":"star_sign","value":"Taurus"}]} |
// +-------------------------------------------------------------------------------------------------------------------------------+
Note that we cannot make value type Any to accommodate a mix of different types, as Spark DataFrame API doesn't support type Any. As a consequence, the value in the array must be of a given type (e.g. String). Like the previous solution, this also handles only up to one nested level.
This can be solved with a dedicated case class and a UDF that converts the input data into case class instances. For example:
Define the case class
case class NestedFields(fav_colour: String, star_sign: String)
Define the UDF that takes the original column values as input and returns an instance of NestedFields:
private val asNestedFields = udf((fc: String, ss: String) => NestedFields(fc, ss))
Transform the original DataFrame and drop the flat columns:
val res = dfIn.withColumn("custom_fields", asNestedFields($"`custom_fields.fav_colour`", $"`custom_fields.star_sign`"))
.drop($"`custom_fields.fav_colour`")
.drop($"`custom_fields.star_sign`")
It produces
root
|-- id: string (nullable = false)
|-- name: string (nullable = false)
|-- custom_fields: struct (nullable = true)
| |-- fav_colour: string (nullable = true)
| |-- star_sign: string (nullable = true)
+------+------------+---------------+
| id| name| custom_fields|
+------+------------+---------------+
|123456|John Citizen|[Blue, Scorpio]|
|990087| Jane Simth|[Green, Taurus]|
+------+------------+---------------+
Here's a generalized solution that first assembles a Map of column names that contain the ., traverses the Map to add converted struct columns to the DataFrame, and finally drop the original columns with the .. A slightly more generalized dfIn is used as the sample data.
import org.apache.spark.sql.functions._
val dfIn = Seq(
(123456, "John Citizen", "Blue", "Scorpio", "a", 1),
(990087, "Jane Simth", "Green", "Taurus", "b", 2)
).
toDF("id", "name", "custom_fields.fav_colour", "custom_fields.star_sign", "s.c1", "s.c2")
val structCols = dfIn.columns.filter(_.contains("."))
// structCols: Array[String] =
// Array(custom_fields.fav_colour, custom_fields.star_sign, s.c1, s.c2)
val structColsMap = structCols.map(_.split("\\.")).
groupBy(_(0)).mapValues(_.map(_(1)))
// structColsMap: scala.collection.immutable.Map[String,Array[String]] =
// Map(s -> Array(c1, c2), custom_fields -> Array(fav_colour, star_sign))
val dfExpanded = structColsMap.foldLeft(dfIn){ (accDF, kv) =>
val cols = kv._2.map(v => col("`" + kv._1 + "." + v + "`").as(v))
accDF.withColumn(kv._1, struct(cols: _*))
}
val dfResult = structCols.foldLeft(dfExpanded)(_ drop _)
dfResult.show
// +------+------------+-----+--------------+
// |id |name |s |custom_fields |
// +------+------------+-----+--------------+
// |123456|John Citizen|[a,1]|[Blue,Scorpio]|
// |990087|Jane Simth |[b,2]|[Green,Taurus]|
// +------+------------+-----+--------------+
dfResult.printSchema
// root
// |-- id: integer (nullable = false)
// |-- name: string (nullable = true)
// |-- s: struct (nullable = false)
// | |-- c1: string (nullable = true)
// | |-- c2: integer (nullable = false)
// |-- custom_fields: struct (nullable = false)
// | |-- fav_colour: string (nullable = true)
// | |-- star_sign: string (nullable = true)
Note that this solution handles only up to one nested level.
To convert each row to JSON format, consider using toJSON as follows:
dfResult.toJSON.show(false)
// +---------------------------------------------------------------------------------------------------------------------+
// |value |
// +---------------------------------------------------------------------------------------------------------------------+
// |{"id":123456,"name":"John Citizen","s":{"c1":"a","c2":1},"custom_fields":{"fav_colour":"Blue","star_sign":"Scorpio"}}|
// |{"id":990087,"name":"Jane Simth","s":{"c1":"b","c2":2},"custom_fields":{"fav_colour":"Green","star_sign":"Taurus"}} |
// +---------------------------------------------------------------------------------------------------------------------+

How to extract values from json string?

I have a file which has bunch of columns and one column called jsonstring is of string type which has json strings in it… let's say the format is the following:
{
"key1": "value1",
"key2": {
"level2key1": "level2value1",
"level2key2": "level2value2"
}
}
I want to parse this column something like this: jsonstring.key1,jsonstring.key2.level2key1 to return value1, level2value1
How can I do that in scala or spark sql.
With Spark 2.2 you could use the function from_json which does the JSON parsing for you.
from_json(e: Column, schema: String, options: Map[String, String]): Column parses a column containing a JSON string into a StructType or ArrayType of StructTypes with the specified schema.
With the support for flattening nested columns by using * (star) that seems the best solution.
// the input dataset (just a single JSON blob)
val jsonstrings = Seq("""{
"key1": "value1",
"key2": {
"level2key1": "level2value1",
"level2key2": "level2value2"
}
}""").toDF("jsonstring")
// define the schema of JSON messages
import org.apache.spark.sql.types._
val key2schema = new StructType()
.add($"level2key1".string)
.add($"level2key2".string)
val schema = new StructType()
.add($"key1".string)
.add("key2", key2schema)
scala> schema.printTreeString
root
|-- key1: string (nullable = true)
|-- key2: struct (nullable = true)
| |-- level2key1: string (nullable = true)
| |-- level2key2: string (nullable = true)
val messages = jsonstrings
.select(from_json($"jsonstring", schema) as "json")
.select("json.*") // <-- flattening nested fields
scala> messages.show(truncate = false)
+------+---------------------------+
|key1 |key2 |
+------+---------------------------+
|value1|[level2value1,level2value2]|
+------+---------------------------+
scala> messages.select("key1", "key2.*").show(truncate = false)
+------+------------+------------+
|key1 |level2key1 |level2key2 |
+------+------------+------------+
|value1|level2value1|level2value2|
+------+------------+------------+
You can use withColumn + udf + json4s:
import org.json4s.{DefaultFormats, MappingException}
import org.json4s.jackson.JsonMethods._
import org.apache.spark.sql.functions._
def getJsonContent(jsonstring: String): (String, String) = {
implicit val formats = DefaultFormats
val parsedJson = parse(jsonstring)
val value1 = (parsedJson \ "key1").extract[String]
val level2value1 = (parsedJson \ "key2" \ "level2key1").extract[String]
(value1, level2value1)
}
val getJsonContentUDF = udf((jsonstring: String) => getJsonContent(jsonstring))
df.withColumn("parsedJson", getJsonContentUDF(df("jsonstring")))

How to create correct data frame for classification in Spark ML

I am trying to run random forest classification by using Spark ML api but I am having issues with creating right data frame input into pipeline.
Here is sample data:
age,hours_per_week,education,sex,salaryRange
38,40,"hs-grad","male","A"
28,40,"bachelors","female","A"
52,45,"hs-grad","male","B"
31,50,"masters","female","B"
42,40,"bachelors","male","B"
age and hours_per_week are integers while other features including label salaryRange are categorical (String)
Loading this csv file (lets call it sample.csv) can be done by Spark csv library like this:
val data = sqlContext.csvFile("/home/dusan/sample.csv")
By default all columns are imported as string so we need to change "age" and "hours_per_week" to Int:
val toInt = udf[Int, String]( _.toInt)
val dataFixed = data.withColumn("age", toInt(data("age"))).withColumn("hours_per_week",toInt(data("hours_per_week")))
Just to check how schema looks now:
scala> dataFixed.printSchema
root
|-- age: integer (nullable = true)
|-- hours_per_week: integer (nullable = true)
|-- education: string (nullable = true)
|-- sex: string (nullable = true)
|-- salaryRange: string (nullable = true)
Then lets set the cross validator and pipeline:
val rf = new RandomForestClassifier()
val pipeline = new Pipeline().setStages(Array(rf))
val cv = new CrossValidator().setNumFolds(10).setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator)
Error shows up when running this line:
val cmModel = cv.fit(dataFixed)
java.lang.IllegalArgumentException: Field "features" does not exist.
It is possible to set label column and feature column in RandomForestClassifier ,however I have 4 columns as predictors (features) not only one.
How I should organize my data frame so it has label and features columns organized correctly?
For your convenience here is full code :
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.tuning.CrossValidator
import org.apache.spark.ml.Pipeline
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
import org.apache.spark.mllib.linalg.{Vector, Vectors}
object SampleClassification {
def main(args: Array[String]): Unit = {
//set spark context
val conf = new SparkConf().setAppName("Simple Application").setMaster("local");
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import com.databricks.spark.csv._
//load data by using databricks "Spark CSV Library"
val data = sqlContext.csvFile("/home/dusan/sample.csv")
//by default all columns are imported as string so we need to change "age" and "hours_per_week" to Int
val toInt = udf[Int, String]( _.toInt)
val dataFixed = data.withColumn("age", toInt(data("age"))).withColumn("hours_per_week",toInt(data("hours_per_week")))
val rf = new RandomForestClassifier()
val pipeline = new Pipeline().setStages(Array(rf))
val cv = new CrossValidator().setNumFolds(10).setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator)
// this fails with error
//java.lang.IllegalArgumentException: Field "features" does not exist.
val cmModel = cv.fit(dataFixed)
}
}
Thanks for help!
As of Spark 1.4, you can use Transformer org.apache.spark.ml.feature.VectorAssembler.
Just provide column names you want to be features.
val assembler = new VectorAssembler()
.setInputCols(Array("col1", "col2", "col3"))
.setOutputCol("features")
and add it to your pipeline.
You simply need to make sure that you have a "features" column in your dataframe that is of type VectorUDF as show below:
scala> val df2 = dataFixed.withColumnRenamed("age", "features")
df2: org.apache.spark.sql.DataFrame = [features: int, hours_per_week: int, education: string, sex: string, salaryRange: string]
scala> val cmModel = cv.fit(df2)
java.lang.IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.mllib.linalg.VectorUDT#1eef but was actually IntegerType.
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:37)
at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:50)
at org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:71)
at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:118)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:164)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:164)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108)
at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:164)
at org.apache.spark.ml.tuning.CrossValidator.transformSchema(CrossValidator.scala:142)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:59)
at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:107)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:67)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:72)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:74)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:76)
EDIT1
Essentially there need to be two fields in your data frame "features" for feature vector and "label" for instance labels. Instance must be of type Double.
To create a "features" fields with Vector type first create a udf as show below:
val toVec4 = udf[Vector, Int, Int, String, String] { (a,b,c,d) =>
val e3 = c match {
case "hs-grad" => 0
case "bachelors" => 1
case "masters" => 2
}
val e4 = d match {case "male" => 0 case "female" => 1}
Vectors.dense(a, b, e3, e4)
}
Now to also encode the "label" field, create another udf as shown below:
val encodeLabel = udf[Double, String]( _ match { case "A" => 0.0 case "B" => 1.0} )
Now we transform original dataframe using these two udf:
val df = dataFixed.withColumn(
"features",
toVec4(
dataFixed("age"),
dataFixed("hours_per_week"),
dataFixed("education"),
dataFixed("sex")
)
).withColumn("label", encodeLabel(dataFixed("salaryRange"))).select("features", "label")
Note that there can be extra columns / fields present in the dataframe, but in this case I have selected only features and label:
scala> df.show()
+-------------------+-----+
| features|label|
+-------------------+-----+
|[38.0,40.0,0.0,0.0]| 0.0|
|[28.0,40.0,1.0,1.0]| 0.0|
|[52.0,45.0,0.0,0.0]| 1.0|
|[31.0,50.0,2.0,1.0]| 1.0|
|[42.0,40.0,1.0,0.0]| 1.0|
+-------------------+-----+
Now its upto you to set correct parameters for your learning algorithm to make it work.
According to spark documentation on mllib - random trees, seems to me that you should define the features map that you are using and the points should be a labeledpoint.
This will tell the algorithm which column should be used as prediction and which ones are the features.
https://spark.apache.org/docs/latest/mllib-decision-tree.html