I want to print data of employees who joined before 1991. Below is my sample data:
69062,FRANK,ANALYST,5646,1991-12-03,3100.00,,2001
63679,SANDRINE,CLERK,69062,1990-12-18,900.00,,2001
Initial RDD for loading data:
val rdd=sc.textFile("file:////home/hduser/Desktop/Employees/employees.txt").filter(p=>{p!=null && p.trim.length>0})
UDF for converting string column to date column:
def convertStringToDate(s: String): Date = {
val dateFormat = new SimpleDateFormat("yyyy-MM-dd")
dateFormat.parse(s)
}
Mapping each and every column to its datatype:
val dateRdd=rdd.map(_.split(",")).map(p=>(if(p(0).length >0 )p(0).toLong else 0L,p(1),p(2),if(p(3).length > 0)p(3).toLong else 0L,convertStringToDate(p(4)),if(p(5).length >0)p(5).toDouble else 0D,if(p(6).length > 0)p(6).toDouble else 0D,if(p(7).length> 0)p(7).toInt else 0))
Now I get data in tuples as below:
(69062,FRANK,ANALYST,5646,Tue Dec 03 00:00:00 IST 1991,3100.0,0.0,2001)
(63679,SANDRINE,CLERK,69062,Tue Dec 18 00:00:00 IST 1990,900.0,0.0,2001)
Now when I execute command I am getting below error:
scala> dateRdd.map(p=>(!(p._5.before("1991")))).foreach(println)
<console>:36: error: type mismatch;
found : String("1991")
required: java.util.Date
dateRdd.map(p=>(!(p._5.before("1991")))).foreach(println)
^
So where am I going wrong ???
Since you are working with rdd's and no df's and you have date strings with simple date checking, the following non-complicated way for an RDD:
val rdd = sc.parallelize(Seq((69062,"FRANK","ANALYST",5646, "1991-12-03",3100.00,2001),(63679,"SANDRINE","CLERK",69062,"1990-12-18",900.00,2001)))
rdd.filter(p=>(p._5 < "1991-01-01")).foreach(println)
No need to convert the date to legacy SimpleDate formats. Use Java.time. Since the 4th column is in the ISO expected format, you can simply use the below rdd step.
Check this out
val rdd=spark.sparkContext.textFile("in\\employees.txt").filter( x => {val y = x.split(","); java.time.LocalDate.parse(y(4)).isBefore(java.time.LocalDate.parse("1991-01-01")) } )
the
rdd.collect.foreach(println)
gave the below result
63679,SANDRINE,CLERK,69062,1990-12-18,900.00,,2001
hope, this answers your question.
EDIT1:
Using Java 7 and SimpleFormat libraries
import java.util.Date
import java.text.SimpleDateFormat
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark._
import org.apache.spark.sql.types._
import org.apache.spark.sql._
object DTCheck{
def main(args:Array[String]): Unit = {
def convertStringToDate(s: String): Date = {
val dateFormat = new SimpleDateFormat("yyyy-MM-dd")
dateFormat.parse(s)
}
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession.builder().appName("Employee < 1991").master("local[*]").getOrCreate()
val sdf = new SimpleDateFormat("yyyy-MM-dd")
val dt_1991 = sdf.parse("1991-01-01")
import spark.implicits._
val rdd=spark.sparkContext.textFile("in\\employees.txt").filter( x => {val y = x.split(","); convertStringToDate(y(4)).before(dt_1991 ) } )
rdd.collect.foreach(println)
}
}
Related
I am trying to create a dataFrame. It seems that spark is unable to create a dataframe from a scala.Tuple2 type. How can I do it? I am new to scala and spark.
Below is a part of the error trace from the code run
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for org.apache.spark.sql.Row
- field (class: "org.apache.spark.sql.Row", name: "_1")
- root class: "scala.Tuple2"
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:666)
..........
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:299)
at SparkMapReduce$.runMapReduce(SparkMapReduce.scala:46)
at Entrance$.queryLoader(Entrance.scala:64)
at Entrance$.paramsParser(Entrance.scala:43)
at Entrance$.main(Entrance.scala:30)
at Entrance.main(Entrance.scala)
Below is the code that is a part of the entire program. The problem occurs in the line above the exclamation marks in a comment
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.{SaveMode, SparkSession}
import org.apache.spark.sql.functions.split
import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrame
object SparkMapReduce {
Logger.getLogger("org.spark_project").setLevel(Level.WARN)
Logger.getLogger("org.apache").setLevel(Level.WARN)
Logger.getLogger("akka").setLevel(Level.WARN)
Logger.getLogger("com").setLevel(Level.WARN)
def runMapReduce(spark: SparkSession, pointPath: String, rectanglePath: String): DataFrame =
{
var pointDf = spark.read.format("csv").option("delimiter",",").option("header","false").load(pointPath);
pointDf = pointDf.toDF()
pointDf.createOrReplaceTempView("points")
pointDf = spark.sql("select ST_Point(cast(points._c0 as Decimal(24,20)),cast(points._c1 as Decimal(24,20))) as point from points")
pointDf.createOrReplaceTempView("pointsDf")
// pointDf.show()
var rectangleDf = spark.read.format("csv").option("delimiter",",").option("header","false").load(rectanglePath);
rectangleDf = rectangleDf.toDF()
rectangleDf.createOrReplaceTempView("rectangles")
rectangleDf = spark.sql("select ST_PolygonFromEnvelope(cast(rectangles._c0 as Decimal(24,20)),cast(rectangles._c1 as Decimal(24,20)), cast(rectangles._c2 as Decimal(24,20)), cast(rectangles._c3 as Decimal(24,20))) as rectangle from rectangles")
rectangleDf.createOrReplaceTempView("rectanglesDf")
// rectangleDf.show()
val joinDf = spark.sql("select rectanglesDf.rectangle as rectangle, pointsDf.point as point from rectanglesDf, pointsDf where ST_Contains(rectanglesDf.rectangle, pointsDf.point)")
joinDf.createOrReplaceTempView("joinDf")
// joinDf.show()
import spark.implicits._
val joinRdd = joinDf.rdd
val resmap = joinRdd.map(x=>(x, 1))
val reduced = resmap.reduceByKey(_+_)
val final_datablock = reduced.collect()
val trying : List[Float] = List()
print(final_datablock)
// .toDF("rectangles", "count")
// val dataframe_final1 = spark.createDataFrame(reduced)
val dataframe_final2 = spark.createDataFrame(reduced).toDF("rectangles", "count")
// ^ !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!Line above creates problem
// You need to complete this part
var result = spark.emptyDataFrame
return result // You need to change this part
}
}
Your first column of reduced has a type of ROW and you do not specified it when converting from RDD to DF. A dataframe must have a schema. So you need to use the following method by defining a right schema for your RDD to covert to DataFrame.
createDataFrame(RDD<Row> rowRDD, StructType schema)
for example:
val schema = new StructType()
.add(Array(
StructField("._1a",IntegerType),
StructField("._1b", ArrayType(StringType))
))
.add(StructField("count", IntegerType, true))
I have a collection of rowkeys (plants as shown below) from hbase table and I want to make a fetchData function which returns rdd data for rowkeys from the collection. Goal is to get union of RDDs from fetchData method for each element from plants collection. I have given the relevant part of code below. My issue is, the code is giving compilation error for return type of fetchData:
println("PartB: "+ hBaseRDD.getNumPartitions)
error: value getNumPartitions is not a member of Option[org.apache.spark.rdd.RDD[it.nerdammer.spark.test.sys.Record]]
I am using scala 2.11.8 spark 2.2.0 and maven compilation
import it.nerdammer.spark.hbase._
import org.apache.spark.sql._
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
import org.apache.log4j.Level
import org.apache.log4j.Logger
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object sys {
case class systems( rowkey: String, iacp: Option[String], temp: Option[String])
val spark = SparkSession.builder().appName("myApp").config("spark.executor.cores",4).getOrCreate()
import spark.implicits._
type Record = (String, Option[String], Option[String])
def fetchData(plant: String): RDD[Record] = {
val start_index = plant
val end_index = plant + "z"
//The below command works fine if I run it in main function, but to get multiple rows from hbase, I am using it in a separate function
spark.sparkContext.hbaseTable[Record]("test_table").select("iacp","temp").inColumnFamily("pp").withStartRow(start_index).withStopRow(end_index)
}
def main(args: Array[String]) {
//the below elements in the collection are prefix of relevant rowkeys in hbase table ("test_table")
val plants = Vector("a8","cu","aw","fx")
val hBaseRDD = plants.map( pp => fetchData(pp))
println("Part: "+ hBaseRDD.getNumPartitions)
/*
rest of the code
*/
}
}
Here is the working version of code. The problem here is that I am using for loop and I have to request data corresponding to rowkey (plants) vector from HBase in each loop instead of getting all the data first and then executing rest of the codes
import it.nerdammer.spark.hbase._
import org.apache.spark.sql._
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
import org.apache.log4j.Level
import org.apache.log4j.Logger
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object sys {
case class systems( rowkey: String, iacp: Option[String], temp: Option[String])
def main(args: Array[String]) {
val spark = SparkSession.builder().appName("myApp").config("spark.executor.cores",4).getOrCreate()
import spark.implicits._
type Record = (String, Option[String], Option[String])
val plants = Vector("a8","cu","aw","fx")
for (plant <- plants){
val start_index = plant
val end_index = plant + "z"
val hBaseRDD = spark.sparkContext.hbaseTable[Record]("test_table").select("iacp","temp").inColumnFamily("pp").withStartRow(start_index).withStopRow(end_index)
println("Part: "+ hBaseRDD.getNumPartitions)
/*
rest of the code
*/
}
}
}
After trying, this is where I am stuck now. So how can I cast the type to required.
scala> def fetchData(plant: String) = {
| val start_index = plant
| val end_index = plant + "~"
| val x1 = spark.sparkContext.hbaseTable[Record]("test_table").select("iacp","temp").inColumnFamily("pp").withStartRow(start_index).withStopRow(end_index)
| x1
| }
Define the function in REPL and running it
scala> val hBaseRDD = plants.map( pp => fetchData(pp)).reduceOption(_ union _)
<console>:39: error: type mismatch;
found : org.apache.spark.rdd.RDD[(String, Option[String], Option[String])]
required: it.nerdammer.spark.hbase.HBaseReaderBuilder[(String, Option[String], Option[String])]
val hBaseRDD = plants.map( pp => fetchData(pp)).reduceOption(_ union _)
Thanks in Advance!
The type of hBaseRDD is Vector[_] and not RDD[_], so you can't execute method getNumPartitions on it. If I understand correctly you want to union fetched RDDs. You can do it by plants.map( pp => fetchData(pp)).reduceOption(_ union _) (I recommend to use reduceOption because it won't fail on empty list, but you can use reduce if you confident that list is not empty)
Also the returned type of fetchData is RDD[U], but I didn't find any definition of U. Probably this is the reason why compiler infer Vector[Nothing] instead of Vector[RDD[Record]]. In order to avoid subsequent errors you should also change RDD[U] to RDD[Record].
While inserting data into Hive TimestampType from spark, nanoseconds are truncated. does anyone has any solution towards it? I have tried writing to orc and csv format on hive.
CSV: it appeared as 2018-03-20T13:04:20.123Z
ORC: 2018-03-20 13:04:20.123456
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types._
import org.apache.spark.sql.types.StructField
import java.util.Date
import org.apache.spark.sql.Row
import java.math.{BigDecimal,MathContext,RoundingMode}
/**
* Main class to read Order, Route and Trade records and convert them to ORC File format
* #author Shefali.Nema
* #since 1.0.0
*/
object testDateAndDecimal {
def main(args: Array[String]): Unit = {
execute;
}
private def execute: Unit = {
val sparkConf = new SparkConf().setAppName("Test");
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// Define DataTypes
val datetimestring: String = "2018-03-20 13:04:20.123456789"
val dt = java.sql.Timestamp.valueOf(datetimestring)
//val DecimalType = DataTypes.createDecimalType(18, 8)
//Define Values
val id = 1
//System.out.println(new BigDecimal("135.69")); // 135.69
val price = new BigDecimal("1234567890.1234567899")
System.out.println("\n###################################################price###################################" + price + "\n")
System.out.println("\n###################################################dt###################################" + dt + "\n")
val schema = StructType(StructField("id",IntegerType,true) :: StructField("name",TimestampType,true) :: StructField("price",DecimalType(18,8),true) :: Nil)
val values = List(id,dt,price)
val row = Row.fromSeq(values)
// Create `RDD` from `Row`
val rdd = sc.makeRDD(List(row))
val orcFolderName = "testDecimal"
val hiveRowsDF = sqlContext.createDataFrame(rdd, schema)
hiveRowsDF.write.mode(org.apache.spark.sql.SaveMode.Append).orc(orcFolderName)
}
}
I have a date key of type 20170501 which is in YYYYmmdd format. How can we get a date x days back from this date in Scala?
This is what I have in program
val runDate = 20170501
Now I want a date say 30 days back from this date.
Using Scala/JVM/Java 8...
scala> import java.time._
import java.time._
scala> import java.time.format._
import java.time.format._
scala> val formatter = DateTimeFormatter.ofPattern("yyyyMMdd")
formatter: java.time.format.DateTimeFormatter = Value(YearOfEra,4,19,EXCEEDS_PAD)Value(MonthOfYear,2)Value(DayOfMonth,2)
scala> val runDate = 20170501
runDate: Int = 20170501
scala> val runDay = LocalDate.parse(runDate.toString, formatter)
runDay: java.time.LocalDate = 2017-05-01
scala> val runDayMinus30 = runDay.minusDays(30)
runDayMinus30: java.time.LocalDate = 2017-04-01
You can also use joda-time API with which has really good functions like
date.minusMonths
date.minusYear
date.minusDays
date.minusHours
date.minusMinutes
Here is simple example usinf JodaTIme API '
import org.joda.time.format.DateTimeFormat
val dtf = DateTimeFormat.forPattern("yyyyMMdd")
val dt= "20170531"
val date = dtf.parseDateTime(dt)
println(date.minusDays(30))
Output:
2017-05-01T00:00:00.000+05:45
For this you need to use udf and create a DateTime object with your input format "YYYYmmdd" and do the operations.
Hope this helps!
I have a dataset of 2002 variables. All variables are numeric. I first read in the dataset to Spark 1.5.0 and created a Double Type dataframe following the instruction here . Then I converted the dataframe to LabeledPoint following instructions here and here. However, when I tried to print out sample rows in the generated LabeledPoint, I got the "java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Double" error. Below is the Scala code I used. Sorry for the long code but I hope that will help the debug.
Could anyone please tell me where the error is coming from and how to resolve the problem? Thank you very much for your help!
Below is the Scala code I used:
// Read in dataset but drop the header row
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val trainRDD = sc.textFile("train.txt").filter(line => !line.contains("target"))
// Read in header file to get column names. Store in an Array.
val dictFile = "header.txt"
var arrName = new Array[String](2002)
for (line <- Source.fromFile(dictFile).getLines) {
arrName = line.split('\t').map(_.trim).toArray
}
// Create dataframe using programmatically specifying the schema method
// Encode schema in a string
var schemaString = arrName.mkString(" ")
// Import Row
import org.apache.spark.sql.Row
// Import RDD
import org.apache.spark.rdd.RDD
// Import Spark SQL data types
import org.apache.spark.sql.types.{StructType,StructField,StringType,IntegerType,LongType,FloatType,DoubleType}
// Generate the Double Type schema based on the string of schema
val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, DoubleType, true)))
// Create rowRDD and convert String type to Double type
val arrVar = sc.broadcast(0 to 2001 toArray)
def createRowRDD(rdd:RDD[String], anArray:org.apache.spark.broadcast.Broadcast[Array[Int]]) : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = {
val rowRDD = rdd.map(_.split("\t")).map(_.map({y => y.toDouble})).map(p => Row.fromSeq(anArray.value map p))
return rowRDD
}
val rowRDDTrain = createRowRDD(trainRDD, arrVar)
// Apply the schema to the RDD.
val trainDF = sqlContext.createDataFrame(rowRDDTrain, schema)
trainDF.printSchema
// Verified all 2002 variables are in "double (nullable = true)" format
// Define toLabeledPoint( ) to convert dataframe to LabeledPoint format
// Reference: https://stackoverflow.com/questions/31638770/rdd-to-labeledpoint-conversion
def toLabeledPoint(dataDF:org.apache.spark.sql.DataFrame) : org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = {
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
val targetInd = dataDF.columns.indexOf("target")
val ignored = List("target")
val featInd = dataDF.columns.diff(ignored).map(dataDF.columns.indexOf(_))
val dataLP = dataDF.rdd.map(r => LabeledPoint(r.getDouble(targetInd),
Vectors.dense(featInd.map(r.getDouble(_)).toArray)))
return dataLP
}
// Create LabeledPoint from dataframe
val trainLP = toLabeledPoint(trainDF)
// Print out sammple rows in the generated LabeledPoint
trainLP.take(5).foreach(println)
// Failed: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Double
Update:
Thanks a lot for David Griffin's and zero323's comments below. David is correct. I find that exception is indeed caused by the null values in the data. I replaced the following original code:
def createRowRDD(rdd:RDD[String], anArray:org.apache.spark.broadcast.Broadcast[Array[Int]]) : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = {
val rowRDD = rdd.map(_.split("\t")).map(_.map({y => y.toDouble})).map(p => Row.fromSeq(anArray.value map p))
return rowRDD
}
with this one to impute null values to 0.0 and then the problem is gone:
def createRowRDD(rdd:RDD[String], anArray:org.apache.spark.broadcast.Broadcast[Array[Int]]) : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = {
val rowRDD = rdd.map(_.split("\t")).map(_.map({y => try {y.toDouble} catch {case _ : Throwable => 0.0}})).map(p => Row.fromSeq(anArray.value map p))
return rowRDD
}