Scala : Reading data from csv with columns have null values - scala

Enviornment - spark-3.0.1-bin-hadoop2.7, ScalaLibraryContainer 2.12.3, Scala, SparkSQL, eclipse-jee-oxygen-2-linux-gtk-x86_64
I have a csv file having 3 columns with data-type :String,Long,Date. I have converted csv file to datafram and want to show it.
But it is giving following error
java.lang.ArrayIndexOutOfBoundsException: 2
at org.apache.spark.examples.sql.SparkSQLExample5$.$anonfun$runInferSchemaExample$2(SparkSQLExample5.scala:30)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:448)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:448)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at scala code
.map(attributes => Person(attributes(0), attributes(1),attributes(2))).toDF();
Error comes, if subsequent rows have less values than number of values present in header. Basically I am trying to read data from csv using Scala and Spark with columns have null values.
Rows dont have the same number of columns.
It is running successfully if all the rows have 3 column values.
package org.apache.spark.examples.sql
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import java.sql.Date
import org.apache.spark.sql.functions._
import java.util.Calendar;
object SparkSQLExample5 {
case class Person(name: String, age: String, birthDate: String)
def main(args: Array[String]): Unit = {
val fromDateTime=java.time.LocalDateTime.now;
val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.master", "local").getOrCreate();
import spark.implicits._
runInferSchemaExample(spark);
spark.stop()
}
private def runInferSchemaExample(spark: SparkSession): Unit = {
import spark.implicits._
println("1. Creating an RDD of 'Person' object and converting into 'Dataframe' "+
" 2. Registering the DataFrame as a temporary view.")
println("1. Third column of second row is not present.Last value of second row is comma.")
val peopleDF = spark.sparkContext
.textFile("examples/src/main/resources/test.csv")
.map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1),attributes(2))).toDF();
val finalOutput=peopleDF.select("name","age","birthDate")
finalOutput.show();
}
}
csv file
col1,col2,col3
row21,row22,
row31,row32,

Try PERMISSIVE mode when reading csv file, it will add NULL for missing fields
val df = spark.sqlContext.read.format("csv").option("mode", "PERMISSIVE") .load("examples/src/main/resources/test.csv")
you can find more information
https://docs.databricks.com/data/data-sources/read-csv.html

Input: csv file
col1,col2,col3
row21,row22,
row31,row32,
Code:
import org.apache.spark.sql.SparkSession
object ReadCsvFile {
case class Person(name: String, age: String, birthDate: String)
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.master", "local").getOrCreate();
readCsvFileAndInferCustomSchema(spark);
spark.stop()
}
private def readCsvFileAndInferCustomSchema(spark: SparkSession): Unit = {
val df = spark.read.csv("C:/Users/Ralimili/Desktop/data.csv")
val rdd = df.rdd.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }
val mapRdd = rdd.map(attributes => {
Person(attributes.getString(0), attributes.getString(1),attributes.getString(2))
})
val finalDf = spark.createDataFrame(mapRdd)
finalDf.show(false);
}
}
output
+-----+-----+---------+
|name |age |birthDate|
+-----+-----+---------+
|row21|row22|null |
|row31|row32|null |
+-----+-----+---------+
If you want to fill some values instead of null values use below code
val customizedNullDf = finalDf.na.fill("No data")
customizedNullDf.show(false);
output
+-----+-----+---------+
|name |age |birthDate|
+-----+-----+---------+
|row21|row22|No data |
|row31|row32|No data |
+-----+-----+---------+

Related

UnsupportedOperationException: No Encoder found for org.apache.spark.sql.Row

I am trying to create a dataFrame. It seems that spark is unable to create a dataframe from a scala.Tuple2 type. How can I do it? I am new to scala and spark.
Below is a part of the error trace from the code run
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for org.apache.spark.sql.Row
- field (class: "org.apache.spark.sql.Row", name: "_1")
- root class: "scala.Tuple2"
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:666)
..........
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:299)
at SparkMapReduce$.runMapReduce(SparkMapReduce.scala:46)
at Entrance$.queryLoader(Entrance.scala:64)
at Entrance$.paramsParser(Entrance.scala:43)
at Entrance$.main(Entrance.scala:30)
at Entrance.main(Entrance.scala)
Below is the code that is a part of the entire program. The problem occurs in the line above the exclamation marks in a comment
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.{SaveMode, SparkSession}
import org.apache.spark.sql.functions.split
import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrame
object SparkMapReduce {
Logger.getLogger("org.spark_project").setLevel(Level.WARN)
Logger.getLogger("org.apache").setLevel(Level.WARN)
Logger.getLogger("akka").setLevel(Level.WARN)
Logger.getLogger("com").setLevel(Level.WARN)
def runMapReduce(spark: SparkSession, pointPath: String, rectanglePath: String): DataFrame =
{
var pointDf = spark.read.format("csv").option("delimiter",",").option("header","false").load(pointPath);
pointDf = pointDf.toDF()
pointDf.createOrReplaceTempView("points")
pointDf = spark.sql("select ST_Point(cast(points._c0 as Decimal(24,20)),cast(points._c1 as Decimal(24,20))) as point from points")
pointDf.createOrReplaceTempView("pointsDf")
// pointDf.show()
var rectangleDf = spark.read.format("csv").option("delimiter",",").option("header","false").load(rectanglePath);
rectangleDf = rectangleDf.toDF()
rectangleDf.createOrReplaceTempView("rectangles")
rectangleDf = spark.sql("select ST_PolygonFromEnvelope(cast(rectangles._c0 as Decimal(24,20)),cast(rectangles._c1 as Decimal(24,20)), cast(rectangles._c2 as Decimal(24,20)), cast(rectangles._c3 as Decimal(24,20))) as rectangle from rectangles")
rectangleDf.createOrReplaceTempView("rectanglesDf")
// rectangleDf.show()
val joinDf = spark.sql("select rectanglesDf.rectangle as rectangle, pointsDf.point as point from rectanglesDf, pointsDf where ST_Contains(rectanglesDf.rectangle, pointsDf.point)")
joinDf.createOrReplaceTempView("joinDf")
// joinDf.show()
import spark.implicits._
val joinRdd = joinDf.rdd
val resmap = joinRdd.map(x=>(x, 1))
val reduced = resmap.reduceByKey(_+_)
val final_datablock = reduced.collect()
val trying : List[Float] = List()
print(final_datablock)
// .toDF("rectangles", "count")
// val dataframe_final1 = spark.createDataFrame(reduced)
val dataframe_final2 = spark.createDataFrame(reduced).toDF("rectangles", "count")
// ^ !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!Line above creates problem
// You need to complete this part
var result = spark.emptyDataFrame
return result // You need to change this part
}
}
Your first column of reduced has a type of ROW and you do not specified it when converting from RDD to DF. A dataframe must have a schema. So you need to use the following method by defining a right schema for your RDD to covert to DataFrame.
createDataFrame(RDD<Row> rowRDD, StructType schema)
for example:
val schema = new StructType()
.add(Array(
StructField("._1a",IntegerType),
StructField("._1b", ArrayType(StringType))
))
.add(StructField("count", IntegerType, true))

How to parse a JSON with unknown key-value pairs in a Spark DataFrame to multiple rows of values

Input DataFrame is coming from Kafka in key value pair -
Input :
{
"CBT-POSTED-TXN": {
"eventTyp": "TXN-NEW",
"eventCatgry": "CBT-POSTED-TXN"
},
"CBT-BALANCE-CHG": {
"eventTyp": "TXN-NEW",
"eventCatgry": "CBT-BALANCE-CHG",
"enablerEventVer": "1.0.0"
}
}
The output DataFrame needs to be -
ROW 1
{
"eventTyp": "TXN-NEW",
"eventCatgry": "CBT-POSTED-TXN"
}
ROW 2
{
"eventTyp": "TXN-NEW",
"eventCatgry": "CBT-BALANCE-CHG",
"enablerEventVer": "1.0.0"
}
Below is how i'm trying to parse -
override def translateSource(df: DataFrame, spark: SparkSession = SparkUtil.getSparkSession("")): DataFrame = {
import spark.implicits._
val k3ProcessingSchema: StructType = Encoders.product[k3SourceSchema].schema
val dfTranslated=df.selectExpr("CAST(key AS STRING) key", "cast(value as string) value", "CAST(partition as String)", "CAST(offset as String)", "CAST(timestamp as String)").as[(String, String, String, String, String)]
.select(from_json($"value", k3ProcessingSchema).as("k3Clubbed"), $"value", $"key", $"partition", $"offset", $"timestamp")
.select($"k3Clubbed", $"value", $"key", $"partition", $"offset", $"timestamp").filter("k3Clubbed is not null")
.as[(k3SourceSchema, String, String, String, String, String)]
dfTranslated.collect()
df
}
case class k3SourceSchema (
k3SourceValue: Map[String,String]
)
However the code is not able to parse the df(value) column which contains the JSON Format with multiple events into a map of String (key) and String (Value)
You are talking about manipulating rows of a data frame and outputting multiple rows. The best way I see it is by using explode function over data frame but you would like to enrich your data in a format where you can use the explode function.
Refer the below code for better understanding -
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions.explode
import scala.collection.JavaConverters._
import java.util.HashMap
import scala.collection.mutable.ArrayBuffer
import org.codehaus.jackson.map.ObjectMapper
import com.lbg.pas.alerts.realtime.notifications.common.entity.K3FLDEntity
def translateSource(df: DataFrame, spark: SparkSession = SparkUtil.getSparkSession("")): DataFrame = {
import spark.implicits._
val dataSetSource = df.selectExpr("CAST(key AS STRING) key", "cast(value as string) value", "CAST(partition as String)", "CAST(offset as String)", "CAST(timestamp as String)")
.select( $"value", $"key", $"partition", $"offset", $"timestamp").as[(String, String, String, String, String)]
val dataSetTranslated = dataSetSource.map(row=>{
val jsonInput= row._1
val key = row._2
val partition = row._3
val offset = row._4
val timestamp = row._5
//Converting Original JSON to ArrayBuffer of JSON(s)
val mapperObj = new ObjectMapper()
val jsonMap = mapperObj.readValue(jsonInput,classOf[HashMap[String,HashMap[String,String]]])
val JsonList = new ArrayBuffer[String]()
for ((k,v) <- jsonMap.asScala){
val jsonResp = mapperObj.writeValueAsString(v)
JsonList+=jsonResp
}
K3FLDEntity(JsonList,key,partition,offset,timestamp)
}).as[(ArrayBuffer[String],String,String,String,String)]
val dataFrameTranslated=dataSetTranslated.withColumn("value",explode($"JsonList"))
.drop("JsonList")
.toDF()
dataFrameTranslated
}
I am using the below case class -
import scala.collection.mutable.ArrayBuffer
case class K3FLDEntity(
JsonList : ArrayBuffer[String],
key: String,
partition: String,
offset: String,
timestamp:String
)

How to set default value to 'null' in Dataset parsed from RDD[String] applying Case Class as schema

I am parsing JSON strings from a given RDD[String] and try to convert it into a Dataset with a given case class. However, when the JSON string does not contain all required fields of the case class I get an Exception that the missing column could not be found.
How can I define default values for such cases?
I tried defining default values in the case class but that did not solve the problem. I am working with Spark 2.3.2 and Scala 2.11.12.
This code is working fine
import org.apache.spark.rdd.RDD
case class SchemaClass(a: String, b: String)
val jsonData: String = """{"a": "foo", "b": "bar"}"""
val jsonRddString: RDD[String] = spark.sparkContext.parallelize(List(jsonData))
import spark.implicits._
val ds = spark.read.json(jsonRddString).as[SchemaClass]
When I run this code
val jsonDataIncomplete: String = """{"a": "foo"}"""
val jsonIncompleteRddString: RDD[String] = spark.sparkContext.parallelize(List(jsonDataIncomplete))
import spark.implicits._
val dsIncomplete = spark.read.json(jsonIncompleteRddString).as[SchemaClass]
dsIncomplete.printSchema()
dsIncomplete.show()
I get the following Exception
org.apache.spark.sql.AnalysisException: cannot resolve '`b`' given input columns: [a];
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:92)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:89)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$11.apply(TreeNode.scala:335)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:333)
[...]
Interestingly, the default value "null" is applied when json strings are parsed from a file as the example given in the Spark documentation on Datasets is shown:
val path = "examples/src/main/resources/people.json"
val peopleDS = spark.read.json(path).as[Person]
peopleDS.show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
Content of the json file
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
You can now skip loading json as RDD and then reading as DF to directly
val dsIncomplete = spark.read.json(Seq(jsonDataIncomplete).toDS) if you are using Spark 2.2+
Load your JSON data
Extract your schema from case class or define it manually
Get the missing field list
Default the value to lit(null).cast(col.dataType) for missing column.
import org.apache.spark.sql.Encoders
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{StructField, StructType}
object DefaultFieldValue {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
val jsonDataIncomplete: String = """{"a": "foo"}"""
val dsIncomplete = spark.read.json(Seq(jsonDataIncomplete).toDS)
val schema: StructType = Encoders.product[SchemaClass].schema
val fields: Array[StructField] = schema.fields
val outdf = fields.diff(dsIncomplete.columns).foldLeft(dsIncomplete)((acc, col) => {
acc.withColumn(col.name, lit(null).cast(col.dataType))
})
outdf.printSchema()
outdf.show()
}
}
case class SchemaClass(a: String, b: Int, c: String, d: Double)
package spark
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{Column, Encoders, SparkSession}
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.functions.{col, lit}
object JsonDF extends App {
val spark = SparkSession.builder()
.master("local")
.appName("DataFrame-example")
.getOrCreate()
import spark.implicits._
case class SchemaClass(a: String, b: Int)
val jsonDataIncomplete: String = """{"a": "foo", "m": "eee"}"""
val jsonIncompleteRddString: RDD[String] = spark.sparkContext.parallelize(List(jsonDataIncomplete))
val dsIncomplete = spark.read.json(jsonIncompleteRddString) // .as[SchemaClass]
lazy val schema: StructType = Encoders.product[SchemaClass].schema
lazy val fields: Array[String] = schema.fieldNames
lazy val colNames: Array[Column] = fields.map(col(_))
val sch = dsIncomplete.schema
val schemaDiff = schema.diff(sch)
val rr = schemaDiff.foldLeft(dsIncomplete)((acc, col) => {
acc.withColumn(col.name, lit(null).cast(col.dataType))
})
val schF = dsIncomplete.schema
val schDiff = schF.diff(schema)
val rrr = schDiff.foldLeft(rr)((acc, col) => {
acc.drop(col.name)
})
.select(colNames: _*)
}
It will work the same way if you have different json strings in the same RDD. When you have only one which is not matching with the schema then it will throw error.
Eg.
val jsonIncompleteRddString: RDD[String] = spark.sparkContext.parallelize(List(jsonDataIncomplete, jsonData))
import spark.implicits._
val dsIncomplete = spark.read.json(jsonIncompleteRddString).as[SchemaClass]
dsIncomplete.printSchema()
dsIncomplete.show()
scala> dsIncomplete.show()
+---+----+
| a| b|
+---+----+
|foo|null|
|foo| bar|
+---+----+
One way you can do is instead converting it as[Person] you can build schema(StructType) from it and apply it while reading the json files,
import org.apache.spark.sql.Encoders
val schema = Encoders.product[Person].schema
val path = "examples/src/main/resources/people.json"
val peopleDS = spark.read.schema(schema).json(path).as[Person]
peopleDS.show
+-------+----+
| name| age|
+-------+----+
|Michael|null|
+-------+----+
Content of the code file is,
{"name":"Michael"}
The answer from #Sathiyan S led me to the following solution (presenting it here as it did not completely solved my problems but served as the pointer to the right direction):
import org.apache.spark.sql.Encoders
import org.apache.spark.sql.types.{StructField, StructType}
// created expected schema
val schema = Encoders.product[SchemaClass].schema
// convert all fields as nullable
val newSchema = StructType(schema.map {
case StructField( c, t, _, m) ⇒ StructField( c, t, nullable = true, m)
})
// apply expected and nullable schema for parsing json string
session.read.schema(newSchema).json(jsonIncompleteRddString).as[SchemaClass]
Benefits:
All missing fields are set to null, independent of data type
Additional fields in the json string, which are not part of the case class will be ignored

Error in returning spark rdd from a function called inside the map function

I have a collection of rowkeys (plants as shown below) from hbase table and I want to make a fetchData function which returns rdd data for rowkeys from the collection. Goal is to get union of RDDs from fetchData method for each element from plants collection. I have given the relevant part of code below. My issue is, the code is giving compilation error for return type of fetchData:
println("PartB: "+ hBaseRDD.getNumPartitions)
error: value getNumPartitions is not a member of Option[org.apache.spark.rdd.RDD[it.nerdammer.spark.test.sys.Record]]
I am using scala 2.11.8 spark 2.2.0 and maven compilation
import it.nerdammer.spark.hbase._
import org.apache.spark.sql._
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
import org.apache.log4j.Level
import org.apache.log4j.Logger
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object sys {
case class systems( rowkey: String, iacp: Option[String], temp: Option[String])
val spark = SparkSession.builder().appName("myApp").config("spark.executor.cores",4).getOrCreate()
import spark.implicits._
type Record = (String, Option[String], Option[String])
def fetchData(plant: String): RDD[Record] = {
val start_index = plant
val end_index = plant + "z"
//The below command works fine if I run it in main function, but to get multiple rows from hbase, I am using it in a separate function
spark.sparkContext.hbaseTable[Record]("test_table").select("iacp","temp").inColumnFamily("pp").withStartRow(start_index).withStopRow(end_index)
}
def main(args: Array[String]) {
//the below elements in the collection are prefix of relevant rowkeys in hbase table ("test_table")
val plants = Vector("a8","cu","aw","fx")
val hBaseRDD = plants.map( pp => fetchData(pp))
println("Part: "+ hBaseRDD.getNumPartitions)
/*
rest of the code
*/
}
}
Here is the working version of code. The problem here is that I am using for loop and I have to request data corresponding to rowkey (plants) vector from HBase in each loop instead of getting all the data first and then executing rest of the codes
import it.nerdammer.spark.hbase._
import org.apache.spark.sql._
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
import org.apache.log4j.Level
import org.apache.log4j.Logger
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object sys {
case class systems( rowkey: String, iacp: Option[String], temp: Option[String])
def main(args: Array[String]) {
val spark = SparkSession.builder().appName("myApp").config("spark.executor.cores",4).getOrCreate()
import spark.implicits._
type Record = (String, Option[String], Option[String])
val plants = Vector("a8","cu","aw","fx")
for (plant <- plants){
val start_index = plant
val end_index = plant + "z"
val hBaseRDD = spark.sparkContext.hbaseTable[Record]("test_table").select("iacp","temp").inColumnFamily("pp").withStartRow(start_index).withStopRow(end_index)
println("Part: "+ hBaseRDD.getNumPartitions)
/*
rest of the code
*/
}
}
}
After trying, this is where I am stuck now. So how can I cast the type to required.
scala> def fetchData(plant: String) = {
| val start_index = plant
| val end_index = plant + "~"
| val x1 = spark.sparkContext.hbaseTable[Record]("test_table").select("iacp","temp").inColumnFamily("pp").withStartRow(start_index).withStopRow(end_index)
| x1
| }
Define the function in REPL and running it
scala> val hBaseRDD = plants.map( pp => fetchData(pp)).reduceOption(_ union _)
<console>:39: error: type mismatch;
found : org.apache.spark.rdd.RDD[(String, Option[String], Option[String])]
required: it.nerdammer.spark.hbase.HBaseReaderBuilder[(String, Option[String], Option[String])]
val hBaseRDD = plants.map( pp => fetchData(pp)).reduceOption(_ union _)
Thanks in Advance!
The type of hBaseRDD is Vector[_] and not RDD[_], so you can't execute method getNumPartitions on it. If I understand correctly you want to union fetched RDDs. You can do it by plants.map( pp => fetchData(pp)).reduceOption(_ union _) (I recommend to use reduceOption because it won't fail on empty list, but you can use reduce if you confident that list is not empty)
Also the returned type of fetchData is RDD[U], but I didn't find any definition of U. Probably this is the reason why compiler infer Vector[Nothing] instead of Vector[RDD[Record]]. In order to avoid subsequent errors you should also change RDD[U] to RDD[Record].

Spark read csv with commented headers

I have the following file which I need to read using spark in scala -
#Version: 1.0
#Fields: date time location timezone
2018-02-02 07:27:42 US LA
2018-02-02 07:27:42 UK LN
I am currently trying to extract the fields using the following the -
spark.read.csv(filepath)
I am new to spark+scala and wanted to know know is there a better way to extract fields based on the # Fields row at the top of the file.
You should be using sparkContext's textFile api to read the text file and then filter the header line
val rdd = sc.textFile("filePath")
val header = rdd
.filter(line => line.toLowerCase.contains("#fields:"))
.map(line => line.split(" ").tail)
.first()
That should be it.
Now if you want to create a dataframe then you should parse it to form schema and then filter the data lines to form Rows. And finally use SQLContext to create a dataframe
import org.apache.spark.sql.types._
val schema = StructType(header.map(title => StructField(title, StringType, true)))
val dataRdd = rdd.filter(line => !line.contains("#")).map(line => Row.fromSeq(line.split(" ")))
val df = sqlContext.createDataFrame(dataRdd, schema)
df.show(false)
This should give you
+----------+--------+--------+--------+
|date |time |location|timezone|
+----------+--------+--------+--------+
|2018-02-02|07:27:42|US |LA |
|2018-02-02|07:27:42|UK |LN |
+----------+--------+--------+--------+
Note: if the file is tab delimited, instead of doing
line.split(" ")
you should be using \t
line.split("\t")
Sample input file "example.csv"
#Version: 1.0
#Fields: date time location timezone
2018-02-02 07:27:42 US LA
2018-02-02 07:27:42 UK LN
Test.scala
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession.Builder
import org.apache.spark.sql._
import scala.util.Try
object Test extends App {
// create spark session and sql context
val builder: Builder = SparkSession.builder.appName("testAvroSpark")
val sparkSession: SparkSession = builder.master("local[1]").getOrCreate()
val sc: SparkContext = sparkSession.sparkContext
val sqlContext: SQLContext = sparkSession.sqlContext
case class CsvRow(date: String, time: String, location: String, timezone: String)
// path of your csv file
val path: String =
"sample.csv"
// read csv file and skip firs two lines
val csvString: Seq[String] =
sc.textFile(path).toLocalIterator.drop(2).toSeq
// try to read only valid rows
val csvRdd: RDD[(String, String, String, String)] =
sc.parallelize(csvString).flatMap(r =>
Try {
val row: Array[String] = r.split(" ")
CsvRow(row(0), row(1), row(2), row(3))
}.toOption)
.map(csvRow => (csvRow.date, csvRow.time, csvRow.location, csvRow.timezone))
import sqlContext.implicits._
// make data frame
val df: DataFrame =
csvRdd.toDF("date", "time", "location", "timezone")
// display dataf frame
df.show()
}