I am new to Scala programming and am using IntelliJ IDE. I am getting the below exception when I run my Scala sample code. Not sure if I am missing any dependency.
Sample code
package com.assessments.example
object Example extends App {
//Create a spark context, using a local master so Spark runs on the local machine
val spark = SparkSession.builder().master("local[*]").appName("ScoringModel").getOrCreate()
//importing spark implicits allows functions such as dataframe.as[T]
import spark.implicits._
//Set logger level to Warn
Logger.getRootLogger.setLevel(Level.WARN)
case class CustomerData(
customerId: String,
forename: String,
surname: String
)
case class FullName(
firstName: String,
surname: String
)
case class CustomerModel(
customerId: String,
forename: String,
surname: String,
fullname: FullName
)
val customerData = spark.read.option("header","true").csv("src/main/resources/customer_data.csv").as[CustomerData]
val customerModel = customerData
.map(
customer =>
CustomerModel(
customerId = customer.customerId,
forename = customer.forename,
surname = customer.surname,
fullname = FullName(
firstName = customer.forename,
surname = customer.surname))
)
customerModel.show(truncate = false)
customerModel.write.mode("overwrite").parquet("src/main/resources/customerModel.parquet")
}
Exception message:
Exception in thread "main" java.lang.NoSuchMethodError: scala.collection.mutable.Buffer$.empty()Lscala/collection/GenTraversable;
at org.apache.spark.sql.SparkSessionExtensions.<init>(SparkSessionExtensions.scala:103)
at org.apache.spark.sql.SparkSession$Builder.<init>(SparkSession.scala:793)
at org.apache.spark.sql.SparkSession$.builder(SparkSession.scala:984)
at com.assessments.example.Example$.delayedEndpoint$com$assessments$example$Example$1(Example.scala:10)
at com.assessments.example.Example$delayedInit$body.apply(Example.scala:6)
at scala.Function0.apply$mcV$sp(Function0.scala:39)
at scala.Function0.apply$mcV$sp$(Function0.scala:39)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
at scala.App.$anonfun$main$1(App.scala:76)
at scala.App.$anonfun$main$1$adapted(App.scala:76)
at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:563)
at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:561)
at scala.collection.AbstractIterable.foreach(Iterable.scala:926)
at scala.App.main(App.scala:76)
at scala.App.main$(App.scala:74)
at com.assessments.example.Example$.main(Example.scala:6)
at com.assessments.example.Example.main(Example.scala)
I am using spark version 3.1.2 and Scala version of 2.12.10. When I checked this version of Scala seems to support spark.
Appreciate any guidance on how to get this resolved. Thanks
To solve your problem I would not use case-class as schema. Also I would go with spark dataframes. You define your schema as follows:
val dataSchema = StructType(Array(
StructField("customerId", StringType, true),
StructField("forename", StringType, true),
StructField("surname", StringType, true)
))
// Load data
val rawDf = context.read.format("csv")
.option("delimiter", ",") //edit accordingly
.option("escape", "\"") //edit accordingly
.option("header", "true")
.option("mode", "PERMISSIVE")
.schema(dataSchema)
.load("src/main/resources") // Will read all the csv in directory
rawDf.show()
Once you can see your data then move on to transformations. Create a Struct or a Map data type as suggested here
PySpark - Combine DF columns into named StructType
This is in pySpark but the idea is the same. You play with sparkSQL.
Once you do this simply write to parquet.
finalDf.write.mode("overwrite").parquet("src/main/resources/customerModel")
Notice the output path, there is no file name there. Spark will writeout the data in customerModel directory.
Related
I want to dump data into the HBase table from dataframe using the Spark scala code. I tried using HBaseTableCatalog.
Added following Dependencies :
shc-core-1.1.0.3.1.5.6-1.jar
hbase libraries(hbase-client.jar,hbase-common.jar,hbase-protocol.jar,hbase-server.jar,hbase-spark.jar,hbase-shaeded*.jar,htrace-core,hbase-mapreduce.jar,hadoop-mapreduce-client-core-{version}.jar)
Below is the code:
case class HBaseRecord(col0: String, col1: String, col2: String)
val catalog = s"""{
"table":{"namespace":"default", "name":"shcExampleTable", "tableCoder":"PrimitiveType"},
"rowkey":"key",
"columns":{
"col0":{"cf":"rowkey", "col":"key", "type":"string"},
"col1":{"cf":"cf1", "col":"col1", "type":"string"},
"col2":{"cf":"cf2", "col":"col2", "type":"string"}
}
}""".stripMargin
val AFINN = sc.textFile("hdfs://sandbox-hdp.hortonworks.com:8020/Input/AFINN1.txt").map(x=> x.split("\t")).map(x => HBaseRecord(x(0).toString,x(1).toString,x(2).toString))
val AFINNDF = AFINN.toDF("col0","col1","col2")
AFINNDF.createOrReplaceTempView("rating")
val DF = AFINNDF.select($"col0",$"col1",$"col2")
DF.write.options(Map(HBaseTableCatalog.tableCatalog -> catalog, HBaseTableCatalog.newTable -> "4")).format("org.apache.spark.sql.execution.datasources.hbase").save()
It throws below error:
java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/mapreduce/TableOutputFormat
at org.apache.spark.sql.execution.datasources.hbase.HBaseRelation.insert(HBaseRelation.scala:230)
at org.apache.spark.sql.execution.datasources.hbase.DefaultSource.createRelation(HBaseRelation.scala:65)
I already added hadoop-mapreduce*.jar still it's throwing error.
Which library is needed for TableOutputFormat?
Thanks,
I am kind of newbie to big data world. I have a initial CSV which has a data size of ~40GB but in some kind of shifted order. I mean if you see initial CSV, for Jenny there is no age, so sex column value is shifted to age and remaining column value keeps shifting till the last element in the row.
I want clean/process this CVS using dataframe with Spark in Scala. I tried quite a few solution with withColumn() API and all, but nothing worked for me.
If anyone can suggest me some sort of logic or API available which is out there to solve this in a cleaner way. I might not need proper solution but pointers will also do. Help much appreciated!!
Initial CSV/Dataframe
Required CSV/Dataframe
EDIT:
This is how I'm reading the data:
val spark = SparkSession .builder .appName("SparkSQL")
.master("local[*]") .config("spark.sql.warehouse.dir", "file:///C:/temp")
.getOrCreate()
import spark.implicits._
val df = spark.read.option("header", true").csv("path/to/csv.csv")
This pretty much looks like the data is flawed. To handle this, I would suggest reading each line of the csv file as a single string and the applying a map() function to handle the data
case class myClass(name: String, age: Integer, sex: String, siblings: Integer)
val myNewDf = myDf.map(row => {
val myRow: String = row.getAs[String]("MY_SINGLE_COLUMN")
val myRowValues = myRow.split(",")
if (4 == myRowValues.size()) {
//everything as expected
return myClass(myRowValues[0], myRowValues[1], myRowValues[2], myRowValues[3])
} else {
//do foo to guess missing values
}
}
As in your case Data is not properly formatted. To handle this first data has to be cleansed, i.e all rows of CSV should have same Schema or same no of delimiter/columns.
Basic approach to do this in spark could be:
Load data as Text
Apply map operation on loaded DF/DS to clean it
Create Schema manually
Apply Schema on the cleansed DF/DS
Sample Code
//Sample CSV
John,28,M,3
Jenny,M,3
//Sample Code
val schema = StructType(
List(
StructField("name", StringType, nullable = true),
StructField("age", IntegerType, nullable = true),
StructField("sex", StringType, nullable = true),
StructField("sib", IntegerType, nullable = true)
)
)
import spark.implicits._
val rawdf = spark.read.text("test.csv")
rawdf.show(10)
val rdd = rawdf.map(row => {
val raw = row.getAs[String]("value")
//TODO: Data cleansing has to be done.
val values = raw.split(",")
if (values.length != 4) {
s"${values(0)},,${values(1)},${values(2)}"
} else {
raw
}
})
val df = spark.read.schema(schema).csv(rdd)
df.show(10)
You can try to define a case class with Optional field for age and load your csv with schema directly into a Dataset.
Something like that :
import org.apache.spark.sql.{Encoders}
import sparkSession.implicits._
case class Person(name: String, age: Option[Int], sex: String, siblings: Int)
val schema = Encoders.product[Person].schema
val dfInput = sparkSession.read
.format("csv")
.schema(schema)
.option("header", "true")
.load("path/to/csv.csv")
.as[Person]
I'm trying to write some data into bigtable using a SparkSession
val spark = SparkSession
.builder
.config(conf)
.appName("my-job")
.getOrCreate()
val hadoopConf = spark.sparkContext.hadoopConfiguration
import spark.implicits._
case class BestSellerRecord(skuNbr: String, slsQty: String, slsDollar: String, dmaNbr: String, productId: String)
val seq: DataFrame = Seq(("foo", "1", "foo1"), ("bar", "2", "bar1")).toDF("key", "value1", "value2")
val bigtablePuts = seq.toDF.rdd.map((row: Row) => {
val put = new Put(Bytes.toBytes(row.getString(0)))
put.addColumn(Bytes.toBytes(columnFamily), Bytes.toBytes("nbr"), Bytes.toBytes(row.getString(0)))
(new ImmutableBytesWritable(), put)
})
bigtablePuts.saveAsNewAPIHadoopDataset(hadoopConf)
But this gives me the following exception.
Exception in thread "main" org.apache.hadoop.mapred.InvalidJobConfException: Output directory not set.
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:138)
at org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.assertConf(SparkHadoopWriter.scala:391)
at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:71)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1083)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1081)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1081)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1081)
which is coming from
bigtablePuts.saveAsNewAPIHadoopDataset(hadoopConf)
this line. Also I tried to set the different configurations using hadoopConf.set such as conf.set("spark.hadoop.validateOutputSpecs", "false") but this gives me a NullPointerException.
How may I fix this issue?
Can you try to upgrade to the mapreduce api, as the mapred is deprecated.
This question here shows an example of rewriting this code segment: Output directory not set exception when save RDD to hbase with spark
Hope this is helpful.
I am trying to convert a csv file to a dataframe in Spark 1.5.2 with Scala without the use of the library databricks, as it is a community project and this library is not available. My approach was the following:
var inputPath = "input.csv"
var text = sc.textFile(inputPath)
var rows = text.map(line => line.split(",").map(_.trim))
var header = rows.first()
var data = rows.filter(_(0) != header(0))
var df = sc.makeRDD(1 to data.count().toInt).map(i => (data.take(i).drop(i-1)(0)(0), data.take(i).drop(i-1)(0)(1), data.take(i).drop(i-1)(0)(2), data.take(i).drop(i-1)(0)(3), data.take(i).drop(i-1)(0)(4))).toDF(header(0), header(1), header(2), header(3), header(4))
This code, even though it is quite a mess, works without returning any error messages. The problem comes when trying to display the data inside dfin order to verify the correctness of this method and later try to do some queries in df. The error code I am getting after executing df.show() is SPARK-5063. My questions are:
1) Why is it not possible to print the content of df?
2) Is there any other more straightforward method to convert a csv to a dataframe in Spark 1.5.2 without using the library databricks?
For spark 1.5.x can be used code snippet below to convert input into DF
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
// Define the schema using a case class.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the DataClass interface with 5 fields.
case class DataClass(id: Int, name: String, surname: String, bdate: String, address: String)
// Create an RDD of DataClass objects and register it as a table.
val peopleData = sc.textFile("input.csv").map(_.split(",")).map(p => DataClass(p(0).trim.toInt, p(1).trim, p(2).trim, p(3).trim, p(4).trim)).toDF()
peopleData.registerTempTable("dataTable")
val peopleDataFrame = sqlContext.sql("SELECT * from dataTable")
peopleDataFrame.show()
Spark 1.5
You can create like this:
SparkSession spark = SparkSession
.builder()
.appName("RDDtoDF_Updated")
.master("local[2]")
.config("spark.some.config.option", "some-value")
.getOrCreate();
StructType schema = DataTypes
.createStructType(new StructField[] {
DataTypes.createStructField("eid", DataTypes.IntegerType, false),
DataTypes.createStructField("eName", DataTypes.StringType, false),
DataTypes.createStructField("eAge", DataTypes.IntegerType, true),
DataTypes.createStructField("eDept", DataTypes.IntegerType, true),
DataTypes.createStructField("eSal", DataTypes.IntegerType, true),
DataTypes.createStructField("eGen", DataTypes.StringType,true)});
String filepath = "F:/Hadoop/Data/EMPData.txt";
JavaRDD<Row> empRDD = spark.read()
.textFile(filepath)
.javaRDD()
.map(line -> line.split("\\,"))
.map(r -> RowFactory.create(Integer.parseInt(r[0]), r[1].trim(),Integer.parseInt(r[2]),
Integer.parseInt(r[3]),Integer.parseInt(r[4]),r[5].trim() ));
Dataset<Row> empDF = spark.createDataFrame(empRDD, schema);
empDF.groupBy("edept").max("esal").show();
Using Spark with Scala.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
var hiveCtx = new HiveContext(sc)
var inputPath = "input.csv"
var text = sc.textFile(inputPath)
var rows = text.map(line => line.split(",").map(_.trim)).map(a => Row.fromSeq(a))
var header = rows.first()
val schema = StructType(header.map(fieldName => StructField(fieldName.asInstanceOf[String],StringType,true)))
val df = hiveCtx.createDataframe(rows,schema)
This should work.
But for creating dataframe, would recommend you to use Spark-CSV.
I am trying to read parquet file in dataframe with following code:
val data = spark.read.schema(schema)
.option("dateFormat", "YYYY-MM-dd'T'hh:mm:ss").parquet(<file_path>)
data.show()
Here's the schema:
def schema: StructType = StructType(Array[StructField](
StructField("id", StringType, false),
StructField("text", StringType, false),
StructField("created_date", DateType, false)
))
When I try to execute data.show(), it throws the following exception:
Caused by: java.lang.ClassCastException: [B cannot be cast to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
at org.apache.spark.sql.catalyst.expressions.MutableInt.update(SpecificInternalRow.scala:74)
at org.apache.spark.sql.catalyst.expressions.SpecificInternalRow.update(SpecificInternalRow.scala:240)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.set(ParquetRowConverter.scala:159)
at org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addBinary(ParquetRowConverter.scala:89)
at org.apache.parquet.column.impl.ColumnReaderImpl$2$6.writeValue(ColumnReaderImpl.java:324)
at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:372)
at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:198)
Apparently, it's because of date format and DateType in my schema. If I change DateType to StringType, it works fine and outputs the following:
+--------------------+--------------------+----------------------+
| id| text| created_date|
+--------------------+--------------------+----------------------+
|id..................|text................|2017-01-01T00:08:09Z|
I want to read created_date into DateType, do I need to change anything else?
The following works under Spark 2.1. Note the change of the date format and the usage of TimestampType instead of DateType.
val schema = StructType(Array[StructField](
StructField("id", StringType, false),
StructField("text", StringType, false),
StructField("created_date", TimestampType, false)
))
val data = spark
.read
.schema(schema)
.option("dateFormat", "yyyy-MM-dd'T'HH:mm:ss'Z'")
.parquet("s3a://thisisnotabucket")
In older versions of Spark (I can confirm this works under 1.5.2), you can create a UDF to do the conversion for you in SQL.
def cvtDt(d: String): java.sql.Date = {
val fmt = org.joda.time.format.DateTimeFormat.forPattern("yyyy-MM-dd'T'HH:mm:ss'Z'")
new java.sql.Date(fmt.parseDateTime(d).getMillis)
}
def cvtTs(d: String): java.sql.Timestamp = {
val fmt = org.joda.time.format.DateTimeFormat.forPattern("yyyy-MM-dd'T'HH:mm:ss'Z'")
new java.sql.Timestamp(fmt.parseDateTime(d).getMillis)
}
sqlContext.udf.register("mkdate", cvtDt(_: String))
sqlContext.udf.register("mktimestamp", cvtTs(_: String))
sqlContext.read.parquet("s3a://thisisnotabucket").registerTempTable("dttest")
val query = "select *, mkdate(created_date), mktimestamp(created_date) from dttest"
sqlContext.sql(query).collect.foreach(println)
NOTE: I did this in the REPL, so I had to create the DateTimeFormat pattern on every call to the cvt* methods to avoid serialization issues. If you're doing this is an application, I recommend extracting the formatter into an object.
object DtFmt {
val fmt = org.joda.time.format.DateTimeFormat.forPattern("yyyy-MM-dd'Tââ'HH:mm:ss'Z'")
}