Getting null pointer exception when running saveAsNewAPIHadoopDataset in scala spark2 to hbase - scala

I am saving a puts RDD to Hbase using saveAsNewAPIHadoopDataset. Below is my job creation and submition.
val outputTableName = "test3"
val conf2 = HBaseConfiguration.create()
conf2.set("hbase.zookeeper.quorum", "xx.xx.xx.xx")
conf2.set("hbase.mapred.outputtable", outputTableName)
conf2.set("mapreduce.outputformat.class", "org.apache.hadoop.hbase.mapreduce.TableOutputFormat")
val job = createJob(outputTableName, conf2)
val outputTable = sc.broadcast(outputTableName)
val hbasePuts = simpleRdd.map(k => convertToPut(k, outputTable))
hbasePuts.saveAsNewAPIHadoopDataset(job.getConfiguration)
This is my job creation function
def createJob(table: String, conf: Configuration): Job = {
conf.set(TableOutputFormat.OUTPUT_TABLE, table)
val job = Job.getInstance(conf, this.getClass.getName.split('$')(0))
job.setOutputFormatClass(classOf[TableOutputFormat[String]])
job
}
This function converts data in Hbase format
def convertToPut(k: (String, String, String), outputTable: Broadcast[String]): (ImmutableBytesWritable, Put) = {
val rowkey = k._1
val put = new Put(Bytes.toBytes(rowkey))
val one = Bytes.toBytes("cf1")
val two = Bytes.toBytes("cf2")
put.addColumn(one, Bytes.toBytes("a"), Bytes.toBytes(k._2))
put.addColumn(two, Bytes.toBytes("a"), Bytes.toBytes(k._3))
(new ImmutableBytesWritable(Bytes.toBytes(outputTable.value)), put)
}
This is the error i am getting at line 125 which is :hbasePuts.saveAsNewAPIHadoopDataset(job.getConfiguration)
Exception in thread "main" java.lang.NullPointerException
at org.apache.hadoop.hbase.security.UserProvider.instantiate(UserProvider.java:122)
at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:214)
at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:119)
at org.apache.hadoop.hbase.mapreduce.TableOutputFormat.checkOutputSpecs(TableOutputFormat.java:177)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1099)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1085)
at ScalaSpark$.main(ScalaSpark.scala:125)
at ScalaSpark.main(ScalaSpark.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

I have encountered the same problem.
I think there is a bug in org.apache.hadoop.hbase.mapreduce.TableOutputFormat class.
TableOutputFormat original code is below:
public void checkOutputSpecs(JobContext context) throws IOException,
InterruptedException {
try (Admin admin = ConnectionFactory.createConnection(getConf()).getAdmin()) {
TableName tableName = TableName.valueOf(this.conf.get(OUTPUT_TABLE));
if (!admin.tableExists(tableName)) {
throw new TableNotFoundException("Can't write, table does not exist:" +
tableName.getNameAsString());
}
if (!admin.isTableEnabled(tableName)) {
throw new TableNotEnabledException("Can't write, table is not enabled: " +
tableName.getNameAsString());
}
}
}
If I fix it as below:
public void checkOutputSpecs(JobContext context) throws IOException,
InterruptedException {
//set conf by context parameter
setConf(context.getConfiguration());
try (Admin admin = ConnectionFactory.createConnection(getConf()).getAdmin()) {
TableName tableName = TableName.valueOf(this.conf.get(OUTPUT_TABLE));
if (!admin.tableExists(tableName)) {
throw new TableNotFoundException("Can't write, table does not exist:" +
tableName.getNameAsString());
}
if (!admin.isTableEnabled(tableName)) {
throw new TableNotEnabledException("Can't write, table is not enabled: " +
tableName.getNameAsString());
}
}
}
My problem is resolved.
spark.hadoop.validateOutputSpecs
Another solution is to turn spark.hadoop.validateOutputSpecs off when creating SparkSession.
val session = SparkSession.builder()
.config("spark.hadoop.validateOutputSpecs", false)
.getOrCreate()

Related

How to create a Scala Jdbc program using Option to handle null while returning connection?

I am trying to write a scala-jdbc program which will run an analyze statement on tables present on our database. To do that, I wrote the code as below.
object Trip {
def main(args: Array[String]): Unit = {
val gs = new GetStats(args(0))
gs.run_analyze()
}
}
-----------------------------------------------------------------
class GetStats {
var tables = ""
def this(tables:String){
this
this.tables = tables
}
def run_analyze():Unit = {
val tabList = tables.split(",")
val gpc = new GpConnection()
val con = gpc.getGpCon()
val statement = con.get.createStatement()
try {
for(t<-tabList){
val rs = statement.execute(s"analyze ${t}")
if(rs.equals(true)) println(s"Analyzed ${t}")
else println(s"Analyze failed ${t}")
}
} catch {
case pse:PSQLException => pse.printStackTrace()
case e:Exception => e.printStackTrace()
}
}
}
-----------------------------------------------------------------
class GpConnection {
var gpCon: Option[Connection] = None
def getGpCon():Option[Connection] = {
val url = "jdbc:postgresql://.."
val driver = "org.postgresql.Driver"
val username = "user"
val password = "1239876"
Class.forName(driver)
if(gpCon==None || gpCon.get.isClosed) {
gpCon = DriverManager.getConnection(url, username, password).asInstanceOf[Option[Connection]]
gpCon
} else gpCon
}
}
I create a jar file on my idea (IntelliJ) and submit the jar as below.
scala -cp /home/username/jars/postgresql-42.1.4.jar analyzetables_2.11-0.1.jar schema.table
When I submit the jar file, I see the exception ClassCastException as given below.
java.lang.ClassCastException: org.postgresql.jdbc.PgConnection cannot be cast to scala.Option
at com.db.manager.GpConnection.getGpCon(GpConnection.scala:15)
at com.gp.analyze.GetStats.run_analyze(GetStats.scala:19)
at com.runstats.Trip$.main(Trip.scala:8)
at com.runstats.Trip.main(Trip.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at scala.reflect.internal.util.ScalaClassLoader.$anonfun$run$2(ScalaClassLoader.scala:98)
at scala.reflect.internal.util.ScalaClassLoader.asContext$(ScalaClassLoader.scala:32)
at scala.reflect.internal.util.ScalaClassLoader.asContext(ScalaClassLoader.scala:30)
at scala.reflect.internal.util.ScalaClassLoader.run$(ScalaClassLoader.scala:98)
at scala.reflect.internal.util.ScalaClassLoader.run(ScalaClassLoader.scala:90)
at scala.tools.nsc.CommonRunner.run$(ObjectRunner.scala:22)
The exception says that connection cannot be casted to scala.option but if I don't use Option, I cannot use null to initialize the connection object & I see NullPointerException when I run the code.
Could anyone let me know what is the mistake I am making here and how can I fix it ?
asInstanceOf[] doesn't work that way. It won't just create an Option[] for you.
val x:Option[Int] = 5.asInstanceOf[Option[Int]] //not gonna happen
You have to create the Option[] explicitly.
val x:Option[Int] = Option(5)
You can use an uninitialized var as:
var gpCon: Connection = _
But since you are using scala.util.Option which is a better thing to do, do it in a functional way and don't write imperative Java code in Scala, like:
// a singleton object (Scala provided)
object GpConnection {
private var gpCon: Option[Connection] = None
// returns a Connection (no option - as we need it!)
def getOrCreateCon(): Connection = gpCon match {
case conOpt if conOpt.isEmpty || conOpt.get.isClosed =>
// connection not present or is closed
val url = "jdbc:postgresql://.."
val driver = "org.postgresql.Driver"
val username = "user"
val password = "1239876"
// may throw an exception - you can even handle this
Class.forName(driver)
// may throw an exception - you can even handle this
gpCon = Option(DriverManager.getConnection(url, username, password).asInstanceOf[Connection])
gpCon.getOrElse(throw new RuntimeException("Cannot create connection"))
case Some(con) => con
}
}
use it like:
val con = GpConnection.getOrCreateCon

solr add document error

I am trying to load CSV file to solr doc, i am trying using scala. I am new to scala. For case class structure, if i pass one set of values it works fine. But if i want to want all read values from CSV, it gives an error. I am not sure how to do it in scala, any help greatly appreciated.
object BasicParseCsv {
case class Person(id: String, name: String,age: String, addr: String )
val schema = ArrayBuffer[Person]()
def main(args: Array[String]) {
val master = args(0)
val inputFile = args(1)
val outputFile = args(2)
val sc = new SparkContext(master, "BasicParseCsv", System.getenv("SPARK_HOME"))
val params = new ModifiableSolrParams
val Solr = new HttpSolrServer("http://localhost:8983/solr/person1")
//Preparing the Solr document
val doc = new SolrInputDocument()
val input = sc.textFile(inputFile)
val result = input.map{ line =>
val reader = new CSVReader(new StringReader(line));
reader.readNext();
}
def getSolrDocument(person: Person): SolrInputDocument = {
val document = new SolrInputDocument()
document.addField("id",person.id)
document.addField("name", person.name)
document.addField("age",person.age)
document.addField("addr", person.addr)
document
}
def send(persons:List[Person]){
persons.foreach(person=>Solr.add(getSolrDocument(person)))
Solr.commit()
}
val people = result.map(x => Person(x(0), x(1),x(2),x(3)))
val book1 = new Person("101","xxx","20","abcd")
send(List(book1))
people.map(person => send(List(Person(person.id, person.name, person.age,person.addr))))
System.out.println("Documents added")
}
}
people.map(person => send(List(Person(person.id, person.name, person.age,person.addr)))) ==> gives error
val book1 = new Person("101","xxx","20","abcd") ==> works fine
Update : I get below error
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2067)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:324)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:323)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.map(RDD.scala:323)
at BasicParseCsv$.main(BasicParseCsv.scala:90)
at BasicParseCsv.main(BasicParseCsv.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
Caused by: java.io.NotSerializableException: org.apache.http.impl.client.SystemDefaultHttpClient
Serialization stack:
- object not serializable (class: org.apache.http.impl.client.SystemDefaultHttpClient, value: org.apache.http.impl.client.SystemDefaultHttpClient#1dbd580)
- field (class: org.apache.solr.client.solrj.impl.HttpSolrServer, name: httpClient, type: interface org.apache.http.client.HttpClient)
- object (class org.apache.solr.client.solrj.impl.HttpSolrServer, org.apache.solr.client.solrj.impl.HttpSolrServer#17e0827)
- field (class: BasicParseCsv$$anonfun$main$1, name: Solr$1, type: class org.apache.solr.client.solrj.impl.HttpSolrServer)

Spark application got the error of "Task not serializable"?

The following code got the error of "Task not serializable"?
The error
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2101)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.map(RDD.scala:369)
at ConnTest$.main(main.scala:41)
at ConnTest.main(main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:743)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException: DoWork
Serialization stack:
- object not serializable (class: DoWork, value: DoWork#655621fd)
- field (class: ConnTest$$anonfun$2, name: doWork$1, type: class DoWork)
- object (class ConnTest$$anonfun$2, )
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
... 20 more
Code:
object ConnTest extends App {
override def main(args: scala.Array[String]): Unit = {
super.main(args)
val date = args(0)
val conf = new SparkConf()
val sc = new SparkContext(conf.setAppName("Test").setMaster("local[*]"))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val jdbcSqlConn = "jdbc:sqlserver://......;"
val listJob = new ItemListJob(sqlContext, jdbcSqlConn)
val list = listJob.run(date).select("id").rdd.map(r => r(0).asInstanceOf[Int]).collect()
// It returns about 3000 rows
val doWork = new DoWork(sqlContext, jdbcSqlConn)
val processed = sc.parallelize(list).map(d => {
doWork.run(d, date)
})
}
}
class ItemList(sqlContext: org.apache.spark.sql.SQLContext, jdbcSqlConn: String) {
def run(date: LocalDate) = {
sqlContext.read.format("jdbc").options(Map(
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url" -> jdbcSqlConn,
"dbtable" -> s"dbo.GetList('$date')"
)).load()
}
}
class DoWork(sqlContext: org.apache.spark.sql.SQLContext, jdbcSqlConn: String) {
def run(id: Int, date: LocalDate) = {
// ...... read the data from database for id, and create a text file
val data = sqlContext.read.format("jdbc").options(Map(
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url" -> jdbcSqlConn,
"dbtable" -> s"someFunction('$id', $date)"
)).load()
// .... create a text file with content of data
(id, date)
}
}
Update:
I changed the .map() call to the following,
val processed = sc.parallelize(dealList).toDF.map(d => {
doWork.run(d(0).asInstanceOf[Int], rc)
})
Now I got the error of
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for java.time.LocalDate
- field (class: "java.time.LocalDate", name: "_2")
- root class: "scala.Tuple2"
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:602)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:596)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:587)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
The issue is in the following closure:
val processed = sc.parallelize(list).map(d => {
doWork.run(d, date)
})
The closure in map will run in executors, so Spark needs to serialize doWork and send it to executors. DoWork must be serializable. However. I saw DoWork contains sc and sqlContext so you cannot just make DoWork implement Serializable because you cannot use them in executors.
I guess you probably want to store data into database in DoWork. If so, you can convert RDD to DataFrame and save it via jdbc method, such as:
sc.parallelize(list).toDF.write.jdbc(...)
I cannot give more suggestions since you don't provides the codes in DoWork.

Exception in thread "main" java.lang.AbstractMethodError

I am trying to access spark GRAPHX .
i have tried following code :
/**
* Created by xyz on 5/7/16.
*/
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.graphx._
object simple {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("untitled1")
val sc = new SparkContext(conf)
val graph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt", true).partitionBy(PartitionStrategy.RandomVertexCut)
val triCounts = graph.triangleCount().vertices
val users = sc.textFile("graphx/data/users.txt").map { line =>
val fields = line.split(",")
(fields(0).toLong, fields(1))
}
val triCountByUsername = users.join(triCounts).map { case (id, (username, tc)) =>
(username, tc)
}
println(triCountByUsername.collect().mkString("\n"))
}
}
val conf = new SparkConf().setAppName("untitled1")
val sc = new SparkContext(conf)
val graph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt", true).partitionBy(PartitionStrategy.RandomVertexCut)
val triCounts = graph.triangleCount().vertices
val users = sc.textFile("graphx/data/users.txt").map { line =>
val fields = line.split(",")
(fields(0).toLong, fields(1))
}
val triCountByUsername = users.join(triCounts).map { case (id, (username, tc)) =>
(username, tc)
}
println(triCountByUsername.collect().mkString("\n"))
========================================
but its giving me following error :
Exception in thread "main" java.lang.AbstractMethodError
at org.apache.spark.Logging$class.log(Logging.scala:52)
at org.apache.spark.graphx.GraphLoader$.log(GraphLoader.scala:26)
at org.apache.spark.Logging$class.logInfo(Logging.scala:59)
at org.apache.spark.graphx.GraphLoader$.logInfo(GraphLoader.scala:26)
at org.apache.spark.graphx.GraphLoader$.edgeListFile(GraphLoader.scala:84)
at simple$.main(simple.scala:18)
at simple.main(simple.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
I have a data file having following content :
2 1
4 1
1 2
6 3
7 3
7 6
6 7
3 7

Spark Hadoop Failed to get broadcast

Running a spark-submit job and receiving a "Failed to get broadcast_58_piece0..." error. I'm really not sure what I'm doing wrong. Am I overusing UDFs? Too complicated a function?
As a summary of my objective, I am parsing text from pdfs, which are stored as base64 encoded strings in JSON objects. I'm using Apache Tika to get the text, and trying to make copious use of data frames to make things easier.
I had written a piece of code that ran the text extraction through tika as a function outside of "main" on the data as a RDD, and that worked flawlessly. When I try to bring the extraction into main as a UDF on data frames, though, it borks in various different ways. Before I got here I was actually trying to write the final data frame as:
valid.toJSON.saveAsTextFile(hdfs_dir)
This was giving me all sorts of "File/Path already exists" headaches.
Current code:
object Driver {
def main(args: Array[String]):Unit = {
val hdfs_dir = args(0)
val spark_conf = new SparkConf().setAppName("Spark Tika HDFS")
val sc = new SparkContext(spark_conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
// load json data into dataframe
val df = sqlContext.read.json("hdfs://hadoophost.com:8888/user/spark/data/in/*")
val extractInfo: (Array[Byte] => String) = (fp: Array[Byte]) => {
val parser:Parser = new AutoDetectParser()
val handler:BodyContentHandler = new BodyContentHandler(Integer.MAX_VALUE)
val config:TesseractOCRConfig = new TesseractOCRConfig()
val pdfConfig:PDFParserConfig = new PDFParserConfig()
val inputstream:InputStream = new ByteArrayInputStream(fp)
val metadata:Metadata = new Metadata()
val parseContext:ParseContext = new ParseContext()
parseContext.set(classOf[TesseractOCRConfig], config)
parseContext.set(classOf[PDFParserConfig], pdfConfig)
parseContext.set(classOf[Parser], parser)
parser.parse(inputstream, handler, metadata, parseContext)
handler.toString
}
val extract_udf = udf(extractInfo)
val df2 = df.withColumn("unbased_media", unbase64($"media_file")).drop("media_file")
val dfRenamed = df2.withColumn("media_corpus", extract_udf(col("unbased_media"))).drop("unbased_media")
val depuncter: (String => String) = (corpus: String) => {
val r = corpus.replaceAll("""[\p{Punct}]""", "")
val s = r.replaceAll("""[0-9]""", "")
s
}
val depuncter_udf = udf(depuncter)
val withoutPunct = dfRenamed.withColumn("sentence", depuncter_udf(col("media_corpus")))
val model = sc.objectFile[org.apache.spark.ml.PipelineModel]("hdfs://hadoophost.com:8888/user/spark/hawkeye-nb-ml-v2.0").first()
val with_predictions = model.transform(withoutPunct)
val fullNameChecker: ((String, String, String, String, String) => String) = (fname: String, mname: String, lname: String, sfx: String, text: String) =>{
val newtext = text.replaceAll(" ", "").replaceAll("""[0-9]""", "").replaceAll("""[\p{Punct}]""", "").toLowerCase
val new_fname = fname.replaceAll(" ", "").replaceAll("""[0-9]""", "").replaceAll("""[\p{Punct}]""", "").toLowerCase
val new_mname = mname.replaceAll(" ", "").replaceAll("""[0-9]""", "").replaceAll("""[\p{Punct}]""", "").toLowerCase
val new_lname = lname.replaceAll(" ", "").replaceAll("""[0-9]""", "").replaceAll("""[\p{Punct}]""", "").toLowerCase
val new_sfx = sfx.replaceAll(" ", "").replaceAll("""[0-9]""", "").replaceAll("""[\p{Punct}]""", "").toLowerCase
val name_full = new_fname.concat(new_mname).concat(new_lname).concat(new_sfx)
val c = name_full.r.findAllIn(newtext).length
c match {
case 0 => "N"
case _ => "Y"
}
}
val fullNameChecker_udf = udf(fullNameChecker)
val stringChecker: ((String, String) => String) = (term: String, text: String) => {
val termLower = term.replaceAll("""[\p{Punct}]""", "").toLowerCase
val textLower = text.replaceAll("""[\p{Punct}]""", "").toLowerCase
val c = termLower.r.findAllIn(textLower).length
c match {
case 0 => "N"
case _ => "Y"
}
}
val stringChecker_udf = udf(stringChecker)
val stringChecker2: ((String, String) => String) = (term: String, text: String) => {
val termLower = term takeRight 4
val textLower = text
val c = termLower.r.findAllIn(textLower).length
c match {
case 0 => "N"
case _ => "Y"
}
}
val stringChecker2_udf = udf(stringChecker)
val valids = with_predictions.withColumn("fname_valid", stringChecker_udf(col("first_name"), col("media_corpus")))
.withColumn("lname_valid", stringChecker_udf(col("last_name"), col("media_corpus")))
.withColumn("fname2_valid", stringChecker_udf(col("first_name_2"), col("media_corpus")))
.withColumn("lname2_valid", stringChecker_udf(col("last_name_2"), col("media_corpus")))
.withColumn("camt_valid", stringChecker_udf(col("chargeoff_amount"), col("media_corpus")))
.withColumn("ocan_valid", stringChecker2_udf(col("original_creditor_account_nbr"), col("media_corpus")))
.withColumn("dpan_valid", stringChecker2_udf(col("debt_provider_account_nbr"), col("media_corpus")))
.withColumn("full_name_valid", fullNameChecker_udf(col("first_name"), col("middle_name"), col("last_name"), col("suffix"), col("media_corpus")))
.withColumn("full_name_2_valid", fullNameChecker_udf(col("first_name_2"), col("middle_name_2"), col("last_name_2"), col("suffix_2"), col("media_corpus")))
valids.write.mode(SaveMode.Overwrite).format("json").save(hdfs_dir)
}
}
Full stack trace starting with error:
16/06/14 15:02:01 WARN TaskSetManager: Lost task 0.0 in stage 4.0 (TID 53, hdpd11n05.squaretwofinancial.com): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:272)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_58_piece0 of broadcast_58
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1222)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.ml.feature.CountVectorizerModel$$anonfun$9$$anonfun$apply$7.apply(CountVectorizer.scala:222)
at org.apache.spark.ml.feature.CountVectorizerModel$$anonfun$9$$anonfun$apply$7.apply(CountVectorizer.scala:221)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
at org.apache.spark.ml.feature.CountVectorizerModel$$anonfun$9.apply(CountVectorizer.scala:221)
at org.apache.spark.ml.feature.CountVectorizerModel$$anonfun$9.apply(CountVectorizer.scala:218)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalExpr43$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:263)
... 8 more
Caused by: org.apache.spark.SparkException: Failed to get broadcast_58_piece0 of broadcast_58
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:137)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:175)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1219)
... 25 more
I encountered a similar error.
It turns out to be caused by the broadcast usage in CounterVectorModel. Following is the detailed cause in my case:
When model.transform() is called , the vocabulary is broadcasted and saved as an attribute broadcastDic in model implicitly. Therefore, if the CounterVectorModel is saved after calling model.transform(), the private var attribute broadcastDic is also saved. But unfortunately, in Spark, broadcasted object is context-sensitive, which means it is embedded in SparkContext. If that CounterVectorModel is loaded in a different SparkContext, it will fail to find the previous saved broadcastDic.
So either solution is to prevent calling model.transform() before saving the model, or clone the model by method model.copy().
For anyone coming across this, it turns out the model I was loading was malformed. I found out by using spark-shell in yarn-client mode and stepping through the code. When I tried to load the model it was fine, but running it against the datagram (model.transform) through errors about not finding a metadata directory.
I went back and found a good model, ran against that and it worked fine. This code is actually sound.