H2O implicit conversion throws compilation error - scala

The code below is throwing an error when assigning the H2OFrame, most likely something is wrong in the implicit conversion. The error is:
type mismatch; found : org.apache.spark.h2o.RDD[Int] (which expands
to) org.apache.spark.rdd.RDD[Int] required:
org.apache.spark.h2o.H2OFrame (which expands to) water.fvec.H2OFrame
and the code:
import org.apache.spark.h2o._
import org.apache.spark._
import org.apache.spark.SparkContext._
object App1 extends App{
val conf = new SparkConf()
conf.setAppName("Test")
conf.setMaster("local[1]")
conf.set("spark.executor.memory","1g");
val sc = new SparkContext(conf)
val rawData = sc.textFile("c:\\spark\\data.csv")
val data = rawData.map(line => line.split(',').map(_.toDouble))
val response: RDD[Int] = data.map(row => row(0).toInt)
val h2oResponse: H2OFrame = response // <-- this line throws the error
sc.stop
}

All you are missing is h2oContext's implicits as
import h2oContext.implicits._
val h2oResponse: H2OFrame = response.toDF()

Related

Error while running the spark scala code to do bulk load

I am using the following code in REPL to create hfiles and to the bulk load into hbase.I used the same code and done the spark-submit it was working fine with no errors but when i run it in REPL it is throwing the error
import org.apache.spark._
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.fs.Path
import org.apache.hadoop.hbase.client.{ConnectionFactory, HTable}
import org.apache.hadoop.hbase.mapred.TableOutputFormat
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.hbase.KeyValue
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat
import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.StringType
import scala.collection.mutable.ArrayBuffer
import org.apache.hadoop.hbase.KeyValue
import org.apache.spark.rdd.RDD.rddToPairRDDFunctions
val cdt = "dt".getBytes
val ctemp="temp".getBytes
val ctemp_min="temp_min".getBytes
val ctemp_max="temp_max".getBytes
val cpressure="pressure".getBytes
val csea_level="sea_level".getBytes
val cgrnd_level="grnd_level".getBytes
val chumidity="humidity".getBytes
val ctemp_kf="temp_kf".getBytes
val cid="id".getBytes
val cweather_main="weather_main".getBytes
val cweather_description="weather_description".getBytes
val cweather_icon="weather_icon".getBytes
val cclouds_all="clouds_all".getBytes
val cwind_speed="wind_speed".getBytes
val cwind_deg="wind_deg".getBytes
val csys_pod="sys_pod".getBytes
val cdt_txt="dt_txt".getBytes
val crain="rain".getBytes
val COLUMN_FAMILY = "data".getBytes
val cols = ArrayBuffer(cdt,ctemp,ctemp_min,ctemp_max,cpressure,csea_level,cgrnd_level,chumidity,ctemp_kf,cid,cweather_main,cweather_description,cweather_icon,cclouds_all,cwind_speed,cwind_deg,csys_pod,cdt_txt,crain)
val rowKey = new ImmutableBytesWritable()
val conf = HBaseConfiguration.create()
val ZOOKEEPER_QUORUM = "address"
conf.set("hbase.zookeeper.quorum", ZOOKEEPER_QUORUM);
val connection = ConnectionFactory.createConnection(conf)
val df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("inferschema","true").load("Hbasedata/Weatherdata.csv")
val rdd = df.flatMap(x => { //Error when i run this
rowKey.set(x(0).toString.getBytes)
for(i <- 0 to cols.length - 1) yield {
val index = x.fieldIndex(new String(cols(i)))
val value = if (x.isNullAt(index)) "".getBytes else x(index).toString.getBytes
(rowKey,new KeyValue(rowKey.get, COLUMN_FAMILY, cols(i), value))
}
})
It is throwing the following error
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2067)
at org.apache.spark.rdd.RDD$$anonfun$flatMap$1.apply(RDD.scala:333)
at org.apache.spark.rdd.RDD$$anonfun$flatMap$1.apply(RDD.scala:332)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.flatMap(RDD.scala:332)
at org.apache.spark.sql.DataFrame.flatMap(DataFrame.scala:1418)
The error is thrown when i tried to create the rdd.I have used the same code in spark-submit it was working fine.
Issue in
val rowKey = new ImmutableBytesWritable()
ImmutableBytesWritable is not serializable, and located outside "flatMap" function. Please check exception full stack trace.
You can move mentioned statement inside "flatMap" function, at least for check.

do not want string as type when using foreach in scala spark streaming?

code snippet :
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap)
val write2hdfs = lines.filter(x => x._1 == "lineitem").map(_._2)
write2hdfs.foreachRDD(rdd => {
rdd.foreach(avroRecord => {
println(avroRecord)
//val rawByte = avroRecord.getBytes("UTF-8")
Issue faced>
avroRecord holds avro encoded messages received from kafka stream.
By default avroRecord is a string when the above code is being used.
And string has UTF-16 encoding as default in scala.
Due this deserialization is not correct and facing issues.
Messages were encoded into avro with utf-8 when sent to kafka stream.
I would need avroRecord to be pure bytes instead of getting as string and then converting to bytes(internally string would do utf-16 encoding).
or a way to get avroRecord itself in utf-8. Stuck here deadblock.
Need a way forward for this problem statement.
Thanks in advance.
UPDATE:
Code snippet changed >
val ssc = new StreamingContext(sparkConf, Seconds(5))
//val ssc = new JavaStreamingContext(sparkConf, Seconds(5))
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val kafkaParams = Map[String, String]("zookeeper.connect" ->
zkQuorum,"group.id" -> group,"zookeeper.connection.timeout.ms" -> "10000")
//val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap)
val lines =
KafkaUtils.createStream[String,Message,StringDecoder,DefaultDecoder]
(ssc,kafkaParams,topics,StorageLevel.NONE)
imports done :
import org.apache.spark.streaming._
import org.apache.spark.streaming.api.java.JavaStreamingContext
import org.apache.spark.streaming.kafka._
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.dstream.DStream.toPairDStreamFunctions
import org.apache.avro
import org.apache.avro.Schema
import org.apache.avro.generic.{GenericDatumReader, GenericRecord,
GenericDatumWriter, GenericData}
import org.apache.avro.io.{DecoderFactory, DatumReader, DatumWriter,
BinaryDecoder}
import org.apache.avro.file.{DataFileReader, DataFileWriter}
import java.io.{File, IOException}
//import java.io.*
import org.apache.commons.io.IOUtils;
import _root_.kafka.serializer.{StringDecoder, DefaultDecoder}
import _root_.kafka.message.Message
import scala.reflect._
Compilation error :
Compiling 1 Scala source to /home/spark_scala/spark_stream_project/target/scala-2.10/classes...
[error] /home/spark_scala/spark_stream_project/src/main/scala/sparkStreaming.scala:34: overloaded method value createStream with alternatives:
[error] (jssc: org.apache.spark.streaming.api.java.JavaStreamingContext,keyTypeClass: Class[String],valueTypeClass: Class[kafka.message.Message],keyDecoderClass: Class[kafka.serializer.StringDecoder],valueDecoderClass: Class[kafka.serializer.DefaultDecoder],kafkaParams: java.util.Map[String,String],topics: java.util.Map[String,Integer],storageLevel: org.apache.spark.storage.StorageLevel)org.apache.spark.streaming.api.java.JavaPairReceiverInputDStream[String,kafka.message.Message]
[error] (ssc: org.apache.spark.streaming.StreamingContext,kafkaParams: scala.collection.immutable.Map[String,String],topics: scala.collection.immutable.Map[String,Int],storageLevel: org.apache.spark.storage.StorageLevel)(implicit evidence$1: scala.reflect.ClassTag[String], implicit evidence$2: scala.reflect.ClassTag[kafka.message.Message], implicit evidence$3: scala.reflect.ClassTag[kafka.serializer.StringDecoder], implicit evidence$4: scala.reflect.ClassTag[kafka.serializer.DefaultDecoder])org.apache.spark.streaming.dstream.ReceiverInputDStream[(String, kafka.message.Message)]
[error] cannot be applied to (org.apache.spark.streaming.StreamingContext, scala.collection.immutable.Map[String,String], String, org.apache.spark.storage.StorageLevel)
[error] val lines = KafkaUtils.createStreamString,Message,StringDecoder,DefaultDecoder
[error] ^
[error] one error found
What is wrong here.
Also, i dont see the correct constructor as suggested being defined in the kafkaUtils API doc.
API Doc ref am referring :
https://spark.apache.org/docs/1.3.0/api/java/index.html?
org/apache/spark/streaming/kafka/KafkaUtils.html
looking forward for support.
Thanks.
UPDATE 2:
Tried with corrections suggested!
code snippet>
val lines =
KafkaUtils.createStream[String,Message,StringDecoder,DefaultDecoder]
(ssc,kafkaParams,topicMap,StorageLevel.MEMORY_AND_DISK_2)
val write2hdfs = lines.filter(x => x._1 == "lineitem").map(_._2)
Facing runtime exception>
java.lang.ClassCastException: [B cannot be cast to kafka.message.Message
On line :
KafkaUtils.createStream[String,Message,StringDecoder,DefaultDecoder]
(ssc,kafkaParams,topicMap,StorageLevel.MEMORY_AND_DISK_2)
val write2hdfs = lines.filter(x => x._1 == "lineitem").map(_._2)
ideally filter this Dstream(String,Message) should also work right ?
Do i need to extract the payload from Message before subjecting to map ?
need inputs please.
Thanks
You could do something like this:
import kafka.serializer.{StringDecoder, DefaultDecoder}
import kafka.message.Message
val kafkaParams = Map[String, String](
"zookeeper.connect" -> zkQuorum, "group.id" -> group,
"zookeeper.connection.timeout.ms" -> "10000")
val lines = KafkaUtils.createStream[String, Message, StringDecoder, DefaultDecoder](
ssc, kafkaParams, topics, storageLevel)
This should get you a DStream[(String, kafka.message.Message)], and you should be able to retrieve the raw bytes and convert to Avro from there.
This worked for me :
val lines =
KafkaUtils.createStream[String,Array[Byte],StringDecoder,DefaultDecoder]
(ssc,kafkaParams,topicMap,StorageLevel.MEMORY_AND_DISK_2)
My requirement was to get the Byte Array, so changed to Array[Byte] instead of kafka.message.Message

VertexRDD giving me type mismatch error

I am running the following code attempting to create a Graph in GraphX in Apache Spark.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.graphx.GraphLoader
import org.apache.spark.graphx.Graph
import org.apache.spark.rdd.RDD
import org.apache.spark.graphx.VertexId
//loads file from the array
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/data/google-plus/2309.graph");
//maps lines and takes the first 21 characters of each line which is the node.
val result = lines.map( line => line.substring(0,20))
//creates a new variable with each node followed by a long .
val result2 = result.map(word => (word,1L).toLong)
//where i am getting an error
val vertexRDD: RDD[(Long,Long)] = sc.parallelize(result2)
i am getting the following error:
error: type mismatch;
found : org.apache.spark.rdd.RDD[(Long, Long)]
required: Seq[?]
Error occurred in an application involving default arguments.
val vertexRDD: RDD[(Long, Long)] = sc.parallelize(result2)
First, your maps can be simplified to the following code:
val vertexRDD: RDD[(Long, Long)] =
lines.map(line => (line.substring(0, 17).toLong, 1L))
Now, to your error: you cannot call sc.parallelize with an RDD. Your vertexRDD is already defined by result2. You can then create your graph with result2 and your EdgesRDD:
val g = Graph(result2, edgesRDD)
or, if using my suggestion:
val g = Graph(vertexRDD, edgesRDD)

Cannot resolve reference StructField with such signature

i've copied a working example of and i've changed it a little, but the core is always the same, but i got always this error in the StructField point:
cannot resolve reference StructField with such signature
And also gives me this one, inside the signature:
Type mismatch, expected: Datatype, actual StringType
Here is the part of my code where i got problems:
import org.apache.avro.generic.GenericData.StringType
import org.apache.spark
import org.apache.spark.sql.types.StructField
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.types._
object Test{
def main(args: Array[String]): Unit = {
val file = "/home/ubuntu/spark/MyFile"
val conf = new SparkConf().setAppName("Test")
val sc = new SparkContext(conf)
val read = sc.textFile(file)
val header = read.first().toString
//generate schema from first csv row
val fields = header.split(";").map(fieldName => StructField(fieldName.trim, StringType, true))
val schema = StructType(fields)
}
}
I cannot understand where i'm wrong.
I'm using Spark version 2.0.0
Thanks
It looks like GenericData.StringType is an issue. Use an alias:
import org.apache.avro.generic.GenericData.{StringType => AvroStringType}
or remove this import (you don't use it).

Wiki xml parser - org.apache.spark.SparkException: Task not serializable

I am newbie to both scala and spark, and trying some of the tutorials, this one is from Advanced Analytics with Spark. The following code is supposed to work:
import com.cloudera.datascience.common.XmlInputFormat
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io._
val path = "/home/petr/Downloads/wiki/wiki"
val conf = new Configuration()
conf.set(XmlInputFormat.START_TAG_KEY, "<page>")
conf.set(XmlInputFormat.END_TAG_KEY, "</page>")
val kvs = sc.newAPIHadoopFile(path, classOf[XmlInputFormat],
classOf[LongWritable], classOf[Text], conf)
val rawXmls = kvs.map(p => p._2.toString)
import edu.umd.cloud9.collection.wikipedia.language._
import edu.umd.cloud9.collection.wikipedia._
def wikiXmlToPlainText(xml: String): Option[(String, String)] = {
val page = new EnglishWikipediaPage()
WikipediaPage.readPage(page, xml)
if (page.isEmpty) None
else Some((page.getTitle, page.getContent))
}
val plainText = rawXmls.flatMap(wikiXmlToPlainText)
But it gives
scala> val plainText = rawXmls.flatMap(wikiXmlToPlainText)
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1622)
at org.apache.spark.rdd.RDD.flatMap(RDD.scala:295)
...
Running Spark v1.3.0 on a local (and I have loaded only about a 21MB of the wiki articles, just to test it).
All of https://stackoverflow.com/search?q=org.apache.spark.SparkException%3A+Task+not+serializable didn't get me any clue...
Thanks.
try
import com.cloudera.datascience.common.XmlInputFormat
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io._
val path = "/home/terrapin/Downloads/enwiki-20150304-pages-articles1.xml-p000000010p000010000"
val conf = new Configuration()
conf.set(XmlInputFormat.START_TAG_KEY, "<page>")
conf.set(XmlInputFormat.END_TAG_KEY, "</page>")
val kvs = sc.newAPIHadoopFile(path, classOf[XmlInputFormat],
classOf[LongWritable], classOf[Text], conf)
val rawXmls = kvs.map(p => p._2.toString)
import edu.umd.cloud9.collection.wikipedia.language._
import edu.umd.cloud9.collection.wikipedia._
val plainText = rawXmls.flatMap{line =>
val page = new EnglishWikipediaPage()
WikipediaPage.readPage(page, line)
if (page.isEmpty) None
else Some((page.getTitle, page.getContent))
}
The first guess which comes to mind is that: all your code is wrapped in the object where SparkContext is defined. Spark tries to serialize this object to transfer wikiXmlToPlainText function to nodes. Try to create different object with the only one function wikiXmlToPlainText.