spark-hbase-connector : ClusterId read in ZooKeeper is null - scala

I'am trying to run a simple program that copies the content of an rdd into a Hbase table. I'am using spark-hbase-connector by nerdammer https://github.com/nerdammer/spark-hbase-connector. I'am running the code using spark-submit on a local cluster on my machine. Spark version is 2.1.
this is the code i'am trying tu run :
import org.apache.spark.{SparkConf, SparkContext}
import it.nerdammer.spark.hbase._
object HbaseConnect {
def main(args: Array[String]) {
val sparkConf = new SparkConf()
sparkConf.set("spark.hbase.host", "hostname")
sparkConf.set("zookeeper.znode.parent", "/hbase-unsecure")
val sc = new SparkContext(sparkConf)
val rdd = sc.parallelize(1 to 100)
.map(i => (i.toString, i+1, "Hello"))
rdd.toHBaseTable("mytable").toColumns("column1", "column2")
.inColumnFamily("mycf")
.save()
sc.stop
}}
Here is my build.sbt:
name := "HbaseConnect"
version := "0.1"
scalaVersion := "2.11.8"
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first}
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.1.0" % "provided",
"it.nerdammer.bigdata" % "spark-hbase-connector_2.10" % "1.0.3")
the execution gets stuck showing the following info:
17/11/22 10:20:34 INFO ZooKeeperRegistry: ClusterId read in ZooKeeper is null
17/11/22 10:20:34 INFO TableOutputFormat: Created table instance for mytable
I am unable to indentify the problem with zookeeper. The HBase clients will discover the running HBase cluster using the following two properties:
1.hbase.zookeeper.quorum: is used to connect to the zookeeper cluster
2.zookeeper.znode.parent. tells which znode keeps the data (and address for HMaster) for the cluster
I overridden these two properties in the code. with
sparkConf.set("spark.hbase.host", "hostname")
sparkConf.set("zookeeper.znode.parent", "/hbase-unsecure")
Another question is that there is no spark-hbase-connector_2.11. can the provided version spark-hbase-connector_2.10 support scala 2.11 ?

Problem is solved. I had to override the Hmaster port to 16000 (wich is my Hmaster port number. I'am using ambari). Default value that sparkConf uses is 60000.
sparkConf.set("hbase.master", "hostname"+":16000").

Related

spark streaming save base64 rdd to json on s3

The scala application below cannot save an rdd in json format onto S3
I have :-
a kinesis stream that has complex objects placed on the stream. This object has had JSON.stringify() applied to it before being placed on the stream as part of the Kinesis PutRecord method.
A scala spark stream job reads these items off the stream,
I cannot seem to save the rdd record that comes off the stream into json file onto an S3 bucket.
In the code i've attempted to convert the RDD[Bytes] to RDD[String] then load with spark.read.json but no luck. I've tried various other combinations and can't seem to output the onto S3 in it's raw format.
import org.apache.spark._
import org.apache.spark.sql._
import java.util.Base64
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Milliseconds, StreamingContext}
import org.apache.spark.streaming.Duration
import org.apache.spark.streaming.kinesis._
import org.apache.spark.streaming.kinesis.KinesisInputDStream
import org.apache.spark.streaming.kinesis.KinesisInitialPositions.Latest
object ScalaStream {
def main(args: Array[String]): Unit = {
val appName = "ScalaStreamExample"
val batchInterval = Milliseconds(2000)
val outPath = "s3://xxx-xx--xxx/xxxx/"
val spark = SparkSession
.builder()
.appName(appName)
.getOrCreate()
val sparkContext = spark.sparkContext
val streamingContext = new StreamingContext(sparkContext, batchInterval)
// Populate the appropriate variables from the given args
val checkpointAppName = "xxx-xx-xx--xx"
val streamName = "cc-cc-c--c--cc"
val endpointUrl = "https://kinesis.xxx-xx-xx.amazonaws.com"
val regionName = "cc-xxxx-xxx"
val initialPosition = new Latest()
val checkpointInterval = batchInterval
val storageLevel = StorageLevel.MEMORY_AND_DISK_2
val kinesisStream = KinesisInputDStream.builder
.streamingContext(streamingContext)
.endpointUrl(endpointUrl)
.regionName(regionName)
.streamName(streamName)
.initialPosition(initialPosition)
.checkpointAppName(checkpointAppName)
.checkpointInterval(checkpointInterval)
.storageLevel(StorageLevel.MEMORY_AND_DISK_2)
.build()
kinesisStream.foreachRDD { rdd =>
if (!rdd.isEmpty()){
//**************** . <---------------
// This is where i'm trying to save the raw json object to s3 as json file
// tried various combinations here but no luck.
val dataFrame = rdd.map(record=>new String(record)) // convert bytes to string
dataFrame.write.mode(SaveMode.Append).json(outPath + "/" + rdd.id.toString())
//**************** <----------------
}
}
// Start the streaming context and await termination
streamingContext.start()
streamingContext.awaitTermination()
}
}
Any ideas what i'm missing?
So it was complete red herring why it failed to work. Turns out it was a scala version conflict with what is available on EMR.
Many similar questions asked on SO that suggested this may be the issue but whilst the spark documentation lists 2.12.4 is compatible with spark 2.4.4, the EMR instance does not appear to support scala version 2.12.4. So i've updated my build.sbt and deploy script from
build.sbt:
name := "Simple Project"
version := "1.0"
scalaVersion := "2.12.8"
ibraryDependencies += "org.apache.spark" % "spark-sql_2.12" % "2.4.4"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.12" % "2.4.4"
libraryDependencies += "org.apache.spark" % "spark-streaming-kinesis-asl_2.12" % "2.4.4"
to:
name := "Simple Project"
version := "1.0"
scalaVersion := "2.11.12"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.4"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.4.4" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming-kinesis-asl" % "2.4.4"
deploy.sh
aws emr add-steps --cluster-id j-xxxxx --steps Type=spark,Name=ScalaStream,Args=[\
--class,"ScalaStream",\
--deploy-mode,cluster,\
--master,yarn,\
--packages,\'org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.4\',\
--conf,spark.yarn.submit.waitAppCompletion=false,\
--conf,yarn.log-aggregation-enable=true,\
--conf,spark.dynamicAllocation.enabled=true,\
--conf,spark.cores.max=4,\
--conf,spark.network.timeout=300,\
s3://ccc.xxxx/simple-project_2.11-1.0.jar\
],ActionOnFailure=CONTINUE

Proxy Issues while Dumping Data into CosmosDB from Dataframe using intelliJ

I have to dump json data into cosmosDB from spark dataframe using scala and intelliJ.
I am reading a csv file from my local machine and converting it into json format. Now I have to dump this json data into cosmosDB collection.
Spark version is 2.2.0 and scala version is 2.11.8
Below is the code which I wrote in IntelliJ with scala for fetching a csv file from my local machine and convert it into a json file.
import org.apache.spark.sql.SparkSession
import com.microsoft.azure.cosmosdb.spark.config.Config
object DataLoadConversion {
def main(args: Array[String]): Unit = {
System.setProperty("spark.sql.warehouse.dir", "file:///C:/spark-warehouse")
val spark = SparkSession.builder().master("local").appName("DataConversion").getOrCreate()
val df = spark.read.format("com.databricks.spark.csv")
.option("quote", "\"")
.option("escape", "\"")
.option("delimiter", ",")
.option("header", "true")
.option("mode", "FAILFAST")
.option("inferSchema","true")
.load("file:///C:/Users/an/Desktop/ct_temp.csv")
val finalDf = df.select(df("history_temp_id").as("NUM"),df("history_temp_time").as("TIME"))
val jsonData = finalDf.select("NUM", "TIME").toJSON
jsonData.show(2)
// COSMOS DB Write configuration
val writeConfig = Config(Map(
"Endpoint" -> "https://cosms.documents.azure.com:443/",
"Masterkey" -> "YOUR-KEY-HERE", //provided primary key
"Database" -> "DBName", //provided with DB name
"Collection" -> "Collection", //provided with collection name
))
// Write to Cosmos DB from the DataFrame
import org.apache.spark.sql.SaveMode
jsonData.write.mode(SaveMode.Overwrite).cosmosDB(writeConfig)
}
Below is the build.sbt file
scalaVersion := "2.11.8"
val sparkVersion = "2.2.0"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"com.databricks" %% "spark-csv" % "1.5.0",
)
libraryDependencies += "com.microsoft.azure" % "azure-cosmosdb-spark_2.2.0_2.11" % "1.1.1" % "provided" exclude("org.apache.spark", "spark-core_2.10")
Added cosmosDB dependency to the build.sbt file.
I am new to Spark and Scala. please let me know what all steps to be followed to get connected with cosmos DB from intelliJ with spark and scala?
Build is successful but I am getting below error while running the code.
19/07/10 16:32:41 INFO DocumentClient: Initializing DocumentClient with serviceEndpoint [https://cosms.documents.azure.com/], ConnectionPolicy [ConnectionPolicy [requestTimeout=60, mediaRequestTimeout=300, connectionMode=Gateway, mediaReadMode=Buffered, maxPoolSize=400, idleConnectionTimeout=60, userAgentSuffix= SparkConnector/2.2.0_2.11-1.1.1, retryOptions=com.microsoft.azure.documentdb.RetryOptions#1ef5cde4, enableEndpointDiscovery=true, preferredLocations=[Japan East]]], ConsistencyLevel [Session]
19/07/10 16:33:03 WARN DocumentClient: Failed to retrieve database account information. org.apache.http.conn.HttpHostConnectException: Connect to cosms.documents.azure.com:443 [cosms.documents.azure.com/13.78.51.35] failed: Connection timed out: connect
......
Exception in thread "main" java.lang.IllegalStateException: Http client execution failed.
at com.microsoft.azure.documentdb.internal.GatewayProxy.performGetRequest(GatewayProxy.java:244)
at com.microsoft.azure.documentdb.internal.GatewayProxy.doRead(GatewayProxy.java:93)
If I am connecting out of my office network this is working, but when I my machine is connected with office network i am getting above error.
I have tried with configuring proxy settings in below shown page. Settings>>> Proxy settings.
If i try the same end point in chrome i am getting below error.
{"code":"Unauthorized","message":"Required Header authorization is missing. Ensure a valid Authorization token is passed.\r\nActivityId: 54999e41-179e-4877-b8bf-f2c2a33280fd, Microsoft.Azure.Documents.Common/2.5.1"}
how to resolve this? how to bypass proxy from office network?

issue running spark-submit : java.lang.NoSuchMethodError: com.couchbase.spark.streaming.Mutation.key() [duplicate]

This question already has answers here:
Resolving dependency problems in Apache Spark
(7 answers)
Closed 4 years ago.
I have the following scala code and am using sbt to compile and run this. sbt run works as expected.
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{StreamingContext, Seconds}
import com.couchbase.spark.streaming._
object StreamingExample {
def main(args: Array[String]): Unit = {
// Create the Spark Config and instruct to use the travel-sample bucket
// with no password.
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("StreamingExample")
.set("com.couchbase.bucket.travel-sample", "")
// Initialize StreamingContext with a Batch interval of 5 seconds
val ssc = new StreamingContext(conf, Seconds(5))
// Consume the DCP Stream from the beginning and never stop.
// This counts the messages per interval and prints their count.
ssc
.couchbaseStream(from = FromBeginning, to = ToInfinity)
.foreachRDD(rdd => {
rdd.foreach(message => {
//println(message.getClass());
message.getClass();
if(message.isInstanceOf[Mutation]) {
val document = message.asInstanceOf[Mutation].key.map(_.toChar).mkString
println("mutated: " + document);
} else if( message.isInstanceOf[Deletion]) {
val document = message.asInstanceOf[Deletion].key.map(_.toChar).mkString
println("deleted: " + document);
}
})
})
// Start the Stream and await termination
ssc.start()
ssc.awaitTermination()
}
}
but this fails when run as a spark job like below :
spark-submit --class "StreamingExample" --master "local[*]" target/scala-2.11/spark-samples_2.11-1.0.jar
The error is java.lang.NoSuchMethodError: com.couchbase.spark.streaming.Mutation.key()
Following is my build.sbt
lazy val root = (project in file(".")).
settings(
name := "spark-samples",
version := "1.0",
scalaVersion := "2.11.12",
mainClass in Compile := Some("StreamingExample")
)
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.4.0",
"org.apache.spark" %% "spark-streaming" % "2.4.0",
"org.apache.spark" %% "spark-sql" % "2.4.0",
"com.couchbase.client" %% "spark-connector" % "2.2.0"
)
// META-INF discarding
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
The spark version running on my machine is 2.4.0 using scala 2.11.12.
Observations:
I do not see com.couchbase.client_spark-connector_2.11-2.2.0 in my spark jars ( /usr/local/Cellar/apache-spark/2.4.0/libexec/jars ), but the older version com.couchbase.client_spark-connector_2.10-1.2.0.jar exists.
Why is spark-submit not working?
how does sbt manage to run this? where does it download the
dependencies?
Please ensure that both the Scala version and the spark connector library version used by SBT and your spark installation are the same.
I had run into a similar problem when I was trying to run a sample Flink job on my system. It was being caused by version mismatch.

How does DataStax Spark Cassandra connector create SparkContext?

I have run the following Spark test program successfully. In this program I notice the "cassandraTable" method and "getOrCreate" method in SparkContext class. But I am not able to find it in the Spark Scala API docs for this class. What am I missing in understanding this code? I am trying to understand how this SparkContext is different when Datastax Connector is in sbt.
Code -
import org.apache.spark.{SparkContext, SparkConf}
import com.datastax.spark.connector._
object CassandraInt {
def main(args:Array[String]){
val SparkMasterHost = "127.0.0.1"
val CassandraHost = "127.0.0.1"
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", CassandraHost)
.set("spark.cleaner.ttl", "3600")
.setMaster("local[12]")
.setAppName(getClass.getSimpleName)
// Connect to the Spark cluster:
lazy val sc = SparkContext.getOrCreate(conf)
val rdd = sc.cassandraTable("test", "kv")
println(rdd.count)
println(rdd.map(_.getInt("value")).sum)
}}
The build.sbt file I used is -
name := "Test Project"
version := "1.0"
scalaVersion := "2.11.7"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0"
addCommandAlias("c1", "run-main CassandraInt")
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "2.0.0-M3"
fork in run := true
It is not different. Spark supports only one active SparkContext and getOrCreate is a method defined on the companion object:
This function may be used to get or instantiate a SparkContext and register it as a singleton object. Because we can only have one active SparkContext per JVM, this is useful when applications may wish to share a SparkContext.
This method allows not passing a SparkConf (useful if just retrieving).
To summarize:
If there is an active context it returns it.
Otherwise it creates a new one.
cassandraTable is a method of the SparkContextFunctions exposed using an implicit conversion.

Apache spark-streaming application output not forwarded to the master

I'm trying to run the FlumeEvent example which is the following
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming._
import org.apache.spark.streaming.flume._
import org.apache.spark.util.IntParam
import org.apache.spark.streaming.flume.FlumeUtils
object FlumeEventCount {
def main(args: Array[String]) {
val batchInterval = Milliseconds(2000)
// Create the context and set the batch size
val sparkConf = new SparkConf().setAppName("FlumeEventCount")
.set("spark.cleaner.ttl","3")
val ssc = new StreamingContext(sparkConf, batchInterval)
// Create a flume stream
var stream = FlumeUtils.createStream(ssc, "192.168.1.5",3564, StorageLevel.MEMORY_ONLY_SER_2)
// Print out the count of events received from this server in each batch
stream.count().map(cnt => "Received flume events." + cnt ).print()
stream.count.print()
stream.print()
ssc.start()
ssc.awaitTermination()
}
}
My sbt file is the following
import AssemblyKeys._
assemblySettings
name := "flume-test"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.0.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.0.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming-flume" % "1.0.0" exclude("org.apache.spark","spark-core") exclude("org.apache.spark", "spark-streaming_2.10")
resolvers += "Akka Repository" at "http://repo.akka.io/releases/"
I run the programm with the following command
/tmp/spark-1.0.0-bin-hadoop2/bin/spark-submit --class FlumeEventCount --master local --deploy-mode client /tmp/fooproj/target/scala-2.10/cert-log-manager-assembly-1.0.jar
On the other side, the flume application is sending everything correctly and I can see in the logs that it's received.
I haven't made any changes to spark's configuration nor setup any environment variables, I just downloaded and unpacked the program.
Can someone tell me what am I doing wrong?
//edit: When I execute spark's FlumeEventCount example, it works
//edit2: If I remove the awaiTermination and add an ssc.stop it prints everything one single time, I guess this happens because something is getting flushed
....I should have learned to rtfm more carefully by now,
quoting from this page: https://spark.apache.org/docs/latest/streaming-programming-guide.html
// Spark Streaming needs at least two working thread
val ssc = new StreamingContext("local[2]", "NetworkWordCount", Seconds(1))
I've been launching spark with only one thread
also the following works fine
stream.map(event=>"Event: header:"+ event.event.get(0).toString+" body:"+ new String(event.event.getBody.array) ).print