Spark read job from gcs object stuck - scala

I'm trying to read an object with a spark job locally. I previously created with another Spark job locally.
When looking at the logs I see nothing weird, and in the spark UI the job is just stuck
Before I kick the read job I update the spark config as follows:
val hc = spark.sparkContext.hadoopConfiguration
hc.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
hc.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
hc.set("fs.gs.project.id", credential.projectId)
hc.set("fs.gs.auth.service.account.enable", "true")
hc.set("fs.gs.auth.service.account.email", credential.email)
hc.set("fs.gs.auth.service.account.private.key.id", credential.keyId)
hc.set("fs.gs.auth.service.account.private.key", credential.key)
Then I simply read like this
val path = "gs://mybucket/data.csv"
val options = Map("credentials" -> credential.base64ServiceAccount, "parentProject" -> credential.projectId)
spark.read.format("csv")
.options(options)
.load(path)
My service account has those permissions, I literally added all permissions I could find for Object storage
Storage Admin
Storage Object Admin
Storage Object Creator
Storage Object Viewer
This is how I previously wrote the object
val path = "gs://mybucket/data.csv"
val options = Map("credentials" -> credential.base64ServiceAccount, "parentProject" -> credential.projectId, "header" -> "true")
var writer = df.write.format("csv").options(options)
writer.save(path)
Those are my dependencies
Seq(
"org.apache.spark" %% "spark-core" % "3.1.1",
"org.apache.hadoop" % "hadoop-client" % "3.3.1",
"com.google.cloud.spark" %% "spark-bigquery-with-dependencies" % "0.23.0",
"com.google.cloud.bigdataoss" % "gcs-connector" % "hadoop3-2.2.4",
"com.google.cloud" % "google-cloud-storage" % "2.2.1"
)
Any idea why would the write succeed but the read stuck like this?

I was using a version of the dependencies that was not the latest. Once I've updated google connector dependencies to the latest version (December 2021) I got the read working as well as the write from Google Storage.

Related

Using upgrade.from config in Kafka Streams is causing a "BindException: Address already in use" error in tests using embedded-kafka-schema-registry

I've got a Scala application that uses Kafka Streams - and Embedded Kafka Schema Registry in its integration tests.
I'm currently trying to upgrade Kafka Streams from 2.5.1 to 3.3.1 - and everything is working locally as expected, with all unit and integration tests passing.
However, according to the upgrade guide on the Kafka Streams documentation, when upgrading Kafka Streams, "if upgrading from 3.2 or below, you will need to do two rolling bounces, where during the first rolling bounce phase you set the config upgrade.from="older version" (possible values are "0.10.0" - "3.2") and during the second you remove it".
I've therefore added this upgrade.from config to my code as follows:
val propsMap = Map(
...
UPGRADE_FROM_CONFIG -> "2.5.1"
)
val props = new Properties()
properties.putAll(asJava(propsMap))
val streams = new KafkaStreams(topology, props);
However, doing this causes my integration tests to start failing with the following error:
[info] java.net.BindException: Address already in use
[info] at sun.nio.ch.Net.bind0(Native Method)
[info] at sun.nio.ch.Net.bind(Net.java:461)
[info] at sun.nio.ch.Net.bind(Net.java:453)
[info] at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:222)
[info] at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:85)
[info] at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:78)
[info] at org.apache.zookeeper.server.NIOServerCnxnFactory.configure(NIOServerCnxnFactory.java:676)
[info] at org.apache.zookeeper.server.ServerCnxnFactory.configure(ServerCnxnFactory.java:109)
[info] at org.apache.zookeeper.server.ServerCnxnFactory.configure(ServerCnxnFactory.java:105)
[info] at io.github.embeddedkafka.ops.ZooKeeperOps.startZooKeeper(zooKeeperOps.scala:26)
Does anyone know why that might be happening and how to resolve? And also additionally, if this use of the upgrade.from config is correct?
For additional context, my previous versions of the relevant libraries were:
"org.apache.kafka" %% "kafka-streams-scala" % "2.5.1"
"org.apache.kafka" % "kafka-clients" % "5.5.1-ccs"
"io.confluent" % "kafka-avro-serializer" % "5.5.1"
"io.confluent" % "kafka-schema-registry-client" % "5.5.1"
"org.apache.kafka" %% "kafka" % "2.5.1"
"io.github.embeddedkafka" %% "embedded-kafka-schema-registry" % "5.5.1"
And my updated versions are:
"org.apache.kafka" %% "kafka-streams-scala" % "3.3.1"
"org.apache.kafka" % "kafka-clients" % "7.3.0-ccs"
"io.confluent" % "kafka-avro-serializer" % "7.3.0"
"io.confluent" % "kafka-schema-registry-client" % "7.3.0"
"org.apache.kafka" %% "kafka" % "3.3.1"
"io.github.embeddedkafka" %% "embedded-kafka-schema-registry" % "7.3.0"
My integration tests use Embedded Kafka Schema Registry as follows in their test setup, with specific ports specified for Kafka, Zookeeper and Schema Registry:
class MySpec extends AnyWordSpec
with EmbeddedKafkaConfig
with EmbeddedKafka {
override protected def beforeAll(): Unit = {
super.beforeAll()
EmbeddedKafka.start()
...
}
override protected def afterAll(): Unit = {
...
EmbeddedKafka.stop()
super.afterAll()
}
}
I'm not quite sure what to try to resolve this issue.
In searching online, did find this open GitHub issue on Scalatest Embedded Kafka, which was the precursor to Embedded Kafka Schema Registry and seems to be a similar issue. However, it doesn't appear to have been resolved.
Your config upgrade_from is not valid.
Cf https://kafka.apache.org/documentation/#streamsconfigs_upgrade.from
It should be 2.5, not 2.5.1.

unable to connect to minio-s3 spark

I am trying to connect to s3 provided by minio using spark But it is saying the bucket minikube does not exists. (created bucket already)
val spark = SparkSession.builder().appName("AliceProcessingTwentyDotTwo")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer").master("local[1]")
.getOrCreate()
val sc= spark.sparkContext
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.set("fs.s3a.endpoint", "http://localhost:9000")
sc.hadoopConfiguration.set("fs.s3a.access.key", "minioadmin")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "minioadmin")
sc.hadoopConfiguration.set("fs.s3`a`.path.style.access", "true")
sc.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled","false")
sc.textFile("""s3a://minikube/data.json""").collect()
I am using the following guide to connect.
https://github.com/minio/cookbook/blob/master/docs/apache-spark-with-minio.md
These are the dependencies I used in scala.
"org.apache.spark" %% "spark-core" % "2.4.0", "org.apache.spark" %%
"spark-sql" % "2.4.0", "com.amazonaws" % "aws-java-sdk" % "1.11.712",
"org.apache.hadoop" % "hadoop-aws" % "2.7.3",
Try spark 2.4.3 without hadoop and use Hadoop 2.8.2 or 3.1.2. After trying steps in below link I am able to connect minio using cli
https://www.jitsejan.com/setting-up-spark-with-minio-as-object-storage.html

How to save spark streaming data in cassandra

build.sbt
Below are the contents included in build.sbt file
val sparkVersion = "1.6.3"
scalaVersion := "2.10.5"
resolvers += "Spark Packages Repo" at "https://dl.bintray.com/spark-packages/maven"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-streaming-kafka" % sparkVersion)
libraryDependencies +="datastax" % "spark-cassandra-connector" % "1.6.3-s_2.10"
libraryDependencies +="org.apache.spark" %% "spark-sql" % "1.1.0"
Command to initialize shell:
The below command is the shell initialization procedure I followed
/usr/hdp/2.6.0.3-8/spark/bin/spark-shell --packages datastax:spark-cassandra-connector:1.6.3-s_2.10 --conf spark.cassandra.connection.host=127.0.0.1 –jars spark-streaming-kafka-assembly_2.10-1.6.3.jar
Note:
Here I specified jar specifically because SBT couldn’t fetch the required libraries of spark streaming kafka used at creation of kafkaStream in later sections
Import required libraries:
This section includes libraries to be imported that are used in various cases of the REPL session
import org.apache.spark.SparkConf; import org.apache.spark.streaming.StreamingContext; import org.apache.spark.streaming.Seconds; import org.apache.spark.streaming.kafka.KafkaUtils; import com.datastax.spark.connector._ ; import org.apache.spark.sql.cassandra._ ;
Setting up Spark Streaming Configuration:
Here am configuring configurations required for spark streaming context
val conf = new SparkConf().setMaster("local[*]").setAppName("KafkaReceiver")
conf.set("spark.driver.allowMultipleContexts", "true"); // Required to set this to true because during // shell initialization or starting we a spark context is created with configurations of highlighted
conf.setMaster("local"); // then we are assigning those cofigurations locally
Creation of SparkStreamingContext using above configurations:
Using configurations defined above we create a spark streaming context in the below way
val ssc = new StreamingContext(conf, Seconds(1)); // Seconds here describe the interval to fetch
Creating a Kafka stream using above Spark Streaming Context aka SSC:
Here ssc is spark streaming context that was created above,
“localhost:2181” is ZKquoram
"spark-streaming-consumer-group" is consumer group
Map("test3" -> 5) is Map(“topic” -> number of partitions )
val kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181","spark-streaming-consumer-group", Map("test3" -> 5)).map(_._2)
Note
Values fetched when the kafkaStream object is printed, using kafkaStream.print() are shown in below image
85052,19,960.00,0,2017-08-29 14:52:41,17,VISHAL_GWY01_HT1,26,VISHAL_GTWY17_PRES_01,1,2,4
85053,19,167.00,0,2017-08-29 14:52:41,17,VISHAL_GWY01_HT1,25,VISHAL_GTWY1_Temp_01,1,2,4
85054,19,960.00,0,2017-08-29 14:52:41,17,VISHAL_GWY01_HT1,26,VISHAL_GTWY17_PRES_01,1,2,4
85055,19,167.00,0,2017-08-29 14:52:54,17,VISHAL_GWY01_HT1,25,VISHAL_GTWY1_Temp_01,1,2,4
85056,19,960.00,0,2017-08-29 14:52:54,17,VISHAL_GWY01_HT1,26,VISHAL_GTWY17_PRES_01,1,2,4
85057,19,167.00,0,2017-08-29 14:52:55,17,VISHAL_GWY01_HT1,25,VISHAL_GTWY1_Temp_01,1,2,4
85058,19,960.00,0,2017-08-29 14:52:55,17,VISHAL_GWY01_HT1,26,VISHAL_GTWY17_PRES_01,1,2,4
17/09/02 18:25:25 INFO JobScheduler: Finished job streaming job 1504376716000 ms.0 from job set of time 1504376716000 ms
17/09/02 18:25:25 INFO JobScheduler: Total delay: 9.661 s for time 1504376716000 ms (execution: 0.021 s)
17/09/02 18:25:25 INFO JobScheduler: Starting job streaming job 1504376717000 ms.0 from job set of time 1504376717000 ms
Transforming the kafkaStream and saving in Cassandra:
kafkaStream.foreachRDD( rdd => {
if (! rdd.isEmpty()) {
rdd.map( line => {
val arr = line.split(",");
(arr(0), arr(1), arr(2), arr(3), arr(4), arr(5), arr(6), arr(7), arr(8), arr(9), arr(10), arr(11))
}). saveToCassandra("test", "sensorfeedVals", SomeColumns(
"tableid", "ccid", "paramval", "batVal", "time", "gwid", "gwhName", "snid", "snhName", "snStatus", "sd", "MId")
)
} else {
println("No records to save")
}
}
)
Start ssc:
Using ssc.start you can start the streaming
Issues am facing here are:
1. Printing of the content of stream is happening only after I enter exit or Ctrl+C
2. Whenever I use ssc.start does it start streaming immediately In REPL? Without giving time to enter ssc.awaitTermination
3. Main issue when I tried to save normally in below procedure ***
val collection = sc.parallelize(Seq(("key3", 3), ("key4", 4)))
collection.saveToCassandra("test", "kv", SomeColumns("key", "value"))
am able to save in Cassandra but whenever am trying to save in Cassandra using the logic shown in Transforming the kafkaStream and saving in Cassandra: I couldn't extract each value from string and save it in respective columns of Cassandra tables!
java.lang.NoClassDefFoundError: Could not initialize class com.datastax.spark.connector.cql.CassandraConnector
Means the classpath has not been correctly setup for your application. Make sure you are using the --packages option when launching your application as is noted in the SCC Docs
For your other issues
You don't need awaitTermination in the REPL because the repl will not instantly quit after starting the streaming context. That call is there for an application which may have no further instructions to prevent the main thread from exiting.
Start will start the streaming immediately.
A line or two lines of code which related to contexts is causing the issue here!
I found the solution when i walked through the topics of context!
Here I was running multiple contexts but they are independent to each other.
I have initialized shell with below command:
/usr/hdp/2.6.0.3-8/spark/bin/spark-shell --packages datastax:spark-cassandra-connector:1.6.3-s_2.10 --conf spark.cassandra.connection.host=127.0.0.1 –jars spark-streaming-kafka-assembly_2.10-1.6.3.jar
So when shell starts A spark context with properties of Datastax connector are initialized.
Later I created some configurations and using those configurations created a spark streaming context. Using this context I have created kafkaStream. This kafkaStream is having only properties of SSC but not SC, so here raised the issue of storing in to cassandra.
I have tried to resolve it in the below and succeeded!
val sc = new SparkContext(new SparkConf().setAppName("Spark-Kafka-Streaming").setMaster("local[*]").set("spark.cassandra.connection.host", "127.0.0.1"))
val ssc = new StreamingContext(sc, Seconds(10))
Thanks everyone who came forward to support!
Let me know if any more best possible ways to achieve it!
A very simple approach is to convert a stream as a dataframe for foreachRDD API, convert the RDD to DataFrame and save to cassandra using SparkSQL-Cassandra Datasource API. Below is a simple code snippet where I am saving the Twitter tweets to a Cassandra Table
stream.foreachRDD(rdd => {
if (rdd.count() > 0) {
val data = rdd.filter(status => status.getLang.equals("en")).map(status => TweetsClass(status.getId,
status.getCreatedAt.toGMTString(),
status.getUser.getLocation,
status.getText)).toDF()
//Save the data to Cassandra
data.write.
format("org.apache.spark.sql.cassandra").
options(Map(
"table" -> "sentiment_tweets",
"keyspace" -> "My Keyspace",
"cluster" -> "My Cluster")).mode(SaveMode.Append).save()
}
})

Flink, Kafka and Zookeeper with an URI

I am trying to connect to Kafka from my local machine:
kafkaParams.setProperty("bootstrap.servers", Defaults.BROKER_URL)
kafkaParams.setProperty("metadata.broker.list", Defaults.BROKER_URL)
kafkaParams.setProperty("group.id", "group_id")
kafkaParams.setProperty("auto.offset.reset", "earliest")
Perfectly fine, but my BROKER_URI is defined as follows my-server.com:1234/my/subdirectory.
I figured out that this phenomena is called a chroot path.
It throws the following error: Caused by: org.apache.kafka.common.config.ConfigException: Invalid url in bootstrap.servers: my-server.com:1234/my/subdirectory
How do I solve this?
These are my dependencies:
val flinkVersion = "1.0.3"
"org.apache.flink" %% "flink-scala" % flinkVersion % "provided",
"org.apache.flink" %% "flink-streaming-scala" % flinkVersion % "provided",
"org.apache.flink" %% "flink-connector-kafka-0.9" % flinkVersion,
Just try host:port format without the path context and slashes. If you have more than one servers it would be a list host1:port1,host2:port2
Reference: http://kafka.apache.org/documentation.html
bootstrap.servers should be a comma-separated list like the following: address1:port1,address2:port2,...,addressn:portn. If you only have one Kafka broker you should input something like localhost:9092 (unless you configured Kafka to run on another port).
You may refer on the this post from dataArtisans for more details on how to make Flink and Kafka work together.
Stupid. Zookeeper != Kafka. As you can see in the code, I used the same URL twice, but it turned out that they should be different.
I am trying to connect to Kafka from my local machine:
kafkaParams.setProperty("bootstrap.servers", Defaults.KAFKA_URL)
kafkaParams.setProperty("metadata.broker.list", Defaults.ZOOKEEPER_URL)
kafkaParams.setProperty("group.id", "group_id")
kafkaParams.setProperty("auto.offset.reset", "earliest")

Apache Spark Throws java.lang.IllegalStateException: unread block data

What we are doing is:
Installing Spark 0.9.1 according to the documentation on the website, along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs.
Building a fat jar with a Spark app with sbt then trying to run it on the cluster
I've also included code snippets, and sbt deps at the bottom.
When I've Googled this, there seems to be two somewhat vague responses:
a) Mismatching spark versions on nodes/user code
b) Need to add more jars to the SparkConf
Now I know that (b) is not the problem having successfully run the same code on other clusters while only including one jar (it's a fat jar).
But I have no idea how to check for (a) - it appears Spark doesn't have any version checks or anything - it would be nice if it checked versions and threw a "mismatching version exception: you have user code using version X and node Y has version Z".
I would be very grateful for advice on this. I've submitted a bug report, because there has to be something wrong with the Spark documentation because I've seen two independent sysadms get the exact same problem with different versions of CDH on different clusters. https://issues.apache.org/jira/browse/SPARK-1867
The exception:
Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task 0.0:1 failed 32 times (most recent failure: Exception failure: java.lang.IllegalStateException: unread block data)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to java.lang.IllegalStateException: unread block data [duplicate 59]
My code snippet:
val conf = new SparkConf()
.setMaster(clusterMaster)
.setAppName(appName)
.setSparkHome(sparkHome)
.setJars(SparkContext.jarOfClass(this.getClass))
println("count = " + new SparkContext(conf).textFile(someHdfsPath).count())
My SBT dependencies:
// relevant
"org.apache.spark" % "spark-core_2.10" % "0.9.1",
"org.apache.hadoop" % "hadoop-client" % "2.3.0-mr1-cdh5.0.0",
// standard, probably unrelated
"com.github.seratch" %% "awscala" % "[0.2,)",
"org.scalacheck" %% "scalacheck" % "1.10.1" % "test",
"org.specs2" %% "specs2" % "1.14" % "test",
"org.scala-lang" % "scala-reflect" % "2.10.3",
"org.scalaz" %% "scalaz-core" % "7.0.5",
"net.minidev" % "json-smart" % "1.2"
Changing
"org.apache.hadoop" % "hadoop-client" % "2.3.0-mr1-cdh5.0.0",
to
"org.apache.hadoop" % "hadoop-common" % "2.3.0-cdh5.0.0"
In my application code seemed to fix this. Not entirely sure why. We have hadoop-yarn on the cluster, so maybe the "mr1" broke things.
I recently ran into this issue with CDH 5.2 + Spark 1.1.0.
Turns out the problem was in my spark-submit command I was using
--master yarn
instead of the new
--master yarn-cluster