I'am looking for a solution where I can save my kafka streaming RDD to Redis with zscore and in Append Mode. Do we have any Connector to do this - Tried spark redis connector by RedisLabs but it is only compatible with Scala 2.10 . There is one more github by Anchormen https://github.com/Anchormen/spark-redis-connector but it is again lacking documentation and also jar is not available on maven.
collection.foreachRDD(rdd =>
{
if (!rdd.partitions.isEmpty) {
// save rdd to redis as list with zscore
}
})
So , any suggestions please. Thanks,
Related
I used Cassandra Datastax driver 4.x in Scala. I build a minimal session as Datastax explain in its documentation:
ref. https://github.com/datastax/java-driver/tree/4.x/manual/core#cqlsession
val session: CqlSession = CqlSession.builder.build()
I have few keyspaces and tables in my cassandra instance (apache cassandra 3.11). However, when I try to get all keyspaces by getMetadata method, it returns an empty list.
val keyspaces = session.getMetadata.getKeyspaces.values()
Does anyone have the same problem or know what happens here?
Thanks so much!
I am creating a Dataframe from a kafka topic using spark streaming.
I want to write the Dataframe into a Kinesis Producer.
I understand that there is no official API for this as of now. But there are multiple APIs available over the internet , but sadly, none of them worked for me.
Spark version : 2.2
Scala : 2.11
I tried using https://github.com/awslabs/kinesis-kafka-connector and build the jar. But getting errors due to conflicting package names between this jar and spark API. Please help.
########### Here is the code for others:
spark-shell --jars spark-sql-kinesis_2.11-2.2.0.jar,spark-sql-kafka-0-10_2.11-2.1.0.jar,spark-streaming-kafka-0-10-assembly_2.10-2.1.0.jar --files kafka_client_jaas_spark.conf --properties-file gobblin_migration.conf --conf spark.port.maxRetries=100 --driver-java-options "-Djava.security.auth.login.config=kafka_client_jaas_spark.conf" --conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=kafka_client_jaas_spark.conf" --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=kafka_client_jaas_spark.conf"
import java.io.File
import org.apache.commons.lang3.exception.ExceptionUtils
import org.apache.spark.SparkException
import org.apache.spark.deploy.SparkHadoopUtil
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import scala.sys.process._
import org.apache.log4j.{ Logger, Level, LogManager, PropertyConfigurator }
import org.apache.spark.sql.streaming.Trigger
val streamingInputDF =spark.readStream.format("kafka").option("kafka.bootstrap.servers","bootstrap server").option("subscribe", "<kafkatopic>").option("startingOffsets", "latest").option("failOnDataLoss", "false").option("kafka.security.protocol", "SASL_PLAINTEXT").load()
val xdf=streamingInputDF.select(col("partition").cast("String").alias("partitionKey"),col("value").alias("data"))
xdf.writeStream.format("kinesis").option("checkpointLocation", "<hdfspath>").outputMode("Append").option("streamName", "kinesisstreamname").option("endpointUrl","kinesisendpoint").option("awsAccessKeyId", "accesskey").option("awsSecretKey","secretkey").start().awaitTermination()
For the jar spark-sql-kinesis_2.11-2.2.0.jar, go to quoble , download the package for your spark version, build the jar.
If you are behind a corporate network, set the proxy before launching spark.
export http_proxy=http://server-ip:port/
export https_proxy=https://server-ip:port/
Kafka Connect is a service to which you can POST your connector specifications (kinesis in this case), which then takes care of running the connector. It supports quite a few transformations as well while processing the records. Kafka Connect plugins are not intended to be used with Spark applications.
If your use case requires you to do some business logic while processing the records, then you could go with either Spark Streaming or Structured Streaming approach.
If you want to take Spark based approach, below are the 2 options I can think of.
Use Structured Streaming. You could use a Strucuted streaming connector for Kinesis. You can find one here. There may be others too. This is the only stable and open source connector I am aware of. You can find an example for using Kinesis as a sink here.
Use Kinesis Producer Library or aws-java-sdk-kinesis library to publish records from your Spark Streaming application. Using KPL is a preferred approach here. You could do mapPartitions and create a Kinesis client per partition and publish the records using these libraries. There are plenty of examples in AWS docs for these 2 libraries.
Is there any best practice for Spark to process kafka stream which is serialized in Avro with schema registry? Especially for Spark Structured Streams?
I have found an example at https://github.com/ScalaConsultants/spark-kafka-avro/blob/master/src/main/scala/io/scalac/spark/AvroConsumer.scala . But I have failed to load AvroConverter class. I cannot find artifact named io.confluent:kafka-avro-serializer in mvnrepository.com.
You need to add the Confluent repo in your build.sbt:
val repositories = Seq(
"confluent" at "http://packages.confluent.io/maven/",
Resolver.sonatypeRepo("public")
)
See: https://github.com/ScalaConsultants/spark-kafka-avro/blob/master/build.sbt
I'm working on Apache Spark Streaming (NO SPARK SQL) and I would like to store the results of my script on a mongodb.
How can I do that? I found some tutorial but they work only with Apache Spark SQL.
Thanks
I am trying to connect Apache Spark to MongoDB using Mesos. Here is my architecture: -
MongoDB: MongoDB Cluster of 2 shards, 1 config server and 1 query server.
Mesos: 1 Mesos Master, 4 Mesos slaves
Now I have installed Spark on just 1 node. There is not much information available on this out there. I just wanted to pose a few questions: -
As per what I understand, I can connect Spark to MongoDB via mesos. In other words, I end up using MongoDB as a storage layer. Do I really Need Hadoop? Is it mandatory to pull all the data into Hadoop just for Spark to read it?
Here is the reason I am asking this. The Spark Install expects the HADOOP_HOME variable to be set. This seems like very tight coupling !! Most of the posts on the net speak about MongoDB-Hadoop connector. It doesn't make sense if you're forcing me to move everything to hadoop.
Does anyone have an answer?
Regards
Mario
Spark itself takes a dependency on Hadoop and data in HDFS can be used as a datasource.
However, if you use the Mongo Spark Connector you can use MongoDB as a datasource for Spark without going via Hadoop at all.
Spark-mongo connector is good idea, moreover if your are executing Spark in a hadoop cluster you need set HADOOP_HOME.
Check your requeriments and test it (tutorial)
Basic working knowledge of MongoDB and Apache Spark. Refer to the MongoDB documentation and Spark documentation.
Running MongoDB instance (version 2.6 or later).
Spark 1.6.x.
Scala 2.10.x if using the mongo-spark-connector_2.10 package
Scala 2.11.x if using the mongo-spark-connector_2.11 package
The new MongoDB Connector for Apache Spark provides higher performance, greater ease of use and, access to more advanced Spark functionality than the MongoDB Connector for Hadoop. The following table compares the capabilities of both connectors.
Then you need to configure Spark with mesos:
Connecting Spark to Mesos
To use Mesos from Spark, you need a Spark binary package available in a place accessible by Mesos, and a Spark driver program configured to connect to Mesos.
Alternatively, you can also install Spark in the same location in all the Mesos slaves, and configure spark.mesos.executor.home (defaults to SPARK_HOME) to point to that location.