I want to prepare a Java class that will read an index from MongoDB into an SQLContext in order to get a Dataset to process in Spark.My code is as follows
SparkConf conf = new SparkConf().setAppName("Aggregation").setMaster("local");
conf.set("mongo.job.input.format", "com.mongodb.hadoop.MongoInputFormat");
conf.set("mongo.input.uri", uri);
conf.set("mongo.output.uri", uri);
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(sc);
Map<String, String> options = new HashMap<String, String>();
options.put("host", mongoHost +":27017");
options.put("inferSchema", "true");
options.put("database", database);
Dataset df = sqlContext.read().format("com.stratio.datasource.mongodb").options(options).option("collection", "kpi_aggregator").load();
I use the following dependencies in maven
<!-- Spark -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.0.0</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>log4j</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.0.0</version>
</dependency>
<!-- MongoDB -->
<dependency>
<groupId>org.mongodb.spark</groupId>
<artifactId>mongo-spark-connector_2.11</artifactId>
<version>2.0.0-rc0</version>
</dependency>
<dependency>
<groupId>org.mongodb</groupId>
<artifactId>mongo-java-driver</artifactId>
<version>3.3.0</version>
</dependency>
<dependency>
<groupId>com.stratio.datasource</groupId>
<artifactId>spark-mongodb_2.11</artifactId>
<version>0.12.0</version>
</dependency>
However when I execute the code I get the following exception
Caused by: java.lang.IllegalArgumentException: state should be: w >= 0
at com.mongodb.assertions.Assertions.isTrueArgument(Assertions.java:99)
at com.mongodb.WriteConcern.<init>(WriteConcern.java:316)
at com.mongodb.WriteConcern.<init>(WriteConcern.java:227)
at com.mongodb.casbah.WriteConcern$.<init>(WriteConcern.scala:41)
at com.mongodb.casbah.WriteConcern$.<clinit>(WriteConcern.scala)
... 43 more
Any ideas what the problem could be?
I managed to solve this by removing com.stratio.datasource dependency and changing my code as follows:
SparkConf conf = new SparkConf().setAppName("Aggregation").setMaster("local");
conf.set("mongo.job.input.format", "com.mongodb.hadoop.MongoInputFormat");
conf.set("mongo.input.uri", uri);
conf.set("mongo.output.uri", uri);
conf.set("spark.mongodb.input.partitioner","MongoPaginateBySizePartitioner");
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(sc);
Map<String, String> options = new HashMap<String, String>();
options.put("uri", uri);
options.put("database", database);
Dataset df = MongoSpark.read(sqlContext).options(options).option("collection", "kpi_aggregator").load();
Related
Im trying a simple test where i use Kafka-connect and spark
I wrote a custom kafka-connect that creates this source record
SourceRecord sr = new SourceRecord(null,
null,
destTopic,
Schema.STRING_SCHEMA,
cleanPath);
in the spark i receive this message like this
val kafkaConsumerParams = Map[String, String](
"metadata.broker.list" -> prop.getProperty("kafka_host"),
"zookeeper.connect" -> prop.getProperty("zookeeper_host"),
"group.id" -> prop.getProperty("kafka_group_id"),
"schema.registry.url" -> prop.getProperty("schema_registry_url"),
"auto.offset.reset" -> prop.getProperty("auto_offset_reset")
)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaConsumerParams, topicsSet)
val ds = messages.foreachRDD(rdd => {
val toPrint = rdd.map(t => {
val file_path = t._2
val startTime = DateTime.now()
Thread.sleep(1000 * 60)
1
}).sum()
LogUtils.getLogger(classOf[DeviceManager]).info(" toPrint = " + toPrint +" (number of flows calculated)")
})
}
when i use the connector to send multiple message to the desired topic ( in my test it had 6 partitions)
The sleep thread gets all the messages, but preforms them synchronically instead of asynchronically.
When i create a simple test producer, the sleeps are done asynchronically.
I Also created 2 simple consumers, and tried both the connector and a producer, and both task were consumed asynchronically
which means my problems lays with the way the spark is receiving the messages sent from the connector.
I cant figure why the tasks are not acting the same way as they do when i send it from a producer.
i even printed the record the spark recieves and they are exactly the same
producer sent record
1: {partition=2, offset=11, value=something, key=null}
2: {partition=5, offset=9, value=something2, key=null}
connect sent record
1: {partition=3, offset=9, value=something, key=null}
the versions used in my projects are
<scala.version>2.11.7</scala.version>
<confluent.version>4.0.0</confluent.version>
<kafka.version>1.0.0</kafka.version>
<java.version>1.8</java.version>
<spark.version>2.0.0</spark.version>
dependencies
<dependency>
<groupId>io.confluent</groupId>
<artifactId>kafka-avro-serializer</artifactId>
<version>${confluent.version}</version>
</dependency>
<dependency>
<groupId>io.confluent</groupId>
<artifactId>kafka-schema-registry-client</artifactId>
<version>${confluent.version}</version>
</dependency>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>1.8.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.11</artifactId>
<version>1.6.3</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-graphx_2.11</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector_2.11</artifactId>
<version>2.0.0-RC1</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.8.0</version>
</dependency>
<dependency>
<groupId>io.confluent</groupId>
<artifactId>kafka-avro-serializer</artifactId>
<version>${confluent.version}</version>
<scope>${global.scope}</scope>
</dependency>
<dependency>
<groupId>io.confluent</groupId>
<artifactId>kafka-connect-avro-converter</artifactId>
<version>${confluent.version}</version>
<scope>${global.scope}</scope>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>connect-api</artifactId>
<version>${kafka.version}</version>
</dependency>
We cannot run Spark-Kafka streaming jobs asynchronously. But we can run them in parallel, as Kafka consumer(s) do. For that, we need to set following configuration in SparkConf():
sparkConf.set("spark.streaming.concurrentJobs","4")
By default, its value is "1". But we can override it to a higher value.
I hope this helps!
So I know that in Mongo Shell, you use dot notation to get the field you want in any document.
How is dot notation achieved in MongoDB Scala. I'm confused as to how it works. Here is the code that fetches a document from a collection:
val record = collection.find().projection(fields(include("offset"), excludeId())).limit(1)
EDIT:
I'm trying to work on a mechanism to basically re-consume Kafka records at a point where the consumer was shutdown. To do this, I store my kafka records in an external database, and then try to fetch the most recent offset from there and start consuming from that point. Here is my Scala method that should do that:
def getLatestCommitOffsetFromDB(collectionName: String): Long = {
import com.mongodb.Block
import org.bson.Document
val printBlock = new Block[Document]() {
override def apply(document: Document): Unit = {
println(document.toJson)
}
}
import com.mongodb.async.SingleResultCallback
val callbackWhenFinished = new SingleResultCallback[Void]() {
override def onResult(result: Void, t: Throwable): Unit = {
System.out.println("Latest offset fetched from database.")
}
}
var obj: String = " "
try {
val record = collection.find().projection(fields(include("offset"), excludeId())).limit(1)
//TODO FIND A WAY TO GET THE VALUE AND STORE IT IN A VARIABLE
} catch {
case e: RuntimeException =>
logger.error(s"MongoDB Server Error : Unable to fetch data from collection : $collection")
logger.error(e.printStackTrace().toString())
}
obj.toLong
}
The problem isn't that I can fetch documents from Mongo, more-so that I'm trying to access a particular field in Mongo. The Document has four fields in it: topic, partition, message, and offset. I want to get the "offset" field and store that in a variable, so I can use it as a restarting point to re-consume Kafka records.
where do I go from there?
POM.xml
<?xml version="1.0" encoding="UTF-8"?>
http://maven.apache.org/xsd/maven-4.0.0.xsd">
4.0.0
<groupId>OffsetManagementPoC</groupId>
<artifactId>OffsetManagementPoC</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>1.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.12</artifactId>
<version>1.0.0</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-compiler</artifactId>
<version>2.11.8</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.25</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-streams</artifactId>
<version>0.10.0.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/com.fasterxml.jackson.core/jackson-core -->
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
<version>2.6.5</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.6.5</version>
</dependency>
<!-- https://mvnrepository.com/artifact/com.fasterxml.jackson.core/jackson-annotations -->
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-annotations</artifactId>
<version>2.6.5</version>
</dependency>
<dependency>
<groupId>org.mongodb</groupId>
<artifactId>casbah_2.12</artifactId>
<version>3.1.1</version>
<type>pom</type>
</dependency>
<dependency>
<groupId>com.typesafe</groupId>
<artifactId>config</artifactId>
<version>1.2.1</version>
</dependency>
<dependency>
<groupId>org.mongodb.scala</groupId>
<artifactId>mongo-scala-driver_2.12</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-compiler</artifactId>
<version>2.11.8</version>
</dependency>
<dependency>
<groupId>org.mongodb</groupId>
<artifactId>mongo-java-driver</artifactId>
<version>3.4.2</version>
</dependency>
<dependency>
<groupId>org.mongodb.scala</groupId>
<artifactId>mongo-scala-driver_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.mongodb</groupId>
<artifactId>bson</artifactId>
<version>3.3.0</version>
</dependency>
<dependency>
<groupId>org.mongodb</groupId>
<artifactId>mongodb-driver-async</artifactId>
<version>3.4.3</version>
</dependency>
<dependency>
<groupId>org.mongodb.scala</groupId>
<artifactId>mongo-scala-bson_2.11</artifactId>
<version>2.1.0</version>
</dependency>
</dependencies>
You can modify your query this way:
import com.mongodb.MongoClient
import com.mongodb.client.MongoCollection
import com.mongodb.client.model.Projections
def getLatestCommitOffsetFromDB(
databaseName: String,
collectionName: String
): Long = {
val mongoClient = new MongoClient("localhost", 27017);
val collection =
mongoClient.getDatabase(databaseName).getCollection(collectionName)
val record = collection
.find()
.projection(
Projections
.fields(Projections.include("offset"), Projections.excludeId()))
.first
record.get("offset").asInstanceOf[Double].toLong
}
I think you were missing the com.mongodb.client.model.Projections imports in order to use fields, include and excludeId
I used first instead of limit(1) to make it easier to extract the result.
first returns a Document object on which you can call get to retrieve the value of the requested field.
But in fact, since you just want one record and one field, you can remove the projection!:
val record = collection.find().first
According to the documentation, collection.find() accepts a com.mongodb.DBObject
One of the implementations of that interface that you can use is BasicDBObject which is basically like a mutable.Map[String, Object]. You can use the constructor which accepts a map like:
val query = new com.mongodb.BasicDBObject(Map(
"foo.bar" -> "value1"
"bar.foo" -> "value2"
))
val record = collection.find(query)....
I'm trying to run a simple Apache Flink script with Kafka inegration but I keep on having problems with the execution.
The script should read messages coming from a kafka producer, elaborate them, and then send back again, to an other topic, the result of the processing.
I've get this example from here:
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Simple-Flink-Kafka-Test-td4828.html
The error I have is:
Exception in thread "main" java.lang.NoSuchFieldError:ALL
at org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.createJobGraph(StreamingJobGraphGenerator.java:86)
at org.apache.flink.streaming.api.graph.StreamGraph.getJobGraph(StreamGraph.java:429)
at org.apache.flink.streaming.api.environment.LocalStreamEnvironment.execute(LocalStreamEnvironment.java:46)
at org.apache.flink.streaming.api.environment.LocalStreamEnvironment.execute(LocalStreamEnvironment.java:33)
This is my code:
public class App {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
//properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "javaflink");
DataStream<String> messageStream = env.addSource(new FlinkKafkaConsumer010<String>("test", new SimpleStringSchema(), properties));
System.out.println("Step D");
messageStream.map(new MapFunction<String, String>(){
public String map(String value) throws Exception {
// TODO Auto-generated method stub
return "Blablabla " + value;
}
}).addSink(new FlinkKafkaProducer010("localhost:9092", "demo2", new SimpleStringSchema()));
env.execute();
}
}
These are the pom.xml dependencies:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-core</artifactId>
<version>1.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java_2.11</artifactId>
<version>0.10.2</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_2.11</artifactId>
<version>1.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-core</artifactId>
<version>0.9.1</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.11</artifactId>
<version>1.3.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka-0.10_2.11</artifactId>
<version>1.3.1</version>
</dependency>
What could cause this kind of error?
Thanks
Luca
The problem is most likely caused by the mixture of different Flink versions you have defined in your pom.xml. In order to run this program, it should be enough to include the following dependencies:
<!-- Streaming API -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.11</artifactId>
<version>1.3.1</version>
</dependency>
<!-- In order to execute the program from within your IDE -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_2.11</artifactId>
<version>1.3.1</version>
</dependency>
<!-- Kafka connector dependency -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka-0.10_2.11</artifactId>
<version>1.3.1</version>
</dependency>
Based on the Storm documentation supported implementation of KafkaSpout is based on the old consumer API. I noticed the external package has another implementation named storm-kafka-client.
https://github.com/apache/storm/tree/master/external/storm-kafka-client
It is unclear if the new client release in 1.0.1 is production ready. Does anyone have experience running it?
I posted the same question to the Storm mail list.
the new API is production ready. We should use 1.x branch.
I plan to test with
<!-- https://mvnrepository.com/artifact/org.apache.storm/storm-kafka-client -->
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-kafka-client</artifactId>
<version>1.0.1</version>
</dependency>
Will update on the progress.
Below Code works for me fine!!!
public TopologyBuilder myTopology() {
TopologyBuilder builder = new TopologyBuilder();
try {
KafkaSpoutConfig<String, String> kafkaSpoutConfig = getKafkaSpoutConfig("KAFKA_IP:9092", KAFKA_TOPIC);
KafkaSpout kafkaSpout = new KafkaSpout<>(kafkaSpoutConfig);
builder.setSpout("kafkaSpout", kafkaSpout, 2 * 2);
builder.setBolt("Bolt-1", new TestBolt(), parallelism).shuffleGrouping("kafkaSpout", KAFKA_TOPIC);
} catch (Exception ex) {
}
return builder;
}
Configure Spout.
protected KafkaSpoutConfig<String, String> getKafkaSpoutConfig(String bootstrapServers ,String topic) {
ByTopicRecordTranslator<String, String> trans = new ByTopicRecordTranslator<>(
(r) -> new Values(r.topic(), r.partition(), r.offset(), r.key(), r.value()),
new Fields("topic", "partition", "offset", "key", "value"), topic);
Builder<String, String> builder = KafkaSpoutConfig.builder(bootstrapServers, new String[]{topic});
return builder.setProp(ConsumerConfig.GROUP_ID_CONFIG, topic)
.setProcessingGuarantee(ProcessingGuarantee.AT_LEAST_ONCE)
.setRetry(getRetryService())
.setRecordTranslator(trans)
.setOffsetCommitPeriodMs(10_000)
.setFirstPollOffsetStrategy(UNCOMMITTED_EARLIEST)
.setMaxUncommittedOffsets(1000)
.build();
}
For configure failed messages retyr logic
protected KafkaSpoutRetryService getRetryService() {
return new KafkaSpoutRetryExponentialBackoff(TimeInterval.microSeconds(500),
TimeInterval.milliSeconds(2), Integer.MAX_VALUE, TimeInterval.seconds(10));
}
You can use following maven dependency for storm 1.1.0
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-core</artifactId>
<version>1.1.0</version>
<scope>provided</scope>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>log4j-over-slf4j</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.10.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-kafka</artifactId>
<version>1.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.10</artifactId>
<version>0.9.0.0</version>
<exclusions>
<exclusion>
<groupId>org.apache.zookeeper</groupId>
<artifactId>zookeeper</artifactId>
</exclusion>
<exclusion>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
</exclusion>
</exclusions>
</dependency>
You may face some more dependency issue which you can resolve by adding the required jars.
Also the dependency in java code will change from org.backtype.storm.XXXXX to org.apache.storm.XXXXX
i'm trying to connect in HIVE (in sandbox of Hortonworks) and i'm receving the message below:
Exception in thread "main" java.sql.SQLException: No suitable driver found for jdbc:hive2://sandbox.hortonworks.com:10000/default
Maven dependencies:
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.10</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
</dependencies>
Code:
// **** SetMaster is Local only to test *****
// Set context
val sparkConf = new SparkConf().setAppName("process").setMaster("local")
val sc = new SparkContext(sparkConf)
val hiveContext = new HiveContext(sc)
// Set HDFS
System.setProperty("HADOOP_USER_NAME", "hdfs")
val hdfsconf = SparkHadoopUtil.get.newConfiguration(sc.getConf)
hdfsconf.set("fs.defaultFS", "hdfs://sandbox.hortonworks.com:8020")
val hdfs = FileSystem.get(hdfsconf)
// Set Hive Connector
val url = "jdbc:hive2://sandbox.hortonworks.com:10000/default"
val user = "username"
val password = "password"
hiveContext.read.format("jdbc").options(Map("url" -> url,
"user" -> user,
"password" -> password,
"dbtable" -> "tablename")).load()
You need to have Hive JDBC driver in your application classpath:
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>1.2.1</version>
<scope>provided</scope>
</dependency>
Also, specify driver explicitly in options:
"driver" -> "org.apache.hive.jdbc.HiveDriver"
However, it's better to skip JDBC and use native Spark integration with Hive, since it make possible to use Hive metastore. See http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables