Alpakka Consumer not consuming messages from Kafka running via Docker compose - scala

I've got Kafka and Zookeeper running via Docker compose. I'm able to send/consume messages to a topic using Kafka terminal and I'm able to monitor everything via Conduktor. But unfortunately, I'm not being able to consume msgs via my Scala app using Alpakka connector. The app connects to the topic but whenever I send a msg to the topic nothing happens.
Just Kafka and Zookeeper are running via docker-compose. I'm running Scala consumer app directly in the host machine.
Docker Compose
version: '3'
services:
zookeeper:
image: wurstmeister/zookeeper
ports:
- "2181:2181"
kafka:
image: wurstmeister/kafka
ports:
- "9092:9092"
environment:
KAFKA_BROKER_ID: 1
HOSTNAME_COMMAND: "route -n | awk '/UG[ \t]/{print $$2}'"
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
depends_on:
- zookeeper
Scala App
object Main extends App {
implicit val actorSystem = ActorSystem()
import actorSystem.dispatcher
val kafkaConsumerSettings = ConsumerSettings(actorSystem, new StringDeserializer, new StringDeserializer)
.withGroupId("new_id")
.withCommitRefreshInterval(1.seconds)
.withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
.withBootstrapServers("localhost:9092")
Consumer
.plainSource(kafkaConsumerSettings, Subscriptions.topics("test1"))
.map(msg => msg.value())
.runWith(Sink.foreach(println)).onComplete {
case Failure(exception) => exception.printStackTrace()
case Success(value) => println("done")
}
}
App - Console output
16:58:33.877 INFO [akka.event.slf4j.Slf4jLogger] Slf4jLogger started
16:58:34.470 INFO [akka.kafka.internal.SingleSourceLogic] [1955f] Starting. StageActor Actor[akka://default/system/Materializers/StreamSupervisor-0/$$a#-591284224]
16:58:34.516 INFO [org.apache.kafka.clients.consumer.ConsumerConfig] ConsumerConfig values:
allow.auto.create.topics = true
auto.commit.interval.ms = 5000
auto.offset.reset = earliest
bootstrap.servers = [localhost:9092]
check.crcs = true
client.dns.lookup = default
client.id =
client.rack =
connections.max.idle.ms = 540000
default.api.timeout.ms = 60000
enable.auto.commit = false
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
group.id = novo_id
group.instance.id = null
heartbeat.interval.ms = 3000
interceptor.classes = []
internal.leave.group.on.close = true
isolation.level = read_uncommitted
key.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
max.partition.fetch.bytes = 1048576
max.poll.interval.ms = 300000
max.poll.records = 500
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
receive.buffer.bytes = 65536
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
request.timeout.ms = 30000
retry.backoff.ms = 100
sasl.client.callback.handler.class = null
sasl.jaas.config = null
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 60000
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.login.callback.handler.class = null
sasl.login.class = null
sasl.login.refresh.buffer.seconds = 300
sasl.login.refresh.min.period.seconds = 60
sasl.login.refresh.window.factor = 0.8
sasl.login.refresh.window.jitter = 0.05
sasl.mechanism = GSSAPI
security.protocol = PLAINTEXT
security.providers = null
send.buffer.bytes = 131072
session.timeout.ms = 10000
ssl.cipher.suites = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.endpoint.identification.algorithm = https
ssl.key.password = null
ssl.keymanager.algorithm = SunX509
ssl.keystore.location = null
ssl.keystore.password = null
ssl.keystore.type = JKS
ssl.protocol = TLS
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.location = null
ssl.truststore.password = null
ssl.truststore.type = JKS
value.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
16:58:34.701 INFO [org.apache.kafka.common.utils.AppInfoParser] Kafka version: 2.4.0
16:58:34.702 INFO [org.apache.kafka.common.utils.AppInfoParser] Kafka commitId: 77a89fcf8d7fa018
16:58:34.702 INFO [org.apache.kafka.common.utils.AppInfoParser] Kafka startTimeMs: 1585256314699
16:58:34.715 INFO [org.apache.kafka.clients.consumer.KafkaConsumer] [Consumer clientId=consumer-novo_id-1, groupId=novo_id] Subscribed to topic(s): test1
16:58:35.308 INFO [org.apache.kafka.clients.Metadata] [Consumer clientId=consumer-novo_id-1, groupId=novo_id] Cluster ID: c2XBuDIJTI-gBs9guTvG

Export KAFKA_ADVERTISED_LISTENERS
Describes how the host name that is advertised and can be reached by
clients. The value is published to ZooKeeper for clients to use.
If using the SSL or SASL protocol, the endpoint value must specify the
protocols in the following formats:
SSL: SSL:// or SASL_SSL://
SASL: SASL_PLAINTEXT:// or SASL_SSL://
KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://localhost:29092
And now your consumer can use port 29092:
.withBootstrapServers("localhost:29092")

Related

Error while connecting Flume with Kafka - Could not find a 'KafkaClient' entry in the JAAS configuration

Currently we are using CDP 7.1.7 and client wants to use Flume. Since CDP has removed Flume, we need to install it as a separate application. I have installed Flume at one of the data nodes.
here are the config files:
flume-env.sh
export JAVA_OPTS="$JAVA_OPTS -Djava.security.krb5.conf=/etc/krb5.conf
-Djava.security.auth.login.config=/opt/cloudera/security/flafka_jaas.conf "
flafka_jaas.conf
Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
storeKey=true
keyTab="flume.keytab"
principal="flume/hostname#realm";
};
KafkaClient {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
storeKey=true
serviceName="kafka"
keyTab="flume.keytab"
principal="flume/hostname#realm";
};
Flume.conf
KafkaAgent.sources = source_kafka
KafkaAgent.channels = MemChannel
KafkaAgent.sinks = LoggerSink
#Configuring Source
KafkaAgent.sources.source_kafka.type = org.apache.flume.source.kafka.KafkaSource
KafkaAgent.sources.source_kafka.kafka.bootstrap.servers = hostn1:9092,host2:9092,host3:9092
KafkaAgent.sources.source_kafka.kafka.topics = cim
KafkaAgent.sources.source_kafka.kafka.consumer.group.id = flume
KafkaAgent.sources.source_kafka.channels = MemChannel
#KafkaAgent.sources.source_kafka.kafka.consumer.timeout.ms = 100
KafkaAgent.sources.source_kafka.agent-principal=flume/hostname#realm
KafkaAgent.sources.source_kafka.agent-keytab=flume.keytab
KafkaAgent.sources.source_kafka.kafka.consumer.security.protocol = SASL_PLAINTEXT
KafkaAgent.sources.source_kafka.kafka.consumer.sasl.kerberos.service.name = kafka
KafkaAgent.sources.source_kafka.kafka.consumer.sasl.mechanism = GSSAPI
KafkaAgent.sources.source_kafka.kafka.consumer.security.protocol = SASL_PLAINTEXT
#Configuring Sink
KafkaAgent.sinks.LoggerSink.type = logger
#Configuring Channel
KafkaAgent.channels.MemChannel.type = memory
KafkaAgent.channels.MemChannel.capacity = 10000
KafkaAgent.channels.MemChannel.transactionCapacity = 1000
#bind source and sink to channel
KafkaAgent.sinks.LoggerSink.channel = MemChannel
After running this command:
`flume-ng agent -n KafkaAgent -c -conf /opt/cdpdeployment/apache-flume-1.9.0-bin/conf/ -f /opt/cdpdeployment/apache-flume-1.9.0-bin/conf/kafka-flume.conf -Dflume.root.logger=DEBUG,console`
I am getting this below error:
Caused by: java.lang.IllegalArgumentException: Could not find a 'KafkaClient' entry in the JAAS configuration. System property 'java.security.auth.login.config' is not set
at org.apache.kafka.common.security.JaasContext.defaultContext(JaasContext.java:133)
at org.apache.kafka.common.security.JaasContext.load(JaasContext.java:98)
at org.apache.kafka.common.security.JaasContext.loadClientContext(JaasContext.java:84)
at org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:119)
at org.apache.kafka.common.network.ChannelBuilders.clientChannelBuilder(ChannelBuilders.java:65)
at org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:88)
at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:713)
Can someone tell me what I am missing as far as configs are concerned.

CDC with WSO2 Streaming Integrator and Postgres DB

I am trying to setup Change Data Capture (CDC) between WSO2 Streaming Integrator and a local Postgres DB.
I have added the Postgres Driver (v42.2.5) to SI_HOME/lib and I am able to read data from the database from a Siddhi application.
I am following the CDCWithListeningMode example to implement CDC and I am using pgoutput as the logical decoding plugin. But when I run the application I get the following log.
[2020-04-23_19-02-37_460] INFO {org.apache.kafka.connect.json.JsonConverterConfig} - JsonConverterConfig values:
converter.type = key
schemas.cache.size = 1000
schemas.enable = true
[2020-04-23_19-02-37_461] INFO {org.apache.kafka.connect.json.JsonConverterConfig} - JsonConverterConfig values:
converter.type = value
schemas.cache.size = 1000
schemas.enable = false
[2020-04-23_19-02-37_461] INFO {io.debezium.embedded.EmbeddedEngine$EmbeddedConfig} - EmbeddedConfig values:
access.control.allow.methods =
access.control.allow.origin =
bootstrap.servers = [localhost:9092]
header.converter = class org.apache.kafka.connect.storage.SimpleHeaderConverter
internal.key.converter = class org.apache.kafka.connect.json.JsonConverter
internal.value.converter = class org.apache.kafka.connect.json.JsonConverter
key.converter = class org.apache.kafka.connect.json.JsonConverter
listeners = null
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
offset.flush.interval.ms = 60000
offset.flush.timeout.ms = 5000
offset.storage.file.filename =
offset.storage.partitions = null
offset.storage.replication.factor = null
offset.storage.topic =
plugin.path = null
rest.advertised.host.name = null
rest.advertised.listener = null
rest.advertised.port = null
rest.host.name = null
rest.port = 8083
ssl.client.auth = none
task.shutdown.graceful.timeout.ms = 5000
value.converter = class org.apache.kafka.connect.json.JsonConverter
[2020-04-23_19-02-37_516] INFO {io.debezium.connector.common.BaseSourceTask} - offset.storage = io.siddhi.extension.io.cdc.source.listening.InMemoryOffsetBackingStore
[2020-04-23_19-02-37_517] INFO {io.debezium.connector.common.BaseSourceTask} - database.server.name = localhost_5432
[2020-04-23_19-02-37_517] INFO {io.debezium.connector.common.BaseSourceTask} - database.port = 5432
[2020-04-23_19-02-37_517] INFO {io.debezium.connector.common.BaseSourceTask} - table.whitelist = SweetProductionTable
[2020-04-23_19-02-37_517] INFO {io.debezium.connector.common.BaseSourceTask} - cdc.source.object = 1716717434
[2020-04-23_19-02-37_517] INFO {io.debezium.connector.common.BaseSourceTask} - database.hostname = localhost
[2020-04-23_19-02-37_518] INFO {io.debezium.connector.common.BaseSourceTask} - database.password = ********
[2020-04-23_19-02-37_518] INFO {io.debezium.connector.common.BaseSourceTask} - name = CDCWithListeningModeinsertSweetProductionStream
[2020-04-23_19-02-37_518] INFO {io.debezium.connector.common.BaseSourceTask} - server.id = 6140
[2020-04-23_19-02-37_519] INFO {io.debezium.connector.common.BaseSourceTask} - database.history = io.debezium.relational.history.FileDatabaseHistory
[2020-04-23_19-02-38_103] INFO {io.debezium.connector.postgresql.PostgresConnectorTask} - user 'user_name' connected to database 'db_name' on PostgreSQL 11.5, compiled by Visual C++ build 1914, 64-bit with roles:
role 'user_name' [superuser: false, replication: true, inherit: true, create role: false, create db: false, can log in: true] (Encoded)
[2020-04-23_19-02-38_104] INFO {io.debezium.connector.postgresql.PostgresConnectorTask} - No previous offset found
[2020-04-23_19-02-38_104] INFO {io.debezium.connector.postgresql.PostgresConnectorTask} - Taking a new snapshot of the DB and streaming logical changes once the snapshot is finished...
[2020-04-23_19-02-38_105] INFO {io.debezium.util.Threads} - Requested thread factory for connector PostgresConnector, id = localhost_5432 named = records-snapshot-producer
[2020-04-23_19-02-38_105] INFO {io.debezium.util.Threads} - Requested thread factory for connector PostgresConnector, id = localhost_5432 named = records-stream-producer
[2020-04-23_19-02-38_293] INFO {io.debezium.connector.postgresql.connection.PostgresConnection} - Obtained valid replication slot ReplicationSlot [active=false, latestFlushedLSN=null]
[2020-04-23_19-02-38_704] ERROR {io.siddhi.core.stream.input.source.Source} - Error on 'CDCWithListeningMode'. Connection to the database lost. Error while connecting at Source 'cdc' at 'insertSweetProductionStream'. Will retry in '5 sec'. (Encoded)
io.siddhi.core.exception.ConnectionUnavailableException: Connection to the database lost.
at io.siddhi.extension.io.cdc.source.CDCSource.lambda$connect$1(CDCSource.java:424)
at io.debezium.embedded.EmbeddedEngine.run(EmbeddedEngine.java:793)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.kafka.connect.errors.ConnectException: Cannot create replication connection
at io.debezium.connector.postgresql.connection.PostgresReplicationConnection.(PostgresReplicationConnection.java:87)
at io.debezium.connector.postgresql.connection.PostgresReplicationConnection.(PostgresReplicationConnection.java:38)
at io.debezium.connector.postgresql.connection.PostgresReplicationConnection$ReplicationConnectionBuilder.build(PostgresReplicationConnection.java:362)
at io.debezium.connector.postgresql.PostgresTaskContext.createReplicationConnection(PostgresTaskContext.java:65)
at io.debezium.connector.postgresql.RecordsStreamProducer.(RecordsStreamProducer.java:81)
at io.debezium.connector.postgresql.RecordsSnapshotProducer.(RecordsSnapshotProducer.java:70)
at io.debezium.connector.postgresql.PostgresConnectorTask.createSnapshotProducer(PostgresConnectorTask.java:133)
at io.debezium.connector.postgresql.PostgresConnectorTask.start(PostgresConnectorTask.java:86)
at io.debezium.connector.common.BaseSourceTask.start(BaseSourceTask.java:45)
at io.debezium.embedded.EmbeddedEngine.run(EmbeddedEngine.java:677)
... 3 more
Caused by: io.debezium.jdbc.JdbcConnectionException: ERROR: could not access file "decoderbufs": No such file or directory
at io.debezium.connector.postgresql.connection.PostgresReplicationConnection.initReplicationSlot(PostgresReplicationConnection.java:145)
at io.debezium.connector.postgresql.connection.PostgresReplicationConnection.(PostgresReplicationConnection.java:79)
... 12 more
Caused by: org.postgresql.util.PSQLException: ERROR: could not access file "decoderbufs": No such file or directory
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2440)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2183)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:308)
at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:441)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:365)
at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:307)
at org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:293)
at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:270)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:266)
at org.postgresql.replication.fluent.logical.LogicalCreateSlotBuilder.make(LogicalCreateSlotBuilder.java:48)
at io.debezium.connector.postgresql.connection.PostgresReplicationConnection.initReplicationSlot(PostgresReplicationConnection.java:108)
... 13 more
Debezium defaults to decoderbufs plugin - "could not access file "decoderbufs": No such file or directory".
According to this answer, the issue is due to the configuration of decoderbufs plugin.
Details
Postgres - 11.4
siddhi-cdc-io - 2.0.3
Debezium - 0.8.3
How do I configure the embedded debezium engine to use the pgoutput plugin? Will changing this configuration fix the error?
Please help me with this issue. I have not found any resources that can help me.
you either need to update the Debezium to the latest 1.1 version - this will enable you to use pgoutput plugin using plugin.name config option or you need to deploy (and maybe build) decoderbufs.so library to your PostgreSQL database.
I'd recommend the former as 0.8.3 is very old version.
I observed this behavior with PostgreSQL 12 when I tried to do CDC with pgoutput logical decoding output plug-in. It seems like even though I configured the database with pgoutput, the siddhi extension is trying to make the connection using "decoderbufs" as decoding plug-in.
When I tried configuring decoderbufs as the logical decoding output plug-in in the database level, I was able to use siddhi io extension without any issue.
It seems like for now, Siddhi io CDC only supports decoderbufs logical decoding output plug-in with PostgreSQL.

How to provide Kafka Streams properties via Spring Cloud Stream in YAML?

I'd like to move spring.kafka.streams.* under spring.cloud.stream - is this possible? I thought of streams-properties similarly to consumer-properties or producer-properties, but it doesn't work.
spring:
cloud:
config:
override-system-properties: false
server:
health:
enabled: false
stream:
bindings:
input_technischerplatz:
destination: technischerplatz
output_technischerplatz:
destination: technischerplatz
default:
group: '${spring.application.name}'
consumer:
max-attempts: 5
kafka:
binder:
auto-add-partitions: false
auto-create-topics: false
brokers: '${values.spring.kafka.bootstrap-servers}'
configuration:
header.mode: headers
consumer-properties:
allow.auto.create.topics: false
auto.offset.reset: '${values.spring.kafka.consumer.auto-offset-reset}'
enable.auto.commit: false
isolation.level: read_committed
max.poll.interval.ms: 300000
max.poll.records: 100
session.timeout.ms: 300000
header-mapper-bean-name: defaultKafkaHeaderMapper
producer-properties:
acks: all
key.serializer: org.apache.kafka.common.serialization.StringSerializer
max.in.flight.requests.per.connection: 1
max.block.ms: '${values.spring.kafka.producer.max-block-ms}'
retries: 10
required-acks: -1
kafka:
streams:
applicationId: '${spring.application.name}_streams'
properties:
default.key.serde: org.apache.kafka.common.serialization.Serdes$StringSerde
default.timestamp.extractor: org.apache.kafka.streams.processor.LogAndSkipOnInvalidTimestamp
state.dir: '${values.spring.kafka.streams.properties.state.dir}'
You can bind the streams properties with spring.cloud.stream in the following manner:
spring.cloud.stream.kafka.streams.binder.applicationId: my-application-id
spring.cloud.stream.kafka.streams.binder.configuration:
default.key.serde: org.apache.kafka.common.serialization.Serdes$StringSerde
default.value.serde: org.apache.kafka.common.serialization.Serdes$StringSerde
For more details, you can refer the documentation:
https://cloud.spring.io/spring-cloud-static/spring-cloud-stream-binder-kafka/3.0.0.M3/reference/html/spring-cloud-stream-binder-kafka.html#_kafka_streams_binder

Can I set Kafka Stream consumer group.id?

I'm using Kafka Stream library for streaming application.
I wanted to set kafka consumer group id.
Then, I put Kafka stream configuration like below.
streamsCopnfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, "JoinTestApp");
streamsCopnfiguration.put(StreamsConfig.CLIENT_ID_CONFIG, "JonTestClientId1");
streamsCopnfiguration.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 10 * 1000);
streamsCopnfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, bootstreapServer);
streamsCopnfiguration.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
streamsCopnfiguration.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.Bytes().getClass().getName());
streamsCopnfiguration.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
streamsCopnfiguration.put(StreamsConfig.consumerPrefix("group.id"), "groupId1");
// streamsCopnfiguration.put(ConsumerConfig.GROUP_ID_CONFIG, "groupId1");
But, my console log kafka consumer configuration not seted group.id.
Just stream application id..
2019-03-19 17:17:03,206 [main] INFO org.apache.kafka.clients.consumer.ConsumerConfig - ConsumerConfig values:
auto.commit.interval.ms = 5000
auto.offset.reset = earliest
bootstrap.servers = [....]
check.crcs = true
client.dns.lookup = default
client.id = JonTestClientId111-StreamThread-1-consumer
connections.max.idle.ms = 540000
default.api.timeout.ms = 60000
enable.auto.commit = false
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
group.id = JoinTestApp
heartbeat.interval.ms = 3000
interceptor.classes = []
internal.leave.group.on.close = false
Can I set kafka stream consumer group.id??
No, you can't. For this purpose in Kafka Streams you need to use application.id.
Kafka Streams application.id is used at various places to isolate resources used by application from others.
application.id is used as Kafka consumer group.id for coordination. That's why you can not set group.id explicitly.
From Kafka Streams Official documentation, application.id is also used at following places:
- As the default Kafka consumer and producer client.id prefix
- As the name of the subdirectory in the state directory (cf. state.dir)
- As the prefix of internal Kafka topic names

Spark + Flume : "Unable to create Rpc client"

I am trying to connect a scala spark-shell with a flume agent.
I launch the shell :
 ./bin/spark-shell
--jars /Users/romain/Informatique/zoo/spark-1.6.0-bin-hadoop2.4/lib/spark-streaming-flume-sink_2.10-1.6.0.jar
--jars /Users/romain/Informatique/zoo/spark-1.6.0-bin-hadoop2.4/lib/spark-streaming-flume_2.10-1.6.0.jar
--jars /Users/romain/Informatique/zoo/spark-1.6.0-bin-hadoop2.4/lib/spark-streaming-flume-assembly_2.10-1.6.0.jar
And launch a scala script to listen on port 10000 of localhost :
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.flume._
import org.apache.spark.util.IntParam
 
 
val host = "localhost"
val port = 10000
val batchInterval = Milliseconds(5000)
 
// Create the context and set the batch size
val sparkConf = new SparkConf().setAppName("FlumePollingEventCount")
val ssc = new StreamingContext(sc, batchInterval)
 
// Create a flume stream that polls the Spark Sink running in a Flume agent
val stream = FlumeUtils.createPollingStream(ssc, host, port)
 
// Print out the count of events received from this server in each batch
stream.count().map(cnt => "Received " + cnt + " flume events." ).print()
 
ssc.start()
ssc.awaitTermination()
 
Then I configure and start a shell agent :
1) configuration : I make a tail - f on a file appended by another running script (I think it doesn't matter to detail that here)
agent1.sources = source1
agent1.channels = channel1
agent1.sinks = spark
agent1.sources.source1.type = exec
agent1.sources.source1.command = tail -f /Users/romain/Informatique/notebooks/spark_scala/velib/logs/trajets.csv
agent1.sources.source1.channels = channel1
agent1.channels.channel1.type = memory
agent1.channels.channel1.capacity = 2000000
agent1.channels.channel1.transactionCapacity = 1000000
agent1.sinks = avroSink
agent1.sinks.avroSink.type = avro
agent1.sinks.avroSink.channel = channel1
agent1.sinks.avroSink.hostname = localhost
agent1.sinks.avroSink.port = 10000
2) start :
./bin/flume-ng agent --conf conf --conf-file ./conf/avro_velib.conf --name agent1 -Dflume.root.logger=INFO,console
And then things go ok till an error :
2017-02-01 15:09:55,688 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:173)] Starting Sink avroSink
2017-02-01 15:09:55,688 (lifecycleSupervisor-1-1) [INFO - org.apache.flume.sink.AbstractRpcSink.start(AbstractRpcSink.java:289)] Starting RpcSink avroSink { host: localhost, port: 10000 }...
2017-02-01 15:09:55,688 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:184)] Starting Source source1
2017-02-01 15:09:55,689 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.source.ExecSource.start(ExecSource.java:169)] Exec source starting with command:tail -f /Users/romain/Informatique/notebooks/spark_scala/velib/logs/trajets.csv
2017-02-01 15:09:55,689 (lifecycleSupervisor-1-1) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:120)] Monitored counter group for type: SINK, name: avroSink: Successfully registered new MBean.
2017-02-01 15:09:55,689 (lifecycleSupervisor-1-1) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:96)] Component type: SINK, name: avroSink started
2017-02-01 15:09:55,689 (lifecycleSupervisor-1-1) [INFO - org.apache.flume.sink.AbstractRpcSink.createConnection(AbstractRpcSink.java:206)] Rpc sink avroSink: Building RpcClient with hostname: localhost, port: 10000
2017-02-01 15:09:55,689 (lifecycleSupervisor-1-1) [INFO - org.apache.flume.sink.AvroSink.initializeRpcClient(AvroSink.java:126)] Attempting to create Avro Rpc client.
2017-02-01 15:09:55,691 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:120)] Monitored counter group for type: SOURCE, name: source1: Successfully registered new MBean.
2017-02-01 15:09:55,692 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:96)] Component type: SOURCE, name: source1 started
2017-02-01 15:09:55,710 (lifecycleSupervisor-1-1) [WARN - org.apache.flume.api.NettyAvroRpcClient.configure(NettyAvroRpcClient.java:634)] Using default maxIOWorkers
2017-02-01 15:10:15,802 (lifecycleSupervisor-1-1) [WARN - org.apache.flume.sink.AbstractRpcSink.start(AbstractRpcSink.java:294)] Unable to create Rpc client using hostname: localhost, port: 10000
org.apache.flume.FlumeException: NettyAvroRpcClient { host: localhost, port: 10000 }: RPC connection error
at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:182)
at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:121)
at org.apache.flume.api.NettyAvroRpcClient.configure(NettyAvroRpcClient.java:638)
at org.apache.flume.api.RpcClientFactory.getInstance(RpcClientFactory.java:89)
at org.apache.flume.sink.AvroSink.initializeRpcClient(AvroSink.java:127)
at org.apache.flume.sink.AbstractRpcSink.createConnection(AbstractRpcSink.java:211)
at org.apache.flume.sink.AbstractRpcSink.start(AbstractRpcSink.java:292)
at org.apache.flume.sink.DefaultSinkProcessor.start(DefaultSinkProcessor.java:46)
at org.apache.flume.SinkRunner.start(SinkRunner.java:79)
at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:251)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Error connecting to localhost/127.0.0.1:10000
at org.apache.avro.ipc.NettyTransceiver.getChannel(NettyTransceiver.java:261)
at org.apache.avro.ipc.NettyTransceiver.<init>(NettyTransceiver.java:203)
at org.apache.avro.ipc.NettyTransceiver.<init>(NettyTransceiver.java:152)
at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:168)
... 16 more
Any idea welcom
You should start "listenner"(spark app in your case) first and then launch "writer" (flume-ng)
Instead of localhost use your machine name i.e (machine name used in spark conf) in both scala and avroconf.
agent1.sinks.avroSink.hostname = <Machine name>
agent1.sinks.avroSink.port = 10000