We are testing kafka connect in distributed mode to pull topic records from kafka to HDFS. We have two boxes. One in which kafka and zookeeper daemons are running. We have kept one instance of kafka connect in this box. We have another box where HDFS namenode is present. We have kept another instance of kafka connect here.
We started kafka,zookeeper and kafka connect in first box. We started kafka connect in second box as well. Now as per confluent documentation, we have to start the HDFS connector(or any other connector for that matter) using REST API. So, after starting kafka connect in these two boxes, we tried starting connector through REST API. We tried below command:-
curl -X POST -H "HTTP/1.1 Host: ip-10-16-34-57.ec2.internal:9092 Content-Type: application/json Accept: application/json" --data '{"name": "hdfs-sink", "config": {"connector.class":"io.confluent.connect.hdfs.HdfsSinkConnector", "format.class":"com.qubole.streamx.SourceFormat", "tasks.max":"1", "hdfs.url":"hdfs://ip-10-16-37-124:9000", "topics":"Prd_IN_TripAnalysis,Prd_IN_Alerts,Prd_IN_GeneralEvents", "partitioner.class":"io.confluent.connect.hdfs.partitioner.DailyPartitioner", "locale":"", "timezone":"Asia/Calcutta" }}' http://ip-10-16-34-57.ec2.internal:8083/connectors
As soon as we press enter here, we get below response:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
<title>Error 415 </title>
</head>
<body>
<h2>HTTP ERROR: 415</h2>
<p>Problem accessing /connectors. Reason:
<pre> Unsupported Media Type</pre></p>
<hr /><i><small>Powered by Jetty://</small></i>
</body>
</html>
The connect-distributed.properties file at etc/kafka/ is below in both the kafka connect nodes. We have created the said three topics as well (connect-offsets,connect-configs,connect-status)
bootstrap.servers=ip-10-16-34-57.ec2.internal:9092
group.id=connect-cluster
key.converter=com.qubole.streamx.ByteArrayConverter
value.converter=com.qubole.streamx.ByteArrayConverter
enable.auto.commit=true
auto.commit.interval.ms=1000
offset.flush.interval.ms=1000
key.converter.schemas.enable=true
value.converter.schemas.enable=true
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
offset.storage.topic=connect-offsets
rest.port=8083
config.storage.topic=connect-configs
status.storage.topic=connect-status
offset.flush.interval.ms=10000
What is the issue here? Are we missing something to start kafka connect in distributed mode to work with HDFS connectors. kafka connect in standalone mode is working fine.
To upload a connector, this is a PUT command, not a POST: http://docs.confluent.io/3.1.1/connect/restapi.html#put--connectors-(string-name)-config
On a side note, I believe that you curl command might be wrong:
you need one -H switch per header, putting all headers in one -H parameter is not how it works (I think).
I do not think that the port is part of the Host header.
Related
I am trying to gradually enable ACLs on a existing cluster (3.1.0 bitnami helm chart) which is configured like this :
listeners=INTERNAL://:9093,CLIENT://:9092
listener.security.protocol.map=INTERNAL:PLAINTEXT,CLIENT:PLAINTEXT
advertised.listeners=CLIENT://$(MY_POD_NAME)-k8s.dev.host.com:4430,INTERNAL://$(MY_POD_NAME).message-broker-dev-kafka-headless.message-broker-dev.svc.cluster.local:9093
The kafka-k8s.dev.host.com:4430 is internally forwarded to the CLIENT listener on 9092
For now, we are doing TLS termination on the LB, hence the PLAINTEXT on the CLIENT listener but using SSL security.protocol :
kafkacat -b kafka-k8s.dev.host.com:4430 -X security.protocol=SSL -L
The plan is to add 2 new listeners that will require SASL auth, migrate the clients to the listeners & deprecate the existing listeners. The new configuration will look like this :
listeners=INTERNAL://:9093,CLIENT://:9092,SASL_INTERNAL://:9095,SASL_CLIENT://:9094
listener.security.protocol.map=INTERNAL:PLAINTEXT,CLIENT:PLAINTEXT,SASL_INTERNAL:SASL_PLAINTEXT,SASL_CLIENT:SASL_PLAINTEXT
advertised.listeners=CLIENT://$(MY_POD_NAME)-k8s.dev.host.com:4430,INTERNAL://$(MY_POD_NAME).message-broker-dev-kafka-headless.message-broker-dev.svc.cluster.local:9093,SASL_CLIENT://$(MY_POD_NAME)-sasl-k8s.dev.host.com:4430,SASL_INTERNAL://$(MY_POD_NAME).message-broker-dev-kafka-headless.message-broker-dev.svc.cluster.local:9095
allow.everyone.if.no.acl.found=true
authorizer.class.name=kafka.security.authorizer.AclAuthorizer
sasl.enabled.mechanisms=PLAIN,SCRAM-SHA-256,SCRAM-SHA-512
sasl.mechanism.inter.broker.protocol=PLAIN
After creating some SCRAM-SHA-512 users and applying ACLs to existing topics, everything is working fine on the SASL_INTERNAL listener but not on the SASL_CLIENT :
$ kafkacat -b message-broker-dev-kafka-headless.message-broker-dev:9095 -C -t protected-topic-v1 -X security.protocol=SASL_PLAINTEXT -X sasl.mechanisms=SCRAM-SHA-512 -X sasl.username=demo-user -X sasl.password=secret
{"userId":"1225"}
% Reached end of topic protected-topic-v1 [0] at offset 22
$ kafkacat -b kafka-sasl-k8s.dev.host.com:4430 -C -t protected-topic-v1 -X security.protocol=SASL_SSL -X sasl.mechanisms=SCRAM-SHA-512 -X sasl.username=demo-user -X sasl.password=secret
%3|1669825033.516|FAIL|rdkafka#consumer-1| [thrd:sasl_ssl://kafka-sasl-k8s.dev.host.com:4430/bootstrap]: sasl_ssl://kafka-sasl-k8s.dev.host.com:4430/bootstrap: SASL SCRAM-SHA-512 mechanism handshake failed: Broker: Request not valid in current SASL state: broker's supported mechanisms: (after 44ms in state AUTH_HANDSHAKE)
The kafka-sasl-k8s.dev.host.com:4430 is internally forwarded to the SASL_CLIENT listener on 9094 (and again using TLS termination on LB, so SASL_SSL instead of SASL_PLAINTEXT)
For now, I'm not totally sure if I missed a kafka configuration or messed a network configuration.
Thanks in advance.
Auto-answering, it was a network issue.
kafka-sasl-k8s.dev.host.com:4430 was sending traffic to 9092 & not 9094 as expeccted
I have created a simple stream in spring cloud dataflow where I have http as source and kafka as sink.
stream create --definition "http --port=<yyyy> --path-pattern=/test > :streamtest1" --name ingest_to_broker_from_http --deploy --properties app.spring.cloud.stream.bindings.output.producer.headerMode=raw
Even after using app.spring.cloud.stream.bindings.output.producer.headerMode=raw,
I am receiving Kafka messages with contentType application string.
cURL Command:
curl -X POST -H "Content-Type: application/json" --data '{"name":"test6"}' http://<xxxx>:<yyyy>/test
Kafka Message:
contentType "text/plain"originalContentType
"application/json;charset=UTF-8"{"name":"test6"}
Am I passing the headerMode property in a correct way?
What should I do to receive only the message (without header) in
Kafka topic?
Resolved.
Changed the below property:
app.http.spring.cloud.stream.bindings.output.producer.headerMode=raw
I am new to Confluent/Kafka and I want to find metadata information from kafka
I want to know
list of producers
list of topics
schema information for topic
Confluent version is 5.0
What are classes (methods) that can give this information?
Are there any Rest API's for the same
Also is zookeeper connection necessary to get this information.
1) I don't think that Kafka brokers are aware of producers that produce messages in topics and therefore there is no command line tool for listing them. However, an answer to this SO question suggests that you can list producers by viewing the MBeans over JMX.
2) In order to list the topics you need to run:
kafka-topics --zookeeper localhost:2181 --list
Otherwise, if you want to list the topics using a Java client, you can call listTopics() method of KafkaConsumer.
You can also fetch the list of topics through ZooKeeper
ZkClient zkClient = new ZkClient("zkHost:zkPort");
List<String> topics = JavaConversions.asJavaList(ZkUtils.getAllTopics(zkClient));
3) To get the schema information for a topic you can use Schema Registry API
In particular, you can fetch all subjects by calling:
GET /subjects HTTP/1.1
Host: schemaregistry.example.com
Accept: application/vnd.schemaregistry.v1+json, application/vnd.schemaregistry+json, application/json
which should give a response similar to the one below:
HTTP/1.1 200 OK
Content-Type: application/vnd.schemaregistry.v1+json
["subject1", "subject2"]
You can then get all the versions of a particular subject:
GET /subjects/subject-name/versions HTTP/1.1
Host: schemaregistry.example.com
Accept: application/vnd.schemaregistry.v1+json, application/vnd.schemaregistry+json, application/json
And finally, you can get a specific version of the schema registered under this subject
GET /subjects/subject_name/versions/1 HTTP/1.1
Host: schemaregistry.example.com
Accept: application/vnd.schemaregistry.v1+json, application/vnd.schemaregistry+json, application/json
Or just the latest registered schema:
GET /subjects/subject-name/versions/latest HTTP/1.1
Host: schemaregistry.example.com
Accept: application/vnd.schemaregistry.v1+json, application/vnd.schemaregistry+json, application/json
In order to perform such actions in Java, you can either prepare your own GET requests (see how to do it here) or use Confluent's Schema Registry Java Client. You can see the implementation and the available methods in their Github repo.
Regarding your question about Zookeeper, note that ZK is a requirement for Kafka.
Kafka uses ZooKeeper so you need to first start a ZooKeeper server if
you don't already have one. You can use the convenience script
packaged with kafka to get a quick-and-dirty single-node ZooKeeper
instance.
I setuped single node Hadoop cluster to perform some experiments with HDFS. Via web access all looks good, I created a dedicated folder and copied file from local system to it using command line. It all appeared in web UI. After it I to get access to it via WebHDFS.
For example:
curl -i "http://127.0.0.1:50075/webhdfs/v1/?op=LISTSTATUS"
But after that I get:
HTTP/1.1 400 Bad Request
Content-Type: application/json; charset=utf-8
Content-Length: 154
Connection: close
{
"RemoteException":
{
"exception":"IllegalArgumentException",
"javaClassName":"java.lang.IllegalArgumentException",
"message":"Invalid operation LISTSTATUS"
}
}
The same error I receive on any another command.
I have no idea what went wrong here. Can it be caused for example by missing some components or anything else during setup?
For HDP you can use following URL (with default port):
http://x.x.x.x:50070/webhdfs/v1/?op=LISTSTATUS
For MapR cluster (with default port):
http://x.x.x.x:14000/webhdfs/v1/user?op=LISTSTATUS&user.name=YOUR_USER
We are using twitter4j for streaming twitter data, but getting "connect timed out" error. But if we use curl with the authorization information and execute from the same system, it works fine and we get the tweets.
Below is the java code snippet:
ArrayList<String> track = new ArrayList<String>();
track.add("#usa");
String[] trackArray = track.toArray(new String[track.size()]);
twitterStream.filter(new FilterQuery(0, null, trackArray));
Below is the stack trace, the google links provided in the error message do not give much information.
Relevant discussions can be found on the Internet at:
http://www.google.co.jp/search?q=944a924a or
http://www.google.co.jp/search?q=24fd66dc
TwitterException{exceptionCode=[944a924a-24fd66dc 944a924a-24fd66b2], statusCode=-1, message=null, code=-1, retryAfter=-1, rateLimitStatus=null, version=3.0.5}
at twitter4j.internal.http.HttpClientImpl.request(HttpClientImpl.java:177)
at twitter4j.internal.http.HttpClientWrapper.request(HttpClientWrapper.java:61)
at twitter4j.internal.http.HttpClientWrapper.post(HttpClientWrapper.java:98)
at twitter4j.TwitterStreamImpl.getFilterStream(TwitterStreamImpl.java:304)
at twitter4j.TwitterStreamImpl$7.getStream(TwitterStreamImpl.java:292)
at twitter4j.TwitterStreamImpl$TwitterStreamConsumer.run(TwitterStreamImpl.java:462)
Caused by: java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:310)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:176)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:163)
at java.net.Socket.connect(Socket.java:546)
at sun.net.NetworkClient.doConnect(NetworkClient.java:169)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:409)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:530)
at sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:289)
at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:346)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:755)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177)
at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:858)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getOutputStream(HttpsURLConnectionImpl.java:250)
at twitter4j.internal.http.HttpClientImpl.request(HttpClientImpl.java:135)
... 5 more
20254 [Twitter Stream consumer-2[Establishing connection]] INFO twitter4j.TwitterStreamImpl - Waiting for 250 milliseconds
Below is the curl command
curl --get 'https://stream.twitter.com/1.1/statuses/filter.json' --data 'track=usa' --header 'Authorization: OAuth oauth_consumer_key="hidden_value", oauth_nonce="hidden_value", oauth_signature="hidden_value", oauth_signature_method="HMAC-SHA1", oauth_timestamp="1388654077", oauth_token="hidden_value", oauth_version="1.0"'
We use a proxy server to connect to internet and are using cntlm with the user credentials along with proxy server details specified in cntlm.
The issue is resolved.
Earlier I was setting the proxy host and port using below code which didn't work:
System.getProperties().put("http.proxyHost", "proxy");
System.getProperties().put("http.proxyPort", "8080");
But it works by setting the proxy using ConfigurationBuilder
cb.setHttpProxyHost("proxy");
cb.setHttpProxyPort(8080);
You can also set proxy host and port by using twitter4j properties in flume-ng command as below
$ flume-ng agent -n TwitAgent -c conf -f /usr/hadoop/FLUME/apache-flume-1.3.1-bin/conf/flume.conf -Dtwitter4j.http.proxyHost=www-proxy.example.com -Dtwitter4j.http.proxyPort=80