HDFS-sink connector: No FileSystem for scheme: http - apache-kafka

I'm using the documentation of confluent, but when I add a hdfs-sink connector, I get this Error:
Caused by: java.io.IOException: No FileSystem for scheme: http
Could any one help me please?

HDFS Sink connect doesn't work with HTTP urls (such as HttpFs).
You need to give a supported Hadoop Compatible FileSystem such as
hdfs://
file:// (Will write to local disk on individual Connect workers in Distributed Mode, works best only in Standalone mode)
s3a:// (Assuming hadoop-aws on CLASSPATH)
wasb:// (Assuming hadoop-azure on CLASSPATH)

Related

Confluent Control Center Upload connector Error

I am using windows 10, and using docker container to run the confluent control center.
I am trying to upload one of the pre-built connectors that can be found on the confluent hub : https://www.confluent.io/product/connectors/?_ga=2.268912561.1564485000.1614024157-1461284509.1612365443
I am getting the following error: "Invalid connector class. Check the connector configuration file."
I am trying to upload the connector with the following .properties file
name=hdfs-sink
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=1
topics=test_hdfs
hdfs.url=hdfs://localhost:9000
flush.size=3
You cannot upload a property file using a class that's not available on that page
HDFS 2 Sink no longer comes with Confluent Platform due to security issues
If you're using Docker, you need to install the connector into the container (manually build your own Connect Docker image, use a volume mount, or using confluent-hub), then you should see it available in the list of connectors on that page.

HBase: MasterNotRunningException: the node /hbase is not in zookeeper

I'm building a pipeline with streamsets to read data from a kafka topic and write it to a HBase table. I am able to write it to an HDFS file, but when I try to use an HBase destination I get the following error:
I'm using cloudera to manage the services, and I configured the following properties on the HBase destination:
Zookeeper quorum : (my zookeeeper server IP^)
Zookeeper client port: 2181
Zookeeper parent znode: /hbase
I've the following configuration on the HBase Cloudera service:
zookeeper.znode.parent: /hbase
so there isn't a mismatch between the indicated parameters.
What can be happening?
Thank you in advance.
Look over your Zookeeper server IP address. You should give the IP of the Zookeper that is in the same cluster as HBase. If you have multiple clusters managed by Cloudera Manager you may have multiple Zookeeper services in different clusters.
It is ok to use a Zookeper service for Kafka from one cluster and a different one from another cluster as long as StreamSets is configured accordingly.

Error running multiple kafka standalone hdfs connectors

We are trying to launch multiple standalone kafka hdfs connectors on a given node.
For each connector, we are setting the rest.port and offset.storage.file.filename to different ports and path respectively.
Also kafka broker JMX port is # 9999.
When I start the kafka standalone connector, I get the error
Error: Exception thrown by the agent : java.rmi.server.ExportException: Port already in use: 9999; nested exception is:
java.net.BindException: Address already in use (Bind failed)
Though the rest.port is set to 9100
kafka version: 2.12-0.10.2.1
kafka-connect-hdfs version: 3.2.1
Please help.
We are trying to launch multiple standalone kafka hdfs connectors on a given node.
Have you considered running these multiple connectors within a single instance of Kafka Connect? This might make things easier.
Kafka Connect itself can handle running multiple connectors within a single worker process. Kafka Connect in distributed mode can run on a single node, or across multiple ones.
For those who trying to use rest.port flag and still getting Address already in use error. That flag has been marked as deprecated in KIP-208 and finally removed in PR.
From that point listeners can be used to change default REST port.
Examples from Javadoc
listeners=HTTP://myhost:8083
listeners=HTTP://:8083
Configuring and Running Workers - Standalone mode
You may have open Kafka Connect connections that you don't know about. You can check this with:
ps -ef | grep connect
If you find any, kill those processes.

How to use yarn to run a self-contained Spark app remotely

I am creating a self-contained Scala program that uses Spark for parallelization in some parts. In my specific situation, the Spark cluster is available through yarn.
I need my Spark job to load a hdfs file located on a hadoop cluster not accessible directly through my local machine. So, I create SOCKS proxy through ssh tunnel by including these properties in hdfs-site.xml.
<property>
<name>hadoop.socks.server</name>
<value>localhost:7070</value>
</property>
<property>
<name>hadoop.rpc.socket.factory.class.default</name>
<value>org.apache.hadoop.net.SocksSocketFactory</value>
</property>
<property>
<name>dfs.client.use.legacy.blockreader</name>
<value>true</value>
</property>
where 7070 is the dynamic port to the hadoop gateway machine.
ssh -fCND 7070 <hadoop-gateway-machine>
This allows me to access hdfs files locally, when I am using Spark in local[*] master configuration for testing.
However, when I run a real Spark job on yarn deployed on the same hadoop cluster (configured by yarn-site.xml, hdfs-site.xml, and core-site.xml in the classpath), I see errors like:
java.lang.IllegalStateException: Library directory '<project-path>/assembly/target/scala-2.11/jars' does not exist; make sure Spark is built.
So, I set spark.yarn.jars property directly on sparkConf. This at least starts a yarn application. When I go the application url I just keep seeing this message in one of the worker logs:
Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher
And this message in one more hadoop worker log (apparently the Spark worker that could not connect to driver)
org.apache.spark.SparkException: Failed to connect to driver!
at org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkDriver(ApplicationMaster.scala:484)
at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:345)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:187)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:653)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:69)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:68)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:68)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:651)
at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:674)
at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
My question is, what is the right way of running self-contained Spark apps on a yarn cluster. How do you do it so you don't have to specify spark.yarn.jars and other properties? Should you include spark-defaults.conf in the classpath as well?

Hbase for File I/O. and way to connect HDFS on remote client

Please be aware that I’m not fluent in English before you read.
I'm new at NoSQL,and now trying to use HBase for File storage. - I'll store Files in HBase as binary.
I don't need any statistics. Only what I need is File storage.
IS IT RECOMMENDED!?!?
I am worrying about I/O speed.
Actually, because I couldn't find any way to connect HDFS with out hadoop, I wanna try HBase for file storage. I can’t set up Hadoop on client computer. I was trying to find some libraries - like JDBC for RDBMS - which help the client connect HDFS to get files. but I couldn’t find anything and just have chosen HBase instead of connection library.
Can I get any help from someone?
It really depends on your file sizes. In Hbase it is generally not recommended to store files or LOBs, the default max keyvalue size is 10mb. I have raised that limit and run tests with >100mb values but you do risk OOME your regionservers as it has to hold the entire value in memory - config your JVMs memory with care.
When this type of question is asked on the hbase-users listserve the usual response is to recommend using HDFS if you files can be large.
You should be able to use Thrift to connect to HDFS to bypass installing the Hadoop client on your client computer.