Using the Beam Python SDK and PortableRunner to connect to Kafka with SSL - apache-kafka

I have the code below for connecting to kafka using the python beam sdk. I know that the ReadFromKafka transform is run in a java sdk harness (docker container) but I have not been able to figure out how to make ssl.truststore.location and ssl.keystore.location accesible inside the sdk harness' docker environment. The job_endpoint argument is pointing to java -jar beam-runners-flink-1.10-job-server-2.27.0.jar --flink-master localhost:8081
pipeline_args.extend([
'--job_name=paul_test',
'--runner=PortableRunner',
'--sdk_location=container',
'--job_endpoint=localhost:8099',
'--streaming',
"--environment_type=DOCKER",
f"--sdk_harness_container_image_overrides=.*java.*,{my_beam_sdk_docker_image}:{my_beam_docker_tag}",
])
with beam.Pipeline(options=PipelineOptions(pipeline_args)) as pipeline:
kafka = pipeline | ReadFromKafka(
consumer_config={
"bootstrap.servers": "bootstrap-server:17032",
"security.protocol": "SSL",
"ssl.truststore.location": "/opt/keys/client.truststore.jks", # how do I make this available to the Java SDK harness
"ssl.truststore.password": "password",
"ssl.keystore.type": "PKCS12",
"ssl.keystore.location": "/opt/keys/client.keystore.p12", # how do I make this available to the Java SDK harness
"ssl.keystore.password": "password",
"group.id": "group",
"basic.auth.credentials.source": "USER_INFO",
"schema.registry.basic.auth.user.info": "user:password"
},
topics=["topic"],
max_num_records=2,
# expansion_service="localhost:56938"
)
kafka | beam.Map(lambda x: print(x))
I tried specifying the image override option as --sdk_harness_container_image_overrides='.*java.*,beam_java_sdk:latest' - where beam_java_sdk:latest is a docker image I based on apache/beam_java11_sdk:2.27.0 and that pulls the credetials in its entrypoint.sh. But Beam does not appear to use it, I see
INFO org.apache.beam.runners.fnexecution.environment.DockerEnvironmentFactory - Still waiting for startup of environment apache/beam_java11_sdk:2.27.0 for worker id 1-1
in the logs. Which is soon inevitebly followed by
Caused by: org.apache.kafka.common.KafkaException: org.apache.kafka.common.KafkaException: org.apache.kafka.common.KafkaException: Failed to load SSL keystore /opt/keys/client.keystore.p12 of type PKCS12
In conclusion, my question is this, In Apache Beam, is it possible to make files available inside java sdk harness docker container from the python beam sdk? If so, how might it be done?
Many thanks.

Currently, there is no straightforward way to achieve this. There is ongoing discussion and tracking issues to provide support for this kind of expansion service customization (see here, here, BEAM-12538 and BEAM-12539). That is the short answer.
Long answer is yes, you can do that. You would have to copy &paste ExpansionService.java into your codebase and build your custom expansion service, where you specify default environment (DOCKER) and default environment config (your image) here. You then have to manually run this expansion service and specify its address using expansion_service parameter of ReadFromKafka.

Related

Beam SDK harness still trying to launch docker when I set environment_type to be `PROCESS`

According to the beam harness documentation:
PROCESS: User code is executed by processes that are automatically started by the runner on each worker node.
args = [
"--runner=portableRunner",
"--streaming",
"--sdk_worker_parallelism=2",
"--environment_type=PROCESS",
"--environment_config={\"command\": \"/opt/apache/beam/boot\"}",
]
consumer_config = {
"security.protocol": "SASL_SSL",
"sasl.mechanism": "AWS_MSK_IAM",
"sasl.jaas.config": "software.amazon.msk.auth.iam.IAMLoginModule required;",
"sasl.client.callback.handler.class": "software.amazon.msk.auth.iam.IAMClientCallbackHandler",
"bootstrap.servers": bootstrap_servers,
}
with beam.Pipeline(options=PipelineOptions(args)) as p:
data = p | "Reading messages from Kafka" >> ReadFromKafka(
consumer_config=consumer_config,
topics=topics,
with_metadata=True
)
data | 'Writing to stdout' >> beam.Map(logging.info)
But when I run the code (deployed to k8s using flinkk8soperator), it is complaining:
Caused by: java.io.IOException: Cannot run program "docker": error=2, No such file or directory
Wondering if I misunderstand anything? Thanks!
After couple digging, I finally make the cross language work without using DinD or DooD. Here's the steps:
Ensure both job and task manager mount a shared volume for artifact staging. (This is required, otherwise the task manager will complained unable to find the submitted jar)
Ensure your docker image can run both java and python beam code, here's what I did:
# python SDK
COPY --from=apache/beam_python3.7_sdk:2.41.0 /opt/apache/beam/ /opt/apache/beam/
# java SDK
COPY --from=apache/beam_java8_sdk:2.41.0 /opt/apache/beam/ /opt/apache/beam_java/
In the job, you'll need to start the expansion service with extra args, for example the KafkaIo:
from apache_beam.io.kafka import ReadFromKafka, default_io_expansion_service
ReadFromKafka(
consumer_config=consumer_config,
topics=[topic],
with_metadata=False,
expansion_service=default_io_expansion_service(
append_args=[
'--defaultEnvironmentType=PROCESS',
"--defaultEnvironmentConfig={\"command\":\"/opt/apache/beam_java/boot\"}",
]
)
You portable execution relies on xLang support that requires starting a Java SDK with docker. Your cluster doesn't have docker installed.

Custom Spring cloud application - Kafka embeddedheader issue

I am trying to build a custom transformer application using the guidelines provided here
https://docs.spring.io/spring-cloud-dataflow/docs/current/reference/htmlsingle/#streams-dev-guide
I have started kafka on my windows machine.
I have http source running on windows machine it writes to destination transformData.
Command: java -Dserver.port=8123 -Dhttp.path-pattern=/data -Dspring.cloud.stream.bindings.output.destination=transformData -jar http-source-kafka-10-1.3.1.RELEASE.jar
I have transform application running that reads input from transformData and outputs to destination transformedData
Command
java -Dserver.port=8090 -Dspring.cloud.stream.bindings.input.destination=transformData -Dspring.cloud.stream.bindings.output.destination=transformedData -jar transformer-0.0.1-SNAPSHOT.jar
I have log sink running that reads from destination transformedData
Command
java -Dserver.port=8888 -Dspring.cloud.stream.bindings.input.destination=transformedData -jar log-sink-kafka-10-1.3.1.RELEASE.jar
Problem:
When I try to send this curl request:
curl -H "Content-Type: application/json" -X POST -d '{"id":"1", "temp":"400"}' http://172.20.24.47:8123/data
On the custom Transformer console I see errors:
Caused by: com.fasterxml.jackson.core.JsonParseException: Unrecognized
token '▒': was expecting ('true', 'false' or 'null') at [Source:
(byte[])"?
contentType
"text/plain"originalContentType "application/json;charset=UTF-8"{"id":"1", "temp":"400"}"; line: 1,
column: 4]
at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1804)
~[jackson-core-2.9.6.jar!/:2.9.6]
at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:679)
~[jackson-core-2.9.6.jar!/:2.9.6]
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidToken(UTF8StreamJsonParser.java:3526)
~[jackson-core-2.9.6.jar!/:2.9.6]
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._handleUnexpectedValue(UTF8StreamJsonParser.java:2621)
~[jackson-core-2.9.6.jar!/:2.9.6]
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._nextTokenNotInObject(UTF8StreamJsonParser.java:826)
~[jackson-core-2.9.6.jar!/:2.9.6]
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:723)
~[jackson-core-2.9.6.jar!/:2.9.6]
at com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:4141)
~[jackson-databind-2.9.6.jar!/:2.9.6]
at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4000)
~[jackson-databind-2.9.6.jar!/:2.9.6]
at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3121)
~[jackson-databind-2.9.6.jar!/:2.9.6]
at org.springframework.cloud.stream.converter.ApplicationJsonMessageMarshallingConverter.convertParameterizedType(ApplicationJsonMessageMarshallingConverter.java:114)
~[spring-cloud-stream-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
... 37 common frames omitted
Can any one help?
I got this to finally work. When building the custom application using the Spring initializr instead of selecting 2.0.4 release as starter I reverted back to 1.5.15 Release. Now I have no more need to pass properties on the subscriber end that is the custom app and the logger sink app using headerModes set to embeddedHeaders.
It appears that you are using Spring Cloud Stream 2.0.0.RELEASE, but your app is 1.3.x. Can you set spring.cloud.stream.bindings.input.consumer.headerMode to embeddedHeaders in the processor app where it is failing? In 2.0, the default is not to embed headers as Kafka supports it out of the box. However, in versions prior to 2.0 (which is used by 1.3.x apps), the default is to embed the headers. You need to explicitly set that when using that combination.

Starting KsqlRestApplication form scala and getting NoSuchMethodError org.apache.kafka.streams.StreamsConfig.getConsumerConfigs

I am trying to write a program that enables me to run predefined KSQL operations on Kafka topics in Scala, but I don't want to open the KSQL Cli everytime. Therefore I want to start the KSQL "Server" from within my Scala program. If I understand the KSQL source code right, I have to build and start a KsqlRestApplication:
def restServer = KsqlRestApplication.buildApplication(new
KsqlRestConfig(defaultServerProperties), true, new VersionCheckerAgent
{override def start(ksqlModuleType: KsqlModuleType, properties:
Properties): Unit = ???})
But when I try doing that, I get the following error:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.kafka.streams.StreamsConfig.getConsumerConfigs(Ljava/lang/String;Ljava/lang/String;)Ljava/util/Map;
at io.confluent.ksql.rest.server.BrokerCompatibilityCheck.create(BrokerCompatibilityCheck.java:62)
at io.confluent.ksql.rest.server.KsqlRestApplication.buildApplication(KsqlRestApplication.java:241)
I looked into the function call in BrokerCompatibilityCheck and in the create function it calls the StreamsConfig.getConsumerConfigs() with 2 Strings as parameters instead of the parameters defined in
https://kafka.apache.org/0102/javadoc/org/apache/kafka/streams/StreamsConfig.html#getConsumerConfigs(StreamThread,%20java.lang.String,%20java.lang.String).
Are my KSQL and Kafka version simply not compatible or am I doing something wrong?
I am using KSQL version 4.1.0-SNAPSHOT and Kafka version 1.0.0.
Yes, NoSuchMethodError typically indicates a version incompatibility between libraries.
The link you posted is to javadoc for kafka 0.10.2. The method hasn't changed in 1.0 but indeed in the upcoming 1.1 it only takes 2 Strings:
https://kafka.apache.org/11/javadoc/org/apache/kafka/streams/StreamsConfig.html#getConsumerConfigs(java.lang.String,%20java.lang.String)
. That suggests the version of KSQL you're using (4.1.0-SNAPSHOT) depends on version 1.1 of kafka streams, which is currently in the release candidate phase and I believe and should be out soon:
https://lists.apache.org/thread.html/780c4458b16590e99261b69d7b41b9ec374a3226d72c8d38885a008a#%3Cusers.kafka.apache.org%3E
As per that email you can find the latest (1.1.0-rc2) artifacts in the apache staging repo:
https://repository.apache.org/content/groups/staging/

CXF DOSGI ZOOKEEPER

Good morning,
I'm looking for help please.I'm only a beginner.
I am using cxf-dosgi from (DOSGi Apache Karaf Feature distribution)
I want to make transparent use of services between two remote machines. So I have a karaf container on each of these two machines.
I tested this example : to start with two containers karaf hosted on the same machine then I changed the configuration to test with two containers hosted on two different remote machines. And it works great !
So I want to do the same thing to export to export my web services. I am using Spring DM. So I do this on the server side :
<osgi:service id="osgi-service" ref="myservice" interface="org.apache.camel.Endpoint"> <osgi:service-properties> <entry key="name" value="service"/> <entry key="service.exported.interfaces" value="*"/> </osgi:service-properties> </osgi:service>
I did the installations like in the tutorial with cxf dosgi version 1.6 But I get this error:
16:25:53,256 | ERROR | pool-21-thread-3 | w.service.RemoteServiceAdminCore 193 | 184 - cxf-dosgi-ri-dsw-cxf - 1.6.0 | failed to create server for interface org.apache.camel.Endpoint
java.lang.NullPointerException
at java.lang.reflect.Array.newArray(Native Method)[:1.7.0_79]
at java.lang.reflect.Array.newInstance(Array.java:70)[:1.7.0_79]
at org.apache.cxf.aegis.type.TypeUtil.getTypeRelatedClass(TypeUtil.java:259)
at org.apache.cxf.aegis.type.AbstractTypeCreator.createTypeForClass(AbstractTypeCreator.java:108)
at org.apache.cxf.aegis.type.AbstractTypeCreator.createType(AbstractTypeCreator.java:402)
at org.apache.cxf.aegis.type.basic.BeanTypeInfo.getType(BeanTypeInfo.java:192)
at org.apache.cxf.aegis.type.basic.BeanType.getDependencies(BeanType.java:547)
at org.apache.cxf.aegis.databinding.AegisDatabinding.addDependencies(AegisDatabinding.java:394)
at org.apache.cxf.aegis.databinding.AegisDatabinding.addDependencies(AegisDatabinding.java:399)
at org.apache.cxf.aegis.databinding.AegisDatabinding.addDependencies(AegisDatabinding.java:399)
at org.apache.cxf.aegis.databinding.AegisDatabinding.initializeMessage(AegisDatabinding.java:371)
at org.apache.cxf.aegis.databinding.AegisDatabinding.initializeOperation(AegisDatabinding.java:283)
at org.apache.cxf.aegis.databinding.AegisDatabinding.initialize(AegisDatabinding.java:242)
at org.apache.cxf.service.factory.AbstractServiceFactoryBean.initializeDataBindings(AbstractServiceFactoryBean.java:86)
at org.apache.cxf.service.factory.ReflectionServiceFactoryBean.buildServiceFromClass(ReflectionServiceFactoryBean.java:490)
at org.apache.cxf.service.factory.ReflectionServiceFactoryBean.initializeServiceModel(ReflectionServiceFactoryBean.java:550)
at org.apache.cxf.service.factory.ReflectionServiceFactoryBean.create(ReflectionServiceFactoryBean.java:265)
at org.apache.cxf.frontend.AbstractWSDLBasedEndpointFactory.createEndpoint(AbstractWSDLBasedEndpointFactory.java:102)
at org.apache.cxf.frontend.ServerFactoryBean.create(ServerFactoryBean.java:159)
at org.apache.cxf.dosgi.dsw.handlers.AbstractPojoConfigurationTypeHandler.createServerFromFactory(AbstractPojoConfigurationTypeHandler.java:191)
at org.apache.cxf.dosgi.dsw.handlers.PojoConfigurationTypeHandler.createServer(PojoConfigurationTypeHandler.java:119)
at org.apache.cxf.dosgi.dsw.service.RemoteServiceAdminCore.exportInterfaces(RemoteServiceAdminCore.java:184)
at org.apache.cxf.dosgi.dsw.service.RemoteServiceAdminCore.exportService(RemoteServiceAdminCore.java:140)
at org.apache.cxf.dosgi.dsw.service.RemoteServiceAdminInstance$1.run(RemoteServiceAdminInstance.java:59)
at org.apache.cxf.dosgi.dsw.service.RemoteServiceAdminInstance$1.run(RemoteServiceAdminInstance.java:57)
at java.security.AccessController.doPrivileged(Native Method)[:1.7.0_79]
at org.apache.cxf.dosgi.dsw.service.RemoteServiceAdminInstance.exportService(RemoteServiceAdminInstance.java:57)[184:cxf-dosgi-ri-dsw-cxf:1.6.0]
at org.apache.cxf.dosgi.dsw.service.RemoteServiceAdminInstance.exportService(RemoteServiceAdminInstance.java:41)[184:cxf-dosgi-ri-dsw-cxf:1.6.0]
at org.apache.cxf.dosgi.topologymanager.exporter.TopologyManagerExport.exportServiceUsingRemoteServiceAdmin(TopologyManagerExport.java:185)[183:cxf-dosgi-ri-topology-manager:1.6.0]
at org.apache.cxf.dosgi.topologymanager.exporter.TopologyManagerExport.doExportService(TopologyManagerExport.java:168)[183:cxf-dosgi-ri-topology-manager:1.6.0]
at org.apache.cxf.dosgi.topologymanager.exporter.TopologyManagerExport$3.run(TopologyManagerExport.java:143)[183:cxf-dosgi-ri-topology-manager:1.6.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)[:1.7.0_79]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)[:1.7.0_79]
at java.lang.Thread.run(Thread.java:745)[:1.7.0_79]
Have you an idea of what is wrong?
Ouch .. what are you doing with 1.4-SNAPSHOT. First it is not a release and second it is quite old.
Another thing that looks suspicious is the service.exported.interfaces=myInterface. It should be a fully qualified java interface name. For the start try service.exported.interfaces=* for it.
You should start with my CXF DOSGi tutorial. The code there should work out of the box. You can then add your changes in the config. So it is easier than to start completely new.

Error running hadoop application in Eclipse on Windows

I'm trying to set up an Eclipse environment for developing and debugging hadoop. I'm following Tom White's Definitive Hadoop 3rd ed. What I would like to do is get the MaxTemperature app working locally on my Windows within Eclipse before moving it to my Hortonworks sandbox VM. The comment on page 158 about using the local job runner seems to be what I want. I don't want to set up a full hadoop implementation on Windows. I'm hoping with the right config params I can convince it to run as a java application inside Eclipse.
Windows: 7
Eclipse: Luna
Hadoop: 2.4.0
JDK: 7
When I set the Run configuration for MaxTemperatureDriver (Source code on page 157) to
inputfile outputdir foo (deliberate bogus 3rd parameter)
I get the usage message so I know I'm running my program with those params.
If I remove the bogus third param I get
Exception in thread "main" java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.
at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:120)
at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:82)
at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:75)
at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1255)
at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1251)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapreduce.Job.connect(Job.java:1250)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1279)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)
at mark.MaxTemperatureDriver.run(MaxTemperatureDriver.java:52)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at mark.MaxTemperatureDriver.main(MaxTemperatureDriver.java:56)
I've tried inserting -conf but it seems to be ignored. There is no error message if I specify a nonexistent path.
I've tried inserting -fs file:/// -jt local, but it makes no difference
I've tried inserting -D mapreduce.framework.name=local
I've tried specifying the input and output with the file: format
Note. I'm not asking about how to configure eclipse to connect to a remote Hadoop installation. I want the application to run within eclipse.
Is this possible? Any ideas?
Additional info:
I turned on debugging. I saw:
582 [main] DEBUG org.apache.hadoop.mapreduce.Cluster - Trying ClientProtocolProvider : org.apache.hadoop.mapred.YarnClientProtocolProvider
583 [main] DEBUG org.apache.hadoop.mapreduce.Cluster - Cannot pick org.apache.hadoop.mapred.YarnClientProtocolProvider as the ClientProtocolProvider - returned null protocol
I'm wondering not why YarnClientProtocolProvider failed, but why it didn't try LocalClientProtocolProvider.
New info:
It seems that this is an issue with Hadoop 2.4.0. I recreated my environment with Hadoop 1.2.1, followed the instructions in
http://gerrymcnicol.com/index.php/2014/01/02/hadoop-and-cassandra-part-4-writing-your-first-mapreduce-job/
added the Windows hack from
http://bigdatanerd.wordpress.com/2013/11/14/mapreduce-running-mapreduce-in-windows-file-system-debug-mapreduce-in-eclipse
and it all started working.
Following blog will be useful.
Running mapreduce in Windows filesystem