Custom Spring cloud application - Kafka embeddedheader issue - apache-kafka

I am trying to build a custom transformer application using the guidelines provided here
https://docs.spring.io/spring-cloud-dataflow/docs/current/reference/htmlsingle/#streams-dev-guide
I have started kafka on my windows machine.
I have http source running on windows machine it writes to destination transformData.
Command: java -Dserver.port=8123 -Dhttp.path-pattern=/data -Dspring.cloud.stream.bindings.output.destination=transformData -jar http-source-kafka-10-1.3.1.RELEASE.jar
I have transform application running that reads input from transformData and outputs to destination transformedData
Command
java -Dserver.port=8090 -Dspring.cloud.stream.bindings.input.destination=transformData -Dspring.cloud.stream.bindings.output.destination=transformedData -jar transformer-0.0.1-SNAPSHOT.jar
I have log sink running that reads from destination transformedData
Command
java -Dserver.port=8888 -Dspring.cloud.stream.bindings.input.destination=transformedData -jar log-sink-kafka-10-1.3.1.RELEASE.jar
Problem:
When I try to send this curl request:
curl -H "Content-Type: application/json" -X POST -d '{"id":"1", "temp":"400"}' http://172.20.24.47:8123/data
On the custom Transformer console I see errors:
Caused by: com.fasterxml.jackson.core.JsonParseException: Unrecognized
token '▒': was expecting ('true', 'false' or 'null') at [Source:
(byte[])"?
contentType
"text/plain"originalContentType "application/json;charset=UTF-8"{"id":"1", "temp":"400"}"; line: 1,
column: 4]
at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1804)
~[jackson-core-2.9.6.jar!/:2.9.6]
at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:679)
~[jackson-core-2.9.6.jar!/:2.9.6]
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidToken(UTF8StreamJsonParser.java:3526)
~[jackson-core-2.9.6.jar!/:2.9.6]
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._handleUnexpectedValue(UTF8StreamJsonParser.java:2621)
~[jackson-core-2.9.6.jar!/:2.9.6]
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._nextTokenNotInObject(UTF8StreamJsonParser.java:826)
~[jackson-core-2.9.6.jar!/:2.9.6]
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:723)
~[jackson-core-2.9.6.jar!/:2.9.6]
at com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:4141)
~[jackson-databind-2.9.6.jar!/:2.9.6]
at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4000)
~[jackson-databind-2.9.6.jar!/:2.9.6]
at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3121)
~[jackson-databind-2.9.6.jar!/:2.9.6]
at org.springframework.cloud.stream.converter.ApplicationJsonMessageMarshallingConverter.convertParameterizedType(ApplicationJsonMessageMarshallingConverter.java:114)
~[spring-cloud-stream-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
... 37 common frames omitted
Can any one help?

I got this to finally work. When building the custom application using the Spring initializr instead of selecting 2.0.4 release as starter I reverted back to 1.5.15 Release. Now I have no more need to pass properties on the subscriber end that is the custom app and the logger sink app using headerModes set to embeddedHeaders.

It appears that you are using Spring Cloud Stream 2.0.0.RELEASE, but your app is 1.3.x. Can you set spring.cloud.stream.bindings.input.consumer.headerMode to embeddedHeaders in the processor app where it is failing? In 2.0, the default is not to embed headers as Kafka supports it out of the box. However, in versions prior to 2.0 (which is used by 1.3.x apps), the default is to embed the headers. You need to explicitly set that when using that combination.

Related

Using the Beam Python SDK and PortableRunner to connect to Kafka with SSL

I have the code below for connecting to kafka using the python beam sdk. I know that the ReadFromKafka transform is run in a java sdk harness (docker container) but I have not been able to figure out how to make ssl.truststore.location and ssl.keystore.location accesible inside the sdk harness' docker environment. The job_endpoint argument is pointing to java -jar beam-runners-flink-1.10-job-server-2.27.0.jar --flink-master localhost:8081
pipeline_args.extend([
'--job_name=paul_test',
'--runner=PortableRunner',
'--sdk_location=container',
'--job_endpoint=localhost:8099',
'--streaming',
"--environment_type=DOCKER",
f"--sdk_harness_container_image_overrides=.*java.*,{my_beam_sdk_docker_image}:{my_beam_docker_tag}",
])
with beam.Pipeline(options=PipelineOptions(pipeline_args)) as pipeline:
kafka = pipeline | ReadFromKafka(
consumer_config={
"bootstrap.servers": "bootstrap-server:17032",
"security.protocol": "SSL",
"ssl.truststore.location": "/opt/keys/client.truststore.jks", # how do I make this available to the Java SDK harness
"ssl.truststore.password": "password",
"ssl.keystore.type": "PKCS12",
"ssl.keystore.location": "/opt/keys/client.keystore.p12", # how do I make this available to the Java SDK harness
"ssl.keystore.password": "password",
"group.id": "group",
"basic.auth.credentials.source": "USER_INFO",
"schema.registry.basic.auth.user.info": "user:password"
},
topics=["topic"],
max_num_records=2,
# expansion_service="localhost:56938"
)
kafka | beam.Map(lambda x: print(x))
I tried specifying the image override option as --sdk_harness_container_image_overrides='.*java.*,beam_java_sdk:latest' - where beam_java_sdk:latest is a docker image I based on apache/beam_java11_sdk:2.27.0 and that pulls the credetials in its entrypoint.sh. But Beam does not appear to use it, I see
INFO org.apache.beam.runners.fnexecution.environment.DockerEnvironmentFactory - Still waiting for startup of environment apache/beam_java11_sdk:2.27.0 for worker id 1-1
in the logs. Which is soon inevitebly followed by
Caused by: org.apache.kafka.common.KafkaException: org.apache.kafka.common.KafkaException: org.apache.kafka.common.KafkaException: Failed to load SSL keystore /opt/keys/client.keystore.p12 of type PKCS12
In conclusion, my question is this, In Apache Beam, is it possible to make files available inside java sdk harness docker container from the python beam sdk? If so, how might it be done?
Many thanks.
Currently, there is no straightforward way to achieve this. There is ongoing discussion and tracking issues to provide support for this kind of expansion service customization (see here, here, BEAM-12538 and BEAM-12539). That is the short answer.
Long answer is yes, you can do that. You would have to copy &paste ExpansionService.java into your codebase and build your custom expansion service, where you specify default environment (DOCKER) and default environment config (your image) here. You then have to manually run this expansion service and specify its address using expansion_service parameter of ReadFromKafka.

Could not instantiate EventHubSourceProvider for Azure Databricks

Using the steps documented in structured streaming pyspark, I'm unable to create a dataframe in pyspark from the Azure Event Hub I have set up in order to read the stream data.
Error message is:
java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.eventhubs.EventHubsSourceProvider could not be instantiated
I have installed the Maven libraries (com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.12 is unavailable) but none appear to work:
com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.15
com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.6
As well as ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString) but the error message returned is:
java.lang.NoSuchMethodError: org.apache.spark.internal.Logging.$init$(Lorg/apache/spark/internal/Logging;)V
The connection string is correct as it is also used in a console application that writes to the Azure Event Hub and that works.
Can someone point me in the right direction, please. Code in use is as follows:
from pyspark.sql.functions import *
from pyspark.sql.types import *
# Event Hub Namespace Name
NAMESPACE_NAME = "*myEventHub*"
KEY_NAME = "*MyPolicyName*"
KEY_VALUE = "*MySharedAccessKey*"
# The connection string to your Event Hubs Namespace
connectionString = "Endpoint=sb://{0}.servicebus.windows.net/;SharedAccessKeyName={1};SharedAccessKey={2};EntityPath=ingestion".format(NAMESPACE_NAME, KEY_NAME, KEY_VALUE)
ehConf = {}
ehConf['eventhubs.connectionString'] = connectionString
# For 2.3.15 version and above, the configuration dictionary requires that connection string be encrypted.
# ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)
df = spark \
.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
To resolve the issue, I did the following:
Uninstall azure event hub library versions
Install com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.15 library version from Maven Central
Restart cluster
Validate by re-running code provided in the question
I received this same error when installing libraries with the version number com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.* on a Spark cluster running Spark 3.0 with Scala 2.12
For anyone else finding this via google - check if you have the correct Scala library version. In my case, my cluster is Spark v3 with Scala 2.12
Changing the "2.11" in the library version from the tutorial I was using to "2.12", so it matches my cluster runtime version, fixed the issue.
I had to take this a step further. in the format method I had to add in this:
.format("org.apache.spark.sql.eventhubs.EventHubsSourceProvider") directly.
check the cluster scala version and the library version
Unisntall the older libraries and install :
com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.17
in the shared workspace(right click and install library) and also in the cluster

Intermittent problems with querying data from druid using SQL

I query data from druid via SQL. Sometimes it succeeds, but sometimes it fails. My query uses curl; it is:
curl --negotiate -u:srvadmin -X POST -H'Content-Type: application/json' http://du-s12-idc:8082/druid/v2/sql -d #query.json.
When it fails, I get this response:
{"error":"Unknown exception","errorMessage":"Failure getting results for query[6639c357-441f-456c-9a01-0f7ffd0758b7] url[http://du-s28-idc:8083/druid/v2/] because of [Invalid type marker byte 0x3c for expected value token\n at [Source: (SequenceInputStream); line: -1, column:
1]]","errorClass":"io.druid.java.util.common.RE","host":null}
The file query.json is simple:
{"query":"select * from bds_dsp_media_run_info_h_1016 limit 3"}
The data was loaded from hadoop to druid and succeeded. My druid version was 0.11 and built in a cluster with Kerberos.
Does anyone have this problem?
I think Invalid type marker byte 0x3c... exception is just uninformative response that tells you that the server has an internal error, but doesn't give you a clue on what is actually happening. It would help a lot if you could check broker logs when the request is happening.
But, to play a guessing game - I would expect it to be a Kerberos issue. Do you have KRB5_CLIENT_KTNAME env variable populated with the path to your key file?

Error running hadoop application in Eclipse on Windows

I'm trying to set up an Eclipse environment for developing and debugging hadoop. I'm following Tom White's Definitive Hadoop 3rd ed. What I would like to do is get the MaxTemperature app working locally on my Windows within Eclipse before moving it to my Hortonworks sandbox VM. The comment on page 158 about using the local job runner seems to be what I want. I don't want to set up a full hadoop implementation on Windows. I'm hoping with the right config params I can convince it to run as a java application inside Eclipse.
Windows: 7
Eclipse: Luna
Hadoop: 2.4.0
JDK: 7
When I set the Run configuration for MaxTemperatureDriver (Source code on page 157) to
inputfile outputdir foo (deliberate bogus 3rd parameter)
I get the usage message so I know I'm running my program with those params.
If I remove the bogus third param I get
Exception in thread "main" java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.
at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:120)
at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:82)
at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:75)
at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1255)
at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1251)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapreduce.Job.connect(Job.java:1250)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1279)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)
at mark.MaxTemperatureDriver.run(MaxTemperatureDriver.java:52)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at mark.MaxTemperatureDriver.main(MaxTemperatureDriver.java:56)
I've tried inserting -conf but it seems to be ignored. There is no error message if I specify a nonexistent path.
I've tried inserting -fs file:/// -jt local, but it makes no difference
I've tried inserting -D mapreduce.framework.name=local
I've tried specifying the input and output with the file: format
Note. I'm not asking about how to configure eclipse to connect to a remote Hadoop installation. I want the application to run within eclipse.
Is this possible? Any ideas?
Additional info:
I turned on debugging. I saw:
582 [main] DEBUG org.apache.hadoop.mapreduce.Cluster - Trying ClientProtocolProvider : org.apache.hadoop.mapred.YarnClientProtocolProvider
583 [main] DEBUG org.apache.hadoop.mapreduce.Cluster - Cannot pick org.apache.hadoop.mapred.YarnClientProtocolProvider as the ClientProtocolProvider - returned null protocol
I'm wondering not why YarnClientProtocolProvider failed, but why it didn't try LocalClientProtocolProvider.
New info:
It seems that this is an issue with Hadoop 2.4.0. I recreated my environment with Hadoop 1.2.1, followed the instructions in
http://gerrymcnicol.com/index.php/2014/01/02/hadoop-and-cassandra-part-4-writing-your-first-mapreduce-job/
added the Windows hack from
http://bigdatanerd.wordpress.com/2013/11/14/mapreduce-running-mapreduce-in-windows-file-system-debug-mapreduce-in-eclipse
and it all started working.
Following blog will be useful.
Running mapreduce in Windows filesystem

How to set up messaging subsystem using CLI in Wildfly

Does anyone have an example script for setting up the messaging subsystem in Wildfly using CLI?
The perfect example would be the CLI needed to take a server running the standalone.xml, and after running the CLI script it has the messaging subsystem as defined in the standalone-full.xml.
The examples I've found so far all start with the assumption the messaging subsystem is already in place.
Here's the script to add messaging. This adds the messaging subsystem, and makes it look like the subsystem when running standalone-full.xml.
/extension=org.jboss.as.messaging:add()
batch
/subsystem=messaging:add
/subsystem=messaging/hornetq-server=default:add
/subsystem=messaging/hornetq-server=default/:write-attribute(name=journal-file-size, value=102400L)
/subsystem=messaging/hornetq-server=default/address-setting=#:add(address-full-policy="PAGE", \
dead-letter-address="jms.queue.DLQ", expiry-address="jms.queue.ExpiryQueue", expiry-delay=-1L, \
last-value-queue=false, max-delivery-attempts=10, max-size-bytes=10485760L, message-counter-history-day-limit=10, \
page-max-cache-size=5, page-size-bytes=2097152L, redelivery-delay=0L, redistribution-delay=-1L, send-to-dla-on-no-route=false)
/subsystem=messaging/hornetq-server=default/in-vm-connector=in-vm:add(server-id=0)
/subsystem=messaging/hornetq-server=default/in-vm-acceptor=in-vm:add(server-id=0)
/subsystem=messaging/hornetq-server=default/http-connector=http-connector:add(socket-binding="http", param={http-upgrade-endpoint="http-acceptor"})
/subsystem=messaging/hornetq-server=default/http-connector=http-connector-throughput:add(socket-binding="http", param={http-upgrade-endpoint="http-acceptor-throughput", batch-delay=50})
/subsystem=messaging/hornetq-server=default/http-acceptor=http-acceptor:add(http-listener="default")
/subsystem=messaging/hornetq-server=default/http-acceptor=http-acceptor-throughput:add(http-listener="default", param={batch-delay=50, direct-deliver=false})
/subsystem=messaging/hornetq-server=default/connection-factory=InVmConnectionFactory:add(connector={"in-vm"=>undefined}, entries = ["java:/ConnectionFactory"])
/subsystem=messaging/hornetq-server=default/connection-factory=RemoteConnectionFactory:add(connector={"http-connector"=>undefined}, entries = ["java:jboss/exported/jms/RemoteConnectionFactory"])
/subsystem=messaging/hornetq-server=default/pooled-connection-factory=hornetq-ra:add(connector={"in-vm"=>undefined}, entries=["java:/JmsXA","java:jboss/DefaultJMSConnectionFactory"])
/subsystem=messaging/hornetq-server=default/security-setting=#:add()
/subsystem=messaging/hornetq-server=default/security-setting=#/role=guest:add(consume=true, create-durable-queue=false, create-non-durable-queue=true, delete-durable-queue=false, delete-non-durable-queue=true, manage=false, send=true)
jms-queue add --queue-address=ExpiryQueue --durable=true --entries=["java:/jms/queue/ExpiryQueue"]
jms-queue add --queue-address=DLQ --durable=true --entries=["java:/jms/queue/DLQ"]
run-batch
Here is an updated CLI command for new Wildfly 10 (ActiveMQ Artemis)
>> ADD MESSAGING SUBSYSTEM
/extension=org.wildfly.extension.messaging-activemq:add()
/subsystem=messaging-activemq:add
/:reload
/subsystem=messaging-activemq/server=default:add
/subsystem=messaging-activemq/server=default/security-setting=#:add
/subsystem=messaging-activemq/server=default/address-setting=#:add(dead-letter-address="jms.queue.DLQ", expiry-address="jms.queue.ExpiryQueue", expiry-delay="-1L", max-delivery-attempts="10", max-size-bytes="10485760", page-size-bytes="2097152", message-counter-history-day-limit="10")
/subsystem=messaging-activemq/server=default/http-connector=http-connector:add(socket-binding="http", endpoint="http-acceptor")
/subsystem=messaging-activemq/server=default/http-connector=http-connector-throughput:add(socket-binding="http", endpoint="http-acceptor-throughput" ,params={batch-delay="50"})
/subsystem=messaging-activemq/server=default/in-vm-connector=in-vm:add(server-id="0")
/subsystem=messaging-activemq/server=default/http-acceptor=http-acceptor:add(http-listener="default")
/subsystem=messaging-activemq/server=default/http-acceptor=http-acceptor-throughput:add(http-listener="default", params={batch-delay="50", direct-deliver="false"})
/subsystem=messaging-activemq/server=default/in-vm-acceptor=in-vm:add(server-id="0")
/subsystem=messaging-activemq/server=default/jms-queue=ExpiryQueue:add(entries=["java:/jms/queue/ExpiryQueue"])
/subsystem=messaging-activemq/server=default/jms-queue=DLQ:add(entries=["java:/jms/queue/DLQ"])
>> Refresh needed at this point
/subsystem=messaging-activemq/server=default/connection-factory=InVmConnectionFactory:add(connectors=["in-vm"], entries=["java:/ConnectionFactory"])
/subsystem=messaging-activemq/server=default/connection-factory=RemoteConnectionFactory:add(connectors=["http-connector"], entries = ["java:jboss/exported/jms/RemoteConnectionFactory"])
/subsystem=messaging-activemq/server=default/pooled-connection-factory=activemq-ra:add(transaction="xa", connectors=["in-vm"], entries=["java:/JmsXA java:jboss/DefaultJMSConnectionFactory"])
/subsystem=ee/service=default-bindings/:write-attribute(name="jms-connection-factory", value="java:jboss/DefaultJMSConnectionFactory")
/subsystem=ejb3:write-attribute(name="default-resource-adapter-name", value="${ejb.resource-adapter-name:activemq-ra.rar}")
/subsystem=ejb3:write-attribute(name="default-mdb-instance-pool", value="mdb-strict-max-pool")
>> ADD MESSAGE QUEUE
/subsystem=messaging-activemq/server=default/jms-queue=MyQueue:add(entries=[java:/jms/queue/MyQueue])
All commands may be ran as a batch command or separately like this:
$SERVER_CLI_PATH --connect --user=$SERVER_USER --password=$SERVER_PASSW --command="{{line with command}}"
To set up messaging in WildFly 14, I had to do the configuration with separate CLI script files, otherwise jboss-cli would fail with JBTHR00004: Operation was cancelled exceptions, probably due to incomplete reloads. In case you still encounter these errors, add sleep commands in between to the shell script that runs the CLI scripts.
Add the messaging extension, 1-add-messaging-extension-and-subsystem.cli:
batch
# Add messaging extension
/extension=org.wildfly.extension.messaging-activemq:add()
# Add messaging subsystem
/subsystem=messaging-activemq:add
run-batch
/:reload
Add the messaging server allowing only in-VM connectors, 2-add-messaging-server.cli:
batch
# Add messaging server with default configuration, allow only in-VM connectors
/subsystem=messaging-activemq/server=default:add
/subsystem=messaging-activemq/server=default/security-setting=#:add
/subsystem=messaging-activemq/server=default/address-setting=#:add( \
dead-letter-address="jms.queue.DLQ", \
expiry-address="jms.queue.ExpiryQueue", \
max-size-bytes="10485760", \
message-counter-history-day-limit="10", \
page-size-bytes="2097152")
/subsystem=messaging-activemq/server=default/in-vm-connector=in-vm:add( \
server-id="0",params=buffer-pooling=false)
/subsystem=messaging-activemq/server=default/in-vm-acceptor=in-vm:add( \
server-id="0",params=buffer-pooling=false)
/subsystem=messaging-activemq/server=default/jms-queue=ExpiryQueue:add( \
entries=["java:/jms/queue/ExpiryQueue"])
/subsystem=messaging-activemq/server=default/jms-queue=DLQ:add( \
entries=["java:/jms/queue/DLQ"])
/subsystem=messaging-activemq/server=default/connection-factory=InVmConnectionFactory:add( \
connectors=["in-vm"], \
entries=["java:/ConnectionFactory"])
/subsystem=messaging-activemq/server=default/pooled-connection-factory=activemq-ra:add( \
transaction="xa", \
connectors=["in-vm"], \
entries=["java:/JmsXA java:jboss/DefaultJMSConnectionFactory"])
# Configure default connection factory in the EE subsystem
/subsystem=ee/service=default-bindings/:write-attribute(name="jms-connection-factory", value="java:jboss/DefaultJMSConnectionFactory")
# Configure message-driven beans in the EJB subsystem
/subsystem=ejb3:write-attribute(name="default-resource-adapter-name", value="${ejb.resource-adapter-name:activemq-ra.rar}")
/subsystem=ejb3:write-attribute(name="default-mdb-instance-pool", value="mdb-strict-max-pool")
run-batch
/:reload
In case you need HTTP connectors as well, see #petr-hunka's answer.