Store data required by a custom NiFi processor - apache-zookeeper

HDP-2.5.3.0, NiFi 1.1.1
I am writing a custom processor in NiFi. There are several String and Timestamp fields that I need to store somewhere such that those are available on all/any nodes.
#Tags({ "example" })
#CapabilityDescription("Provide a description")
#SeeAlso({})
#ReadsAttributes({ #ReadsAttribute(attribute = "", description = "") })
#WritesAttributes({ #WritesAttribute(attribute = "", description = "") })
public class MyProcessor extends AbstractProcessor {
.
.
.
private List<PropertyDescriptor> descriptors;
private Set<Relationship> relationships;
/* Persist these, probably, in ZK */
private Timestamp lastRunAt;
private String startPoint;
.
.
.
#Override
public void onTrigger(final ProcessContext context,final ProcessSession session) throws ProcessException {FlowFile flowFile = session.get();
/*Retrieve lastRunAt & startPoint and use*/
lastRunAt ;
startPoint ;
.
.
.
}
}
Note that HDFS is NOT an option as NiFi may run without any Hadoop installation in picture.
What are the options to do this - I was wondering if Zookeeper can be used to store this data since it's small in size and NiFi is backed by ZK. I tried to find ways to use the Zookeeper API to persist these fields, in vain.

NiFi exposes a concept called a "state manager" for processors to store information like this. When running standalone NiFi there is a local state manager, and when running clustered there is a ZooKeeper state manager.
Take a look at the developer guide here:
https://nifi.apache.org/docs/nifi-docs/html/developer-guide.html#state_manager
Also, many of the source processors in NiFi make use of this so you can look for examples in the code:
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-hadoop-bundle/nifi-hdfs-processors/src/main/java/org/apache/nifi/processors/hadoop/ListHDFS.java#L249
Admin Guide for configuration of state providers:
https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#state_management

Related

Migrate Spring cloud stream listener (kafka) from declarative to functional model

I'm trying to migrate an implementation of spring cloud streams (kafka) declarative way to the recommended functional model
In this blog post they say :
...a functional programming model in Spring Cloud Stream (SCSt). It’s
less code, less configuration. Most importantly, though, your code is
completely decoupled and independent from the internals of SCSt
My current implementation:
Declaring the MessageChanel
#Input(PRODUCT_INPUT_TOPIC)
MessageChannel productInputChannel();
Using #StreamListener which is deprecated now
#StreamListener(StreamConfig.PRODUCT_INPUT_TOPIC)
public void addProduct(#Payload Product product, #Header Long header1, #Header String header2)
Here it is
#Bean
public Consumer<Product> addProduct() {
return product -> {
// your code
};
}
I am not sure what is the value of PRODUCT_INPUT_TOPIC, but let's assume input.
So the s-c-stream will automatically create a binding for you with name addProduct-in-0. Here are the details. You can use it as is, but if you still want to use the custom name, you can use spring.cloud.stream.function.bindings.addProduct-in-0=input. - see more here.
If you need access to headers, you can just pass a Message as input argument
Here it is
#Bean
public Consumer<Message<Product>> addProduct() {
return message -> {
Product product = message.getPayload();
// your code
};
}

Cannot get custom store connected to a Transformer with Spring Cloud Stream Binder Kafka 3.x

Cannot get custom store connected to my Transformer in Spring Cloud Stream Binder Kafka 3.x (functional style) following examples from here.
I am defining a KeyValueStore as a #Bean with type StoreBuilder<KeyValueStore<String,Long>>:
#Bean
public StoreBuilder<KeyValueStore<String,Long>> myStore() {
return Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("my-store"), Serdes.String(),
Serdes.Long());
}
#Bean
#DependsOn({"myStore"})
public MyTransformer myTransformer() {
return new MyTransformer("my-store");
}
In debugger I can see that the beans get initialised.
In my stream processor function then:
return myStream -> {
return myStream
.peek(..)
.transform(() -> myTransformer())
...
MyTransformer is declared as
public class MyTransformer implements Transformer<String, MyEvent, KeyValue<KeyValue<String,Long>, MyEvent>> {
...
#Override
public void init(final ProcessorContext context) {
this.context = context;
this.myStore = context.getStateStore(storeName);
}
Getting the following error when application context starts up from my unit test:
Caused by: org.apache.kafka.streams.errors.StreamsException: Processor KSTREAM-TRANSFORM-0000000002 has no access to StateStore my-store as the store is not connected to the processor. If you add stores manually via '.addStateStore()' make sure to connect the added store to the processor by providing the processor name to '.addStateStore()' or connect them via '.connectProcessorAndStateStores()'. DSL users need to provide the store name to '.process()', '.transform()', or '.transformValues()' to connect the store to the corresponding operator, or they can provide a StoreBuilder by implementing the stores() method on the Supplier itself. If you do not add stores manually, please file a bug report at https://issues.apache.org/jira/projects/KAFKA.
In the application startup logs when running my unit test, I can see that the store seems to get created:
2021-04-06 00:44:43.806 INFO [ main] .k.s.AbstractKafkaStreamsBinderProcessor : state store my-store added to topology
I'm already using pretty much every feature of the Spring Cloud Stream Binder Kafka in my app and from my unit test, everything works very well. Unexpectedly, I got stuck at adding the custom KeyValueStore to my Transformer. It would be great, if you could spot an error in my setup.
The versions I'm using right now:
org.springframework.boot:spring-boot:jar:2.4.4
org.springframework.kafka:spring-kafka:jar:2.6.7
org.springframework.kafka:spring-kafka-test:jar:2.6.7
org.springframework.cloud:spring-cloud-stream-binder-kafka-streams:jar:3.0.4.RELEASE
org.apache.kafka:kafka-streams:jar:2.7.0
I've just tried with
org.springframework.cloud:spring-cloud-stream-binder-kafka-streams:jar:3.1.3-SNAPSHOT
and the issue seems to persist.
In your processor function, when you call .transform(() -> myTransformer()), you also need to provide the state store names in order for this to be connected to that transformer. There are some overloaded transform methods in the KStream API that takes state store names as a vararg. I wonder if this is the issue that you are running into. You may want to change that call to .transform(() -> myTransformer(), "myStore").

why dont apache kafka use the standard java doc?

public class CommonClientConfigs {
private static final Logger log = LoggerFactory.getLogger(CommonClientConfigs.class);
/*
* NOTE: DO NOT CHANGE EITHER CONFIG NAMES AS THESE ARE PART OF THE PUBLIC API AND CHANGE WILL BREAK USER CODE.
*/
public static final String BOOTSTRAP_SERVERS_CONFIG = "bootstrap.servers";
public static final String BOOTSTRAP_SERVERS_DOC = "A list of host/port pairs to use for establishing the initial connection to the Kafka cluster. The client will make use of all servers irrespective of which servers are specified here for bootstrapping—this list only impacts the initial hosts used to discover the full set of servers. This list should be in the form "
+ "<code>host1:port1,host2:port2,...</code>. Since these servers are just used for the initial connection to "
+ "discover the full cluster membership (which may change dynamically), this list need not contain the full set of "
+ "servers (you may want more than one, though, in case a server is down).";
public static final String CLIENT_DNS_LOOKUP_CONFIG = "client.dns.lookup";
public static final String CLIENT_DNS_LOOKUP_DOC = "Controls how the client uses DNS lookups. If set to <code>use_all_dns_ips</code> then, when the lookup returns multiple IP addresses for a hostname,"
+ " they will all be attempted to connect to before failing the connection. Applies to both bootstrap and advertised servers."
+ " If the value is <code>resolve_canonical_bootstrap_servers_only</code> each entry will be resolved and expanded into a list of canonical names.";
}
I just dont understand why they dont use the standard javadoc to comment.
anyone has idea about this?

Kafka Streams - ACLs for Internal Topics

I'm trying to setup a secure Kafka cluster and having a bit of difficulty with ACLs.
The Confluent security guide for Kafka Streams (https://docs.confluent.io/current/streams/developer-guide/security.html) simply states that the Cluster Create ACL has to be given to the principal... but it doesn't say anything about how to actually handle the internal topics.
Through research and experimentation, I've determined (for Kafka version 1.0.0):
Wildcards cannot be used along with text for topic names in ACLs. For example, since all internal topics are prefixed with the application id, my first thought was to apply an acl to topics matching '<application.id>-*'. This doesn't work.
Topics created by the Streams API do not get read/write access granted to the creator automatically.
Are the exact names of the internal topics predictable and consistent? In other words, if I run my application on a dev server, will the exact same topics be created on the production server when run? If so, then I can just add ACLs derived from dev before deploying. If not, how should the ACLs be added?
Are the exact names of the internal topics predictable and consistent? In other words, if I run my application on a dev server, will the exact same topics be created on the production server when run?
Yes, you'll get the same exact topics names from run to run. The DSL generates processor names with a function that looks like this:
public String newProcessorName(final String prefix) {
return prefix + String.format("%010d", index.getAndIncrement());
}
(where index is just an incrementing integer). Those processor names are then used to create repartition topics with a function that looks like this (the parameter name is a processor name generated as above):
static <K1, V1> String createReparitionedSource(final InternalStreamsBuilder builder,
final Serde<K1> keySerde,
final Serde<V1> valSerde,
final String topicNamePrefix,
final String name) {
Serializer<K1> keySerializer = keySerde != null ? keySerde.serializer() : null;
Serializer<V1> valSerializer = valSerde != null ? valSerde.serializer() : null;
Deserializer<K1> keyDeserializer = keySerde != null ? keySerde.deserializer() : null;
Deserializer<V1> valDeserializer = valSerde != null ? valSerde.deserializer() : null;
String baseName = topicNamePrefix != null ? topicNamePrefix : name;
String repartitionTopic = baseName + REPARTITION_TOPIC_SUFFIX;
String sinkName = builder.newProcessorName(SINK_NAME);
String filterName = builder.newProcessorName(FILTER_NAME);
String sourceName = builder.newProcessorName(SOURCE_NAME);
builder.internalTopologyBuilder.addInternalTopic(repartitionTopic);
builder.internalTopologyBuilder.addProcessor(filterName, new KStreamFilter<>(new Predicate<K1, V1>() {
#Override
public boolean test(final K1 key, final V1 value) {
return key != null;
}
}, false), name);
builder.internalTopologyBuilder.addSink(sinkName, repartitionTopic, keySerializer, valSerializer,
null, filterName);
builder.internalTopologyBuilder.addSource(null, sourceName, new FailOnInvalidTimestamp(),
keyDeserializer, valDeserializer, repartitionTopic);
return sourceName;
}
If you don't change your topology—like, if don't change the order of how it's built, etc—you'll get the same results no matter where the topology is constructed (presuming you're using the same version of Kafka Streams).
If so, then I can just add ACLs derived from dev before deploying. If not, how should the ACLs be added?
I have not used ACLs, but I imagine that since these are just regular topics, then yeah, you can apply ACLs to them. The security guide does mention:
When applications are run against a secured Kafka cluster, the principal running the application must have the ACL --cluster --operation Create set so that the application has the permissions to create internal topics.
I've been wondering about this myself, though, so if I am wrong I am guessing someone from Confluent will correct me.

spring cloud programmatic metadata generation

Is there anyway that I can generate some metadata to add to the service when it registers.
We are moving from Eureka to Consul and I need to add a UUID value to the registered metadata when a service starts. So that later I can get this metadata value when I retrieve the service instances by name.
Some background: We were using this excellent front end UI from https://github.com/VanRoy/spring-cloud-dashboard. It is set to use the Eureka model for services in which you have an Application with a name. Each application will have multiple instances each with an instance id.
So with the eureka model there is a 2 level service description whereas the spring cloud model is a flat one where n instances each of which have a service id.
The flat model won't work with the UI that I referenced above since there is no distinction between application name and instance id which is the spring model these are the same.
So if I generate my own instance id and handle it through metadata then I can preserve some of the behaviour without rewriting the ui.
See the documentation on metadata and tags in spring cloud consul. Consul doesn't support metadata on service discovery yet, but spring cloud has a metadata abstraction (just a map of strings). In consul tags created with key=value style are parsed into that metadata map.
For example in, application.yml:
spring:
cloud:
consul:
discovery:
tags: foo=bar, baz
The above configuration will result in a map with foo→bar and baz→baz.
Based on Spencer's answer I added an EnvironmentPostProcessor to my code.
It works and I am able to add the metadata tag I want programmatically but it is a complement to the "tags: foo=bar, baz" element so it overrides that one. I will probably figure a way around it in the next day or so but I thougth I would add what I did for other who look at this answer and say, so what did you do?
first add a class as follows:
#Slf4j
public class MetaDataEnvProcessor implements EnvironmentPostProcessor, Ordered {
// Before ConfigFileApplicationListener
private int order = ConfigFileApplicationListener.DEFAULT_ORDER - 1;
private UUID instanceId = UUID.randomUUID();
#Override
public int getOrder() {
return this.order;
}
#Override
public void postProcessEnvironment(ConfigurableEnvironment environment,
SpringApplication application) {
LinkedHashMap<String, Object> map = new LinkedHashMap<>();
map.put("spring.cloud.consul.discovery.tags", "instanceId="+instanceId.toString());
MapPropertySource propertySource = new MapPropertySource("springCloudConsulTags", map);
environment.getPropertySources().addLast(propertySource);
}
}
then add a spring.factories in resources/META-INF with eht following line to add this processor
org.springframework.boot.env.EnvironmentPostProcessor=com.example.consul.MetaDataEnvProcessor
This works fine except for the override of what is in your application.yml file for tags