Best practice for implementing Micronaut/Kafka-Streams with more than one KStream/KTable? - apache-kafka

There are several details about the example Micronaut/Kafka Streams application which I don't understand. Here is the example class from the documentation (original link: https://micronaut-projects.github.io/micronaut-kafka/latest/guide/#kafkaStreams).
My questions are:
Why are we returning only the source stream?
If we have multiple source KStream objects, EG to do a join, do we need to also make them Beans?
Do we also need make each source KTable a Bean?
What happens if we don't make a source KStream or KTable a Bean? We currently have at least one project that does this but with no apparent problems.
import io.micronaut.configuration.kafka.streams.ConfiguredStreamBuilder;
import io.micronaut.context.annotation.Factory;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.kstream.Grouped;
import org.apache.kafka.streams.kstream.KStream;
import org.apache.kafka.streams.kstream.KTable;
import org.apache.kafka.streams.kstream.Materialized;
import org.apache.kafka.streams.kstream.Produced;
import javax.inject.Named;
import javax.inject.Singleton;
import java.util.Arrays;
import java.util.Locale;
import java.util.Properties;
#Factory
public class WordCountStream {
public static final String STREAM_WORD_COUNT = "word-count";
public static final String INPUT = "streams-plaintext-input";
public static final String OUTPUT = "streams-wordcount-output";
public static final String WORD_COUNT_STORE = "word-count-store";
#Singleton
#Named(STREAM_WORD_COUNT)
KStream<String, String> wordCountStream(ConfiguredStreamBuilder builder) {
// set default serdes
Properties props = builder.getConfiguration();
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
KStream<String, String> source = builder
.stream(INPUT);
KTable<String, Long> groupedByWord = source
.flatMapValues(value -> Arrays.asList(value.toLowerCase().split("\\W+")))
.groupBy((key, word) -> word, Grouped.with(Serdes.String(), Serdes.String()))
//Store the result in a store for lookup later
.count(Materialized.as(WORD_COUNT_STORE));
groupedByWord
//convert to stream
.toStream()
//send to output using specific serdes
.to(OUTPUT, Produced.with(Serdes.String(), Serdes.Long()));
return source;
}
}
Edit: Here's a version of our service with multiple streams, edited to remove identifying info.
#Factory
public class TopologyCopy {
private static class DataOut {}
private static class DataInOne {}
private static class DataInTwo {}
private static class DataInThree {}
#Singleton
#Named("data")
KStream<Integer, DataOut> dataStream(ConfiguredStreamBuilder builder) {
Properties props = builder.getConfiguration();
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
props.put(StreamsConfig.DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER_CLASS_CONFIG, LogAndContinueExceptionHandler.class);
KStream<Integer, DataInOne> dataOneStream = builder.stream("data-one",
Consumed.with(TextualIntSerde.INSTANCE, new JsonSerde<>(DataInOne.class)));
KStream<Integer, DataInTwo> dataTwoStream = builder.stream("data-two",
Consumed.with(TextualIntSerde.INSTANCE, new JsonSerde<>(DataInTwo.class)));
GlobalKTable<Integer, DataInThree> signalTable = builder.globalTable("data-three",
Consumed.with(TextualIntSerde.INSTANCE, new JsonSerde<>(DataInThree.class)),
Materialized.as("data-three-store"));
KTable<Integer, DataInTwo> dataTwoTable = dataTwoStream
.groupByKey()
.aggregate(() -> null, (key, device, storedDevice) -> device,
Materialized.with(TextualIntSerde.INSTANCE, new JsonSerde<>(DataInTwo.class)));
dataOneStream
.transformValues(() -> /* MAGIC */))
.join(dataTwoTable, (data1, data2) -> /* MAGIC */)
.selectKey((something, msg) -> /* MAGIC */)
.to("topic-out", Produced.with(Serdes.UUID(), new JsonSerde<>(OutMessage.class)));
return dataOneStream;
}
}

Related

How to implement security in Spring Cloud API Gateway with Key Cloak for Multi-tenant Application

I'm working on a project in which there is a requirement to convert single tenant application to multi tenant application. How to implement Multi-tenant support for Gateway using key cloak with spring security oauth2?
Any references please share
Use ReactiveClientRegistrationRepository to register your clients dynamically at runtime. You can then plug this implementation to a custom Webfilter, which will use your custom repository to compute the client details based on the logged in user's realm(which can be the hostname, email etc.)
import java.util.HashMap;
import java.util.Map;
import java.util.Optional;
import reactor.core.publisher.Mono;
import org.springframework.security.oauth2.client.registration.ClientRegistration;
import org.springframework.security.oauth2.client.registration.ClientRegistrations;
import org.springframework.security.oauth2.client.registration.ReactiveClientRegistrationRepository;
import org.springframework.security.oauth2.core.ClientAuthenticationMethod;
import org.springframework.stereotype.Component;
#Component
public class TenantClientRegistrationRepository implements ReactiveClientRegistrationRepository {
private final Map<String, String> tenants = new HashMap<>();
private final Map<String, Mono<ClientRegistration>> clients = new HashMap<>();
public TenantClientRegistrationRepository() {
this.tenants.put("tenant1", "http://localhost:8080/realms/tenant1");
this.tenants.put("tenant2", "http://localhost:8080/realms/tenant2");
}
#Override
public Mono<ClientRegistration> findByRegistrationId(String registrationId) {
return this.clients.computeIfAbsent(registrationId, this::fromTenant);
}
private Mono<ClientRegistration> fromTenant(String registrationId) {
return Optional.ofNullable(this.tenants.get(registrationId))
.map(uri -> Mono.defer(() -> clientRegistration(uri, registrationId)).cache())
.orElse(Mono.error(new IllegalArgumentException("unknown tenant")));
}
private Mono<ClientRegistration> clientRegistration(String uri, String registrationId) {
return Mono.just(ClientRegistrations.fromIssuerLocation(uri)
.registrationId(registrationId)
.clientId("web-client")//fetch client creds via rest or some other means
.clientSecret("********")
.scope("openid")
.build());
}
#KafkaListener(topics="tenants")
//TODO: Here we will populate tenants(realm) and respective client details based on tenant creation event
public void action(Map<String, Map<String, Object>> action) {
if (action.containsKey("created")) {
Map<String, Object> tenant = action.get("created");
String alias = (String) tenant.get("alias");
String issuerUri = (String) tenant.get("issuerUri");
this.tenants.put(alias, issuerUri);
this.clients.remove(alias);
}
}
}
import static org.springframework.security.config.Customizer.withDefaults;
import java.util.HashMap;
import java.util.Map;
import org.springframework.beans.BeansException;
import org.springframework.context.ApplicationContext;
import org.springframework.context.ApplicationContextAware;
import org.springframework.security.config.web.server.ServerHttpSecurity;
import org.springframework.security.oauth2.client.oidc.web.server.logout.OidcClientInitiatedServerLogoutSuccessHandler;
import org.springframework.security.oauth2.client.registration.ClientRegistration;
import org.springframework.security.oauth2.client.registration.ReactiveClientRegistrationRepository;
import org.springframework.security.web.server.ServerAuthenticationEntryPoint;
import org.springframework.security.web.server.WebFilterChainProxy;
import org.springframework.security.web.server.authentication.RedirectServerAuthenticationEntryPoint;
import org.springframework.web.server.ServerWebExchange;
import org.springframework.web.server.WebFilter;
import org.springframework.web.server.WebFilterChain;
import org.springframework.web.util.UriComponentsBuilder;
import reactor.core.publisher.Mono;
import reactor.util.context.Context;
public class TenantFilterChain implements WebFilter, ApplicationContextAware {
private final Map<String, Mono<WebFilter>> tenants = new HashMap<>();
private final ReactiveClientRegistrationRepository clients;
private ApplicationContext context;
public TenantFilterChain(ReactiveClientRegistrationRepository clients) {
this.clients = clients;
}
#Override
public Mono<Void> filter(ServerWebExchange exchange, WebFilterChain chain) {
Mono<ClientRegistration> tenant = toTenant(exchange);
Mono<WebFilter> filter = tenant.flatMap(this::fromTenant);
return Mono.zip(tenant, filter)
.flatMap(tuple -> tuple.getT2().filter(exchange, chain)
.contextWrite(Context.of(ClientRegistration.class, tuple.getT1()))
.thenReturn(exchange))
.switchIfEmpty(chain.filter(exchange).thenReturn(exchange))
.then(Mono.empty());
}
private Mono<ClientRegistration> toTenant(ServerWebExchange exchange) {
String host = UriComponentsBuilder.fromUri(exchange.getRequest().getURI())
.build().getHost();
return this.clients.findByRegistrationId(host);
}
private Mono<WebFilter> fromTenant(ClientRegistration registration) {
return this.tenants.computeIfAbsent(registration.getRegistrationId(), tenant -> {
ServerHttpSecurity http = new ContextAwareServerHttpSecurity(this.context);
OidcClientInitiatedServerLogoutSuccessHandler handler =
new OidcClientInitiatedServerLogoutSuccessHandler(this.clients);
handler.setPostLogoutRedirectUri("http://localhost:8282");
ServerAuthenticationEntryPoint entryPoint =
new RedirectServerAuthenticationEntryPoint("/oauth2/authorization/" + tenant);
// #formatter:off
http
.authorizeExchange(e -> e
.pathMatchers("/jwks").permitAll()
.anyExchange().authenticated())
.logout(l -> l.logoutSuccessHandler(handler))
.oauth2Login(withDefaults())
.exceptionHandling(e -> e.authenticationEntryPoint(entryPoint));
// #formatter:on
return Mono.just((WebFilter) new WebFilterChainProxy(http.build())).cache();
});
}
#Override
public void setApplicationContext(ApplicationContext context) throws BeansException {
this.context = context;
}
private static class ContextAwareServerHttpSecurity extends ServerHttpSecurity {
ContextAwareServerHttpSecurity(ApplicationContext context) {
super.setApplicationContext(context);
}
}
}
Referenced codebase
Talk by Josh Cummings on multitenancy with spring security OAuth2
Hope this helps !! :)

Flink Streaming Event time Window

I am running a simple example to test window based on EventTime. I am able to generate output with processing time but when i using EventTime, no output is coming . Please help me to understand what i am doing wrong.
i am creating a SlidingWindow of size 10 seconds which slides every 5 seconds and at the end of the window, the system will emit the number of messages that were received during that time.
input :
a,1513695853 (generated at 13th second, received at 13th second)
a,1513695853 (generated at 13th second, received at 13th second)
a,1513695856 (generated at 16th second, received at 19th second)
a,1513695859 (generated at 13th second, received at 19th second)
2nd field represent timestamp of event, representing 13th,13th,16th,19th second of a minute.
if i am using Processing Time window :
Output :
(a,1)
(a,3)
(a,2)
But when i am using Event Time than no output is printing. Please help me to understand what is going wrong.
package org.apache.flink.window.training;
import java.io.InputStream;
import java.util.Properties;
import org.apache.flink.api.common.functions.FoldFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks;
import org.apache.flink.streaming.api.watermark.Watermark;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010;
import org.apache.flink.streaming.util.serialization.SimpleStringSchema;
import com.fasterxml.jackson.databind.ObjectMapper;
public class SocketStream {
private static Properties properties = new Properties();
public static void main(String args[]) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
InputStream inputStream =
SocketStream.class.getClassLoader().getResourceAsStream("local-kafka-server.properties");
properties.load(inputStream);
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
FlinkKafkaConsumer010<String> consumer =
new FlinkKafkaConsumer010<>("test-topic", new SimpleStringSchema(), properties);
DataStream<Element> socketStockStream =
env.addSource(consumer).map(new MapFunction<String, Element>() {
#Override
public Element map(String value) throws Exception {
String split[] = value.split(",");
Element element = new Element(split[0], Long.parseLong(split[1]));
return element;
}
}).assignTimestampsAndWatermarks(new TimestampExtractor());
socketStockStream.map(new MapFunction<Element, Tuple2<String, Integer>>() {
#Override
public Tuple2<String, Integer> map(Element value) throws Exception {
return new Tuple2<String, Integer>(value.getId(), 1);
}
}).keyBy(0).timeWindow(Time.seconds(10), Time.seconds(5))
.sum(1).
print();
env.execute();
}
public static class TimestampExtractor implements AssignerWithPunctuatedWatermarks<Element> {
private static final long serialVersionUID = 1L;
#Override
public long extractTimestamp(Element element, long previousElementTimestamp) {
return element.getTimestamp();
}
#Override
public Watermark checkAndGetNextWatermark(Element lastElement, long extractedTimestamp) {
// TODO Auto-generated method stub
return null;
}
}
}
Event-time processing requires properly generated timestamps and watermarks.
The TimestampExtractor in your code does not generate watermark but returns always null.

Writing to Google Cloud Storage from PubSub using Cloud Dataflow using DoFn

I am trying write Google PubSub messages to Google Cloud Storage using Google Cloud Dataflow. I know that TextIO/AvroIO do not support streaming pipelines. However, I read in [1] that it is possible to write to GCS in a streaming pipeline from a ParDo/DoFn in a comment by the author. I constructed a pipeline by following their article as closely as I could.
I was aiming for this behaviour:
Messages written out in a batches of up to 100 to objects in GCS (one per window pane) under a path that corresponds to the time the message was published in dataflow-requests/[isodate-time]/[paneIndex].
I get different results:
There is only a single pane in every hourly window. I therefore only get one file in every hourly 'bucket' (it's really an object path in GCS). Reducing MAX_EVENTS_IN_FILE to 10 made no difference, still only one pane/file.
There is only a single message in every GCS object that is written out
The pipeline occasionally raises a CRC error when writing to GCS.
How do I fix these problems and get the behaviour I'm expecting?
Sample log output:
21:30:06.977 writing pane 0 to blob dataflow-requests/2016-04-08T20:59:59.999Z/0
21:30:06.977 writing pane 0 to blob dataflow-requests/2016-04-08T20:59:59.999Z/0
21:30:07.773 sucessfully write pane 0 to blob dataflow-requests/2016-04-08T20:59:59.999Z/0
21:30:07.846 sucessfully write pane 0 to blob dataflow-requests/2016-04-08T20:59:59.999Z/0
21:30:07.847 writing pane 0 to blob dataflow-requests/2016-04-08T20:59:59.999Z/0
Here is my code:
package com.example.dataflow;
import com.google.cloud.dataflow.sdk.Pipeline;
import com.google.cloud.dataflow.sdk.io.PubsubIO;
import com.google.cloud.dataflow.sdk.options.DataflowPipelineOptions;
import com.google.cloud.dataflow.sdk.options.PipelineOptions;
import com.google.cloud.dataflow.sdk.options.PipelineOptionsFactory;
import com.google.cloud.dataflow.sdk.transforms.DoFn;
import com.google.cloud.dataflow.sdk.transforms.ParDo;
import com.google.cloud.dataflow.sdk.transforms.windowing.*;
import com.google.cloud.dataflow.sdk.values.PCollection;
import com.google.gcloud.storage.BlobId;
import com.google.gcloud.storage.BlobInfo;
import com.google.gcloud.storage.Storage;
import com.google.gcloud.storage.StorageOptions;
import org.joda.time.Duration;
import org.joda.time.format.ISODateTimeFormat;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.IOException;
public class PubSubGcsSSCCEPipepline {
private static final Logger LOG = LoggerFactory.getLogger(PubSubGcsSSCCEPipepline.class);
public static final String BUCKET_PATH = "dataflow-requests";
public static final String BUCKET_NAME = "myBucketName";
public static final Duration ONE_DAY = Duration.standardDays(1);
public static final Duration ONE_HOUR = Duration.standardHours(1);
public static final Duration TEN_SECONDS = Duration.standardSeconds(10);
public static final int MAX_EVENTS_IN_FILE = 100;
public static final String PUBSUB_SUBSCRIPTION = "projects/myProjectId/subscriptions/requests-dataflow";
private static class DoGCSWrite extends DoFn<String, Void>
implements DoFn.RequiresWindowAccess {
public transient Storage storage;
{ init(); }
public void init() { storage = StorageOptions.defaultInstance().service(); }
private void readObject(java.io.ObjectInputStream in)
throws IOException, ClassNotFoundException {
init();
}
#Override
public void processElement(ProcessContext c) throws Exception {
String isoDate = ISODateTimeFormat.dateTime().print(c.window().maxTimestamp());
String blobName = String.format("%s/%s/%s", BUCKET_PATH, isoDate, c.pane().getIndex());
BlobId blobId = BlobId.of(BUCKET_NAME, blobName);
LOG.info("writing pane {} to blob {}", c.pane().getIndex(), blobName);
storage.create(BlobInfo.builder(blobId).contentType("text/plain").build(), c.element().getBytes());
LOG.info("sucessfully write pane {} to blob {}", c.pane().getIndex(), blobName);
}
}
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
options.as(DataflowPipelineOptions.class).setStreaming(true);
Pipeline p = Pipeline.create(options);
PubsubIO.Read.Bound<String> readFromPubsub = PubsubIO.Read.named("ReadFromPubsub")
.subscription(PUBSUB_SUBSCRIPTION);
PCollection<String> streamData = p.apply(readFromPubsub);
PCollection<String> windows = streamData.apply(Window.<String>into(FixedWindows.of(ONE_HOUR))
.withAllowedLateness(ONE_DAY)
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterPane.elementCountAtLeast(MAX_EVENTS_IN_FILE))
.withLateFirings(AfterFirst.of(AfterPane.elementCountAtLeast(MAX_EVENTS_IN_FILE),
AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(TEN_SECONDS))))
.discardingFiredPanes());
windows.apply(ParDo.of(new DoGCSWrite()));
p.run();
}
}
[1] https://labs.spotify.com/2016/03/10/spotifys-event-delivery-the-road-to-the-cloud-part-iii/
Thanks to Sam McVeety for the solution. Here is the corrected code for anyone reading:
package com.example.dataflow;
import com.google.cloud.dataflow.sdk.Pipeline;
import com.google.cloud.dataflow.sdk.io.PubsubIO;
import com.google.cloud.dataflow.sdk.options.DataflowPipelineOptions;
import com.google.cloud.dataflow.sdk.options.PipelineOptions;
import com.google.cloud.dataflow.sdk.options.PipelineOptionsFactory;
import com.google.cloud.dataflow.sdk.transforms.*;
import com.google.cloud.dataflow.sdk.transforms.windowing.*;
import com.google.cloud.dataflow.sdk.values.KV;
import com.google.cloud.dataflow.sdk.values.PCollection;
import com.google.gcloud.WriteChannel;
import com.google.gcloud.storage.BlobId;
import com.google.gcloud.storage.BlobInfo;
import com.google.gcloud.storage.Storage;
import com.google.gcloud.storage.StorageOptions;
import org.joda.time.Duration;
import org.joda.time.format.ISODateTimeFormat;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.IOException;
import java.nio.ByteBuffer;
import java.util.Iterator;
public class PubSubGcsSSCCEPipepline {
private static final Logger LOG = LoggerFactory.getLogger(PubSubGcsSSCCEPipepline.class);
public static final String BUCKET_PATH = "dataflow-requests";
public static final String BUCKET_NAME = "myBucketName";
public static final Duration ONE_DAY = Duration.standardDays(1);
public static final Duration ONE_HOUR = Duration.standardHours(1);
public static final Duration TEN_SECONDS = Duration.standardSeconds(10);
public static final int MAX_EVENTS_IN_FILE = 100;
public static final String PUBSUB_SUBSCRIPTION = "projects/myProjectId/subscriptions/requests-dataflow";
private static class DoGCSWrite extends DoFn<Iterable<String>, Void>
implements DoFn.RequiresWindowAccess {
public transient Storage storage;
{ init(); }
public void init() { storage = StorageOptions.defaultInstance().service(); }
private void readObject(java.io.ObjectInputStream in)
throws IOException, ClassNotFoundException {
init();
}
#Override
public void processElement(ProcessContext c) throws Exception {
String isoDate = ISODateTimeFormat.dateTime().print(c.window().maxTimestamp());
long paneIndex = c.pane().getIndex();
String blobName = String.format("%s/%s/%s", BUCKET_PATH, isoDate, paneIndex);
BlobId blobId = BlobId.of(BUCKET_NAME, blobName);
LOG.info("writing pane {} to blob {}", paneIndex, blobName);
WriteChannel writer = storage.writer(BlobInfo.builder(blobId).contentType("text/plain").build());
LOG.info("blob stream opened for pane {} to blob {} ", paneIndex, blobName);
int i=0;
for (Iterator<String> it = c.element().iterator(); it.hasNext();) {
i++;
writer.write(ByteBuffer.wrap(it.next().getBytes()));
LOG.info("wrote {} elements to blob {}", i, blobName);
}
writer.close();
LOG.info("sucessfully write pane {} to blob {}", paneIndex, blobName);
}
}
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
options.as(DataflowPipelineOptions.class).setStreaming(true);
Pipeline p = Pipeline.create(options);
PubsubIO.Read.Bound<String> readFromPubsub = PubsubIO.Read.named("ReadFromPubsub")
.subscription(PUBSUB_SUBSCRIPTION);
PCollection<String> streamData = p.apply(readFromPubsub);
PCollection<KV<String, String>> keyedStream =
streamData.apply(WithKeys.of(new SerializableFunction<String, String>() {
public String apply(String s) { return "constant"; } }));
PCollection<KV<String, Iterable<String>>> keyedWindows = keyedStream
.apply(Window.<KV<String, String>>into(FixedWindows.of(ONE_HOUR))
.withAllowedLateness(ONE_DAY)
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterPane.elementCountAtLeast(MAX_EVENTS_IN_FILE))
.withLateFirings(AfterFirst.of(AfterPane.elementCountAtLeast(MAX_EVENTS_IN_FILE),
AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(TEN_SECONDS))))
.discardingFiredPanes())
.apply(GroupByKey.create());
PCollection<Iterable<String>> windows = keyedWindows
.apply(Values.<Iterable<String>>create());
windows.apply(ParDo.of(new DoGCSWrite()));
p.run();
}
}
There's a gotcha here, which is that you'll need a GroupByKey in order for the panes to be aggregated appropriate. The Spotify example references this as "Materialization of panes is done in “Aggregate Events” transform which is nothing else than a GroupByKey transform", but it's a subtle point. You'll need to provide a key in order to do this, and in your case, it appears a constant value will work.
PCollection<String> streamData = p.apply(readFromPubsub);
PCollection<KV<String, String>> keyedStream =
streamData.apply(WithKeys.of(new SerializableFunction<String, String>() {
public Integer apply(String s) { return "constant"; } }));
At this point, you can apply your windowing function, and then a final GroupByKey to get the desired behavior:
PCollection<String, Iterable<String>> keyedWindows = keyedStream.apply(...)
.apply(GroupByKey.create());
PCollection<Iterable<String>> windows = keyedWindows
.apply(Values.<Iterable<String>>create());
Now the elements in processElement will be Iterable<String>, with size 100 or more.
We've filed https://issues.apache.org/jira/browse/BEAM-184 to make this behavior clearer.
As of Beam 2.0, TextIO/AvroIO do support writing unbounded collections - see documentation, in particular, you have to specify withWindowedWrites().

How to solve the chainmapper is not applicable for the arguments error while doing job chaining in Mapreduce?

I'm using Hadoop 1.2.1, eclipse juno. I'm trying to chaining three map task in a single Mapreduce job. while writing Mapreduce code in eclipse, I'm getting error like chainmapper is not applicable for the arguments and also I cant set inputpath. Following are my mapreduce code,
package org.myorg;
import java.io.IOException;
import java.net.URI;
import java.nio.file.FileSystem;
import java.util.StringTokenizer;
import javax.security.auth.login.Configuration;
import org.apache.hadoop.classification.InterfaceAudience.Private;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.MapRunnable;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.mapred.lib.ChainMapper;
import org.apache.hadoop.mapred.lib.ChainReducer;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.net.StaticMapping;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class Recommand extends Configured implements Tool {
public static class IdIndexMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text>{
public void map(LongWritable key, Text val, OutputCollector<Text, Text> output,Reporter reporter)throws IOException{
String[] ids;
String ln=val.toString();
ids=ln.split("\t");
output.collect(new Text(ids[0]),new Text(ids[1]));
}
}
public static class FtrMapper extends MapReduceBase implements Mapper<Text, Text, Text, Text>{
public void map(Text key, Text val, OutputCollector<Text, Text>output, Reporter reporter) throws IOException{
String[] str;
String lne=val.toString();
while(lne.contains("M1024")){
str=lne.split(",");
String[] str1=new String[str.length];
for(int i=0;i<str.length;i++){
if(str[i]=="M1024"){ //hre need to give id which we need to split;
continue;
}
str1[i]=str[i];
output.collect(key,new Text(str1[i]));
// System.out.println("str1 out:"+str[i]);
}
}
}
}
public static class CntMapper extends MapReduceBase implements Mapper<Text, Text, Text, IntWritable>{
private final static IntWritable one=new IntWritable(1);
private Text word=new Text();
public void map(Text key, Text val, OutputCollector<Text, IntWritable>output, Reporter reporter)throws IOException{
String line = val.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>{
public void reduce(Text key, Iterable<IntWritable>values, OutputCollector<Text, IntWritable>output, Reporter reporter)throws IOException{
int sum=0;
for(IntWritable val:values){
sum+=val.get();
}
output.collect(key,new IntWritable(sum));
}
}
static int printUsage() {
System.out.println("recommand ");
ToolRunner.printGenericCommandUsage(System.out);
return -1;
}
public int run(String[] args) throws Exception {
JobConf conf = new JobConf(getConf(), Recommand.class);
conf.setJobName("wordcount");
if (args.length != 2) {
System.out.println("ERROR: Wrong number of parameters: " +
args.length + " instead of 2.");
return printUsage();
}
FileInputFormat.setInputPaths(conf, args[0]);
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
JobConf mapAConf = new JobConf(false);
ChainMapper.addMapper(conf, IdIndexMapper.class, LongWritable.class, Text.class, Text.class, Text.class, true, mapAConf);
JobConf mapBConf = new JobConf(false);
ChainMapper.addMapper(conf, FtrMapper.class, Text.class, Text.class, Text.class, Text.class, true, mapBConf);
JobConf mapCConf = new JobConf(false);
ChainMapper.addMapper(conf, CntMapper.class, Text.class, Text.class, Text.class, IntWritable.class, true, mapBConf);
JobConf reduceConf = new JobConf(false);
ChainReducer.setReducer(conf, Reduce.class, Text.class, IntWritable.class, Text.class, IntWritable.class, true, reduceConf);
JobClient.runJob(conf);
return 0;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new org.apache.hadoop.conf.Configuration(), Recommand(), args);
System.exit(res);
}
}
Can anyone help me to solve this problem please?
Make sure of the following to avoid this error
Both classes extend the Mapper class.
The ChainMapper class being used is from the correct API, whichever is applicable to your code.
org.apache.hadoop.mapreduce.lib.chain.ChainMapper or import org.apache.hadoop.mapred.lib.ChainMapper

Create Scalding Source like TextLine that combines multiple files into single mappers

We have many small files that need combining. In Scalding you can use TextLine to read files as text lines. The problem is we get 1 mapper per file, but we want to combine multiple files so that they are processed by 1 mapper.
I understand we need to change the input format to an implementation of CombineFileInputFormat, and this may involve using cascadings CombinedHfs. We cannot work out how to do this, but it should be just a handful of lines of code to define our own Scalding source called, say, CombineTextLine.
Many thanks to anyone who can provide the code to do this.
As a side question, we have some data that is in s3, it would be great if the solution given works for s3 files - I guess it depends on whether CombineFileInputFormat or CombinedHfs works for s3.
You get the idea in your question, so here is what possibly is a solution for you.
Create your own input format that extends the CombineFileInputFormat and uses your own custom RecordReader. I am showing you Java code, but you could easily convert it to scala if you want.
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapred.InputSplit;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.LineRecordReader;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.lib.CombineFileInputFormat;
import org.apache.hadoop.mapred.lib.CombineFileRecordReader;
import org.apache.hadoop.mapred.lib.CombineFileSplit;
public class CombinedInputFormat<K, V> extends CombineFileInputFormat<K, V> {
public static class MyKeyValueLineRecordReader implements RecordReader<LongWritable,Text> {
private final RecordReader<LongWritable,Text> delegate;
public MyKeyValueLineRecordReader(CombineFileSplit split, Configuration conf, Reporter reporter, Integer idx) throws IOException {
FileSplit fileSplit = new FileSplit(split.getPath(idx), split.getOffset(idx), split.getLength(idx), split.getLocations());
delegate = new LineRecordReader(conf, fileSplit);
}
#Override
public boolean next(LongWritable key, Text value) throws IOException {
return delegate.next(key, value);
}
#Override
public LongWritable createKey() {
return delegate.createKey();
}
#Override
public Text createValue() {
return delegate.createValue();
}
#Override
public long getPos() throws IOException {
return delegate.getPos();
}
#Override
public void close() throws IOException {
delegate.close();
}
#Override
public float getProgress() throws IOException {
return delegate.getProgress();
}
}
#Override
public RecordReader getRecordReader(InputSplit split, JobConf job, Reporter reporter) throws IOException {
return new CombineFileRecordReader(job, (CombineFileSplit) split, reporter, (Class) MyKeyValueLineRecordReader.class);
}
}
Then you need to extend the TextLine class and make it use your own input format you just defined (Scala code from now on).
import cascading.scheme.hadoop.TextLine
import cascading.flow.FlowProcess
import org.apache.hadoop.mapred.{OutputCollector, RecordReader, JobConf}
import cascading.tap.Tap
import com.twitter.scalding.{FixedPathSource, TextLineScheme}
import cascading.scheme.Scheme
class CombineFileTextLine extends TextLine{
override def sourceConfInit(flowProcess: FlowProcess[JobConf], tap: Tap[JobConf, RecordReader[_, _], OutputCollector[_, _]], conf: JobConf) {
super.sourceConfInit(flowProcess, tap, conf)
conf.setInputFormat(classOf[CombinedInputFormat[String, String]])
}
}
Create a scheme for the for your combined input.
trait CombineFileTextLineScheme extends TextLineScheme{
override def hdfsScheme = new CombineFileTextLine().asInstanceOf[Scheme[JobConf,RecordReader[_,_],OutputCollector[_,_],_,_]]
}
Finally, create your source class:
case class CombineFileMultipleTextLine(p : String*) extends FixedPathSource(p :_*) with CombineFileTextLineScheme
If you want to use a single path instead of multiple ones, the change to your source class is trivial.
I hope that helps.
this should do the trick, ya man? - https://wiki.apache.org/hadoop/HowManyMapsAndReduces