reading files and folders in order with apache beam - apache-beam

I have a folder structure of the type year/month/day/hour/*, and I'd like the beam to read this as an unbounded source in chronological order. Specifically, this means reading in all the files in the first hour on record and adding their contents for processing. Then, add the file contents of the next hour for processing, up until the current time where it waits for new files to arrive in the latest year/month/day/hour folder.
Is it possible to do this with apache beam?

So what I would do is to add timestamps to each element according to the file path. As a test I used the following example.
First of all, as explained in this answer, you can use FileIO to match continuously a file pattern. This will help as, per your use case, once you have finished with the backfill you want to keep reading new arriving files within the same job. In this case I provide gs://BUCKET_NAME/data/** because my files will be like gs://BUCKET_NAME/data/year/month/day/hour/filename.extension:
p
.apply(FileIO.match()
.filepattern(inputPath)
.continuously(
// Check for new files every minute
Duration.standardMinutes(1),
// Never stop checking for new files
Watch.Growth.<String>never()))
.apply(FileIO.readMatches())
Watch frequency and timeout can be adjusted at will.
Then, in the next step we'll receive the matched file. I will use ReadableFile.getMetadata().resourceId() to get the full path and split it by "/" to build the corresponding timestamp. I round it to the hour and do not account for timezone correction here. With readFullyAsUTF8String we'll read the whole file (be careful if the whole file does not fit into memory, it is recommended to shard your input if needed) and split it into lines. With ProcessContext.outputWithTimestamp we'll emit downstream a KV of filename and line (the filename is not needed anymore but it will help to see where each file comes from) and the timestamp derived from the path. Note that we're shifting timestamps "back in time" so this can mess up with the watermark heuristics and you will get a message such as:
Cannot output with timestamp 2019-03-17T00:00:00.000Z. Output timestamps must be no earlier than the timestamp of the current input (2019-06-05T15:41:29.645Z) minus the allowed skew (0 milliseconds). See the DoFn#getAllowedTimestampSkew() Javadoc for details on changing the allowed skew.
To overcome this I set getAllowedTimestampSkew to Long.MAX_VALUE but take into account that this is deprecated. ParDo code:
.apply("Add Timestamps", ParDo.of(new DoFn<ReadableFile, KV<String, String>>() {
#Override
public Duration getAllowedTimestampSkew() {
return new Duration(Long.MAX_VALUE);
}
#ProcessElement
public void processElement(ProcessContext c) {
ReadableFile file = c.element();
String fileName = file.getMetadata().resourceId().toString();
String lines[];
String[] dateFields = fileName.split("/");
Integer numElements = dateFields.length;
String hour = dateFields[numElements - 2];
String day = dateFields[numElements - 3];
String month = dateFields[numElements - 4];
String year = dateFields[numElements - 5];
String ts = String.format("%s-%s-%s %s:00:00", year, month, day, hour);
Log.info(ts);
try{
lines = file.readFullyAsUTF8String().split("\n");
for (String line : lines) {
c.outputWithTimestamp(KV.of(fileName, line), new Instant(dateTimeFormat.parseMillis(ts)));
}
}
catch(IOException e){
Log.info("failed");
}
}}))
Finally, I window into 1-hour FixedWindows and log the results:
.apply(Window
.<KV<String,String>>into(FixedWindows.of(Duration.standardHours(1)))
.triggering(AfterWatermark.pastEndOfWindow())
.discardingFiredPanes()
.withAllowedLateness(Duration.ZERO))
.apply("Log results", ParDo.of(new DoFn<KV<String, String>, Void>() {
#ProcessElement
public void processElement(ProcessContext c, BoundedWindow window) {
String file = c.element().getKey();
String value = c.element().getValue();
String eventTime = c.timestamp().toString();
String logString = String.format("File=%s, Line=%s, Event Time=%s, Window=%s", file, value, eventTime, window.toString());
Log.info(logString);
}
}));
For me it worked with .withAllowedLateness(Duration.ZERO) but depending on the order you might need to set it. Keep in mind that a value too high will cause windows to be open for longer and use more persistent storage.
I set the $BUCKET and $PROJECT variables and I just upload two files:
gsutil cp file1 gs://$BUCKET/data/2019/03/17/00/
gsutil cp file2 gs://$BUCKET/data/2019/03/18/22/
And run the job with:
mvn -Pdataflow-runner compile -e exec:java \
-Dexec.mainClass=com.dataflow.samples.ChronologicalOrder \
-Dexec.args="--project=$PROJECT \
--path=gs://$BUCKET/data/** \
--stagingLocation=gs://$BUCKET/staging/ \
--runner=DataflowRunner"
Results:
Full code
Let me know how this works. This was just an example to get started and you might need to adjust windowing and triggering strategies, lateness, etc to suit your use case

Related

Trigger Beam ParDo at window closing only

I have a pipeline that read events from Kafka. I want to count and log the event count only when the window closes. By doing this I will only have one output log per Kafka partition/shard on each window. I use a timestamp in the header which I truncate to the hour to create a collection of hourly timestamps. I group the timestamps by hour and I log the hourly timestamp and count. This log will be sent to Grafana to create a dashboard with the counts.
Below is how I fetch the data from Kafka and where it defines the window duration:
int windowDuration = 5;
p.apply("Read from Kafka",KafkaIO.<byte[], GenericRecord>read()
.withBootstrapServers(options.getSourceBrokers().get())
.withTopics(srcTopics)
.withKeyDeserializer(ByteArrayDeserializer.class)
.withValueDeserializer(ConfluentSchemaRegistryDeserializerProvider
.of(options.getSchemaRegistryUrl().get(), options.getSubject().get()))
.commitOffsetsInFinalize())
.apply("Windowing of " + windowDuration +" seconds" ,
Window.<KafkaRecord<byte[], GenericRecord>>into(
FixedWindows.of(Duration.standardSeconds(windowDuration))));
The next step in the pipeline is to produce two collections from the above collection one with the events as GenericRecord and the other with the hourly timestamp, see below. I want a trigger (I believe) to be applied only two the collection holding the counts. So that it only prints the count once per window. Currently as is, it prints a count every time it reads from Kafka creating a large number of entries.
tuplePCollection.get(createdOnTupleTag)
.apply(Count.perElement())
.apply( MapElements.into(TypeDescriptors.strings())
.via( (KV<Long,Long> recordCount) -> recordCount.getKey() +
": " + recordCount.getValue()))
.apply( ParDo.of(new LoggerFn.logRecords<String>()));
Here is the DoFn I use to log the counts:
class LoggerFn<T> extends DoFn<T, T> {
#ProcessElement
public void process(ProcessContext c) {
T e = (T)c.element();
LOGGER.info(e);
c.output(e);
}
}
You can use the trigger “Window.ClosingBehavior”. You need to specify under which conditions a final pane will be created when a window is permanently closed. You can use these options:
FIRE_ALWAYS: Always Fire the last Pane.
FIRE_IF_NON_EMPTY: Only Fire the last pane if there is new data since
previous firing.
You can see this example.
// We first specify to never emit any panes
.triggering(Never.ever())
// We then specify to fire always when closing the window. This will emit a
// single final pane at the end of allowedLateness
.withAllowedLateness(allowedLateness, Window.ClosingBehavior.FIRE_ALWAYS)
.discardingFiredPanes())
You can see more information about this trigger.

How to use Kafka time window for historical aggregation?

I need to create state store with number of authenticated users per day so I can get number of authenticated users in the last day, in the last 7 days and in the last 30 days.
In order to achieve this, every authentication event is sent to auth-event topic.
I am streaming this topic and creating window for every day.
Code :
KStream<String, GenericRecord> authStream = builder.stream("auth-event", Consumed.with(stringSerde, valueSerde)
.withOffsetResetPolicy(Topology.AutoOffsetReset.EARLIEST)
.withTimestampExtractor(new TransactionTimestampExtractor()));
authStream
.groupBy(( String key, GenericRecord value) -> value.get("tenantId").toString(), Grouped.with(Serdes.String(), valueSerde))
.windowedBy(TimeWindows.of(Duration.ofDays(1)))
.count(Materialized.<String, Long, WindowStore<Bytes, byte[]>>as("auth-result-store")
.withKeySerde(stringSerde)
.withValueSerde(longSerde))
.suppress(Suppressed.untilWindowCloses(unbounded()))
.toStream()
.to("auth-result-topic", Produced.with(timeWindowedSerdeFrom(String.class), Serdes.Long()));
After that I am inserting records on the topic.
Also I have rest controller and i am reading the store using ReadOnlyWindowStore.
day parameter is sent from UI and can be 1, 7 or 30 days. That means I would like to read last 7 windows.
Code :
final ReadOnlyWindowStore<String, Long> dayStore = kafkaStreams.store(KStreamsLdapsExample.authResultTable, QueryableStoreTypes.windowStore());
Instant timeFrom = (Instant.now().minus(Duration.ofDays(days)));
LocalDate currentDate = LocalDate.now();
LocalDateTime currentDayTime = currentDate.atTime(23, 59, 59);
Instant timeTo = Instant.ofEpochSecond(currentDayTime.toEpochSecond(ZoneOffset.UTC));
try(WindowStoreIterator<Long> it1 = dayStore.fetch(tenant, timeFrom, timeTo)) {
Long count = 0L;
JsonObject jsonObject = new JsonObject();
while (it1.hasNext())
{
final KeyValue<Long, Long> next = it1.next();
Date resultDate = new Date(next.key);
jsonObject.addProperty(resultDate.toString(), next.value);
count += next.value;
}
jsonObject.addProperty("tenant", tenant);
jsonObject.addProperty("Total number of events", count);
return ResponseEntity.ok(jsonObject.toString());
}
The problem is that, I can get results only for 1-2 days. After that older windows are lost.
The other problem is the information written in the output topic : "auth-result-topic"
I am reading the results with console-consumer, and there are a lot of empty records, no key, no value, and some record with some random number.
enter image description here
Any idea what is going on with my store ? How to read past N windows?
Thanks
You will need to increase the store retention time (default is 1 day), via Materialize.as(...).withRetention(...) that you can pass into count() operator.
You may also want to increase the window grace period via TimeWindows.of(Duration.ofDays(1)).grace(...).
For reading the data with the console consumer: you will need to specify the correct deserializer. The window-serde and long-serde that you use to write into the output topic uses binary formats while the console consumer assumes string data type by default. There are corresponding command line parameters you can specify to set different key and value deserializers that must match the serializers you use when writing into the topic.

Flink: join file with kafka stream

I have a problem I don't really can figure out.
So I have a kafka stream that contains some data like this:
{"adId":"9001", "eventAction":"start", "eventType":"track", "eventValue":"", "timestamp":"1498118549550"}
And I want to replace 'adId' with another value 'bookingId'.
This value is located in a csv file, but I can't really figure out how to get it working.
Here is my mapping csv file:
9001;8
9002;10
So my output would ideally be something like
{"bookingId":"8", "eventAction":"start", "eventType":"track", "eventValue":"", "timestamp":"1498118549550"}
This file can get refreshed every hour at least once, so it should pick up changes to it.
I currently have this code which doesn't work for me:
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(30000); // create a checkpoint every 30 seconds
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);
DataStream<String> adToBookingMapping = env.readTextFile(parameters.get("adToBookingMapping"));
DataStream<Tuple2<Integer,Integer>> input = adToBookingMapping.flatMap(new Tokenizer());
//Kafka Consumer
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", parameters.get("bootstrap.servers"));
properties.setProperty("group.id", parameters.get("group.id"));
FlinkKafkaConsumer010<ObjectNode> consumer = new FlinkKafkaConsumer010<>(parameters.get("inbound_topic"), new JSONDeserializationSchema(), properties);
consumer.setStartFromGroupOffsets();
consumer.setCommitOffsetsOnCheckpoints(true);
DataStream<ObjectNode> logs = env.addSource(consumer);
DataStream<Tuple4<Integer,String,Integer,Float>> parsed = logs.flatMap(new Parser());
// output -> bookingId, action, impressions, sum
DataStream<Tuple4<Integer, String,Integer,Float>> joined = runWindowJoin(parsed, input, 3);
public static DataStream<Tuple4<Integer, String, Integer, Float>> runWindowJoin(DataStream<Tuple4<Integer, String, Integer, Float>> parsed,
DataStream<Tuple2<Integer, Integer>> input,long windowSize) {
return parsed.join(input)
.where(new ParsedKey())
.equalTo(new InputKey())
.window(TumblingProcessingTimeWindows.of(Time.of(windowSize, TimeUnit.SECONDS)))
//.window(TumblingEventTimeWindows.of(Time.milliseconds(30000)))
.apply(new JoinFunction<Tuple4<Integer, String, Integer, Float>, Tuple2<Integer, Integer>, Tuple4<Integer, String, Integer, Float>>() {
private static final long serialVersionUID = 4874139139788915879L;
#Override
public Tuple4<Integer, String, Integer, Float> join(
Tuple4<Integer, String, Integer, Float> first,
Tuple2<Integer, Integer> second) {
return new Tuple4<Integer, String, Integer, Float>(second.f1, first.f1, first.f2, first.f3);
}
});
}
The code only runs once and then stops, so it doesn't convert new entries in kafka using the csv file. Any ideas on how I could process the stream from Kafka with the latest values from my csv file?
Kind regards,
darkownage
Your goal appears to be to join steaming data with a slow-changing catalog (i.e. a side-input). I don't think the join operation is useful here because it doesn't store the catalog entries across windows. Also, the text file is a bounded input whose lines are read once.
Consider using connect to create a connected stream, and store the catalog data as managed state to perform lookups into. The operator's parallelism would need to be 1.
You may find a better solution by researching 'side inputs', looking at the solutions that people use today. See FLIP-17 and Dean Wampler's talk at Flink Forward.

Compute file content hash with Scala

In our app, we are in need to compute file hash, so we can compare if the file was updated later.
The way I am doing it right now is with this little method:
protected[services] def computeMigrationHash(toVersion: Int): String = {
val migrationClassName = MigrationClassNameFormat.format(toVersion, toVersion)
val migrationClass = Class.forName(migrationClassName)
val fileName = migrationClass.getName.replace('.', '/') + ".class"
val resource = getClass.getClassLoader.getResource(fileName)
logger.debug("Migration file - " + resource.getFile)
val file = new File(resource.getFile)
val hc = Files.hash(file, Hashing.md5())
logger.debug("Calculated migration file hash - " + hc.toString)
hc.toString
}
It all works perfectly, until the code get's deployed into different environment and file file is located in a different absolute path. I guess, the hashing take the path into account as well.
What is the best way to calculate some sort of reliable hash of a file content that well produce the same result for as log as the content of a file stays the same?
Thanks,
Having perused the source code https://github.com/google/guava/blob/master/guava/src/com/google/common/io/Files.java - only the file contents are hashed - the path does not come into play.
public static HashCode hash(File file, HashFunction hashFunction) throws IOException {
return asByteSource(file).hash(hashFunction);
}
Therefore you need not worry about locality of the file. Now why you end up with a different hash on a different fs .. maybe you should compare the size/contents to ensure eg no compound eol's were introduced.

Change date column to integer

I have a large csv file as below:
DATE status code value value2
2014-12-13 Shipped 105732491-20091002165230 0.000803398 0.702892835
2014-12-14 Shipped 105732491-20091002165231 0.012925206 1.93748834
2014-12-15 Shipped 105732491-20091002165232 0.000191278 0.004772389
2014-12-16 Shipped 105732491-20091002165233 0.007493046 0.44883348
2014-12-17 Shipped 105732491-20091002165234 0.022015049 3.081006137
2014-12-18 Shipped 105732491-20091002165235 0.001894693 0.227268466
2014-12-19 Shipped 105732491-20091002165236 0.000312871 0.003113062
2014-12-20 Shipped 105732491-20091002165237 0.001754068 0.105016053
2014-12-21 Shipped 105732491-20091002165238 0.009773315 0.585910214
:
:
What i need to do is remove the header and change the date format to an integer yyyymmdd (eg. 20141217)
I am using opencsv to read and write the file.
Is there a way where i can change all the dates at once without parsing them one by one?
Below is my code to remove the header and create a new file:
void formatCsvFile(String fileToChange) throws Exception {
CSVReader reader = new CSVReader(new FileReader(new File(fileToChange)), CSVParser.DEFAULT_SEPARATOR, CSVParser.NULL_CHARACTER, CSVParser.NULL_CHARACTER, 1)
info "Read all rows at once"
List<String[]> allRows = reader.readAll();
CSVWriter writer = new CSVWriter(new FileWriter(fileToChange), CSVWriter.DEFAULT_SEPARATOR, CSVWriter.NO_QUOTE_CHARACTER)
info "Write all rows at once"
writer.writeAll(allRows)
writer.close()
}
Please can some one help?
Thanks
You don't need to parse the dates, but you do need to process each line in the file and convert the data on each line you want to convert. Java/Groovy doesn't have anything like awk where you can work with file data as columns, for example, the first 10 "columns" (characters usually) in every line in a file. Java/Groovy only deals with "rows" of data in a file, not "columns".
You could try something like this: (in Groovy)
reader.eachLine { String theLine ->
int idx = theLine.indexOf(' ')
String oldDate = theLine.subString(0, idx)
String newDate = oldDate.replaceAll('-', '')
String newLine = newDate + theLine.subString(idx);
writer.writeLine(newline);
}
Edit:
If your CSVReader class is not derived from File, then you can't use Groovy's eachLine method on it. And if the CSVReader class's readAll() method really returns a List of String arrays, then the above code could change to this:
allRows.each { String[] theLine ->
String newDate = theLine[0].replaceAll('-', '')
writer.writeLine(newDate + theLine[1..-1])
}
Ignore the first line (the header):
List<String[]> allRows = reader.readAll()[1..-1];
and replace the '-' in the dates by splitting each row and editting the first:
allrows = allrows.collect{
row -> row.split(',')[0].replace(',','') // the date
+ row.split(',')[1..-1] // the rest
}
I don't know what you mean by "all dates at once". For me can only be iterated.