TextIO.Read().From() vs TextIO.ReadFiles() over withHintMatchesManyFiles() - apache-beam

In my usecase getting set of matching filepattern from Kafka,
PCollection<String> filepatterns = p.apply(KafkaIO.read()...);
Here each pattern could match upto 300+ files.
Q1. How can I use TextIO.Read() to match data from PCollection, as withHintMatchesManyFiles() available only for TextIO.Read() not for TextIO.ReadFiles().
Q2. If path via FileIO.Match->FileIO.ReadMatch()->TextIO.ReadFiles() is used, withHintMatchesManyFiles() isn't available in this path, how it will impact the read performance?
Q3. Is there any other optimized path for above usecase?

Yes, you can't have withHintMatchesManyFiles() with TextIO.ReadFiles() out of the box. Actually, TextIO.Read().withHintMatchesManyFiles() is implemented via FileIO transforms + TextIO.ReadFiles() (see details). In this way, FileIO.readMatches() should distribute the files reading over the workers.
So, I think you can use the same approach while reading file names from Kafka topic.

How can I use TextIO.Read() to match data from PCollection, as withHintMatchesManyFiles() available only for TextIO.Read() not for TextIO.ReadFiles().
My very limited understanding of Apache Beam in general and PTransforms in particular is that TextIO.read() creates a root PTransform that can only be used at the very beginning of the pipeline. In other words, TextIO.Read cannot be used after a PTransform of any kind.

Related

How do you send the record matching certain condition/s to specific/multiple output topics in Spring apache-kafka?

I have referred this. but, this is an old post so i'm looking for a better solution if any.
I have an input topic that contains 'userActivity' data. Now I wish to gather different analytics based on userInterest, userSubscribedGroup, userChoice, etc... produced to distinct output topics from the same Kafka-streams-application.
Could you help me achieve this... ps: This my first time using Kafka-streams so I'm unaware of any other alternatives.
edit:
It's possible that One record matches multiple criteria, in which case the same record should go into those output topics as well.
if(record1 matches criteria1) then... output to topic1;
if(record1 matches criteria2) then ... output to topic2;
and so on.
note: i'm not looking elseIf kind of solution.
For dynamically choosing which topic to send to at runtime based on each record's key-value pairs. Apache Kafka version 2.0 or later introduced a feature called: Dynamic routing
And this is an example of it: https://kafka-tutorials.confluent.io/dynamic-output-topic/confluent.html

Reusing PCollection from output of a Transform in another Transform which is later stage of pipeline

In java or any other Programming we can save state of a variable and refer the variable value later as needed. This seems to be not possible with Apache beam, can someone confirm? If it is possible please point me to some samples or documentation.
I am trying to solve below which needs context of my previous transform output.
I am new to Apache Beam so finding it hard to understand how to solve the above.
Approach#1:
PCollection config = p.apply(ReadDBConfigFn(options.getPath()));
PCollection<Record> records = config.apply(FetchRecordsFn());
PCollection<Users> users = config.apply(FetchUsersFn());
// Now Process using both 'records' and 'users', How can this be done with beam?
Approach#2:
PCollection config = p.apply(ReadDBConfigFn(options.getPath()));
PCollection<Record> records = config.apply(FetchRecordsFn()).apply(FetchUsersAndProcessRecordsFn());
// Above line 'FetchUsersAndProcessRecordsFn' needs 'config' so it can fetch Users but there is seems to be no possible way?
If I understand correctly, you want to use elements from the two collections records and users in a processing step? There are two commonly used patterns in Beam to accomplish this:
If you are looking to join the two collections, you probably want to use a CoGroupByKey to group related records and users together for processing.
If one of the collections (records or users) is "small" and the entire set needs to be available during processing, you might want to send it as a side input to your processing step.
It isn't clear what might be in the PCollection config in your example, so I may have misinterpreted... Does this meet your use case?

Filtering GCS uris before job execution

I have a frequent use case I couldn't solve.
Let's say I have a filepattern like gs://mybucket/mydata/*/files.json where * is supposed to match a date.
Imagine I want to keep 251 dates (this is an example, let's say a big number of dates but without a meta-pattern to match them like 2019* or else).
For now, I have two options :
create a TextIO for every single file, which is overkill and fails almost everytime (graph too large)
read ALL data and then filter it within my job from data : which is also overkill when you have 10 TB of data while you only need 10 Gb for instance
In my case, I would like to just do something like that (pseudo code) :
Read(LIST[uri1,uri2,...,uri251])
And that this instruction actually spawn a single TextIO task on the graph.
I am sorry if I missed something, but I couldn't find a way to do it.
Thanks
Ok I found it, the naming was mileading me :
Example 2: reading a PCollection of filenames.
Pipeline p = ...;
// E.g. the filenames might be computed from other data in the pipeline, or
// read from a data source.
PCollection<String> filenames = ...;
// Read all files in the collection.
PCollection<String> lines =
filenames
.apply(FileIO.matchAll())
.apply(FileIO.readMatches())
.apply(TextIO.readFiles());
(Quoted from Apache Beam documentation https://beam.apache.org/releases/javadoc/2.13.0/org/apache/beam/sdk/io/TextIO.html)
So we need to generate a PCollection of URIS (with Create/of) or to read it from the pipeline, then to match all the uris (or patterns I guess) and the to read all files.

Reading csv file by Flink, scala, addSource and readCsvFile

I'd want to read csv file using by Flink, Scala-language and addSource- and readCsvFile-functions. I have not found any simple examples about that. I have only found: https://github.com/dataArtisans/flink-training-exercises/blob/master/src/main/scala/com/dataartisans/flinktraining/exercises/datastream_scala/cep/LongRides.scala and this too complex for my purpose.
In definition: StreamExecutionEnvironment.addSource(sourceFunction) should i only use readCsvFile as sourceFunction ?
After reading i'd want to use CEP (Complex Event Processing).
readCsvFile() is only available as part of Flink's DataSet (batch) API, and cannot be used with the DataStream (streaming) API. Here's a pretty good example of readCsvFile(), though it's probably not relevant to what you're trying to do.
readTextFile() and readFile() are methods on StreamExecutionEnvironment, and do not implement the SourceFunction interface -- they are not meant to be used with addSource(), but rather instead of it. Here's an example of using readTextFile() to load a CSV using the DataStream API.
Another option is to use the Table API, and a CsvTableSource. Here's an example and some discussion of what it does and doesn't do. If you go this route, you'll need to use StreamTableEnvironment.toAppendStream() to convert your table stream to a DataStream before using CEP.
Keep in mind that all of these approaches will simply read the file once and create a bounded stream from its contents. If you want a source that reads in an unbounded CSV stream, and waits for new rows to be appended, you'll need a different approach. You could use a custom source, or a socketTextStream, or something like Kafka.
If you have a CSV file with 3 fields - String,Long,Integer
then do below
val input=benv.readCsvFile[(String,Long,Integer)]("hdfs:///path/to/your_csv_file")
PS:-I am using flink shell that is why I have benv

Google cloud storage - Split files by value in the file

I need to read a file from Google Cloud storage and split it into multiple files based on transaction_date which is a field in the file. File is about 6 TB in size (broken in to multiple files). What's the most effective ways to achieve this? Do I have to use Dataflow or Dataproc, any other simple way to do this?
I take it to mean that you want to write a separate (sharded) file per transaction_date. There isn't any direct support for this in the TextIO.Write that ships with Dataflow, but since it sounds like you have a special case where you know the date range, so you manually create ~11 different filtered TextIO.Write transforms.
PCollection<Record> input = ...
for (Date transaction_date : known_transaction_dates) {
input.apply(Filter.by(<record has this date>)
.apply(TextIO.Write.to(
String.format("gs://my-bucket/output/%s", transaction_date)));
}
This is certainly not ideal. For BigQueryIO there is a feature to write to a different table based on the windowing of the data - similar functionality added to TextIO might address your use case. Otherwise, data-dependent writes of various sorts are on our radar and include cases like yours.