How can you write to multiple outputs dependent on the key using Scalding(/cascading) in a single Map Reduce Job. I could of course use .filter for all the possible keys, but that is a horrible hack, which will fire up many jobs.
There is TemplatedTsv in Scalding (from version 0.9.0rc16 and up), exactly same as Cascading TemplateTsv.
Tsv(args("input"), ('COUNTRY, 'GDP))
.read
.write(TemplatedTsv(args("output"), "%s", 'COUNTRY))
// it will create a directory for each country under "output" path in Hadoop mode.
Use MultipleOutputFormat and extrapolate from these other SO questions to write a custom output class using the output format:
Create Scalding Source like TextLine that combines multiple files into single mappers,
Compress Output Scalding / Cascading TsvCompressed
This suggestion on the Cascading User group suggests to use Cascading TemplateTap. Not sure how to connect this to Scalding though.
Related
I have referred this. but, this is an old post so i'm looking for a better solution if any.
I have an input topic that contains 'userActivity' data. Now I wish to gather different analytics based on userInterest, userSubscribedGroup, userChoice, etc... produced to distinct output topics from the same Kafka-streams-application.
Could you help me achieve this... ps: This my first time using Kafka-streams so I'm unaware of any other alternatives.
edit:
It's possible that One record matches multiple criteria, in which case the same record should go into those output topics as well.
if(record1 matches criteria1) then... output to topic1;
if(record1 matches criteria2) then ... output to topic2;
and so on.
note: i'm not looking elseIf kind of solution.
For dynamically choosing which topic to send to at runtime based on each record's key-value pairs. Apache Kafka version 2.0 or later introduced a feature called: Dynamic routing
And this is an example of it: https://kafka-tutorials.confluent.io/dynamic-output-topic/confluent.html
In java or any other Programming we can save state of a variable and refer the variable value later as needed. This seems to be not possible with Apache beam, can someone confirm? If it is possible please point me to some samples or documentation.
I am trying to solve below which needs context of my previous transform output.
I am new to Apache Beam so finding it hard to understand how to solve the above.
Approach#1:
PCollection config = p.apply(ReadDBConfigFn(options.getPath()));
PCollection<Record> records = config.apply(FetchRecordsFn());
PCollection<Users> users = config.apply(FetchUsersFn());
// Now Process using both 'records' and 'users', How can this be done with beam?
Approach#2:
PCollection config = p.apply(ReadDBConfigFn(options.getPath()));
PCollection<Record> records = config.apply(FetchRecordsFn()).apply(FetchUsersAndProcessRecordsFn());
// Above line 'FetchUsersAndProcessRecordsFn' needs 'config' so it can fetch Users but there is seems to be no possible way?
If I understand correctly, you want to use elements from the two collections records and users in a processing step? There are two commonly used patterns in Beam to accomplish this:
If you are looking to join the two collections, you probably want to use a CoGroupByKey to group related records and users together for processing.
If one of the collections (records or users) is "small" and the entire set needs to be available during processing, you might want to send it as a side input to your processing step.
It isn't clear what might be in the PCollection config in your example, so I may have misinterpreted... Does this meet your use case?
I'd want to read csv file using by Flink, Scala-language and addSource- and readCsvFile-functions. I have not found any simple examples about that. I have only found: https://github.com/dataArtisans/flink-training-exercises/blob/master/src/main/scala/com/dataartisans/flinktraining/exercises/datastream_scala/cep/LongRides.scala and this too complex for my purpose.
In definition: StreamExecutionEnvironment.addSource(sourceFunction) should i only use readCsvFile as sourceFunction ?
After reading i'd want to use CEP (Complex Event Processing).
readCsvFile() is only available as part of Flink's DataSet (batch) API, and cannot be used with the DataStream (streaming) API. Here's a pretty good example of readCsvFile(), though it's probably not relevant to what you're trying to do.
readTextFile() and readFile() are methods on StreamExecutionEnvironment, and do not implement the SourceFunction interface -- they are not meant to be used with addSource(), but rather instead of it. Here's an example of using readTextFile() to load a CSV using the DataStream API.
Another option is to use the Table API, and a CsvTableSource. Here's an example and some discussion of what it does and doesn't do. If you go this route, you'll need to use StreamTableEnvironment.toAppendStream() to convert your table stream to a DataStream before using CEP.
Keep in mind that all of these approaches will simply read the file once and create a bounded stream from its contents. If you want a source that reads in an unbounded CSV stream, and waits for new rows to be appended, you'll need a different approach. You could use a custom source, or a socketTextStream, or something like Kafka.
If you have a CSV file with 3 fields - String,Long,Integer
then do below
val input=benv.readCsvFile[(String,Long,Integer)]("hdfs:///path/to/your_csv_file")
PS:-I am using flink shell that is why I have benv
I am trying to load a csv file in scala from spark. I see that we can do using the below two different syntaxes:
sqlContext.read.format("csv").options(option).load(path)
sqlContext.read.options(option).csv(path)
What is the difference between these two and which gives the better performance?
Thanks
There's no difference.
So why do both exist?
The .format(fmt).load(path) method is a flexible, pluggable API that allows adding more formats without having to re-compile spark - you can register aliases for custom Data Source implementations and have Spark use them; "csv" used to be such a custom implementation (outside of the packaged Spark binaries), but it is now part of the project
There are shorthand methods for "built-in" data sources (like csv, parquet, json...) which make the code a bit simpler (and verified at compile time)
Eventually, they both create a CSV Data Source and use it to load the data.
Bottom line, for any supported format, you should opt for the "shorthand" method, e.g. csv(path).
I need to read a file from Google Cloud storage and split it into multiple files based on transaction_date which is a field in the file. File is about 6 TB in size (broken in to multiple files). What's the most effective ways to achieve this? Do I have to use Dataflow or Dataproc, any other simple way to do this?
I take it to mean that you want to write a separate (sharded) file per transaction_date. There isn't any direct support for this in the TextIO.Write that ships with Dataflow, but since it sounds like you have a special case where you know the date range, so you manually create ~11 different filtered TextIO.Write transforms.
PCollection<Record> input = ...
for (Date transaction_date : known_transaction_dates) {
input.apply(Filter.by(<record has this date>)
.apply(TextIO.Write.to(
String.format("gs://my-bucket/output/%s", transaction_date)));
}
This is certainly not ideal. For BigQueryIO there is a feature to write to a different table based on the windowing of the data - similar functionality added to TextIO might address your use case. Otherwise, data-dependent writes of various sorts are on our radar and include cases like yours.