Side input for a pcollection - apache-beam

I want to enter a string value dynamically in a pipeline. (JsonSchema value).
It is from another pipeline. PCollection<String>
row.apply("Inserting to BQ table", BigQueryIO.writeTableRows()
//facing issues here
.withJsonSchema("")
.withCreateDisposition(
BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(
BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
// .withExtendedErrorInfo()
// .withMethod(BigQueryIO.Write.Method.STREAMING_INSERTS)
.withMethod(BigQueryIO.Write.Method.FILE_LOADS)
.withTriggeringFrequency(Duration.standardMinutes(1))
// .withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors())
.withOptimizedWrites().withNumFileShards(5)
.to(options.getBqTargetTableName()));

Related

How to sequence a task to execute once all Single's in a collection complete

I am using Helidon DBClient transactions and have found myself in a situation where I end up with a list of Singles, List<Single<T>> and want to perform the next task only after completing all of the singles.
I am looking for something of equivalent to CompletableFuture.allOf() but with Single.
I could map each of the single toCompletableFuture() and then do a CompletableFuture.allOf() on top, but is there a better way? Could someone point me in the right direction with this?
--
Why did I end up with a List<Single>?
I have a collection of POJOs which I turn into named insert .execute() all within an open transaction. Since I .stream() the original collection and perform inserts using the .map() operator, I end up with a List when I terminate the stream to collect a List. None of the inserts might have actually been executed. At this point, I want to wait until all of the Singles have been completed before I proceed to the next stage.
This is something I would naturally do with a CompletableFuture.allOf(), but I do not want to change the API dialect for just this and stick to Single/Multi.
Single.flatMap, Single.flatMapSingle, Multi.flatMap will effectively inline the future represented by the publisher passed as argument.
You can convert a List<Single<T>> to Single<List<T>> like this:
List<Single<Integer>> listOfSingle = List.of(Single.just(1), Single.just(2));
Single<List<Integer>> singleOfList = Multi.just(listOfSingle)
.flatMap(Function.identity())
.collectList();
Things can be tricky when you are dealing with Single<Void> as Void cannot be instantiated and null is not a valid value (i.e. Single.just(null) throws a NullPointerException).
// convert List<Single<Void>> to Single<List<Void>>
Single<List<Void>> listSingle =
Multi.just(List.of(Single.<Void>empty(), Single.<Void>empty()))
.flatMap(Function.identity())
.collectList();
// convert Single<List<Void>> to Single<Void>
// Void cannot be instantiated, it needs to be casted from null
// BUT null is not a valid value...
Single<Void> single = listSingle.toOptionalSingle()
// convert Single<List<Void>> to Single<Optional<List<Void>>>
// then Use Optional.map to convert Optional<List<Void>> to Optional<Void>
.map(o -> o.map(i -> (Void) null))
// convert Single<Optional<Void>> to Single<Void>
.flatMapOptional(Function.identity());
// Make sure it works
single.forSingle(o -> System.out.println("ok"))
.await();

How to export an csv file to a bigqery table using java dataflow?

I want to read an csv file from the cloud bucket and write it to a bigquery table with columns using dataflow in java. How can I set the headers to the csv file while writing to bigquery?
There are two issues to solve here
Skipping the header when reading the data, and
Using the header to correctly populate teh bigquery table columns.
For (1) this is, as of June 2019, not implemented natively, though you could try the options listed at Skipping header rows - is it possible with Cloud DataFlow?. For (2) the easiest would be to read the first line of your CSV in your main program, and pass the list of column names in the constructor to a DoFn that converts CSV lines into TableRow objects ready to write to Bigquery.
Your final program would look something like
public void CsvToBigquery(csvInputPattern, bigqueryTable) {
final String[] columns = readAndSplitFirstLineOfFirstFile(csvInputPattern);
Pipeline p = new Pipeline.create(...);
p
.apply(TextIO.read().from(csvInputPattern)
.apply(Filter.by(new MatchIfNonHeader())
.apply(ParDo.of(new DoFn<String, TableRow>() {
... // use columns here to TableRows
})
.apply(BigtableIO.write().withTableId(bigqueryTable)...);
}
I've done a similar task and used Apache Common library in ParDo function to extract the data from CSV files and then converted them to Table Row Objects for BQ.
String fileData = c.element();
BufferedReader fileReader = new BufferedReader(new InputStreamReader(
new ByteArrayInputStream(fileData.getBytes("UTF-8")), "UTF-8"));
CSVParser csvParser = new CSVParser(fileReader,CSVFormat.DEFAULT.withFirstRecordAsHeader().withIgnoreHeaderCase().withTrim());
Iterable<CSVRecord> csvRecords = csvParser.getRecords();
for (CSVRecord csvRecord : csvRecords) {
TableRow row = new TableRow();
checkAndConvertIntoBqDataType(csvRecord.toMap());
c.output(row);
}

reading files and folders in order with apache beam

I have a folder structure of the type year/month/day/hour/*, and I'd like the beam to read this as an unbounded source in chronological order. Specifically, this means reading in all the files in the first hour on record and adding their contents for processing. Then, add the file contents of the next hour for processing, up until the current time where it waits for new files to arrive in the latest year/month/day/hour folder.
Is it possible to do this with apache beam?
So what I would do is to add timestamps to each element according to the file path. As a test I used the following example.
First of all, as explained in this answer, you can use FileIO to match continuously a file pattern. This will help as, per your use case, once you have finished with the backfill you want to keep reading new arriving files within the same job. In this case I provide gs://BUCKET_NAME/data/** because my files will be like gs://BUCKET_NAME/data/year/month/day/hour/filename.extension:
p
.apply(FileIO.match()
.filepattern(inputPath)
.continuously(
// Check for new files every minute
Duration.standardMinutes(1),
// Never stop checking for new files
Watch.Growth.<String>never()))
.apply(FileIO.readMatches())
Watch frequency and timeout can be adjusted at will.
Then, in the next step we'll receive the matched file. I will use ReadableFile.getMetadata().resourceId() to get the full path and split it by "/" to build the corresponding timestamp. I round it to the hour and do not account for timezone correction here. With readFullyAsUTF8String we'll read the whole file (be careful if the whole file does not fit into memory, it is recommended to shard your input if needed) and split it into lines. With ProcessContext.outputWithTimestamp we'll emit downstream a KV of filename and line (the filename is not needed anymore but it will help to see where each file comes from) and the timestamp derived from the path. Note that we're shifting timestamps "back in time" so this can mess up with the watermark heuristics and you will get a message such as:
Cannot output with timestamp 2019-03-17T00:00:00.000Z. Output timestamps must be no earlier than the timestamp of the current input (2019-06-05T15:41:29.645Z) minus the allowed skew (0 milliseconds). See the DoFn#getAllowedTimestampSkew() Javadoc for details on changing the allowed skew.
To overcome this I set getAllowedTimestampSkew to Long.MAX_VALUE but take into account that this is deprecated. ParDo code:
.apply("Add Timestamps", ParDo.of(new DoFn<ReadableFile, KV<String, String>>() {
#Override
public Duration getAllowedTimestampSkew() {
return new Duration(Long.MAX_VALUE);
}
#ProcessElement
public void processElement(ProcessContext c) {
ReadableFile file = c.element();
String fileName = file.getMetadata().resourceId().toString();
String lines[];
String[] dateFields = fileName.split("/");
Integer numElements = dateFields.length;
String hour = dateFields[numElements - 2];
String day = dateFields[numElements - 3];
String month = dateFields[numElements - 4];
String year = dateFields[numElements - 5];
String ts = String.format("%s-%s-%s %s:00:00", year, month, day, hour);
Log.info(ts);
try{
lines = file.readFullyAsUTF8String().split("\n");
for (String line : lines) {
c.outputWithTimestamp(KV.of(fileName, line), new Instant(dateTimeFormat.parseMillis(ts)));
}
}
catch(IOException e){
Log.info("failed");
}
}}))
Finally, I window into 1-hour FixedWindows and log the results:
.apply(Window
.<KV<String,String>>into(FixedWindows.of(Duration.standardHours(1)))
.triggering(AfterWatermark.pastEndOfWindow())
.discardingFiredPanes()
.withAllowedLateness(Duration.ZERO))
.apply("Log results", ParDo.of(new DoFn<KV<String, String>, Void>() {
#ProcessElement
public void processElement(ProcessContext c, BoundedWindow window) {
String file = c.element().getKey();
String value = c.element().getValue();
String eventTime = c.timestamp().toString();
String logString = String.format("File=%s, Line=%s, Event Time=%s, Window=%s", file, value, eventTime, window.toString());
Log.info(logString);
}
}));
For me it worked with .withAllowedLateness(Duration.ZERO) but depending on the order you might need to set it. Keep in mind that a value too high will cause windows to be open for longer and use more persistent storage.
I set the $BUCKET and $PROJECT variables and I just upload two files:
gsutil cp file1 gs://$BUCKET/data/2019/03/17/00/
gsutil cp file2 gs://$BUCKET/data/2019/03/18/22/
And run the job with:
mvn -Pdataflow-runner compile -e exec:java \
-Dexec.mainClass=com.dataflow.samples.ChronologicalOrder \
-Dexec.args="--project=$PROJECT \
--path=gs://$BUCKET/data/** \
--stagingLocation=gs://$BUCKET/staging/ \
--runner=DataflowRunner"
Results:
Full code
Let me know how this works. This was just an example to get started and you might need to adjust windowing and triggering strategies, lateness, etc to suit your use case

Spring Batch - Comma separated values - Save in Data Base

I have a file which contains list of values (user IDs) separated by comma(“,”) as follows.
111, 222, 333, 444, 555, 777 …………
The file contains millions of such records and I wanted to save these values into a single column in a table in RDBMS.
I tried to use DelimitedLineTokenizer for parsing data.
The issue is that “DelimitedLineTokenizer” considers only one entry in a single line, and rest of the values are ignored.The first entry ("111") is saved and rest of the values in the same line are ignored.If there is a second line , the first element in the second line is saved and rest are ignored.
Is there a way to tokenize all the comma separated values from a single line and save all of them into DB?
The query is a s follows.
INSERT INTO users (id) VALUES (: userid).
I used the following code to parse the file and save it in DB.
public FlatFileItemReader<User> reader() {
FlatFileItemReader<User> reader = new FlatFileItemReader<User>();
DelimitedLineTokenizer reader = new DelimitedLineTokenizer(",");
reader.setNames(new String[] {“userid”});
blah…blah….blah….
reader.setLineMapper(new DefaultLineMapper<User>() {
{
setLineTokenizer(reader);
setFieldSetMapper(new BeanWrapperFieldSetMapper<User>() {
{
setTargetType(User.class);
}
});
}
});
return reader;
}
#Bean
public UserItemProcessor processor() {
return new UserItemProcessor();
}
#Bean
public Job importUserJob(JobCompletionNotificationListener listener) {
return jobBuilderFactory.get("importUserJob").incrementer(new RunIdIncrementer()).listener(listener)
.flow(step1()).end().build();
}
#Bean
public Step step1() {
return stepBuilderFactory.get("step1").<User, User> chunk(5).reader(reader()).processor(processor())
.writer(writer()).build();
}
Basically, you have two delimiters for target object - comma & new line. So either you writer a custom reader that works on both delimiters or you need to pre process your file to bring it to standard format.
In my opinion, you are better off by pre processing your file to replace all comma with new line character.
You might retain original file as is and create pre processed data in a new temporary file.
You can either do that as a separate spring batch step ( not recommended due to file size ) or if its going to be a scheduled job then probably, in your kick off script.
Replace comma with newline in java
How to break lines at a specific character in Notepad++?
Notepad++ find and replace string with a new-line
Replace comma with new line in a text file using tr in Linux

Empty data while reading data from kafka using Trident Topology

I am new to Trident. I am writing a trident topology which reads data from kafka. Topic name is 'test'. I have local kafka setup. I started zookeeper, kafka in local. And created a topic 'test' in kafka and opened the producer and typed the message 'Hello Kafka!'.
I want to read the message 'Hello Kafka' from the 'test' topic using trident.
Below is my code. I am getting empty tuple.
TridentTopology topology = new TridentTopology();
BrokerHosts brokerHosts = new ZkHosts("localhost:2181");
TridentKafkaConfig kafkaConfig = new TridentKafkaConfig(brokerHosts, "test");
kafkaConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
kafkaConfig.bufferSizeBytes = 1024 * 1024 * 4;
kafkaConfig.fetchSizeBytes = 1024 * 1024 * 4;
kafkaConfig.forceFromStart = false;
OpaqueTridentKafkaSpout opaqueTridentKafkaSpout = new OpaqueTridentKafkaSpout(kafkaConfig);
topology.newStream("TestSpout", opaqueTridentKafkaSpout).parallelismHint(1)
.each(new Fields(), new TestFilter()).parallelismHint(1)
.each(new Fields(), new Utils.PrintFilter());
and this is my TestFilter class code
public TestFilter()
{
//
}
#Override
public boolean isKeep(TridentTuple tuple) {
boolean isKeep=true;
System.out.println("TestFilter is called...");
if (tuple != null && tuple.getValues().size()>0) {
System.out.println("data from kafka ::: "+tuple.getValues());
}
return isKeep;
}
Whenever i type message in kafka producer to the 'test' topic, first sysout getting printed but it doesn't pass the if loop. I am simply getting message 'TestFilter is called...' not more than that.
I want to get the actual data i produced to the 'test' topic. How?
The problems lies in the parameters to Stream.each. The relevant portion of the javadoc for the method is:
each(Fields inputFields, Filter filter)
The documentation is't too clear about it, but the semantic is that you should specifies all the fields used by your filter using the inputFields parameter.
Storm will then apply a projection on the input tuple and forward it to the filter.
Given that you didn't specified any input fields, the projection resulted in an empty tuple thus resulting in the failure of the tuple.getValues().size()>0 condition inside the filter.
It's worth mentioning also the other variants of each:
each(Fields inputFields, Function function, Fields functionFields)
each(Function function, Fields functionFields)
These will apply the provided function on the projection of the input tuple, appending the resulting tuple to the original input tuple renaming the new fields as functionFields (i.e. the projection is only used for applying the function).
In particular the second version is equivalent to invoke each with inputFields set to null (or new Fields()) and will result in an empty tuple getting passed to function.