Vertx merge contents of multiple files in single file - vert.x

What is the best way to append the contents of multiple files into single file in vertx? I have tried vertx filesystem and asyncFile but both do not have a option to append file or I did not know of any. Is there any alternative approach to merge or append files in vertx asynchronously.
The only solution I could find is to make buffer list and write content on the end of each previous buffer length using loop.

Indeed, as of Vert.x 3.4, there is no helper method on FileSystem to append a file to another file.
You could do it with AsyncFile and Pump as follows.
First create a utility method to open files:
Future<AsyncFile> openFile(FileSystem fileSystem, String path, OpenOptions openOptions) {
Future<AsyncFile> future = Future.future();
fileSystem.open(path, openOptions, future);
return future;
}
Then another one to append a file to another file with a Pump:
Future<AsyncFile> append(AsyncFile source, AsyncFile destination) {
Future<AsyncFile> future = Future.future();
Pump pump = Pump.pump(source, destination);
source.exceptionHandler(future::fail);
destination.exceptionHandler(future::fail);
source.endHandler(v -> future.complete(destination));
pump.start();
return future;
}
Now you can combine those with sequential composition:
void merge(FileSystem fileSystem, String output, List<String> sources) {
openFile(fileSystem, output, new OpenOptions().setCreate(true).setTruncateExisting(true).setWrite(true)).compose(outFile -> {
Future<AsyncFile> mergeFuture = null;
for (String source : sources) {
if (mergeFuture == null) {
mergeFuture = openFile(fileSystem, source, new OpenOptions()).compose(sourceFile -> {
return append(sourceFile, outFile);
});
} else {
mergeFuture = mergeFuture.compose(v -> {
return openFile(fileSystem, source, new OpenOptions()).compose(sourceFile -> {
return append(sourceFile, outFile);
});
});
}
}
return mergeFuture;
}).setHandler(ar -> {
System.out.println("Done");
});
}

Related

FlowableEmitter fails to emit the items

I have a program which polls 2 directories - dir1 and dir2 and emit the file as soon as the file arrives in any of the 2 directories. But when the files arrive at same time in both directories, some files from either of the directory is failed to get emitted.
The code is as follows:
Flowable.create((FlowableEmitter<Path> em) -> pollDirectory(Arrays.asList(dir1,dir2), em),
BackpressureStrategy.BUFFER).subscribe();
The method pollDirectory has the below code:
for (Path directory: pathList) {
FileAlterationObserver fao = new FileAlterationObserver(directory);
fao.addListener(new FileAlterationListenerImpl(emitter));
final FileAlterationMonitor monitor = new FileAlterationMonitor(5000);
monitor.addObserver(fao);
try {
monitor.start();
} catch (Exception e) {
handleException(e);
}
}
The FileAlterationListenerImpl class emits the path on file creation and code is as follows:
public class FileAlterationListenerImpl implements FileAlterationListener {
FlowableEmitter<Path> source;
public FileAlterationListenerImpl(FlowableEmitter<Path> emitter) {
super();
this.source = emitter;
}
#Override
public void onFileCreate(final File file) {
source.onNext(file.toPath());
}
}
Is there any way in RxJava to handle this scenario so that emitter emits files from both directories even if the files are arrived at same time?
FileAlterationObserver creates a thread that will trigger its emission thus if you have two or more, that will result in concurrent onNext invocation.
The javadoc indicates this is not safe and you need to use serialize() to ensure thread safety.
Flowable.create((FlowableEmitter<Path> em) ->
pollDirectory(Arrays.asList(dir1,dir2), em.serialize()),
BackpressureStrategy.BUFFER
)
.subscribe();

How to generate output files for each input in Apache Flink

I'm using Flink to process my streaming data.
The streaming is coming from some other middleware, such as Kafka, Pravega, etc.
Saying that Pravega is sending some word stream, hello world my name is....
What I need is three steps of process:
Map each word to my custom class object MyJson.
Map the object MyJson to String.
Write Strings to files: one String is written to one file.
For example, for the stream hello world my name is, I should get five files.
Here is my code:
// init Pravega connector
PravegaDeserializationSchema<String> adapter = new PravegaDeserializationSchema<>(String.class, new JavaSerializer<>());
FlinkPravegaReader<String> source = FlinkPravegaReader.<String>builder()
.withPravegaConfig(pravegaConfig)
.forStream(stream)
.withDeserializationSchema(adapter)
.build();
// map stream to MyJson
DataStream<MyJson> jsonStream = env.addSource(source).name("Pravega Stream")
.map(new MapFunction<String, MyJson>() {
#Override
public MyJson map(String s) throws Exception {
MyJson myJson = JSON.parseObject(s, MyJson.class);
return myJson;
}
});
// map MyJson to String
DataStream<String> valueInJson = jsonStream
.map(new MapFunction<MyJson, String>() {
#Override
public String map(MyJson myJson) throws Exception {
return myJson.toString();
}
});
// output
valueInJson.print();
This code will output all of results to Flink log files.
My question is how to write one word to one output file?
I think the easiest way to do this would be with a custom sink.
stream.addSink(new WordFileSink)
public static class WordFileSink implements SinkFunction<String> {
#Override
public void invoke(String value, Context context) {
// generate a unique name for the new file and open it
// write the word to the file
// close the file
}
}
Note that this implementation won't necessarily provide exactly once behavior. You might want to take care that the file naming scheme is both unique and deterministic (rather than depending on processing time), and be prepared for the case that the file may already exist.

VS Code task process stdout/stderr

I'm attempting to write some test for a VS Code extension.
The extension basically creates some tasks, using ShellExecution to run a local executable file, for example:
new Task(
definition,
folder,
name,
source,
new ShellExecution('./runme', { cwd })
);
I would like to be able to test the shell process, but don't have access to this process and so cannot attach to any of the output streams nor get the exit code.
In my tests, I execute the task like so: await vscode.tasks.executeTask(task); which runs successfully regardless of the exit code of the process created by ShellExecution.
Is there any way I can get access to the child process generated from executing a task?
With Node.js' child_process this is simple to do. I use it to run an external Java jar and capture its output to get the errors. The main part is:
let java = child_process.spawn("java", parameters, spawnOptions);
let buffer = "";
java.stderr.on("data", (data) => {
let text = data.toString();
if (text.startsWith("Picked up _JAVA_OPTIONS:")) {
let endOfInfo = text.indexOf("\n");
if (endOfInfo == -1) {
text = "";
} else {
text = text.substr(endOfInfo + 1, text.length);
}
}
if (text.length > 0) {
buffer += "\n" + text;
}
});
java.on("close", (code) => {
let parser = new ErrorParser(dependencies);
if (parser.convertErrorsToDiagnostics(buffer)) {
thisRef.setupInterpreters(options.outputDir);
resolve(fileList);
} else {
reject(buffer); // Treat this as non-grammar error (e.g. Java exception).
}
});

How to do Async Http Call with Apache Beam (Java)?

Input PCollection is http requests, which is a bounded dataset. I want to make async http call (Java) in a ParDo , parse response and put results into output PCollection. My code is below. Getting exception as following.
I cound't figure out the reason. need a guide....
java.util.concurrent.CompletionException: java.lang.IllegalStateException: Can't add element ValueInGlobalWindow{value=streaming.mapserver.backfill.EnrichedPoint#2c59e, pane=PaneInfo.NO_FIRING} to committed bundle in PCollection Call Map Server With Rate Throttle/ParMultiDo(ProcessRequests).output [PCollection]
Code:
public class ProcessRequestsFn extends DoFn<PreparedRequest,EnrichedPoint> {
private static AsyncHttpClient _HttpClientAsync;
private static ExecutorService _ExecutorService;
static{
AsyncHttpClientConfig cg = config()
.setKeepAlive(true)
.setDisableHttpsEndpointIdentificationAlgorithm(true)
.setUseInsecureTrustManager(true)
.addRequestFilter(new RateLimitedThrottleRequestFilter(100,1000))
.build();
_HttpClientAsync = asyncHttpClient(cg);
_ExecutorService = Executors.newCachedThreadPool();
}
#DoFn.ProcessElement
public void processElement(ProcessContext c) {
PreparedRequest request = c.element();
if(request == null)
return;
_HttpClientAsync.prepareGet((request.getRequest()))
.execute()
.toCompletableFuture()
.thenApply(response -> { if(response.getStatusCode() == HttpStatusCodes.STATUS_CODE_OK){
return response.getResponseBody();
} return null; } )
.thenApply(responseBody->
{
List<EnrichedPoint> resList = new ArrayList<>();
/*some process logic here*/
System.out.printf("%d enriched points back\n", result.length());
}
return resList;
})
.thenAccept(resList -> {
for (EnrichedPoint enrichedPoint : resList) {
c.output(enrichedPoint);
}
})
.exceptionally(ex->{
System.out.println(ex);
return null;
});
}
}
The Scio library implements a DoFn which deals with asynchronous operations. The BaseAsyncDoFn might provide you the handling you need. Since you're dealing with CompletableFuture also take a look at the JavaAsyncDoFn.
Please note that you necessarily don't need to use the Scio library, but you can take the main idea of the BaseAsyncDoFn since it's independent of the rest of the Scio library.
The issue that your hitting is that your outputting outside the context of a processElement or finishBundle call.
You'll want to gather all your outputs in memory and output them eagerly during future processElement calls and at the end within finishBundle by blocking till all your calls finish.

Read large file using vertx

I am new to using vertx and I am using vertx filesystem api to read file of large size.
vertx.fileSystem().readFile("target/classes/readme.txt", result -> {
if (result.succeeded()) {
System.out.println(result.result());
} else {
System.err.println("Oh oh ..." + result.cause());
}
});
But the RAM is all consumed while reading and the resource is not even flushed after use. The vertx filesystem api also suggest
Do not use this method to read very large files or you risk running out of available RAM.
Is there any alternative to this?
To read large file you should open an AsyncFile:
OpenOptions options = new OpenOptions();
fileSystem.open("myfile.txt", options, res -> {
if (res.succeeded()) {
AsyncFile file = res.result();
} else {
// Something went wrong!
}
});
Then an AsyncFile is a ReadStream so you can use it together with a Pump to copy the bits to a WriteStream:
Pump.pump(file, output).start();
file.endHandler((r) -> {
System.out.println("Copy done");
});
There are different kind of WriteStream, like AsyncFile, net sockets, HTTP server responses, ...etc.
To read/process a large file in chunks you need to use the open() method which will return an AsyncFile on success. On this AsyncFile you setReadBufferSize() (or not, the default is 8192), and attach a handler() which will be passed a Buffer of at most the size of the read buffer you just set.
In the example below I have also attached an endHandler() to print a final newline to stay in line with the sample code you provided in the question:
vertx.fileSystem().open("target/classes/readme.txt", new OpenOptions().setWrite(false).setCreate(false), result -> {
if (result.succeeded()) {
result.result().setReadBufferSize(READ_BUFFER_SIZE).handler(data -> System.out.print(data.toString()))
.endHandler(v -> System.out.println());
} else {
System.err.println("Oh oh ..." + result.cause());
}
});
You need to define READ_BUFFER_SIZE somewhere of course.
The reason for that is that internally .readFile calls to Files.readAllBytes.
What you should do instead is create a stream out of your file, and pass it to Vertx handler:
try (InputStream steam = new FileInputStream("target/classes/readme.txt")) {
// Your handling here
}