Google Cloud Storage atomic creation of a Blob - google-cloud-dataproc

I'm using haddop-connectors
project for writing BLOBs to Google Cloud Storage.
I'd like to make sure that a BLOB with a specific target name that is being written in a concurrent context is either written in FULL or not appearing at all as visible in case that an exception has occurred while writing.
In the code below, in case that that an I/O exception occurs, the BLOB written will appear on GCS because the stream is being closed in finally:
val stream = fs.create(path, overwrite)
try {
actions.map(_ + "\n").map(_.getBytes(UTF_8)).foreach(stream.write)
} finally {
stream.close()
}
The other possibility would be to not close the stream and let it "leak" so that the BLOB does not get created. However this is not really a valid option.
val stream = fs.create(path, overwrite)
actions.map(_ + "\n").map(_.getBytes(UTF_8)).foreach(stream.write)
stream.close()
Can anybody share with me a recipe on how to write to GCS a BLOB either with hadoop-connectors or cloud storage client in an atomic fashion?

I have used reflection within hadoop-connectors to retrieve an instance of com.google.api.services.storage.Storage from the GoogleHadoopFileSystem instance
GoogleCloudStorage googleCloudStorage = ghfs.getGcsFs().getGcs();
Field gcsField = googleCloudStorage.getClass().getDeclaredField("gcs");
gcsField.setAccessible(true);
Storage gcs = (Storage) gcsField.get(googleCloudStorage);
in order to have the ability to make a call based on an input stream corresponding to the data in memory.
private static StorageObject createBlob(URI blobPath, byte[] content, GoogleHadoopFileSystem ghfs, Storage gcs)
throws IOException
{
CreateFileOptions createFileOptions = new CreateFileOptions(false);
CreateObjectOptions createObjectOptions = objectOptionsFromFileOptions(createFileOptions);
PathCodec pathCodec = ghfs.getGcsFs().getOptions().getPathCodec();
StorageResourceId storageResourceId = pathCodec.validatePathAndGetId(blobPath, false);
StorageObject object =
new StorageObject()
.setContentEncoding(createObjectOptions.getContentEncoding())
.setMetadata(encodeMetadata(createObjectOptions.getMetadata()))
.setName(storageResourceId.getObjectName());
InputStream inputStream = new ByteArrayInputStream(content, 0, content.length);
Storage.Objects.Insert insert = gcs.objects().insert(
storageResourceId.getBucketName(),
object,
new InputStreamContent(createObjectOptions.getContentType(), inputStream));
// The operation succeeds only if there are no live versions of the blob.
insert.setIfGenerationMatch(0L);
insert.getMediaHttpUploader().setDirectUploadEnabled(true);
insert.setName(storageResourceId.getObjectName());
return insert.execute();
}
/**
* Helper for converting from a Map<String, byte[]> metadata map that may be in a
* StorageObject into a Map<String, String> suitable for placement inside a
* GoogleCloudStorageItemInfo.
*/
#VisibleForTesting
static Map<String, String> encodeMetadata(Map<String, byte[]> metadata) {
return Maps.transformValues(metadata, QuickstartParallelApiWriteExample::encodeMetadataValues);
}
// A function to encode metadata map values
private static String encodeMetadataValues(byte[] bytes) {
return bytes == null ? Data.NULL_STRING : BaseEncoding.base64().encode(bytes);
}
Note in the example above, that even if there are multiple callers trying to create a blob with the same name in parallel, ONE and only ONE will succeed in creating the blob. The other callers will receive 412 Precondition Failed.

GCS objects (blobs) are immutable 1, which means they can be created, deleted or replaced, but not appended.
The Hadoop GCS connector provides the HCFS interface which gives the illusion of appendable files. But under the hood, it is just one blob creation, GCS doesn't know if the content is complete or not from the application's perspective, just as you mentioned in the example. There is no way to cancel a file creation.
There are 2 options you can consider:
Create a temp blob/file, copy it to the final blob/file, then delete the temp blob/file, see 2. Note that there is no atomic rename operation in GCS, rename is implemented as copy-then-delete.
If your data fits into the memory, first read up the stream and buffer the bytes in memory, then create the blob/file, see 3.
GCS connector should also work with the 2 options above, but I think GCS client library gives you more control.

Related

Spring batch : Inputstream is getting closed in spring batch reader if writer takes more than 5 minutes in processing

Whats need to be achieve: Read a csv from sftp location and write again to different path and also save to db with Spring batch in springboot app.
Issue : 1. Reader is getting executed only once and writer as per the chunk size like print statement in reader gets printed only one time while in writer per chunk execution. It seems to be the default behaviour of FlatfileItemReader.
I am using SFTP channel to read the file from sftp location which is getting closed after read if writer processing time is huge.
So is there a way, I can always pass a new SFTP connection for eech chunk size or is there a way I can extend this readers input stream timeout as I dont see any option of timeout. In sftp configuration, I already tried increasing the timeout and idle time but of no use.
I have tried creating the new sftp connection in reader and pass it to stream but as reader is only getting initialized one, this does not help.
I already tried increasing the timeout and idle time but of no use.
Reader snippet:
private Step step (FileInputDTO input,
Map<String, Float> ratelist) throws
SftpException { return
stepBuilderFactory.get("Step").<DTO,
DTO>chunk(chunkSize)
.reader(buildReader(input)).writer(new
Writer(input, fileUtil, ratelist,
mapper,service))
.taskExecutor (taskExecutor)
.listener(stepListener)
.build();
}
private FlatFileItemReader<? extends
DTO> buildReader(FileInputDTO input)
throws SftpException {
//Create reader instance
FlatFileItemReader<DTO> reader = new
FlatFileItemReader<>();
log.info("reading file :starts" );"
//Set input file location
reader.setResource(new InputStreamResource(input.getChannel().get(input.getPath())));
//Set number of Lines to ships. Use it if file has header rows.
reader.setLinesToSkip(1);
//Other code
return reader;
}
SFTP configuration:
public SFTPUtil (Environment env, String sftpPassword) throws JSchException {
JSch jsch = new JSch();
log.debug("Creating SFTP channelSftp");
Session session = jSch.getSession(env.getProperty("sftp.remoteUserName"),
env.getProperty("sftp.remoteHost"), Integer.parseInt(env.getProperty("sftp.remotePort"))); session.setConfig(Constants.STRICT_HOST_KEY_CHECKING, Constants.NO);
session.setPassword(sftpPassword);
session.connect(Integer.parseInt(env.getProperty("sftp.sessionTimeout")));
Channel sftpchannel = session.openChannel (Constants.SFTP);
sftpchannel.connect(Integer.parseInt(env.getProperty("sftp.channelTimeout")));
this.channel = (ChannelSftp) sftpchannel;
log.debug("SFTP channelSftp connected");
}
public ChannelSftp get() throws CustomException {
if(channel == null) throw new CustomException("Channel creation failed");
return channel;
}

How to read a pdf file in Vertx?

I am new to VertX and I want to read a pdf using the "GET" method. I know that buffer will be used. But there are no resources on the internet on how to do that.
Omitting the details of how you would get the file from your data store (couchbase DB), it is fair to assume the data is read correctly into a byte[].
Once the data is read, you can feed it to an io.vertx.core.buffer.Buffer that can be used to shuffle data to the HttpServerResponse as follows:
public void sendPDFFile(byte[] fileBytes, HttpServerResponse response) {
Buffer buffer = Buffer.buffer(fileBytes);
response.putHeader("Content-Type", "application/pdf")
.putHeader("Content-Length", String.valueOf(buffer.length()))
.setStatusCode(200)
.end(buffer);
}

Getting 409 from the schema registry while saving records to a local state store where multiple state stores are associated with a single processor

Long story short: I am in the middle of implementing a processor topology: the processor is to store the received records into corresponding local state stores and do event-based processing as a record arrives. And the related code looks like this:
#Override
public void configureBuilder(StreamsBuilder builder) {
final Map<String, String> serdeConfig =
Collections.singletonMap("schema.registry.url", processorConfig.getSchemaRegistryUrl());
final Serde<GenericRecord> valueSerde = new GenericAvroSerde();
valueSerde.configure(serdeConfig, false); // `true` for record keys
final Serde<EventKey> keySerde = new SpecificAvroSerde();
keySerde.configure(serdeConfig, true); // `true` for record keys
Map<String, String> stateStoreConfigMap = new HashMap<>();
//stateStoreConfigMap.put(KafkaAvroSerializerConfig.VALUE_SUBJECT_NAME_STRATEGY, RecordNameStrategy.class.getName());
StoreBuilder<KeyValueStore<EventKey, GenericRecord>> aggSequenceStateStoreBuilder =
Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore(processStateStore), keySerde, valueSerde)
.withLoggingEnabled(stateStoreConfigMap)
.withCachingEnabled();
final Serde<EnrichedSmcHeatData> enrichedSmcHeatDataSerde = new SpecificAvroSerde<>();
enrichedSmcHeatDataSerde.configure(serdeConfig, false); // `true` for record keys
StoreBuilder<KeyValueStore<EventKey, EnrichedSmcHeatData>> enrichedSmcHeatStateStoreBuilder =
Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("enriched-smc-heat-state-store"), keySerde, enrichedSmcHeatDataSerde)
.withLoggingEnabled(stateStoreConfigMap)
.withCachingEnabled();
Topology topology = builder.build();
topology
.addSource(
PROCESS_EVENTS_SOURCE,
keySerde.deserializer(),
valueSerde.deserializer(),
processorConfig.getInputCcmProcessEvents())
.addSource(
SCHEDULED_SEQUENCES_SOURCE,
keySerde.deserializer(),
valueSerde.deserializer(),
processorConfig.getScheduledCastSequences())
.addSource(
SMC_HEAT_EVENTS_SOURCE,
keySerde.deserializer(),
valueSerde.deserializer(),
processorConfig.getInputSmcHeatEvents())
.addProcessor(
PROCESS_STATE_AGGREGATOR,
() -> new ProcessStateProcessor(processStateStore, processorConfig),
PROCESS_EVENTS_SOURCE,
SCHEDULED_SEQUENCES_SOURCE,
SMC_HEAT_EVENTS_SOURCE)
.addStateStore(aggSequenceStateStoreBuilder, PROCESS_STATE_AGGREGATOR)
.addStateStore(enrichedSmcHeatStateStoreBuilder, PROCESS_STATE_AGGREGATOR);
If there are updates for the store created by aggSequenceStateStoreBuilder, the record values could be saved to the store without problems. However, if updates came for the second store, the following error was getting thrown:
Caused by:
io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException:
Schema being registered is incompatible with an earlier schema for
subject
"ccm-process-events-processor-ccm-process-state-store-changelog-value";
error code: 409
My use case: the state processor accepts inbound records from multiple source topics and do event-handling (including storing the modified values to the corresponding stores) when a record arrives from any of the source topics.
It appears that there can only be one schema registered with the schema registry for the same processor. Is that by design, or am I missing anything, or what alternative options do I have instead?
Thanks in advance!

Not Reading bytes properly during FTP transfer in Spring Batch

I am doing a project where I have to efficiently transfer data(any file) from one endpoint(HTTP, FTP, SFTP) to other. I want to use springBatch concurrency and parallelism feature of Job. In my case, one file will be one job. So, I am Trying to read file(any extension) from ftp(running locally) and writing it to same ftp in different folder.
My Reader has:
FlatFileItemReader<byte[]> reader = new FlatFileItemReader<>();
reader.setResource(new UrlResource("ftp://localhost:2121/source/1.txt"));
reader.setLineMapper((line, lineNumber) -> {
return line.getBytes();
});
And Writer has:
URL url = new URL("ftp://localhost:2121/dest/tempOutput/TransferTest.txt");
URLConnection conn = url.openConnection();
DataOutputStream out = new DataOutputStream(conn.getOutputStream());
for (byte[] b : bytes) { //I am getting List<byte[]> in my writer
out.write(b);
}
out.close();
In case of text file, all content is showing in one line(omitting nextLine character) and in case of video file bytes are missing/corrupted as video is not getting played at destination.
What I am doing wrong or is there something better way to transfer file(irrespective of its extension).

How do I collect from a flux without closing the stream

My usecase is to create an reactive endpoint like this :
public Flux<ServerEvent> getEventFlux(Long forId){
ServicePoller poller = new ServicePollerImpl();
Map<String,Object> params = new HashMap<>();
params.put("id", forId);
Flux<Long> interval = Flux.interval(Duration.ofMillis(pollDuration));
Flux<ServerEvent> serverEventFlux =
Flux.fromStream(
poller.getEventStream(url, params) //poll a given endpoint after a fixed duration.
);
Flux<ServerEvent> sourceFlux= Flux.zip(interval, serverEventFlux)
.map(Tuple2::getT2); // Zip the two streams.
/* Here I want to store data from sourceFlux into a collection whenever some data arrives without disturbing the downstream processing in Spring. So that I can access collection later on without polling again */
This sends back the data to front end as soon as it is available , however my second use case is to pool that data as it arrives into a separate collection , so that if a similar request arrives later on , I can offload the whole data from the pool without hitting the service again .
I tried to subscribe the flux , buffer , cache and collect the flux before returning from the original flux the controller , but all of that seems to close the stream hence Spring cant process it.
What are my options to tap into the flux and store values into a collection as and when they arrive without closing the flux stream ?
Exception encountered :
java.lang.IllegalStateException: stream has already been operated upon
or closed at
java.util.stream.AbstractPipeline.spliterator(AbstractPipeline.java:343)
~[na:1.8.0_171] at
java.util.stream.ReferencePipeline.iterator(ReferencePipeline.java:139)
~[na:1.8.0_171] at
reactor.core.publisher.FluxStream.subscribe(FluxStream.java:57)
~[reactor-core-3.1.7.RELEASE.jar:3.1.7.RELEASE] at
reactor.core.publisher.Flux.subscribe(Flux.java:6873)
~[reactor-core-3.1.7.RELEASE.jar:3.1.7.RELEASE] at
reactor.core.publisher.FluxZip$ZipCoordinator.subscribe(FluxZip.java:573)
~[reactor-core-3.1.7.RELEASE.jar:3.1.7.RELEASE] at
reactor.core.publisher.FluxZip.handleBoth(FluxZip.java:326)
~[reactor-core-3.1.7.RELEASE.jar:3.1.7.RELEASE]
poller.getEventStream returns a Java 8 stream that can be consumed only once. You can either convert the stream to a collection first or defer the execution of poller.getEventStream by using a supplier:
Flux.fromStream(
() -> poller.getEventStream(url, params)
);
Solution that worked for me as suggested by #a better oliver
public Flux<ServerEvent> getEventFlux(Long forId){
ServicePoller poller = new ServicePollerImpl();
Map<String,Object> params = new HashMap<>();
params.put("id", forId);
Flux<Long> interval = Flux.interval(Duration.ofMillis(pollDuration));
Flux<ServerEvent> serverEventFlux =
Flux.fromStream(
()->{
return poller.getEventStream(url, params).peek((se)->{reactSink.addtoSink(forId, se);});
}
);
Flux<ServerEvent> sourceFlux= Flux.zip(interval, serverEventFlux)
.map(Tuple2::getT2);
return sourceFlux;
}