How to write Avro files from Apache Beam Row

How to write Avro files from Apache Beam Row - apache-beam

In my Apache Beam pipeline I have a PCollection of Row objects (org.apache.beam.sdk.values.Row). I want write to Avro files. Here is a simplified version of my code:
Pipeline p = Pipeline.create();
Schema inputSchema = Schema.of(
Schema.Field.of("MyField1", Schema.FieldType.INT32)
);
Row row = Row.withSchema(inputSchema).addValues(1).build();
PCollection<Row> myRow = p.apply(Create.of(row)).setCoder(RowCoder.of(inputSchema));
myRow.apply(
"WriteToAvro",
AvroIO.write(Row.class)
.to("src/tmp/my_files")
.withWindowedWrites()
.withNumShards(10));
p.run();
The file gets created, but it looks like this (in JSON form):
"schema" : {
"fieldIndices" : {
"MyField1" : 0
},
"encodingPositions" : {
"MyField1" : 0
},
"fields" : [
{
}
],
"hashCode" : 545134948,
"uuid" : {
},
"options" : {
"options" : {
}
}
}
So only the schema is there with bunch of useless metadata. What's the right way of writing to Avro from Row objects so that I have the data and not just the schema. And can I get rid of the metadata?

AvroUtils has some utilities to convert Rows and Beam schemas to Avro-types. You can do something like this:
Pipeline p = Pipeline.create();
Schema inputSchema = Schema.of(
Schema.Field.of("MyField1", Schema.FieldType.INT32)
);
avro.Schema avroSchema = AvroUtils.toAvroSchema(inputSchema)
class ConvertToGenericRecords extends DoFn<Row, GenericRecord> {
#ProcessElement
public void process(ProcessContext<Row> c) {
c.output(AvroUtils.toGenericRecord(c.element(), avroSchema));
}
}
Row row = Row.withSchema(inputSchema).addValues(1).build();
PCollection<Row> myRow = p.apply(Create.of(row)).setCoder(RowCoder.of(inputSchema));
myRow.apply(ParDo.of(new ConvertToGenericRecords()))
.withCoder(AvroCoder.of(avroSchema)
.apply(
"WriteToAvro",
AvroIO.writeGenericRecords(avroSchema)
.to("src/tmp/my_files")
.withWindowedWrites()
.withNumShards(10));
p.run();

Related

How to add the new field to Object in Purescript

I am first about the Purescript.
I am going to add the new field to Object and send it as a function param.
But I can not find a good solution for this.
For example.
oldFiled = {
title : "title",
description : "d"
}
newField = {
time : "time"
}
//result after added new field
oldFiled = {
title : "title",
description : "d",
time : "time"
}
How can I do it?

If it's just about adding a single field you can use https://pursuit.purescript.org/packages/purescript-record/2.0.1/docs/Record#v:insert like so:
import Data.Record as Record
import Data.Symbol (SProxy(..))
oldFiled = {
title : "title",
description : "d"
}
newFiled = Record.insert (SProxy :: _ "time") "time" oldFiled
If you're merging records look at the merge union and disjointUnion functions in the Data.Record module

How to compare two collections and archive documents which are not common

I have two collections for example CollectionA and CollectionB both have common filed which is hostname
Collection A :
{
"hostname": "vm01",
"id": "1",
"status": "online",
}
Collection B
{
"hostname": "vm01",
"id": "string",
"installedversion": "string",
}
{
"hostname": "vm02",
"id": "string",
"installedversion": "string",
}
what i want to achieve is when i receive a post message for collection B
I want to check if the record exists in Collection B based on hostname and update all the values. if not insert the new record ( i have read it can be achieved by using upsert -- still looking how to make it work)
I want to check if the hostname is present in Collection A , if not move the record from collection B to another collection which is collection C ( as archive records).ie in the above hostname=vm02 record from collection B should be moved to collectionC
how can i achieve this using springboot mongodb anyhelp is appreciated.The code which i have to save the Collection B is as follows which i want to update to achieve the above desired result
public RscInstalltionStatusDTO save(RscInstalltionStatusDTO rscInstalltionStatusDTO) {
log.debug("Request to save RscInstalltionStatus : {}", rscInstalltionStatusDTO);
RscInstalltionStatus rscInstalltionStatus = rscInstalltionStatusMapper.toEntity(rscInstalltionStatusDTO);
rscInstalltionStatus = rscInstalltionStatusRepository.save(rscInstalltionStatus);
return rscInstalltionStatusMapper.toDto(rscInstalltionStatus);
}
Update 1 : The below works as i expected but I think there should be a better way to do this.
public RscInstalltionStatusDTO save(RscInstalltionStatusDTO rscInstalltionStatusDTO) {
log.debug("Request to save RscInstalltionStatus : {}", rscInstalltionStatusDTO);
RscInstalltionStatus rscInstalltionStatus = rscInstalltionStatusMapper.toEntity(rscInstalltionStatusDTO);
System.out.print(rscInstalltionStatus.getHostname());
Query query = new Query(Criteria.where("hostname").is(rscInstalltionStatus.getHostname()));
Update update = new Update();
update.set("configdownload",rscInstalltionStatus.getConfigdownload());
update.set("rscpkgdownload",rscInstalltionStatus.getRscpkgdownload());
update.set("configextraction",rscInstalltionStatus.getConfigextraction());
update.set("rscpkgextraction",rscInstalltionStatus.getRscpkgextraction());
update.set("rscstartup",rscInstalltionStatus.getRscstartup());
update.set("installedversion",rscInstalltionStatus.getInstalledversion());
mongoTemplate.upsert(query, update,RscInstalltionStatus.class);
rscInstalltionStatus = rscInstalltionStatusRepository.findByHostname(rscInstalltionStatus.getHostname());
return rscInstalltionStatusMapper.toDto(rscInstalltionStatus);
}
Update2 : with the below code i am able to get move the records to another collection
String query = "{$lookup:{ from: \"vmdetails\",let: {rschostname: \"$hostname\"},pipeline:[{$match:{$expr:{$ne :[\"$hostname\",\"$$rschostname\"]}}}],as: \"rscInstall\"}},{$unwind:\"$rscInstall\"},{$project:{\"_id\":0,\"rscInstall\":0}}";
AggregationOperation rscInstalltionStatusTypedAggregation = new CustomProjectAggregationOperation(query);
LookupOperation lookupOperation = LookupOperation.newLookup().from("vmdetails").localField("hostname").foreignField("hostname").as("rscInstall");
UnwindOperation unwindOperation = Aggregation.unwind("$rscInstall");
ProjectionOperation projectionOperation = Aggregation.project("_id","rscInstall").andExclude("_id","rscInstall");
OutOperation outOperation = Aggregation.out("RscInstallArchive");
Aggregation aggregation = Aggregation.newAggregation(rscInstalltionStatusTypedAggregation,unwindOperation,projectionOperation,outOperation);
List<BasicDBObject> results = mongoTemplate.aggregate(aggregation,"rsc_installtion_status",BasicDBObject.class).getMappedResults();
this issue which i have here is it returns multiple records

Found the solution , there may be other best solutions but for me this one worked
create a class customeAggregationGeneration (found in SO answers and extended to match my needs)
public class CustomProjectAggregationOperation implements AggregationOperation {
private String jsonOperation;
public CustomProjectAggregationOperation(String jsonOperation) {
this.jsonOperation = jsonOperation;
}
#Override
public Document toDocument(AggregationOperationContext aggregationOperationContext) {
return aggregationOperationContext.getMappedObject(Document.parse(jsonOperation));
}
}
String lookupquery = "{$lookup :{from:\"vmdetails\",localField:\"hostname\",foreignField:\"hostname\"as:\"rscinstall\"}}";
String matchquery = "{ $match: { \"rscinstall\": { $eq: [] } }}";
String projectquery = "{$project:{\"rscinstall\":0}}";
AggregationOperation lookupOpertaion = new CustomProjectAggregationOperation(lookupquery);
AggregationOperation matchOperation = new CustomProjectAggregationOperation(matchquery);
AggregationOperation projectOperation = new CustomProjectAggregationOperation(projectquery);
Aggregation aggregation = Aggregation.newAggregation(lookupOpertaion, matchOperation, projectOperation);
ArrayList<Document> results1 = (ArrayList<Document>) mongoTemplate.aggregate(aggregation, "rsc_installtion_status", Document.class).getRawResults().get("result");
// System.out.println(results1);
for (Document doc : results1) {
// System.out.print(doc.get("_id").toString());
mongoTemplate.insert(doc, "RscInstallArchive");
delete(doc.get("_id").toString());

Translate postgres query to sequelize

I am having trouble connecting relationships in sequelize.
SELECT * from actors
JOIN "actorStatuses"
on "actorStatuses".actor_id = actors.id
JOIN movies
on movies.id = "actorStatuses".actor_id
WHERE movies.date = '7/8/2017';

Here you go :
model.Actor.findAll({ // change model name as per yours
include : [
{
model : model.ActorStatuses // change model name as per yours
required : true ,
},{
model : model.Movies // change model name as per yours
required : true ,
where : { date : 'your_date' }
}
]
});
This will create exact same query / result as you required.

Defining Query for OrientDB - using JSON data

I have a question concerning using orientDB for JSON data. In my scenario,
I receive entities which are serialized into JSON. This JSON data should
be stored in orientDB. The relevant code part to create documents
in orientDB looks as follows:
//JSON string created by using Jackson
String jsonStr = "...";
//here some dummy name "test"
ODocument doc = new ODocument("test");
doc.fromJSON(jsonStr);
doc.save();
In the following, I give an example for classes I'm working with
(I left out constructors, getters and setters and other fields which
are not relevant for the example):
class AbstractEntity {
private String oidString;
}
class A extends AbstractEntity {
private List<B> bs;
}
class B extends AbstractEntity {
private List<C> cs;
}
class C extends AbstractEntity {
private List<D> ds;
}
class D extends AbstractEntity {
private int type;
}
As the classes use type List, Jackson needs to store
additional type information in the JSON representation, to be able
to deserialize the data properly.
{
"oidString" : "AAA_oid1",
"bs" : [ "java.util.ArrayList", [ {
"oidString" : "b_oid1",
"cs" : null
}, {
"oidString" : "b_oid2",
"cs" : [ "java.util.ArrayList", [ {
"oidString" : "c_oid1",
"ds" : [ "java.util.ArrayList", [ ] ]
}, {
"oidString" : "c_oid2",
"ds" : [ "java.util.ArrayList", [ {
"oidString" : "d_oid1",
"type" : 2
} ] ]
} ] ]
} ] ]
}
However, I have problems querying such a document if I try to e.g. find all D instances that contain a certain type. I tried to simplify my query by first listing all D instances that can be found for a specific A:
OQuery<?> query = new OSQLSynchQuery<ODocument>(
"select bs.cs.ds from test where oidString = 'AAA_oid1'"
);
This returns: {"#type":"d","#rid":"#-2:0","#version":0,"bs":[null,null]}
The additional type information ("java.util.ArrayList") seems to cause problems for orientDB. If I rewrite my example and only use ArrayList directly, hence, the additional type information is omitted, the query above shows something as a result.
Is there a general solution to this problem? I have to work with JSON data and that JSON data will contain additional type information (it has to).
Can't deal orientDB with this situation?

Removing documents while preserving at least one

I have a MongoDB collection containing history data with id and timestamp.
I want to delete data from the collection older than a specific
timestamp. But for every id at least one
document (the newest) must stay in the collection.
Suppose I have the following documents in my collection ...
{"id" : "11", "timestamp" : ISODate("2011-09-09T10:27:34.785Z")} //1
{"id" : "11", "timestamp" : ISODate("2011-09-08T10:27:34.785Z")} //2
{"id" : "22", "timestamp" : ISODate("2011-09-05T10:27:34.785Z")} //3
{"id" : "22", "timestamp" : ISODate("2011-09-01T10:27:34.785Z")} //4
... and I want to delete documents having a timestamp older than
2011-09-07 then
1 and 2 should not be deleted because they are newer.
4 should be deleted because it is older, but 3 should not be deleted
(although it is older) because
at least one document per id should stay in the collection.
Does anyone know how I can do this with casbah and/or on the mongo
console?
Regards,
Christian

I can think of a couple of ways. First, try this:
var cutoff = new ISODate("2011-09-07T00:00:00.000Z");
db.testdata.find().forEach(function(data) {
if (data.timestamp.valueOf() < cutoff.valueOf()) {
// A candidate for deletion
if (db.testdata.find({"id": data.id, "timestamp": { $gt: data.timestamp }}).count() > 0) {
db.testdata.remove({"_id" : data._id});
}
}
});
This does the job you want. Or you can use a MapReduce job to do it as well. Load this into a text file:
var map = function() {
emit(this.id, {
ref: this._id,
timestamp: this.timestamp
});
};
var reduce = function(key, values) {
var cutoff = new ISODate("2011-09-07T00:00:00.000Z");
var newest = null;
var ref = null;
var i;
for (i = 0; i < values.length; ++i) {
if (values[i].timestamp.valueOf() < cutoff.valueOf()) {
// falls into the delete range
if (ref == null) {
ref = values[i].ref;
newest = values[i].timestamp;
} else if (values[i].timestamp.valueOf() > newest.valueOf()) {
// This one is newer than the one we are currently saving.
// delete ref
db.testdata.remove({_id : ref});
ref = values[i].ref;
newest = values[i].timestamp;
} else {
// This one is older
// delete values[i].ref
db.testdata.remove({_id : values[i].ref});
}
} else if (ref == null) {
ref = values[i].ref;
newest = values[i].timestamp;
}
}
return { ref: ref, timestamp: newest };
};
Load the above file into the shell: load("file.js");
Then run it: db.testdata.mapReduce(map, reduce, {out: "results"});
Then remove the mapReduce output: db.results.drop();

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to write Avro files from Apache Beam Row - apache-beam

Related

How to add the new field to Object in Purescript

How to compare two collections and archive documents which are not common

Translate postgres query to sequelize

Defining Query for OrientDB - using JSON data

Removing documents while preserving at least one

Categories

Resources