google-cloud-datastore java client: Is there a way to infer schema and/or retrieve results as Json? - scala

I am working on datastore datasource for apache-spark based on spark datasource V2 api. I was able to implement using hard-coded single entity but couldn't generalize it. Either I need to infer entity schema and translate entity record into Spark Row or read entity record as json and let the user translate into scala product (datastore java client is REST based so the payload is being pulled as json). I could see "entity.properties" as json key-values from within IntelliJ debugger which includes everything I need (column name, value, type etc.) but I can't use entity.properties due to access restrictions. Appreciate any ideas.

fixed by switching to low level API https://github.com/GoogleCloudPlatform/google-cloud-datastore
full source for spark-datastore-connector https://github.com/sgireddy/spark-datastore-connector

Related

Kafka Source Connector - How to pass the schema from String (json)

I've developed a custom source connector for an external REST service.
I get JSONs, convert them to org.apache.kafka.connect.data.Struct with manually defined schema (SchemaBuilder) and wrap all this to SourceRecord.
All of this is for one entity only, but there a dozen of them.
My new goal is to make this connector universal and parametrize the schema. The idea is to get the schema as String (json) from configs or external files and pass it to SourceRecord, but it only accepts Schema objects.
Is there any simple/good ways to convert String/json to Schema or even pass String schema directly?
There is a JSON to Avro converter, however, if you are already building a Struct/Schema combination, then you shouldn't need to do anything, as the Converter classes in Kafka Connect can handle the conversion for you
https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained/

Random Avro data generator in Java/Scala

Is this possible to generate random Avro data by the specified schema using org.apache.avro library?
I need to produce this data with Kafka.
I tried to find some kind of random data generator for test, however, I have stumbled upon tools for such data generator or GenericRecord usage. Tools are not very suitable for me as there is a specific file dependency (like reading the file and so on) and GenericRecord should be generated one-by-one as I've understood.
Are there any other solutions for Java/Scala?
UPDATE: I have found this class but it does not seem to beaccessible from org.apache.avro version version 1.8.2
The reason you need to read a file, is that it matches a Schema, which defines the fields that need to be created, and of which types.
That is not a hard requirement, and there would be nothing preventing creation of random Generic or Specific Records that are built in code via Avro's SchemaBuilder class
See this repo for example, that uses a POJO generated from an AVSC schema (which again, could be done with SchemaBuilder instead) into a Java class.
Even the class you linked to uses a schema file
So I personally would probably use Avro4s (https://github.com/sksamuel/avro4s) in conjunction with scalachecks (https://www.scalacheck.org) Gen to model such tests.
You could use scalacheck to generate random instances of case classes and avro4s to convert them to generic records, extract their schema etc etc.
There's also avro-mocker https://github.com/speedment/avro-mocker though I don't know how easy it is to hook into the code.
I'd just use Podam http://mtedone.github.io/podam/ to generate POJOs and then just output them to Avro using Java Avro library https://avro.apache.org/docs/1.8.1/gettingstartedjava.html#Serializing

How to evolve Avro schema with Akka Persistence without sending schema or using a registry?

We are considering a serialization approach for our scala-based Akka Persistence app. We consider it likely that our persisted events will "evolve" over time, so we want to support schema evolution, and are considering Avro first.
We'd like to avoid including the full schema with every message. However, for the foreseeable future, this Akka Persistence app is the only app that will be serializing and deserializing these messages, so we don't see a need for a separate schema registry.
Checking the docs for avro and the various scala libs, I see ways to include the schema with messages, and also how to use it "schema-less" by using a schema registry, but what about the in-between case? What's the correct approach for going schema-less, but somehow including an identifier to be able to look up the correct schema (available in the local deployed codebase) for the deserialized object? Would I literally just create a schema that represents my case class, but with an additional "identifier" field for schema version, and then have some sort of in-memory map of identifier->schema at runtime?
Also, is the correct approach to have one serializer/deserialize class for each version of the schema, so it knows how to translate every version to/from the most recent version?
Finally, are there recommendations on how to unit-test schema evolutions? For instance, store a message in akka-persistence, then actually change the definition of the case class, and then kill the actor and make sure it properly evolves. (I don't see how to change the definition of the case class at runtime.)
After spending more time on this, here are the answers I came up with.
Using avro4s, you can use the default data output stream to include the schema with every serialized message. Or, you can use the binary output stream, which simply omits the schema when serializing each message. ('binary' is a bit of a misnomer here since all it does is omit the schema. In either case it is still an Array[Byte].)
Akka itself supplies a Serializer trait or a SerializerWithStringManifest trait, which will automatically include a field for a "schema identifier" in the object of whatever you serialize.
So when you create your custom serializer, you can extend the appropriate trait, define your schema identifier, and use the binary output stream. When those techniques are combined, you'll successfully be using schema-less serialization while including a schema identifier.
One common technique is to "fingerprint" your schema - treat it as a string and then calculate its digest (MD5, SHA-256, whatever). If you construct an in-memory map of fingerprint to schema, that can serve as your application's in-memory schema registry.
So then when deserializing, your incoming object will have the schema identifier of the schema that was used to serialize it (the "writer"). While deserializing, you should know the identifier of the schema to use to deserialize it (the "reader"). Avro4s supports a way for you to specify both using a builder pattern, so avro can translate the object from the old format to the new. That's how you support "schema evolution". Because of how that works, you don't need a separate serializer for each schema version. Your custom serializer will know how to evolve your objects, because that's the part that Avro gives you for free.
As for unit testing, your best bet is exploratory testing. Actually define multiple versions of a case class in your test, and multiple accompanying versions of its schema, and then explore how Avro works by writing tests that will evolve an object between different versions of that schema.
Unfortunately that won't be directly relevant to the code you are writing, because it's hard to simulate actually changing the code you are testing as you test it.
I developed a prototype that demonstrates several of these answers, and it's available on github. It uses avro, avro4s, and akka persistence. For this one, I demonstrated a changing codebase by actually changing it across commits - you'd check out commit #1, run the code, then move to commit #2, etc. It runs against cassandra so it will demonstrate replaying events that need to be evolved using new schema, all without using an external schema registry.

OData REST API where table has columns unique to customer

We would like to create an OData REST API. Our data model is such that each customer has their own database. All database objects have the same definition across all customer databases, with the exception of a single table.
The customer specific table we will call Contact. When a customer adds a column the system creates a column with a standardised name with a definition translated from options selected by the user in the UI. The user only refers to the column data by a field name they have specified to enable the user to be able to generate friendly queries.
It seems to me that the following approaches could be used to enable OData for the model described:
1) Create an OData open type to cater for the dynamic properties. This has the disadvantage of user requests for a customer not providing an indication of the dynamic properties that can be queried against. Even though they will be known for the user (via token authentication). Also, because dynamic properties are a dictionary, some data pivoting and inefficient query writing would be required. Not sure how to implement the IQueryable handling of query options for the dynamic properties to enable our own custom field querying.
2) Create a POCO class with e.g. 50 properties; CustomField1, CustomField2... Then somehow control which fields are exposed for use in OData calls. We would then include a separate API call to expose the custom field mapping. E.g. custom field friendly name of MobileNumber = CustomField12.
3) At runtime, check to see if column definitions of table changed since last check. If have, generate class specific to customer using CodeDom and register it with OData. Aiming for a unique URL for each customer. E.g. http://domain.name/{customer guid}/odata
I think the ideal for us is option 2. However, the fact the CustomField1 could be an underlying SQL data type of nvarchar, int, decimal, datetime, etc, there are added complications.
Has anyone a working example of how to achieve what has been described, satisfactorily?
Thanks in advance for any help.
Rik
We have run into a similar situation but with our entire dataset being unknown until runtime. Using the ODataConventionModelBuilder and EdmModel classes, you can add properties dynamically to the model at runtime.
I'm not sure whether you will have to manually add all of the properties for this object type even though only some of them are unknown or whether you can add your main object and then add your dynamic ones afterwards, but I guess either would be workable.
If you can get hold of which type of user it is on the server, you could then add only the properties that you are interested in (like option 3 but not having to CodeDom).
There is an example of this kind of untyped OData server in the OData samples here that should get you started: https://github.com/OData/ODataSamples/tree/master/WebApi/v4/ODataUntypedSample
The research we carried out actually posed Option 1 as the most suitable approach for some operations. i.e. Create an SQL view that unpivots the data in a table to a key/value pair of column name/column value for each column in the table. This was suitable for queries returning small datasets. This was far less effort than Option 3 and less confusing for the user than Option 2. The unpivot query converted the field values to nvarchar (string) values and thus meant that filtering in the UI by column value data types was not simple to achieve. (If we decide to implement this ability, I believe this can be achieved by creating a custom attribute that derives from EnablQueryAttribute, marking the controller action with it and manipulate the IQueryable before execution).
However, we wanted to expose a /Contacts/Export endpoint that when called would output the columns from a table with a fixed schema joined on a table with a client specific schema and output to a CSV file. All the while utilising the OData supported filter syntax. One of our customer databases has more than 12 million rows of data and is made up of approximately 30 columns.
To achieve this it looks like our best bet would have been to work with the Microsoft.OData.Core.UriParser.UriQueryExpressionParser class, unfortunately Microsoft in their wisdom have declared this as internal, as well as many of it's dependants.
Walking an abstract syntax tree built from OData supported query options and applying our own visitor to each node to build some dynamic Linq query/SQL seems like a possible solution.
For the time-being we will simply implement a cut-down set of supported $filter criteria without the support for grouping parenthesis.

Data Mapper pattern implementation with zend

I am implementing data mapper in my zend framework 1.12 project and its working fine as expected. Now further more to enhance it i wants to optimize it in following way.
While fetching any data what id i wants to fetch any 3 field data out of 10 fields in my model table? - The current issue is if i fetches the only required values then other valus in domain object class remains blank and while saving that data i am saving while model object not a single field value.
Can any one suggest the efficient way of doing this so that i can fetch/update only required values and no need to fetch all field data to update the record.
If property is NULL ignore it when crafting the update? If NULLs are valid values, then I think you would need to track loaded/dirty states per property.
How do you go about white-listing the fields to retrieve when making the call to the mapper? If you can persist that information I think it would make sense to leverage that knowledge when going to craft the update.
I don't typically go down this path. I will lazy load certain fields on a model when it makes sense, but I don't allow loading parts of the object like this, rather I create an alternate object for use in rendering a list when loading the full object is too resource intensive. A generic dummy list object I just use with tabular data. It being populated from SQL or stored procedures result-sets, usually with my generic table mapper.