Generating GraphQL schema from external .graphqls file in scala (sangria) - scala

In graphql-scala, can we parse and use a graphql schema passed externally in some .graphqls file so that I don't have to create case classes for each of the object types. I know this is very much possible in graphql-java which picks the schema from a .graphqls file where I do not have to create POJOs for each object.
So basically, my requirement is that we would have more than 100 tables, schemas for which would be stored in hbase as key-value pairs in json format for each of the tables. I would have a program which would read the schema from hbase and create a .graphqls file to be then used in my graphQL server.

Related

Map JSON CDC data in cache value to Ignite sqlline thin client tables

Basically, I am trying to send CDC JSON data via connectors from Kafka topics to Ignite Cache. And for transformation purpose I want to look up this JSON data in table view in Ignite. How can I map these JSON fields to ignite SQLline table's columns?
If I understood your question correctly it should be pretty straightforward.
It's possible to leverage Java classes for that. You can specify a structure that would correspond to your JSON structure. Along with the additional cache configuration (CacheConfiguration.setIndexedTypes(…​)) property it will make it SQL-enabled.

Is there a way to setup code generation in JOOQ for multiple schemas with the same table structure?

We have a multi-tenant database, where each tenant has their own dedicated schema. The schemas always have identical table structures. What I'm trying to figure out is if there's a way to pass the schema to JOOQ at query time when using code generation to track the schema. Something like:
dslContext.useSchema("schema1").select(A.id).from(A).fetch()
It seems like the schema is always tied to the table object and the only option for mapping at runtime is statically via an input schema and an output schema.
Environmental info: Java/Kotlin, Maven, Spring Boot, Postgres, Flyway
The features you are looking for are:
Code generation time schema mapping
Runtime schema mapping
See also the FAQ
The simplest solution here is to just turn off the generation of schema information in the code generator:
<outputSchemaToDefault>true</outputSchemaToDefault>
Or at runtime
new Settings().withRenderSchema(false);

How to capture schema of a JSON file using Talend

I'm trying to capture the schema of a JSON file that I am generating from a SQL database using Talend. I need to store this schema in a separate file. Does anyone know of a way to capture this?
With the metadata section in repository, you can create a JSON File Schema. Here you can import a json file example , it will then generate a schema that you could reuse in the output of your job, in a twritejsonfields component for example.

How to update table schema when there is new Avro schema for Kafka data in Flink?

We are consuming a Kafka topic in the Flink application using Flink Table API.
When we first submit the application, we first read the latest schema from our custom registry. Then create a Kafka Datastream and Table using Avro schema. My data serializers' implementation works similarly to the Confluent schema registry by checking the schema ID and then using the registry. So we can apply the correct schema in runtime.
However, I do not know how to update the table schema and re-execute SQL without re deploying the job. Is there a way to have a background thread for checking the schema changes, and if there are any, pauses the current execution, updates the table schema and execute the SQL.
This will be particularly useful for the continuous delivery of schema changes to the applications. We already have a compatibility check in place.
TL;DR you don't need to change anything to get it working in most cases.
In Avro, there is the concept of reader and writer schema. Writer schema is the schema that was used to generate the Avro record and it's encoded into the payload (in most cases as an id).
Reader schema is used by your application to make sense of your data. If you do a particular calculation you are using a specific set of fields of an Avro record.
Now the good part: Avro transparently translates the writer schema to a read schema if they are schema-compatible. So as long as your schemas are fully compatible, there is a way to always transform the writer schema to your read schema.
So if your schema of the records change in the background while the application is running, the DeserializationSchema fetches the new write schema and infers a new mapping to the read schema. Your query will not notice any change.
This approach falls short if you actually want to enrich the schema in your application; for example, you always want to add a field calculated and return all other fields. Then a newly added field will not be picked up, since effectively your reader schema changes. In this case, you either need to restart or use generic record schema.

AWS Glue: How to handle nested JSON with varying schemas

Objective:
We're hoping to use the AWS Glue Data Catalog to create a single table for JSON data residing in an S3 bucket, which we would then query and parse via Redshift Spectrum.
Background:
The JSON data is from DynamoDB Streams and is deeply nested. The first level of JSON has a consistent set of elements: Keys, NewImage, OldImage, SequenceNumber, ApproximateCreationDateTime, SizeBytes, and EventName. The only variation is that some records do not have a NewImage and some don't have an OldImage. Below this first level, though, the schema varies widely.
Ideally, we would like to use Glue to only parse this first level of JSON, and basically treat the lower levels as large STRING objects (which we would then parse as needed with Redshift Spectrum). Currently, we're loading the entire record into a single VARCHAR column in Redshift, but the records are nearing the maximum size for a data type in Redshift (maximum VARCHAR length is 65535). As a result, we'd like to perform this first level of parsing before the records hit Redshift.
What we've tried/referenced so far:
Pointing the AWS Glue Crawler to the S3 bucket results in hundreds of tables with a consistent top level schema (the attributes listed above), but varying schemas at deeper levels in the STRUCT elements. We have not found a way to create a Glue ETL Job that would read from all of these tables and load it into a single table.
Creating a table manually has not been fruitful. We tried setting each column to a STRING data type, but the job did not succeed in loading data (presumably since this would involve some conversion from STRUCTs to STRINGs). When setting columns to STRUCT, it requires a defined schema - but this is precisely what varies from one record to another, so we are not able to provide a generic STRUCT schema that works for all the records in question.
The AWS Glue Relationalize transform is intriguing, but not what we're looking for in this scenario (since we want to keep some of the JSON intact, rather than flattening it entirely). Redshift Spectrum supports scalar JSON data as of a couple weeks ago, but this does not work with the nested JSON we're dealing with. Neither of these appear to help with handling the hundreds of tables created by the Glue Crawler.
Question:
How would we use Glue (or some other method) to allow us to parse just the first level of these records - while ignoring the varying schemas below the elements at the top level - so that we can access it from Spectrum or load it physically into Redshift?
I'm new to Glue. I've spent quite a bit of time in the Glue documentation and looking through (the somewhat sparse) info on forums. I could be missing something obvious - or perhaps this is a limitation of Glue in its current form. Any recommendations are welcome.
Thanks!
I'm not sure you can do this with a table definition, but you can accomplish this with an ETL job by using a mapping function to cast the top level values as JSON strings. Documentation: [link]
import json
# Your mapping function
def flatten(rec):
for key in rec:
rec[key] = json.dumps(rec[key])
return rec
old_df = glueContext.create_dynamic_frame.from_options(
's3',
{"paths": ['s3://...']},
"json")
# Apply mapping function f to all DynamicRecords in DynamicFrame
new_df = Map.apply(frame=old_df, f=flatten)
From here you have the option of exporting to S3 (perhaps in Parquet or some other columnar format to optimize for querying) or directly into Redshift from my understanding, although I haven't tried it.
This is a limitation of Glue as of now. Have you taken a look at Glue Classifiers? It's the only piece I haven't used yet, but might suit your needs. You can define a JSON path for a field or something like that.
Other than that - Glue Jobs are the way to go. It's Spark in the background, so you can do pretty much everything. Set up a development endpoint and play around with it. I've run against various roadblocks for the last three weeks and decided to completely forgo any and all Glue functionality and only Spark, that way it's both portable and actually works.
One thing you might need to keep in mind when setting up the dev endpoint is that the IAM role must have a path of "/", so you will most probably need to create a separate role manually that has this path. The one automatically created has a path of "/service-role/".
you should add a glue classifier preferably $[*]
When you crawl the json file in s3, it will read the first line of the file.
You can create a glue job in order to load the data catalog table of this json file into the redshift.
My only problem with here is that Redshift Spectrum has problems reading json tables in the data catalog..
let me know if you have found a solution
The procedure I found useful to shallow nested json:
ApplyMapping for the first level as datasource0;
Explode struct or array objects to get rid of element level
df1 = datasource0.toDF().select(id,col1,col2,...,explode(coln).alias(coln), where explode requires from pyspark.sql.functions import explode;
Select the JSON objects that you would like to keep intact by intact_json = df1.select(id, itct1, itct2,..., itctm);
Transform df1 back to dynamicFrame and Relationalize the
dynamicFrame as well as drop the intact columns by dataframe.drop_fields(itct1, itct2,..., itctm);
Join relationalized table with the intact table based on 'id'
column.
As of 12/20/2018, I was able to manually define a table with first level json fields as columns with type STRING. Then in the glue script the dynamicframe has the column as a string. From there, you can do an Unbox operation of type json on the fields. This will json parse the fields and derive the real schema. Combining Unbox with Filter allows you to loop through and process heterogeneous json schemas from the same input if you can loop through a list of schemas.
However, one word of caution, this is incredibly slow. I think that glue is downloading the source files from s3 during each iteration of the loop. I've been trying to find a way to persist the initial source data but it looks like .toDF derives the schema of the string json fields even if you specify them as glue StringType. I'll add a comment here if I can figure out a solution with better performance.