Use TableProvider to generate a table and run an SQL query in Apache Beam - apache-beam

I want to generate an unbounded collection of rows and run an SQL query on it using the Apache Beam Calcite SQL dialect and the Apache Flink runner. Based on the source code and documentation of Apache Beam, one can do something like this using a table provider: GenerateSequenceTableProvider. But I don't understand how to use it outside of the Beam SQL CLI. I'd like to use it in my regular Java code.
I was trying to do something like this:
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).create();
Pipeline pipeline = Pipeline.create(options);
GenerateSequenceTableProvider tableProvider = new GenerateSequenceTableProvider();
tableProvider.createTable(Table.builder()
.name("sequence")
.schema(Schema.of(Schema.Field.of("sequence", Schema.FieldType.INT64), Schema.Field.of("event_time", Schema.FieldType.DATETIME)))
.type(tableProvider.getTableType())
.build()
);
PCollection<Row> res = PCollectionTuple.empty(pipeline).apply(SqlTransform.query("select * from sequenceSchema.sequence limit 5").withTableProvider("sequenceSchema", tableProvider));
pipeline.run().waitUntilFinish();
But I'm getting Object 'sequence' not found within 'sequenceSchema' errors, so I guess I'm not actually creating the table. So how do I create the table? If I understand correctly, the values should be provided automatically by the table provider.
Basically, how to use Beam SQL table providers if I want to execute queries on tables that these providers are supposed (I think?) to generate?

The TableProvider interface is a bit difficult to work with directly. The problem you're running into is that the GenerateSquenceTableProvider, like many other TableProviders, doesn't have any way to store table metadata on its own. So calling its createTable method is actually a no-op! What you'll want to do is wrap it in an InMemoryMetaStore, something like this:
GenerateSequenceTableProvider tableProvider = new GenerateSequenceTableProvider();
InMemoryMetaStore metaStore = new InMemoryMetaStore();
metaStore.registerProvider(tableProvider);
metaStore.createTable(Table.builder()
.name("sequence")
.schema(Schema.of(Schema.Field.of("sequence", Schema.FieldType.INT64), Schema.Field.of("event_time", Schema.FieldType.DATETIME)))
.type(tableProvider.getTableType())
.build()
);
PCollection<Row> res = PCollectionTuple.empty(pipeline)
.apply(SqlTransform.query("select * from sequenceSchema.sequence limit 5")
.withTableProvider("sequenceSchema", metaStore));
(Note I haven't tested this, but I think something like it should work)
As robertwb pointed out, another option would be to just avoid the TableProvider interface and use GenerateSequence directly. You'd just need to make sure that your PCollection has a schema. Then you could process it with SqlTransform, like this:
pc.apply(SqlTransform.query("select * from PCOLLECTION limit 5"))

If you can't get TableProviders to work, you could read this as an ordinary PCollection and then apply a SqlTransform to the result.

Related

Selecting identical named columns in jOOQ

Im currently using jOOQ to build my SQL (with code generation via the mvn plugin).
Executing the created query is not done by jOOQ though (Using vert.X SqlClient for that).
Lets say I want to select all columns of two tables which share some identical column names. E.g. UserAccount(id,name,...) and Product(id,name,...). When executing the following code
val userTable = USER_ACCOUNT.`as`("u")
val productTable = PRODUCT.`as`("p")
create().select().from(userTable).join(productTable).on(userTable.ID.eq(productTable.AUTHOR_ID))
the build method query.getSQL(ParamType.NAMED) returns me a query like
SELECT "u"."id", "u"."name", ..., "p"."id", "p"."name", ... FROM ...
The problem here is, the resultset will contain the column id and name twice without the prefix "u." or "p.", so I can't map/parse it correctly.
Is there a way how I can say to jOOQ to alias these columns like the following without any further manual efforts ?
SELECT "u"."id" AS "u.id", "u"."name" AS "u.name", ..., "p"."id" AS "p.id", "p"."name" AS "p.name" ...
Im using the holy Postgres Database :)
EDIT: Current approach would be sth like
val productFields = productTable.fields().map { it.`as`(name("p.${it.name}")) }
val userFields = userTable.fields().map { it.`as`(name("p.${it.name}")) }
create().select(productFields,userFields,...)...
This feels really hacky though
How to correctly dereference tables from records
You should always use the column references that you passed to the query to dereference values from records in your result. If you didn't pass column references explicitly, then the ones from your generated table via Table.fields() are used.
In your code, that would correspond to:
userTable.NAME
productTable.NAME
So, in a resulting record, do this:
val rec = ...
rec[userTable.NAME]
rec[productTable.NAME]
Using Record.into(Table)
Since you seem to be projecting all the columns (do you really need all of them?) to the generated POJO classes, you can still do this intermediary step if you want:
val rec = ...
val userAccount: UserAccount = rec.into(userTable).into(UserAccount::class.java)
val product: Product = rec.into(productTable).into(Product::class.java)
Because the generated table has all the necessary meta data, it can decide which columns belong to it, and which ones don't. The POJO doesn't have this meta information, which is why it can't disambiguate the duplicate column names.
Using nested records
You can always use nested records directly in SQL as well in order to produce one of these 2 types:
Record2<Record[N], Record[N]> (e.g. using DSL.row(table.fields()))
Record2<UserAccountRecord, ProductRecord> (e.g using DSL.row(table.fields()).mapping(...), or starting from jOOQ 3.17 directly using a Table<R> as a SelectField<R>)
The second jOOQ 3.17 solution would look like this:
// Using an implicit join here, for convenience
create().select(productTable.userAccount(), productTable)
.from(productTable)
.fetch();
The above is using implicit joins, for additional convenience
Auto aliasing all columns
There are a ton of flavours that users could like to have when "auto-aliasing" columns in SQL. Any solution offered by jOOQ would be no better than the one you've already found, so if you still want to auto-alias all columns, then just do what you did.
But usually, the desire to auto-alias is a derived feature request from a misunderstanding of what's the best approch to do something in jOOQ (see above options), so ideally, you don't follow down the auto-aliasing road.

Case-when or if-then to control table creation in Redshift

I have a handful of data sources that I'd like to apply the same analyses to and eventually load into a larger table database (uniformtable). Different sources contain different columns, and sometimes sources involve crosswalk files that I need to join. I'd like to have one query that converts all sources' data into uniformtable formatting, based on a unique key for each source. Something along the lines of this:
case when source.sourceid = 1 then
create uniformtable as
select column1a as uniforma, column1b as uniformb, sourceid from source
else
when source.sourceid = 2 then
create uniformtable as
select column2a as uniforma, column2b as uniformb, sourceid from source
end;
I've tried using if-then and case-when to accomplish this, but I get syntax errors pointing to the very start of my query. Does Redshift allow you to use if logic for this kind of control?
No, this logic is not permitted.
CASE statements are only valid within a SELECT statement.
You would need to perform this logic external to Amazon Redshift, and then just send the final SQL to create the table.

Is it possible to run a SQL query with EntityFramework that joins three tables between two databases?

So I've got a SQL query that is called from an API that I'm trying to write an integration test for. I have the method that prepares the data totally working, but I realized that I don't know how to actually execute the query to check that data (and run the test). Here is what the query looks like (slightly redacted to protect confidental data):
SELECT HeaderQuery.[headerid],
kaq.[applicationname],
HeaderQuery.[usersession],
HeaderQuery.[username],
HeaderQuery.[referringurl],
HeaderQuery.[route],
HeaderQuery.[method],
HeaderQuery.[logdate],
HeaderQuery.[logtype],
HeaderQuery.[statuscode],
HeaderQuery.[statusdescription],
DetailQuery.[detailid],
DetailQuery.[name],
DetailQuery.[value]
FROM [DATABASE1].[dbo].[apilogheader] HeaderQuery
LEFT JOIN [DATABASE1].[dbo].[apilogdetails] DetailQuery
ON HeaderQuery.[headerid] = DetailQuery.[headerid]
INNER JOIN [DATABASE2].[dbo].[apps] kaq
ON HeaderQuery.[applicationid] = kaq.[applicationid]
WHERE HeaderQuery.[applicationid] = #applicationid1
AND HeaderQuery.[logdate] >= #logdate2
AND HeaderQuery.[logdate] <= #logdate3
For the sake of the test, and considering I already have the SQL script, I was hoping to be able to just execute that script above (providing the where clause programmatically) using context.Database.SqlQuery<string>(QUERY) but since I have two different contexts, I'm not sure how to do that.
The short answer is no, EF doesn’t support cross database queries. However there are a few things you can try.
You can use two different database contexts (one for each database).
Run your respective queries and then merge / massage the data after
the query returns.
Create a database view and query the view through EF.
Using a SYNONYM
https://rachel53461.wordpress.com/2011/05/22/tricking-ef-to-span-multiple-databases/
If the databases are on the same server, you can try using a
DbCommandInterceptor
I’ve had this requirement before and personally like the view option.

Apache Spark Multiple Aggregations

I am using Apache spark in Scala to run aggregations on multiple columns in a dataframe for example
select column1, sum(1) as count from df group by column1
select column2, sum(1) as count from df group by column2
The actual aggregation is more complicated than just the sum(1) but it's besides the point.
Query strings such as the examples above are compiled for each variable that I would like to aggregate, and I execute each string through a Spark sql context to create a corresponding dataframe that represents the aggregation in question
The nature of my problem is that I would have to do this for thousands of variables.
My understanding is that Spark will have to "read" the main dataframe each time it executes an aggregation.
Is there maybe an alternative way to do this more efficiently?
Thanks for reading my question, and thanks in advance for any help.
Go ahead and cache the data frame after you build the DataFrame with your source data. Also, to avoid writing all the queries in the code, go ahead and put them in a file and pass the file at run time. Have something in your code that can read your file and then you can run your queries. The best part about this approach is you can change your queries by updating the file and not the applications. Just make sure you find a way to give the output unique names.
In PySpark, it would look something like this.
dataframe = sqlContext.read.parquet("/path/to/file.parquet")
// do your manipulations/filters
dataframe.cache()
queries = //how ever you want to read/parse the query file
for query in queries:
output = dataframe.sql(query)
output.write.parquet("/path/to/output.parquet")

How to call 'like any' PostgreSQL function in JPQL

I have next issue:
I have list of names, based on which I want to filter.The problem is that I have not full names(Because I'm receiving them from ui), and I have, for example, this array= ['Joh', 'Michae'].
So, I want to filter based on this array.
I wrote query in PostgreSQL
select * from q_ob_person where name like any (array['%Хомяченко%', '%Вартопуз%']);
And I want to ask how to write JPQL query gor this.
Is there an option to call postgresql function like any from JPQL?
JPA 2.1 allows invocation of any SQL function using
FUNCTION(sqlFuncName, sqlArgs)
So you could likely do something like (note never tried this LIKE ANY you refer to, just play around with it)
FUNCTION("LIKE", FUNCTION("ANY", arrayField))
Obviously by invoking SQL functions specific to a particular RDBMS you lose database independence (in case that's of importance).