What is the relevant rules of Flink Window TVF and CEP SQL? - flink-sql

I am trying to parse Flink windowing TVF sql column level lineage, I initial a custom FlinkChainedProgram and set some Opt rules.
Mostly works fine except Window TVF SQL and CEP SQL.
for example, I get a logical plan as
insert into sink_table(f1, f2, f3, f4)
SELECT cast(window_start as String),
cast(window_start as String),
user_id,
cast(SUM(price) as Bigint)
FROM TABLE(TUMBLE(TABLE source_table, DESCRIPTOR(event_time), INTERVAL '10' MINUTES))
GROUP BY window_start, window_end, GROUPING SETS ((user_id), ());
rel#1032:FlinkLogicalCalc.LOGICAL.any.None: 0.[NONE].[NONE](input=FlinkLogicalAggregate#1030,select=CAST(window_start) AS EXPR$0, CAST(window_start) AS EXPR$1, null:BIGINT AS EXPR$2, user_id, null:VARCHAR(2147483647) CHARACTER SET "UTF-16LE" AS EXPR$4, CAST($f4) AS EXPR$5)
As we seen, the Optimized RelNode Tree contains null column so that MetadataQuery can't get origin column info.
What rules should I set in Logical Optimized phase to parse Window TVF SQL and CEP SQL? Thanks

I solved the field blood relationship method of Flink CEP SQL and added the getColumnOrigins(Match rel, RelMetadataQuery mq, int iOutputColumn) method in org.apache.calcite.rel.metadata.org.apache.calcite.rel.metadata. RelMdColumnOrigins:
/**
* Support field blood relationship of CEP.
* The first column is the field after PARTITION BY, and the other columns come from the measures in Match
*/
public Set<RelColumnOrigin> getColumnOrigins(Match rel, RelMetadataQuery mq, int iOutputColumn) {
if (iOutputColumn == 0) {
return mq.getColumnOrigins(rel.getInput(), iOutputColumn);
}
final RelNode input = rel.getInput();
RexNode rexNode = rel.getMeasures().values().asList().get(iOutputColumn - 1);
RexPatternFieldRef rexPatternFieldRef = searchRexPatternFieldRef(rexNode);
if (rexPatternFieldRef != null) {
return mq.getColumnOrigins(input, rexPatternFieldRef.getIndex());
}
return null;
}
private RexPatternFieldRef searchRexPatternFieldRef(RexNode rexNode) {
if (rexNode instanceof RexCall) {
RexNode operand = ((RexCall) rexNode).getOperands().get(0);
if (operand instanceof RexPatternFieldRef) {
return (RexPatternFieldRef) operand;
} else {
// recursive search
return searchRexPatternFieldRef(operand);
}
}
return null;
}
Source address: https://github.com/HamaWhiteGG/flink-sql-lineage/blob/main/src/main/java/org/apache/calcite/rel/metadata/RelMdColumnOrigins.java
I have given detailed test cases, you can refer to: https://github.com/HamaWhiteGG/flink-sql-lineage/blob/main/src/test/java/com/dtwave/flink/lineage/cep/CepTest.java
Flink CEP SQL test case:

Related

Force Entity Framework to read each line with all details

I am having trouble with en EF method returning duplicate rows of data. When I am running this, in my example, it returns four rows from a database view. The fourth row includes details from the third row.
The same query in SSMS returns four individual rows with the correct details. I have read somewhere about EK and problems with optimization when there are no identity column. But - is there anyway to alter the below code to force EK to read all records with all details?
public List<vs_transactions> GetTransactionList(int cID)
{
using (StagingDataEntities db = new StagingDataEntities())
{
var res = from trans in db.vs_transactions
where trans.CreditID == cID
orderby trans.ActionDate descending
select trans;
return res.ToList();
}
}
Found the solution :) MergeOption.NoTracking
public List<vs_transactions> GetTransactionList(int cID)
{
db.vs_transactions.MergeOption = MergeOption.NoTracking;
using (StagingDataEntities db = new StagingDataEntities())
{
var res = from trans in db.vs_transactions
where trans.CreditID == cID
orderby trans.ActionDate descending
select trans;
return res.ToList();
}
}

LINQ contains and fix

I have a LINQ query
var age = new int[]{1,2,3};
dbContext.TA.WHERE(x=> age.Contains( x.age)).ToList()
In an online article #11 (https://medium.com/swlh/entity-framework-common-performance-mistakes-cdb8861cf0e7) mentioned it is not a good practice as it creates many execution plan at the SQL server.
In this case, how should LINQ be revised so that I can do the same thing but minimize the amount of execution plans generated?
(note that I have no intention to convert it into a stored procedure and pass & join with the UDT as again it requires too many effort to do so)
That article offers some good things to keep in mind when writing expressions for EF. As a general rule that example is something to keep in mind, not a hard "never do this" kind of rule. It is a warning over writing queries that allow for multi-select and to avoid this when possible as it will be on the more expensive side.
In your example with something like "Ages", having a hard-coded list of values does not cause a problem because every execution uses the same list. (until the app is re-compiled with a new list, or you have code that changes the list for some reason.) Examples where it can be perfectly valid to use this is with something like Statuses where you have a status Enum. If there are a small number of valid statuses that a record can have, then declaring a common array of valid statuses to use in an Contains clause is fine:
public void DeleteEnquiry(int enquiryId)
{
var allowedStatuses = new[] { Statuses.Pending, Statuses.InProgress, Statuses.UnderReview };
var enquiry = context.Enquiries
.Where(x => x.EnquiryId == enquiryId && allowedStatuses.Contains(x.Status))
.SingleOrDefault();
try
{
if(enquiry != null)
{
enquiry.IsActive = false;
context.SaveChanges();
}
else
{
// Enquiry not found or invalid status.
}
}
catch (Exception ex) { /* handle exception */ }
}
The statuses in the list aren't going to change so the execution plan is static for that context.
The problem is where you accept something like a parameter with criteria that include a list for a Contains clause.
it is highly unlikely that someone would want to load data where a user could select ages "2, 4, and 6", but rather they would want to select something like: ">=2", or "<=6, or "2>=6" So rather than creating a method that accepts a list of acceptable ages:
public IEnumerable<Children> GetByAges(int[] ages)
{
return _dbContext.Children.Where(x => ages.Contains( x.Age)).ToList();
}
You would probably be better served with ranging the parameters:
private IEnumerable<Children> GetByAgeRange(int? minAge = null, int? maxAge = null)
{
var query = _dbContext.Children.AsQueryable();
if (minAge.HasValue)
query = query.Where(x => x.Age >= minAge.Value);
if (maxAge.HasValue)
query = query.Where(x => x.Age <= maxAge.Value);
return query.ToList();
}
private IEnumerable<Children> GetByAge(int age)
{
return _dbContext.Children.Where(x => x.Age == age).ToList();
}

How to assure the return StringList will be ordered : Scala

I am using Scala 2.11.8
I am trying to read queries from my Property File. Each Query Set has multiple parts (explained below)
And i have certain sequence in which these queries must execute.
Code:
import com.typesafe.config.ConfigFactory
object ReadProperty {
def main(args : Array[String]): Unit = {
val queryRead = ConfigFactory.load("testqueries.properties").getConfig("select").getStringList("caseInc").toArray()
val localRead = ConfigFactory.load("testqueries.properties").getConfig("select").getStringList("caseLocal").toArray.toSet
queryRead.foreach(println)
localRead.foreach(println)
}
}
PropertyFile Content :
select.caseInc.2 = Select emp_salary, emp_dept_id from employees
select.caseLocal.1 = select one
select.caseLocal.3 = select three
select.caseRemote.2 = Select e1.emp_name, d1.dept_name, e1.salary from emp_1 e1 join dept_1 d1 on(e1.emp_dept_id = d1.dept_id)
select.caseRemote.1 = Select * from departments
select.caseInc.1 = Select emp_id, emp_name from employees
select.caseLocal.2 = select two
select.caseLocal.4 = select four
Output:
Select emp_id, emp_name from employees
Select emp_salary, emp_dept_id from employees
select one
select two
select three
select four
As we can see in output, The result is Sorted . In the property if you see i have tried numbering the queries in the sequence it should run.(passing the caseInc, caseLocal as arguments).
With getStringList() i am always getting the Sorted List on the basis of the sequence number i am providing.
Even when i tried using toArray() & toArray().toSet i am getting sorted output.
So far its Good
But how to be sure that it will always return in Sorted Order which i have provided in the property file. I am confused because somehow i am not able to find the API which says that the returned List will be Sorted.
I think you can rely on this fact. Looking into the code of DefaultTransformer you can see following piece of logic:
} else if (requested == ConfigValueType.LIST && value.valueType() == ConfigValueType.OBJECT) {
// attempt to convert an array-like (numeric indices) object to a
// list. This would be used with .properties syntax for example:
// -Dfoo.0=bar -Dfoo.1=baz
// To ensure we still throw type errors for objects treated
// as lists in most cases, we'll refuse to convert if the object
// does not contain any numeric keys. This means we don't allow
// empty objects here though :-/
AbstractConfigObject o = (AbstractConfigObject) value;
Map<Integer, AbstractConfigValue> values = new HashMap<Integer, AbstractConfigValue>();
for (String key : o.keySet()) {
int i;
try {
i = Integer.parseInt(key, 10);
if (i < 0)
continue;
values.put(i, o.get(key));
} catch (NumberFormatException e) {
continue;
}
}
if (!values.isEmpty()) {
ArrayList<Map.Entry<Integer, AbstractConfigValue>> entryList = new ArrayList<Map.Entry<Integer, AbstractConfigValue>>(
values.entrySet());
// sort by numeric index
Collections.sort(entryList,
new Comparator<Map.Entry<Integer, AbstractConfigValue>>() {
#Override
public int compare(Map.Entry<Integer, AbstractConfigValue> a,
Map.Entry<Integer, AbstractConfigValue> b) {
return Integer.compare(a.getKey(), b.getKey());
}
});
// drop the indices (we allow gaps in the indices, for better or
// worse)
ArrayList<AbstractConfigValue> list = new ArrayList<AbstractConfigValue>();
for (Map.Entry<Integer, AbstractConfigValue> entry : entryList) {
list.add(entry.getValue());
}
return new SimpleConfigList(value.origin(), list);
}
}
Note how keys are parsed as integer values and then sorted using Integer.compare

How can I generate SQL with WHERE statement in EFCore Query

I am trying to write the EF Query with a filter and the generated SQL is not using WHERE statement on the SQL Server. It extracts all data and does the filter on the Client's Memory. I am quite worried about the performance and would like to apply the filter on the SQL Server.
The requirement is to count the number of records in an Hour time slot.
public async Task<int> GetNumberofSchedules(DateTime dt)
{
return await _context.Schedules.CountAsync(
s => s.state == 0
&& s.AppointmentDate.Value.Date == dt.Date
&& s.AppointmentTime.Value.Hours >= dt.Hour
&& s.AppointmentTime.Value.Hours < dt.AddHours(1).Hour
);
}
Sample Data
if given DateTime is 07/04/2017 08:20, it should return the number of records between 07/04/2017 08:00 and 07/04/2017 09:00
The count does return 6 which is correct but when I trace the generated SQL in SQL Profiler, it's retrieving all data.
Generated SQL in SQL Profiler
SELECT
[s].[EnrolmentRecordID], [s].[AF1], [s].[AcademicYearID], [s].[AdminName], [s].[AppForm], [s].[AppointmentDate],
[s].[AppointmentTime], [s].[BDT], [s].[BKSB], [s].[DateCreated],
[s].[DateModified], [s].[Employer], [s].[EmployerInvited],
[s].[EmployerReps], [s].[MIAPP], [s].[NI], [s].[ProposedQual],
[s].[SMT], [s].[StudentInvited], [s].[StudentName]
FROM
[dbo].[EN_Schedules] AS [s]
I would like to amend my EF code to generate WHERE statement and do the filter on the server side. How can I achieve it?
Update1:
If I remove filters for TimeSpan value, it generates the correct SQL statement as the following: So, it seems to me that I need to apply the filter differently for TimeSpan Field.
exec sp_executesql N'SELECT COUNT(*) FROM [dbo].[EN_Schedules] AS [s]
WHERE ([s].[state] = 0) AND (CONVERT(date, [s].[AppointmentDate]) =
#__dt_Date_0)',N'#__dt_Date_0 datetime2(7)',#__dt_Date_0='2017-04-07
00:00:00'
**Update2: **
By using Ivan's solution, I ended up doing like this:
var startTime = new TimeSpan(dt.Hour, 0, 0);
var endTime = new TimeSpan(dt.Hour + 1, 0, 0);
return await _context.Schedules.CountAsync(
s => s.state == 0
&& s.AppointmentDate.Value.Date == dt.Date
&& s.AppointmentTime.Value >= startTime
&& s.AppointmentTime.Value < endTime
);
It's indeed converted to client evaluation - looks like many TimeSpan operations are still not supported well by EF Core.
After a lot of trial and error, the only way currently you can make it translate to SQL is to prepare TimeSpan limits as variables and use them inside the query:
var startTime = new TimeSpan(dt.Hour, 0, 0);
var endTime = startTime + new TimeSpan(1, 0, 0);
return await _context.Schedules.CountAsync(
s => s.state == 0
&& s.AppointmentDate.Value.Date == dt.Date
&& s.AppointmentTime.Value >= startTime
&& s.AppointmentTime.Value < endTime
);
It looks like Client Evaluation is the reason.
Disable it with ConfigureWarnings call. It will give an exception if a LINQ statement cann't be translated to SQL:
public class FooContext : DbContext
{
protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
{
optionsBuilder
.UseSqlServer("foo_connstr")
.ConfigureWarnings(warnings => warnings.Throw(RelationalEventId.QueryClientEvaluationWarning));
}
}

Column must appear in the GROUP BY clause or be used in an aggregate function

I'm updating a Qt software, to make it compatible with both SQLite and PostgreSQL.
I have a C++ method that is used to count elements of a given table with given clauses.
In SQLite, the following worked and gave me a number N (the count).
SELECT COUNT(*) FROM table_a
INNER JOIN table_b AS
ON table_b.fk_table_a = table_a.id
WHERE table_a.start_date_time <> 0
ORDER BY table_a.creation_date_time DESC
With PostgreSQL (I'm using 9.3), I have the following error :
ERROR: column "table_a.creation_date_time" must appear in the
GROUP BY clause or be used in an aggregate function
LINE 5: ORDER BY
table_a.creation_date_time DESC
If I add, GROUP BY table_a.creation_date_time, it gives me a table with N rows.
I've read a lot of stuff about how different DBMS allow you to omit columns in the GROUP BY clause. Now, I'm just confused.
For those who are curious, the C++ method is:
static int count(const QString &table, const QString &clauses = QString(""))
{
int success = -1;
if (!table.isEmpty())
{
QString statement = QString("SELECT COUNT(*) FROM ");
statement.append(table);
if (!clauses.isEmpty())
{
statement.append(" ").append(clauses) ;
}
QSqlQuery query;
if(!query.exec(statement))
{
qWarning() << query.lastError();
qWarning() << statement;
}
else
{
if (query.isActive() && query.isSelect() && query.first())
{
bool ok = false;
success = query.value(0).toInt(&ok);
if (ok == false)
{
success = -1;
return success;
}
}
}
}
return success;
}
If you're just doing a count(*) on the table in order to get a single scalar-value result, then surely having the order by present is obsolete ?
solution
Remove the obsolete order by to get "standard" query behavior across multiple dbms