I'm currently working on a report generation servlet that agglomerates information from several tables and generates a report. In addition to returning the resulting rows, I'm also storing them into a reports table so they won't need to be regenerated later, and will persist if the tables they're drawn from are wiped. To do the latter I have a statement of the form (NB: x is externally generated and actually a constant in this statement):
INSERT INTO reports
(report_id, col_a, col_b, col_c)
SELECT x as report_id, foo.a, bar.b, bar.c
FROM foo, bar
This works fine, but then I need a second query to actually return the resulting rows back, e.g.
SELECT col_a, col_b, col_c
FROM reports
WHERE report_id = x
This works fine and since it only involves the single table, shouldn't be expensive, but seems like I should be able to directly return the results of the insertion avoiding the second query. Is there some syntax for doing this I've not been able to find? (I should note, I'm fairly new at DB work, so if the right answer is to just run the second query, as it's only slightly slower, so be it)
In PostgreSQL with version >= 8.2, you can use this construct:
INSERT INTO reports (report_id, col_a, col_b, col_c)
SELECT x as report_id, foo.a, bar.b, bar.c
FROM foo, bar
RETURNING col_a, col_b, col_c
Or without select:
INSERT INTO distributors (did, dname) VALUES (DEFAULT, 'XYZ Widgets')
RETURNING did;
documentation
You could also use an SRF although that may be overkill. It depends on what you are trying to do. For example, if you are only returning the information to perform a piece of logic that will go directly back to the database to perform more queries, it may make sense to use an SRF.
http://wiki.postgresql.org/wiki/Return_more_than_one_row_of_data_from_PL/pgSQL_functions
Related
I've got a SQL Server 2005 database. I need to get distinct values in addition to calling a function on those distinct values. I'm not sure how the distinct works when there is a function call involved. For example, I have this query:
SELECT DISTINCT a, b, c, fcn_DoSomething(a, b, c) AS z FROM users
I'm guessing that the function (fcn_DoSomething) is being called for all of the values in the table, not the distinct values. Am I correct? If so, how can I write the query to call the function only on distinct values of a,b,c? I know one option is to use a temporary table, but if anyone has better ideas that would be great.
Thanks
This got me curious, so I did a bit of basic testing. I created a small table with some distinct and some repeating values, a function that just does string concatenation, and then looked at the execution plans for:
Go
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
select distinct cola, colb, dbo.sillyfunc(cola, colb)
from distincttest
--Clear the cache
Go
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
select cola, colb, dbo.sillyfunc(cola, colb)
from (select distinct cola, colb from distincttest) as t
In this case, the execution plans showed clearly that the first one ran the concatenation function for every single row, but the second did the sort for distinct values first, then ran the function. But for a small number of rows, they had the same execution time, and when run together they showed each one using 50% of the total query resources.
So, I added a few hundred thousand repeating rows. and tried again. This changed the query plan so it was doing a hash match to get distinctness rather than the former sort, and now the second version which forced it to select for distinctness first executed more than ten times faster.
Finally, I thought there was a chance that this might just be because SQL Server had my sillyfunc marked as nondeterministic (select OBJECTPROPERTYEX(object_id('dbo.sillyfunc'), 'isdeterministic') returned 0), so I switched to patindex which was a builtin function and considered deterministic. This gave me the same results with the function being called for every row in the first version and just for the few distinct ones in the second version.
So, its possible that further testing would find situations that would coax the optimizer to do something more sophisticated, but it appears that if you want to apply the distinct before the function is called then you need to use something like a subquery, CTE, or temp table to limit what the function has access to.
This would ensure that the function only got called on distinct values.
select *, fcn_DoSomething(a, b, c)
from
(select distinct a,b,c FROM users) v
However, I believe that the function call will be optimised, so it may not make a difference. Give it a try.
Could anyone offer a solution to speed up one of our processes? We have a view used for reporting that is a union all of 10 tables. The view has 180 million rows. We would like to generate a list of the distinct values of individual columns. The current SQL generated by the reporting tool does a select distinct on the view which takes 10 minutes. Preferably the solution would be automatically updated. We have been trying to create a MQT in DB2 udb V8 as a union all, refresh immediate with little success. Any suggestions would be greatly appreciated.
Charles.
There are a lot of restrictions in DB2 8.2 for refresh immediate MQTs, and they can have a significant performance impact on applications that write to the base tables. That said, you may be able to use an MQT. However, instead of using SELECT DISTINCT, try making the query look something like:
select yourcolumn, count(*) as ignore
from union_all_view
group by yourcolumn
The column (yourcolumn) from must be defined as NOT NULL for this to work (in DB2 8.2). The optimizer may not select this MQT if you still issue SELECT DISTINCT against the union all view, so you may need to query the MQT (or a view defined on top of it) directly. Ignore the column "ignore" in the MQT -- that is there only for DB2; if you really don't want to see it, you can create a vew on top of the MQT.
However, this is really a database design issue. Why do you need to scan 180 million rows of data to find the unique values in a particular column? Why don't these values already reside in their own table, with foreign keys defined against it from each of the 10 base tables?
I'm using Sybase 12.5.3 (ASE); I'm new to Sybase though I've worked with MSSQL pretty extensively. I'm running into a scenario where a stored procedure is really very slow. I've traced the issue to a single SELECT stmt for a relatively large table. Modifying that statement dramatically improves the performance of the procedure (and reverting it drastically slows it down; i.e., the SELECT stmt is definitely the culprit).
-- Sybase optimizes and uses multi-column index... fast!<br>
SELECT ID,status,dateTime
FROM myTable
WHERE status in ('NEW','SENT')
ORDER BY ID
-- Sybase does not use index and does very slow table scan<br>
SELECT ID,status,dateTime
FROM myTable
WHERE status in (select status from allowableStatusValues)
ORDER BY ID
The code above is an adapted/simplified version of the actual code. Note that I've already tried recompiling the procedure, updating statistics, etc.
I have no idea why Sybase ASE would choose an index only when strings are hard-coded and choose a table scan when choosing from another table. Someone please give me a clue, and thank you in advance.
1.The issue here is poor coding. In your release, poor code and poor table design are the main reasons (98%) the optimiser makes incorrect decisions (the two go hand-in-hand, I have not figured out the proportion of each). Both:
WHERE status IN ('NEW','SENT')
and
WHERE status IN (SELECT status FROM allowableStatusValues)
are substandard, because in both cases they cause ASE to create a worktable for the contents between the brackets, which can easily be avoided (and all consequential issues avoided with it). There is no possibility of statistics on a worktable, since the statistics on either t.status or s.status is missing (AdamH is correct re that point), it correctly chooses a table scan.
Subqueries have their place, but never as a substitute for a pure (the tables are related) join. The corrections are:
WHERE status = "NEW" OR status = "SENT"
and
FROM myTable t,
allowableStatusValues s
WHERE t.status = s.status
2.The statement
|Now you don't have to add an index to get statistics on a column, but it's probably the best way.
is incorrect. Never create Indices that you will not use. If you want statistics updated on a column, simply
UPDATE STATISTICS myTable (status)
3.It is important to ensure that you have current statistics on (a) all indexed columns and (b) all join columns.
4.Yes, there is no substitute for SHOWPLAN on every code segment that is intended for release, doubly so for any code with questionable performance. You can also SET NOEXEC ON, to avoid execution, eg. for large result sets.
An index hint will work around it, but is probably not the solution.
Firstly I'd like to know if there is an index on allowableStatusValues.status, if there is then sybase will have stats on it and will have a good idea on the number of values in there.
If not then the optimiser probably won't have a good idea how many different values Status may take. It's then having to make the assumption that you're going to be extracting almost all of the rows from myTable, and the best way of doing this is a table scan (if no covering index).
Now you don't have to add an index to get statistics on a column, but it's probably the best way.
If you do have an index on allowableStatusValues.status, then i'd wonder how good your stats are. Get yourself a copy of sp__optdiag. You probably also need to tune the values of "histogram tuning factor" and "number of histogram steps", increasing these slightly from the defaults will give you more detailed statistics which always helps the optimiser.
Does it still do a table scan if you replace the subquery with a join:
SELECT m.ID,m.status,m.dateTime
FROM myTable m
JOIN allowableStatusValues a on m.status = a.status
ORDER BY ID
Rather than relying on experimental observations of how long a query takes to run, I would highly recommend getting Sybase to show you the execution plans for each query, for example:
SET showplan ON
GO
-- query/procedure call goes here
SELECT id, status, datetime
FROM myTable
WHERE status IN('NEW','SENT')
ORDER BY id
GO
SET showplan OFF
GO
With SET showplan ON, Sybase generates execution plans for every statement it executes. These can be invaluable in helping to identify where queries are not making use of appropriate indexes. For stored procedures in Sybase, the execution plan for the entire procedure is generated when the stored procedure is first executed after being compiled.
If you post the plans for each of your queries we might be able to shed more light on the problem.
Amazingly, using an index hint resolves the issue (see the (index myIndexName) line below - re-written/simplififed code below:
-- using INDEX HINT
SELECT ID,status,dateTime
FROM myTable (index myIndexName)
WHERE status in (select status from allowableStatusValues)
ORDER BY ID
Weird that I have to use this technique to avoid a table scan, but there ya go.
Garrett, by showing only the simplified code, you have likely stripped out exactly the information that would illuminate the source of the problem.
My first guess would be a type mismatch between allowableStatusValues.status and myTable.status. However, that is not the only possibility. As ninesided stated, the complete query plans (using showplan and fmtonly flags), as well as the actual table definitions and stored procedure source, is much more likely to produce a useful answer.
SELECT DISTINCT tblJobReq.JobReqId
, tblJobReq.JobStatusId
, tblJobClass.JobClassId
, tblJobClass.Title
, tblJobReq.JobClassSubTitle
, tblJobAnnouncement.JobClassDesc
, tblJobAnnouncement.EndDate
, blJobAnnouncement.AgencyMktgVerbage
, tblJobAnnouncement.SpecInfo
, tblJobAnnouncement.Benefits
, tblSalary.MinRateSal
, tblSalary.MaxRateSal
, tblSalary.MinRateHour
, tblSalary.MaxRateHour
, tblJobClass.StatementEval
, tblJobReq.ApprovalDate
, tblJobReq.RecruiterId
, tblJobReq.AgencyId
FROM ((tblJobReq
LEFT JOIN tblJobAnnouncement ON tblJobReq.JobReqId = tblJobAnnouncement.JobReqId)
INNER JOIN tblJobClass ON tblJobReq.JobClassId = tblJobClass.JobClassId)
LEFT JOIN tblSalary ON tblJobClass.SalaryCode = tblSalary.SalaryCode
WHERE (tblJobReq.JobClassId in (SELECT JobClassId
from tblJobClass
WHERE tblJobClass.Title like '%Family Therapist%'))
When i try to execute the query it results in the following error.
Cannot sort a row of size 8130, which is greater than the allowable maximum of 8094
I checked and didn't find any solution. The only way is to truncate (substring())the "tblJobAnnouncement.JobClassDesc" in the query which has column size of around 8000.
Do we have any work around so that i need not truncate the values. Or Can this query be optimised? Any setting in SQL Server 2000?
The [non obvious] reason why SQL needs to SORT is the DISTINCT keyword.
Depending on the data and underlying table structures, you may be able to do away with this DISTINCT, and hence not trigger this error.
You readily found the alternative solution which is to truncate some of the fields in the SELECT list.
Edit: Answering "Can you please explain how DISTINCT would be the reason here?"
Generally, the fashion in which the DISTINCT requirement is satisfied varies with
the data context (expected number of rows, presence/absence of index, size of row...)
the version/make of the SQL implementation (the query optimizer in particular receives new or modified heuristics with each new version, sometimes resulting in alternate query plans for various constructs in various contexts)
Yet, all the possible plans associated with a "DISTINCT query" involve *some form* of sorting of the qualifying records. In its simplest form, the plan "fist" produces the list of qualifying rows (records) (the list of records which satisfy the WHERE/JOINs/etc. parts of the query) and then sorts this list (which possibly includes some duplicates), only retaining the very first occurrence of each distinct row. In other cases, for example when only a few columns are selected and when some index(es) covering these columns is(are) available, no explicit sorting step is used in the query plan but the reliance on an index implicitly implies the "sortability" of the underlying columns. In other cases yet, steps involving various forms of merging or hashing are selected by the query optimizer, and these too, eventually, imply the ability of comparing two rows.
Bottom line: DISTINCT implies some sorting.
In the specific case of the question, the error reported by SQL Server and preventing the completion of the query is that "Sorting is not possible on rows bigger than..." AND, the DISTINCT keyword is the only apparent reason for the query to require any sorting (BTW many other SQL constructs imply sorting: for example UNION) hence the idea of removing the DISTINCT (if it is logically possible).
In fact you should remove it, for test purposes, to assert that, without DISTINCT, the query completes OK (if only including some duplicates). Once this fact is confirmed, and if effectively the query could produce duplicate rows, look into ways of producing a duplicate-free query without the DISTINCT keyword; constructs involving subqueries can sometimes be used for this purpose.
An unrelated hint, is to use table aliases, using a short string to avoid repeating these long table names. For example (only did a few tables, but you get the idea...)
SELECT DISTINCT JR.JobReqId, JR.JobStatusId,
tblJobClass.JobClassId, tblJobClass.Title,
JR.JobClassSubTitle, JA.JobClassDesc, JA.EndDate, JA.AgencyMktgVerbage,
JA.SpecInfo, JA.Benefits,
S.MinRateSal, S.MaxRateSal, S.MinRateHour, S.MaxRateHour,
tblJobClass.StatementEval,
JR.ApprovalDate, JR.RecruiterId, JR.AgencyId
FROM (
(tblJobReq AS JR
LEFT JOIN tblJobAnnouncement AS JA ON JR.JobReqId = JA.JobReqId)
INNER JOIN tblJobClass ON tblJobReq.JobClassId = tblJobClass.JobClassId)
LEFT JOIN tblSalary AS S ON tblJobClass.SalaryCode = S.SalaryCode
WHERE (JR.JobClassId in
(SELECT JobClassId from tblJobClass
WHERE tblJobClass.Title like '%Family Therapist%'))
FYI, running this SQL command on your DB can fix the problem if it is caused by space that needs to be reclaimed after dropping variable length columns:
DBCC CLEANTABLE (0,[dbo.TableName])
See: http://msdn.microsoft.com/en-us/library/ms174418.aspx
This is a limitation of SQL Server 2000. You can:
Split it into two queries and combine elsewhere
SELECT ID, ColumnA, ColumnB FROM TableA JOIN TableB
SELECT ID, ColumnC, ColumnD FROM TableA JOIN TableB
Truncate the columns appropriately
SELECT LEFT(LongColumn,2000)...
Remove any redundant columns from the SELECT
SELECT ColumnA, ColumnB, --IDColumnNotUsedInOutput
FROM TableA
Migrate off of SQL Server 2000
What I want to do is an outer join to a table, where I exclude records from the joined table based on matching a constant, however keep records from the main table. For example:
SELECT a.id, a.other, b.baz
FROM a
LEFT OUTER JOIN b
ON a.id = b.id
AND b.bar = 'foo'
Expected results:
id other baz
-- ---------- -------
1 Has foo Include
2 Has none (null)
3 Has foobar (null)
I can't get the same results by putting it in the filter condition. If I use the following:
SELECT a.id, a.other, b.baz
FROM a
LEFT OUTER JOIN b
ON a.id = b.id
WHERE (b.bar IS NULL OR b.bar = 'foo')
I get these incorrect results:
id other baz
-- -------- -------
1 Has foo Include
2 Has none (null)
Where it excluded records of A that happen to match a record of B where bar = 'foobar'. I don't want that, I want A to be present, but B to be nulls in that case.
Table B will have multiple records that need excluding, so I don't think I can filter this on the Crystal side without doing a lot of messing around to avoid problems from duplicate records from table A.
I cannot use a SQL command object, as the third party application that we are running the reports from seems to choke on SQL command objects.
I cannot use views, as our support contract does not permit database modifications, and our vendor considers adding views a database modification.
I am working with Crystal Reports XI, specifically version 11.0.0.895. In case it makes a difference, I am running against a Progress 9.1E04 database using the SQL-92 ODBC driver.
The sample tables and data used in the examples can be created with the following:
CREATE TABLE a (id INTEGER, other VARCHAR(32));
CREATE TABLE b (id INTEGER, bar VARCHAR(32), baz VARCHAR(32));
insert into A (id, other) values ('1', 'Has foo');
insert into A (id, other) values ('2', 'Has none');
insert into A (id, other) values ('3', 'Has foobar');
insert into B (id, bar, baz) values ('1', 'foo', 'Include');
insert into B (id, bar, baz) values ('1', 'foobar', 'Exclude');
insert into B (id, bar, baz) values ('1', 'another', 'Exclude');
insert into B (id, bar, baz) values ('1', 'More', 'Exclude');
insert into B (id, bar, baz) values ('3', 'foobar', 'Exclude');
Crystal reports can't generate that commonly used SQL statement based on its links and report selection criteria. You have to use a "command" or build a view.
In short, Crystal sucks.
Is a stored procedure an option for you? If so you could pre-select the data sets that way without having to resort to the command option, and one can import a stored procedure as one would a table.
I would propose stored procedure which does select * from b where bar= 'foo' and join to that, such that the b table is pre-filtered so all you have to do is join on the other join field.
Hope that helps.
Not sure if you can do this in Crystal but how about joining to a Select?
SELECT a.id, x.baz
FROM a
LEFT OUTER JOIN
(SELECT id, baz FROM b WHERE bar = 'foo') As x ON a.id = x.id
Can't you create appropriate views in database and base your report on these views? I'm using Crystal Reports on MSSQL and often I just create views to avoid similar problems.
I can see two solutions:
a) accept presence of multiple (unneeded) rows in B (and repeated values in A), calculate totals using runnign total fields and/or formulas - not easy way, but almost always possible;
b) move B into subreport (where you can set filter easily) and communicate needed values between main and subreport using shared variables.
Subreports are powerful tool for solving this kind of problems, unless you need to nest them (not possible) or export reports into excel (adds empty lines, at least in CR 9).
Adding
(Isnull({b.bar}) OR {b.bar} = "foo")
to the record-selection formula should act as you expect.
** edit **
A couple of other things to try:
Use a different database driver--the native driver (that avoids ODBC) may act differently. I first noticed this using the WITH syntax--the SQL Server ODBC driver didn't work, but the SQL Server native driver did.
While it sacrifices some flexibility, embed the query in a Command, assuming you can get the 3rd-party's product to comply. Added for completeness.
Seems to me you don't want to accept anyone's suggestions but here's one last-ditch shot at it anyway. The solution I've used recently where the db has to remain intact is as follows:
Set up Tomcat server so I can run some JSP and Hibernate goodness.
Grab Crystal reports for eclipse
Build report in crystal reports designer with faked data on a local db conforming to how I'd have the data in an ideal world
Using java servlet pass List to each of the table aliases such that the report has the data replaced directly from POJOs. The POJOs can of course be entirely composed in the java by pulling in content from various db tables and mashing them up as you see fit, often enabling one to provide a thoroughly flattened dataset that Crystal reports is only too happy to work with.
You should not add filter condition for table b by b.bar is null or b.bar = 'foo', but you should also not access attributes from table b directly. You should get all attributes by a condition if b.bar = 'foo' through a formula.