We have a requirement in our project to store millions of records(~100 million) in database.
And we know that SQL Express Edition 2012 can maximum accommodate 10GB of data.
I am using this query to get the actual size of the database - Is this right?
use [Bio Lambda8R32S50X]
SELECT DB_NAME(database_id) AS DatabaseName,
Name AS Logical_Name,
Physical_Name, (size*8)/1024 SizeMB
FROM sys.master_files
WHERE DB_NAME(database_id) = 'Bio Lambda8R32S50X'
GO
SET NOCOUNT ON
DBCC UPDATEUSAGE(0)
-- Table row counts and sizes.
CREATE TABLE #t
(
[name] NVARCHAR(128),
[rows] CHAR(11),
reserved VARCHAR(18),
data VARCHAR(18),
index_size VARCHAR(18),
unused VARCHAR(18)
)
INSERT #t EXEC sp_msForEachTable 'EXEC sp_spaceused ''?'''
SELECT *
FROM #t
-- # of rows.
SELECT SUM(CAST([rows] AS int)) AS [rows]
FROM #t
DROP TABLE #t
The second question is this restriction is only on the database size of the Primary file group or inclusive of the log files as well?
If we do a lot of delete and insert, or may be delete and insert back the same number of records, does the database size vary or remains the same?
This is very crucial, since this will decide whether we can go ahead with SQL Server 2012 Express Edition or not?
Thanks and regards
Subasish
I can see that the first query is to get the overall size of the database for the data and logs. The second one is for each table. So I would say yes to both.
Based upon my experience seeing db's over 40GB and this linkmaximum DB size limits that the limit on sql server express is based upon the mdf and ndf files not the ldf.
You might be safer however, just to go with SQL Server Standard and use CAL licensing in case your database starts growing.
Good Luck!
Related
In SQL Server 2008 R2, I am trying to insert 30 million records from a source table to the target table. Out of these 30 million records, few records have some bad data and exceeds the length of target field. Generally due to these bad data, the whole insert gets aborted with "string or binary data would be truncated" error, without loading any rows in the target table and SQL Server also do not specify which row had the problem. Is there a way that we can insert rest of rows and catch the bad data rows without big impact on the performance (because performance is the main concern in this case) .
You can use the len function in your where condition to filter out long values:
select ...
from ...
where len(yourcolumn) <= 42
gives you the "good" records
select ...
from ...
where len(yourcolumn) > 42
gives you the "bad" records. You can use such where conditions in an insert select syntax as well.
You can also truncate your string as well, like:
select
left(col, 42) col
from yourtable
In the examples I assumed that 42 is your character limit.
You are not mention that how to insert data i.e. bulk insert or SSIS.
I prefer in this condition SSIS, in which you have control and also find the solution of your issue means you can insert the proper data as #Lajos suggest as well as for bad data you can create a temporary table and get the bad datas.
You can give flow of your logic via transformation and also error handling. You can more search for this too.
https://www.simple-talk.com/sql/reporting-services/using-sql-server-integration-services-to-bulk-load-data/
https://www.mssqltips.com/sqlservertip/2149/capturing-and-logging-data-load-errors-for-an-ssis-package/
http://www.techbrothersit.com/2013/07/ssis-how-to-redirect-invalid-rows-from.html
I am migrating a large database from oracle 11g to SQL SERVER 2008R2 using SSIS. How can the data integrity can be validated for numeric data using checksum?
In the past, I've always done this using a few application controls. It should be something that is easy to compute on both platforms.
Frequently, the end result is a query like:
select count(*)
, count(distinct col1) col1_dist_cnt
...
, count(distinct col99) col99_dist_cnt
, sum(col1) col1_sum
...
, sum(col99) col99_sum
from table
Spool to file, Excel or database and compare outcome. Save for project management and auditors.
Please read more on application control here. I wrote it for checks between various financial reporting systems for the regulatory reporting, so this approach serves most cases.
If exactly one field value is wrong, it will always show up. Two errors might compensate each other. For example row 1 col 1 gets the value from row 2 col 1.
To detect for that, multiply each value with something unique for the row. For instance, if you have a unique ID column or identity that is included in the migration too:
, sum(ID * col1) col1_sum
...
, sum(ID * col99) col99_sum
When you get number overflows, first try using the largest available precision (especially on SQL Server sometimes difficult). If not feasibly anymore, use mod function. Only few types of error are hidden by mod.
Icing on the cake is to auto generate these statements. On Oracle look at user_tables, user_tab_columns. On SQL Server look at syscolumns etc.
I'm trying to connect from Microsoft SQL server to as AS/400 so i can pull data from the AS/400 then flag the data as being pulled.
I've successfully created and OLE DB "IBMDASQL" connection, and am able to pull data some data, but i'm running into an issue when i try to pull data from a very large table
This runs fine, and returns a count of 170 million:
select count(*)
from transactions
This query executed for 15 hours before i gave up on it. (It should return zero since i haven't flagged anything as 'in process' yet)
select count(*)
from transactions
where processed = 'In process'
I'm a Microsoft guy, but my AS/400 guy says that there is an index on the 'processed' column and that locally, that query run instantaneously.
Any thoughts on what i might be doing wrong? I found a table with only 68 records in it, and was able to run this query in about a second:
select count(*)
from smallTable
where RandomColumn = 'randomValue'
So I know that the AS/400 is at least able to understand that type of query.
I have had to fight this battle many times.
There are two ways of approaching this.
1) Stage your data from the AS400 into SQL server where you can optimize your indexes
2) Ask the AS400 folks to create logical views which speed up data retrieval, your AS400 programmer is correct, index will help but I forget the term they use to define a "view" similar to a sql server view, I beleive its something like "physical" v/s "logical". Logical is what you want.
Thirdly, 170 million is a lot of records, even for a relational database like SQL server, have you considered running an SSIS package nightly that stages your data into your own SQL table to see if it improves performance?
I would suggest this way to have good performance, i suppose you have at least SQL2005, i havent tested yet but this is a tip
Let the AS400 perform the select in native way by creating stored procedure in the AS400
open a AS400 session
launch STRSQL
create an AS400 stored procedure in this way to get/update the recordset
CREATE PROCEDURE MYSELECT (IN PARAM CHAR(10))
LANGUAGE SQL
DYNAMIC RESULT SETS 1
BEGIN
DECLARE C1 CURSOR FOR SELECT * FROM MYLIB.MYFILE WHERE MYFIELD=PARAM;
OPEN C1;
RETURN;
END
create an AS400 stored procedure to update the recordset
CREATE PROCEDURE MYUPDATE (IN PARAM CHAR(10))
LANGUAGE SQL
RESULT SETS 0
BEGIN
UPDATE MYLIB.MYFILE SET MYFIELD='newvalue' WHERE MYFIELD=PARAM;
END
Call those AS400 SP from SQL SERVER
declare #myParam char(10)
set #myParam = 'In process'
-- get the recordset
EXEC ('CALL NAME_AS400.MYLIB.MYSELECT(?) ', #myParam) AT AS400 -- < AS400 = name of linked server
-- update
EXEC ('CALL NAME_AS400.MYLIB.MYUPDATE(?) ', #myParam) AT AS400
Hope it helps
I recommend following the suggestions in the IBM Redbook SQL Performance Diagnosis on IBM DB2 Universal Database for iSeries to determine what's really happening.
IBM technical support can also be extremely helpful in diagnosing issues such as these. Don't be afraid to get in touch with them as the software support is generally included as part of the maintenance contract and there is no charge to talk to them.
I've seen OLEDB connections eat up 100% cpu for hours and when the same query is run through VisualExplain (query analyzer) it estimates mere seconds to execute.
We found that running the query like this performed liked expected:
SELECT *
FROM OpenQuery( LinkedServer,
'select count(*)
from transactions
where processed = ''In process''')
GO
Could this be a collation problem? - your WHERE clause is testing on a text field and if the collations of the two servers don't match this clause will be applied clientside rather than serverside so you are first of all pulling all 170 million records down to the client and then performing the WHERE clause on it there.
Based on the past interactions I have had, the query should take about the same amount of time no matter how you access the data. Another thought would be if you could create a view on the table to get the data you need or use a stored procedure.
I have a stored procedure which creates and works with a temporary #table
Some of the queries would be tremendously optimized if that temporary #table would have an index created on it.
However, creating an index within the stored procedure fails:
create procedure test1 as
SELECT f1, f2, f3
INTO #table1
FROM main_table
WHERE 1 = 2
-- insert rows into #table1
create index my_idx on #table1 (f1)
SELECT f1, f2, f3 FROM #table1 (index my_idx) WHERE f1 = 11 -- "QUERY X"
When I call the above, the query plan for "QUERY X" shows a table scan.
If I simply run the code above outside the stored procedure, the messages show the following warning:
Index 'my_idx' specified as optimizer hint in the FROM clause of table '#table1' does not exist. Optimizer will choose another index instead.
This can be resolved when running ad-hoc (outside the stored procedure) by splitting the code above in two batches by addding "go" after index creation:
create index my_idx on #table1 (f1)
go
Now, "QUERY X" query plan shows the use of index "my_idx".
QUESTION: How do I mimique running the "create index" in a separate batch when it's inside the stored procedure? I can't insert a "go" there like I do with the ad-hoc copy above. Please note that I'm aware of the solution of "split up the 'QUERY X' into a separate stored procedure" and am looking for a solution that will avoid that.
P.S. If it matters, this is on Sybase 12 (ASE 12.5.4)
UPDATE:
I have been seeing several references to "schema bumping" during my Googling before posing the question. But that doesn't seem to happen in my case.
You can create a table, populate it, create an index on it and select values
from it in the same porc and have the optimizer fully cost it based on
accurate information. This is called 'schema bumping' and has been in place
since 11.5.1.
The Sybase documentation says that you create and use a temporary index in the same stored procedure:
http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.dc20023_1251/html/optimizer/X26029.htm
I think to get around this you will need to split your stored procedure into at least two parts, one to create and populate the table then build the index, and then a second one to run the select query.
I am not sure how you are getting this problem, might be in older version of Sybase, however with version 12.5.4 I tried executing the same thing as suggested by you but in my case the optimizer correctly suggested the use of index created in the stored procedure. Usually in a stored procedure we do not need to break sql into batches because else we would have been required to have a seperate batch for create table command as well.
In case we try to create index within a same batch (not in a stored procedure) we will do get the same error as specified by you above because we are trying to create an index on a table and then trying to use it within the same batch. Usually the Sybase server will compile the whole batch in one go and hence the problem. But as far as stored procedure is concerned in Sybase 12.5.4 there will be no problem.
After much fiddling, I've managed to install the right ODBC driver and have successfully created a linked server on SQL Server 2008, by which I can access my PostgreSQL db from SQL server.
I'm copying all of the data from some of the tables in the PgSQL DB into SQL Server using merge statements that take the following form:
with mbRemote as
(
select
*
from
openquery(someLinkedDb,'select * from someTable')
)
merge into someTable mbLocal
using mbRemote on mbLocal.id=mbRemote.id
when matched
/*edit*/
/*clause below really speeds things up when many rows are unchanged*/
/*can you think of anything else?*/
and not (mbLocal.field1=mbRemote.field1
and mbLocal.field2=mbRemote.field2
and mbLocal.field3=mbRemote.field3
and mbLocal.field4=mbRemote.field4)
/*end edit*/
then
update
set
mbLocal.field1=mbRemote.field1,
mbLocal.field2=mbRemote.field2,
mbLocal.field3=mbRemote.field3,
mbLocal.field4=mbRemote.field4
when not matched then
insert
(
id,
field1,
field2,
field3,
field4
)
values
(
mbRemote.id,
mbRemote.field1,
mbRemote.field2,
mbRemote.field3,
mbRemote.field4
)
WHEN NOT MATCHED BY SOURCE then delete;
After this statement completes, the local (SQL Server) copy is fully in sync with the remote (PgSQL server).
A few questions about this approach:
is it sane?
it strikes me that an update will be run over all fields in local rows that haven't necessarily changed. The only prerequisite is that the local and remote id field match. Is there a more fine grained approach/a way of constraining the merge statment to only update rows that have actually changed?
That looks like a reasonable method if you're not able or wanting to use a tool like SSIS.
You could add in a check on the when matched line to check if changes have occurred, something like:
when matched and mbLocal.field1 <> mbRemote.field1 then
This many be unwieldy if you have more than a couple of columns to check, so you could add a check column in (like LastUpdatedDate for example) to make this easier.