sql server split mirror db on to multiple devices - sql-server-2008-r2

Say I have a large production mirrored 1TB DB that resides on a single MDF device and I would like to split that up into say 5 200 Gig devices.
I want to do this without interruption to Production.
I thought I could break the mirror and use the RESTORE process for creating a mirror to achieve the split to multiple devices quickly and without interruption to Production. Doing this twice would allow me to get this done in a few hours.
Has anyone done this? Is it the preferred method seeing as we are mirroring anyways?
What are other my alternatives, Pros and Cons? And gotchas?
Also, I recall another more organic process where one would create the 5 new New Devices and somehow, over time get the objects to move over to the new devices. Not sure of the process for this but I seem to recall it being discussed. Sounds like this could take a long time and possibly cause some clocking at times?
Thanks
...Ray

This isn't quite as simple a process as it first looks, the reason being is that just adding the files to SQL server isn't enough as even if you were to add 4 new files, they would all be empty space, you would have one file with 1Tb of data in it and 4 empty ones, which would eventually fill up as SQL server uses a proportional fill method for the files, but most of your queries would still be hitting the single file.
I take it you are doing this to improve performance? If so, you will need to move data around into different files in order to actually split the data up. Whether you can do this online or not depends on whether you are running Enterprise Edition or not (as this allows you to rebuild indexes online).
An easy way to move a table (or more accurately a clustered index, which is pretty much the same thing as the table for all intents and purposes) is to add a new filegroup with a new data file and then rebuild the clustered index specifying the new filegroup, you can use the following to do this:
CREATE CLUSTERED INDEX Existing_Index_Name ON schema_name.table_name(column_name)
WITH(DROP_EXISTING=ON,Online=ON) on [new_filegroup_name]
GO
This code will create the new index on the new filegroup, get rid of the old one and if you are running enterprise edition, it will do it all without blocking the users.
See the following link for more methods of moving the data between filegroups:
Move data between SQL Server database filegroups
You should also look into partitioning your tables to help improve performance too:
Partitioning Tables and Indexes
With regards to your mirroring setup, you should break the mirror, then on the primary add all your files/filegroups, then move the data between the filegroups, then backup the modified database on the primary, restore on the mirror (so all the files are set up the same on the mirror) and then re-set up your mirroring.

Related

Archive old data in Postgresql

I'm currently expecting for somebody to advice me on the process which I'm gonna take forward for DB archiving.
I've database (DB-1) which has 2 very large tables, one table having 25 GB of data and another is 20 GB of data. Which cause major performance issues even I have indexes.
So, we considered to archive the old data with the below process,
Clone a new database (DB-2) from existing database (DB-1).
Delete the old data from DB-1, so it will have only the last 2 years records. In case If I need old data can connect DB-2.
Every month should move an old data from DB-1 to DB-2, and delete the moved rows from DB-1.
That is the wrong approach.
What you are looking for is partitioning.
You can create range partitions covering one year each. To remove old data all you need to do is to drop the partition for the year(s) no longer needed.
If you need to keep the data for some reasons, you can also just detach the partition from the table. Then the data is still "lying around", but would not show up in the (partitioned) table. You could query the (detached) partition directly to access that data. You could even move that (detached) partition to a slower harddisk to free up space on your fast disks if you have more than one.
But you might even see that partitioning alone might already improves performance, but that depends a lot on your queries.
Note that you should use Postgres 11 for that, as partitioning wasn't that sophisticated in older versions.
While you should no doubt upgrade your current version (I'd suggest moving away from the EDB system you are working on now, and going to community based Postgres 11) even if you can't upgrade, partitioning is still a much better answer than creating a second database.
By recreating your table as a set of partitions within the same database, you will be able to add/remove data in a much cleaner fashion, and it will make dealing with Vacuums much easier. Even in 9.5, you can take advantage of table inheritance to build out partitions by first adding partitions for incoming data, and then creating partitions at various intervals (probably monthly, since you want to run monthly cleanup) and moving the data into those partitions. This can be accomplished atomically with a series of INSERT INTO partition SELECT * FROM table WHERE <timestamp> style statements.
I suspect you can probably manage this yourself (you need basic sql and the ability to write simple triggers/functions... here is a link to the 9.5 docs), but if you need help, you can engage with one of the Postgres chat communities, or contact a support company if you want a deeper dive.

Neo4j's MERGE command on big datasets

Currently, I am working on a project of implementing a Neo4j (V2.2.0) database in the field of web-analytics. After loading some samples, I'm trying to load a big data set (>1GB, >4M lines). The problem I am facing, is that the usage of the MERGE command takes exponentially more time as the data size grows. Online sources are ambiguous on what the best way is to load big sets of data when not every line has to be loaded as a node, and I would like some clarity on the subject. To emphasize, in this situation I am just loading the nodes; relations are the next step.
Basically there are three methods
i) Set a uniqueness constraint for a property, and create all nodes. This method was used mainly before the MERGE command was introduced.
CREATE CONSTRAINT ON (book:Book) ASSERT book.isbn IS UNIQUE
followed by
USING PERIODIC COMMIT 250
LOAD CSV WITH HEADERS FROM "file:C:\\path\\file.tsv" AS row FIELDTERMINATOR'\t'
CREATE (:Book{isbn=row.isbn, title=row.title, etc})
In my experience, this will return a error if a duplicate is found, which stops the query.
ii) Merging the nodes with all their properties.
USING PERIODIC COMMIT 250
LOAD CSV WITH HEADERS FROM "file:C:\\path\\file.tsv" AS row FIELDTERMINATOR'\t'
MERGE (:Book{isbn=row.isbn, title=row.title, etc})
I have tried loading my set in this manner, but after letting the process run for over 36 hours and coming to a grinding halt, I figured there should be a better alternative, as ~200K of my eventual ~750K nodes were loaded.
iii) Merging nodes based on one property, and setting the rest after that.
USING PERIODIC COMMIT 250
LOAD CSV WITH HEADERS FROM "file:C:\\path\\file.tsv" AS row FIELDTERMINATOR'\t'
MERGE (b:Book{isbn=row.isbn})
ON CREATE SET b.title = row.title
ON CREATE SET b.author = row.author
etc
I am running a test now (~20K nodes) to see if switching from method ii to iii will improve execution time, as a smaller sample gave conflicting results. Are there methods which I am overseeing and could improve execution time? If I am not mistaken, the batch inserter only works for the CREATE command, and not the MERGE command.
I have permitted Neo4j to use 4GB of RAM, and judging from my task manager this is enough (uses just over 3GB).
Method iii) should be the fastest solution since you MERGE against a single property. Do you create the uniqueness constraint before you do the MERGE? Without an index (constraint or normal index), the process will take a long time with a growing number of nodes.
CREATE CONSTRAINT ON (book:Book) ASSERT book.isbn IS UNIQUE
Followed by:
USING PERIODIC COMMIT 20000
LOAD CSV WITH HEADERS FROM "file:C:\\path\\file.tsv" AS row FIELDTERMINATOR'\t'
MERGE (b:Book{isbn=row.isbn})
ON CREATE SET b.title = row.title
ON CREATE SET b.author = row.author
This should work, you can increase the PERIODIC COMMIT.
I can add a few hundred thousand nodes within minutes this way.
In general, make sure you have indexes in place. Merge a node first on the basis of the properties that are indexed (to exploit fast lookup) and then modify that node's properties as needed with SET.
Beyond that, both of your approaches are going through the transaction layer. If you need to jam a lot of data into the DB really quickly, you probably don't want to use transactions to do that, because they're giving you functionality you might not need, and they require overhead that's slowing you down. So a larger solution would be to not insert data with LOAD CSV but go another route entirely.
If you're using the 2.2 series of neo4j, you can go for the batch inserter via java, or the neo4j-import tool sadly not available prior to 2.2. What they both have in common is that they don't use transactions.
Finally, either way you go you should read Michael Hunger's article on importing data into neo4j as it provides a good conceptual discussion of what's happening, and why you need to skip transactions if you're going to load big huge piles of data into neo4j.

MongoDB: Switch database/collection referenced by a given name on the fly

My application needs only read access to all of its databases. One of those databases (db_1) hosts a collection coll_1 whose entire contents* need to be replaced periodically**.
My goal is to have no or very little effect on read performance for servers currently connected to the database.
Approaches I could think of with so far:
1. renameCollection
Build a temporary collection coll_tmp, then use renameCollection with dropTarget: true to move its contents over to coll_1. The downside of this approach is that as far as I can tell, renameCollection does not copy indexes, so once the collection is renamed, coll_1 would need reindexing. While I don't have a good estimate of how long this would take, I would think that query-performance will be significantly affected until reindexing is complete.
2. TTL Index
Instead of straight up replacing, use a time-to-live index to expire documents after the chosen replacement period. Insert new data every time period. This seems like a decent solution to me, except that for our specific application, old data is better than no data. In this scenario, if the cron job to repopulate the database fails for whatever reason, we could potentially be left with an empty coll_1 which is undesirable. I think this might have a negligible effect, but this solution also requires on-the-fly indexing as every document is inserted.
3. Communicate current database to read-clients
Simply use two different databases (or collections?) and inform connected clients which one is more recent. This solution would allow for finishing indexing the new coll_1_alt (and then coll_1 again) before making it available. I personally dislike the solution since it couples the read clients very closely to the database itself, and of course communication channels are always imperfect.
4. copyDatabase
Use copyDatabase to rename (designate) an alternate database db_tmp to db_1.db_tmp would also have a collection coll_1. Once reindexing is complete on db_tmp.coll_1, copyDatabase could be used to simply rename db_tmp to db_1. It seems that this would require droppping db_1 before renaming, leaving a window in which data won't be accessible.
Ideally (and naively), I'd just set db_1 to be something akin to a symlink, switching to the most current database as needed.
Anyone has good suggestions on how to achieve the desired effect?
*There are about 10 million documents in coll_1.
** The current plan is to replace the collection once every 24 hours. The replacement interval might get as low as once every 30 minutes, but not lower.
The problem that you point out in option 4 you will also have with option 1. dropTarget will also mean that the collection is not available.
Another alternative could be to just have both the old and the new data in the same collection, and use a "version ID" that you then still have to communicate to your clients to do a query on. That at least stops you from having to do reindexing like you pointed out for option 1.
I think your best bet is actually option 3, and it's the most equivalent to changing a symlink, except it is on the client side.

Is there a way to configure Heroku PostgreSQL to not bother loading a particular column into RAM?

This may be a long shot, but I thought I'd ask anyway.
I am looking at using Heroku's new Crane Postgres DB (400 MB RAM Cache) in conjunction with an app I'm deploying on Heroku. The 400 MB cache size should be plenty for our needs... except for one column of one table, in which we store a cached PDF file as a string. The PDF's could easily use up the 400MB RAM pretty quickly if Heroku uses its Cache for them.
If I were on an actual server, I'd just store the PDF as a file, but given Heroku's ephemeral file system, my life is much simpler if I just store the pdf in the DB rather than rigging up a connection to S3 just for this one thing. (It further complicates that we're looking at deploying multiple heroku instances, one for each client ... so using the DB's is simpler than creating a new bucket for each one.) I don't really care about the speed on this. If people are getting the file, they will expect speeds as if it were coming from a file system anyhow, since thats how most file downloads are done. Is there any way to tell PostGRES to not bother caching this column?
Or maybe I'm asking the wrong question, and there is some other way to solve the problem or design alternatives that make it irrelevant.
You don't have to do anything. PostgreSQL will automatically use TOAST on values larger than 8 kB.
From http://www.postgresql.org/docs/9.1/static/storage-toast.html
PostgreSQL uses a fixed page size (commonly 8 kB), and does not allow tuples to span multiple pages. Therefore, it is not possible to store very large field values directly. To overcome this limitation, large field values are compressed and/or broken up into multiple physical rows. This happens transparently to the user, with only small impact on most of the backend code. The technique is affectionately known as TOAST (or "the best thing since sliced bread").
PostgreSQL caching is also done at the page level so TOAST does not have to be cached with the rest of the row (http://www.westnet.com/~gsmith/content/postgresql/InsideBufferCache.pdf).
The fact that Postgres can TOAST large field values, it doesn't mean it's the best thing to do.
If you store big fields in your main database, it will make many things harder, such as creating forks or followers, and creating and restoring backups in particular. I would strongly reconsider utilizing S3 to store the PDF files, and simply invest in automated onboarding of new clients (create heroku app, provision database, provision/create S3 bucket).
I'm not quite sure how you're managing to store large PDF's, since Postgres imposes a maximum field size (or at least a maximum page size). However, you might be able to get around this by using TOAST. TOASTed items are stored in a separate (physical) table, so if you're not selecting them frequently they shouldn't be cached.
If you are selecting them frequently, then I'm not sure if what you want is possible. Remember that Postgres only supplies one "level" of caching - the Linux VFS does caching also.

how to merge Tokyo Cabinet hash-table db's (.tch files) (no duplicate keys)

Is this possible? I couldn't find an answer anywhere.
Basically I'm looking at a setup where I have multiple workers (boxes) which must all store there data into a Tokyo Cabinet index/db eventually (I'm using Tokyto Tyrant over the memcached protocol abtw. not that it matters but still)
Basically, I tried pushing the data directly to another box which runs Tokyo Tyrant, but the TT can't handle it after a while. Inserts get really slow, and workers sit there idle wanting to offload data to the TT-server. (I tried all sorts of things to improve performance, more ram, raid-configs, multiple TT-servers on the box, etc) but the major drop in performance (inserts/ sec) comes sooner or later.
Now, I'm looking at the option to let each worker store it's own data in a local Tokyo Tyrant db and merge the db's of all workers afterwards (no duplicate keys guarenteed)
Any help appreciated, (also of other ways to distribute load on TT appreciated)
btw: the config for TT: #bnum=20000000#opts=l#xmsiz=162000000
I set bnum to the upperbound of items expected: 20 mil.
Thanks, Geert-Jan
check out kchashmgr. you could dump the files out into data files and then load them into a new kch file created with a bigger bnum.