who in general writes a DDL for a table/view creation? Is it the requesting developer (to use it) or is the DBA? anyways the DBA will be running the DDL.
In my personal opinion, part of the DBA responsibilites is to work with the development team to design and develop the database, as indicated in Wikipedia.
Designing a table has two parts, first the physical design (column names, column types, constraints, relationships, etc.) that is mainly responsability of the development team. However, the table is stored in a tablespace with tons of characteristics, and the DBA should provide the mechanisms to adapt that table to the database structure.
Taking into account these elements, the DDL should be written in two phases, with many interactions of the actors, in order to write a DDL that is good for the application, and the database keeps its performance, security, availability, reliability, etc.
In my opinion the DDL contains many instructions for the table's design as well as table's functionality, such as the tablespace it will occupy, data partitioning, index location, etc. These are all part of the database design which the developer shouldn't know. A DBA should never give a developer direct access to the table, instead functions and views should be established from needs. But this is based on the Organization's requirements. The requirements of that of a bank will differ than that of a video store.
Related
I know this question has been asked before at PostgreSQL to Data-Warehouse: Best approach for near-real-time ETL / extraction of data.
But I want to rephrase this question.
I am attempting a real-time data warehouse. The difference between real-time and near real-time is huge.
I find real-time data warehouse to be event-driven and transactional in approach.
While near real-time would do the same batch mode application but would poll data more frequently.
It would put so much extra load on the production server and would certainly kill the production system.
Being a batch approach it would scan through all the tables for changes and would take rows which have changed from
a cut-off time stamp.
I mean by event driven, it would be specific to tables which have undergone changes are focus only on transaction
which are happening currently.
But the source system is an elephant of system, SAP, assuming which has 25,000 tables. It is not easy to model that,
not easy to write database triggers on each table to capture each change. I want impact on the production server to be minimal.
Is there any trigger at database level so that I could capture all changes happening in database in one trigger.
Is there any way to write that database trigger on a different database server so that production server goes untouched.
I have not been keeping pace with changes happening to database technology and am sure some nice new technologies would have come by to capture these changes easily.
I know of Log miners and Change data captures but it would be difficult to filter out the information which I need from redo logs.
Alternate ways to capture database write operations on the go.
Just for completeness sake let us assume databases are a heterogeneous mix of Oracle, SQL Server and DB2. But my contention is
the concepts we want to develop.
This is a universal problem, every company is looking for easy to implement solution. So a good discussion would benefit all.
Don't ever try to access SAP directly. Use the APIs of SAP Data Services (http://help.sap.com/bods). Look for the words "Integrator Guide" on that page for documentation.
This document should give you a good hint about where to look for your data sources (http://wiki.scn.sap.com/wiki/display/EIM/What+Extractors+to+use). Extractors are kind-of-somewhat like views in a DBMS, they're abstracting all the SAP stuff into somethin human readable.
As far as near-real-time, think in terms of micro-batches. Run your extract jobs every 5 (?) minutes, or longer if necessary.
Check the Lambda Architecture from Nathan Marz (I provide no link, but you'll find the resources easily). The implementation is all Java and No SQl, but the main ideas are applicable to the classical relational databases as well. In the nutshell: you have two implementations, one real time but responsible for only limited time interval. The "long tail" is maintained with classical best practice batch implementation.
The real time part is always discarded after the batch refresh, effectively blocking the propagation of the problems of the real time processing in the historical data.
As of now I can see only two solutions:
Write services on the source systems. If source is COBOL, put those in services. Put all services in a service bus and
some how trap when changes happen to database. This needs to be explored how that trap will work.
But from outset it appears to be a very
expensive proposition and uncertain. Convincing management for a three year lag time would be difficult. Services are not easy.
Log Shippers: This a trusted database solution. Logs would be available on another server, production server need not be
burdened. There are good number of tools as well available.
But the spirit does not match. Event driven is missing so the action when things
are happening is not captured. I will settle down for this.
As Ron pointed out NEVER TOUCH SAP TABLE DIRECTLY. There are adapters and adapters to access SAP tables. This will build another layer in between but it is unavoidable. One good news I want to share is a customer did a study of SAP tables and found that only 14% of the tables are actually populated or touched by SAP system. Even then 14% of 25,000 tables is coming to huge data model of 2000+ entities. Again micro-batches are like dividing the system into Purchase, Receivables, Payables etc., which is heading for a data mart and not an EDW. I want to focus on a Enterprise Data Warehouse.
I am planning a large project that is going to use Neo4j as main storage to model the data, but I was wondering if I should additionally use something like MySQL as a parallel structure to back the user administration subsystem. It would have the advantage that I could use something with which I have much more experience (18 Years) to do the very tedious job of login and account administration. The disadvantage would be the overhead of keeping two databases in sync, since I do plan to have the users also represented as nodes in Neo4j with a small subset of the user information. I know that it depends a lot on the details of the project, but if anybody has experience in the field with a similar set up it would be very appreciated.
Rather than building replication, stick to your chosen database & build administration tools and UI. Replication is complex & difficult, admin tools hopefully less so.
If you can't get the data out at least in CSV for reporting, maybe reconsider choice of DB.
I will be constructing an ecommerce site, and would like to use a no-sql database, which will fit well with the plans for the app. But when it comes to which database would fit the job, im not sure. After comparing various DB's, the ones that seem best might be either mongo, couch, or even orientdb. I have seen arguments for all of them to be used or not used compared to something like MySQL. But between themselves (nosql databases), which one would fit well with an ecommerce solution?
Note, for the use case, i wont be having thousands of transactions a second. Or similarly high write rates. they will be moderate sure, but at a level that any established database could handle.
CouchDB: Has master to master replication, which I could really use. If not, I will still have to implement the same functionality in code anyways. I need to be able to have a users database, sync with the mothership. (users will have their own, potentially localhost database, that could sync with the main domains server). Couch is also fast, once your queries have been stored in the db.As i will probably have a higher need for read performance. Though not by a lot.
MongoDB: queries are very easy and user friendly. Also, with the fact that end users may need to query for certain things at a given time that I may not be able to account for ahead of time, this seems like it may be a better fit. I dont have to pre-store my queries in the db. Does support atomic transactions, though only when writing to a single document at a time.
OrientDB: A graph database. much different that most people are used to, but with the needs, it could fit very well too. Orient has the benefits of being schemaless, as well as having support for ACID transactions. There is a lot of customer, and product relationships that a graph database could be great with. Orient also support master to master replication, similar to couchdb.
Dont get me wrong, I can see how to build this traditionally with something like MySQL, but the ease and simplicity of a nosql solution, is very attractive. Although, in my case, needing a schemaless solution, would be much easier in nosql rather than mysql. a given product may have more or less items, than another. and avoiding recreating a table whenever a new field is added, is preferrable.
So between these 3 (or even others you think may be better), what features in each could potentially work for, or against me in regards to an ecommerce based site, when dealing with customer transactions?
Edit: The reason I am not using an existing solution, is because with the integrated features I need, there are no solutions available out there. We are also aiming to use this as a full product for our company. There will be a handful of other integrations than just sales. It is also going to be working with a store's POS system.
Since e-commerce can encompass everything from shopping carts through to membership and recurring subscriptions, it is hard to guess exactly what requirements and complexity you are envisioning.
When constructing an e-commerce site, one of the early considerations should be investigating whether there is already an established e-commerce product or toolkit that could meet your requirements. There are many subtleties to processes like ordering, invoicing, payments, products, and customer relationships even when your use case appears to be straightforward. It may also be possible to separate your application into the catalog management aspects (possibly more custom) versus the billing (potentially third party, perhaps even via a hosted billing/payment API).
Another consideration should be who you are developing the e-commerce site for: is this to scratch your own itch, or for a client? Time, budget, and features for a custom build can be difficult to estimate and schedule .. and a niche choice of technology may make it difficult to find/hire additional development expertise.
A third consideration is what your language(s) of choice are for developing your application. Some languages will have more complete/mature/documented drivers and/or framework abstractions for the different databases.
That said, writing an e-commerce system appears to be a rite of passage for many developers ;-).
Edit: a lot has changed since this answer was originally posted in 2012 and you should definitely refer to current product information. For example, MongoDB has had support for Decimal128 values since MongoDB 3.4 (2016) and multi-document transactions since MongoDB 3.6 (2017).
Check the comparison of different available NoSql databases here. Suit your requirement as per that.
MongoDB 4 now multi-document ACID transactions! That makes it suitable for e-Commerce!
Check out: https://www.mongodb.com/transactions
I'm developing a cms for a company that has multiple regional sites (us, uk, china, russia, etc..). Should I use a separate database for each of these sites or use a single database with a 'site' field in each table? My main concern is the table language encoding (ie, can storing strings in different langauges in the same table cause problems, such as sorting issues).
That depends. If you store separate data on the different sites, you should use separate databases. It is much faster and safer, though more expensive. You should also use separate databases if you want to share the same data over the sites, but you expect a heavy load. However, in this case you need a way to synchronise the data between the sites. If you store the same data and your application is not speed critical, then a centralised storage may suffice (but only experience will show if it is fast enough or not).
From what you wrote, I suspect that the first case is true (you store separate data per site), but I can't be sure.
Edit: You may also ask this on Server Fault, there are more experienced administrators there.
The benefits of a non-relational database (such as a key-value pair storage) are evident when used in large scale datasets (google, facebook, linkedin). How do you think small to medium sized applications can benefit from using non-relational databases?
IBM Mainframes have had "non-relational" databases since the 60s (hierarchial databases such as IMS + variants). These databases are still in use because they are extremely fast and handle huge scale well.
The point of relational databases was to provide a regular, relatively abstract method for storing and retrieving data in which the tuning can be done relatively independently of the data model (not true for IMS). They were designed rather in reaction to the inability to reorganize hiearchical databases easily. The upside is nice organization; the downside is medium, not high performance.
Google provides scalable storage and MapReduce to handle scale. It isn't relational.
There was a huge push early in the last decade to store data in XML, in essentially hiearchical form because XML is implicitly hierarchical. That was a huge mistake IMHO, because it repeated the inconvenience of heirarchical databases, but had none of the performance. I'm not very surprised this movement seems to have pretty much died.
Most of the practical push to non-relational seems to me to be towards performance and scale. I don't see how this helps "small" applications much.
People have proposed, but not done a lot of practical data management using knowledge-based schemes. Doug Lenat's CYC comes to mind here. The ability of the database
to help an application draw non-obvious conclusions strikes me a very interesting for "small" applications that are trying be "smart". But there aren't a lot of these yet.
The sweet spot of using a NoSQL database at that scale is when the database model (key-value, document, etc.) is a good match to the application's needs and the advanced relational functionality is not needed.
At the small end of the spectrum, performance is a non issue because just about everything is fast. Storage engines are a non issue, if you don't need a sophisticated query engine, the lack of SQL support is a non issue.
You are left with how well it fits and how easy it is to use. Honestly though, tooling does become an issue. Relational database tooling is mature, NoSQL tooling is less feature rich and less battle hardened. Too often it is roll-your-own tooling. Definitely consider what tools you'd be giving up and how much you need them.
There is an additional slate of advantages for smaller projects when considering a NoSQL service (like Amazon SimpleDB and Microsoft Azure) as compared to a product. If you only have to pay for what you use and you don't use much, it can be cheaper than running a dedicated server, going all the way down to free for something like the SimpleDB free usage tier.
You also avoid some of the server and database maintenance costs. This can be a big win if you don't have a DBA, or when your DBAs are already over worked. Of course you'll still have admin work to do, but it is significantly reduced, and typically simpler.
When it comes to graph databases (like Neo4j - a project I'm involved in) they excel at scaling to complexity. This means, they provide "better substrates for modeling business domains" (see The State of NoSQL, also by Ben Scofield, too). As I see it, this is very important in small to medium sized apps.
This may be better explained through examples, so here's some links to example apps/domain modeling:
Access control lists the graph
database way
Social networks in the database: using a graph database
Domain modeling gallery
The question perhaps requires a bit more context... assuming a Python environment, consider the tutorial at the y_serial project: http://yserial.sourceforge.net/
NoSQL is not merely adopted for reasons of scalability. Serialization (of any arbitrary Python object) and persistence are very convenient at any scale -- so consider the key-value system as one approach.
Well one of the problems with a RDBMS is that you need to spend effort mapping your programming languages domain models to the relational schema of your RDBMS. This effort is usually spent configuring your ORM layer.
With NoSQL databases you are not forced to map your objects to a relational model and in most cases your objects are serialized as-is. Because of the lack of an intermediary schema, data migrations and versioning become easier.
Another benefit is scalability and performance. Since most of the time your data is received by 'keys' effectively everything uses and index. Trivial sharding is possible by doing a % (MOD) on the key against the number of your available NoSQL instances providing natural data partitioning which is crucial for sharding.
If you're interested in seeing how developing with a NoSQL differs from a RDBMS, I have a tutorial where I show how to go about designing a simple blog application using Redis.
If you match up a few common PaaS cloud services like a Key-Value store, a BLOB store, and a Message Queue store you have some handy tools that can free small application developers from the tyranny of the DBA and the infrastructure folks.
Today small developers often resort to Jet MDBs. Why? Easy, shared access is as easy as storing the MDB file on a file share visible to the entire application community. When they can get away with it (i.e. get the necessary support from the gatekeepers) they might use SQL Server Express, MySQL, etc.
Sadly those gatekeepers can be pretty hostile to deal with in a large organization. Mention a "database" and suddenly you face the DBA gang and associated delays, application reviews, prioritization, etc. Mention needing a server and you face that other firing squad.
Using a NoSQL solution and related cloud services can eliminate a ton of this if you don't need an RDBMS.
For one thing, all that's really required is an account with a public cloud provider. This is something that becomes fairly easy once the concept has been approved. And easier for you as a developer once you've been approved and assigned an account, though of course there are the usual bookkeeping issues.
But let's even set that aside. What if your organization implemented a private cloud for such uses? Lots of the issues of outside billing go away, data insecurity worries go away, etc.
Such a thing could be implemented and provisioned in a semi-anonymous fashion, almost as easily as administering file shares. The anonymity comes in because once you've been approved to develop on the in-house cloud nobody needs to nitpick the details of your activities using it any more than they need to examine a request before you can create a file on an existing file share.
Obviously there would be storage and CPU quotas to manage. Nobody can afford to just keep scaling up indefinately. Rogue applications might consume vast quantities of resources. So what you need is some sort of quota system to cap usage. Whether this is monitored by infrastructure folks is an implementation decision, or it might be treated just like file share use: run out and somebody yells at the programmer who in turn looks into it and requests more if appropriate (or fixes his bugs).
But you end up with "utility computing" and by "using no SQL" you don't incur the cost (and issues) of dealing with DBAs. They can still sit quietly surfing the Web in their big offices while you get some work done.
Amazon SimpleDB can be useful for those who need a non-relational database for storage of smaller, non-structural data. Amazon SimpleDB has restricted storage size to 10GB per domain. Amazon SimpleDB offers simplicity and flexibility. SimpleDB automatically indexes all data. Amazon SimpleDB pricing is based on your actual box usage. You can store any UTF-8 string data in Amazon SimpleDB.