What is a Historian?

What is a Historian? - opc

What is the function of a historian in terms of OPC and PLC?

Based upon your tags opc and plc, you're refering to Historian in the SCADA context.
Basically, a historian is normally a service that collects data from various devices in SCADA network and logs to a database.
A proprietary (time-series) database is normally used. Normal marketing makes claims such as:
Faster Speeds
In contrast, a plant-wide historian provides a much faster read/write performance over a relational database and “down to the millisecond” resolution for true real-time data. This capability enables better responsiveness by quickly providing the granularity of data needed to analyze and solve intense process applications.
Higher Data Compression
The powerful compression algorithms of Proficy Historian enables you to store years of data easily and securely online, which enhances performance, reduces maintenance and lowers costs. You can configure GE Intelligent Platforms’s Proficy Historian without the active maintenance and back-up routines that a traditional RDB requires. Archives can be automatically created, backed up, and purged—enabling extended use without the need for a database administrator.
These marketing bullet points are often optimistic of poor RDBMS implementations. "Faster Speeds" is confused with the precision used for the timestamping datatype and proper indexing of data in a relational database. "Higher Data Compression" claims are realized by using Swinging Door Algorithms that can also be implemented for most RDBMS. Their use is explained in the Chevron whitepaper Data Compression for Process Historians. A new trend is using a classical time-series database for the historical data, and a relational database for analysis and reporting. An example of this hybrid configuration is OSIsoft using Microsoft SQL server for its Analysis Framework that holds the asset management hierarchy and relational type batch history data with a shift from a tag-centric application to an asset-centric application with a relational database at the core.
Popular options are Proficy Historian from General Electric, or OSIsoft's PI Historian.

As an automation integrator, my absolute favorite Historian is Wonderware (now called AVEVA). It integrates with their other products (SCADA, Archestra, Batch, etc.) to provide integrated data from a manufacturing environment. It is also can be integrated with "competitors" products.
The Wonderware Historian product does have great compression...relative to the Rockwell's FactoryTalk Historian and SIGNIFICANTLY better configuration.
It is quite scalable and has excellent documentation, views and stored procedures for querying including abundant examples IMHO.
They offer tiered historians and cloud Historian and Historian saas.

Related

Network performance requirements for cloud-SQL spatial data?

Assume you're planning to have a PostgreSQL PaaS database server hosted in Azure/AWS/GCP. This PostrgeSQL server will contains GBs of spatial data (national land parcel polygons, address points etc.) stored in PostGIS enabled SQL tables. All tables have good spatial and non-spatial indexes. The database server SKU/config is powerful enough for heavy GIS data usage. The client computer (eg. staff laptops) connects to these cloud databases via a corporate VPN - in the office the speedtest.net results whilst using this VPN are: 60Mpbs down, 50Mpbs up, 16ms latency.
What is the minimum network requirement (latency, bandwidth etc.) for smooth rendering and querying of these PostGIS tables in QGIS/ArcGIS?
Any general guidelines would be useful here too. For example, is bandwidth more important than latency when rendering spatial data in GIS software?
It's hard to pose an exact question here as the use cases vary and the network requirement is less the more zoomed in you are and/or the fewer layers you have checked on (less data to show on map). I've struggled finding any online articles which cover this topic.

Check the performance guidelines mentioned by Azure.
The choice of using cloud for GIS software is optional and it is purely based on the application requirements. Kindly check the pricing and computation engine performance and validate with application requirements.
https://learn.microsoft.com/en-us/azure/azure-sql/virtual-machines/windows/performance-guidelines-best-practices-checklist
https://azure.microsoft.com/en-in/pricing/details/bandwidth/

What are the pros and cons of DynamoDB with respect to other NoSQL databases?

We use MongoDB database add-on on Heroku for our SaaS product. Now that Amazon launched DynamoDB, a cloud database service, I was wondering how that changes the NoSQL offerings landscape?
Specifically for cloud based services or SaaS vendors, how will using DynamoDB be better or worse as compared to say MongoDB? Are there any cost, performance, scalability, reliability, drivers, community etc. benefits of using one versus the other?

For starters, it will be fully managed by Amazon's expert team, so you can bet that it will scale very well with virtually no input from the end user (developer).
Also, since its built and managed by Amazon, you can assume that they have designed it to work very well with their infrastructure so you can can assume that performance will be top notch. In addition to being specifically built for their infrastructure, they have chosen to use SSD's as storage so right from the start, disk throughput will be significantly higher than other data stores on AWS that are HDD backed.
I havent seen any drivers yet and I think its too early to tell how the community will react to this, but I suspect that Amazon will have drivers for all of the most popular languages and the community will likely receive this well - and in turn create additional drivers and tools.

Using MongoDB through an add-on for Heroku effectively turns MongoDB into a SaaS product as well.
In reality one would be comparing whatever service a chosen provider has compared to what Amazon can offer instead of comparing one persistance solution to another.
This is very hard to do. Each provider will have varying levels of service at different price points and one could consider the option of running it on their own hardware locally for development purposes a welcome option.

I think the key difference to consider is MongoDB is a software that you can install anywhere (including at AWS or at other cloud service or in-house) where as DynamoDB is a SaaS available exclusively as hosted service from Amazon (AWS). If you want to retain the option of hosting your application in-house, DynamoDB is not an option. If hosting outside of AWS is not a consideration, then, DynamoDB should be your default choice unless very specific features are of higher consideration.

There's a table in the following link that summarizes the attributes of DynamoDB and Cassandra:
http://www.datastax.com/dev/blog/amazon-dynamodb
Something that needs improvement on DynamoDB in order to become more usable is the possibility to index columns other than the primary key.
UPDATE 1 (06/04/2013)
On 04/18/2013, Amazon announced support for Local Secondary Indexes, which made DynamoDB f***ing great:
http://aws.amazon.com/about-aws/whats-new/2013/04/18/amazon-dynamodb-announces-local-secondary-indexes/

I have to be honest; I was very excited when I heard about the new DynamoDB and did attend the webinar yesterday. However it's so difficult to make a decision right now as everything they said was still very vague; I have no idea the functions that are going to be allowed / used through their service.
The one thing I do know is that scaling is automatically handled; which is pretty awesome, yet there are still so many unknowns that it's tough to really make a great analysis until all the facts are in and we can start using it.
Thus far I still see mongo as working much better for me (personally) in the project undertaking that I've been working on.
Like most DB decisions, it's really going to come down to a project by project decision of what's best for your need.
I anxiously await more information on the product, as for now though it is in beta and I wouldn't jump ship to adopt the latest and greatest only to be a tester :)

I think one of the key differences between DynamoDB and other NoSQL offerings is the provisioned throughput - you pay for a specific throughput level on a table and provided you keep your data well-partitioned you can always expect that throughput to be met. So as your application load grows you can scale up and keep you performance more-or-less constant.

Amazon DynamoDB seems like a pretty decent NoSQL solution. It is fast, and it is pretty easy to use. Other than having an AWS account, there really isn't any setup or maintenance required. The feature set and API is fairly small right now compared to MongoDB/CouchDB/Cassandra, but I would probably expect that to grow over time as feedback from the developer community is received. Right now, all of the official AWS SDKs include a DynamoDB client.

Pros
Lightning Fast (uses SSDs internally)
Really (really) reliable. (chances of write failures are lower)
Seamless scaling (no need to do manual sharding)
Works as webservices (no server, no configuration, no installation)
Easily integrated with other AWS features (can store the whole table into S3 or use EMR etc)
Replication is managed internally, so chances of accidental loss of data is negligible.
Cons
Very (very) limited querying.
Scanning is painful (I remember once a scanning through Java ran for 6 hours)
pre-defined throughput, which means sudden increase beyond the set throughput will be throttled.
throughput is partitioned as table is sharded internally. (which means if you had a throughput for 1000 and its partitioned in two and if you are reading only the latest data(from one part) then your throughput of reading is 500 only)
No joins, Limited indexing allowed (basically 2).
No views, triggers, scripts or stored procedure.
It's really good as an alternative to session storage in scalable application. Another good use would be logging/auditing in extensive system. NOT preferable for feature rich application with frequent enhancement or changes.

How to store large amounts of email message bodies

I'm creating a project, with web-mail functionality among others. We have MongoDB as main DBMS, but on huge amounts of emails, it becomes overloaded with message bodies.
We've tried to store message bodies on server HD and on S3 node, but it's not very efficient.
Is there any good solution for key-value storing of huge number of files (possibly cloud storage, or some NoSQL DBMS or anything else)?

You may be over-thinking/over-designing the DBMS component. You may want to consider Berkeley DB as your data store. It supports several APIs, including a Key/Value API (NoSQL). It's highly scalable, reliable and very fast. Berkeley DB is used heavily in commercial and open source email projects, including OpenWave, Critical Path, Postfix, SendMail and others. Because of it's embedded nature, small foorprint, developer-friendly key/value pair API and totally configurable from within the embedding application, it's a frequent choice for email data management.
Disclaimer: I'm the Product Manager for Berkeley DB, so I'm a little biased. That said, Berkeley DB is used by those products and many more more email data management.

Non-relational databases (NoSQL) for small to medium sized applications

The benefits of a non-relational database (such as a key-value pair storage) are evident when used in large scale datasets (google, facebook, linkedin). How do you think small to medium sized applications can benefit from using non-relational databases?

IBM Mainframes have had "non-relational" databases since the 60s (hierarchial databases such as IMS + variants). These databases are still in use because they are extremely fast and handle huge scale well.
The point of relational databases was to provide a regular, relatively abstract method for storing and retrieving data in which the tuning can be done relatively independently of the data model (not true for IMS). They were designed rather in reaction to the inability to reorganize hiearchical databases easily. The upside is nice organization; the downside is medium, not high performance.
Google provides scalable storage and MapReduce to handle scale. It isn't relational.
There was a huge push early in the last decade to store data in XML, in essentially hiearchical form because XML is implicitly hierarchical. That was a huge mistake IMHO, because it repeated the inconvenience of heirarchical databases, but had none of the performance. I'm not very surprised this movement seems to have pretty much died.
Most of the practical push to non-relational seems to me to be towards performance and scale. I don't see how this helps "small" applications much.
People have proposed, but not done a lot of practical data management using knowledge-based schemes. Doug Lenat's CYC comes to mind here. The ability of the database
to help an application draw non-obvious conclusions strikes me a very interesting for "small" applications that are trying be "smart". But there aren't a lot of these yet.

The sweet spot of using a NoSQL database at that scale is when the database model (key-value, document, etc.) is a good match to the application's needs and the advanced relational functionality is not needed.
At the small end of the spectrum, performance is a non issue because just about everything is fast. Storage engines are a non issue, if you don't need a sophisticated query engine, the lack of SQL support is a non issue.
You are left with how well it fits and how easy it is to use. Honestly though, tooling does become an issue. Relational database tooling is mature, NoSQL tooling is less feature rich and less battle hardened. Too often it is roll-your-own tooling. Definitely consider what tools you'd be giving up and how much you need them.
There is an additional slate of advantages for smaller projects when considering a NoSQL service (like Amazon SimpleDB and Microsoft Azure) as compared to a product. If you only have to pay for what you use and you don't use much, it can be cheaper than running a dedicated server, going all the way down to free for something like the SimpleDB free usage tier.
You also avoid some of the server and database maintenance costs. This can be a big win if you don't have a DBA, or when your DBAs are already over worked. Of course you'll still have admin work to do, but it is significantly reduced, and typically simpler.

When it comes to graph databases (like Neo4j - a project I'm involved in) they excel at scaling to complexity. This means, they provide "better substrates for modeling business domains" (see The State of NoSQL, also by Ben Scofield, too). As I see it, this is very important in small to medium sized apps.
This may be better explained through examples, so here's some links to example apps/domain modeling:
Access control lists the graph
database way
Social networks in the database: using a graph database
Domain modeling gallery

The question perhaps requires a bit more context... assuming a Python environment, consider the tutorial at the y_serial project: http://yserial.sourceforge.net/
NoSQL is not merely adopted for reasons of scalability. Serialization (of any arbitrary Python object) and persistence are very convenient at any scale -- so consider the key-value system as one approach.

Well one of the problems with a RDBMS is that you need to spend effort mapping your programming languages domain models to the relational schema of your RDBMS. This effort is usually spent configuring your ORM layer.
With NoSQL databases you are not forced to map your objects to a relational model and in most cases your objects are serialized as-is. Because of the lack of an intermediary schema, data migrations and versioning become easier.
Another benefit is scalability and performance. Since most of the time your data is received by 'keys' effectively everything uses and index. Trivial sharding is possible by doing a % (MOD) on the key against the number of your available NoSQL instances providing natural data partitioning which is crucial for sharding.
If you're interested in seeing how developing with a NoSQL differs from a RDBMS, I have a tutorial where I show how to go about designing a simple blog application using Redis.

If you match up a few common PaaS cloud services like a Key-Value store, a BLOB store, and a Message Queue store you have some handy tools that can free small application developers from the tyranny of the DBA and the infrastructure folks.
Today small developers often resort to Jet MDBs. Why? Easy, shared access is as easy as storing the MDB file on a file share visible to the entire application community. When they can get away with it (i.e. get the necessary support from the gatekeepers) they might use SQL Server Express, MySQL, etc.
Sadly those gatekeepers can be pretty hostile to deal with in a large organization. Mention a "database" and suddenly you face the DBA gang and associated delays, application reviews, prioritization, etc. Mention needing a server and you face that other firing squad.
Using a NoSQL solution and related cloud services can eliminate a ton of this if you don't need an RDBMS.
For one thing, all that's really required is an account with a public cloud provider. This is something that becomes fairly easy once the concept has been approved. And easier for you as a developer once you've been approved and assigned an account, though of course there are the usual bookkeeping issues.
But let's even set that aside. What if your organization implemented a private cloud for such uses? Lots of the issues of outside billing go away, data insecurity worries go away, etc.
Such a thing could be implemented and provisioned in a semi-anonymous fashion, almost as easily as administering file shares. The anonymity comes in because once you've been approved to develop on the in-house cloud nobody needs to nitpick the details of your activities using it any more than they need to examine a request before you can create a file on an existing file share.
Obviously there would be storage and CPU quotas to manage. Nobody can afford to just keep scaling up indefinately. Rogue applications might consume vast quantities of resources. So what you need is some sort of quota system to cap usage. Whether this is monitored by infrastructure folks is an implementation decision, or it might be treated just like file share use: run out and somebody yells at the programmer who in turn looks into it and requests more if appropriate (or fixes his bugs).
But you end up with "utility computing" and by "using no SQL" you don't incur the cost (and issues) of dealing with DBAs. They can still sit quietly surfing the Web in their big offices while you get some work done.

Amazon SimpleDB can be useful for those who need a non-relational database for storage of smaller, non-structural data. Amazon SimpleDB has restricted storage size to 10GB per domain. Amazon SimpleDB offers simplicity and flexibility. SimpleDB automatically indexes all data. Amazon SimpleDB pricing is based on your actual box usage. You can store any UTF-8 string data in Amazon SimpleDB.

NO-SQL reliable for small business app?

I'm deciding between go for a NON-SQL engine or a regular SQL one for a document managment system for small bussines.
I have experience with firebird/sql server and found a good track of reliability (specially with firebird).
This market is full of crappy "servers" (clon-made PC, the mayority), cheap harddisk, rarely use of RAID or anything like that, some are in locations where a power-off is normal, some not have a UPS, etc... (I will include off-site auto-backup to external servers, but that no change the internal setup). (I know about end-user education about such proper setups, but is stupid depend on that, so stick to te point)
From the desing point of view, a schema-less database is the way to go for my system, but, I worry if any of the actual solutions (MongoDb, Tokyo Cabinet, etc) are like firebird and survice crash, malfunctions & abuse so data corruption is very rare.
The plan is store the office documents there & provide a central repository.

Check out Neo4j. It is a graph database (schema-free) that can be used like a document or key/value store.
Neo4j has been in production for many years in environments like you describe. Unlike many other NOSQL databases Neo4j actually flushes data to disk and uses a transaction log to recover from an inconsistent state. It also has real transactions (full ACID) that can span multiple operations and treat them as a single unit (which also seems to be a feature that is frequently left out in many other NOSQL stores).
-Johan
(Disclaimer: I am part of the Neo4j team)

CouchDB has the reliability you need:
The CouchDB file layout and commitment system features all Atomic Consistent Isolated Durable (ACID) properties. On-disk, CouchDB never overwrites committed data or associated structures, ensuring the database file is always in a consistent state.
Look at the ACID Properties section here for more info.
With CouchDB you also get easy backup and replication.
I've no code in production using CouchDB yet, but so far I'm very happy with the tests and the development process with CouchDB.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse