How to choose between resource templates in IBM SPSS Modeler? - spss-modeler

Have been researching for sometime, but found little to no articles related to SPSS Modeler. How do we know which resource template is better? Does a higher number of types and concepts extracted indicates it being better?

Related

How to transform System Requirement Specification documents to class and functions?

i'm currently preparing the srs documents for a project , i wounder how these documents and flow charts are transformed to classes , packages and functions?
any suggestion to books or tutorials that explain this transformation process ?
What you might looking for is called "Object Oriented Design".
This generally describes the process of modeling classes and structures out of real world or business objects (described in your specification).
This is quite a wide topic, but as a good starting point I would recommend the book
Object-Oriented Analysis and Design.
For some other suggestions see this question.

Confusion about NoSQL Design

I know that the NoSQL is not the relational database therefore I cannot draw the ERD or other method which can only be applied to relational database.
My confusion is: What kind of method or diagram should I illustrate to design a NoSQL database?
Thanks.
Here's an abstract from a recent 10gen event presentation suggesting that mind maps are the most logical tool for the job. I expect more specialized tools to emerge, but in general, mind maps align well with non-relational schema design.
"Most of us are visual learners. Often, visual learners will find that information "clicks" when it is explained with the aid of a chart or picture. For MongoDB that picture is a leaf representing a natural approach to databases. In the RDBMS world a database schema is "visualized" through an Entity Relationship (ER) diagram. An ER diagram is the primary communication tool about an RDBMS data model. MongoDB provides a powerful dynamic database schema. However it is sometimes difficult to visualize. An accurate visualization of a MongoDB schema dramatically increases the ability to communicate the flexibility and power MongoDB between developers, architects, DBAs and end users. A mind map is a visual thinking tool that helps structure information, do better analysis, comprehend, synthesize and generate new ideas. Its power lies in its simplicity, much like MongoDB. Using a mind mapping open source tool, a clear and vibrant visualization of a dynamic MongoDB schema can be created that "clicks." Further, it works the other way around - mind maps can be used to create a dynamic schema in MongoDB. The mind mapping process allows non-technical business users to visually develop their requirements on the fly. During design process the mind map provides a flexible visual tool which changes in a fluid manner."
You can fairly easily use the standard tools however it depends upon your specific scenario and the problem you are looking to solve. I recently had a conversation about this actually: https://groups.google.com/forum/?fromgroups=#!topic/mongodb-user/xZCwEm06eU4 which might help however that conversation is also quite specialised.
I have been thinking ever since that instead of repeating myself on this I should actually write a manual for drawing UML diagrams, MongoDB style.
Maybe if you explain your perspective on what UML diagram you wish to draw then we could provide a more detailed answer on how to accomplish a type of NoSQL representation of them.

NoSQL for time series/logged instrument reading data that is also versioned

My Data
It's primarily monitoring data, passed in the form of Timestamp: Value, for each monitored value, on each monitored appliance. It's regularly collected over many appliances and many monitored values.
Additionally, it has the quirky feature of many of these data values being derived at the source, with the calculation changing from time to time. This means that my data is effectively versioned, and I need to be able to simply call up only data from the most recent version of the calculation. Note: This is not versioning where the old values are overwritten. I simply have timestamp cutoffs, beyond which the data changes its meaning.
My Usage
Downstream, I'm going to have various undefined data mining/machine learning uses for the data. It's not really clear yet what those uses are, but it is clear that I will be writing all of the downstream code in Python. Also, we are a very small shop, so I can really only deal with so much complexity in setup, maintenance, and interfacing to downstream applications. We just don't have that many people.
The Choice
I am not allowed to use a SQL RDBMS to store this data, so I have to find the right NoSQL solution. Here's what I've found so far:
Cassandra
Looks totally fine to me, but it seems like some of the major users have moved on. It makes me wonder if it's just not going to be that much of a vibrant ecosystem. This SE post seems to have good things to say: Cassandra time series data
Accumulo
Again, this seems fine, but I'm concerned that this is not a major, actively developed platform. It seems like this would leave me a bit starved for tools and documentation.
MongoDB
I have a, perhaps irrational, intense dislike for the Mongo crowd, and I'm looking for any reason to discard this as a solution. It seems to me like the data model of Mongo is all wrong for things with such a static, regular structure. My data even comes in (and has to stay in) order. That said, everybody and their mother seems to love this thing, so I'm really trying to evaluate its applicability. See this and many other SE posts: What NoSQL DB to use for sparse Time Series like data?
HBase
This is where I'm currently leaning. It seems like the successor to Cassandra with a totally usable approach for my problem. That said, it is a big piece of technology, and I'm concerned about really knowing what it is I'm signing up for, if I choose it.
OpenTSDB
This is basically a time-series specific database, built on top of HBase. Perfect, right? I don't know. I'm trying to figure out what another layer of abstraction buys me.
My Criteria
Open source
Works well with Python
Appropriate for a small team
Very well documented
Has specific features to take advantage of ordered time series data
Helps me solve some of my versioned data problems
So, which NoSQL database actually can help me address my needs? It can be anything, from my list or not. I'm just trying to understand what platform actually has code, not just usage patterns, that support my super specific, well understood needs. I'm not asking which one is best or which one is cooler. I'm trying to understand which technology can most natively store and manipulate this type of data.
Any thoughts?
It sounds like you are describing one of the most common use cases for Cassandra. Time series data in general is often a very good fit for the cassandra data model. More specifically many people store metric/sensor data like you are describing. See:
http://rubyscale.com/blog/2011/03/06/basic-time-series-with-cassandra/
http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra
http://engineering.rockmelt.com/post/17229017779/modeling-time-series-data-on-top-of-cassandra
As far as your concerns with the community I'm not sure what is giving you that impression, but there is quite a large community (see irc, mailing lists) as well as a growing number of cassandra users.
http://www.datastax.com/cassandrausers
Regarding your criteria:
Open source
Yes
Works well with Python
http://pycassa.github.com/pycassa/
Appropriate for a small team
Yes
Very well documented
http://www.datastax.com/docs/1.1/index
Has specific features to take advantage of ordered time series data
See above links
Helps me solve some of my versioned data problems
If I understand your description correctly you could solve this multiple ways. You could start writing a new row when the version changes. Alternatively you could use composite columns to store the version along with the timestamp/value pair.
I'll also note that Accumulo, HBase, and Cassandra all have essentially the same data model. You will still find small differences around the data model in regards to specific features that each database offers, but the basics will be the same.
The bigger difference between the three will be the architecture of the system. Cassandra takes its architecture from Amazon's Dynamo. Every server in the cluster is the same and it is quite simple to setup. HBase and Accumulo or more direct clones of BigTable. These have more moving parts and will require more setup/types of servers. For example, setting up HDFS, Zookeeper, and HBase/Accumulo specific server types.
Disclaimer: I work for DataStax (we work with Cassandra)
I only have experience in Cassandra and MongoDB but my experience might add something.
So your basically doing time based metrics?
Ok if I understand right you use the timestamp as a versioning mechanism so that you query per a certain timestamp, say to get the latest calculation used you go based on the metric ID or whatever and get ts DESC and take off the first row?
It sounds like a versioned key value store at times.
With this in mind I probably would not recommend either of the two I have used.
Cassandra is too rigid and it's too heirachal, too based around how you query to the point where you can only make one pivot of graph data from (I presume you would wanna graph these metrics) the columfamily which is crazy, hence why I dropped it. As for searching (which Facebook use it for, and only that) it's not that impressive either.
MongoDB, well I love MongoDB and I am an elite of the user group and it could work here if you didn't use a key value storage policy but at the end of the day if your mind is not set and you don't like the tech then let me be the very first to say: don't use it! You will be no good at a tech that you don't like so stay away from it.
Though I would picture this happening in Mongo much like:
{
_id: ObjectID(),
metricId: 'AvailableMessagesInQueue',
formula: '4+5/10.01',
result: NaN
ts: ISODate()
}
And you query for the latest version of your calculation by:
var results = db.metrics.find({ 'metricId': 'AvailableMessagesInQueue' }).sort({ ts: -1 });
var latest = results.getNext();
Which would output the doc structure you see above. Without knowing more of exactly how you wish to query and the general servera and app scenario etc thats the best I can come up with.
I fond this thread on HBase though: http://mail-archives.apache.org/mod_mbox/hbase-user/201011.mbox/%3C5A76F6CE309AD049AAF9A039A39242820F0C20E5#sc-mbx04.TheFacebook.com%3E
Which might be of interest, it seems to support the argument that HBase is a good time based key value store.
I have not personally used HBase so do not take anything I say about it seriously....
I hope I have added something, if not you could try narrowing your criteria so we can answer more dedicated questions.
Hope it helps a little,
Not a plug for any particular technology but this article on Time Series storage using MongoDB might provide another way of thinking about the storage of large amounts of "sensor" data.
http://www.10gen.com/presentations/mongodc-2011/time-series-data-storage-mongodb
Axibase Time-Series Database
Open source
There is a free Community Edition
Works well with Python
https://github.com/axibase/atsd-api-python. There are also other language wrappers, for example ATSD R client.
Appropriate for a small team
Built-in graphics and rule engine make it productive for building an in-house reporting, dashboarding, or monitoring solution with less coding.
Very well documented
It's hard to beat IBM redbooks, but we're trying. API, configuration, and administration is documented in detail and with examples.
Has specific features to take advantage of ordered time series data
It's a time-series database from the ground-up so aggregation, filtering and non-parametric ARIMA and HW forecasts are available.
Helps me solve some of my versioned data problems
ATSD supports versioned time-series data natively in SE and EE editions. Versions keep track of status, change-time and source changes for the same timestamp for audit trails and reconciliations. It's a useful feature to have if you need clean, verified data with tracing. Think energy metering, PHMR records. ATSD schema also supports series tags, which you could use to store versioning columns manually if you're on CE edition or you need to extend default versioning columns: status, source, change-time.
Disclosure - I work for the company that develops ATSD.

Relationship between standards DocBook DITA OpenDocument and CMIS, MoReq2

Can anybody explain /for dummies :)/ relationship between these (mostly oasis) standards?
DocBook, DITA, OpenDocument
CMIS
MoReq2
As i understand yet:
DITA, DocBook and OpenDocument - are standards for file formats of documents
CMIS is something what I need explain
MoReq2 - is a standard for digital archives for storing metadata about the documents (record management standard)
So, for the portable solutions need
store documents in the above formats (when in the what one?)
and need describe them with MoReq2 schemas
but where to come CMIS?
Or i'm totally wrong?
Ps: Understand than it is an complex question, but nowhere find any simply explanation of their relationship.
ps2: plus question - have any of the above support in perl?
The topics I know best are the first three (DocBook, DITA, OpenDocument).
DocBook and DITA are standards for writing potentially long technical documents, in which you do not specify any style or presentation. Rather, you just write text, and then you can tag the text with information about its role (whether it is a keyword, whether it is a warning note, etc). This way, you can then use stylesheets to apply consistent style to all of your text, and you can produce multiple publication formats from it.
DocBook focuses more on providing a large set of tags that covers every common case, while DITA focuses on a bare minimum that is easy to extend. Another difference is that DocBook encourages you to think in terms of long documents, whereas DITA encourages you to think in reusable "modular" documents.
Both DocBook and DITA documents would be stored in multiple files. A single document could be from tens to thousands of files.
OpenDocument is a standard for specific office documents. As such, an OpenDocument document would often be a single file. An OpenDocument document is more specific than DocBook or DITA. It is less likely to be a book, and more likely to be a letter, a specification, a spreadsheet or a presentation. Also, unlike DocBook and DITA, OpenDocument will very likely contain style information (colours, numbering, etc), because the text is not necessarily related to any other document and is only used once.
Each of DocBook, DITA and OpenDocument are formats used to store text in files. Usually these are XML files.
CMIS. I have never heard of this before today, but I do know about content management systems. I can therefore tell you that it is a headache to try to manage the path that a certain piece of text is supposed to take from the repository, disk or database where it is stored, up to the book, webpage, help system or blog where it is supposed to be published. Content management systems help you specify data for large sets of files; this data can then be used by a tool to decide where to publish a document, or just a piece of information. A content management system can be as simple as two folders on your hard drive: any files put in one folder should be published for example as PDFs in Chinese, whereas files put into the second folder should be published in as blog entries in German and Turkish.
Now, content management systems are usually much more sophisticated than that, and there are many of them. I imagine that CMIS is an abstract layer that lets you allow different content management systems to inter-operate, if by chance you have invested in more than one of them.
Finally, MoReq2. Again, I only discovered this today, and unlike CMIS, I don't even have experience with record keeping. However, you have two answers from #Tasha and #Marc Fresko which should give you a good starter.
What I imagine about MoReq2 is that it can help you manage the lifecycle of your documents. For example, you may want to specify that a certain policy document is only valid until 2010, or that it has been deprecated already. I also imagine that MoReq2 is much, much more than that.
To sum up, all of these standards concern document management. DocBook, DITA and OpenDocument are about writing and storing documents. CMIS is about managing where the documents go. And MoReq2 seems to be about how long they live.
On CMIS, try this link. MoReq2 is not about digital archives, and it's not about “storing metadata". It's typical functional requirements for decent Electronic Records Management System. Both documents are in public domain - get them and read the introductions.
Tasha's reply is 100% accurate. I'll add that the metadata model in MoReq2 is the weakest part of MoReq2, and arguably the least important - it probably contains many errors. I say this on the basis of having been the leader of the MoReq2 project.

Are there any data warehouse frameworks?

I've got a lot of mysql data that I need to generate reports from. It's mostly historic data so it won't be changing much, but it weighs in at 20-30 gigabytes easily and is expected to grow. I currently have a collection of php scripts that will do some complex queries and output csv and excel files. I also use phpMyAdmin with bookmarked queries. I manually edit them to change the parameters. The amount of data is growing and the number of people who need access to it is also growing, so I'm making the time to improve this situation.
I started reading about data warehousing the other day and it seems that this an area that relates to what I need to do. I've read some good articles and am even waiting on a book. I think I'm getting a handle on what these sorts of systems do and what's possible.
Creating a reporting system for my data has always been on a todo list, but until recently I figured it would be a highly niche programing venture. Since I now know data warehousing is a common thing, I figure there must be some sort of reporting/warehousing frames available to ease in the development. I'd gladly skip writing interfaces and scripts to schedule and email reports and the like and stick to writing queries and setting up relations.
I've mostly been a lamp guy, but I'm not above switching languages or platforms. I just need a more robust solution as my one off scripts don't scale well.
So where's a good place to get started?
I'll discuss a few points on the {budget, business utility function, time frame} spectrum out there. For convenience, let's follow the architecture conceptualization you linked to at
WikipediaDataWarehouseArticle
Operational database layer
The source data for the data warehouse - Normalized for In One Place Only data maintenance
Data access layer
The transformation of your source data into your informational access layer. ETL tools to extract, transform, load data into the warehouse fall into this layer.
Informational access layer
• Report-facilitating Data Structure
Data is not maintained here. It is merely a reflection of your source data
Hence, denormalized structures (containing duplicate, but systematically derived data)
are usually most effective here
• Reporting tools
How do you actually allow your users access to the data
• pre-canned reports (simple)
• more dynamic slice-and-dice access methods
The data accessed for reporting and analyzing and the tools for reporting and analyzing data
fall into this layer. And the Inmon-Kimball differences about design methodology,
discussed later in the Wikipedia article, have to do with this layer.
Metadata layer (facilitates automation, organization, etc)
Roll your own (low-end)
For very little out-of-pocket cost, just recognizing the need for the denormalized structures can buy those that are not using it some efficiencies
Get in the ballgame (some outlays required)
You don't need to use all the functionality of a platform right off the bat.
IMO, however, you want to be on a platform that you know will grow, and in the highly competitive and consolidating BI environment, that seems to be one of the four enterprise mega-vendors (my opinion)
Microsoft (the platform of our 110 employee firm)
SAP
Oracle
IBM
BiMarketStateArticle
My firm is at this stage, using some of the ETL capability offered by SQL Server Integration Services (SSIS) and some alternate usage of the open source, but in practice license requiring Talend product in the "Data Access Layer", a denormalized reporting structure (implemented completely in the basic SQL Server database), and SQL Server Reporting Services (SSRS) to largely automate (based on your skill) the production of pre-specified reports. Note that an SSRS "report" is merely a (scalable) XML configuration/specification that gets rendered at runtime via the SSRS engine. Choices such as export to an excel file are simple options.
Serious Commitment (some significant human commitment required)
Notice above that we have yet to utilize the data mining/dynamic slicing/dicing
capabilities of SQL Server Analysis Services. We are working toward that,
but now focused on improving the quality of our data cleansing in the "Data Access Layer".
I hope this helps you to get a sense of where to start looking.
Pentaho has put together a pretty comprehensive suite of products. The products are "free", but be prepared for the usual heavy sell once you fork over your identifying information.
I haven't had a chance to really stretch them as we're a Microsoft shop from one sad end to the other.
I think you should first check out Kimball and Inmon and see if you want to approach your data warehouse in a particular way. Kimball, in particular, lays out a very good framework for the modelling and construction of the warehouse.
There are a number of tools which try to make the process of designing, implementing and managing/operating a Data Warehouse and they each have their strengths and weaknesses and often vastly differing price points. Under the covers you are always going to be best off if you have a good knowledge of warsehousing principles from the Kimball and/or Inmon camps.
As well as tools like Kalido and Wherescape RED (which do similar thing in very different ways), many of the ETL platforms now have good in-built support for the donkey work of implementation - SCD components etc and lineage tracking.
Best though to view all these as tools to be used in the hands of you, the craftsman, they make certain easy things even easier (or even trivial), some hard things easier but some things they just get in they way of IMHO ;) Learn the methodology and principles first and get a good understanding of them and then you will know which tools to apply from your kitbag and when...
It hasn't been updated in a while but there's a nice Data Warehousing/ETL Ruby package called ActiveWarehouse.
But I would check out the Pentaho products like Nick mentioned in another answer. It should easily handle the volume of data you have and may provide you with more ways to slice and dice your data than you could have ever imagined.
The best framework you can currently get is Anchor Modeling.
It might look quite complex because of it's generic structure and built-in capability to historize data.
Also modeling technique is quite different than ERD.
But you end-up with sql code to generate all db objects including 3NF views and:
insert/update handled by triggers
query any point/range in history
you application developers will not see underlying 6NF anchor model.
The technology is open sourced and at the moment is unbeatable.
If you would have AM question you may want to ask on that tag anchor-modeling.
Kimball is the simpler method for data warehousing.
We use Informatica for moving data around, but it doesn't do DW things like indexing by default.
I like the idea of Wherescape RED, as a DW tool and using MS SQL's Linked Servers to obviate the need for an ETL tool.