What makes the '’The Algebra of Data' a better choice for data processing? - algebraixlib

I have read the book ’The Algebra of Data: A Foundation for the Data Economy’ and white paper ‘Data Algebra Hiding in Plain Sight’.
I would like to know people comments on ‘The Algebra of Data’ on following questions
What makes the ‘The Algebra of Data’ a good choice for defining data object and data processing?
What are the key benefits of using ‘The Algebra of Data’’ over other data storing and processing applications specifically SQL databases?
What are the general benefits of using ‘The Algebra of Data’?

(I am one of the authors of the book.) In response to: What makes "Algebra of Data" a good choice for defining data objects and data processing:
Pragmatically, right now, there are a relatively small number of developers employing data algebra. If I understand the situation at Algebraix Data Corp correctly, they are using it to build a SQL accelerator for the Spark environment. Aside from that activity a set of Python libraries (see http://algebraixdata.github.io/algebraixlib/) have been created for programmers to experiment with using the algebra programmatically. So right now there is not much in the way of software tools for implementing data algebra.
Realistically, the existence of data algebra has only just been made public and thus it is early days. It is not possible for me to know how fast it will get adopted, but it is out there and anyone who wishes to build software that employs it can do so.
The key benefits of using Data Algebra are simply the benefits of mathematics applied to any area. Many of the problems that data algebra could help with have been reasonably well dealt with by programmers, often quite a while ago. You might be able to improve on what's been done with, say a PC database, but there's probably little to be gained. (I cannot know that for sure, but it seems probable).
However Mathematics scales almost indefinitely - and does so accurately. Consequently the bigger the problem (data volumes, data speed, data variability and so on) the more useful it proves to be. So this is where it will make its initial impact, I expect.
At some point there will be a need to define an algebraic query language (probably a specialization of and extension of SQL) but right now I don't think anyone is working on that. If you consider the relational model of data - which was a failed attempt at a data algebra, it took quite a while for SQL to develop from it and for the model to come into general use.
I hope this helps

Related

Key value oriented database vs document oriented database

I have recently started learning NO SQL databases and I came across Key-Value oriented databases and Document oriented databases. Since they have a similar structure, aren't they saved and retrieved the exact same way? And if that is the case then why do we define them as separate types? Otherwise, how they are saved in the file system?
To get started it is better to pin point the least wrong vocabulary. What used to be called nosql is too broad in scope, and often there is no intersection feature-wise between two database that are dubbed nosql except for the fact that they somehow deal with "data". What program does not deal with data?! In the same spirit, I avoid the term Relational Database Management System (RDBMS). It is clear to most speakers and listeners that RDBMS is something among SQL Server, some kind of Oracle database, MySQL, PostgreSQL. It is fuzzy whether that includes SQLite, that is already an indicator, that "relational database" ain't the perfect word to describe the concept behind it. Even more so, what people usually call nosql never forbid relations. Even on top of "key-value" stores, one can build relations. In a Resource Description Framework database, the equivalent of SQL rows are called tuple, triple, quads and more generally and more simply: relations. Another example of relational database are database powered by datalog. So RDBMS and relational database is not a good word to describe the intended concepts, and when used by someone, only speak about the narrow view they have about the various paradigms that exists in the data(base) world.
In my opinion, it is better to speak of "SQL databases" that describe the databases that support a subset or superset of SQL programming language as defined by the ISO standard.
Then, the NoSQL wording makes sense: database that do not provide support for SQL programming language. In particular, that exclude Cassandra and Neo4J, that can be programmed with a language (respectivly CQL and Cypher / GQL) which surface syntax looks like SQL, but does not have the semantic of SQL (neither a superset, nor a subset of SQL). Remains Google BigQuery, which feels a lot like SQL, but I am not familiar enough with it to be able to draw a line.
Key-value store is also fuzzy. memcached, REDIS, foundationdb, wiredtiger, dbm, tokyo cabinet et. al are very different from each other and are used in verrrrrrrrrrry different use-cases.
Sorry, document-oriented database is not precise enough. Historically, they were two main databases, so called document database: ElasticSearch and MongoDB. And those yet-another-time, are very different software, and when used properly, do not solve the same problems.
You might have guessed it already, your question shows a lack of work, and as phrased, and even if I did not want to shave a yak regarding vocabulary related to databases, is too broad.
Since they have a similar structure,
No.
aren't they saved and retrieved the exact same way?
No.
And if that is the case then why do we define them as separate types?
Their programming interface, their deployment strategy and their internal structure, and intended use-cases are much different.
Otherwise, how they are saved in the file system?
That question alone is too broad, you need to ask a specific question at least explain your understanding of how one or more database work, and ask a question about where you want to go / what you want to understand. "How to go from point A-understanding (given), to point B-understanding (question)". In your question point A is absent, and point B is fuzzy or too-broad.
Moar:
First, make sure you have solid understanding of an SQL database, at the very least the SQL language (then dive into indices and at last fine-tuning). Without SQL knowledge, your are worthless on the job market. If you already have a good grasp of SQL, my recommendation is to forgo everything else but FoundationDB.
If you still want "benchmark" databases, first set a situation (real or imaginary) ie. a project that you know well, that requires a database. Try to fit several databases to solve the problems of that project.
Lastly, if you have a precise project in mind, try to answer the following questions, prior to asking another question on database-design:
What guarantees do you need. Question all the properties of ACID: Atomic, Consistent, Isolation, Durability. Look into BASE. You do not necessarily need ACID or BASE, but it is a good basis that is well documented to know where you want / need to go.
What is size of the data?
What is the shape of the data? Are they well defined types? Are they polymorphic types (heterogeneous shapes)?
Workload: Write-once then Read-only, mostly reads, mostly writes, a mix of both. Answer also the question how fast or slow can be writes or reads.
Querying: How queries look like: recursive / deep, columns or rows, or neighboor hood queries (like graphql and SQL without recursive queries do). Again what is the expected time to response.
Do not forgo to at least the review deployement and scaling strategies prior to commit to a particular solution.
On my side, I picked up foundationdb because it is the most versatile in those regards, even if at the moment it requires some code to be a drop-in replacement for all postgresql features.

What is the obvious advantage of using AMPL?

I am doing a project using CPLEX solver, on Netbeans with Java. We have several optimization problems to solve, I have already solved one of them by coding in Java all the constraints, objective and variables, without using AMPL. However, some people in my team want to use AMPL.
Thus, as I don't want to read all the AMPL book to find the answer, is there an obvious reason to rather use AMPL than coding all the constraints "manually"? Moreover, can AMPL be integrated in Netbeans ? I did not find any documentation about that.
Is AMPL useful when the constraints need to be "flexible" (I mean, we can't guess in advance the exact number of constraints, it depends on the parameters fixed by the user, modularity is a high importance factor...)
I am really curious to hear about that soon !
Thanks for help
AMPL is an algebraic modeling language and quoting from that link:
One advantage of AMPL is the similarity of its syntax to the
mathematical notation of optimization problems.
For example, this can allow you to define groups of constraints without knowing in advance the dimensions of the model. And, perhaps, you can make big changes to your model more quickly. (You'll have to think about how often you will actually do that.)
However, one could argue that the "obvious advantage" of AMPL is that it supports dozens of different solvers. You can create your model and solve it with CPLEX, but then decide that you want to use a different solver (e.g., Gurobi, Xpress, etc.). On the AMPL Solvers web page, they have the following recommendation:
We recommend that you then test alternative solvers to determine which
offers the best tradeoff of price and performance for your needs.
The AMPL API web page says that there is a Java API, so that should allow you to include it in a Netbeans project, but I have no experience with that.
At the end of the day, you could also argue that these "advantages" are a matter of taste. Using the CPLEX Java API directly, as you have already done, is certainly a valid solution if it meets your requirements. It may allow you to build the model more efficiently, use solver-specific/advanced features that might not be supported by AMPL, and to have more fine-grained control over the model formulation.
You have just coded an optimisation model to optimise your company's production of widgets. Your company got a really good deal on $SOLVER1 so that's what you're using.
Over the next ten years, you improve and extend that model as your bosses throw new requirements at you. By the end of that time, you may have tens of thousands of lines of optimisation code as part of a system that, by now, is absolutely critical to your company's operations.
Your company's original licensing deal has expired, and the manufacturers of $SOLVER1 have massively increased the licensing fees, so you're now paying hundreds of thousands a year in licensing costs.
Meanwhile, the boffins at a rival company have just released a new version of $SOLVER2. It has fancy new algorithms that could solve the widget optimisation problem 20% faster and find better solutions than $SOLVER1 is giving you. It doesn't cost any more than $SOLVER1 and the performance is better.
Meanwhile, the open-source community has released $FREESOLVER. It might not be quite as powerful as the top commercial options, but it's as good as $SOLVER1 was ten years ago, and if you weren't paying $100k/year for licensing you could rent an awful lot of server time to make up for it.
...so, did you write your optimisation model on a platform that lets you switch to a new solver and take advantage of these opportunities without having to jettison ten years' worth of code?
There are huge advantages to being able to switch solvers quickly and easily. I know of one company who uses three different solvers for their work: they try two different open-source solvers both running in the cloud, and if neither of those can find an adequate solution then they throw it to an expensive solver with smarter algorithms. The open-source solvers handle 90% of their problems, so they only have to use the commercial solver for the last 10%, which allows them to make significant savings on their licensing costs.
One option we've discussed at my work is to use a commercial solver for mission-critical work, and open-source alternatives for applications like training or small-scale prototyping where we don't have the same requirements. That way we can minimise the number of concurrent users we need to license for the commercial solver.
(And, yes, there is still an issue of lock-in with the platform, but platforms like AMPL are significantly cheaper than a high-end commercial solver.)
Totally agree with everything that rkersh says. Also note that you should never write your model in a way that hard-codes details of your problem sizes etc. whether you write in an algebraic modelling language or through one of the more direct APIs.
Also, working with a modelling language gives you an extra level/layer of abstraction which can help, especially in sharing or explaining your model to others, comparing with a range of standard problem types etc., but I prefer the more nuts-and-bolts 'feel' of working with the more direct APIs, and almost never need (or have time & budget) to reformulate my models that deeply.
Even GPL means "general" yet newer and newer GPLs coming to life, so a given GPL is "more general" to somet tasks than others... :-) In theory writing a compiler the most efficiently for Pascal or Perl should not matter, so in fact you could write in whatever language you want and yet you should not lose expressivity or efficiency (e.g. for C# which is in the same league for Java now, MS writes a better compiler than the opensource equivalent).
Humans are specializing - this is why we have gotten this far :-) . No different when it comes to achieve a given task to convert a business problem to a math model (aka modeling). The whole idea of having a given modeling layer is that
A. you have the outmost expressivity for that particular task (aka math modeling)
B. it enforces some best practicies for modeling what in GPL you are not "forced" to do (1. you are free to do 2. it is marketed to you as such = flexibility). E.g. AMPL, GAMS, others are mixing declarative code (aka model code) and procedural code (aka flow-control-like) which is not a good practice. On the other hand e.g. separating data and an abstract model is getting to ALL modeling languages but interestingly enough very slowly...
C. thru no.A you can maintain the code more efficiently than otherwise (contrary to API modeling - I have clients who say they turned to modelinglanguage becuase API modeling is a liability for rapid model revamp)
D. in theory you could be solver independent.
If you look around all modeling languages are trying to maintain no.C except OPL (that's for historical reasons). But even in case of OPL, you get constraint-programming and constraint-based scheduling (beside math-programming) what with AMPL/GAMS you don't, however solverindependent they are...
the $Solver1 and $Solver2 + $Freesolver comparison is a bit broken for 4 reasons
A. opensolvers are still very far away from commercial solvers in term of performance when it comes to large/complex problems (probably LP is getting to the exception) - I have clients - the fastest ever sales in my memory - when they tested commercial solvers after their "free-ride".
B. while indeed the scenario described in relation with $Solver1 and $Solver2 seems plausible ($Solver1, the incumbent is getting more expensive over time), we could witness just the other way around where the $Solver2 (a new comer) actualy increased its pricing 4x in 7 years and in some cases doubled it, while $Solver1 (the incumbent) has had no change.
C. mixing up modeling capabilities and solvers is a mistake. The whole idea is that somebody writes models in APIs IS the way to stick to a solver much more than thru modeling languages. At a minimum, as the Hungarians say "what you gain on the custom you lose it on the ferry", in other words, "freedom (i.e. flexibility) comes with using it responsibly"
D. owning a solver for development is NOT expensive at all, i.e. a company can maintain large # of solvers (for less than 10k$ a company could have +4 solvers for development) to test which is the fastest for any given model and then choose the best suited for deployment.
in addition, solver is just one piece of the puzzle. E.g. I have a client who has disparate data sources and it takes 8hours to create a model and 4hours to solve it. Would this client welcome a more efficient data handling suite or would it insist that the solver should be faster? Modelers are too isolated from the business in most cases and while in their mind a given model is perfect, how it is populated by data is secondary, yet it makes or breaks a good performance.
I witness that API modelers are moving to modeling languages, not the other way around for various reasons...
but as somebody wrote above, there are lots of "tastes in the game", so eventually if you feel more confortable with a given approach then nobody can blame you to choose so... :-) after all it is very difficult to compare the/an other approach since it's almost never there on a given case... so eventually what counts is speed from business problem to a model which solve fast in the given application context :-)
phew, it was long... but I gave all my shots... :-)
To keep it short to illustrate advantage/disadvantage of using AMPL just compare using Java(AMPL) instead of assembly language(CPLEX).

recommendations for a dbms for an EAV system with mostly insert and select operations needs on .net stack

In the project I have been working on, the data modeling requirements are:
A system consisting of N number of clients with each having N number of events. An event is an entity with a required name and timestamp at which it occurs. Optionally, an event may have N number of properties (key/value pares) defining attributes that a client want to store with the particular instance of that event.
The system will have mostly:
inserts – events are logged but never updated.
selects – reports/actions will be generated/executed based on events and properties of any possible combinations.
The requirements reflect an entity-attribute-value (EAV) data model. After researching for sometimes, I feel that a relational dbms like Sql Server might not be a good fit for this. (correct me if I'm wrong!)
So I'm leaning toward NoSql option like MongoDb/CouchDb/RavenDb etc.
My questions are:
What is the best fit in available NoSql solutions keeping in view of my system's heavy insert/select needs?
I'm also open for relational option if these requirements can be translated into relational schema. Although I personally doubt this, but after reading performance DBA answers (like referenced here), I got curious. However, I couldn't figure out myself an optimal relational model for my requirements, perhaps the system being rather generic.
thanks!
MongoDB really shines when you write unstructured data to it (like your event). Also, it is able to sustain pretty heavy write load. However, it's not very good for reporting. At least, for reporting in the traditional sense.
So, if your reporting needs are simple, you might get away with some simple map-reduce jobs. Otherwise you can export data to a relational database (nightly job, for example) and report the hell out of it.
Such hybrid solution is pretty common (in my experience).

Why “Set based approaches” are better than the “Procedural approaches”?

I am very eager to know the real cause though earned some knowledge from googling.
Thanks in adavnce
Because SQL is a really poor language for writing procedural code, and because the SQL engine, storage, and optimizer are designed to make it efficient to assemble and join sets of records.
(Note that this isn't just applicable to SQL Server, but I'll leave your tags as they are)
Because, in general, the hundreds of man-years of development time that have gone into the database engine and optimizer, and the fact that it has access to real-time statistics about the data, have resulted in it being better than the user in working out the best way to process the data, for a given request.
Therefore by saying what we want to achieve (with a set-based approach), and letting it decide how to do it, we generally achieve better results than by spelling out exactly how to provess the data, line by line.
For example, suppose we have a simple inner join from table A to table B. At design time, we generally don't know 'which way round' will be most efficient to process: keep a list of all the values on the A side, and go through B matching them, or vice versa. But the query optimizer will know at runtime both the numbers of rows in the tables, and also the most recent statistics may provide more information about the values themselves. So this decision is obviously better made at runtime, by the optimizer.
Finally, note that I have put a number of 'generally's in this post - there will always be times when we know better than the optimizer will, and for such times we can provide hints (NOLOCK etc).
Set based approaches are declarative, so you don't describe the way the work will be done, only what you want the result to look like. The server can decide between several strategies how to complay with your request, and hopefully choose one that is efficient.
If you write procedural code, that code will at best be less then optimal in some situation.
Because using a set-based approach to SQL development conforms to the design of the data model. SQL is a very set-based language, used to build sets, subsets, unions, etc, from data. Keeping that in mind while developing in TSQL will generally lead to more natural algorithms. TSQL makes many procedural commands available that don't exist in plain SQL, but don't let that switch you to a procedural methodology.
This makes me think of one of my favorite quotes from Rob Pike in Notes on Programming C:
Data dominates. If you have chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.
SQL databases and the way we query them are largely set-based. Thus, so should our algorithms be.
From an even more tangible standpoint, SQL servers are optimized with set-based approaches in mind. Indexing, storage systems, query optimizers, and other optimizations made by various SQL database implmentations will do a much better job if you simply tell them the data you need, through a set-based approach, rather than dictating how you want to get it procedurally. Let the SQL engine worry about the best way to get you the data, you just worry about telling it what data you want.
As each one has explained, let the SQL engine help you, believe, it is very smart.
If you do not use to write set based solution and use to develop procedural code, you will have to spend some time until write well formed set based solutions. This is a barrier for most people. A tip if you wish to start coding set base solutions is, stop thinking what you can do with rows, and start thinking what you can do with collumns, and do practice functional languages.

Are there any data warehouse frameworks?

I've got a lot of mysql data that I need to generate reports from. It's mostly historic data so it won't be changing much, but it weighs in at 20-30 gigabytes easily and is expected to grow. I currently have a collection of php scripts that will do some complex queries and output csv and excel files. I also use phpMyAdmin with bookmarked queries. I manually edit them to change the parameters. The amount of data is growing and the number of people who need access to it is also growing, so I'm making the time to improve this situation.
I started reading about data warehousing the other day and it seems that this an area that relates to what I need to do. I've read some good articles and am even waiting on a book. I think I'm getting a handle on what these sorts of systems do and what's possible.
Creating a reporting system for my data has always been on a todo list, but until recently I figured it would be a highly niche programing venture. Since I now know data warehousing is a common thing, I figure there must be some sort of reporting/warehousing frames available to ease in the development. I'd gladly skip writing interfaces and scripts to schedule and email reports and the like and stick to writing queries and setting up relations.
I've mostly been a lamp guy, but I'm not above switching languages or platforms. I just need a more robust solution as my one off scripts don't scale well.
So where's a good place to get started?
I'll discuss a few points on the {budget, business utility function, time frame} spectrum out there. For convenience, let's follow the architecture conceptualization you linked to at
WikipediaDataWarehouseArticle
Operational database layer
The source data for the data warehouse - Normalized for In One Place Only data maintenance
Data access layer
The transformation of your source data into your informational access layer. ETL tools to extract, transform, load data into the warehouse fall into this layer.
Informational access layer
• Report-facilitating Data Structure
Data is not maintained here. It is merely a reflection of your source data
Hence, denormalized structures (containing duplicate, but systematically derived data)
are usually most effective here
• Reporting tools
How do you actually allow your users access to the data
• pre-canned reports (simple)
• more dynamic slice-and-dice access methods
The data accessed for reporting and analyzing and the tools for reporting and analyzing data
fall into this layer. And the Inmon-Kimball differences about design methodology,
discussed later in the Wikipedia article, have to do with this layer.
Metadata layer (facilitates automation, organization, etc)
Roll your own (low-end)
For very little out-of-pocket cost, just recognizing the need for the denormalized structures can buy those that are not using it some efficiencies
Get in the ballgame (some outlays required)
You don't need to use all the functionality of a platform right off the bat.
IMO, however, you want to be on a platform that you know will grow, and in the highly competitive and consolidating BI environment, that seems to be one of the four enterprise mega-vendors (my opinion)
Microsoft (the platform of our 110 employee firm)
SAP
Oracle
IBM
BiMarketStateArticle
My firm is at this stage, using some of the ETL capability offered by SQL Server Integration Services (SSIS) and some alternate usage of the open source, but in practice license requiring Talend product in the "Data Access Layer", a denormalized reporting structure (implemented completely in the basic SQL Server database), and SQL Server Reporting Services (SSRS) to largely automate (based on your skill) the production of pre-specified reports. Note that an SSRS "report" is merely a (scalable) XML configuration/specification that gets rendered at runtime via the SSRS engine. Choices such as export to an excel file are simple options.
Serious Commitment (some significant human commitment required)
Notice above that we have yet to utilize the data mining/dynamic slicing/dicing
capabilities of SQL Server Analysis Services. We are working toward that,
but now focused on improving the quality of our data cleansing in the "Data Access Layer".
I hope this helps you to get a sense of where to start looking.
Pentaho has put together a pretty comprehensive suite of products. The products are "free", but be prepared for the usual heavy sell once you fork over your identifying information.
I haven't had a chance to really stretch them as we're a Microsoft shop from one sad end to the other.
I think you should first check out Kimball and Inmon and see if you want to approach your data warehouse in a particular way. Kimball, in particular, lays out a very good framework for the modelling and construction of the warehouse.
There are a number of tools which try to make the process of designing, implementing and managing/operating a Data Warehouse and they each have their strengths and weaknesses and often vastly differing price points. Under the covers you are always going to be best off if you have a good knowledge of warsehousing principles from the Kimball and/or Inmon camps.
As well as tools like Kalido and Wherescape RED (which do similar thing in very different ways), many of the ETL platforms now have good in-built support for the donkey work of implementation - SCD components etc and lineage tracking.
Best though to view all these as tools to be used in the hands of you, the craftsman, they make certain easy things even easier (or even trivial), some hard things easier but some things they just get in they way of IMHO ;) Learn the methodology and principles first and get a good understanding of them and then you will know which tools to apply from your kitbag and when...
It hasn't been updated in a while but there's a nice Data Warehousing/ETL Ruby package called ActiveWarehouse.
But I would check out the Pentaho products like Nick mentioned in another answer. It should easily handle the volume of data you have and may provide you with more ways to slice and dice your data than you could have ever imagined.
The best framework you can currently get is Anchor Modeling.
It might look quite complex because of it's generic structure and built-in capability to historize data.
Also modeling technique is quite different than ERD.
But you end-up with sql code to generate all db objects including 3NF views and:
insert/update handled by triggers
query any point/range in history
you application developers will not see underlying 6NF anchor model.
The technology is open sourced and at the moment is unbeatable.
If you would have AM question you may want to ask on that tag anchor-modeling.
Kimball is the simpler method for data warehousing.
We use Informatica for moving data around, but it doesn't do DW things like indexing by default.
I like the idea of Wherescape RED, as a DW tool and using MS SQL's Linked Servers to obviate the need for an ETL tool.