Trying to Normalize tables to 2NF and 3NF - database-normalization

I have this dependency diagram I'm trying to make into 2NF and then 3NF. I'm not sure if I'm doing it right:
Here's the dependency:
This is how I tried to model it:

If I understand the diagram correctly, your "1b" is the right decomposition. All the tables in "1b" are in at least 5NF, though.
The notion that you can normalize to 2NF and no higher, or to 3NF and no higher, is a common misunderstanding of how normalization works. It's quite common to start with a relation that's in 1NF, and in a single step end up with all the relations in 5NF.

Related

Is Breeze worth the dependency?

I am programming a machine learning algorithm in Scala. For that one I don't think I will need matrices, but I will need vectors. However, the vectors won't even need a dot product, but just element-wise operations. I see two options:
use a linear algebra library like Breeze
implement my vectors as Scala collections like List or Vector and work on them in a functional manner
The advantage of using a linear algebra library might mean less programming for me... will it though, considering learning? I already started learning and using it and it seems not that straight forward (documentation is so-so). (Another) disadvantage is having a extra dependency. I don't have much experience in writing my own projects (so far I programmed in the job, where libraries usage was dictated).
So, is a linear algebra library - e.g. Breeze - worth the learning and dependency compared to programming needed functionality myself, in my particular case?

Best Method to Intersect Huge HyperLogLogs in Redis

The problem is simple: I need to find the optimal strategy to implement accurate HyperLogLog unions based on Redis' representation thereof--this includes handling their sparse/dense representations if the data structure is exported for use elsewhere.
Two Strategies
There are two strategies, one of which seems vastly simpler. I've looked at the actual Redis source and I'm having a bit of trouble (not big in C, myself) figuring out whether it's better from a precision and efficiency perspective to use their built-in structures/routines or develop my own. For what it's worth, I'm willing to sacrifice space and to some degree errors (stdev +-2%) in the pursuit of efficiency with extremely large sets.
1. Inclusion Principle
By far the simplest of the two--essentially I would just use the lossless union (PFMERGE) in combination with this principle to calculate an estimate of the overlap. Tests seem to show this running reliably in many cases, although I'm having trouble getting an accurate handle on in-the-wild efficiency and accuracy (some cases can produce errors of 20-40% which is unacceptable in this use case).
Basically:
aCardinality + bCardinality - intersectionCardinality
or, in the case of multiple sets...
aCardinality + (bCardinality x cCardinality) - intersectionCardinality
seems to work in many cases with good accuracy, but I don't know if I trust it. While Redis has many built-in low-cardinality modifiers designed to circumvent known HLL issues, I don't know if the issue of wild inaccuracy (using inclusion/exclusion) is still present with sets of high disparity in size...
2. Jaccard Index Intersection/MinHash
This way seems more interesting, but a part of me feels like it may computationally overlap with some of Redis' existing optimizations (ie, I'm not implementing my own HLL algorithm from scratch).
With this approach I'd use a random sampling of bins with a MinHash algorithm (I don't think an LSH implementation is worth the trouble). This would be a separate structure, but by using minhash to get the Jaccard index of the sets, you can then effectively multiply the union cardinality by that index for a more accurate count.
Problem is, I'm not very well versed in HLL's and while I'd love to dig into the Google paper I need a viable implementation in short order. Chances are I'm overlooking some basic considerations either of Redis' existing optimizations, or else in the algorithm itself that allows for computationally-cheap intersection estimates with pretty lax confidence bounds.
thus, my question:
How do I most effectively get a computationally-cheap intersection estimate of N huge (billions) sets, using redis, if I'm willing to sacrifice space (and to a small degree, accuracy)?
Read this paper some time back. Will probably answer most of your questions. Inclusion Principle inevitably compounds error margins a large number of sets. Min-Hash approach would be the way to go.
http://tech.adroll.com/media/hllminhash.pdf
There is a third strategy to estimate the intersection size of any two sets given as HyperLogLog sketches: Maximum likelihood estimation.
For more details see the paper available at
http://oertl.github.io/hyperloglog-sketch-estimation-paper/.

The context between Abstract Algebra and programming

I'm a computer science student among the things I'm learning Abstract Algebra, especially Group theory.
I'm programming for about 5 years and I've never used such things as I learn in Abstract Algebra.
what is the context between programming and abstract algebra? I really have to know.
Group theory is very important in cryptography, for instance, especially finite groups in asymmetric encryption schemes such as RSA and El Gamal. These use finite groups that are based on multiplication of integers. However, there are also other, less obvious kinds of groups that are applied in cryptography, such as elliptic curves.
Another application of group theory, or, to be more specific, finite fields, is checksums. The widely-used checksum mechanism CRC is based on modular arithmetic in the polynomial ring of the finite field GF(2).
Another more abstract application of group theory is in functional programming. In fact, all of these applications exist in any programming language, but functional programming languages, especially Haskell and Scala(z), embrace it by providing type classes for algebraic structures such as Monoids, Groups, Rings, Fields, Vector Spaces and so on. The advantage of this is, obviously, that functions and algorithms can be specified in a very generic, high level way.
On a meta level, I would also say that an understanding of basic mathematics such as this is essential for any computer scientist (not so much for a computer programmer, but for a computer scientist – definitely), as it shapes your entire way of thinking and is necessary for more advanced mathematics. If you want to do 3D graphics stuff or programme an industry robot, you will need Linear Algebra, and for Linear Algebra, you should know at least some Abstract Algebra.
I don't think there's any context between group theory and programming...or rather your question doesn't make any sense. There are applications of programming to algebra and vice versa but they are not intrinsically tied together or mutually beneficial to one another so to speak.
If you are a computer scientist trying to solve some fun abstract algebras problems, there are numerous enumeration and classification problems that could benefit from a computational approach to be worked on in geometric group theory which is a hot topic at the moment, here's a pretty comprehensive list of researchers and problems (of 3 years ago at least)
http://www.math.ucsb.edu/~jon.mccammond/geogrouptheory/people.html
popular problems include finitely presented groups, classification of transitive permutation groups, mobius functions, polycyclic generating systems
and these
http://en.wikipedia.org/wiki/Schreier–Sims_algorithm
http://en.wikipedia.org/wiki/Todd–Coxeter_algorithm
and a problem that gave me many sleepless nights
http://en.wikipedia.org/wiki/Word_problem_for_groups
Existing algebra systems include GAP and MAGMA
finally an excellent reference
http://books.google.com/books?id=k6joymrqQqMC&printsec=frontcover&dq=finitely+presented+groups+book&hl=en&sa=X&ei=WBWUUqjsHI6-sQTR8YKgAQ&ved=0CC0Q6AEwAA#v=onepage&q=finitely%20presented%20groups%20book&f=false

solve multiobjective optimization: CPLEX or Matlab?

I have to solve a multiobjective problem but I don't know if I should use CPLEX or Matlab. Can you explain the advantage and disadvantage of both tools.
Thank you very much!
This is really a question about choosing the most suitable modeling approach in the presence of multiple objectives, rather than deciding between CPLEX or MATLAB.
Multi-criteria Decision making is a whole sub-field in itself. Take a look at: http://en.wikipedia.org/wiki/Multi-objective_optimization.
Once you have decided on the approach and formulated your problem (either by collapsing your multiple objectives into a weighted one, or as series of linear programs) either tool will do the job for you.
Since you are familiar with MATLAB, you can start by using it to solve a series of linear programs (a goal programming approach). This page by Mathworks has a few examples with step-by-step details: http://www.mathworks.com/discovery/multiobjective-optimization.html to get you started.
Probably this question is not a matter of your current concern. However my answer is rather universal, so let me post it here.
If solving a multiobjective problem means deriving a specific Pareto optimal solution, then you need to solve a single-objective problem obtained by scalarizing (aggregating) the objectives. The type of scalarization and values of its parameters (if any) depend on decision maker's preferences, e.g. how he/she/you want(s) to prioritize different objectives when they conflict with each other. Weighted sum, achievement scalarization (a.k.a. weighted Chebyshev), and lexicographic optimization are the most widespread types. They have different advantages and disadvantages, so there is no universal recommendation here.
CPLEX is preferred in the case, where (A) your scalarized problem belongs to the class solved by CPLEX (obviously), e.g. it is a [mixed integer] linear/quadratic problem, and (B) the problem is complex enough for computational time to be essential. CPLEX is specialized in the narrow class of problems, and should be much faster than Matlab in complex cases.
You do not have to limit the choice of multiobjective methods to the ones offered by Matlab/CPLEX or other solvers (which are usually narrow). It is easy to formulate a scalarized problem by yourself, and then run appropriate single-objective optimization (source: it is one of my main research fields, see e.g. implementation for the class of knapsack problems). The issue boils down to finding a suitable single-objective solver.
If you want to obtain some general information about the whole Pareto optimal set, I recommend to start with deriving the nadir and the ideal objective vectors.
If you want to derive a representation of the Pareto optimal set, besides the mentioned population based-heuristics such as GAs, there are exact methods developed for specific classes of problems. Examples: a library implemented in Julia, a recently published method.
All concepts mentioned here are described in the comprehensive book by Miettinen (1999).
Can cplex solve a pareto type multiobjective one? All i know is that it can solve a simple goal programming by defining the lexicographical objs, or it uses the weighted sum to change weights gradually with sensitivity information and "enumerate" the pareto front, which highly depends on the weights and looks very subjective.
You can refer here as how cplex solves the bi-objetive one, which seems not good.
For a true pareto way which includes the ranking, i only know some GA variants can do like NSGA-II.
A different approach would be to use a domain-specific modeling language for mathematical optimization like YALMIP (or JUMP.jl if you like to give Julia a try). There you can write your optimization problem with Matlab with some extra YALMIP functionalities and use CPLEX (or any other supported solver as a backend) without restricting to one solver.

Text classification, preprocessing included

Which is the best method for document classification if time is not a factor, and we dont know how many classes there are?
In my (incomplete) knowledge, Hierarchical Agglomerative Clustering is the best approach if you don't know how many classes. All of the other clustering algorithms either require prior knowledge of the number of buckets or some sort of cross-validation or other experimentation to determine a sensible number of buckets.
A cross link: see how-do-i-determine-k-when-using-k-means-clustering on SO.