clustering l1 norm, k-medians - cluster-analysis

I would like to know how to show that the prototype $\mu_k$ that minimizes the cost function can be interpreted as the elementwise median (not mean) of the inputs assigned to the cluster.

Have you tried to prove this yourself?
As far as I can tell, it should be straightforward.
Prove that it is sufficient to prove this in one dimension
Prove it for 1-dimensional data (= use the regular proof for the median)

Related

Certified calculations in a proof assistant

Symbolic calculations performed manually or by a computer algebra system may be faulty or hold only subject to certain assumptions. A classical example is sqrt(x^2) == x which is not true in general but it does hold if x is real and non-negative.
Are there examples where proof assistants/checkers such as Coq, Isabelle, HOL, Metamath, or others are used to certify correctness of symbolic calculations? In particular, I am interested in calculus and linear algebra examples such as solving definite or indefinite integrals, differential equations, and matrix equations.
Update:
To be more concrete, it would be interesting to know whether there are examples of undergraduate assignments in calculus and linear algebra that could be formally solved (possibly with the help of a proof assistant) such that the solution can be automatically verified by a proof checker. A very simple example assignment for Lean is here.
For the Coq proof assistant there are several libraries to help with that. One matching your request quite well is Coquelicot (https://gitlab.inria.fr/coquelicot/coquelicot). The Coquelicot team made an exercise and participated in the French baccalauréat - I would say comparable more to a college than a high school math exam - and finished proofs for a good part of the exercises. The proofs can be found in the examples here (https://gitlab.inria.fr/coquelicot/coquelicot/-/tree/master/examples). I thought about translating the exercises and solutions to English.
But this was quite a few years ago and meanwhile there are very powerful tools for specific applications. E.g. there is coq-interval (https://gitlab.inria.fr/coqinterval/interval) which fully automatically does Coq proofs of rather complicated inequalities, say that a high order polynomial matches a sine function in a certain interval with a certain maximum deviation. It does this by Taylor decomposition and computing upper bounds for the residual. It can also do error proofs for a wide range of numerical integrals. A new feature added recently is the ability to do proven correct plots.
A tool for proving in Coq the error between infinite precision real and floating point computations is Gappa (https://gitlab.inria.fr/gappa/gappa).
Another very interesting Coq development is CoRN (https://github.com/coq-community/corn), a formalization of constructive reals in Coq. Constructive Reals are true real numbers which do compute. Essentially a constructive real number is an algorithm to compute a number to any desired precision together with a proof that this algorithm converges. One can prove that such numbers fulfill all usual properties of real numbers. An interesting side effect of constructive reals is that they need only LPO as axiom, while in classical reals the existence of the real numbers itself is an axiom. Any computation you do in CoRN, say pi>3, is automatically proven correct.
All these tools are included in Coq Platform, a common distribution of the Coq proof assistant.
There is more and this is steadily increasing. I would say it is not that far in the future that we have a usable proven correct CAS.
The only thing that comes to my mind is that Isabelle/HOL can replay SMT proofs (as produced e.g. by Z3 or CVC4), e.g. involving integer and real arithmetic. For computer algebra systems, I don't know of any comparable examples.
The problem is that computer algebra systems tend not to be set up in a way where they can output a detailed certificate for their simplifications – if they were able to do that, one could attempt to replay that in a theorem prover. But it would have to go beyond purely equational reasoning, since many rules (such as your example) require proving inequalities as preconditions.
If computer algebra systems were able to output a trace of their computations as a list of rewrite rules that were used, including how to prove each of their preconditions, one could in principle replay such a trace in a theorem prover – but that would of course require that every rule used by the CAS has a corresponding rule in the theorem prover (this is roughly how replaying SMT proofs works in Isabelle). However, I do not know of any projects like this.
There are, on the other hand, various examples where CASs are used to compute some easily verifiable (but hard to compute) result, e.g. factoring a polynomial, isolating the roots of a real polynomial, Wilf–Zeilberger witnesses, and then verifying that this is really a valid result in a theorem prover. However, this does not involve certifying the computation process of the CAS, just the result.
for demonstration purposes, I prepared a small "fake exercise" both to illustrate what it means to verify a calculation and to illustrate the most graphical approaches available in Coq (this shows some of the things you can do in Nov. 2021).
It can be seen on github at github.com:ybertot/osxp_demos_coq, especially the file sin_properties.v.
The demonstration follows this path:
Show that we can state and prove automatically a statement giving a "safe approximation" of PI (that's the name of the mathematical constant in the Coq library).
Show that Coq can be used to plot a known mathematical function, in this case the sin function between 0 and PI. This relies on a connexion to gnuplot for the graphical display. I am afraid that gnuplot will not be included in the Coq platform mentioned by M. Soegtrop in another answer.
Show that we can also plot the function sin(1/x) using Coq
(the plot is actually preserved as a pdf file on the github repository)
Show that a generic function plotter actually returns a misleading result in that case (the generic function plotter is gnuplot).
The misleading plot is also given in the github repository as a pdf file.
The next step is to show that we can prove guaranties of intervals for some computations, sometimes automatically using the interval tactic, and sometimes the interval tactic fails to conclude. The important point here is that the command fails to conclude, instead of giving an answer that cannot be trusted. When this happens, users can rely on knowledge and mathematical reasoning to obtain the desired result. The demo shows how to prove that for any x in a certain range, the sin function is guaranteed to have a positive value.
The next step is about proving that sin x < x for every positive x, it shows that mathematical reasoning can rely on various techniques of mathematics:
decomposing the interval in two parts,
using the mean value theorem,
computing the derivative of x - sin x (and this can be done automatically in Coq),
relying on the fact that cos is known to be strictly decreasing between 0 and PI.
This is just a short demo, which is also meant to explain how a theorem prover has to be used differently from a pocket calculator, because just returning an approximation without qualification for the value of a mathematical formula is a process that cannot really be trusted.
The original question also includes questions about computing integrals. The interval package also contains facilities for this.

Why is l2 regularization always an addition?

I am reading through info about the l2 regularization of neural network weights. So far I understood, the intention is that weights get pushed towards zero the larger they become i.e. large weights receive a high penalty while lower ones are less severely punished.
The formulae is usually:
new_weight = weight * update + lambda * sum(squared(weights))
My question: Why is this always positive? If the weight is already positive the l2 will never decrease it but makes things worse and pushes the weight away from zero. This is the case in almost all formulae I saw so far, why is that?
The formula you presented is very vague about what an 'update' is.
First, what is regularization? Generally speaking, the formula for L2 regularization is:
(n is traing set size, lambda scales the influence of the L2 term)
You add an extra term to your original cost function , which will be also partially derived for the update of the weights. Intuitively, this punishes big weights, so the algorithm tries to find the best tradeoff between small weights and the chosen cost function. Small weights are associated with finding a simpler model, as the behavior of the network does not change much when given some random outlying values. This means it filters out the noise of the data and comes down to learn the simplest possible solution. In other words, it reduces overfitting.
Going towards your question, let's derive the update rule. For any weight in the graph, we get
Thus, the update formula for the weights can be written as (eta is the learning rate)
Considering only the first term, the weight seems to be driven towards zero regardless of what's happening. But the second term can add to the weight, if the partial derivative is negative. All in all, weights can be positive or negative, as you cannot derive a constraint from this expression. The same applies to the derivatives. Think of fitting a line with a negative slope: the weight has to be negative. To answer your question, neither the derivative of regularized cost nor the weights have to be positive all the time.
If you need more clarification, leave a comment.

Matlab cannot compute an infinite integral?

I am studying Stochastic calculus, and occasionally we need to compute an integral (from -infinity to +infinity) for some complex distribution. In this case, it was
with the answer on the right. This is the code I put into Matlab (and I have the symbolic math toolbox), which Matlab simply cannot process:
>> syms x t
>> f = exp(1+2*x)*(1/((2*pi*t)^0.5))*exp(-(x^2)/(2*t))
>> int(f,-inf,inf)
ans =
-((2^(1/2)*pi^(1/2)*exp(2*t + 1)*limit(erf((2^(1/2)*((x*1i)/t - 2i))/(2*(-1/t)^(1/2))), x, -Inf)*1i)/(2*(-1/t)^(1/2)) - (2^(1/2)*pi^(1/2)*exp(2*t + 1)*limit(erf((2^(1/2)*((x*1i)/t - 2i))/(2*(-1/t)^(1/2))), x, Inf)*1i)/(2*(-1/t)^(1/2)))/(2*pi*t)^(1/2)
This answer at the end looks like nonsense, while Wolfram (via their free tool), gives me the answer that the picture above has. Am I missing something fundamental about doing such integrations in Matlab that the basic Mathworks pages don't cover? Where am I proceeding incorrectly?
In order to explain what is happening, we need some theory:
Symbolic systems such as Matlab or Mathematica calculate integrals symbolically by the Risch algorithm (yes, there is a method to mechanically calculate integrals, just like derivatives).
However, the Risch algorithms works differently than using derivation rules. Strictly spoken, it is not an algorithm but a semi-algorithm. This is, it is not deterministic one (as algorithms are).
This (semi) algorithm makes a series of transformations on the input expression (the one to be integrated), and in a specific point, it requires to ask if the transformed expression is equals to zero, because if it were zero, it cannot continue (the input is not integrable using a finite set of terms).
The problem (and the reason of the "semi-algoritmicity") is that, the (apparently simple) equation:
E = 0
Is indecidable (it is also called the constant problem). It means that there cannot exist a formal method to solve the constant problem, for any expression E. Of course, we know to solve the constant problem for specific forms of the expression E (i.e. polynomials), but it is impossible to solve the problem for the general case.
It also means that the Risch algorithm cannot be perfect (being able to solve any integral -integrable in finite terms-). In other words, the Risch algorithm will be as powerful as our ability to solve the constant problem for as many forms of the expression E as we can, but losing any hope of solving for the general case.
Different symbolic systems have similar, but different methods to try to solve equations (and therefore the constant problem), it explains why some of them can "solve" different sets of integrals than others.
And generalizing, because no symbolic system will never be able to solve the constant problem for the general case, it will neither be able to solve any integral (integrable in finite terms).
The second parameter of int() needs to be the variable you're integrating over (which looks like t in this case):
syms x t
f = exp(1+2*x)*(1/((2*pi*t)^0.5))*exp(-(x^2)/(2*t))
int(f,'t',-inf,inf) % <- integrate over t

Is an algorithm with asymptotic runtime complexity of θ(n) always faster runtime than a similar algorithm with runtime complexity of θ(n^2 )?

If so can you provide explicit examples? I understand that an algorithm like Quicksort can have O(n log n) expected running time, but O(n^2) in the worse case. I presume that if the same principle of expected/worst case applies to theta, then the above question could be false. Understanding how theta works will help me to understand the relationship between theta and big-O.
When $n$ is large enough, the algorithm with complexity $\theta(n)$ will run faster than the algorithm with complexity $\theta(n^2)$. In fact $\theta(n) / \theta(n^2)\to 0$ as $\theta \to \infty$. However there might be values of $n$ where $\theta(n) > \theta(n^2)$.
It's not always faster, only asymptotically faster (when n grows infinitely). But after some n — yes, it is always faster.
For example, for little n a bubble sort may operate faster than quick sort just because it's simpler (its θ has lower constants).
This has nothing to do with expected/worst cases: selecting a case is another problem that is not related to theta or big-O.
And about the relationship between theta and big-O: in computer science, big-O is often (mis)used in sense of θ, but in its strict meaning big-O is a more wide class than θ: it limits only the upper bound of a growing function while theta limits both bounds. E.g. when somebody says that Quicksort has a complexity of O(n log n), he actually means θ(n log n).
You are on the right track of thought.
Actual runtime of program can be quite different from asymptotic bounds.This is a fundamental concept that arises from the way asymptotic notation is defined.
You can read my answer here to clarify.

Why doesn't k-means give the global minima?

I read that the k-means algorithm only converges to a local minima and not to a global minima. Why is this? I can logically think of how initialization could affect the final clustering and there is a possibility of sub-optimum clustering, but I did not find anything that will mathematically prove that.
Also, why is k-means an iterative process?
Can't we just partially differentiate the objective function w.r.t. to the centroids, equate it to zero to find the centroids that minimizes this function? Why do we have to use gradient descent to reach the minimum step by step?
Consider:
. c .
. c .
Where c is a cluster centroid. The algorithm will stop, but a better solution is:
. .
c c
. .
With regards to a proof - You don't require a mathematical proof to prove that something isn't always true, you just need a single counter-example, as provided above. You can probably convert the above into a mathematical proof, but this is unnecessary and generally requires a lot of work; even in academia it is accepted to merely give a counter-example to disprove something.
The k-means algorithm is by definition an iterative process, it's simply the way it works. The problem of clustering is NP-hard, thus using an exact algorithm to calculate the centroids would take immensely long.
Don't mix the problem and the algorithm.
The k-means problem is finding the least-squares assignment to centroids.
There are multiple algorithms for finding a solution.
There is an obvious approach to find the global optimum: enumerating all k^n possible assignments - that will yield a global minimum, but in exponential runtime.
Much more attention was put to finding an approximate solution in faster time.
The Lloyd/Forgy algorithm is an EM-style iterative model refinement approach, that is guaranteed to converge to a local minimum simply because there is a finite number of states, and the objective function must decrease in every step. This algorithm runs in O(n*k*i) where i << n usually, but it may find a local minimum only.
The MacQueens method is technically not iterative. It's a single-pass, one-element-at-a-time algorithm that will not even find a local minimum in the Lloyd sense. (You can however run it multiple passes over the data set, until convergence, to get a local minimum too!) If you do a single pass, its in O(n*k), for multiple passes add i. It may or may not take more passes than Lloyd.
Then there is Hartigan and Wong. I don't remember the details, IIRC it was a clever, more lazy, variant of Lloyd/Forgy, so probably in O(n*k*i), too (although probably not recomputing all n*k distances for later iterations?)
You could also do a randomized alogrithm that just tests l random assignments. It probably won't find a minimum at all, but run in "linear" time O(n*l).
Oh, and you can try different random initializations, to improve your chances of finding the global minimum. Add a factor t for the number of trials...