Proving Lossless or Lossy Decomposition - database-normalization

I was trying to solve a question regarding proving whether the given decomposition is lossy or lossless:
Verify which among the following decompositions of R(ABCDE) are
lossless or lossy by applying the test for lossless join property. F =
{AB→C, C→E, B→D, E→A} is the set of functional dependencies on R.
R(ABCDE) is decomposed as R1(AB), R2(ADE) and R3(BCD)
R(ABCDE) is decomposed as R1(BCD) , R2(ACE) and R3(BD)
My attempt at the solution:
A decomposition can only be lossless if there are common attributes between the given tables and the common attributes are candidate key/super key.
In the first part, we have no candidate key in R1 and none in R2 so can we call it as lossy decomposition ?
In the second part, C is the candidate key for R2 and B is the candidate key for R3, but R1 has no key and also R2 and R3 share no common attributes so its also not a lossless decomposition ?
Is my explanation valid or are there any more criteria ?

The instructions are clear: "by applying the test for lossless join property". Presumably that is the important theorem that says that a binary join/decomposition is lossless when/iff .... But you don't quote it. Look in your textbook. (Beware, the Wikipedia article has problems.) You do say "can only be lossless if" but that's just saying something is necessary, which is the "if" in one direction, whereas you haven't said what's also sufficient, which is the "if" in the other direction, which is the direction you need, so your reasoning is not sound--unless you are just using poor language and you mean "iff". But even using "iff" your definition is wrong. You need to say that "the common columns form a superkey" of at least one of the components. And "there are common attributes between" doesn't belong--there don't need to be common attributes whether or not a join is lossless. No common attributes means a common attribute set {} and a cross join; but if its subset {} is a CK of a component--that component has at most one row--the join is lossless.--But you don't need to know that, you just have to use the theorem. Also you need to be told F is a cover--then tell us--or CKs can't be found. Another problem is "we have no CK"--every schema has a CK, since the set of all columns forms a superkey--and there are always the trivial FDs.--But you don't need to know that, you just have to use the definitions, theorems & algorithms for finding CKs. Also you need to find all the FDs of the closure of the cover you were given & then calculate the new CKs for each component using the FDs that have all their attributes in it. There's another theorem that says that those FDs of the closure are exactly the FDs that hold in the components. There could be FDs that are implied by the given ones & have all their attributes in a component even though the given ones don't.--But you don't need to know that, you just need to justify you have the CKs by justifying you have the FDs by using definitions, theorems & algorithms. But also the "property" theorem involves binary decompositions--so have you been told to assume--but not told us--that R = R1 JOIN (R2 JOIN R3) in that order? This is all clear from the theorems & algorithms & definitions of FD, cover, closure & CK but you need to memorize definitions & theorems exactly & then apply them precisely & tediously. That's all I just did.
Why do you have doubts in your questions in the 2 parts? Why are you not applying the "property" theorem? Why are you not addressing the fact that it is about binary joins?
If you are ever unsure of how to do an exercise that you have seen in your textbook that you know doesn't require creativity then you literally don't know what you are doing so go back in your textbook & find out.
Usually I just close vote questions like this with this comment:
Re "is this right": Show the steps of your work following your reference/textbook, with justification--you may find mistakes that make your question unnecessary & we don't know exactly what algorithm you are following & we want to check your work but not redo it & we need your choices when an algorithm allows them & otherwise we can't tell you where you went right or wrong & we don't want to rewrite your textbook. Please see How to Ask, hits googling 'stackexchange homework' & the voting arrow mouseover texts.
Usually when I see good ol' F or other set or list of FDs in a quoted database normalization assignment without explanation I try to work towards an answerable post by commenting a variant of this:
What does "I have these FDs" mean? "These are all the FDs that hold"?--Not possible. "These are all the non-trivial FDs that hold"?--Not possible. "These are some FDs that hold"?--Question can't be answered. Find out what a cover is & what the exact conditions are to apply a particular definition/rule/algorithm. To determine CKs & NFs we must be given FDs that form a cover. Sometimes a minimal/irreducible cover. And the set of all attributes must given. See this answer.
So when you redo this exercise, if you're not certain you know what to do at some point & post another question please be sure to follow those comments.

Related

How proof assistants are implemented?

What are the main blocks of a proof assistant?
I am just interested in knowing the internal logic of proof checking. For example, topics about graphical user interfaces of such assistants do not interest me.
A similar question to mine has been asked for compilers:
https://softwareengineering.stackexchange.com/questions/165543/how-to-write-a-very-basic-compiler
My concern is the same but for proof checking systems.
I'm hardly an expert on the matter (I'm only a user of these systems; I don't worry too much about their internals) and this will probably only be a vague partial answer, but the two main approaches that I know of are:
Dependently-typed systems (e.g. Coq, Lean, Agda) that use the Curry–Howard isomorphism. Statements are just types, and proofs are terms that have that type, so checking the validity of a proof is essentially just a special case of type checking a term. I don't want to say too much about this approach because I don't know too much about it and am afraid I'll get something wrong. Théo Winterhalter linked something in the comments above that may provide more context on this approach.
LCF-style theorems provers (e.g. Isabelle, HOL Light, HOL 4). Here a theorem is (roughly speaking) an opaque value of type thm in the implementation language. Only the comparatively small ‘proof kernel’ can create these thm values and all other parts of the system interact with this proof kernel. The kernel offers an interface consisting of various small functions that implement small inference steps such as modus ponens (if you have a theorem A ⟹ B and a theorem A, you can get the theorem B) or ∀-introduction (if you have the theorem P x for a fixed variable x, you can get the theorem ∀x. P x) etc. The kernel also offers an interface for defining new constants. In principle, as long as you can trust that these functions faithfully implement the basic inference steps of the underlying logic, you can trust that any thm value you can produce really corresponds to a theorem in your logic. For LCF-style provers, the answer of what the actual proof is is a bit more difficult to answer because they usually don't build proof terms (e.g. Isabelle has them, but they are disabled by default and not widely used). I think one could say that the history of how the kernel primitives are called constitute the proof, and if one were to record it, it could – in principle – be replayed and checked in another system.
In both cases the idea is that you have a kernel (the type checker in the former case and the inference kernel in the latter) that you have to trust, and then you have a large ecosystem of additional procedures around this to provide more convenience layers. Since they have to interact with the kernel in order to actually produce theorems, however, you do not have to trust that code.
All these different systems have various trade-offs about what parts of the system are in the kernel and what parts are not. In general, I think it is fair to say that the dependently-typed systems tend to have considerably larger kernels than the LCF-based ones (e.g. HOL Light has a particularly small and simple kernel).
There are also other systems that I believe do not fit into these two categories (e.g. Mizar, ACL2, PVS, Metamath, NuPRL) but I don't know anything about how these are implemented.
In the case of LCF, HOL and Isabelle, you'll find an extensive answer to your question in the journal article "From LCF to Isabelle/HOL". (It's open access.)
Most dependently typed systems, such as Coq, are also LCF-style theorem provers, as described in the article and in Eberl's answer. One significant difference is that such calculi incorporate full proof objects, so that one of the objectives of the LCF approach — to save space by not storing proofs — is abandoned. However, the objective of soundness is still met.

How to leverage auto's searching and hint databases in custom tactics?

In my coq development I am learning how to create new tactics tailored to my problem domain, a la Prof. Adam Chlipala. On that page he describes how to create powerful custom tactics by e.g. combining repeat with match.
Now, I already have a powerful one-shot tactic in use, auto. It strings together chains of steps found from hint databases. I have invested some effort in curating those hint databases, so I'd like to continue using it as well.
However this presents a problem. It isn't clear what the "right" way is to incorporate auto's functionality into customized tactics.
For example, since (per its page) auto always either solves the goal or does nothing, putting it inside a loop is no more powerful than calling it once after the loop.
To see why this isn't ideal, consider a hypothetical way to directly call a single "step" of auto, which succeeds if it could make a change (as opposed to only when it solved the goal) and fails otherwise. Such single-steps could be interleaved with custom behavior in a match repeat loop, allowing us to e.g. try contradiction or try congruence at intermediate points within the search tree.
Are there good design patterns for incorporating auto's functionality into custom tactics?
Can auto's behavior be decomposed into "single step" tactics that we can use?
What I would do instead would be to incorporate other tactics within auto.
You can do so by using the Hint Extern num pat => mytactic : mybase command where num is a priority number (0 being the highest priority), pat a pattern to filter when the hint should be used and mytactic and mybase are of course the tactic you want to apply and the base you want to add the hint to (do not use the default core; build up your custom base instead and call it with auto with mybase; if you do not want to include the lemmas from the core base in the search, add the fake base nocore: auto with mybase nocore).
If you start relying on auto very much, I would switch instead to the almost equivalent but better behaved typeclasses eauto with mybase. Contrary to what its name suggests, it is a general purpose tactic that has nothing to do with type classes (as long as you explicitly provide the hint base on which it should be working). One of the main behavior difference to know is that the search depth is unbounded by default. So beware of possible infinite loops or fix a finite limit with the variant typeclasses eauto num with mybase.

How could I encode "implies" logic in LogicBlox?

I would like to encode "implies" logic in LogicBlox.
I have a predicate:
Number(n),hasNumberName(n:i)->int(i).
isTrue[n] = i -> Number(n), boolean(i).
And I add some data in that predicate:
+Number(1).
Now, I want to create number 2 and number 3, and the truth value for these two number following this logic rule:
If isTrue[1] is true, then isTrue[2] is true or isTrue[3] is true. (isTrue[1] implies (isTrue[2] or isTrue[3]))
So I create a predicate:
implies[n1,n2,n3] = e -> Number(n1), Number(n2), Number(n3),boolean(e).
Then I try to create a rule like that:
isTrue[n2] = true;isTrue[n3] = true <- isTrue[n1] = true,implies[n1,n2,n3] = true.
But LogicBlox reports:"error: disjunction is not supported in the head of a rule "
So how can I encoding this implies logic in LogicBlox?
From your question it looks like you're asking this question with a Prolog background. If so, then it might be helpful to read a Datalog introduction, for example "What you always wanted to know about Datalog (and never dared to ask)".
The logic you want to express is on purpose not allowed in Datalog, because it requires a solving or search strategy. As opposed to Prolog, Datalog is on purpose restricted in the computational complexity of the programs you can express. As a result of these restrictions it meets important requirement for use in a database management system, most importantly supporting very large data sets. The computational complexity restrictions will be more clear after reading a good introduction to Datalog.
People have studied extensions of Datalog to allow more programs to be expressed (without going to full Prolog, which would result in a more procedural semantics). This particular example is called "Disjunctive Datalog". The hits on Google look good for this if you want to read more. LogicBlox does (at least currently) not implement Disjunctive Datalog because our primary objective is to be a scalable database management system.
LogicBlox does support using a solver for specific programs. A typical example is the knapsack problem. If your problem is expressible as an optimization problem (it almost certainly is, but the formulation usually requires some creativity for things that are not conventional optimization problems), then you could use this feature. The solver functionality is not very well documented in publicly available material yet. Please reach out to us directly if you would like to give this a try.
I assume you are trying to enforce a constraint that 1 -> 2 or 3 ? If so, trying to derive a value using <- is not going to work: if neither 2 nor 3 is present, which one(s) are you telling the system create? Instead, just write the constraint using -> syntax. Constraints are implications, after all (the right arrow syntax is no accident!), and that puts the disjunction on the right hand side where the language allows it. Then, if you ever try to create 1 and neither 2 nor 3 exists, the system will report a constraint failure because the implication was not found to hold.
Also, you don't usually need boolean-valued functions in logic languages; isTrue(x) can just be the set of x which you consider to be "true" (and any not present are "false").

Definition of a certified program

I see a couple of different research groups, and at least one book, that talk about using Coq for designing certified programs. Is there are consensus on what the definition of certified program is? From what I can tell, all it really means is that the program was proved total and type correct. Now, the program's type may be something really exotic such as a list with a proof that it's nonempty, sorted, with all elements >= 5, etc. However, ultimately, is a certified program just one that Coq shows is total and type safe, where all the interesting questions boil down to what was included in the final type?
Edit 1
Based on wjedynak's answer, I had a look at Xavier Leroy's paper "Formal Verification of a Realistic Compiler", which is linked in the answers below. I think this contains some good information, but I think the more informative information in this sequence of research can be found in the paper Mechanized Semantics for the Clight Subset of the C Language by Sandrine Blazy and Xavier Leroy. This is the language that the "Formal Verification" paper adds optimizations to. In it, Blazy and Leroy present the syntax and semantics of the Clight language and then discuss the validation of these semantics in section 5. In section 5, there's a list of different strategies used for validating the compiler, which in some sense provides an overview of different strategies for creating a certified program. These are:
Manual reviews
Proving properties of the semantics
Verified translations
Testing executable semantics
Equivalence with alternate semantics
In any case, there are probably points that could be added and I'd certainly like to hear about more.
Going back to my original question of what the definition is of a certified program, it's still a little unclear to me. Wjedynak sort of provides an answer, but really the work by Leroy involved creating a compiler in Coq and then, in some sense, certifying it. In theory, it makes it possible to now prove things about the C programs since we can now go C->Coq->proof. In that sense, it seems like there's this work flow where we could
Write a program in X language
Form of a model of the program from step 1 in Coq or some other proof assistant tool. This could involve creating a model of the programming language in Coq or it could involve creating a model of the program directly (i.e. rewriting the program itself in Coq).
Prove some property about the model. Maybe it's a proof about the values. Maybe it's the proof of the equivalence of statements (stuff like 3=1+2 or f(x,y)=f(y,x), whatever.)
Then, based on these proofs, call the original program certified.
Alternatively, we could create a specification of a program in a proof assistant tool and then prove properties about the specification, but not the program itself.
In any case, I'm still interested in hearing alternative definitions if anyone has them.
I agree that the notion seems vague, but in my understanding a certified program is a program equipped/together with the proof of correctness. Now, by using rich and expressive type signatures you can make it so there is no need for a separate proof, but this is often only a matter of convenience. The real issue is what do we mean by correctness: this a matter of specification. You can take a look at e.g. Xavier Leroy. Formal verification of a realistic compiler.
First note that the phrase "certified" has a slightly French bias: elsewhere the expression "verified" or "proven" is often used.
In any case it is important to ask what that actually means. X. Leroy and CompCert is a very good starting point: it is a big project about C compiler verification, and Leroy is always keen to explain to his audience why verification matters. Especially when talking to people from "certification agencies" who usually mean testing, not proving.
Another big verification project is L4.verified which uses Isabelle/HOL. This part of the exposition explains a bit what is actually stated and proven, and what are the consequences. Unfortunately, the actual proof is top secret, so it cannot be checked publicly.
A certified program is a program that is paired with a proof that the program satisfies its specification, i.e., a certificate. The key is that there exists a proof object that can be checked independently of the tool that produced the proof.
A verified program has undergone verification, which in this context may typically mean that its specification has been formalized and proven correct in a system like Coq, but the proof is not necessarily certified by an external tool.
This distinction is well attested in the scientific literature and is not specific to Francophones. Xavier Leroy describes it very clearly in Section 2.2 of A formally verified compiler back-end.
My understanding is that "certified" in this sense is, as Makarius pointed out, an English word chosen by Francophones where native speakers might instead have used "formally verified". Coq was developed in France, and has many Francophone developers and users.
As to what "formal verification" means, Wikipedia notes (license: CC BY-SA 3.0) that it:
is the act of proving ... the correctness of intended algorithms underlying a system with respect to a certain formal specification or property, using formal methods of mathematics.
(I realise you would like a much more precise definition than this. I hope to update this answer in future, if I find one.)
Wikipedia especially notes the difference between verification and validation:
Validation: "Are we trying to make the right thing?", i.e., is the product specified to the user's actual needs?
Verification: "Have we made what we were trying to make?", i.e., does the product conform to the specifications?
The landmark paper seL4: Formal Verification of an OS Kernel (Klein, et al., 2009) corroborates this interpretation:
A cynic might say that an implementation proof only shows that the
implementation has precisely the same bugs that the specification
contains. This is true: the proof does not guarantee that the
specification describes the behaviour the user expects. The
difference [in a verified approach compared to a non-verified one]
is the degree of abstraction and the absence of whole classes of bugs.
Which classes of bugs are those? The Agda tutorial gives some idea:
no runtime errors (inevitable errors like I/O errors are handled; others are excluded by design).
no non-productive infinite loops.
It may means free of runtime error (numeric overflow, invalid references …), which is already good compared to most developed software, while still weak. The other meaning is proved to be correct according to a domain formalization; that is, it does not only have to be formally free of runtime errors, it also has to be proved to do what it's expected to do (which must have been precisely defined).

Pocketsphinx - Adding words and Improving accuracy

I've managed to finally build and run pocketsphinx (pocketsphinx_continuous). The problem I'm running into, is how to a improve accuracy. From what I understand, you can specify a dictionary file (-dict test.dic). So I took the default dictionary file and added some more pronunciations of the same words, for example:
pencil P EH N S AH L
pencil(2) P EH N S IH L
spaghetti S P AH G EH T IY
spaghetti(2) S P UH G EH T IY
Yet pocketsphinx still does not recognize either word at all. I know there is a jsgf file you can specify as well , but that seems more for phrases and grammar. How can I get pocketsphinx to recognize common words such as pencil and spaghetti?
thanks
-Mike
With something like this, you can't be certain, but I can offer the following suggestions:
Perhaps the language model somehow has low probabilities for "spaghetti" and "pencil". As you suggested, you could use a JSGF to test out how it does for recognition if it doesn't use the N-gram models, but instead does a simple grammar (give it like twenty words, including spaghetti and pencil). This way you can see if it is perhaps the language model which makes it difficult to recognize these words, and it can do okay if it considers all the words to have equal probability.
Perhaps you simply pronounce these words poorly, even with the alternative dictionary entries. Try either A. Testing other peoples' voices, or B. Adapting the acoustic model to your voice (see http://cmusphinx.sourceforge.net/wiki/tutorialam)
Also, what is it recognizing them as when it is failing? If possible, remove the words it misrecognizes as from the dictionary.
Again, for overall accuracy, only three things are going to really help you: restricting the grammar, adapting the accoustic model, and perhaps getting higher quality recording input.
To improve accuracy you may want to try adapting the acoustic model to your voice.
http://cmusphinx.sourceforge.net/wiki/tutorialadapt
To learn how to add new words: http://ghatage.com/tech/2012/12/13/Make-Pocketsphinx-recognize-new-words/
Make sure you put a tab (not a space) after the word and before the start of the pronunciation.
May be the problem is with Pocketsphinx. I too was not getting good results with Pocketsphinx. But I was getting very good accuracy with Sphinx4 (for a US speaker with a noise-cancelling microphone.) Therefore I did a comparison between the two using the same audio recordings. For pocketsphinx I used pocketsphinx_batch with the WSJ audio model and a small vocabulary language model and dictionary (created online with the CMU Cambridge language modelling toolkit.) For Sphinx4 I wrote a small Java program using the Sphinx4 library. The result was that Sphinx4 was much more accurate. All the gory details are at http://www.jaivox.com/pocketsphinx.html.
To achieve good accuracy with a pocketshinx:
Important! Check that your mic, audio device, file supports 16 kHz while the general model is trained with 16 kHz acoustic examples.
You should create your own limited dictionary you cannot use cmusphinx-voxforge-de.dic while accuracy is dramatically dropped.
You should create your own language model.
You can search for Jasper project on GitLab to see how it's implemented.
Also, please check the documentation
This is on the CMUSphinx website
"There are various phonesets to represent phones, such as IPA or SAMPA. CMUSphinx does not yet require you to use any well-known phoneset, moreover, it prefers to use letter-only phone names without special symbols. This requirement simplifies some processing algorithms, for example, you can create files with phone names as part of the filenames without any violating of the OS filename requirements.
A dictionary should contain all the words you are interested in, otherwise the recognizer will not be able to recognize them. However, it is not sufficient to have the words in the dictionary. The recognizer looks for a word in both the dictionary and the language model. Without the language model, a word will not be recognized, even if it is present in the dictionary."
https://cmusphinx.github.io/wiki/tutorialdict/