kdb - issue with reading floating numbers - kdb

I am reading covariance data from flat files. For that reason, not being able to fully read the floating number results in covarince not satisfying positive semi definite requirements.
For instance, this is one of the input from raw text:
“-0.581050672”— no, actually raw text is this: -5.801050672E-01
When I read this into kdb and cast with F, it results in -0.50810507. When I do this for all and check the covariance, unfortunately it does not satisfy PSD constraints. Other hack I have been doing is to add small noise in Identity matrix…
Apart from this hack, is there way to read above data into proper floating number up to 9th digit? I tried \P and .Q.f but these only seem to work in Display.
Thank you

Sorry, does not seem like a kdb issue. Was exporting these data into different software and floating points were lost during this process. Thanks for pointer.

Related

how to solve floating point errors in matlab

I have a question that I somehow surprising cannot find an answer to. The issue is about floating point errors. It is not a "why does a==b give 0" question, but rather if there are a way to fix floating point errors. The part of the code where I am right now is sensitive and I need to find a way to solve floating point errors. I have tried with
round(100000*myDouble)/100000;
and with
double(int64(100000*myDouble))/100000
but the output does still have floating point errors for some numbers (not all of them, but a few is enough to mess up my code). The problem is that the function I use in matlab is a polygon clipper which I use to calculate the union of many polygons. The function looks for common points and if there are even a small difference, this will mess everything up. The problem should really not be a problem anyway since it is a union and partly overlapping polygons should not cause trouble. However due to some issues with the function I need to make sure to not have these overlaps. The function works really fine in most cases but to speed up I have added a vector of nan separated polygons with holes and since the function is not expected to handle cases like this there are sometimes problems.
There are no point to use int64 for the calculation instead since the function end up with a mex file and is thus not possible to make it work for int64.
There is not a single answer to solve all floating point precision issues. Some strategies are:
Usage of vpa (or symbolic variables).
Make the errors deterministic. If the same formula is used to calculate the same number twice, the result should be equal.
Don't round using decimal factors. Use 2^x instead. In this case, try 2^16 instead of 100000. This way you round the fraction and keep the exponent.

Numerical Integral of large numbers in Fortran 90

so I have the following Integral that i need to do numerically:
Int[Exp(0.5*(aCosx + bSinx + cCos2x + dSin2x))] x=0..2Pi
The problem is that the output at any given value of x can be extremely large, e^2000, so larger than I can deal with in double precision.
I havn't had much luck googling for the following, how do you deal with large numbers in fortran, not high precision, i dont care if i know it to beyond double precision, and at the end i'll just be taking the log, but i just need to be able to handle the large numbers untill i can take the log..
Are there integration packes that have the ability to handle arbitrarily large numbers? Mathematica clearly can.. so there must be something like this out there.
Cheers
This is probably an extended comment rather than an answer but here goes anyway ...
As you've already observed Fortran isn't equipped, out of the box, with the facility for handling such large numbers as e^2000. I think you have 3 options.
Use mathematics to reduce your problem to one which does (or a number of related ones which do) fall within the numerical range that your Fortran compiler can compute.
Use Mathematica or one of the other computer algebra systems (eg Maple, SAGE, Maxima). All (I think) of these can be integrated into a Fortran program (with varying degrees of difficulty and integration).
Use a library for high-precision (often called either arbitray-precision or multiple-precision too) arithmetic. Your favourite search engine will turn up a number of these for you, some written in Fortran (and therefore easy to integrate), some written in C/C++ or other languages (and therefore slightly harder to integrate). You might start your search at Lawrence Berkeley or the GNU bignum library.
(Yes I know that I wrote that you have 3 options, but your question suggests that you aren't ready to consider this yet) You could write your own high-/arbitrary-/multiple-precision functions. Fortran provides everything you need to construct such a library, there is a lot of work already done in the field to learn from, and it might be something of interest to you.
In practice it generally makes sense to apply as much mathematics as possible to a problem before resorting to a computer, that process can not only assist in solving the problem but guide your selection or construction of a program to solve what's left of the problem.
I agree with High Peformance Mark that the best option here numerically is to use analytics to scale or simplify the result first.
I will mention that if you do want to brute force it, gfortran (as of 4.6, with the libquadmath library) has support for quadruple precision reals, which you can use by selecting the appropriate kind. As long as your answers (and the intermediate results!) don't get too much bigger than what you're describing, that may work, but it will generally be much slower than double precision.
This requires looking deeper at the problem you are trying to solve and the behavior of the underlying mathematics. To add to the good advice already provided by Mark and Jonathan, consider expanding the exponential and trig functions into Taylor series and truncating to the desired level of precision.
Also, take a step back and ask why you are trying to accomplish by calculating this value. As an example, I recently had to debug why I was getting outlandish results from a property correlation which was calculating vapor pressure of a fluid to see if condensation was occurring. I spent a long time trying to understand what was wrong with the temperature being fed into the correlation until I realized the case causing the error was a simulation of vapor detonation. The problem was not in the numerics but in the logic of checking for condensation during a literal explosion; physically, a condensation check made no sense. The real problem was the code was asking an unnecessary question; it already had the answer.
I highly recommend Forman Acton's Numerical Methods That (Usually) Work and Real Computing Made Real. Both focus on problems like this and suggest techniques to tame ill-mannered computations.

Near Duplicate Detection in Data Streams

I am currently working on a streaming API that generates a lot of textual content. As expected, the API gives out a lot of duplicates and we also have a business requirement to filter near duplicate data.
I did a bit of research on duplicate detection in data streams and read about Stable Bloom Filters. Stable bloom filters are data structures for duplicate detection in data streams with an upper bound on the false positive rate.
But, I want to identify near duplicates and I also looked at Hashing Algorithms like LSH and MinHash that are used in Nearest Neighbour problems and Near Duplicate Detection.
I am kind of stuck and looking for pointers as to how to proceed and papers/implementations that I could look at?
First, normalize the text to all lowercase (or uppercase) characters, replace all non-letters with a white space, compress all multiple white spaces to one, remove leading and trailing white space; for speed I would perform all these operations in one pass of the text. Next take the MD5 hash (or something faster) of the resulting string. Do a database lookup of the MD5 hash (as two 64 bit integers) in a table, if it exists, it is an exact duplicate, if not, add it to the table and proceed to the next step. You will want to age off old hashes based either on time or memory usage.
To find near duplicates the normalized string needs to be converted into potential signatures (hashes of substrings), see the SpotSigs paper and blog post by Greg Linden. Suppose the routine Sigs() does that for a given string, that is, given the normalized string x, Sigs(x) returns a small (1-5) set of 64 bit integers. You could use something like the SpotSigs algorithm to select the substrings in the text for the signatures, but making your own selection method could perform better if you know something about your data. You may also want to look at the simhash algorithm (the code is here).
Given the Sigs() the problem of efficiently finding the near duplicates is commonly called the set similarity joins problem. The SpotSigs paper outlines some heuristics to trim the number of sets a new set needs to be compared to as does the simhash method.
http://micvog.com/2013/09/08/storm-first-story-detection/ has some nice implementation notes

Problem with block matching in matlab

I have written matlab codes for two different block matching algorithms, extensive search and three step search, but i am not sure how i can check whether i am getting the correct results. Is there any standard way to check these or any standard code which i can run and compare my result with.I read somewhere that JM software can be used but i didnt find any way to use it.
You can always use the results produced by your algorithms to create the next frame of video and then analyze its quality by either visually inspecting it (which is rather subjective, and we like to deal in numbers) or calculating the mean square error between the produced image and the one you're trying to estimate. Mean square error of the exhaustive (extensive) search should be lower than the one three-step gives you.
Well, did you try to plot it? I mean,after the block-matching you have a new image, right?.
A way to know if you result if true or not is to check the sum of the difference of 2 frames.
A - pre_frame
B - post_frame
C - Compensated frame
If abs(abs(A-B)) is lower than abs(abs(A-C))) that mean it could be true.
Next time, try to specify your algoritm. Also, put your code here to help you more.

Problem with very small numbers?

I tried to assign a very small number to a double value, like so:
double verySmall = 0.000000001;
9 fractional digits. For some reason, when I multiplicate this value by 10, I get something like 0.000000007. I slighly remember there were problems writing big numbers like this in plain text into source code. Do I have to wrap it in some function or a directive in order to feed it correctly to the compiler? Or is it fine to type in such small numbers in text?
The problem is with floating point arithmetic not with writing literals in source code. It is not designed to be exact. The best way around is to not use the built in double - use integers only (if possible) with power of 10 coefficients, sum everything up and display the final useful figure after rounding.
Standard floating point numbers are not stored in a perfect format, they're stored in a format that's fairly compact and fairly easy to perform math on. They are imprecise at surprisingly small precision levels. But fast. More here.
If you're dealing with very small numbers, you'll want to see if Objective-C or Cocoa provides something analagous to the java.math.BigDecimal class in Java. This is precisely for dealing with numbers where precision is more important than speed. If there isn't one, you may need to port it (the source to BigDecimal is available and fairly straightforward).
EDIT: iKenndac points out the NSDecimalNumber class, which is the analogue for java.math.BigDecimal. No port required.
As usual, you need to read stuff like this in order to learn more about how floating-point numbers work on computers. You cannot expect to be able to store any random fraction with perfect results, just as you can't expect to store any random integer. There are bits at the bottom, and their numbers are limited.