iText7(PDFSweep) throw IllegalStateException Error when redact a PDF - itext

when I use pdfAutoSweep.cleanUp(pdf) for a specific PDF, I got this Exception:
java.lang.IllegalStateException: Coordinate outside allowed range at
com.itextpdf.kernel.pdf.canvas.parser.clipper.ClipperBase.rangeTest(ClipperBase.java:76)
at com.itextpdf.kernel.pdf.canvas.parser.clipper.ClipperBase.rangeTest(ClipperBase.java:78)
at com.itextpdf.kernel.pdf.canvas.parser.clipper.ClipperBase.addPath(ClipperBase.java:149)
at com.itextpdf.kernel.pdf.canvas.parser.clipper.ClipperBase.addPaths(ClipperBase.java:321)
at com.itextpdf.kernel.pdf.canvas.parser.clipper.ClipperOffset.execute(ClipperOffset.java:404)
at com.itextpdf.pdfcleanup.PdfCleanUpFilter.filterStrokePath(PdfCleanUpFilter.java:454)
at com.itextpdf.pdfcleanup.PdfCleanUpFilter.filterStrokePath(PdfCleanUpFilter.java:223)
at com.itextpdf.pdfcleanup.PdfCleanUpProcessor.writePath(PdfCleanUpProcessor.java:763)
at com.itextpdf.pdfcleanup.PdfCleanUpProcessor.filterContent(PdfCleanUpProcessor.java:481)
at com.itextpdf.pdfcleanup.PdfCleanUpProcessor.invokeOperator(PdfCleanUpProcessor.java:402)
at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor.processContent(PdfCanvasProcessor.java:281)
at com.itextpdf.pdfcleanup.PdfCleanUpProcessor.processContent(PdfCleanUpProcessor.java:377)
at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor.processPageContent(PdfCanvasProcessor.java:302)
at com.itextpdf.pdfcleanup.PdfCleanUpProcessor.processPageContent(PdfCleanUpProcessor.java:186)
at com.itextpdf.pdfcleanup.PdfCleanUpTool.cleanUpPage(PdfCleanUpTool.java:304)
at com.itextpdf.pdfcleanup.PdfCleanUpTool.cleanUp(PdfCleanUpTool.java:275)
at com.itextpdf.pdfcleanup.autosweep.PdfAutoSweep.cleanUp(PdfAutoSweep.java:190)
at com.q1d.insider.redaction.PDFRedactor.removeContent(PDFRedactor.java:98)
at com.q1d.insider.redaction.PDFRedactor.main(PDFRedactor.java:250)
you can download the PDF use this link: https://drive.google.com/open?id=106xgE0CcGjGqEovPauUfHF-eyO0XJIYL

Your exception is caused because of a constant in pdfSweep.
Whenever pdfSweep needs to redact something, it may need to modify underlying content (e.g. line drawing operations, tables, images, etc)
As you can imagine, a lot of geometry is involved. Internally, iText prefers to work with integer numbers for coordinates. However PDF documents work with floating point numbers.
Or, to quote the API
When a document with line arts is being cleaned up, there are lot of
calculations with floating point numbers. All of them are translated
into fixed point numbers by multiplying by this coefficient. Vary it
to adjust the preciseness of the calculations.
There is a specific constant in pdfSweep that handles the conversion. The default value of this constant may sometimes lead to an infinite float value halfway through the calculations.
The way to solve it is to change the constant.
The constant is floatMultiplier in pdfCleanupTool

Related

kdb - issue with reading floating numbers

I am reading covariance data from flat files. For that reason, not being able to fully read the floating number results in covarince not satisfying positive semi definite requirements.
For instance, this is one of the input from raw text:
“-0.581050672”— no, actually raw text is this: -5.801050672E-01
When I read this into kdb and cast with F, it results in -0.50810507. When I do this for all and check the covariance, unfortunately it does not satisfy PSD constraints. Other hack I have been doing is to add small noise in Identity matrix…
Apart from this hack, is there way to read above data into proper floating number up to 9th digit? I tried \P and .Q.f but these only seem to work in Display.
Thank you
Sorry, does not seem like a kdb issue. Was exporting these data into different software and floating points were lost during this process. Thanks for pointer.

how to solve floating point errors in matlab

I have a question that I somehow surprising cannot find an answer to. The issue is about floating point errors. It is not a "why does a==b give 0" question, but rather if there are a way to fix floating point errors. The part of the code where I am right now is sensitive and I need to find a way to solve floating point errors. I have tried with
round(100000*myDouble)/100000;
and with
double(int64(100000*myDouble))/100000
but the output does still have floating point errors for some numbers (not all of them, but a few is enough to mess up my code). The problem is that the function I use in matlab is a polygon clipper which I use to calculate the union of many polygons. The function looks for common points and if there are even a small difference, this will mess everything up. The problem should really not be a problem anyway since it is a union and partly overlapping polygons should not cause trouble. However due to some issues with the function I need to make sure to not have these overlaps. The function works really fine in most cases but to speed up I have added a vector of nan separated polygons with holes and since the function is not expected to handle cases like this there are sometimes problems.
There are no point to use int64 for the calculation instead since the function end up with a mex file and is thus not possible to make it work for int64.
There is not a single answer to solve all floating point precision issues. Some strategies are:
Usage of vpa (or symbolic variables).
Make the errors deterministic. If the same formula is used to calculate the same number twice, the result should be equal.
Don't round using decimal factors. Use 2^x instead. In this case, try 2^16 instead of 100000. This way you round the fraction and keep the exponent.

Near Duplicate Detection in Data Streams

I am currently working on a streaming API that generates a lot of textual content. As expected, the API gives out a lot of duplicates and we also have a business requirement to filter near duplicate data.
I did a bit of research on duplicate detection in data streams and read about Stable Bloom Filters. Stable bloom filters are data structures for duplicate detection in data streams with an upper bound on the false positive rate.
But, I want to identify near duplicates and I also looked at Hashing Algorithms like LSH and MinHash that are used in Nearest Neighbour problems and Near Duplicate Detection.
I am kind of stuck and looking for pointers as to how to proceed and papers/implementations that I could look at?
First, normalize the text to all lowercase (or uppercase) characters, replace all non-letters with a white space, compress all multiple white spaces to one, remove leading and trailing white space; for speed I would perform all these operations in one pass of the text. Next take the MD5 hash (or something faster) of the resulting string. Do a database lookup of the MD5 hash (as two 64 bit integers) in a table, if it exists, it is an exact duplicate, if not, add it to the table and proceed to the next step. You will want to age off old hashes based either on time or memory usage.
To find near duplicates the normalized string needs to be converted into potential signatures (hashes of substrings), see the SpotSigs paper and blog post by Greg Linden. Suppose the routine Sigs() does that for a given string, that is, given the normalized string x, Sigs(x) returns a small (1-5) set of 64 bit integers. You could use something like the SpotSigs algorithm to select the substrings in the text for the signatures, but making your own selection method could perform better if you know something about your data. You may also want to look at the simhash algorithm (the code is here).
Given the Sigs() the problem of efficiently finding the near duplicates is commonly called the set similarity joins problem. The SpotSigs paper outlines some heuristics to trim the number of sets a new set needs to be compared to as does the simhash method.
http://micvog.com/2013/09/08/storm-first-story-detection/ has some nice implementation notes

Mixing sound files of different size

I want to mix audio files of different size into a one single .wav file without clipping any file.,i.e. The resulting file size should be equal to the largest sized file of all.
There is a sample through which we can mix files of same size
[(http://www.modejong.com/iOS/#ex4 )(Example 4)].
I modified the code to get the mixed file as a .wav file.
But I am not able to understand that how to modify this code for unequal sized files.
If someone can help me out with some code snippet,i'll be really thankful.
It should be as easy as sending all the files to the mixer simultaneously. When any single file gets to the end, just treat it as if the remainder is filled with zeroes. When all files get to the end, you are done.
Note that the example code says it returns an error if there would be clipping (the sum of the waves is greater than the max representable value.). This condition is more likely if you are mixing multiple inputs. The best way around it is to create some "headroom" in the input waves. You can do either do this in preprocessing, by ensuring that each wave's volume is no more than X% of maximum. (~80-90%, depending on number of inputs.). The other way is to do it dynamically in the mixer code by multiplying each sample by some value <1.0 as you add it to the mix.
If you are selecting the waves to mix at runtime and failure due to clipping is unacceptable, you will need to modify the sample code to pin the values at max/min instead of returning an error. Don't just let them overflow or you will get noisy artifacts.
(Clipping creates artifacts as well, but when you haven't created enough headroom before mixing, it is definitely preferrable to overflow. It is a more familiar-sounding type of distortion, similar to what you get when you overdrive your speakers. See this wikipedia article on clipping:
Clipping is preferable to the alternative in digital systems—wrapping—which occurs if the digital hardware is allowed to "overflow", ignoring the most significant bits of the magnitude, and sometimes even the sign of the sample value, resulting in gross distortion of the signal.
How I'd do it:
Much like the mix_buffers function that you linked to, but pass in 2 parameters for mixbufferNumSamples. Iterate over the whole of the longer of the two buffers. When the index has gone beyond the end of the shorter buffer, simply set the sample from that buffer to 0 for the rest of the function.
If you must avoid clipping and do it in real-time and you know nothing else about the two sounds, you must provide enough headroom. The simplest method is by halving each of the samples before mixing:
mixed = s1/2 + s2/2;
This ensures that the resultant mixed sample won't overflow an int16_t. It will have the side effect of making everything quieter though.
If you can run it offline, you can calculate a scale factor to apply to both waveforms which will keep the peaks when summed below the maximum allowed value.
Or you could mix them all at full volume to an int32_t buffer, keeping track of the largest (magnitude) mixed sample and then go back through the buffer multiplying each sample by a scale factor which will make that extreme sample just reach the +32767/-32768 limits.

Arbitrary precision Float numbers on JavaScript

I have some inputs on my site representing floating point numbers with up to ten precision digits (in decimal). At some point, in the client side validation code, I need to compare a couple of those values to see if they are equal or not, and here, as you would expect, the intrinsics of IEEE754 make that simple check fails with things like (2.0000000000==2.0000000001) = true.
I may break the floating point number in two longs for each side of the dot, make each side a 64 bit long and do my comparisons manually, but it looks so ugly!
Any decent Javascript library to handle arbitrary (or at least guaranteed) precision float numbers on Javascript?
Thanks in advance!
PS: A GWT based solution has a ++
There is the GWT-MATH library at http://code.google.com/p/gwt-math/.
However, I warn you, it's a GWT jsni overlay of a java->javascript automated conversion of java.BigDecimal (actually the old com.ibm.math.BigDecimal).
It works, but speedy it is not. (Nor lean. It will pad on a good 70k into your project).
At my workplace, we are working on a fixed point simple decimal, but nothing worth releasing yet. :(
Use an arbitrary precision integer library such as silentmatt’s javascript-biginteger, which can store and calculate with integers of any arbitrary size.
Since you want ten decimal places, you’ll need to store the value n as n×10^10. For example, store 1 as 10000000000 (ten zeroes), 1.5 as 15000000000 (nine zeroes), etc. To display the value to the user, simply place a decimal point in front of the tenth-last character (and then cut off any trailing zeroes if you want).
Alternatively you could store a numerator and a denominator as bigintegers, which would then allow you arbitrarily precise fractional values (but beware – fractional values tend to get very big very quickly).