iText PDFSweep RegexBasedCleanupStrategy not work in some case - itext

I'm trying to use iText PDFSweep RegexBasedCleanupStrategy to redact some words from pdf, however I only want to redact the word but not appear in other word, eg.
I want to redact "al" as single word, but I don't want to redact the "al" in "mineral".
So I add the word boundary("\b") in the Regex as parameter to RegexBasedCleanupStrategy,
new RegexBasedCleanupStrategy("\\bal\\b")
however the pdfAutoSweep.cleanUp not work if the word is at the end of line.

In short
The cause of this issue is that the routine that flattens the extracted text chunks into a single String for applying the regular expression does not insert any indicator for a line break. Thus, in that String the last letter from one line is immediately followed by the first letter of the next which hides the word boundary. One can fix the behavior by adding an appropriate character to the String in case of a line break.
The problematic code
The routine that flattens the extracted text chunks into a single String is CharacterRenderInfo.mapString(List<CharacterRenderInfo>) in the package com.itextpdf.kernel.pdf.canvas.parser.listener. In case of a merely horizontal gap this routine inserts a space character but in case of a vertical offset, i.e. a line break, it adds nothing extra to the StringBuilder in which the String representation is generated:
if (chunk.sameLine(lastChunk)) {
// we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
if (chunk.getLocation().isAtWordBoundary(lastChunk.getLocation()) && !chunk.getText().startsWith(" ") && !chunk.getText().endsWith(" ")) {
sb.append(' ');
}
indexMap.put(sb.length(), i);
sb.append(chunk.getText());
} else {
indexMap.put(sb.length(), i);
sb.append(chunk.getText());
}
A possible fix
One can extend the code above to insert a newline character in case of a line break:
if (chunk.sameLine(lastChunk)) {
// we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
if (chunk.getLocation().isAtWordBoundary(lastChunk.getLocation()) && !chunk.getText().startsWith(" ") && !chunk.getText().endsWith(" ")) {
sb.append(' ');
}
indexMap.put(sb.length(), i);
sb.append(chunk.getText());
} else {
sb.append('\n');
indexMap.put(sb.length(), i);
sb.append(chunk.getText());
}
This CharacterRenderInfo.mapString method is only called from the RegexBasedLocationExtractionStrategy method getResultantLocations() (package com.itextpdf.kernel.pdf.canvas.parser.listener), and only for the task mentioned, i.e. applying the regular expression in question. Thus, enabling it to properly allow recognition of word boundaries should not break anything but indeed should be considered a fix.
One merely might consider adding a different character for a line break, e.g. a plain space ' ' if one does not want to treat vertical gaps any different than horizontal ones. For a general fix one might, therefore, consider making this character a settable property of the strategy.
Versions
I tested with iText 7.1.4-SNAPSHOT and PDFSweep 2.0.3-SNAPSHOT.

Related

Converting numbers into timestamps (inserting colons at specific places)

I'm using AutoHotkey for this as the code is the most understandable to me. So I have a document with numbers and text, for example like this
120344 text text text
234000 text text
and the desired output is
12:03:44 text text text
23:40:00 text text
I'm sure StrReplace can be used to insert the colons in, but I'm not sure how to specify the position of the colons or ask AHK to 'find' specific strings of 6 digit numbers. Before, I would have highlighted the text I want to apply StrReplace to and then press a hotkey, but I was wondering if there is a more efficient way to do this that doesn't need my interaction. Even just pointing to the relevant functions I would need to look into to do this would be helpful! Thanks so much, I'm still very new to programming.
hfontanez's answer was very helpful in figuring out that for this problem, I had to use a loop and substring function. I'm sure there are much less messy ways to write this code, but this is the final version of what worked for my purposes:
Loop, read, C:\[location of input file]
{
{ If A_LoopReadLine = ;
Continue ; this part is to ignore the blank lines in the file
}
{
one := A_LoopReadLine
x := SubStr(one, 1, 2)
y := SubStr(one, 3, 2)
z := SubStr(one, 5)
two := x . ":" . y . ":" . z
FileAppend, %two%`r`n, C:\[location of output file]
}
}
return
Assuming that the "timestamp" component is always 6 characters long and always at the beginning of the string, this solution should work just fine.
String test = "012345 test test test";
test = test.substring(0, 2) + ":" + test.substring(2, 4) + ":" + test.substring(4, test.length());
This outputs 01:23:45 test test test
Why? Because you are temporarily creating a String object that it's two characters long and then you insert the colon before taking the next pair. Lastly, you append the rest of the String and assign it to whichever String variable you want. Remember, the substring method doesn't modify the String object you are calling the method on. This method returns a "new" String object. Therefore, the variable test is unmodified until the assignment operation kicks in at the end.
Alternatively, you can use a StringBuilder and append each component like this:
StringBuilder sbuff = new StringBuilder();
sbuff.append(test.substring(0,2));
sbuff.append(":");
sbuff.append(test.substring(2,4));
sbuff.append(":");
sbuff.append(test.substring(4,test.length()));
test = sbuff.toString();
You could also use a "fancy" loop to do this, but I think for something this simple, looping is just overkill. Oh, I almost forgot, this should work with both of your test strings because after the last colon insert, the code takes the substring from index position 4 all the way to the end of the string indiscriminately.

Carriage return character not being matched in Swift

I'm trying to parse a file that (apparently) ends its lines with carriage returns, but they aren't being matched as such in Swift, despite having the same UTF8 value. I can see possible fixes for the problem, but I'm curious as to what these characters actually are.
Here's some sample code, with the output below. (CR is set using Character("\r"), although I've tried it using "\r" as well.
try f.forEach() { c in
print(c, terminator:" ") // DBG
if (c == "\r") {
print("Carriage return found!")
}
print(String(c).utf8.first!, terminator:" ")//DBG
print(String(describing:pstate)) // DBG
...
case .field:
switch c {
case CR,LF :
self.endline()
pstate = .eol
When it reaches the end of line (which shows up as such in my text editors), I get this:
. 46 field
0 48 field
13 field
I 73 field
It doesn't seem to be matching using == or in the switch statement. Is there another approach I should be using for this character?
(I'll note that the parsing works fine with files that terminate in newlines.)
I determined what the problem was. By looking at c.unicodeScalars I discovered that the end of line character was in fact "\r\n", not just "\r". As seen in my code I was only taking the first when printing it out as UTF-8. I don't know if that's something from String.forEach or in the file itself.
I know that there are tests to determine if something is a newline. Swift 5 has them directly (c.isNewline), and there is also the CharacterSet approach as noted by Bill Nattaner.
I'm happier with something that will work in my switch statement (and thus I'll define each one explicitly), but that might change if I expect to deal with a wider variety of files.
I'm a little hazy as to what the f.forEach represents, but if your variable c is of type Character then you could replace your if statement with:
if "\(c)".rangeOfCharacter( from: CharacterSet.newlines ) != nil
{
print("Carriage return found!")
}
That way you won't have to invent a list of all-possible new line characters.

How can I obtain only word without All Punctuation Marks when I read text file?

The text file abc.txt is an arbitrary article that has been scraped from the web. For example, it is as follows:
His name is "Donald" and he likes burger. On December 11, he married.
I want to extract only words in lower case and numbers except for all kinds of periods and quotes in the above article. In the case of the above example:
{his, name, is, Donald, and, he, likes, burger, on, December, 11, he, married}
My code is as follows:
filename = 'abc.txt';
fileID = fopen(filename,'r');
C = textscan(fileID,'%s','delimiter',{',','.',':',';','"','''});
fclose(fileID);
Cstr = C{:};
Cstr = Cstr(~cellfun('isempty',Cstr));
Is there any simple code to extract only alphabet words and numbers except all symbols?
Two steps are necessary as you want to convert certain words to lowercase.
regexprep converts words, which are either at the start of the string or follow a full stop and whitespace, to lower case.
In the regexprep function, we use the following pattern:
(?<=^|\. )([A-Z])
to indicate that:
(?<=^|\. ) We want to assert that before the word of interest either the start of string (^), or (|) a full stop (.) followed by whitespace are found. This type of construct is called a lookbehind.
([A-Z]) This part of the expression matches and captures (stores the match) a upper case letter (A-Z).
The ${lower($0)} component in the regex is called a dynamic expression, and replaces the contents of the captured group (([A-Z])) to lower case. This syntax is specific to the MATLAB language.
You can check the behaviour of the above expression here.
Once the lower case conversions have occurred, regexp finds all occurrences of one or more digits, lower case and upper case letters.
The pattern [a-zA-Z0-9]+ matches lower case letters, upper case letters and digits.
You can check the behavior of this regex here.
text = fileread('abc.txt')
data = {regexp(regexprep(text,'(?<=^|\. )([A-Z])','${lower($0)}'),'[a-zA-Z0-9]+','match')'}
>>data{1}
13×1 cell array
{'his' }
{'name' }
{'is' }
{'Donald' }
{'and' }
{'he' }
{'likes' }
{'burger' }
{'on' }
{'December'}
{'11' }
{'he' }
{'married' }

How can I add a string to the last line in multiline EditText, Matlab?

I often use this way to add a string to the last line in multiline editText.
Example: The before editText: (handles.txtLine)
line 1
line 2
line 3
and i want to add string "line 4" to it. So i do:
msg = get(handles.txtLine,'string');
msg_i = sprintf('\nline 4');
msg = [msg msg_i];
set(handles.txtLine,'string',msg)
Result:
line 1
line 2
line 3
line 4
Are there other methods to do the same function?
The String property of a multiline edit control can be set in three ways:
a multiline character array, e.g. txt1= ['line 1'; 'line 2']. Here txt1 has size 2x6.
a single line character array containing newline characters, e.g. txt2= sprintf('line 1\nline 2'). Here txt2 has size 1x13.
a cell array of strings, e.g. txt3 = {'line 1', 'line 2'}
You would add or remove text from the string in each case in different ways, and each method has advantages and disadvantages.
1 is usually inconvenient, as all your lines have to have exactly the same length, or be padded with spaces. But if that's the case, then it's easy to add or remove lines.
2 (basically the way you're doing it now) is also usually less convenient, as while it's easy to append lines, it's less easy to remove them from the middle unless you parse the string looking for newlines. But if you only ever need to add lines, it's probably fine.
I would modify the way you're using sprintf and then concatenating:
msg = sprintf('%s\n%s', msg, 'line 4');
is a simpler and more flexible syntax.
Your general method of getting, modifying and setting the String property is fine, although if you wanted you could combine it all into one starement, such as:
set(handles.txtLine, 'String', sprintf('%s\n%s', get(handles.txtLine, 'String'), 'line4'))
3 would typically be the most convenient, as long as you're comfortable with cell arrays. Each line can be whatever you like, and it's easy to add or remove items.

How do I print a tab character in Pascal?

I'm trying to figure out in all the Internets what's the special character for printing a simple tab in Pascal. I have to format a table in a CLI program and that would be handy.
Single non printable characters can be constructed using their ascii code prefixed with #
Since the ascii value for tab is 9, a tab is then #9. Characters such constructed must be outside literals, but don't need + to concatenate:
E.g.
const
sometext = 'firstfield'#9'secondfield'#13#10;
contains two fields separated by a tab, ended by a carriage return (#13) + a linefeed #10
The ' character can be made both via this route, or shorter by just ending the literal and reopening it:
const
some2 = '''bla'''; // will contain 'bla' with the ticks.
some3 = 'start''bla''end'; // will contain start'bla'end
write( ^i );
:-)