itextsharp: words are broken when splitting textchunk into words - itext

I want to highlight several keywords in a set of PDF files. Firstly, we have to identify the single words and match them with my keywords. I found an example:
class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy
//Hold each coordinate
public List<RectAndText> myPoints = new List<RectAndText>();
List<string> topicTerms;
public MyLocationTextExtractionStrategy(List<string> topicTerms)
this.topicTerms = topicTerms;
//Automatically called for each chunk of text in the PDF
public override void RenderText(TextRenderInfo renderInfo)
//Get the bounding box for the chunk of text
var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
var topRight = renderInfo.GetAscentLine().GetEndPoint();
//Create a rectangle from it
var rect = new iTextSharp.text.Rectangle(
//Add this to our main collection
//filter the meaingless words
string text = renderInfo.GetText();
this.myPoints.Add(new RectAndText(rect, renderInfo.GetText()));
However, I found so many words are broken. For example, "stop" will be "st" and "op". Are there any other method to identify a single word and its position?

When you want to collect single words and their coordination, the better way is to override the existing LocationTextExtractionStrategy. Here is my code:
public virtual String GetResultantText(ITextChunkFilter chunkFilter){
List<TextChunk> filteredTextChunks = filterTextChunks(locationalResult, chunkFilter);
List<RectAndText> tmpList = new List<RectAndText>();
StringBuilder sb = new StringBuilder();
TextChunk lastChunk = null;
foreach (TextChunk chunk in filteredTextChunks) {
if (lastChunk == null){
var startLocation = chunk.StartLocation;
var endLocation = chunk.EndLocation;
var rect = new iTextSharp.text.Rectangle(startLocation[0], startLocation[1], endLocation[0], endLocation[1]);
tmpList.Add(new RectAndText(rect, chunk.Text));
} else {
if (chunk.SameLine(lastChunk)){
// we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
if (IsChunkAtWordBoundary(chunk, lastChunk) && !StartsWithSpace(chunk.Text) && !EndsWithSpace(lastChunk.Text))
sb.Append(' ');
if (tmpList.Count > 0)
var startLocation = chunk.StartLocation;
var endLocation = chunk.EndLocation;
var rect = new iTextSharp.text.Rectangle(startLocation[0], startLocation[1], endLocation[0], endLocation[1]);
////var topRight = renderInfo.GetAscentLine().GetEndPoint();
tmpList.Add(new RectAndText(rect,chunk.Text));
} else {
lastChunk = chunk;
return sb.ToString();
private void mergeAndStoreChunk(List<RectAndText> tmpList)
RectAndText mergedChunk = tmpList[0];
int tmpListCount = tmpList.Count();
for (int i = 1; i < tmpListCount; i++)
RectAndText nowChunk = tmpList[i];
mergedChunk.Rect.Right = nowChunk.Rect.Right;
mergedChunk.Text += nowChunk.Text;
myPoints is a list, which will return all we want.


How to get list from docx file?

How to determine whether a list is bulleted or numbered? I use OpenXML
In general, what will be the list determines NumberingDefinitionsPart, I thought to find out the Numbering of a certain element, but this method did not work
I am processing the list in the recommended way, but I need to know which way it is
`public void ParagraphHandle(Elements.Paragraph paragraph, StringBuilder text)
var docPart = paragraph.DocumentPart;
var element = paragraph.Element;
var r = element.Descendants<Numbering>().ToArray();
var images = GetImages(docPart, element);
if (images.Count > 0)
foreach (var image in images)
if (image.Id != null)
string filePath = _saveResources.SaveImage(image);
_handler.ImageHandle(filePath, text);
var paragraphProperties = element.GetFirstChild<ParagraphProperties>();
var numberingProperties = paragraphProperties?.GetFirstChild<NumberingProperties>();
if (numberingProperties != null)
var numberingId = numberingProperties.GetFirstChild<NumberingId>()?.Val?.Value;
if (numberingId != null && !paragraph.IsList)
paragraph.IsList = true;
paragraph.List = new List();
_htmlGenerator.GenerateList(paragraph, text);
_htmlGenerator.GenerateList(paragraph, text);
if (paragraph.IsList)
paragraph.IsList = false;
_handler.ParagraphHandle(element, text);

How set image to a Rectangle by OpenXML?

I have a Rectange in template file PPTX and setting name ="Img"
I want set image to that Rectange:
This is my code, but i can't call shape.Append(part);:
// Open the source document as read/write.
using (var presentationDocument = PresentationDocument.Open(strFile, true))
var presentationPart = presentationDocument.PresentationPart;
var templatePart = GetSlidePartsInOrder( presentationPart).Last();
for(int i = 0; i < 2; i++)
int ifile = i + 1;
string path = #"F:\AUTOM\t"+ ifile+".png";
var newSlidePart = CloneSlide(templatePart);
// Get the shape tree that contains the shape to change.
P.ShapeTree tree = newSlidePart.Slide.CommonSlideData.ShapeTree;
var shapes = from shape in newSlidePart.Slide.Descendants < P.Shape>()
select shape;
foreach (var shape in shapes)
var part = newSlidePart.AddImagePart(ImageExtension(path));
using (var stream = File.OpenRead(path))
// Specify the text of the title shape.
foreach (Paragraph paragraph in shape.Descendants().OfType<Paragraph>())
foreach (Run run in paragraph.Elements<Run>())
run.Text = new Text("Your new text");
AppendSlide(presentationPart, newSlidePart);
// Save the modified presentation.
How set image to a Rectangle by OpenXML?

Open Xml - File Needs To Be Repaired To Open

I am having an issue where the file generation process works as expected, but when I open the excel file it says that it is corrupt and needs to be repaired. When the repair is complete, the file opens and all of the data is there.
The error message that I am receiving is as follows:
Removed Records: Cell information from /xl/worksheets/sheet1.xml part
My code is as follows:
using (var workbookDocument = SpreadsheetDocument.Create(staging, DocumentFormat.OpenXml.SpreadsheetDocumentType.Workbook))
var count = query.Count();
var worksheetNumber = 1;
var worksheetCapacity = Convert.ToInt32(100000);
var worksheetCount = Convert.ToInt32(Math.Ceiling(Convert.ToDouble(count) / worksheetCapacity));
var workbookPart = workbookDocument.AddWorkbookPart();
var worksheetInfo = new List<WorksheetData>();
OpenXmlWriter worksheetWriter;
while (worksheetNumber <= worksheetCount)
var worksheetLine = 1;
var worksheetName = sheet + "_" + Convert.ToString(worksheetNumber);
var worksheetPart = workbookDocument.WorkbookPart.AddNewPart<WorksheetPart>
var worksheetId = workbookDocument.WorkbookPart.GetIdOfPart(worksheetPart);
var worksheetKey = Convert.ToUInt32(worksheetNumber);
var worksheetAttributes = new List<OpenXmlAttribute>();
worksheetAttributes.Add(new OpenXmlAttribute("r", null, worksheetLine.ToString()));
worksheetInfo.Add(new WorksheetData() { Id = worksheetId, Key = worksheetKey, Name = worksheetName });
worksheetWriter = OpenXmlWriter.Create(worksheetPart);
worksheetWriter.WriteStartElement(new Worksheet());
worksheetWriter.WriteStartElement(new SheetData());
worksheetWriter.WriteStartElement(new Row(), worksheetAttributes);
for (var i = 0; i < headers.Count; i++)
var worksheetCell = new DocumentFormat.OpenXml.Spreadsheet.Cell();
var worksheetValue = new DocumentFormat.OpenXml.Spreadsheet.CellValue(headers[i]);
worksheetAttributes.Add(new OpenXmlAttribute("t", null, "str"));
worksheetAttributes.Add(new OpenXmlAttribute("r", null, GetColumnReference(worksheetLine, i)));
worksheetWriter.WriteStartElement(worksheetCell, worksheetAttributes);
var skip = ((worksheetNumber - 1) * worksheetCapacity);
var results = query.SelectProperties(columns).Skip(skip).Take(worksheetCapacity).ToList();
for (var j = 0; j < results.Count; j++)
worksheetAttributes.Add(new OpenXmlAttribute("r", null, worksheetLine.ToString()));
worksheetWriter.WriteStartElement(new Row());
for (var k = 0; k < columns.Count(); k++)
var column = columns[k].Split((".").ToCharArray()).Last();
var value = results[j].GetType().GetField(column).GetValue(results[j]);
var type = value?.GetType().Name;
var text = ExportFormatter.Format(type, value);
worksheetAttributes.Add(new OpenXmlAttribute("t", null, "str"));
worksheetAttributes.Add(new OpenXmlAttribute("r", null, GetColumnReference(worksheetLine, j)));
worksheetWriter.WriteStartElement(new Cell());
worksheetWriter.WriteElement(new CellValue(text));
worksheetWriter = OpenXmlWriter.Create(workbookDocument.WorkbookPart);
worksheetWriter.WriteStartElement(new Workbook());
worksheetWriter.WriteStartElement(new Sheets());
for (var i = 0; i < worksheetInfo.Count; i++)
worksheetWriter.WriteElement(new Sheet()
Name = worksheetInfo[i].Name,
SheetId = worksheetInfo[i].Key,
Id = worksheetInfo[i].Id
I use the below class to track the worksheet information:
private class WorksheetData
public String Id { get; set; }
public UInt32 Key { get; set; }
public String Name { get; set; }
Can anyone identify why this is happening?
Perhaps and extra ending tag or ones that missing?
Finally got this to work, there were a few issues.
The cell references A1 A2 A3, etc... were not correct in the code.
The row number were not beign tracked corretly.
The attributes being applied to the cell elements were not correct because they wern't being cleared prior to writing.
The usage of the CallValue was not functioning as expected. Not sure exactly why, but when the Cell Value is used it doesn't open properly in excel. Simply using the cell and setting the DataType and CellValue properties did work. Note - The underlying xml looked exactly the same between the two, but only the second worked.
The final code from this is as follows:
public static ExportInfo Export<T>(this IQueryable<T> query, String temp, String path, List<ExportField> fields)
var worker = new ExportWorker();
return worker.Export<T>(query, temp, path, fields);
public static class ExportFormatter
public static String Format(String type, Object value)
if (value == null)
return "";
var text = "";
switch (type)
case "Decimal":
var decimalValue = (Decimal)value;
text = decimal.Round(decimalValue, 2, MidpointRounding.AwayFromZero).ToString();
case "DateTimeOffset":
var dateTimeOffset = (DateTimeOffset)value;
text = dateTimeOffset.ToUniversalTime().ToString("MM/dd/yyyy");
case "DateTime":
var dateTime = (DateTime)value;
text = dateTime.ToUniversalTime().ToString("MM/dd/yyyy");
text = Convert.ToString(value);
return text;
public class ExportWorker
String record;
String staging;
String destination;
Thread thread;
Timer timer;
public ExportInfo Export<T>(IQueryable<T> query, String temp, String path, List<ExportField> fields)
var selections = from a in fields group a by new { a.Field } into g select new { g.Key.Field, Header = g.Max(x => x.Header) };
var headers = (from a in selections select a.Header).ToList();
var columns = (from a in selections select a.Field).Distinct().ToList();
var entity = query.ElementType.ToString();
var array = entity.Split((".").ToCharArray());
var sheet = array[array.Length - 1];
var key = Guid.NewGuid().ToString().Replace("-", "_");
var name = key + ".xlsx";
var log = key + ".txt";
var timeout = 60 * 60000;
staging = temp + name;
destination = path + name;
record = path + log;
thread = new Thread(
new ThreadStart(() =>
using (var workbookDocument = SpreadsheetDocument.Create(staging, DocumentFormat.OpenXml.SpreadsheetDocumentType.Workbook))
var count = query.Count();
var worksheetNumber = 1;
var worksheetCapacity = Convert.ToInt32(100000);
var worksheetCount = Convert.ToInt32(Math.Ceiling(Convert.ToDouble(count) / worksheetCapacity));
var workbookPart = workbookDocument.AddWorkbookPart();
var worksheetInfo = new List<WorksheetData>();
OpenXmlWriter worksheetWriter;
while (worksheetNumber <= worksheetCount)
var worksheetLine = 1;
var worksheetThrottle = 0;
var worksheetName = sheet + "_" + Convert.ToString(worksheetNumber);
var worksheetPart = workbookDocument.WorkbookPart.AddNewPart<WorksheetPart>();
var worksheetId = workbookDocument.WorkbookPart.GetIdOfPart(worksheetPart);
var worksheetKey = Convert.ToUInt32(worksheetNumber);
var worksheetAttributes = new List<OpenXmlAttribute>();
worksheetAttributes.Add(new OpenXmlAttribute("r", null, worksheetLine.ToString()));
worksheetInfo.Add(new WorksheetData() { Id = worksheetId, Key = worksheetKey, Name = worksheetName });
worksheetWriter = OpenXmlWriter.Create(worksheetPart);
worksheetWriter.WriteStartElement(new Worksheet());
worksheetWriter.WriteStartElement(new SheetData());
worksheetWriter.WriteStartElement(new Row(), worksheetAttributes);
for (var i = 0; i < headers.Count; i++)
var worksheetCell = new Cell();
worksheetCell.DataType = CellValues.String;
worksheetCell.CellValue = new CellValue(headers[i]);
var skip = ((worksheetNumber - 1) * worksheetCapacity);
var results = query.SelectProperties(columns).Skip(skip).Take(worksheetCapacity).ToList();
for (var j = 0; j < results.Count; j++)
if (worksheetThrottle >= 5) { worksheetThrottle = 0; System.Threading.Thread.Sleep(1); }
worksheetAttributes.Add(new OpenXmlAttribute("r", null, worksheetLine.ToString()));
worksheetWriter.WriteStartElement(new Row(), worksheetAttributes);
for (var k = 0; k < columns.Count(); k++)
var column = columns[k].Split((".").ToCharArray()).Last();
var value = results[j].GetType().GetField(column).GetValue(results[j]);
var type = value?.GetType().Name;
var text = (String)ExportFormatter.Format(type, value);
var worksheetCell = new Cell();
worksheetCell.DataType = CellValues.String;
worksheetCell.CellValue = new CellValue(text);
worksheetWriter = OpenXmlWriter.Create(workbookDocument.WorkbookPart);
worksheetWriter.WriteStartElement(new Workbook());
worksheetWriter.WriteStartElement(new Sheets());
for (var i = 0; i < worksheetInfo.Count; i++)
worksheetWriter.WriteElement(new Sheet()
Name = worksheetInfo[i].Name,
SheetId = worksheetInfo[i].Key,
Id = worksheetInfo[i].Id
var logsfile = File.CreateText(record);
var datafile = (new DirectoryInfo(temp)).GetFiles().FirstOrDefault(a => a.Name == name);
catch (Exception ex)
try { File.Delete(staging); } catch (Exception) { }
var logsfile = File.CreateText(record);
timer = new Timer(Expire, null, timeout, Timeout.Infinite);
return new ExportInfo() { File = destination, Log = record };
void Expire(object state)
try { File.Delete(staging); } catch (Exception) { }
var logsfile = File.CreateText(record);
private class WorksheetData
public String Id { get; set; }
public UInt32 Key { get; set; }
public String Name { get; set; }
After making those adjustments, the export works beautifully.
Also, open xml solved a lot of problems that I was having with memory management.
Using the above approach allowed me to export 3 files, each with 1.5 million rows (40 columns) in about 10 minutes.
During the export process, CPU utilization never exceeded 35% and it never used more than 1/10 of a gig of memory. Bravo...

VS Code extension how to edit in context?

Her is the class I use automatically capitalize true, false, ...
export class StUpdater {
private _lines: number;
private _strings: Array<string>;
constructor() {
this._lines = 0;
this._strings = ['true', 'false', 'exit', 'continue', 'return'];
Update(Cntx: boolean = false) {
let editor = window.activeTextEditor;
if (!editor || (editor.document.languageId != 'st')) {
window.showErrorMessage('No editor!')
let doc = editor.document;
if (Cntx == false) {
if (this._lines >= doc.lineCount) {
this._lines = doc.lineCount;
this._lines = doc.lineCount;
let AutoFormat = workspace.getConfiguration('st').get('autoFormat');
if (!AutoFormat) {
let edit = new WorkspaceEdit();
for (let line = 0; line < doc.lineCount; line++) {
const element = doc.lineAt(line);
for (let i = 0; i < this._strings.length; i++) {
let str = this._strings[i];
let last_char = 0;
while (element.text.indexOf(str, last_char) >= 0) {
let char = element.text.indexOf(str, last_char);
last_char = char + str.length;
new Range(
new Position(line, char),
new Position(line, last_char)
return workspace.applyEdit(edit);
public dispose() {
This code works fine, but I do not want to replace it inside the string or comment. How do I do that? I cannot find preg version of replace and even if I do, in one line I do not know if it is comment or not if it is multiple line comment.
If I understand you correctly you want capitalize only certain elements (identifiers probably), but not words in comments or strings, correct? That requires to identify lexical elements in the text, which is a mapping of a range of letters to a lexical type. This is usually done by a lexer. It's not difficult to write one by hand which walks over the characters on top of your current processing and find those ranges that can be manipulated.

CreateJS swapping display list containers with the use of classes

This is a 2D Jenga game.
So I am currently making a Jenga game in createjs. When users take a block out from the Jenga building they can move it around, ultimately users are suppose to be able to take the piece and move it to the top like a typical Jenga game. The problem is you can take any piece out move it towards the bottom it appears to be in front of the Jenga building, but once you move a block towards the top it goes behind the building. I have a piece class which creates one block looks like this:
var GamepeiceComponent = (function() {
var assets = {};
var offset;
var gamePeice;
var currentX;
var currentY;
var newContainer;
this.makePiece = function(ingredient) {
gamePeice = new createjs.Container();
assets.peice = new createjs.Bitmap(queue.getResult(ingredient));
gamePeice.on('pressmove', handlePieceDrag);
gamePeice.on("pressup", handlePieceUp);
gamePeice.on("mousedown", handleMouseDown);
gamePeice.cursor = 'pointer';
function handleMouseDown(e) {
// Game.block.swapChildren(e.currentTarget, Game.block);
for(var i = 0; i < Game.stage.children.length; i++){
//Game.stage.swapChildren(e.currentTarget, Game.stage.children[i])
offset = {x: - e.stageX, y: - e.stageY};
function handlePieceDrag(e) { = e.stageX + offset.x; = e.stageY + offset.y;
function handlePieceUp(e) { = 0; = 0;
this.addPiece = function() {
return gamePeice;
return this;
I then have a block class which creates a block using the piece class because it creates 3 pieces per block (just like Jenga) heres what that is:
var GameblockComponent = (function() {
var gameBlock;
this.makeBlock = function(ingredient, yOffset, xOffset) {
gameBlock = new createjs.Container();
for(var i=0;i<3;i++) {
var gamePieces = new GamepeiceComponent();
var makePiece = gamePieces.makePiece(ingredient);
gamePieces.addPiece().y = yOffset * i;
gamePieces.addPiece().x = xOffset * i;
gamePieces.addPiece().on('pieceClicked', handleClick);
function handleClick(e) {
console.log('Game Piece Clicked');
this.addBlock = function() {
return gameBlock;
return this;
Lastly I have a building which organizes all the blocks in order:
var GamebuildingComponent = (function(game) {
var jengaContainer;
var left = ['burger_l', 'cheese_l', 'egg_l', 'ham_l', 'lettuce_l', 'onion_l', 'pickle_l', 'salmon_l', 'sausage_l', 'tomato_l'];
var right = ['burger_r', 'cheese_r', 'egg_r', 'ham_r', 'lettuce_r', 'onion_r', 'pickle_r', 'salmon_r', 'sausage_r', 'tomato_r'];
var bread = ['bread_l', 'bread_r'];
var seed = [];
var offsets = {
xOffsetLeft: 15,
yOffsetLeft: -33,
xPosLeft: 170,
xOffsetRight: 17,
yOffsetRight: 33,
function init() {
jengaContainer = new createjs.Container();
function createBread(yOffset) {
var block = new GameblockComponent();
var breadLeft = block.makeBlock(bread[0], offsets.xOffsetLeft, offsets.yOffsetLeft);
block.addBlock().x = 170;
block.addBlock().y = yOffset;
jengaContainer.addChildAt(block.addBlock(), 0);
// LEFT: left side facing to left
// RIGHT: right side facing to right
function createSubBlock(yOffset) {
for(var i=0;i<5;i++) {
var block = new GameblockComponent();
var random = Math.floor(Math.random()*left.length);
// prevents duplicates
while(seed.indexOf(random) > -1) {
var random = Math.floor(Math.random()*left.length);
if(i%2 != 0) {
var ingredient = block.makeBlock(left[random], offsets.xOffsetLeft, offsets.yOffsetLeft);
block.addBlock().x = 170;
} else {
var ingredient = block.makeBlock(right[random], offsets.xOffsetRight, offsets.yOffsetRight);
block.addBlock().x = 105;
block.addBlock().y = 23 * i + yOffset;
jengaContainer.addChildAt(block.addBlock(), 0);
this.addBuilding = function() {
return jengaContainer;
return this;
It all works fine except for when you move a lower piece towards the top the piece goes behind the jenga building, and of course its how the displaylist works, how would I be able to swap the piece correctly and where? I was listing all my child elements that are on the stage and it gave me one child (the jenga building). That child gave me 13 children (each block).
Then I just add the Jenga building to a view, and that view gets called from a controller.
You're probably looking for the setChildIndex method of the Container object.
function handleMouseDown(e) {
Game.block.setChildIndex(e.currentTarget, Game.block.children.length - 1);