How to reject numeric values in Lucene.net? - lucene.net

I want to know whether is it possible to reject numeric phrases or numeric values while indexing or searching in Lucene.net.
For example (this is one line),
Hi all my no is 4756396
Now, when I index or search it should reject the numeric value 4756396 to be indexed or searched. I tried making a custom stop word list with 1, 2, 3, 4, 5, 6, etc, but I guess it will only ignore if a single number will appears.

You can copy the StandardAnalyzer and customize the grammar (simple JFlex stuff) to reject number tokens. If you do that, you'll need to port back the analyzer to Java since JFlex will generate java code, tho you could give it a try with C# Flex.
You could also write a TokenFilter that scans tokens one by one and rejects them if they are numbers. If you wanna filter only whole numbers and still retain numbers that are for example separate by hyphens, the filter could simply attempt a double.TryParse() and if it fails you accept the Token. A more robust and customizable solution would still use a lexical parser.
Edit:
Heres a quick sample of what I mean, with a little main method that shows how to use it. In this I used a TryParse() to filter out tokens, if it were for a more complex production system I'd use a lexical parser system. (take a look at C# Flex for that)
public class NumericFilter : TokenFilter
{
private ITermAttribute termAtt ;
public NumericFilter(TokenStream tokStream)
: base(tokStream)
{
termAtt = AddAttribute<ITermAttribute>();
}
public override bool IncrementToken()
{
while (base.input.IncrementToken())
{
string term = termAtt.Term;
double res ;
if(double.TryParse(term, out res))
{
// skip this token
continue;
}
// accept this token
return true;
}
// no more token in the stream
return false;
}
}
static void Main(string[] args)
{
RAMDirectory dir = new RAMDirectory();
IndexWriter iw = new IndexWriter(dir, new KeywordAnalyzer(), IndexWriter.MaxFieldLength.UNLIMITED);
Document d = new Document();
Field f = new Field("text", "", Field.Store.YES, Field.Index.ANALYZED);
d.Add(f);
// use our Filter here
f.SetTokenStream(new NumericFilter(new LowerCaseFilter(new WhitespaceTokenizer(new StringReader("I have 300 dollars")))));
iw.AddDocument(d);
iw.Commit();
IndexReader reader = iw.GetReader();
// print all terms in the text field
TermEnum terms = reader.Terms(new Term("text", ""));
do
{
Console.WriteLine(terms.Term.Text);
}
while (terms.Next());
reader.Dispose();
iw.Dispose();
Console.ReadLine();
Environment.Exit(42);
}

Related

Salesforce trigger-Not able to understand

Below is the code written by my collegue who doesnt work in the firm anymore. I am inserting records in object with data loader and I can see success message but I do not see any records in my object. I am not able to understand what below trigger is doing.Please someone help me understand as I am new to salesforce.
trigger DataLoggingTrigger on QMBDataLogging__c (after insert) {
Map<string,Schema.RecordTypeInfo> recordTypeInfo = Schema.SObjectType.QMB_Initial_Letter__c.getRecordTypeInfosByName();
List<QMBDataLogging__c> logList = (List<QMBDataLogging__c>)Trigger.new;
List<Sobject> sobjList = (List<Sobject>)Type.forName('List<'+'QMB_Initial_Letter__c'+'>').newInstance();
Map<string, QMBLetteTypeToVfPage__c> QMBLetteTypeToVfPage = QMBLetteTypeToVfPage__c.getAll();
Map<String,QMBLetteTypeToVfPage__c> mapofLetterTypeRec = new Map<String,QMBLetteTypeToVfPage__c>();
set<Id>processdIds = new set<Id>();
for(string key : QMBLetteTypeToVfPage.keyset())
{
if(!mapofLetterTypeRec.containsKey(key)) mapofLetterTypeRec.put(QMBLetteTypeToVfPage.get(Key).Letter_Type__c, QMBLetteTypeToVfPage.get(Key));
}
for(QMBDataLogging__c log : logList)
{
Sobject logRecord = (sobject)log;
Sobject QMBLetterRecord = new QMB_Initial_Letter__c();
if(mapofLetterTypeRec.containskey(log.Field1__c))
{
string recordTypeId = recordTypeInfo.get(mapofLetterTypeRec.get(log.Field1__c).RecordType__c).isAvailable() ? recordTypeInfo.get(mapofLetterTypeRec.get(log.Field1__c).RecordType__c).getRecordTypeId() : recordTypeInfo.get('Master').getRecordTypeId();
string fieldApiNames = mapofLetterTypeRec.containskey(log.Field1__c) ? mapofLetterTypeRec.get(log.Field1__c).FieldAPINames__c : '';
//QMBLetterRecord.put('Letter_Type__c',log.Name);
QMBLetterRecord.put('RecordTypeId',tgh);
processdIds.add(log.Id);
if(string.isNotBlank(fieldApiNames) && fieldApiNames.contains(','))
{
Integer i = 1;
for(string fieldApiName : fieldApiNames.split(','))
{
string logFieldApiName = 'Field'+i+'__c';
fieldApiName = fieldApiName.trim();
system.debug('fieldApiName=='+fieldApiName);
Schema.DisplayType fielddataType = getFieldType('QMB_Initial_Letter__c',fieldApiName);
if(fielddataType == Schema.DisplayType.Date)
{
Date dateValue = Date.parse(string.valueof(logRecord.get(logFieldApiName)));
QMBLetterRecord.put(fieldApiName,dateValue);
}
else if(fielddataType == Schema.DisplayType.DOUBLE)
{
string value = (string)logRecord.get(logFieldApiName);
Double dec = Double.valueOf(value.replace(',',''));
QMBLetterRecord.put(fieldApiName,dec);
}
else if(fielddataType == Schema.DisplayType.CURRENCY)
{
Decimal decimalValue = Decimal.valueOf((string)logRecord.get(logFieldApiName));
QMBLetterRecord.put(fieldApiName,decimalValue);
}
else if(fielddataType == Schema.DisplayType.INTEGER)
{
string value = (string)logRecord.get(logFieldApiName);
Integer integerValue = Integer.valueOf(value.replace(',',''));
QMBLetterRecord.put(fieldApiName,integerValue);
}
else if(fielddataType == Schema.DisplayType.DATETIME)
{
DateTime dateTimeValue = DateTime.valueOf(logRecord.get(logFieldApiName));
QMBLetterRecord.put(fieldApiName,dateTimeValue);
}
else
{
QMBLetterRecord.put(fieldApiName,logRecord.get(logFieldApiName));
}
i++;
}
}
}
sobjList.add(QMBLetterRecord);
}
if(!sobjList.isEmpty())
{
insert sobjList;
if(!processdIds.isEmpty()) DeleteDoAsLoggingRecords.deleteTheProcessRecords(processdIds);
}
Public static Schema.DisplayType getFieldType(string objectName,string fieldName)
{
SObjectType r = ((SObject)(Type.forName('Schema.'+objectName).newInstance())).getSObjectType();
DescribeSObjectResult d = r.getDescribe();
return(d.fields.getMap().get(fieldName).getDescribe().getType());
}
}
You might be looking in the wrong place. Check if there's an unit test written for this thing (there should be one, especially if it's deployed to production), it should help you understand how it's supposed to be used.
You're inserting records of QMBDataLogging__c but then it seems they're immediately deleted in DeleteDoAsLoggingRecords.deleteTheProcessRecords(processdIds). Whether whatever this thing was supposed to do succeeds or not.
This seems to be some poor man's CSV parser or generic "upload anything"... that takes data stored in QMBDataLogging__c and creates QMB_Initial_Letter__c out of it.
QMBLetteTypeToVfPage__c.getAll() suggests you could go to Setup -> Custom Settings, try to find this thing and examine. Maybe it has some values in production but in your sandbox it's empty and that's why essentially nothing works? Or maybe some values that are there are outdated?
There's some comparison if what you upload into Field1__c can be matched to what's in that custom setting. I guess you load some kind of subtype of your QMB_Initial_Letter__c in there. Record Type name and list of fields to read from your log record is also fetched from custom setting based on that match.
Then this thing takes what you pasted, looks at the list of fields in from the custom setting and parses it.
Let's say the custom setting contains something like
Name = XYZ, FieldAPINames__c = 'Name,SomePicklist__c,SomeDate__c,IsActive__c'
This thing will look at first record you inserted, let's say you have the CSV like that
Field1__c,Field2__c,Field3__c,Field4__c
XYZ,Closed,2022-09-15,true
This thing will try to parse and map it so eventually you create record that a "normal" apex code would express as
new QMB_Initial_Letter__c(
Name = 'XYZ',
SomePicklist__c = 'Closed',
SomeDate__c = Date.parse('2022-09-15'),
IsActive__c = true
);
It's pretty fragile, as you probably already know. And because parsing CSV is an art - I expect it to absolutely crash and burn when text with commas in it shows up (some text,"text, with commas in it, should be quoted",more text).
In theory admin can change mapping in setup - but then they'd need to add new field anyway to the loaded file. Overcomplicated. I guess somebody did it to solve issue with Record Type Ids - but there are better ways to achieve that and still have normal CSV file with normal columns and strong type matching, not just chucking everything in as strings.
In theory this lets you have "jagged" csv files (row 1 having 5 fields, row 2 having different record type and 17 fields? no problem)
Your call whether it's salvageable or you'd rather ditch it and try normal loading of QMB_Initial_Letter__c records. (get back to your business people and ask for requirements?) If you do have variable number of columns at source - you'd need to standardise it or group the data so only 1 "type" of records (well, whatever's in that "Field1__c") goes into each file.

What's wrong with my Meteor publication?

I have a publication, essentially what's below:
Meteor.publish('entity-filings', function publishFunction(cik, queryArray, limit) {
if (!cik || !filingsArray)
console.error('PUBLICATION PROBLEM');
var limit = 40;
var entityFilingsSelector = {};
if (filingsArray.indexOf('all-entity-filings') > -1)
entityFilingsSelector = {ct: 'filing',cik: cik};
else
entityFilingsSelector = {ct:'filing', cik: cik, formNumber: { $in: filingsArray} };
return SB.Content.find(entityFilingsSelector, {
limit: limit
});
});
I'm having trouble with the filingsArray part. filingsArray is an array of regexes for the Mongo $in query. I can hardcode filingsArray in the publication as [/8-K/], and that returns the correct results. But I can't get the query to work properly when I pass the array from the router. See the debugged contents of the array in the image below. The second and third images are the client/server debug contents indicating same content on both client and server, and also identical to when I hardcode the array in the query.
My question is: what am I missing? Why won't my query work, or what are some likely reasons it isn't working?
In that first screenshot, that's a string that looks like a regex literal, not an actual RegExp object. So {$in: ["/8-K/"]} will only match literally "/8-K/", which is not the same as {$in: [/8-K/]}.
Regexes are not EJSON-able objects, so you won't be able to send them over the wire as publish function arguments or method arguments or method return values. I'd recommend sending a string, then inside the publish function, use new RegExp(...) to construct a regex object.
If you're comfortable adding new methods on the RegExp prototype, you could try making RegExp an EJSON-able type, by putting this in your server and client code:
RegExp.prototype.toJSONValue = function () {
return this.source;
};
RegExp.prototype.typeName = function () {
return "regex";
}
EJSON.addType("regex", function (str) {
return new RegExp(str);
});
After doing this, you should be able to use regexes as publish function arguments, method arguments and method return values. See this meteorpad.
/8-K/.. that's a weird regex. Try /8\-K/.
A minus (-) sign is a range indicator and usually used inside square brackets. The reason why it's weird because how could you even calculate a range between 8 and K? If you do not escape that, it probably wouldn't be used to match anything (thus your query would not work). Sometimes, it does work though. Better safe than never.
/8\-K/ matches the string "8-K" anywhere once.. which I assume you are trying to do.
Also it would help if you would ensure your publication would always return something.. here's a good area where you could fail:
if (!cik || !filingsArray)
console.error('PUBLICATION PROBLEM');
If those parameters aren't filled, console.log is probably not the best way to handle it. A better way:
if (!cik || !filingsArray) {
throw "entity-filings: Publication problem.";
return false;
} else {
// .. the rest of your publication
}
This makes sure that the client does not wait unnecessarily long for publications statuses as you have successfully ensured that in any (input) case you returned either false or a Cursor and nothing in between (like surprise undefineds, unfilled Cursors, other garbage data.

Getting line locations with iText

How can one find where are lines located in a document with iText?
Suppose say I have a table in a PDF document, and want to read its contents; I would like to find where exactly the cells are located. In order to do that I thought I might find the intersections of lines.
I think your only option using iText will be to parse the PDF tokens manually. Before doing that I would have a copy of the PDF spec handy.
(I'm a .Net guy so I use iTextSharp but other than some capitalization differences and property declarations they're almost 100% the same.)
You can get the individual tokens using the PRTokeniser object which you feed bytes into from calling getPageContent(pageNum) on your PdfReader.
//Get bytes for page 1
byte[] pageBytes = reader.getPageContent(1);
//Get the tokens for page 1
PRTokeniser tokeniser = new PRTokeniser(pageBytes);
Then just loop through the PRTokeniser:
PRTokeniser.TokType tokenType;
string tokenValue;
while (tokeniser.nextToken()) {
tokenType = tokeniser.tokenType;
tokenValue = tokeniser.stringValue;
//...check tokenValue, do something with it
}
As far a tokenValue, you'd want to probably look for re and l values for rectangle and line. If you see an re then you want to look at the previous 4 values and if you see an l then previous 2 values. This also means that you need to store each tokenValue in an array so you can look back later.
Depending on what you used to create the PDF with you might get some interesting results. For instance, I created a 4 cell table with Microsoft Word and saved as a PDF. For some reason there are two sets of 10 rectangles with many duplicates, but the general idea still works.
Below is C# code targeting iTextSharp 5.1.1.0. You should be able to convert it to Java and iText very easily, I noted the one line that has .Net-specific code that needs to be adjusted from a Generic List (List<string>) to a Java equivalent, probably an ArrayList. You'll also need to adjust some casing, .Net uses Object.Method() whereas Java uses Object.method(). Lastly, .Net accesses properties without gets and sets, so Object.Property is both the getter and setter compared to Java's Object.getProperty and Object.setProperty.
Hopefully this gets you started at least!
//Source file to read from
string sourceFile = "c:\\Hello.pdf";
//Bind a reader to our PDF
PdfReader reader = new PdfReader(sourceFile);
//Create our buffer for previous token values. For Java users, List<string> is a generic list, probably most similar to an ArrayList
List<string> buf = new List<string>();
//Get the raw bytes for the page
byte[] pageBytes = reader.GetPageContent(1);
//Get the raw tokens from the bytes
PRTokeniser tokeniser = new PRTokeniser(pageBytes);
//Create some variables to set later
PRTokeniser.TokType tokenType;
string tokenValue;
//Loop through each token
while (tokeniser.NextToken()) {
//Get the types and value
tokenType = tokeniser.TokenType;
tokenValue = tokeniser.StringValue;
//If the type is a numeric type
if (tokenType == PRTokeniser.TokType.NUMBER) {
//Store it in our buffer for later user
buf.Add(tokenValue);
//Otherwise we only care about raw commands which are categorized as "OTHER"
} else if (tokenType == PRTokeniser.TokType.OTHER) {
//Look for a rectangle token
if (tokenValue == "re") {
//Sanity check, make sure we have enough items in the buffer
if (buf.Count < 4) throw new Exception("Not enough elements in buffer for a rectangle");
//Read and convert the values
float x = float.Parse(buf[buf.Count - 4]);
float y = float.Parse(buf[buf.Count - 3]);
float w = float.Parse(buf[buf.Count - 2]);
float h = float.Parse(buf[buf.Count - 1]);
//..do something with them here
}
}
}

What is the better way to do the below program(c#3.0)

Consider the below program
private static bool CheckFactorPresent(List<FactorReturn> factorReturnCol)
{
bool IsPresent = true;
StringBuilder sb = new StringBuilder();
//Get the exposure names from Exposure list.
//Since this will remain same , so it has been done outside the loop
List<string> lstExposureName = (from item in Exposures
select item.ExposureName).ToList<string>();
foreach (FactorReturn fr in factorReturnCol)
{
//Build the factor names from the ReturnCollection dictionary
List<string> lstFactorNames = fr.ReturnCollection.Keys.ToList<string>();
//Check if all the Factor Names are present in ExposureName list
List<string> result = lstFactorNames.Except(lstExposureName).ToList();
if (result.Count() > 0)
{
result.ForEach(i =>
{
IsPresent = false;
sb.AppendLine("Factor" + i + "is not present for week no: " + fr.WeekNo.ToString());
});
}
}
return IsPresent;
}
Basically I am checking if all the FactorNames[lstFactorNames] are present in
ExposureNames[lstExposureName] list by using lstFactorNames.Except(lstExposureName).
And then by using the Count() function(if count() > 0), I am writing the error
messages to the String Builder(sb)
I am sure that someone can definitely write a better implementation than the one presented.
And I am looking forward for the same to learn something new from that program.
I am using c#3.0 and dotnet framework 3.5
Thanks
Save for some naming convention issues, I'd say that looks fine (for what I can figure out without seeing the rest of the code, or the purpose in the effort. The naming conventions though, need some work. A sporadic mix of ntnHungarian, PascalCase, camelCase, and abbrv is a little disorienting. Try just naming your local variables camelCase exclusively and things will look a lot better. Best of luck to you - things are looking good so far!
- EDIT -
Also, you can clean up the iteration at the end by just running a simple foreach:
...
foreach (var except in result)
{
isPresent = false;
builder.AppendFormat("Factor{0} is not present for week no: {1}\r\n", except, fr.WeekNo);
}
...

How Do I detect Text and Cursor position changes in Word using VSTO

I want to write a word addin that does some computations and updates some ui whenever the user types something or moves the current insertion point. From looking at the MSDN docs, I don't see any obvious way such as an TextTyped event on the document or application objects.
Does anyone know if this is possible without polling the document?
Actually there is a way to run some code when a word has been typed, you can use SmartTags, and override the Recognize method, this method will be called whenever a word is type, which means whenever the user typed some text and hit the space, tab, or enter keys.
one problem with this however is that if you change the text using "Range.Text" it will detect it as a word change and call the function so it can cause infinite loops.
Here is some code I used to achieve this:
public class AutoBrandSmartTag : SmartTag
{
Microsoft.Office.Interop.Word.Document cDoc;
Microsoft.Office.Tools.Word.Action act = new Microsoft.Office.Tools.Word.Action("Test Action");
public AutoBrandSmartTag(AutoBrandEngine.AutoBrandEngine _engine, Microsoft.Office.Interop.Word.Document _doc)
: base("AutoBrandTool.com/SmartTag#AutoBrandSmartTag", "AutoBrand SmartTag")
{
this.cDoc = _doc;
this.Actions = new Microsoft.Office.Tools.Word.Action[] { act };
}
protected override void Recognize(string text, Microsoft.Office.Interop.SmartTag.ISmartTagRecognizerSite site,
Microsoft.Office.Interop.SmartTag.ISmartTagTokenList tokenList)
{
if (tokenList.Count < 1)
return;
int start = 0;
int length = 0;
int index = tokenList.Count > 1 ? tokenList.Count - 1 : 1;
ISmartTagToken token = tokenList.get_Item(index);
start = token.Start;
length = token.Length;
}
}
As you've probably discovered, Word has events, but they're for really coarse actions like a document open or a switch to another document. I'm guessing MS did this intentionally to prevent a crappy macro from slowing down typing.
In short, there's no great way to do what you want. A Word MVP confirms that in this thread.