Finding bad words from large list of email addressess using PHP -Mongodb

Finding bad words from large list of email addressess using PHP -Mongodb - mongodb

I have large list of email addressses from a file. It comes around 1 million email ids. I have list of bad words like spam,junk etc, it consist of 20,000+ bad words.
I need to validate email ids. If bad words is present any where in email id it will be marked as invalid.
For example;
testspam#gmail.com - invalid
newuser#desspam.com - invalid
I would like to know which will be fastest comparison method as array looping will take time.
I tried following methods
//$keyword_list- array of bad words;
//$check_key- the email id which need to validate
$arrays = array_chunk($keyword_list, 2000);
for($i=0;$i<count($arrays);$i++)
{
if (preg_match('/'.implode('|', $arrays[$i]).'/', $check_key, $matches)){
return 1;
}
}
The above method is taking more time when comparing 1 million data.
Next we tried with the following code and this also takes more time
//$contain = bad words separated by '|'
// $str - the email id which need to validate
if(stripos($contain,"|") !== false)
{
$s = preg_split('/[|]+/i',$contain);
$len = sizeof($s);
for($i=0;$i < $len;$i++)
{
if(stripos($str,$s[$i]) !== false)
{
return(true);
}
}
}
if(stripos($str,$contain) !== false)
{
return(true);
}
return(false);
Finally I had tried Mongodb Text Search. It works fast with the following issues
If 'Hell' is the word in my bad list and my email id is like
head#e-hellinglysussex.sch.uk, then the Mongodb Text Search wont matches it.
Here is the code I used;
$ret = $db->command( array("text" =>$section, "search" => $keyword_string, "limit"=>$cnt_finalnonmatch));
where $section = Collection name,
$keyword_string = Comparing keywords string separated by space, for eg "Hell Spam Junk" etc,
$cnt_finalnonmatch = total number of comparing email ids
Please help me to solve this issue.

I am not entirely sure, but I suspect that the problem is that 'Hell' is not equal to 'hell' when you search for text since mongodb is case sensitive.
The solution should be to force all the strings and word to be lowercase (or uppercase)

We have used Mongodb 'like' to solve this issue;
$keywords = $key['keyword']; // Keywords need to compare
$regexObj = new MongoRegex("/".$keywords."/i"); // MongoRegex function declration
$where = array($section => $regexObj); // $section is the collection name
$resultset = $info->find($where);

Related

Salesforce trigger-Not able to understand

Below is the code written by my collegue who doesnt work in the firm anymore. I am inserting records in object with data loader and I can see success message but I do not see any records in my object. I am not able to understand what below trigger is doing.Please someone help me understand as I am new to salesforce.
trigger DataLoggingTrigger on QMBDataLogging__c (after insert) {
Map<string,Schema.RecordTypeInfo> recordTypeInfo = Schema.SObjectType.QMB_Initial_Letter__c.getRecordTypeInfosByName();
List<QMBDataLogging__c> logList = (List<QMBDataLogging__c>)Trigger.new;
List<Sobject> sobjList = (List<Sobject>)Type.forName('List<'+'QMB_Initial_Letter__c'+'>').newInstance();
Map<string, QMBLetteTypeToVfPage__c> QMBLetteTypeToVfPage = QMBLetteTypeToVfPage__c.getAll();
Map<String,QMBLetteTypeToVfPage__c> mapofLetterTypeRec = new Map<String,QMBLetteTypeToVfPage__c>();
set<Id>processdIds = new set<Id>();
for(string key : QMBLetteTypeToVfPage.keyset())
{
if(!mapofLetterTypeRec.containsKey(key)) mapofLetterTypeRec.put(QMBLetteTypeToVfPage.get(Key).Letter_Type__c, QMBLetteTypeToVfPage.get(Key));
}
for(QMBDataLogging__c log : logList)
{
Sobject logRecord = (sobject)log;
Sobject QMBLetterRecord = new QMB_Initial_Letter__c();
if(mapofLetterTypeRec.containskey(log.Field1__c))
{
string recordTypeId = recordTypeInfo.get(mapofLetterTypeRec.get(log.Field1__c).RecordType__c).isAvailable() ? recordTypeInfo.get(mapofLetterTypeRec.get(log.Field1__c).RecordType__c).getRecordTypeId() : recordTypeInfo.get('Master').getRecordTypeId();
string fieldApiNames = mapofLetterTypeRec.containskey(log.Field1__c) ? mapofLetterTypeRec.get(log.Field1__c).FieldAPINames__c : '';
//QMBLetterRecord.put('Letter_Type__c',log.Name);
QMBLetterRecord.put('RecordTypeId',tgh);
processdIds.add(log.Id);
if(string.isNotBlank(fieldApiNames) && fieldApiNames.contains(','))
{
Integer i = 1;
for(string fieldApiName : fieldApiNames.split(','))
{
string logFieldApiName = 'Field'+i+'__c';
fieldApiName = fieldApiName.trim();
system.debug('fieldApiName=='+fieldApiName);
Schema.DisplayType fielddataType = getFieldType('QMB_Initial_Letter__c',fieldApiName);
if(fielddataType == Schema.DisplayType.Date)
{
Date dateValue = Date.parse(string.valueof(logRecord.get(logFieldApiName)));
QMBLetterRecord.put(fieldApiName,dateValue);
}
else if(fielddataType == Schema.DisplayType.DOUBLE)
{
string value = (string)logRecord.get(logFieldApiName);
Double dec = Double.valueOf(value.replace(',',''));
QMBLetterRecord.put(fieldApiName,dec);
}
else if(fielddataType == Schema.DisplayType.CURRENCY)
{
Decimal decimalValue = Decimal.valueOf((string)logRecord.get(logFieldApiName));
QMBLetterRecord.put(fieldApiName,decimalValue);
}
else if(fielddataType == Schema.DisplayType.INTEGER)
{
string value = (string)logRecord.get(logFieldApiName);
Integer integerValue = Integer.valueOf(value.replace(',',''));
QMBLetterRecord.put(fieldApiName,integerValue);
}
else if(fielddataType == Schema.DisplayType.DATETIME)
{
DateTime dateTimeValue = DateTime.valueOf(logRecord.get(logFieldApiName));
QMBLetterRecord.put(fieldApiName,dateTimeValue);
}
else
{
QMBLetterRecord.put(fieldApiName,logRecord.get(logFieldApiName));
}
i++;
}
}
}
sobjList.add(QMBLetterRecord);
}
if(!sobjList.isEmpty())
{
insert sobjList;
if(!processdIds.isEmpty()) DeleteDoAsLoggingRecords.deleteTheProcessRecords(processdIds);
}
Public static Schema.DisplayType getFieldType(string objectName,string fieldName)
{
SObjectType r = ((SObject)(Type.forName('Schema.'+objectName).newInstance())).getSObjectType();
DescribeSObjectResult d = r.getDescribe();
return(d.fields.getMap().get(fieldName).getDescribe().getType());
}
}

You might be looking in the wrong place. Check if there's an unit test written for this thing (there should be one, especially if it's deployed to production), it should help you understand how it's supposed to be used.
You're inserting records of QMBDataLogging__c but then it seems they're immediately deleted in DeleteDoAsLoggingRecords.deleteTheProcessRecords(processdIds). Whether whatever this thing was supposed to do succeeds or not.
This seems to be some poor man's CSV parser or generic "upload anything"... that takes data stored in QMBDataLogging__c and creates QMB_Initial_Letter__c out of it.
QMBLetteTypeToVfPage__c.getAll() suggests you could go to Setup -> Custom Settings, try to find this thing and examine. Maybe it has some values in production but in your sandbox it's empty and that's why essentially nothing works? Or maybe some values that are there are outdated?
There's some comparison if what you upload into Field1__c can be matched to what's in that custom setting. I guess you load some kind of subtype of your QMB_Initial_Letter__c in there. Record Type name and list of fields to read from your log record is also fetched from custom setting based on that match.
Then this thing takes what you pasted, looks at the list of fields in from the custom setting and parses it.
Let's say the custom setting contains something like
Name = XYZ, FieldAPINames__c = 'Name,SomePicklist__c,SomeDate__c,IsActive__c'
This thing will look at first record you inserted, let's say you have the CSV like that
Field1__c,Field2__c,Field3__c,Field4__c
XYZ,Closed,2022-09-15,true
This thing will try to parse and map it so eventually you create record that a "normal" apex code would express as
new QMB_Initial_Letter__c(
Name = 'XYZ',
SomePicklist__c = 'Closed',
SomeDate__c = Date.parse('2022-09-15'),
IsActive__c = true
);
It's pretty fragile, as you probably already know. And because parsing CSV is an art - I expect it to absolutely crash and burn when text with commas in it shows up (some text,"text, with commas in it, should be quoted",more text).
In theory admin can change mapping in setup - but then they'd need to add new field anyway to the loaded file. Overcomplicated. I guess somebody did it to solve issue with Record Type Ids - but there are better ways to achieve that and still have normal CSV file with normal columns and strong type matching, not just chucking everything in as strings.
In theory this lets you have "jagged" csv files (row 1 having 5 fields, row 2 having different record type and 17 fields? no problem)
Your call whether it's salvageable or you'd rather ditch it and try normal loading of QMB_Initial_Letter__c records. (get back to your business people and ask for requirements?) If you do have variable number of columns at source - you'd need to standardise it or group the data so only 1 "type" of records (well, whatever's in that "Field1__c") goes into each file.

How would I match the correct hash pair based on a specific string?

I have a simple page hit tracking script that allows for the output to display friendly names instead of urls by using a hash.
UPDATE: I used php to generate the hash below, but used the wrong dynamic page name of item.html. When changed to the correct name, the script returns the desired results. Sorry for wasting anyone's time.
my %LocalAddressTitlePairs = (
'https://www.mywebsite.com/index.html' => 'HOME',
'https://www.mywebsite.com/art_gallery.html' => 'GALLERY',
'https://www.mywebsite.com/cart/item.html?itemID=83&cat=26' => 'Island Life',
'https://www.mywebsite.com/cart/item.html?itemID=11&cat=22' => 'Castaways',
'https://www.mywebsite.com/cart/item.html?itemID=13&cat=29' => 'Pelicans',
and so on..
);
The code for returning the page hits:
sub url_format {
local $_ = $_[0] || '';
if ((m!$PREF{'My_Web_Address'}!i) and (m!^https://(.*)!i) ) {
if ($UseLocalAddressTitlePairs == 1) {
foreach my $Address (keys %LocalAddressTitlePairs) {
return "<a title=\"$Address\" href=\"$_\">$LocalAddressTitlePairs{$Address}</A>" if (m!$_$! eq m!$Address$!);
}
}
my $stub =$1;
return $stub;
}
}
Displaying the log hits will show
HOME with the correct link, GALLERY with the correct url link, but https://www.mywebsite.com/cart/item.html?itemID=83&cat=26
will display a random name instead of what it should be, Island Life for this page.. it has the correct link,-- a different name displays every time the page is loaded.
And, the output for all pages with query strings will display the exact same name. I know the links are correct by clicking thru site pages and checking the log script for my own page visits.
I tried -
while (my($mykey, $Value) = each %LocalAddressTitlePairs) {
return "<a title=\"$mykey\" href=\"$_\">$Value</a>" if(m!$_$! eq m!$mykey$!);
but again, the link is correct but the mykey/Value associated is random too. Way too new to perl to figure this out but I'm doing a lot of online research.

m!$Address$! does not work as expected, because the expression contains special characters such as ?
You need to add escape sequences \Q and \E
m!\Q$Address\E$!
it’s even better to add a check at the beginning of the line, otherwise
my $url = "https://www.mywebsite.com/?foo=bar"
my $bad_url = "https://bad.com?u=https://www.mywebsite.com/?foo=bar"
$bad_url =~ m!\Q$url\E$! ? 1 : 0 # 1, pass
$bad_url =~ m!^\Q$url\E$! ? 1 : 0 # 0, fail

Search removing comma using Entity Framework

I want to search a text that contains comma in database, but, there is not comma in the reference.
For example. In database I have the following value:
"Development of computer programs, including electronic games"
So, I try to search the data using the following string as reference:
"development of computer programs including electronic games"
NOTE that the only difference is that in database I have a comma in the text, but, in my reference for search, I have not.
Here is my code:
public async Task<ActionResult>Index(string nomeServico)
{
using (MyDB db = new MyDB())
{
// 1st We receive the following string:"development-of-computer-programs-including-electronic-games"
// but we remove all "-" characters
string serNome = nomeServico.RemoveCaractere("-", " ");
// we search the service that contains (in the SerName field) the value equal to the parameter of the Action.
Servicos servico = db.Servicos.FirstOrDefault(c => c.SerNome.ToLower().Equals(serNome, StringComparison.OrdinalIgnoreCase));
}
}
The problem is that, in the database, the data contains comma, and in the search value, don't.

In you code you are replacing "-" with "" and that too in your search string. But as per your requirement you need to change "," with "" for your DB entry.
Try doing something like this:
string serNome = nomeServico.ToLower();
Servicos servico = db.Servicos.FirstOrDefault(c => c.SerNome.Replace(",","").ToLower() == serNome);

What's wrong with my Meteor publication?

I have a publication, essentially what's below:
Meteor.publish('entity-filings', function publishFunction(cik, queryArray, limit) {
if (!cik || !filingsArray)
console.error('PUBLICATION PROBLEM');
var limit = 40;
var entityFilingsSelector = {};
if (filingsArray.indexOf('all-entity-filings') > -1)
entityFilingsSelector = {ct: 'filing',cik: cik};
else
entityFilingsSelector = {ct:'filing', cik: cik, formNumber: { $in: filingsArray} };
return SB.Content.find(entityFilingsSelector, {
limit: limit
});
});
I'm having trouble with the filingsArray part. filingsArray is an array of regexes for the Mongo $in query. I can hardcode filingsArray in the publication as [/8-K/], and that returns the correct results. But I can't get the query to work properly when I pass the array from the router. See the debugged contents of the array in the image below. The second and third images are the client/server debug contents indicating same content on both client and server, and also identical to when I hardcode the array in the query.
My question is: what am I missing? Why won't my query work, or what are some likely reasons it isn't working?

In that first screenshot, that's a string that looks like a regex literal, not an actual RegExp object. So {$in: ["/8-K/"]} will only match literally "/8-K/", which is not the same as {$in: [/8-K/]}.
Regexes are not EJSON-able objects, so you won't be able to send them over the wire as publish function arguments or method arguments or method return values. I'd recommend sending a string, then inside the publish function, use new RegExp(...) to construct a regex object.
If you're comfortable adding new methods on the RegExp prototype, you could try making RegExp an EJSON-able type, by putting this in your server and client code:
RegExp.prototype.toJSONValue = function () {
return this.source;
};
RegExp.prototype.typeName = function () {
return "regex";
}
EJSON.addType("regex", function (str) {
return new RegExp(str);
});
After doing this, you should be able to use regexes as publish function arguments, method arguments and method return values. See this meteorpad.

/8-K/.. that's a weird regex. Try /8\-K/.
A minus (-) sign is a range indicator and usually used inside square brackets. The reason why it's weird because how could you even calculate a range between 8 and K? If you do not escape that, it probably wouldn't be used to match anything (thus your query would not work). Sometimes, it does work though. Better safe than never.
/8\-K/ matches the string "8-K" anywhere once.. which I assume you are trying to do.
Also it would help if you would ensure your publication would always return something.. here's a good area where you could fail:
if (!cik || !filingsArray)
console.error('PUBLICATION PROBLEM');
If those parameters aren't filled, console.log is probably not the best way to handle it. A better way:
if (!cik || !filingsArray) {
throw "entity-filings: Publication problem.";
return false;
} else {
// .. the rest of your publication
}
This makes sure that the client does not wait unnecessarily long for publications statuses as you have successfully ensured that in any (input) case you returned either false or a Cursor and nothing in between (like surprise undefineds, unfilled Cursors, other garbage data.

Facing Issue in zend_search_lucene

I am using Zend Lucene Search:
......
$results = $test->fetchAll();
setlocale(LC_CTYPE, 'de_DE.iso-8859-1');
Zend_Search_Lucene_Analysis_Analyzer::setDefault(new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
foreach ($results as $result) {
$doc = new Zend_Search_Lucene_Document();
// add Fields
$doc->addField(
Zend_Search_Lucene_Field::Text('testid', $result->id));
$doc->addField(
Zend_Search_Lucene_Field::Keyword('testemail', strtolower(($result->email))));
$doc->addField(
Zend_Search_Lucene_Field::Text('testconfirmdate', $result->confirmdate));
$doc->addField(
Zend_Search_Lucene_Field::Text('testcreateddate', $result->createddate));
// Add document to the index
$index->addDocument($doc);
}
// Optimize index.
$index->optimize();
// Search by query
setlocale(LC_CTYPE, 'de_DE.iso-8859-1');
if(strlen($Data['name']) > 2){
//$query = Zend_Search_Lucene_Search_QueryParser::parse($Data['name'].'*');
$pattern = new Zend_Search_Lucene_Index_Term($Data['name'].'*');
$query = new Zend_Search_Lucene_Search_Query_Wildcard($pattern);
$this->view->hits = $index->find(strtolower($query));
}
else{
$query = $Data['name'];
$this->view->hits = $index->find($query);
}
............
Works fine here:
It works when I give complete word, first 3 character, case insensitive words
My issues are:
When I search for email, i got error like "Wildcard search is supported only for non-multiple word terms "
When I search for number/date like "1234" or 09/06/2011, I got error like "At least 3 non-wildcard characters are required at the beginning of pattern"
I want to search date, email, number here.

In file zend/search/Lucene/search/search/query/wildcard a parameter is set,
private static $_minPrefixLength = 3;
chnage it and it may work..!

Based on NaanuManu's suggestion, I did a little more digging to figure this out - I posted my answer on a related question here, but repeating for convenience:
Taken directly from the Zend Reference documentation, you can use:
Zend_Search_Lucene_Search_Query_Wildcard::getMinPrefixLength() to
query the minimum required prefix length and
use Zend_Search_Lucene_Search_Query_Wildcard::setMinPrefixLength() to
set it.
So my suggestion would be either of two things:
Set the prefixMinLength to 0 using Zend_Search_Lucene_Search_Query_Wildcard::setMinPrefixLength(0)
Validate all search queries using javascript or otherwise to ensure there is a minimum of Zend_Search_Lucene_Search_Query_Wildcard::getMinPrefixLength() before any wildcards used (I recommend querying that instead of assuming the default of "3" so the validation is flexible)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse