Creating custom analyzers using whoosh - whoosh

I am trying to implement a semantic search engine with deep NLP pipeline using Whoosh. Currently, I just have stemming analyzer, but I need to add lemmatizing and pos tagging to my analyzers.
schema = Schema(id=ID(stored=True, unique=True), stem_text=TEXT(stored= True, analyzer=StemmingAnalyzer()))
I want to know how to add custom analyzers to my schema.

You can write a custom lemmatization filter and integrate into an existing whoosh analyzer. Quoting from Whoosh docs:
Whoosh does not include any lemmatization functions, but if you have
separate lemmatizing code you could write a custom
whoosh.analysis.Filter to integrate it into a Whoosh analyzer.
You can create an analyzer by combining a tokenizer with filters:
my_analyzer = RegexTokenizer() | LowercaseFilter() | StopFilter() | LemmatizationFilter()
or by adding a filter to an existing analyzer:
my_analyzer = StandardAnalyzer() | LemmatizationFilter()
You can define a filter like:
def LemmatizationFilter(self, stream):
for token in stream:
yield token

Related

Is it possible to deactivate the autogenerated Drop?

I'm testing Alembic for a python project. The autogeneration is really nice, but dropping is not really helpful if you need to work on customer databases with many different versions for example.
Activate or deactivate Dropping for different scenarios. This would be the best solution.
I made my own configuration in env.py, so I can use more than one base script. But if I make a new script (defining a new table) and autogenerate a migration-script I have an autogenerated drop of all previous migrated tables.
I looked already for the mako-file. How is it possible to integrate a restriction in the mako-file?
I found a possibility to filter my migration-operations-list.
if you hand over a filter-methode who filters your list to the config-flag "process_revision_directives". (all configs in the env.py)
from alembic.operations import ops
def process_revision_directives(context, revision, directives):
script = directives[0]
# process both "def upgrade()", "def downgrade()"
for directive in (script.upgrade_ops, script.downgrade_ops):
# now rewrite the list of "ops" such that DropColumnOp and DropTableOp
# are removed for those tables. Needs a recursive function.
directive.ops = list(
_filter_drop_elm(directive.ops)
)
def _filter_drop_elm(directives):
# given a set of (tablename, schemaname) to be dropped, filter
# out Drop-Op from the list of directives and yield the result.
for directive in directives:
# ModifyTableOps is a container of ALTER TABLE types of
# commands. process those in place recursively.
if isinstance(directive, ops.DropTableOp):
continue
elif isinstance(directive, ops.ModifyTableOps):
directive.ops = list(
_filter_drop_elm(directive.ops)
)
if not directive.ops:
continue
elif isinstance(directive, ops.AlterTableOp) and isinstance(directive, ops.DropColumnOp):
continue
# otherwise if not filtered, yield out the directive
yield directive

How to turn OFF Schema Validations dynamically using MongoDB-JAVA APIs

I have created a collection with schema validation as below,
ValidationOptions collOptions = new ValidationOptions();
collOptions.validator(sdoc);
collOptions.validationLevel(ValidationLevel.MODERATE);
collOptions.validationAction(ValidationAction.WARN);
srdmDatabase.createCollection(collectionName,new CreateCollectionOptions().validationOptions(collOptions));
My collection is created successfully with schema validation.
In some cases, I want to turn OFF the validation check dynamically.
I found that there is an option to turn OFF the validation(ValidationLevel.OFF) in monogdb-java-driver, but I have no idea about how to use this option.
Please help me some one how to turn off the validation check programmatically.
We are using MongoDB-4.0 and mongo-java-driver-3.10.2.
Thanks in advance.
You can try using the following code to bypass the validation.
For updates
collection.updateOne(
Filters.eq("_id", 1),
Updates.set("name", "Fresh Breads and Tulips"),
new UpdateOptions().upsert(true).bypassDocumentValidation(true));
Similarly for inserts, you can use InsertOptions.bypassDocumentValidation(true)
Refer this link for more info https://docs.mongodb.com/manual/core/schema-validation/#bypass-document-validation

How to specific Scala style rules for a specific file/directory?

I have some cucumber stepDef steps which are more than more than 120 characters in length, I want to exclude all stepDef files from Scala style warning.
Is there a way to to exclude a specific files/directories, using xml tag only for FileLineLengthChecker condition?
Wrapping the entire file in the following comment filter in effect excludes the file from FileLineLengthChecker rule:
// scalastyle:off line.size.limit
val foobar = 134
// scalastyle:on line.size.limit
line.size.limit is the ID of FileLineLengthChecker rule.
Multiple rules can be switched off simultaneously like so
// scalastyle:off line.size.limit
// scalastyle:off line.contains.tab
// ...

How to define source fields in Scalding

I was working in Cascading a month back. Now we are trying to implement the same in Scalding. I have one basic question.
How can i define my source & sink schema in Scalding ??
Below is the procedure that we followed in Cascading
SrcFields sourcefields = new SrcFields();
SinkFields sinkfields = new SinkFields();
Fields source = sourcefields.sourceFields();
Fields sink = sinkfields.sinkfields();
Scheme sourceScheme = new TextDelimited(source,",");
Scheme sinkScheme = new TextDelimited(sink,",");
In Scalding, you can use either Fields based or Typed interface, as per Source documentation. In former, you would use Csv or Tsv classes to read or write.
For typed interface, you would use TypedCsv or TypedTsv classes.
You can find examples in Scalding Tutorial: https://github.com/twitter/scalding/blob/develop/tutorial/Tutorial6.scala , https://github.com/twitter/scalding/blob/develop/tutorial/TypedTutorial.scala

Get assembly metadata in NDepend

I am trying to create CQL query which will select assemblies created by my company.
In my opinion the easiest way is to check data generated in AssemblyInfo, but I cannot find how to access it in CQL.
What about the code query:
from a in Application.Assemblies
where a.Name.StartsWith("YourCompany.YourProduct")
select a
Or do you need something more sophisticated?
Ok, what about getting inspiration from this default rule:
// <Name>UI layer shouldn't use directly DB types</Name>
warnif count > 0
// UI layer is made of types in namespaces using a UI framework
let uiTypes = Application.Namespaces.UsingAny(Assemblies.WithNameIn("PresentationFramework", "System.Windows", "System.Windows.Forms", "System.Web")).ChildTypes()
// You can easily customize this line to define what are DB types.
let dbTypes = ThirdParty.Assemblies.WithNameIn("System.Data", "EntityFramework", "NHibernate").ChildTypes()
// Ideally even DataSet and associated, usage should be forbidden from UI layer:
// http://stackoverflow.com/questions/1708690/is-list-better-than-dataset-for-ui-layer-in-asp-net
.Except(ThirdParty.Types.WithNameIn("DataSet", "DataTable", "DataRow"))
from uiType in uiTypes.UsingAny(dbTypes)
let dbTypesUsed = dbTypes.Intersect(uiType.TypesUsed)
select new { uiType, dbTypesUsed }
Of course the sets uiTypes and dbTypes must be refined with assemblies from level N and assemblies from level N+1.