I need to formulate a set of rules and regular expressions in WKS to identify song title and artist in sentences like "Play Locomotive Breath" or "Play the song Locomotive Breath by Jethro Tull" (I actually do this for German, so I can't use builtin entities in NLU).
My problem is that the "by " part is optional. I have set up regular expressions for the German equivalents of "Play [the song]" (mapping to a class PLAY) and "by" (mapping to a class BY) and tried to add two rules, one matching
PLAY (any tokens) BY (any tokens)
and PLAY (any tokens).
The problem is that the second rule also matches when the first one does, so that in the sentence "Play Locomotive Breath by Jethro Tull" the title is recognized as "Locomotive Breath by Jethro Tull".
I tried to define a regular expression with negative lookahead, i.e. (\w* (?!(by)))* to match the text up the "by", but this does not seem to work in WKS.
Any ideas how I can extract song titles and artists using WKS rules?
Regular expressions with negative lookahead should work fine in WKS.
I am not sure if your regular expression (\w* (?!(by)))* expresses what you intended. Is it working as you expected outside of WKS?
Maybe you meant something like ((?!(by))\w)* ?
Related
I'm deobfuscating some code and I forget the operator to use a wildcard while searching for text in VSCode. By this I mean in VSCode whenever you search for code (CMD/CTRL + F), what is the character for a wild card (i.e searching for "date{WILDCARD HERE}" would return "date1","date2","date", etc.)
I don't recall a wildcard option (I've never used it at least). But the search feature supports using regular expressions.
Given your examples of date1, date2, date, etc. assuming it followed a pattern of date<n> where n is a number (or nothing in the case of just "date"), the regular expression of date[1-9]* should achieve what you want.
You can test the expression out on this site. Input the regular expression and some sample data and see how it matches.
Disclaimer: I have no engineering background whatsoever - please don't hold it against me ;)
What I'm trying to do:
Scan a bunch of text strings and find the ones that
are more than one word
contain title case (at least one capitalized word after the first one)
but exclude specific proper nouns that don't get checked for title case
and disregard any parameters in curly brackets
Example: Today, a Man walked his dogs named {FIDO} and {Fifi} down the Street.
Expectation: Flag the string for title capitalization because of Man and Street, not because of Today, {FIDO} or {Fifi}
Example: Don't post that video on TikTok.
Expectation: No flag because TikTok is a proper noun
I have bits and pieces, none of them error-free from what https://www.regextester.com/ keeps telling me so I'm really hoping for help from this community.
What I've tried (in piece meal but not all together):
(?=([A-Z][a-z]+\s+[A-Z][a-z]+))
^(?!(WordA|WordB)$)
^((?!{*}))
I think your problem is not really solvable solely with regex...
My recommendation would be splitting the input via [\s\W]+ (e.g. with python's re.split, if you really need strings with more than one word, you can check the length of the result), filtering each resulting word if the first character is uppercase (e.g with python's string.isupper) and finally filtering against a dictionary.
[\s\W]+ matches all whitespace and non-word characters, yielding words...
The reasoning behind this different approach: compiling all "proper nouns" in a regex is kinda impossible, using "isupper" also works with non-latin letters (e.g. when your strings are unicode, [A-Z] won't be sufficient to detect uppercase). Filtering utilizing a dictionary is a way more forward approach and much easier to maintain (I would recommend using set or other data type suited for fast lookups.
Maybe if you can define your use case more clearer we can work out a pure regex solution...
I have a strange issue with lucene.net 2.9:
If I searching for: high-quality it doesn't find any results. I found hyphenation char (-) is a problem for Lucene, so I search for high quality and it worked perfectly.
When I search for 30-40 it is showing results but for 30 40 is not showing any.
The second scenarios is in contradiction with first one.
I guess the second one is related as I have numerical text, but I didn't find something on web related.
I'm guessing you are using StandardAnalyzer when indexing your terms, and then are searching without analysis in some form, or with a different form of analysis.
The 2.9 StandardAnalyzer (ClassicAnalyzer, as of version 3.1) has some interesting behavior around hyphens. To quote the StandardTokenizer documentation:
Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
So, two hyphenated words (or any collection of letters) will be split into separate tokens, when any number thrown into the mix will interpret the whole thing as a product number, and index as a ingle token, hyphens and all, so:
"high-qualtiy" --> "high" and "quality"
"ab-cd" ---------> "ab" and "cd"
"30-40" ---------> "30-40"
"ab-c4" ---------> "ab-c4"
"30 40" ---------> "30" and "40"
So, if you construct a TermQuery for "high-quality" on such an analyzed field, you will get no results (though you would if using the QueryParser with the same analyzer). When searching for "30-40", the TermQuery for "30-40" will be an exact match. but matches will be found for neither "30" nor "40".
So, I'm not how you are querying to run into the mismatch there (perhaps using StandardAnalyzer when indexing, and WhitespaceAnalyzer when querying?), but hopefully that points in the right direction.
You need to encrypt "-" sign to URL parameter. I think it will works fine.
Is there an unicode symbol for "n/a"? There are some fractions like ½, but a n/a symbol seems to be missing.
If there is none, what would be the most appropriate unicode symbol to use for n/a in a website (which should be contained in common fonts, to avoid needing a webfont)?
Looking at the Unicode code charts, I do not see a single N/A symbol. I do, however, see ⁿ (U+207F) and ₐ (U+2090), which you could separate with / (U+002F) eg: ⁿ/ₐ, or ̷ (U+0337), eg: ⁿ̷ₐ, or ̸ (U+0338), eg: ⁿ̸ₐ. Probably not what you are hoping for, though. And I don't know if "common" fonts implement them, either.
For future reference, the fastest way I know to answer questions like the OP's when I have them myself is to go to unicodelookup.com, because of the way it works: there's a search bar at the top, and you just type a string and it will return any and all unicode characters containing that string (this is also a great way to discover new and useful symbols). So in the OP's case, he could proceed like this:
first try entering "not" (without the quotes) in the search field
visually scan through the results... doing so would not reveal a "not
applicable" character in this case
try again but this time entering "applic" in the search field
again, doing so would not turn up anything along the lines of what he's
looking for
At that point he would be reasonably confident the current Unicode standard does not have a "n/a" symbol.
If you use Firefox you can define a keyword like "uni" to search that site from the URL bar, meaning any time the browser is open and regardless of what page or site is currently showing, you could do this:
hit [F6]... this moves the cursor to the URL bar at the top
type something like "uni applic" and hit [Enter]... this brings up the
unicodelookup.com website with the search results for "applic" already
showing
For the above to work you would need to define your keyword ("uni" or wtv you prefer) to point to location http://unicodelookup.com/#%s.
There's a Negative Acknowlege icon...
␕ symbol for negative acknowledge 022025 9237 0x2415 ␕
Found by searching negative on the Unicode Lookup site.
I'm not a fan, and for my purposes have just gone with __N/A__ (Markdown..)
I see lots of answers going head-on at the "Not Applicable" abbreviation, without exploring what a symbol is. A quick search for the equivalent phrase "out of scope" brings up a couple of variations on the No symbol: ⃠ – this seems to fit the bill (and since I was looking for a way to represent inapplicability, I'll be using it in my technical document).
Per the Wikipedia article, the Unicode codepoint U+20E0 is a combining character, so it is superimposed on the preceding character; e.g. ! ⃠ overlays an exclamation point. To get it to appear isolated, use a non-breaking space
If you don't want to bother with the combining symbol, the article mentions there's also an emoji U+1F6AB 🚫 but it's typically going to be colored red, or won't render!
There's actually a single character that could be repurposed for this: the "Square Na" character ㎁ (U+3381), which is used to represent the nanoampere in fullwidth (CJK) scripts.
What about the "SYMBOL FOR NULL" ␀ (U+2400)?
Hi I'm building a RESTful app and can't find the recommended way to pattern optional fuzzy or LIKE queries. For example a strict query might be,
/place?city=New+York&state=NY
Corresponds to SQL "... WHERE city="New York" AND state="NY"
But what if I wanted to search for the city field for any row with "York" in city name?
"... WHERE city LIKE "%{parameter}%" AND state="{parameter2}"
I'm thinking about just adding some kind of url-valid character to the request like this:
/place?city=*York*&state=NY
Is there an established or recommended pattern I should use? Thanks!
It's fine to use query string for searching, but it's a little bit weird to use macro character like "*" or "?" in query string(unless you decide to build a really powerful search engine like Google). More importantly, search is usually considered in fuzzy mode by default, so it's redundant to append/prepend the keyword with "*". If you do need exact search, you could surround the exact(or strict) keyword with double quotes. Namely, instead of using /place?city=*York*&state=NY, I recommend /place?city=York&state="NY".
In fact, Google uses quotes to search for an exact word or set of words, and I also found this site takes this pattern.