Fuzzy searching with MongoDB, why does /1.0/ match 100.0? [duplicate] - mongodb

This question already has an answer here:
Why do escape characters in regex mismatch?
(1 answer)
Closed 3 years ago.
https://imgsa.baidu.com/forum/w%3D580/sign=bbcf762fa986c91708035231f93c70c6/10c17c3e6709c93d5eca9c20913df8dcd0005407.jpg
I have implemented fuzzy searching using RegEx as shown below. I just want to get '1.0' and '1.01', But the results show figures such as '1.0' '1.01' '100' '100.10' and '110.11'. Why does 1.0 match 100 and 100.10? How can I only get 1.0 and 1.01?
db.getCollection ("CE").find (
{
"ID": /1.0/
}
);

the . in regex means any character, so 1.0 means a 1 followed by any character, followed by a 0. So 100, 1.0 , 1a0, etc are valid matches.
What you need to do is to escape the dot with a \. So replace your regex with 1\.0. Or with ^1\.0 if you want to only match strings that start with 1.0.
This question has probably been answered many times, please feel free to delete the question.

Related

How to match the first word after a specific expressions with regex? [duplicate]

This question already has answers here:
Python Regex Engine - "look-behind requires fixed-width pattern" Error
(3 answers)
Regex to get the word after specific match words
(5 answers)
Closed 8 months ago.
This post was edited and submitted for review 8 months ago and failed to reopen the post:
Original close reason(s) were not resolved
My code below match the first word after one expression "let" :
(?<=\blet\s)(\w+)
What I need is to match the first word after a specific expressions, "let", "var", "func"
Input text:
let name: String
var age: Int
func foo() {
//...
Expected:
name
age
foo
Here is an image for clarity:
Since some regex flavors do not allow using groups inside lookbehinds and alternatives of different length, it is safe to use a non-capturing group with the lookbehind as alternatives:
(?:(?<=\blet\s)|(?<=\bvar\s)|(?<=\bfunc\s))\w+
Here, (?:...|...|...) is a non-capturing group matching one of the three alternatives: (?<=\blet\s), (?<=\bvar\s) and (?<=\bfunc\s).

How to capture text within a negate class character using perl [duplicate]

This question already has answers here:
missing last character in perl regex
(3 answers)
Closed 7 years ago.
My first problem is, I need to search for two consecutive sets of parenthesis, for example,
(log dkdkdkd) (log edksks)
This code below solves this first problem:
^\([^)]*\) \([^)]*\)$
Second problem, in addition to using the solution above, I need to capture the text after the log something like this:
^\(log (.*)\) \(log (.*)\)$
But this above solution does not work because it find more than two sets of parenthesis, for example:
(log dkdkdkd) (log edksks) (log riwqoq)
What I really need is to find two sets of consecutive parenthesis while capturing the text after the log text?
You can tell Perl that the text after "log" doesn't contain (:
/^\(log ([^(]*)\)\s*\(log ([^(]*)\)$/
Or, if ( is possible and you only want to exclude (log, check that in two steps:
for my $s ('(log this) (log matches)',
'(log these) (log do) (log not)'
) {
my #matches = $s =~ /^\(log (.*)\)\s*\(log (.*)\)$/;
next if $matches[0] =~ /\(log /; # More than 2 logs, skip.
say "($_)" for #matches;
}
You can use a non-greedy search
/^\(log (.*?)\) \(log (.*?)\)/
Saying .*? instead of .* makes the regex engine try to match the pattern with the minimum number of characters necessary.

Print a substring of an array value [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I have an array in the third element dataArr[2], I know it contains a 10 digit phone. I need to only read or print the first 6 digits. For instance if my phone is 8329001111, I need to print out the 832900. I tried to see if I can use substr but I keep reading or printing the full list. Do I need to dereference..
Try this :
$dataArr[2] =~ s/\s//g; # ensure there's no spaces
print substr($dataArr[2], 0, 6);
# ^ ^ ^
# variable | |
# offset start|
# |
# substring length

Formatting dates [duplicate]

This question already has answers here:
Pattern matching dates
(4 answers)
Closed 9 years ago.
April 9, 2012 can be written in any of these ways:
4912
4/9/12
4-9-12
4 9 12
04-9-12
04-09-12
4 9 2012
4 09 2012
(I think you get the point)
For those of you that don't understand, the rules are:
1. Dates may or may not have ` `, `-` or `/` between them
2. The year can be written as 2 digits (assumed to be dates in the range of [2000, 2099] inclusive) or 4 digits
3. One digit month/days may or may not have leading zeroes.
How would you go about problem solving this to format the dates into 04/09/12?
I know the dates can be ambiguous, i.e., 12112 can be 12/1/12 or 1/21/12, but assume the smallest month possible.
This actually is something that regexes are good at; making an assumption, moving forward with it, then backtracking if necessary to get a successful match.
s{
\A
( 1[0-2] | 0?[1-9] )
[-/ ]?
( 3[01] | [12][0-9] | 0?[1-9] )
[-/ ]?
( (?: [0-9]{2} ){1,2} )
\z
}
{
sprintf '%02u/%02u/%04u', $1, $2, ( length $3 == 4 ? $3 : 2000+$3 )
}xe;
The range checks present, while not determined by the value of the month, should be sufficient to pick a good date from the ambiguous cases (where there is a good date).
Note that it is important to try two digit month and days first; otherwise 111111 becomes 1-1-1111, not the presumably intended 11-11-11. But this means 11111 will prefer to be 11-1-11, not 1-11-11.
If a valid day of month check is needed, it should be performed after reformatting.
Notes:
s{}{} is a substitution using curly braces instead of / to delimit the parts of the regex to avoid having to escape the /, and also because using paired delimiters allows opening and closing both the pattern and replacement parts, which looks nice to me.
\A matches the start of the string being matched; \z matches the end. ^ and $ are often used for this, but can have slightly different meanings in some cases; I prefer these since they always only mean one thing.
The x flag on the end says this is an extended regex that can have extra whitespace or comments that are ignored, so that it is more readable. (Whitespace inside a character class isn't ignored.) The e flag says the replacement part isn't a string, it is code to execute.
'%02u/%02u/%02u' is a printf format, used for taking values and formatting them in a particular way; see http://perldoc.perl.org/functions/sprintf.html.
Install Date::Calc
On ubuntu libdate-calc-perl
This should be able to read in all those dates ( except 4912, 4 9 2012, 4 09 2012 ) and then output them in a common format

Should I use sed, awk, perl, for altering text spanning multiple lines and selecting only the info needed? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I'm working on a project for class where we take a file full of lines describing classes like the one below
CSC 1010 - COMPUTERS & APPLICATIONS
Computers and Applications. Prerequisite: high school Algebra II. History of computers, >hardware components, operating systems, applications software, data communication.
3.000 Credit hours
and turn it into
CSC1010,COMPUTERS & APPLICATIONS,3
I used:
sed -n 's/^CSC /CSC/p' courses.txt > practice.txt
which outputs:
CSC1010 - COMPUTERS & APPLICATIONS
CSC1310 - INTRO COMP PROGRAMMING NON-MAJ
CSC2010 - INTRO TO COMPUTER SCIENCE
CSC2310 - PRIN OF COMPUTER PROGRAMMING
CSC2320 - FUND OF WEBSITE DEVELOPMENT
CSC2510 - THEOR FOUNDATIONS OF COMP SCI
CSC3010 - HISTORY OF COMPUTING
CSC3210 - COMPUTER ORG & PROGRAMMING
CSC3320 - SYSTEM-LEVEL PROGRAMMING
CSC3330 - C++ PROGRAMMING
CSC3410 - DATA STRUCTURES-CTW
CSC4110 - EMBEDDED SYSTEMS
CSC4120 - INTRODUCTION TO ROBOTICS
and I also used:
sed '/\.000 Course hours//p' courses.txt > courses10.txt
which outputs:
3
3
3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
My problem is trying to select whether the sed, awk, or perl would be better. So, far I've used sed to eliminate the lines that are not composed either of the course title or the number of credit hours. As you saw above. I was hoping to use a regular expression to sort through the file and get each line that started with "CSC" or contained ".000 Course hours". I figured that after I got that output I could use a command in the sed to remove the new line from the end of the lines starting with the CSC and replace that with a comma. After that I would replace the backslash with a comma. However, to do that I think I would need to use an extended expression so sed would probably be out. The regular expression I was considering using is (^CSC |[0-9]\.000). So, should I be doing this in sed, awk, or perl. If you could please include your reasoning as to why it would be more efficient to use whatever method you suggest.
In Perl:
while (<>) {
chomp;
print if s/^CSC\s+/CSC/ and s/\s+-\s+/,/;
printf ",%.0f\n", $1 if /^([\d.]+)\s+Credit hours/;
}
I'd go with awk because you want to match and reformat lines and awk is perfect for this:
/CSC/ { # Lines that match CSC
split($0,a,"- ") # Split the line around the hyphen and following space
gsub(/ /,"",a[1]) # Remove the spaces from the first part of the split
printf a[1]","a[2] # Print the line in required format
}
/Credit hours/ { # Lines that match Credit hours
printf ",%i\n",$1 # Print the integer value of credit hours
}
Demo:
awk '/CSC/{split($0,a,"- ");gsub(/ /,"",a[1]);printf a[1]","a[2]}/Credit hours/{printf ",%i\n",$1}' file
CSC1010,COMPUTERS & APPLICATIONS,3
I prefer awk to Perl, which has no advantage (or disadvantage) for this. Using sed would be a regexp hack so I'd stay away from a sed solution.