Refine a string to only certain values scala - scala

Is there any way to refine a string to only a certain subset of values? For example, I have a list of 500 keys in a hash map. But I only want certain keys to be inserted. For example, "abcd" and "aaaa" are valid keys but "abdc" is invalid. Is there any way to refine the String to only one of the given 500 keys?
I'm guessing the way to do this is just a very long regexp that matches abcd|aaaa?
Edit: Using the fthomas/refined library specifically the MatchesRegex function. Want to know if there is a better approach that I'm missing out on.

Scala 3 seems to Allow Singletons in Unions #6299 like so
val refinedString: "abcd" | "aaaa" = "aaaa"
whilst abdc would result in the following error
val refinedString: "abcd" | "aaaa" = "abdc"
^^^^^^
Found: String("abdc")
Required: String("abcd") | String("aaaa")
It worked for me with Dotty Scala version 0.15.0-bin-20190517-fb6667b-NIGHTLY.

I ended up actually just using generated source code that has every known key inside a MatchesRegex (a|b..) construct for 500 keys. It works. It's not pretty but it's also generated source code that I don't have to deal with so it's okay, I guess.

Related

Match part of a string with regex

I have two arrays of strings and I want to check if a string of array a matches a string from array b. Those strings are phone numbers that might come in different formats. For example:
Array a might have a phone number with prefix like so +44123123123 or 0044123123123
Array b have a standard format without prefixes like so 123123123
So I'm looking for a regex that can match a part of a string like +44123123123 with 123123123
Btw I'm using Swift but I don't think there's a native way to do it (at least a more straightforward solution)
EDIT
I decided to reactivate the question after experimenting with the library #Larme mentioned because of inconsistent results.
I'd prefer a simper solution as I've stated earlier.
SOLUTION
Thanks guys for the responses. I saw many comments saying that Regex is not the right solution for this problem. And this is partly true. It could be true (or false) depending on my current setup/architecture ( which thinking about it now I realise that I should've explained better).
So I ended up using the native solution (hasSuffix/contains) but to do that I had to do some refactoring on the way the entire flow was structured. In the end I think it was the least complicated solution and more performant of the two. I'll give the bounty to #Alexey Inkin for being the first to mention the native solution and the right answer to #Ωmega for providing a more complete solution.
I believe regex is not the right approach for this task.
Instead, you should do something like this:
var c : [String] = b.filter ({ (short : String) -> Bool in
var result = false
for full in a {
result = result || full.hasSuffix(short)
}
return result
})
Check this demo.
...or similar solution like this:
var c : [String] = b.filter ({ (short : String) -> Bool in
for full in a {
if full.hasSuffix(short) { return true }
}
return false
})
Check this demo.
As you do not mention requirements to prefixes, the simplest solution is to check if string in a ends with a string in b. For this, take a look at https://developer.apple.com/documentation/swift/string/1541149-hassuffix
Then, if you have to check if the prefix belongs to a country, you may replace ^00 with + and then run a whitelist check against known prefixes. And the prefix itself can be obtained as a substring by cutting b's length of characters. Not really a regex's job.
I agree with Alexey Inkin that this can also nicely be solved without regex. If you really want a regex, you can try something like the following:
(?:(\+|00)(93|355|213|1684|376))?(\d+)
^^^^^^^^^^^^^^^^^^^^^ Add here all your expected country prefixes (see below)
^^^ ^^ Match a country prefix if it exists but don't give it a group number
^^^^^^^ Match the "prefix-prefix" (+ or 00)
^^^^ Match the local phone number
Unfortunatly with this regex, you have to provide all the expected country prefixes. But you can surely get this list online, e.g. here: https://www.countrycode.org
With this regex above you will get the local phone number in matching group 3 (and the "prefix-prefix" in group 1 and the country code in group 2).

Removing Data Type From Tuple When Printing In Scala

I currently have two maps: -
mapBuffer = Map[String, ListBuffer[(Int, String, Float)]
personalMapBuffer = Map[mapBuffer, String]
The idea of what I'm trying to do is create a list of something, and then allow a user to create a personalised list which includes a comment, so they'd have their own list of maps.
I am simply trying to print information as everything is good from the above.
To print the Key from mapBuffer, I use: -
mapBuffer.foreach(line => println(line._1))
This returns: -
Sample String 1
Sample String 2
To print the same thing from personalMapBuffer, I am using: -
personalMapBuffer.foreach(line => println(line._1.map(_._1)))
However, this returns: -
List(Sample String 1)
List(Sample String 2)
I obviously would like it to just return "Sample String" and remove the List() aspect. I'm assuming this has something to do with the .map function, although this was the only way I could find to access a tuple within a tuple. Is there a simple way to remove the data type? I was hoping for something simple like: -
line._1.map(_._1).removeDataType
But obviously no such pre-function exists. I'm very new to Scala so this might be something extremely simple (which I hope it is haha) or it could be a bit more complex. Any help would be great.
Thanks.
What you see if default List.toString behaviour. You build your own string with mkString operation :
val separator = ","
personalMapBuffer.foreach(line => println(line._1.map(_._1.mkString(separator))))
which will produce desired result of Sample String 1 or Sample String 1, Sample String 2 if there will be 2 strings.
Hope this helps!
I have found a way to get the result I was looking for, however I'm not sure if it's the best way.
The .map() method just returns a collection. You can see more info on that here:- https://www.geeksforgeeks.org/scala-map-method/
By using any sort of specific element finder at the end, I'm able to return only the element and not the data type. For example: -
line._1.map(_._1).head
As I was writing this Ivan Kurchenko replied above suggesting I use .mkString. This also works and looks a little bit better than .head in my mind.
line._1.map(_._1).mkString("")
Again, I'm not 100% if this is the most efficient way but if it is necessary for something, this way has worked for me for now.

Gremlin: Is there a way to find the character based on the index of a string?

I have vertex "office" and property "name" on OrientDB. I want to find the offices, by name, where the name does not have a "-" as the third character of the string. I imagine this would require some java code within the gremlin query.This is my best attempt, but it is resulting in office names that do in fact have a "-" as their third character.
g.V().hasLabel('office')
.where(values('name').map{it.get().charAt(2)}.is(neq('-')))
.project('Office Name')
.by(values('name'))
Since Gremlin doesn't support String operations (like split, charAt, etc.), your only chance is a lambda. Seems like you figured that out already, but your solution looks too overcomplicated to me. You can use something much simpler, like:
g.V().hasLabel('office').
has('name', filter {it.get()[2] != '-'}).
project('Office Name').
by('name')
However, note, that this filter will throw an exception if the office namer has less than 3 characters. Thus, you should better check that the String is long enough:
g.V().hasLabel('office').
has('name', filter {it.get().length() > 2 && it.get()[2] != '-'}).
project('Office Name').
by('name')
...or use RegEx pattern matching (which is pretty nice and easy in Groovy):
g.V().hasLabel('office').
has('name', filter {it.get() ==~ /.{2}-.*/}).
project('Office Name').
by('name')
The main reason why your traversal didn't work though, is that charAt returns a Character which is then compared to the String -, hence every office name will pass the neq filter.

dataFrame keying using pandas groupby method

I new to pandas and trying to learn how to work with it. Im having a problem when trying to use an example I saw in one of wes videos and notebooks on my data. I have a csv file that looks like this:
filePath,vp,score
E:\Audio\7168965711_5601_4.wav,Cust_9709495726,-2
E:\Audio\7168965711_5601_4.wav,Cust_9708568031,-80
E:\Audio\7168965711_5601_4.wav,Cust_9702445777,-2
E:\Audio\7168965711_5601_4.wav,Cust_7023544759,-35
E:\Audio\7168965711_5601_4.wav,Cust_9702229339,-77
E:\Audio\7168965711_5601_4.wav,Cust_9513243289,25
E:\Audio\7168965711_5601_4.wav,Cust_2102513187,18
E:\Audio\7168965711_5601_4.wav,Cust_6625625104,-56
E:\Audio\7168965711_5601_4.wav,Cust_6073165338,-40
E:\Audio\7168965711_5601_4.wav,Cust_5105831247,-30
E:\Audio\7168965711_5601_4.wav,Cust_9513082770,-55
E:\Audio\7168965711_5601_4.wav,Cust_5753907026,-79
E:\Audio\7168965711_5601_4.wav,Cust_7403410322,11
E:\Audio\7168965711_5601_4.wav,Cust_4062144116,-70
I loading it to a data frame and the group it by "filePath" and "vp", the code is:
res = df.groupby(['filePath','vp']).size()
res.index
and the output is:
[E:\Audio\7168965711_5601_4.wav Cust_2102513187,
Cust_4062144116, Cust_5105831247,
Cust_5753907026, Cust_6073165338,
Cust_6625625104, Cust_7023544759,
Cust_7403410322, Cust_9513082770,
Cust_9513243289, Cust_9702229339,
Cust_9702445777, Cust_9708568031,
Cust_9709495726]
Now Im trying to approach the index like a dict, as i saw in examples, but when im doing
res['Cust_4062144116']
I get an error:
KeyError: 'Cust_4062144116'
I do succeed to get a result when im putting the filepath, but as i understand and saw in previouse examples i should be able to use the vp keys as well, isnt is so?
Sorry if its a trivial one, i just cant understand why it is working in one example but not in the other.
Rutger you are not correct. It is possible to "partial" index a multiIndex series. I simply did it the wrong way.
The index first level is the file name (e.g. E:\Audio\7168965711_5601_4.wav above) and the second level is vp. Meaning, for each file name i have multiple vps.
Now, this is correct:
res['E:\Audio\7168965711_5601_4.wav]
and will return:
Cust_2102513187 2
Cust_4062144116 8
....
but trying to index by the inner index (the Cust_ indexes) will fail.
You groupby two columns and therefore get a MultiIndex in return. This means you also have to slice using those to columns, not with a single index value.
Your .size() on the groupby object converts it into a Series. If you force it in a DataFrame you can use the .xs method to slice a single level:
res = pd.DataFrame(df.groupby(['filePath','vp']).size())
res.xs('Cust_4062144116', level=1)
That works. If you want to keep it as a series, boolean indexing can help, something like:
res[res.index.get_level_values(1) == 'Cust_4062144116']
The last option is a bit less readable, but sometimes also more flexibile, you could test for multiple values at once for example:
res[res.index.get_level_values(1).isin(['Cust_4062144116', 'Cust_6073165338'])]

whoosh doesn't search for short words like "C#"

i am using whoosh to index over 200,000 books. but i have encountered some problems with it.
the whoosh query parser returns NullQuery for words like "C#", "C++" with meta-characters in them and also for some other short words. this words are used in the title and body of some documents so i am not using keyword type for them. i guess the problem is in the analysis or query-parsing phase of searching or indexing but i can't touch my data blindly. can anyone help me to correct this issue. Tnx.
i fixed the problem by creating a StandardAnalyzer with a regex pattern that meets my requirements,here is the regex pattern:
'\w+[#+.\w]*'
this will make tokenizing of fields to be done successfully, and also the searching goes well.
but when i use queries like "some query++*" or "some##*" the parsed query will be a single Every query, just the '*'. also i found that this is not related to my analyzer and this is the Whoosh's default behavior. so here is my new question: is this behavior correct or it is a bug??
note: removing the WildcardPlugin from the query-parser solves this problem but i also need the WildcardPlugin.
now i am using the following code:
from whoosh.util import rcompile
#for matching words like: '.NET', 'C++' and 'C#'
word_pattern = rcompile('(\.|[\w]+)(\.?\w+|#|\+\+)*')
#i don't need words shorter that two characters so i don't change the minsize default
analyzer = analysis.StandardAnalyzer(expression=word_pattern)
... now in my schema:
...
title = fields.TEXT(analyzer=analyzer),
...
this will solve my first problem, yes. but the main problem is in searching. i don't want to let users to search using the Every query or *. but when i parse queries like C++* i end up an Every(*) query. i know that there is some problem but i can't figure out what it is.
I had the same issue and found out that StandardAnalyzer() uses minsize=2 by default. So in your schema, you have to tell it otherwise.
schema = whoosh.fields.Schema(
name = whoosh.fields.TEXT(stored=True, analyzer=whoosh.analysis.StandardAnalyzer(minsize=1)),
# ...
)