Using ORACLE-LIKE like feature between Spark DataFrame and a List of words - Scala - scala

My requirement is similar to one in :
LINK
Instead of direct match I need LIKE type match on a list. i.e Want to LIKE match COMMENTS with List
ID,COMMENTS
1,bad is he
2,hell thats good
3,sick !thats hell
4,That was good
List = ('good','horrible','hell')
I want to get output like
ID, COMMENTS,MATCHED_WORD,NUM_OF_MATCHES
1,bad is he,,
2,hell thats good,(hell,good),2
3,sick !thats hell,hell,1
4,That was good,good,1
In simpler terms I need : ( rlike isn't matching values from a list instead expects one single string , as far I know it)
file.select($"COMMENTS",$"ID").filter($"COMMENTS".rlike(List_ :_*)).show()
I tried isin , that works but matches WHOLE WORDS ONLY.
file.select($"COMMENTS",$"ID").filter($"COMMENTS".isin(List_ :_*)).show()
Kindly help or please re-direct to me any links as I tried lot of searching !

With simple words I'd use an alternative:
val xs = Seq("good", "horrible", "hell")
df.filter($"COMMENTS".rlike(xs.mkString("|"))
otherwise:
df.filter(xs.foldLeft(lit(false))((acc, x) => acc || $"COMMENTS".rlike(x)))

Related

Check for list of substrings inside string column in PySpark

For checking if a single string is contained in rows of one column. (for example, "abc" is contained in "abcdef"), the following code is useful:
df_filtered = df.filter(df.columnName.contains('abc'))
The result would be for example "_wordabc","thisabce","2abc1".
How can I check for multiple strings (for example ['ab1','cd2','ef3']) at the same time?
I'm ideally searching for something like this:
df_filtered = df.filter(df.columnName.contains(['word1','word2','word3']))
The result would be for example "x_ab1","_cd2_","abef3".
Please, post scalable solutions (no for loops, for example) because the aim is to check a big list around 1000 elements.
All you need is isin
df_filtered = df.filter(df['columnName'].isin('word1','word2','word3')
Edit
You need rlike function to achieve your result
words="(aaa|bbb|ccc)"
df.filter(df['columnName'].rlike(words))

Match part of a string with regex

I have two arrays of strings and I want to check if a string of array a matches a string from array b. Those strings are phone numbers that might come in different formats. For example:
Array a might have a phone number with prefix like so +44123123123 or 0044123123123
Array b have a standard format without prefixes like so 123123123
So I'm looking for a regex that can match a part of a string like +44123123123 with 123123123
Btw I'm using Swift but I don't think there's a native way to do it (at least a more straightforward solution)
EDIT
I decided to reactivate the question after experimenting with the library #Larme mentioned because of inconsistent results.
I'd prefer a simper solution as I've stated earlier.
SOLUTION
Thanks guys for the responses. I saw many comments saying that Regex is not the right solution for this problem. And this is partly true. It could be true (or false) depending on my current setup/architecture ( which thinking about it now I realise that I should've explained better).
So I ended up using the native solution (hasSuffix/contains) but to do that I had to do some refactoring on the way the entire flow was structured. In the end I think it was the least complicated solution and more performant of the two. I'll give the bounty to #Alexey Inkin for being the first to mention the native solution and the right answer to #Ωmega for providing a more complete solution.
I believe regex is not the right approach for this task.
Instead, you should do something like this:
var c : [String] = b.filter ({ (short : String) -> Bool in
var result = false
for full in a {
result = result || full.hasSuffix(short)
}
return result
})
Check this demo.
...or similar solution like this:
var c : [String] = b.filter ({ (short : String) -> Bool in
for full in a {
if full.hasSuffix(short) { return true }
}
return false
})
Check this demo.
As you do not mention requirements to prefixes, the simplest solution is to check if string in a ends with a string in b. For this, take a look at https://developer.apple.com/documentation/swift/string/1541149-hassuffix
Then, if you have to check if the prefix belongs to a country, you may replace ^00 with + and then run a whitelist check against known prefixes. And the prefix itself can be obtained as a substring by cutting b's length of characters. Not really a regex's job.
I agree with Alexey Inkin that this can also nicely be solved without regex. If you really want a regex, you can try something like the following:
(?:(\+|00)(93|355|213|1684|376))?(\d+)
^^^^^^^^^^^^^^^^^^^^^ Add here all your expected country prefixes (see below)
^^^ ^^ Match a country prefix if it exists but don't give it a group number
^^^^^^^ Match the "prefix-prefix" (+ or 00)
^^^^ Match the local phone number
Unfortunatly with this regex, you have to provide all the expected country prefixes. But you can surely get this list online, e.g. here: https://www.countrycode.org
With this regex above you will get the local phone number in matching group 3 (and the "prefix-prefix" in group 1 and the country code in group 2).

Removing Data Type From Tuple When Printing In Scala

I currently have two maps: -
mapBuffer = Map[String, ListBuffer[(Int, String, Float)]
personalMapBuffer = Map[mapBuffer, String]
The idea of what I'm trying to do is create a list of something, and then allow a user to create a personalised list which includes a comment, so they'd have their own list of maps.
I am simply trying to print information as everything is good from the above.
To print the Key from mapBuffer, I use: -
mapBuffer.foreach(line => println(line._1))
This returns: -
Sample String 1
Sample String 2
To print the same thing from personalMapBuffer, I am using: -
personalMapBuffer.foreach(line => println(line._1.map(_._1)))
However, this returns: -
List(Sample String 1)
List(Sample String 2)
I obviously would like it to just return "Sample String" and remove the List() aspect. I'm assuming this has something to do with the .map function, although this was the only way I could find to access a tuple within a tuple. Is there a simple way to remove the data type? I was hoping for something simple like: -
line._1.map(_._1).removeDataType
But obviously no such pre-function exists. I'm very new to Scala so this might be something extremely simple (which I hope it is haha) or it could be a bit more complex. Any help would be great.
Thanks.
What you see if default List.toString behaviour. You build your own string with mkString operation :
val separator = ","
personalMapBuffer.foreach(line => println(line._1.map(_._1.mkString(separator))))
which will produce desired result of Sample String 1 or Sample String 1, Sample String 2 if there will be 2 strings.
Hope this helps!
I have found a way to get the result I was looking for, however I'm not sure if it's the best way.
The .map() method just returns a collection. You can see more info on that here:- https://www.geeksforgeeks.org/scala-map-method/
By using any sort of specific element finder at the end, I'm able to return only the element and not the data type. For example: -
line._1.map(_._1).head
As I was writing this Ivan Kurchenko replied above suggesting I use .mkString. This also works and looks a little bit better than .head in my mind.
line._1.map(_._1).mkString("")
Again, I'm not 100% if this is the most efficient way but if it is necessary for something, this way has worked for me for now.

How to convert "like each" into a functional form?

Let's say I have a column of a table whose data type is a character array. I want to pass in a functional select where clause, where the column is in a list of given strings. However, I cannot simply use (in; `col; myList) for reasons. Instead, I need to do the equivalent of:
max col like/: myList
which effectively gives the same result. However, I have tried to put this in functional form
(max; (like/:; `col; myList))
And I am getting a type error. Any ideas on how I could make this work?
A nice trick when dealing with this problem is using parse on a string of the select statement you want to functionalize. For example:
q)parse"select from t where max col like/: myList"
?
`t
,,(max;((/:;like);`col;`myList))
0b
()
Or specifically in your case you want the 3rd element of the result list (the where clause):
q)(parse"select from t where max col like/: myList")2
max ((/:;like);`col;`myList)
I even think using this pattern in your actual code can be a good idea, as functionalized statements like max ((/:;like);`col;`myList) can get pretty unreadable pretty quickly!
Hope that helps!
(any; ((/:;like); `col; enlist,myList))
it should be: (max;((/:;like);`col;`mylist))

Is there a way to filter a field not containing something in a spark dataframe using scala?

Hopefully I'm stupid and this will be easy.
I have a dataframe containing the columns 'url' and 'referrer'.
I want to extract all the referrers that contain the top level domain 'www.mydomain.com' and 'mydomain.co'.
I can use
val filteredDf = unfilteredDf.filter(($"referrer").contains("www.mydomain."))
However, this pulls out the url www.google.co.uk search url that also contains my web domain for some reason. Is there a way, using scala in spark, that I can filter out anything with google in it while keeping the correct results I have?
Thanks
Dean
You can negate predicate using either not or ! so all what's left is to add another condition:
import org.apache.spark.sql.functions.not
df.where($"referrer".contains("www.mydomain.") &&
not($"referrer".contains("google")))
or separate filter:
df
.where($"referrer".contains("www.mydomain."))
.where(!$"referrer".contains("google"))
You may use a Regex. Here you can find a reference for the usage of regex in Scala. And here you can find some hints about how to create a proper regex for URLs.
Thus in your case you will have something like:
val regex = "PUT_YOUR_REGEX_HERE".r // something like (https?|ftp)://www.mydomain.com?(/[^\s]*)? should work
val filteredDf = unfilteredDf.filter(regex.findFirstIn(($"referrer")) match {
case Some => true
case None => false
} )
This solution requires a bit of work but is the safest one.