Filter pyspark DataFrame by partial string match - pyspark

would like check substring match between comments and keyword column and find if anyone of the keywords present in that particular row.
name : ['paul','john','max','peter']
comments :['account is active','account is activated','account is activateds','account is activ']
keywords :[ 'active,activated,activ' , 'active,activated,activ' ,'active,activated,activ','null']
expected output
match = ['True','True','True','True']

Related

In Spark Scala, how to create a column with substring() using locate() as a parameter?

I have a dataset that is like the following:
val df = Seq("samb id 12", "car id 13", "lxu id 88").toDF("list")
I want to create a column that will be a string containing only the values after Id. The result would be something like:
val df_result = Seq(("samb id 12",12), ("car id 13",13), ("lxu id 88",88)).toDF("list", "id_value")
For that, I am trying to use substring. For the the parameter of the starting position to extract the substring, I am trying to use locate. But it gives me an error saying that it should be an Int and not a column type.
What I am trying is like:
df
.withColumn("id_value", substring($"list", locate("id", $"list") + 2, 2))
The error I get is:
error: type mismatch;
found : org.apache.spark.sql.Column
required: Int
.withColumn("id_value", substring($"list", locate("id", $"list") + 2, 2))
^
How can I fix this and continue using locate() as a parameter?
UPDATE
Updating to give an example in which #wBob answer doesn't work for my real world data: my data is indeed a bit more complicated than the examples above.
It is something like this:
val df = Seq(":option car, lorem :ipsum: :ison, ID R21234, llor ip", "lst ID X49329xas ipsum :ion: ip_s-")
The values are very long strings that don't have a specific pattern.
Somewhere in the string that is always a part written ID XXXXX. The XXXXX varies, but it is always the same size (5 characters) and always after a ID .
I am not being able to use neither split nor regexp_extract to get something in this pattern.
It is not clear if you want the third item or the first number from the list, but here are a couple of examples which should help:
// Assign sample data to dataframe
val df = Seq("samb id 12", "car id 13", "lxu id 88").toDF("list")
df
.withColumn("t1", split($"list", "\\ ")(2))
.withColumn("t2", regexp_extract($"list", "\\d+", 0))
.withColumn("t3", regexp_extract($"list", "(id )(\\d+)", 2))
.withColumn("t4", regexp_extract($"list", "ID [A-Z](\\d{5})", 1))
.show()
You can use functions like split and regexp_extract with withColumn to create new columns based on existing values. split splits out the list into an array based on the delimiter you pass in. I have used space here, escaped with two slashes to split the array. The array is zero-based hence specifying 2 gets the third item in the array. regexp_extract uses regular expressions to extract from strings. here I've used \\d which represents digits and + which matches the digit 1 or many times. The third column, t3, again uses regexp_extract with a similar RegEx expression, but using brackets to group up sections and 2 to get the second group from the regex, ie the (\\d+). NB I'm using additional slashes in the regex to escape the slashes used in the \d.
My results:
If your real data is more complicated please post a few simple examples where this code does not work and explain why.

Check that a SPARK Dataframe column matches a Regex for all occurrences using Scala

I am using Scala.
I have a dataframe with a column date which looks like that:
| date |
|2017-09-24T11:05:52.647+02:00|
|2018-09-24T11:05:52.647+02:00|
|2018-10-24T11:05:52.647+02:00|
I have a regex to check the date format:
pattern = new regex(([12]\d{3}-(0[1-9]|1[0-2])-(0[1-9]| [12]\d|3[01])T\d{2}:\d{2}:\d{2}.\d{3}\+\d{2}:\d{2}))
I want to check if each row in the dataframe matches with the regex, if yes return true and if not return false. I need to return just true or false not a list.
Any help is welcome and many thanks for your help.
This should work - but turning it around, find first non-match:
import scala.util.Try
val result = Try(Option(df.filter($"cityid" rlike "[^0-9]").first)).toOption.flatten
if (result.isEmpty) { println("Empty")}
I use a DF as outcome and you can just check if empty or not.
Please tailor to your own situation. e.g. not empty, your own regex.
Without the Try and such the .first generates an error if empty. None is returned if empty and you can do the empty check.

Text Indexes MongoDB, Minimum length of search string

I have created a text index for collection X from mongo shell
db.X.ensureIndex({name: 'text', cusines: 'text', 'address.city': 'text'})
now if a document whose name property has a value seasons, its length is 7
so if I run the find query(with a search string of length <= 5)
db.X.find({$text: {$search: 'seaso'}})
it does not return any value if I change the search string to season (length >= 6) then it returns the document.
Now my question is does the search string has some minimum length constraint to fetch the records.
if yes, then is there is any way to change it?
MongoDB $text searches do not support partial matching. MongoDB allows support text search queries on string content with support for case insensitivity and stemming.
Looking at your examples:
// this returns nothing because there is no inferred association between
// the value: 'seasons' and your input: 'seaso'
db.X.find({$text: {$search: 'seaso'}})
// this returns a match because 'season' is seen as a stem of 'seasons'
db.X.find({$text: {$search: 'season'}})
So, this is not an inssue with the length of your input. Searching on seaso returns no matches because:
Your text index does not contain the whole word: seaso
Your text index does not contain a whole word for which seaso is a recognised stem
This presumes that the language of your text index is English, You can confirm this by runing db.X.getIndexes() and you'll see this in the definition of your text index:
"default_language" : "english"
FWIW, if your index is case insensitive then the following will also return matches:
db.X.find({$text: {$search: 'SEaSON'}})
db.X.find({$text: {$search: 'SEASONs'}})
Update 1: in repsonse to this question "is it possible to use RegExp".
Assuming the name attribute contains the value seasons and you are seaching with seaso then the following will match your document:
db.X.find({type: {$regex: /^seaso/}})
More details in the docs but ...
This will not use your text index so if you proceeed with using the $regex operator then you won't need the text index.
Index coverage with the $regex operator is probably not what you expect, the brief summary is this: if your search value is anchored (i.e. seaso, rather than easons) then MongoDB can use an index but otherwise it cannot.

spark filter column from dataframe with words from a collection

i have been searching for a while but i haven't found how to do it.
i have a dataframe that contains a reference to a table and one of the columns contains a string
dataframe schema: name string,lastname string, interests string
i have a list of interests like so:
val sports:List [String] = List("football","basketball","soccer")
i want to filter all the people from my dataframe that contain one of the sports above in their interests
val peopledata = sqlContext.sql("select * from learning.people")
i have tried to do this like this :
for (sport <- sports)peopledata.filter(peopledata("interests").contains(sport))
but i have asked a pro in the company i work in, and he told me there he a better and prettier way to do it
Execute collect() function to get Array[Row] of results and filter elements of this array with sports.contains():
peopledata.collect().filter(row => sports contains row.getString(2))
2 here is number of interests field in your schema.
Usage of string interpolation will solve your problem:
val interest = sports.mkString("('","','","')")
val peopledata = sqlContext.sql(s"select * from learning.people where interest in $interest")

JPQL: sort queryresult by "best matches" possible?

I have the following question/problem:
I'm using JPQL (JPA 2.0 and eclipselink) and I wanna create a query that gives me the results sorted the following way:
At first the results sorted ascending by the best matches. After that should appear the inferior matches.
My objects are based on a simple class called 'Person' with the attributes:
{String Id,
String forename,
String name}
For example if I'm searching for "Picol" the result should look like:
[{129, Picol, Newman}, {23, Johnny, Picol},{454, Picolori, Newta}, {4774, Picolatus, Larimus}...]
PS: I already thought about using two queries, the first is searching with "equals" and the second with "like", although I'm not quite sure how to connect both queryresults...?
Hope for your help and thanks in advance,
Florian
If, as your question seem to imply, you only have two groups (first group : forename or name equals searched string; second group : forename or name contains searched string), and if all the persons of a given group have the same "match score", then using two queries is indeed a good solution.
First query :
select p from Person p where p.foreName = :param or p.name = :param
Second query :
select p from Person p where (p.foreName like :paramSurroundedWithPercent
or p.name like :paramSurroundedWithPercent)
and p.foreName != :param
and p.name != :param
Execute both queries (each returning a List<Person>), and add all the elements of the second list to the first one (using the addAll() method)