Check for list of substrings inside string column in PySpark - pyspark

For checking if a single string is contained in rows of one column. (for example, "abc" is contained in "abcdef"), the following code is useful:
df_filtered = df.filter(df.columnName.contains('abc'))
The result would be for example "_wordabc","thisabce","2abc1".
How can I check for multiple strings (for example ['ab1','cd2','ef3']) at the same time?
I'm ideally searching for something like this:
df_filtered = df.filter(df.columnName.contains(['word1','word2','word3']))
The result would be for example "x_ab1","_cd2_","abef3".
Please, post scalable solutions (no for loops, for example) because the aim is to check a big list around 1000 elements.

All you need is isin
df_filtered = df.filter(df['columnName'].isin('word1','word2','word3')
Edit
You need rlike function to achieve your result
words="(aaa|bbb|ccc)"
df.filter(df['columnName'].rlike(words))

Related

Is it possible to use isin() and wildcard searches together in pyspark?

I have a dataframe that I want to filter rows based on a list of conditions.This seems to work if I know the exact values - using .isin() - but when I want to use a wildcard - similar to the .like('%condition%') - the filtering does not seem to work. Does anyone know if this is possible? Otherwise I will have to loop through the conditions and add a like filter for each. I have tried both with and without * on the list of conditions to unpack it:
filter_out_conditions=['condition_1', 'condition_2']
df.where(~col(check_col).isin(*filter_out_conditions))
df.where(~col(check_col).isin(filter_out_conditions))
You can create the condition as per requirement
cons = ['%1%','%3%']
cod = ' or '.join([f"col1 like '{i}'" for i in cons])
df.filter(cod)

Using distinct on a slice stringbuilder

buffer.slice(mouse,highlight).distinct
Now when I perform this it seems to apply .distinct to the whole string rather than the selection I use with slice. (mouse and highlight are just index positions and buffer is a StringBuilder). I'm just wondering what is the reason for this.
Your approach is correct. Please see below code for more clarification.
slice() function gives you the sub-string so in your approach It will first find the sub-string and then distinct.
Please follow below step by step approach for more understanding.
val buffer=new StringBuilder
buffer.append("bbbaabbbcccbdbcdbd")
val sl=buffer.slice(2,10)
The variable sl contains
sl= baabbbcc
Now you can apply distinct on sl variable
val result=sl.distinct
Finally your output
result= bac
This is the how your single line of code is working.

Using ORACLE-LIKE like feature between Spark DataFrame and a List of words - Scala

My requirement is similar to one in :
LINK
Instead of direct match I need LIKE type match on a list. i.e Want to LIKE match COMMENTS with List
ID,COMMENTS
1,bad is he
2,hell thats good
3,sick !thats hell
4,That was good
List = ('good','horrible','hell')
I want to get output like
ID, COMMENTS,MATCHED_WORD,NUM_OF_MATCHES
1,bad is he,,
2,hell thats good,(hell,good),2
3,sick !thats hell,hell,1
4,That was good,good,1
In simpler terms I need : ( rlike isn't matching values from a list instead expects one single string , as far I know it)
file.select($"COMMENTS",$"ID").filter($"COMMENTS".rlike(List_ :_*)).show()
I tried isin , that works but matches WHOLE WORDS ONLY.
file.select($"COMMENTS",$"ID").filter($"COMMENTS".isin(List_ :_*)).show()
Kindly help or please re-direct to me any links as I tried lot of searching !
With simple words I'd use an alternative:
val xs = Seq("good", "horrible", "hell")
df.filter($"COMMENTS".rlike(xs.mkString("|"))
otherwise:
df.filter(xs.foldLeft(lit(false))((acc, x) => acc || $"COMMENTS".rlike(x)))

How to find most frequent string in List of strings

I have a list of strings (List[String]) and I want to obtain the most frequent string from this list:
val list1 = List('a','a','0','b','b','a')
The answer should be:
freq_list1 = a
I was thinking to use list1.sliding(2).count... in order to get the count of unique string, but I don't know how to wrap it into finding the most frequent string.
list1.groupBy(identity).mapValues(_.size).maxBy(_._2)._1
EDIT: See comment below, can be made shorter by using maxBy(_._2.size) without mapping beforehand, thanks #kawty

dataFrame keying using pandas groupby method

I new to pandas and trying to learn how to work with it. Im having a problem when trying to use an example I saw in one of wes videos and notebooks on my data. I have a csv file that looks like this:
filePath,vp,score
E:\Audio\7168965711_5601_4.wav,Cust_9709495726,-2
E:\Audio\7168965711_5601_4.wav,Cust_9708568031,-80
E:\Audio\7168965711_5601_4.wav,Cust_9702445777,-2
E:\Audio\7168965711_5601_4.wav,Cust_7023544759,-35
E:\Audio\7168965711_5601_4.wav,Cust_9702229339,-77
E:\Audio\7168965711_5601_4.wav,Cust_9513243289,25
E:\Audio\7168965711_5601_4.wav,Cust_2102513187,18
E:\Audio\7168965711_5601_4.wav,Cust_6625625104,-56
E:\Audio\7168965711_5601_4.wav,Cust_6073165338,-40
E:\Audio\7168965711_5601_4.wav,Cust_5105831247,-30
E:\Audio\7168965711_5601_4.wav,Cust_9513082770,-55
E:\Audio\7168965711_5601_4.wav,Cust_5753907026,-79
E:\Audio\7168965711_5601_4.wav,Cust_7403410322,11
E:\Audio\7168965711_5601_4.wav,Cust_4062144116,-70
I loading it to a data frame and the group it by "filePath" and "vp", the code is:
res = df.groupby(['filePath','vp']).size()
res.index
and the output is:
[E:\Audio\7168965711_5601_4.wav Cust_2102513187,
Cust_4062144116, Cust_5105831247,
Cust_5753907026, Cust_6073165338,
Cust_6625625104, Cust_7023544759,
Cust_7403410322, Cust_9513082770,
Cust_9513243289, Cust_9702229339,
Cust_9702445777, Cust_9708568031,
Cust_9709495726]
Now Im trying to approach the index like a dict, as i saw in examples, but when im doing
res['Cust_4062144116']
I get an error:
KeyError: 'Cust_4062144116'
I do succeed to get a result when im putting the filepath, but as i understand and saw in previouse examples i should be able to use the vp keys as well, isnt is so?
Sorry if its a trivial one, i just cant understand why it is working in one example but not in the other.
Rutger you are not correct. It is possible to "partial" index a multiIndex series. I simply did it the wrong way.
The index first level is the file name (e.g. E:\Audio\7168965711_5601_4.wav above) and the second level is vp. Meaning, for each file name i have multiple vps.
Now, this is correct:
res['E:\Audio\7168965711_5601_4.wav]
and will return:
Cust_2102513187 2
Cust_4062144116 8
....
but trying to index by the inner index (the Cust_ indexes) will fail.
You groupby two columns and therefore get a MultiIndex in return. This means you also have to slice using those to columns, not with a single index value.
Your .size() on the groupby object converts it into a Series. If you force it in a DataFrame you can use the .xs method to slice a single level:
res = pd.DataFrame(df.groupby(['filePath','vp']).size())
res.xs('Cust_4062144116', level=1)
That works. If you want to keep it as a series, boolean indexing can help, something like:
res[res.index.get_level_values(1) == 'Cust_4062144116']
The last option is a bit less readable, but sometimes also more flexibile, you could test for multiple values at once for example:
res[res.index.get_level_values(1).isin(['Cust_4062144116', 'Cust_6073165338'])]