Is it possible to use isin() and wildcard searches together in pyspark? - pyspark

I have a dataframe that I want to filter rows based on a list of conditions.This seems to work if I know the exact values - using .isin() - but when I want to use a wildcard - similar to the .like('%condition%') - the filtering does not seem to work. Does anyone know if this is possible? Otherwise I will have to loop through the conditions and add a like filter for each. I have tried both with and without * on the list of conditions to unpack it:
filter_out_conditions=['condition_1', 'condition_2']
df.where(~col(check_col).isin(*filter_out_conditions))
df.where(~col(check_col).isin(filter_out_conditions))

You can create the condition as per requirement
cons = ['%1%','%3%']
cod = ' or '.join([f"col1 like '{i}'" for i in cons])
df.filter(cod)

Related

Check for list of substrings inside string column in PySpark

For checking if a single string is contained in rows of one column. (for example, "abc" is contained in "abcdef"), the following code is useful:
df_filtered = df.filter(df.columnName.contains('abc'))
The result would be for example "_wordabc","thisabce","2abc1".
How can I check for multiple strings (for example ['ab1','cd2','ef3']) at the same time?
I'm ideally searching for something like this:
df_filtered = df.filter(df.columnName.contains(['word1','word2','word3']))
The result would be for example "x_ab1","_cd2_","abef3".
Please, post scalable solutions (no for loops, for example) because the aim is to check a big list around 1000 elements.
All you need is isin
df_filtered = df.filter(df['columnName'].isin('word1','word2','word3')
Edit
You need rlike function to achieve your result
words="(aaa|bbb|ccc)"
df.filter(df['columnName'].rlike(words))

prometheus doesn't match regex query

I'm trying to write a prometheus query in grafana that will select visits_total{route!~"/api/docs/*"}
What I'm trying to say is that it should select all the instances where the route doesn't match /api/docs/* (regex) but this isn't working. It's actually just selecting all the instances. I tried to force it to select others by doing this:
visits_total{route=~"/api/order/*"} but it doesn't return anything. I found these operators in the querying basics page of prometheus. What am I doing wrong here?
May be because you have / in the regex. Try with something like visits_total{route=~".*order.*"} and see if the result is generated or not.
Try this also,
visits_total{route!~"\/api\/docs\/\*"}
If you want to exclude all the things that has the word docs you can use below,
visits_total{route!~".*docs.*"}
The main problem with your original query is that /api/docs/* will only match things like /api/docs and /api/docs//////; i.e. the * in your query will match 0 or more / characters.
I think what you meant to use was /api/docs/.*.

Using ORACLE-LIKE like feature between Spark DataFrame and a List of words - Scala

My requirement is similar to one in :
LINK
Instead of direct match I need LIKE type match on a list. i.e Want to LIKE match COMMENTS with List
ID,COMMENTS
1,bad is he
2,hell thats good
3,sick !thats hell
4,That was good
List = ('good','horrible','hell')
I want to get output like
ID, COMMENTS,MATCHED_WORD,NUM_OF_MATCHES
1,bad is he,,
2,hell thats good,(hell,good),2
3,sick !thats hell,hell,1
4,That was good,good,1
In simpler terms I need : ( rlike isn't matching values from a list instead expects one single string , as far I know it)
file.select($"COMMENTS",$"ID").filter($"COMMENTS".rlike(List_ :_*)).show()
I tried isin , that works but matches WHOLE WORDS ONLY.
file.select($"COMMENTS",$"ID").filter($"COMMENTS".isin(List_ :_*)).show()
Kindly help or please re-direct to me any links as I tried lot of searching !
With simple words I'd use an alternative:
val xs = Seq("good", "horrible", "hell")
df.filter($"COMMENTS".rlike(xs.mkString("|"))
otherwise:
df.filter(xs.foldLeft(lit(false))((acc, x) => acc || $"COMMENTS".rlike(x)))

Laravel WhereIn with multiple options in the field itself

Normally a whereIn in Eloquent compares a value from a field to an array with options. I like to reverse that and compare a option to multiple options in the field:
field contains 'option1,option2,option3'
Model::whereIn('field', 'option1')->get();
Is this possible?
You can make your query using LIKE:
Model::where('field', 'LIKE', '%option1%')->get();
Documentation on the syntax is available here: http://dev.mysql.com/doc/refman/5.7/en/pattern-matching.html
If you always add a comma , even after the last choice, like option1,option2,option3, you can use a bit of a more robust filter:
Model::where('field', 'LIKE', '%option1,%')->get();
And a comma at the start (or any other separator if that matters) would make it even better:
Model::where('field', 'LIKE', '%,option1,%')->get();
Otherwise you can have issues if one of your option is similar to another one at the end (if you have fish and goldfish as possible categories, using LIKE ',fish,' will guarantee that you don't match goldfish, while LIKE 'fish,' would match both fish and goldfish).
I'd recommend to store your categories like that: /fish/goldfish/water/ and then filter using LIKE '%/yourcategory/%'

dataFrame keying using pandas groupby method

I new to pandas and trying to learn how to work with it. Im having a problem when trying to use an example I saw in one of wes videos and notebooks on my data. I have a csv file that looks like this:
filePath,vp,score
E:\Audio\7168965711_5601_4.wav,Cust_9709495726,-2
E:\Audio\7168965711_5601_4.wav,Cust_9708568031,-80
E:\Audio\7168965711_5601_4.wav,Cust_9702445777,-2
E:\Audio\7168965711_5601_4.wav,Cust_7023544759,-35
E:\Audio\7168965711_5601_4.wav,Cust_9702229339,-77
E:\Audio\7168965711_5601_4.wav,Cust_9513243289,25
E:\Audio\7168965711_5601_4.wav,Cust_2102513187,18
E:\Audio\7168965711_5601_4.wav,Cust_6625625104,-56
E:\Audio\7168965711_5601_4.wav,Cust_6073165338,-40
E:\Audio\7168965711_5601_4.wav,Cust_5105831247,-30
E:\Audio\7168965711_5601_4.wav,Cust_9513082770,-55
E:\Audio\7168965711_5601_4.wav,Cust_5753907026,-79
E:\Audio\7168965711_5601_4.wav,Cust_7403410322,11
E:\Audio\7168965711_5601_4.wav,Cust_4062144116,-70
I loading it to a data frame and the group it by "filePath" and "vp", the code is:
res = df.groupby(['filePath','vp']).size()
res.index
and the output is:
[E:\Audio\7168965711_5601_4.wav Cust_2102513187,
Cust_4062144116, Cust_5105831247,
Cust_5753907026, Cust_6073165338,
Cust_6625625104, Cust_7023544759,
Cust_7403410322, Cust_9513082770,
Cust_9513243289, Cust_9702229339,
Cust_9702445777, Cust_9708568031,
Cust_9709495726]
Now Im trying to approach the index like a dict, as i saw in examples, but when im doing
res['Cust_4062144116']
I get an error:
KeyError: 'Cust_4062144116'
I do succeed to get a result when im putting the filepath, but as i understand and saw in previouse examples i should be able to use the vp keys as well, isnt is so?
Sorry if its a trivial one, i just cant understand why it is working in one example but not in the other.
Rutger you are not correct. It is possible to "partial" index a multiIndex series. I simply did it the wrong way.
The index first level is the file name (e.g. E:\Audio\7168965711_5601_4.wav above) and the second level is vp. Meaning, for each file name i have multiple vps.
Now, this is correct:
res['E:\Audio\7168965711_5601_4.wav]
and will return:
Cust_2102513187 2
Cust_4062144116 8
....
but trying to index by the inner index (the Cust_ indexes) will fail.
You groupby two columns and therefore get a MultiIndex in return. This means you also have to slice using those to columns, not with a single index value.
Your .size() on the groupby object converts it into a Series. If you force it in a DataFrame you can use the .xs method to slice a single level:
res = pd.DataFrame(df.groupby(['filePath','vp']).size())
res.xs('Cust_4062144116', level=1)
That works. If you want to keep it as a series, boolean indexing can help, something like:
res[res.index.get_level_values(1) == 'Cust_4062144116']
The last option is a bit less readable, but sometimes also more flexibile, you could test for multiple values at once for example:
res[res.index.get_level_values(1).isin(['Cust_4062144116', 'Cust_6073165338'])]