Filter specific word in PySpark

Filter specific word in PySpark - pyspark

I'm pretty new to PySpark so I'm sorry if my question seems too simple but I've been bogged by it for some time now.
Given the following text is parallelized
Alice has recently been troubled by recent events. It has come to Alice's attention that her grades were dipping down which worried Alice a lot. If Alice's parents were to find out about this, they would flip out. What should Alice do?
I know that to find the frequency of the word 'Alice' uses the code
rdd.filter(lambda x: "Alice" in x).count()
which is equal to 5. However, how do I exclude the the ones that are 'Alice's' and merely just get the word count of 'Alice', so my desired count should be 3 instead of 5.

You can filter only Alice using .filter(lambda x:x[0]=='Alice')
from operator import add
rdd.collect()
#["Alice has recently been troubled by recent events. It has come to Alice's attention that her grades were dipping down which worried Alice a lot. If Alice's parents were to find out about this, they would flip out. What should Alice do?"]
#filter only Alice
rdd.flatMap(lambda x:x.split(" ")).map(lambda x:(x,1)).reduceByKey(add).filter(lambda x:x[0]=='Alice').collect()
#[('Alice', 3)]

Related

How to do a name similarity using clustering

I have a very big -super big- database of names.
The task is to find all the similar names (of the same person per se) despite some diffrences like :
first name, second name inversed --> John Doe & Doe John
two names or more (same ones) with light changes, maybe some
letters misplaced or something else --> Jonh Doe & John Deo
two names with some letters added --> Johhn Doe & Johnn Doee &
John Doe
names where another middle name inserted --> John Blair Campbell Doe & John Blair Doe
And so on..
I tried using the classical methods like soundex and leveshtein but the results were not very good, had results like : Amine depi and Amina dope are in the same group while they're diffrent
and It would take very long to perform the task on just a fraction on the data, as for my database, it would directly crash after a long time
I also thought of using another approach like cosine which uses numerical values and I though of finding a way of representing the names in a numerical way, or convert them (something like word2vec), I actually though of using directly word2vec with the whole database of namems as the text, but as expected it didn't work. Tried to codify the names in a low level way, like code ASCII for exemple, but the results weren't good neither.
So I thought of Clustering.
So I tried using DBSCAN. I found a way to use DBSCAN clustering with a custom distance metric and used leveshtein distance. (If you ask me why DBSCAN? It is because I don't know the numbers of similar groups of names which are in the database in the beginning)
I did have some results, but very poor performance overall. It would either give the same exact ones, John Doe and John Doe int he same cluster, or nothing at all, and would even skip some exact ones.
Do you have a suggestion for performing this task ? preferbly using clsutering or another smart way since the database is very big (more than 500 000 line and up to millions ) so I cannot iterate alot.
I am open to suggestions or propositions !
Especially if you worked on something like this previously or similar to this, Thank you in advance.

Try AgglomerativeClustering.
Sample code:
clustering = AgglomerativeClustering(
n_clusters=None,
distance_threshold=0.3 # smaller threshold meaning more strict similarities, and more clusters
).fit(your_vectorized_name_list)
print(f'total clusters: {clustering.n_clusters_}')

I need help in data sanitization problem in tableau

I trying doing the manual sanitization, however I am getting a type mismatch error in performing the calculations.
I also need help in sanitizing the data and getting the insight as per the below instructions:
The column sellerproductcount gives you the count of products in the
form '1-16 of over 100,000 results' , and you can parse out the product count 100,000.
sellerratings - this columns gives you the % and count of positive ratings (e.g. 88% positive
in the last 12 months (118 ratings) ) if parsed correctly
sellerdetails - you can use this text to parse out phone numbers, and email IDs of
merchants, where available, so our team can reach out to them.
businessaddress - this will give you the business locations of the sellers. You can parse them
to identify if a seller is registered in the US , Germany (DE), or China (CN).
Hero Product 1 #ratings and Hero Product 2 #ratings - these 2 columns give you the number of
ratings of the 2 'hero products' or bestselling products of this seller.
I have attached the dataset for the same.
https://docs.google.com/spreadsheets/d/1PSqRCnmFgq7v7RzZaCXXoV0Edp_vM7QO/edit?usp=sharing&ouid=115547990006782902200&rtpof=true&sd=true

Most of this type of data prep can be done with string & RegEx functions like REGEX_MATCH(). Here are a few examples based on the data you shared:
Seller Product Count
INT(REGEXP_EXTRACT([Sellerproductcount], '(\d*,?\d*) results'))
1-16 of over 6,000 results >> 6000
Seller Rating (Percentage)
INT(REGEXP_EXTRACT([Sellerratings], '(\d*)% positive'))
92% positive in the last 12 months (181 ratings) >> 92
Seller Rating (Count)
INT(REGEXP_EXTRACT([Sellerratings], '(\d*) (?:total )?ratings'))
92% positive in the last 12 months (181 ratings) >> 181
Business Country Code
RIGHT([Businessaddress],2)
AM Treptower Park28-30Berlin12435DE >> DE
These examples all have very straightforward patterns that are present in all rows so they can be done pretty easily with one simple calculation. However, something like sellerdetails which is unstructured, inconsistent, and sometimes incomplete will be a bit more of a challenge. You will need to use a couple of different calculations and techniques combined together to find what you are looking for, as well as some manual data prep. Here's an example of how you can pull out email but it won't work for everything:
Email
REGEXP_EXTRACT([Sellerdetails], '([a-zA-Z0-9.!#$%&’*+/=?^_`{|}~-]+#[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*)')
Good luck with your data cleaning, I suggest using sites like https://regex101.com/ and https://regexr.com/ to learn more about and help test regular expressions.

Send email from Tableau: how to split email's body into multiple lines?

Hello Stack Overflow community,
I am trying to send an email from Tableau, but I can't figure it out how to split the body of the email into multiple rows.
I have three fields: , and .
The email must be structured in this way:
with $ in total revenue in the last 7 days
with $ in total revenue in the last 7 days
with $ in total revenue in the last 7 days
I've created the following URL action : mailto:XYZ?Business&body= with total <SUM(Revenue)> revenue in the last 7 days
When selecting multiple sellers at the same time, I get this:
"Seller A-Seller B-Seller C - with total $15 946-$3 412-$31 505 PD-PD-PA revenue in the last 7 days".
Do you know how (if possible) to split that text in multiple lines?
And if so, how can I send the email automatically, without having to manually click on the seller's name?
Sorry for the silly question, I am relatively new to Tableau.
Many thanks!

It used to be possible to send a formatted email from Tableau. However, I believe this was made not possible as a security precaution back in 2020.
I haven't tried in later versions of Tableau to know if it's possible again.

Grouping By with missing data

Image of Data and desired result:
I'm trying to aggregate volunteer hours from a Google spreadsheet a non-profit I volunteer for. We collect volunteer e-mail information and the time that each volunteer has contributed. Each volunteer only puts in their e-mail the first time. I've found examples online on how to send e-mails, but I'm having trouble aggregating the data. I think the trouble might be that not every row has an e-mail address associated with it.
I've been able to get the sum of hours worked by volunteer using QUERY(data, "select A, sum(C) Group By A", ) but can't figure out how to get the e-mail associated with each individual.

Thanks for the advice! The VLOOKUP and ArrayFormula functions were new to me. Here's how I solved it:
QUERY(data, "select A, B where B <>'' ", -1)
This allowed me to get the Key-Value pair (Name, Email) for each volunteer (solving the problem of people who volunteered multiple times, but only left their e-mail once). From there, I was able to generate the 'Name:Hours Worked' table off to the right with:
QUERY(data, "select A, sum(C) Group By A", ).
Then, I used VLOOKUP to query my Name-Email table to get the desired result of:
Name-Email-aggregatedHours
Thanks!

You can't achieve this with query. But you could apply vlookup to sorted table:
=ArrayFormula(VLOOKUP(UNIQUE(FILTER(A2:A,A2:A<>"")),SORT(A2:B,2,0),2,0))
and get email list for unique names.

First, clean up your data. You shoud be certain that at least one column has no typos an that this column appropiate identify which data corresponds to each volunteer. This is called key value. This also could be done by, but not limited to, filling up the missing values for each row. If this will be hard, then
Create a volunteer list without missing data.
Calculate the time contributed by each volunteer. If you was able to fill up the missing values, then you could use QUERY, I this case the QUERY formula should have to group by name and email, if not, then use SUMIF

Counting multiple values from one column in Tableau

I have a field from the data I am reading in that can contain multiple values. They are essentially tags.
For example, there could be a column called "persons responsible". This could read "Joe; Bob; Sue" or "Sue" for a given row.
Is it possible from within Tableau to read these in as separate categories? So that for this sample data:
Project | Persons
---------------------------
Zeta | Bob; Sue; Joe
Enne | Sue
Doble Ve | Bob
There could be a count of Bob (2), Sue (2), Joe (1)?
I am working on getting better data inputs, but I was wondering if there was a temporary solution at this level.

I would definitely work towards normalizing your schema.
In the meantime, there is a workaround that is almost reasonable if there is a small set of possible values for the tags (persons in your example).
If Bob, Sue and Joe are the only people in the system, you can use the contains() function to define a boolean calculated field for each person -- e.g. Bob_Is_Responsible = contains(Persons, 'Bob"), and similar fields for Sue and Joe. Then you could use those as building blocks, possibly with sets, to break the data up in different ways.
Of course, this approach gets cumbersome fast if the number of tags grows, or if it is unconstrained. But you asked for a temporary solution ...

If the number of elements is small, you write and union several queries with each one having the project and nth element.
Ideally, you'd reshape your data to look like this either in the database or with the above mentioned union technique. Then you could count() or countd() the elements by project.
Project | Persons
---------------------------
Zeta | Bob
Zeta | Sue
Zeta | Joe
Enne | Sue
Doble Ve | Bob

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse