I want to use the fonction **"doublemetaphone"** on my data set on **pyspark**.
The expectation result is a boolean: TRUE or FALSE.
But *#udf("bool")* doesn't work, is there another way?
from metaphone import doublemetaphone
#udf("bool")
def udf_doublemetaphone(a,b):
return doublemetaphone(a)== doublemetaphone(b)
data_set_doublemetaphone = (data_set.withColumn("doublemetaphone",
udf_doublemetaphone(col("A"),col("B")))) ```
I have find!
This is #udf(BooleanType())
So you can use like this:
rom metaphone import doublemetaphone
#udf(BooleanType())
def udf_doublemetaphone(a,b):
return doublemetaphone(a)== doublemetaphone(b)
data_set_doublemetaphone = (data_set.withColumn("doublemetaphone",
udf_doublemetaphone(col("A"),col("B")))) ```
Related
My goal is to clean the Data in a column in a Pyspark DF. I have written a function for cleaning .
def preprocess(text):
text = text.lower()
text=text.strip()
text=re.compile('<.*?>').sub('', text)
text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)
text = re.sub('\s+', ' ', text)
text = re.sub(r'\[[0-9]*\]',' ',text)
text=re.sub(r'[^\w\s]', '', text.lower().strip())
text = re.sub(r'\d',' ',text)
text = re.sub(r'\s+',' ',text)
return text
#LEMMATIZATION
# Initialize the lemmatizer
wl = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
def remove_stopwords(text):
text = [i for i in text.split() if not i in stop_words]
return text
# This is a helper function to map NTLK position tags
def get_wordnet_pos(tag):
if tag.startswith('J'):
return wordnet.ADJ
elif tag.startswith('V'):
return wordnet.VERB
elif tag.startswith('N'):
return wordnet.NOUN
elif tag.startswith('R'):
return wordnet.ADV
else:
return wordnet.NOUN
# Tokenize the sentence
def lemmatizer(string):
word_pos_tags = nltk.pos_tag(word_tokenize(string)) # Get position tags
a=[wl.lemmatize(tag[0], get_wordnet_pos(tag[1])) for idx, tag in enumerate(word_pos_tags)] # Map the position tag and lemmatize the word/token
return " ".join(a)
#Final Function
def finalpreprocess(string):
return lemmatizer(' '.join(remove_stopwords(preprocess(string))))
The functions seems to work fine when I test it . When I do
text = 'Ram and Bheem are buddies. They (both) like <b>running</b>. They got better at it over the weekend'
print(finalpreprocess(text))
I see the exact result I want.
ram bheem buddy like run get well weekend
How ever when I try to apply this function finalpreprocess() to a column in pyspark dataframe . I am getting errors.
Here is what I did.
udf_txt_clean = udf(lambda x: finalpreprocess(x),StringType())
df.withColumn("cleaned_text",lem(col("reason"))).select("reason","cleaned_text").show(10,False)
Then I am getting the error :
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 473, in dumps
return cloudpickle.dumps(obj, pickle_protocol)
File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 73, in dumps
cp.dump(obj)
File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 563, in dump
return Pickler.dump(self, obj)
TypeError: cannot pickle '_thread.RLock' object
PicklingError: Could not serialize object: TypeError: cannot pickle '_thread.RLock' object
So far here is what I did. In my finalpreprocess(), I am using three different functions preprocess(),remove_stopwords(), lemmatizer() . I changed my udf_txt_clean accordingly . Like
udf_txt_clean = udf(lambda x: preprocess(x),StringType())
udf_txt_clean = udf(lambda x: remove_stopwords(x),StringType())
These two run fine But -
udf_txt_clean = udf(lambda x: lemmatizer (x),StringType())
is the one that is giving me the error. I am not able to understand why this function is giving the error but not the other two. From my limited understating I see that its having trouble trying to pickle this function but I am not able to understand why its trying to pickle this in the first place or if there is a work around for it.
It would help if the example were more reproducible next time. It took a bit to re-create this. No worries, though,I have a solution here.
First, cloudpickle is the mechanism of Spark to move a function from drivers to workers. So functions are pickled and then sent to the workers for execution. So something you are using can't be pickled. In order to quickly test, you can just use:
import cloudpickle
cloudpickle.dumps(x)
where x is something that you are testing if it's cloudpickle-able. In this case, I tried a couple of times and found wordnet not to be serializable. You can test with:
cloudpickle.dumps(wordnet)
and it will reproduce the issue. You can get around this by importing the stuff that can't be pickled inside your function. Here is an end-to-end example for you.
import re
import pandas as pd
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType,IntegerType,StringType
def preprocess(text):
text = text.lower()
text=text.strip()
text=re.compile('<.*?>').sub('', text)
text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)
text = re.sub('\s+', ' ', text)
text = re.sub(r'\[[0-9]*\]',' ',text)
text=re.sub(r'[^\w\s]', '', text.lower().strip())
text = re.sub(r'\d',' ',text)
text = re.sub(r'\s+',' ',text)
return text
#LEMMATIZATION
# Initialize the lemmatizer
wl = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
def remove_stopwords(text):
text = [i for i in text.split() if not i in stop_words]
return text
def lemmatizer(string):
from nltk.corpus import wordnet
def get_wordnet_pos(tag):
if tag.startswith('J'):
return wordnet.ADJ
elif tag.startswith('V'):
return wordnet.VERB
elif tag.startswith('N'):
return wordnet.NOUN
elif tag.startswith('R'):
return wordnet.ADV
else:
return wordnet.NOUN
word_pos_tags = nltk.pos_tag(word_tokenize(string)) # Get position tags
a=[wl.lemmatize(tag[0], get_wordnet_pos(tag[1])) for idx, tag in enumerate(word_pos_tags)] # Map the position tag and lemmatize the word/token
return " ".join(a)
#Final Function
def finalpreprocess(string):
return lemmatizer(' '.join(remove_stopwords(preprocess(string))))
spark = SparkSession.builder.getOrCreate()
text = 'Ram and Bheem are buddies. They (both) like <b>running</b>. They got better at it over the weekend'
test = pd.DataFrame({"test": [text]})
sdf = spark.createDataFrame(test)
udf_txt_clean = udf(lambda x: finalpreprocess(x),StringType())
sdf.withColumn("cleaned_text",udf_txt_clean(col("test"))).select("test","cleaned_text").show(10,False)
Is there any possibility to update a list/variable inside an udf?
Let's consider this scenario:
studentsWithNewId = []
udfChangeStudentId = udf(changeStudentId, IntegerType())
def changeStudentId(studentId):
if condition:
newStudentId = computeNewStudentId() // this function is based on studentsWithNewId list contents
studentsWithNewId.append(newStudentId)
return newStudentId
return studentId
studentsDF.select(udfChangeStudentId(studentId))
Is this possible and safe on a cluster environment?
The code above is just an example, therefore maybe it can be re-written in some other, better way.
I would like to run test several times according to given list.
I build the list according to a given file at the 'setup_module' section.
Is it possible to do something like this?
data = []
def setup_module(module):
with open('data.json') as config_file:
configData = json.load(config_file)
data = fillData(configData)
#pytest.mark.parametrize("data", data)
def test_data(data):
for d in data:
.
.
.
Thanks,
Avi
I am not sure about format of your data. you could do like this
import pytest
scenarios = [('first', {'attribute': 'value'}), ('second', {'attribute': 'value'})]
#pytest.mark.parametrize("test_id,scenario",scenarios)
def test_scenarios(test_id,scenario):
assert scenario["attribute"] == "value"
I want to apply a udf function which return original value only if it can be converted to int. Til now I tried 2 functions :
def nb_int(s):
try:
val=int(s)
return s
except:
return "ERROR"
def nb_digit(s):
if (s.isdigit() == True):
return s
else:
return "ERROR"
nb_udf = F.udf(nb_digit, StringType())
df_corrected=df.withColumn("IntValue",nb_udf("nb_value"))
I applied this function on "nb_value". But it's not working :
df_corrected.filter(df_corrected["IntValue"] == "ERROR").select("nb_value").dropDuplicates(subset=["nb_value"]).collect()
Results of the last line should only give me values which are not convertible, but I still have 1, 2, 4, etc ...
[Row(nb_value=u'MS'),
Row(nb_value=u'286'),
Row(nb_value=u'TB'),
Row(nb_value=u'GF'),
Row(nb_value=u'287'),
Row(nb_value=u'MU'),
Row(nb_value=u'170'),
Row(nb_value=u'A9'),
Row(nb_value=u'288'),
Row(nb_value=u'171'),
Row(nb_value=u'333'),....
Any help to fix it is welcome ! Thanks
Both your UDFs work for me. You'd better use the second one, because raising exceptions can slow spark down. The == True is redundant in nb_digit, but apart from that it's fine:
First let's create the example dataframe:
df = hc.createDataFrame(sc.parallelize([['MS'],['286'],['TB'],['GF'],['287'],['MU'],['170'],['A9'],['288'],['171'],['333']]), ["nb_value"])
Using any of your UDFs:
df_corrected = df.withColumn("IntValue", nb_udf("nb_value"))
df_corrected.filter(df_corrected["IntValue"] == "ERROR").select("nb_value").dropDuplicates(subset=["nb_value"]).collect()
[Row(nb_value=u'MS'),
Row(nb_value=u'TB'),
Row(nb_value=u'GF'),
Row(nb_value=u'A9'),
Row(nb_value=u'MU')]
I need to update the value and if the value is zero then drop that row. Here is the snapshot.
val net = sc.accumulator(0.0)
df1.foreach(x=> {net += calculate(df2, x)})
def calculate(df2:DataFrame, x : Row):Double = {
var pro:Double = 0.0
df2.foreach(y => {if(xxx){ do some stuff and update the y.getLong(2) value }
else if(yyy){ do some stuff and update the y.getLong(2) value}
if(y.getLong(2) == 0) {drop this row from df2} })
return pro;
}
Any suggestions? Thanks.
You cannot change the DataFrame or RDD. They are read only for a reason. But you can create a new one and use transformations by all the means available. So when you want to change for example contents of a column in dataframe just add new column with updated contents by using functions like this:
df.withComlumn(...)
DataFrames are immutable, you can not update a value but rather create new DF every time.
Can you reframe your use case, its not very clear what you are trying to achieve with the above snippet (Not able to understand the use of accumulator) ?
You can rather try df2.withColumn(...) and use your udf here.