I am performing reverse geocoding i.e converting coordinates to city by using geopy's Nominatim locator in pyspark. I packed it inside a function getCity() which takes a pair of coords in string and returns the city. For example:
from geopy.geocoders import Nominatim
locator = Nominatim(user_agent="myGeocoder")
def getCity(coords):
loc = locator.reverse(coords, language="en")
try:
city = loc.raw["address"]["city"]
except:
city = '' if loc is None else loc.raw["address"].get('state', '') # if couldn't find city then return state, otherwise just empty string
return city
This function works fine if mapped to a pandas dataframe column but throws exception in pyspark. I have tried pyspark's withColumn, spark UDFs and even converted to rdd for this mapping but the same exception gets returned i.e
UDF_getCity = F.udf(getCity, StringType())
cityDF_spark = cityDF_spark.select('geom', UDF_getCity('geom').alias('CITY'))
cityDF_spark.show()
PythonException: An exception was thrown from a UDF: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):
AttributeError: 'RequestsHTTPWithSSLContextAdapter' object has no attribute '_RequestsHTTPWithSSLContextAdapter__ssl_context'
Related
I have csv and no header in it, I want provide header from list
import polars as pl
header = ["col1", "col2"]
df = pl.scan_csv(file_path, has_header = False, with_column_names = header,)
But get error list object is not callable
I try
with_column_names=header[:]
with_column_names=list(header)
Make def function which return list
with_column_names=def_fun(header)
All still error list not Callable
And
with_column_names=list[header]
With this no error list not callable but the header still auto generate with column_1 and so on
According to the doc, with_column_names is a Callable[[list[str]], list[str]].
You can try passing with_column_names = lambda _: header
My goal is to clean the Data in a column in a Pyspark DF. I have written a function for cleaning .
def preprocess(text):
text = text.lower()
text=text.strip()
text=re.compile('<.*?>').sub('', text)
text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)
text = re.sub('\s+', ' ', text)
text = re.sub(r'\[[0-9]*\]',' ',text)
text=re.sub(r'[^\w\s]', '', text.lower().strip())
text = re.sub(r'\d',' ',text)
text = re.sub(r'\s+',' ',text)
return text
#LEMMATIZATION
# Initialize the lemmatizer
wl = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
def remove_stopwords(text):
text = [i for i in text.split() if not i in stop_words]
return text
# This is a helper function to map NTLK position tags
def get_wordnet_pos(tag):
if tag.startswith('J'):
return wordnet.ADJ
elif tag.startswith('V'):
return wordnet.VERB
elif tag.startswith('N'):
return wordnet.NOUN
elif tag.startswith('R'):
return wordnet.ADV
else:
return wordnet.NOUN
# Tokenize the sentence
def lemmatizer(string):
word_pos_tags = nltk.pos_tag(word_tokenize(string)) # Get position tags
a=[wl.lemmatize(tag[0], get_wordnet_pos(tag[1])) for idx, tag in enumerate(word_pos_tags)] # Map the position tag and lemmatize the word/token
return " ".join(a)
#Final Function
def finalpreprocess(string):
return lemmatizer(' '.join(remove_stopwords(preprocess(string))))
The functions seems to work fine when I test it . When I do
text = 'Ram and Bheem are buddies. They (both) like <b>running</b>. They got better at it over the weekend'
print(finalpreprocess(text))
I see the exact result I want.
ram bheem buddy like run get well weekend
How ever when I try to apply this function finalpreprocess() to a column in pyspark dataframe . I am getting errors.
Here is what I did.
udf_txt_clean = udf(lambda x: finalpreprocess(x),StringType())
df.withColumn("cleaned_text",lem(col("reason"))).select("reason","cleaned_text").show(10,False)
Then I am getting the error :
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 473, in dumps
return cloudpickle.dumps(obj, pickle_protocol)
File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 73, in dumps
cp.dump(obj)
File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 563, in dump
return Pickler.dump(self, obj)
TypeError: cannot pickle '_thread.RLock' object
PicklingError: Could not serialize object: TypeError: cannot pickle '_thread.RLock' object
So far here is what I did. In my finalpreprocess(), I am using three different functions preprocess(),remove_stopwords(), lemmatizer() . I changed my udf_txt_clean accordingly . Like
udf_txt_clean = udf(lambda x: preprocess(x),StringType())
udf_txt_clean = udf(lambda x: remove_stopwords(x),StringType())
These two run fine But -
udf_txt_clean = udf(lambda x: lemmatizer (x),StringType())
is the one that is giving me the error. I am not able to understand why this function is giving the error but not the other two. From my limited understating I see that its having trouble trying to pickle this function but I am not able to understand why its trying to pickle this in the first place or if there is a work around for it.
It would help if the example were more reproducible next time. It took a bit to re-create this. No worries, though,I have a solution here.
First, cloudpickle is the mechanism of Spark to move a function from drivers to workers. So functions are pickled and then sent to the workers for execution. So something you are using can't be pickled. In order to quickly test, you can just use:
import cloudpickle
cloudpickle.dumps(x)
where x is something that you are testing if it's cloudpickle-able. In this case, I tried a couple of times and found wordnet not to be serializable. You can test with:
cloudpickle.dumps(wordnet)
and it will reproduce the issue. You can get around this by importing the stuff that can't be pickled inside your function. Here is an end-to-end example for you.
import re
import pandas as pd
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType,IntegerType,StringType
def preprocess(text):
text = text.lower()
text=text.strip()
text=re.compile('<.*?>').sub('', text)
text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)
text = re.sub('\s+', ' ', text)
text = re.sub(r'\[[0-9]*\]',' ',text)
text=re.sub(r'[^\w\s]', '', text.lower().strip())
text = re.sub(r'\d',' ',text)
text = re.sub(r'\s+',' ',text)
return text
#LEMMATIZATION
# Initialize the lemmatizer
wl = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
def remove_stopwords(text):
text = [i for i in text.split() if not i in stop_words]
return text
def lemmatizer(string):
from nltk.corpus import wordnet
def get_wordnet_pos(tag):
if tag.startswith('J'):
return wordnet.ADJ
elif tag.startswith('V'):
return wordnet.VERB
elif tag.startswith('N'):
return wordnet.NOUN
elif tag.startswith('R'):
return wordnet.ADV
else:
return wordnet.NOUN
word_pos_tags = nltk.pos_tag(word_tokenize(string)) # Get position tags
a=[wl.lemmatize(tag[0], get_wordnet_pos(tag[1])) for idx, tag in enumerate(word_pos_tags)] # Map the position tag and lemmatize the word/token
return " ".join(a)
#Final Function
def finalpreprocess(string):
return lemmatizer(' '.join(remove_stopwords(preprocess(string))))
spark = SparkSession.builder.getOrCreate()
text = 'Ram and Bheem are buddies. They (both) like <b>running</b>. They got better at it over the weekend'
test = pd.DataFrame({"test": [text]})
sdf = spark.createDataFrame(test)
udf_txt_clean = udf(lambda x: finalpreprocess(x),StringType())
sdf.withColumn("cleaned_text",udf_txt_clean(col("test"))).select("test","cleaned_text").show(10,False)
I´ve got an issue trying to replicate the example I saw here - https://learn.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-load-data-run-query.
It seems to fail when it comes to : hvacTable = sqlContext.createDataFrame(hvac)
and the error it returns is:
'PipelinedRDD' object has no attribute '_get_object_id'
Traceback (most recent call last):
File "/usr/hdp/current/spark2-client/python/pyspark/sql/context.py", line 333, in createDataFrame
return self.sparkSession.createDataFrame(data, schema, samplingRatio, verifySchema)
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1124, in __call__
args_command, temp_args = self._build_args(*args)
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1094, in _build_args
[get_command_part(arg, self.pool) for arg in new_args])
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 289, in get_command_part
command_part = REFERENCE_TYPE + parameter._get_object_id()
AttributeError: 'PipelinedRDD' object has no attribute '_get_object_id'
I´m following the example to a T, it´s a pyspark notebook in Jupyter.
Why is this error occurring?
you probably running it on newer cluster. Please update "sqlContext" to "spark" to get it to work. We'll update this doc article as well.
Also in Spark 2.x you can now do this operation with DataFrames which simpler. You can replace the snippet creating hvac table with following equivalent:
csvFile = spark.read.csv('wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv', header=True, inferSchema=True)
csvFile.write.saveAsTable("hvac")
When I implement this part of this python code in Azure Databricks:
class clustomTransformations(Transformer):
<code>
custom_transformer = customTransformations()
....
pipeline = Pipeline(stages=[custom_transformer, assembler, scaler, rf])
pipeline_model = pipeline.fit(sample_data)
pipeline_model.save(<your path>)
When I attempt to save the pipeline, I get this:
AttributeError: 'customTransformations' object has no attribute '_to_java'
Any work arounds?
It seems like there is no easy workaround but to try and implement the _to_java method, as is suggested here for StopWordsRemover:
Serialize a custom transformer using python to be used within a Pyspark ML pipeline
def _to_java(self):
"""
Convert this instance to a dill dump, then to a list of strings with the unicode integer values of each character.
Use this list as a set of dumby stopwords and store in a StopWordsRemover instance
:return: Java object equivalent to this instance.
"""
dmp = dill.dumps(self)
pylist = [str(ord(d)) for d in dmp] # convert byes to string integer list
pylist.append(PysparkObjId._getPyObjId()) # add our id so PysparkPipelineWrapper can id us.
sc = SparkContext._active_spark_context
java_class = sc._gateway.jvm.java.lang.String
java_array = sc._gateway.new_array(java_class, len(pylist))
for i in xrange(len(pylist)):
java_array[i] = pylist[i]
_java_obj = JavaParams._new_java_obj(PysparkObjId._getCarrierClass(javaName=True), self.uid)
_java_obj.setStopWords(java_array)
return _java_obj
I am running an AWS Glue job to load a pipe delimited file on S3 into an RDS Postgres instance, using the auto-generated PySpark script from Glue.
Initially, it complained about NULL values in some columns:
pyspark.sql.utils.IllegalArgumentException: u"Can't get JDBC type for null"
After some googling and reading on SO, I tried to replace the NULLs in my file by converting my AWS Glue Dynamic Dataframe to a Spark Dataframe, executing the function fillna() and reconverting back to a Dynamic Dataframe.
datasource0 = glueContext.create_dynamic_frame.from_catalog(database =
"xyz_catalog", table_name = "xyz_staging_files", transformation_ctx =
"datasource0")
custom_df = datasource0.toDF()
custom_df2 = custom_df.fillna(-1)
custom_df3 = custom_df2.fromDF()
applymapping1 = ApplyMapping.apply(frame = custom_df3, mappings = [("id",
"string", "id", "int"),........more code
References:
https://github.com/awslabs/aws-glue-samples/blob/master/FAQ_and_How_to.md#3-there-are-some-transforms-that-i-cannot-figure-out
How to replace all Null values of a dataframe in Pyspark
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.fillna
Now, when I run my job, it throws the following error:
Log Contents:
Traceback (most recent call last):
File "script_2017-12-20-22-02-13.py", line 23, in <module>
custom_df3 = custom_df2.fromDF()
AttributeError: 'DataFrame' object has no attribute 'fromDF'
End of LogType:stdout
I am new to Python and Spark and have tried a lot, but can't make sense of this. Appreciate some expert help on this.
I tried changing my reconvert command to this:
custom_df3 = glueContext.create_dynamic_frame.fromDF(frame = custom_df2)
But still got the error:
AttributeError: 'DynamicFrameReader' object has no attribute 'fromDF'
UPDATE:
I suspect this is not about NULL values. The message "Can't get JDBC type for null" seems not to refer to a NULL value, but some data/type that JDBC is unable to decipher.
I created a file with only 1 record, no NULL values, changed all Boolean types to INT (and replaced values with 0 and 1), but still get the same error:
pyspark.sql.utils.IllegalArgumentException: u"Can't get JDBC type for null"
UPDATE:
Make sure DynamicFrame is imported (from awsglue.context import DynamicFrame), since fromDF / toDF are part of DynamicFrame.
Refer to https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html
You are calling .fromDF on the wrong class. It should look like this:
from awsglue.dynamicframe import DynamicFrame
DyamicFrame.fromDF(custom_df2, glueContext, 'label')
For this error, pyspark.sql.utils.IllegalArgumentException: u"Can't get JDBC type for null"
you should use the drop Null columns.
I was getting similar errors while loading to Redshift DB Tables. After using the below command, the issue got resolved
Loading= DropNullFields.apply(frame = resolvechoice3, transformation_ctx = "Loading")
In Pandas, and for Pandas DataFrame, pd.fillna() is used to fill null values with other specified values. However, DropNullFields drops all null fields in a DynamicFrame whose type is NullType. These are fields with missing or null values in every record in the DynamicFrame data set.
In your specific situation, you need to make sure you are using the write class for the appropriate dataset.
Here is the edited version of your code:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database =
"xyz_catalog", table_name = "xyz_staging_files", transformation_ctx =
"datasource0")
custom_df = datasource0.toDF()
custom_df2 = custom_df.fillna(-1)
custom_df3 = DyamicFrame.fromDF(custom_df2, glueContext, 'your_label')
applymapping1 = ApplyMapping.apply(frame = custom_df3, mappings = [("id",
"string", "id", "int"),........more code
This is what you are doing: 1. Read the file in DynamicFrame, 2. Convert it to DataFrame, 3. Drop null values, 4. Convert back to DynamicFrame, and 5. ApplyMapping. You were getting the following error because your step 4 was wrong and you were were feeding a DataFrame to ApplyMapping which does not work. ApplyMapping is designed for DynamicFrames.
I would suggest read your data in DynamicFrame and stick to the same data type. It would look like this (one way to do it):
from awsglue.dynamicframe import DynamicFrame
datasource0 = glueContext.create_dynamic_frame.from_catalog(database =
"xyz_catalog", table_name = "xyz_staging_files", transformation_ctx =
"datasource0")
custom_df = DropNullFields.apply(frame=datasource0)
applymapping1 = ApplyMapping.apply(frame = custom_df, mappings = [("id",
"string", "id", "int"),........more code