I used AWS Glue Transform-custom code, and wrote the following code.
def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
from pyspark.sql.functions import col, lit
selected = dfc.select(list(dfc.keys())[0]).toDF()
selected = selected.withColumn('item2',lit(col('item1')))
results = DynamicFrame.fromDF(selected, glueContext, "results")
return DynamicFrameCollection({"results": results}, glueContext)
I got the following error.
AnalysisException: cannot resolve '`item1`' given input columns: [];
You don't need to wrap lit(col('item1')). lit is used to create a literal static value, while col is to refer to a specific column. Depends on what exactly you're trying to achieve, but this change will at least fix your error
selected = selected.withColumn('item2', col('item1'))
Related
I have csv and no header in it, I want provide header from list
import polars as pl
header = ["col1", "col2"]
df = pl.scan_csv(file_path, has_header = False, with_column_names = header,)
But get error list object is not callable
I try
with_column_names=header[:]
with_column_names=list(header)
Make def function which return list
with_column_names=def_fun(header)
All still error list not Callable
And
with_column_names=list[header]
With this no error list not callable but the header still auto generate with column_1 and so on
According to the doc, with_column_names is a Callable[[list[str]], list[str]].
You can try passing with_column_names = lambda _: header
My goal is to clean the Data in a column in a Pyspark DF. I have written a function for cleaning .
def preprocess(text):
text = text.lower()
text=text.strip()
text=re.compile('<.*?>').sub('', text)
text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)
text = re.sub('\s+', ' ', text)
text = re.sub(r'\[[0-9]*\]',' ',text)
text=re.sub(r'[^\w\s]', '', text.lower().strip())
text = re.sub(r'\d',' ',text)
text = re.sub(r'\s+',' ',text)
return text
#LEMMATIZATION
# Initialize the lemmatizer
wl = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
def remove_stopwords(text):
text = [i for i in text.split() if not i in stop_words]
return text
# This is a helper function to map NTLK position tags
def get_wordnet_pos(tag):
if tag.startswith('J'):
return wordnet.ADJ
elif tag.startswith('V'):
return wordnet.VERB
elif tag.startswith('N'):
return wordnet.NOUN
elif tag.startswith('R'):
return wordnet.ADV
else:
return wordnet.NOUN
# Tokenize the sentence
def lemmatizer(string):
word_pos_tags = nltk.pos_tag(word_tokenize(string)) # Get position tags
a=[wl.lemmatize(tag[0], get_wordnet_pos(tag[1])) for idx, tag in enumerate(word_pos_tags)] # Map the position tag and lemmatize the word/token
return " ".join(a)
#Final Function
def finalpreprocess(string):
return lemmatizer(' '.join(remove_stopwords(preprocess(string))))
The functions seems to work fine when I test it . When I do
text = 'Ram and Bheem are buddies. They (both) like <b>running</b>. They got better at it over the weekend'
print(finalpreprocess(text))
I see the exact result I want.
ram bheem buddy like run get well weekend
How ever when I try to apply this function finalpreprocess() to a column in pyspark dataframe . I am getting errors.
Here is what I did.
udf_txt_clean = udf(lambda x: finalpreprocess(x),StringType())
df.withColumn("cleaned_text",lem(col("reason"))).select("reason","cleaned_text").show(10,False)
Then I am getting the error :
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 473, in dumps
return cloudpickle.dumps(obj, pickle_protocol)
File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 73, in dumps
cp.dump(obj)
File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 563, in dump
return Pickler.dump(self, obj)
TypeError: cannot pickle '_thread.RLock' object
PicklingError: Could not serialize object: TypeError: cannot pickle '_thread.RLock' object
So far here is what I did. In my finalpreprocess(), I am using three different functions preprocess(),remove_stopwords(), lemmatizer() . I changed my udf_txt_clean accordingly . Like
udf_txt_clean = udf(lambda x: preprocess(x),StringType())
udf_txt_clean = udf(lambda x: remove_stopwords(x),StringType())
These two run fine But -
udf_txt_clean = udf(lambda x: lemmatizer (x),StringType())
is the one that is giving me the error. I am not able to understand why this function is giving the error but not the other two. From my limited understating I see that its having trouble trying to pickle this function but I am not able to understand why its trying to pickle this in the first place or if there is a work around for it.
It would help if the example were more reproducible next time. It took a bit to re-create this. No worries, though,I have a solution here.
First, cloudpickle is the mechanism of Spark to move a function from drivers to workers. So functions are pickled and then sent to the workers for execution. So something you are using can't be pickled. In order to quickly test, you can just use:
import cloudpickle
cloudpickle.dumps(x)
where x is something that you are testing if it's cloudpickle-able. In this case, I tried a couple of times and found wordnet not to be serializable. You can test with:
cloudpickle.dumps(wordnet)
and it will reproduce the issue. You can get around this by importing the stuff that can't be pickled inside your function. Here is an end-to-end example for you.
import re
import pandas as pd
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType,IntegerType,StringType
def preprocess(text):
text = text.lower()
text=text.strip()
text=re.compile('<.*?>').sub('', text)
text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)
text = re.sub('\s+', ' ', text)
text = re.sub(r'\[[0-9]*\]',' ',text)
text=re.sub(r'[^\w\s]', '', text.lower().strip())
text = re.sub(r'\d',' ',text)
text = re.sub(r'\s+',' ',text)
return text
#LEMMATIZATION
# Initialize the lemmatizer
wl = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
def remove_stopwords(text):
text = [i for i in text.split() if not i in stop_words]
return text
def lemmatizer(string):
from nltk.corpus import wordnet
def get_wordnet_pos(tag):
if tag.startswith('J'):
return wordnet.ADJ
elif tag.startswith('V'):
return wordnet.VERB
elif tag.startswith('N'):
return wordnet.NOUN
elif tag.startswith('R'):
return wordnet.ADV
else:
return wordnet.NOUN
word_pos_tags = nltk.pos_tag(word_tokenize(string)) # Get position tags
a=[wl.lemmatize(tag[0], get_wordnet_pos(tag[1])) for idx, tag in enumerate(word_pos_tags)] # Map the position tag and lemmatize the word/token
return " ".join(a)
#Final Function
def finalpreprocess(string):
return lemmatizer(' '.join(remove_stopwords(preprocess(string))))
spark = SparkSession.builder.getOrCreate()
text = 'Ram and Bheem are buddies. They (both) like <b>running</b>. They got better at it over the weekend'
test = pd.DataFrame({"test": [text]})
sdf = spark.createDataFrame(test)
udf_txt_clean = udf(lambda x: finalpreprocess(x),StringType())
sdf.withColumn("cleaned_text",udf_txt_clean(col("test"))).select("test","cleaned_text").show(10,False)
When running the following command I get the error
I am running the code on Databricks Platform, but the code is written using Pandas
TypeError: 'DataFrame' object does not support item assignment
Can someone let me know if the error is related to spark / databricks platform not supporting the code?
import numpy as np
import pandas as pd
def matchSchema(df):
df['active'] = df['active'].astype('boolean')
df['price'] = df['counts']/100
df.drop('counts', axis=1, inplace=True)
return df,df.head(3)
(dataset, sample) = matchSchema(df)
print(dataset)
print(sample)
The error is:
TypeError: 'DataFrame' object does not support item assignment
bool is used instead of boolean as a dtype...
df['active'] = df['active'].astype('bool')
I am running an AWS Glue job to load a pipe delimited file on S3 into an RDS Postgres instance, using the auto-generated PySpark script from Glue.
Initially, it complained about NULL values in some columns:
pyspark.sql.utils.IllegalArgumentException: u"Can't get JDBC type for null"
After some googling and reading on SO, I tried to replace the NULLs in my file by converting my AWS Glue Dynamic Dataframe to a Spark Dataframe, executing the function fillna() and reconverting back to a Dynamic Dataframe.
datasource0 = glueContext.create_dynamic_frame.from_catalog(database =
"xyz_catalog", table_name = "xyz_staging_files", transformation_ctx =
"datasource0")
custom_df = datasource0.toDF()
custom_df2 = custom_df.fillna(-1)
custom_df3 = custom_df2.fromDF()
applymapping1 = ApplyMapping.apply(frame = custom_df3, mappings = [("id",
"string", "id", "int"),........more code
References:
https://github.com/awslabs/aws-glue-samples/blob/master/FAQ_and_How_to.md#3-there-are-some-transforms-that-i-cannot-figure-out
How to replace all Null values of a dataframe in Pyspark
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.fillna
Now, when I run my job, it throws the following error:
Log Contents:
Traceback (most recent call last):
File "script_2017-12-20-22-02-13.py", line 23, in <module>
custom_df3 = custom_df2.fromDF()
AttributeError: 'DataFrame' object has no attribute 'fromDF'
End of LogType:stdout
I am new to Python and Spark and have tried a lot, but can't make sense of this. Appreciate some expert help on this.
I tried changing my reconvert command to this:
custom_df3 = glueContext.create_dynamic_frame.fromDF(frame = custom_df2)
But still got the error:
AttributeError: 'DynamicFrameReader' object has no attribute 'fromDF'
UPDATE:
I suspect this is not about NULL values. The message "Can't get JDBC type for null" seems not to refer to a NULL value, but some data/type that JDBC is unable to decipher.
I created a file with only 1 record, no NULL values, changed all Boolean types to INT (and replaced values with 0 and 1), but still get the same error:
pyspark.sql.utils.IllegalArgumentException: u"Can't get JDBC type for null"
UPDATE:
Make sure DynamicFrame is imported (from awsglue.context import DynamicFrame), since fromDF / toDF are part of DynamicFrame.
Refer to https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html
You are calling .fromDF on the wrong class. It should look like this:
from awsglue.dynamicframe import DynamicFrame
DyamicFrame.fromDF(custom_df2, glueContext, 'label')
For this error, pyspark.sql.utils.IllegalArgumentException: u"Can't get JDBC type for null"
you should use the drop Null columns.
I was getting similar errors while loading to Redshift DB Tables. After using the below command, the issue got resolved
Loading= DropNullFields.apply(frame = resolvechoice3, transformation_ctx = "Loading")
In Pandas, and for Pandas DataFrame, pd.fillna() is used to fill null values with other specified values. However, DropNullFields drops all null fields in a DynamicFrame whose type is NullType. These are fields with missing or null values in every record in the DynamicFrame data set.
In your specific situation, you need to make sure you are using the write class for the appropriate dataset.
Here is the edited version of your code:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database =
"xyz_catalog", table_name = "xyz_staging_files", transformation_ctx =
"datasource0")
custom_df = datasource0.toDF()
custom_df2 = custom_df.fillna(-1)
custom_df3 = DyamicFrame.fromDF(custom_df2, glueContext, 'your_label')
applymapping1 = ApplyMapping.apply(frame = custom_df3, mappings = [("id",
"string", "id", "int"),........more code
This is what you are doing: 1. Read the file in DynamicFrame, 2. Convert it to DataFrame, 3. Drop null values, 4. Convert back to DynamicFrame, and 5. ApplyMapping. You were getting the following error because your step 4 was wrong and you were were feeding a DataFrame to ApplyMapping which does not work. ApplyMapping is designed for DynamicFrames.
I would suggest read your data in DynamicFrame and stick to the same data type. It would look like this (one way to do it):
from awsglue.dynamicframe import DynamicFrame
datasource0 = glueContext.create_dynamic_frame.from_catalog(database =
"xyz_catalog", table_name = "xyz_staging_files", transformation_ctx =
"datasource0")
custom_df = DropNullFields.apply(frame=datasource0)
applymapping1 = ApplyMapping.apply(frame = custom_df, mappings = [("id",
"string", "id", "int"),........more code
I need to update the value and if the value is zero then drop that row. Here is the snapshot.
val net = sc.accumulator(0.0)
df1.foreach(x=> {net += calculate(df2, x)})
def calculate(df2:DataFrame, x : Row):Double = {
var pro:Double = 0.0
df2.foreach(y => {if(xxx){ do some stuff and update the y.getLong(2) value }
else if(yyy){ do some stuff and update the y.getLong(2) value}
if(y.getLong(2) == 0) {drop this row from df2} })
return pro;
}
Any suggestions? Thanks.
You cannot change the DataFrame or RDD. They are read only for a reason. But you can create a new one and use transformations by all the means available. So when you want to change for example contents of a column in dataframe just add new column with updated contents by using functions like this:
df.withComlumn(...)
DataFrames are immutable, you can not update a value but rather create new DF every time.
Can you reframe your use case, its not very clear what you are trying to achieve with the above snippet (Not able to understand the use of accumulator) ?
You can rather try df2.withColumn(...) and use your udf here.