pipeline error in pyspark when excuting below code

pipeline error in pyspark when excuting below code - pyspark

from pyspark.sql import *
from pyspark import SQLContext
sqc=SQLContext(sc)
input=sc.textFile("file:///home/cloudera/Desktop/uber.txt")
df=input.map(lambda x:x.split(","))
df=sqc.createDataFrame(input.map(lambda x:x.split(","))
input.map(lambda r:Row(basedid=r[0],dt=r[1],nveh=int(r[2]),ncus=int(r[3])))))
when I executed above code,I got following error.
TypeError: 'PipelinedRDD' object is not callable

Last line of your code should be
input.map(lambda r: r.split(",")).map(lambda r:Row(basedid=r[0],dt=r[1],nveh=int(r[2]),ncus=int(r[3])))
and remove extra parentheses at the end of last line.

Related

PySpark UDF crashing at return cloudpickle.loads(obj, encoding=encoding) with code() argument 13 must be str, not int as error

The issue
When running my udf (listed below), an exception gets thrown from a Python worker. The exception being
File "C:\PATH\SparkInstallation\spark-3.3.1-bin-hadoop3\python\lib\pyspark.zip\pyspark\serializers.py", line 471, in loads
return cloudpickle.loads(obj, encoding=encoding)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: code() argument 13 must be str, not int
I do not know what would cause this error, and it seems like it might be caused by something else than my code.
The code
from pyspark.sql.types import StringType, ArrayType
import re
from pyspark.sql import SparkSession
import pyspark.sql.functions as pysparkfunctions
spark = SparkSession.builder.appName('test').getOrCreate()
savedTweets = spark.read.csv("testData/")
def getHashtags(string):
return re.findall(r"#(\w+)", string)
getHashtagsUDF = pysparkfunctions.udf(getHashtags, ArrayType(StringType()))
savedTweets = savedTweets.withColumn("hashtags", getHashtagsUDF(savedTweets['tweet']))
savedTweets.show()
Where savedTweets has one column called 'tweet' that contains the text of a tweet. The expected outcome would be a second column that gives an array of strings that lists the used hashtags.
Example of UDF
Given the input " #a #b #c", outputs ['a', 'b', 'c']
Given the input " a #b #c", outputs ['c']

I had the same error. Solved it by downgrading python version to 3.7
Edit: It also works in 3.10.9

Why is there a syntax error on pyspark broadcast call

I'm following a textbook example about creating a music recommender and when trying to set up the alias connection to link incorrect names to the correct names in the artist name column, I'm getting a syntax error.
The error is on line 3 and its on the broadcast call. It has an issue with the how='left'
from pyspark.sql.functions import broadcast, when
train_data = user_artist_df.join(broadcast(artist_alias),'artist', how='left').\ //here is the error
train_data = train_data.withColumn('artist',
when(col('alias').isNull(), col('artist')).\
otherwise(col('alias')))
train_data = train_data.withColumn('artist', col('artist').\
cast(IntegerType())).\
drop('alias')
train_data.cache()
train_data.count()

You have a ./ at the end of line 3 that is causing the error. This should work:
from pyspark.sql.functions import broadcast, when
train_data = user_artist_df.join(broadcast(artist_alias),'artist', how='left')
train_data = train_data.withColumn('artist',
when(col('alias').isNull(), col('artist')).\
otherwise(col('alias')))
train_data = train_data.withColumn('artist', col('artist').\
cast(IntegerType())).\
drop('alias')
train_data.cache()
train_data.count()

pyspark: How can I use multiple `lit(val)` in a computation where `val` is a variable in python?

D = # came from numpy.int64 via pandas
E = # came from numpy.int64 via pandas
import pyspark.sql.functions as F
output_df.withColumn("c", F.col("A") - F.log(F.lit(D) - F.lit(E)))
I tried to use multiple lit inside pyspark with column operation. But I keep getting errors like
*** AttributeError: 'numpy.int64' object has no attribute '_get_object_id'
But these ones work
D=2
output_df.withColumn("c", F.lit(D))
output_df.withColumn("c", F.lit(2))

Try this
df.withColumn("c", F.col("A") - F.log(F.lit(int(D - E))))

D = int(D)
E = int(E)
Just add these two lines and it will work. The issue is that pyspark doesn't know how to handle numpy.int64

spark.read.format('libsvm') not working with python

I am learning PYSPARK and encountered a problem that I can't fix. I followed this video to copy codes from the PYSPARK documentation to load data for linear regression. The code I got from the documentation was spark.read.format('libsvm').load('file.txt'). I created a spark data frame before this btw. When I run this code in Jupyter notebook it keeps giving me some java error and the guy in this video did the exact same thing as I did and he didn't get this error. Can someone help me resolve this issue, please?
A lot of thanks!

I think I solved this issue by setting the "numFeatures" in the option method:
training = spark.read.format('libsvm').option("numFeatures","10").load('sample_linear_regression_data.txt', header=True)

You can use this custom function to read libsvm file.
from pyspark.sql import Row
from pyspark.ml.linalg import SparseVector
def read_libsvm(filepath, spark_session):
'''
A utility function that takes in a libsvm file and turn it to a pyspark dataframe.
Args:
filepath (str): The file path to the data file.
spark_session (object): The SparkSession object to create dataframe.
Returns:
A pyspark dataframe that contains the data loaded.
'''
with open(filepath, 'r') as f:
raw_data = [x.split() for x in f.readlines()]
outcome = [int(x[0]) for x in raw_data]
index_value_dict = list()
for row in raw_data:
index_value_dict.append(dict([(int(x.split(':')[0]), float(x.split(':')[1]))
for x in row[1:]]))
max_idx = max([max(x.keys()) for x in index_value_dict])
rows = [
Row(
label=outcome[i],
feat_vector=SparseVector(max_idx + 1, index_value_dict[i])
)
for i in range(len(index_value_dict))
]
df = spark_session.createDataFrame(rows)
return df
Usage:
my_data = read_libsvm(filepath="sample_libsvm_data.txt", spark_session=spark)

You can try to load via:
from pyspark.mllib.util import MLUtils
df = MLUtils.loadLibSVMFile(sc,"data.libsvm",numFeatures=781).toDF()
sc is Spark context and df is resulting data frame.

Getting AttributeError: 'OneHotEncoder' object has no attribute '_jdf in pyspark'

I am trying to implement Gradient boosting algorithm on a kaggle dataset in pyspark for learning purpose. I am facing error given below
Traceback (most recent call last):
File "C:/SparkCourse/Gradientboost.py", line 29, in <module>
output=assembler.transform(data)
File "C:\spark\python\lib\pyspark.zip\pyspark\ml\base.py", line 105, in transform
File "C:\spark\python\lib\pyspark.zip\pyspark\ml\wrapper.py", line 281, in _transform
AttributeError: 'OneHotEncoder' object has no attribute '_jdf'
the corresponding code is
from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer,VectorIndexer,OneHotEncoder,VectorAssembler
spark=SparkSession.builder.config("spark.sql.warehouse.dir", "file:///C:/temp").appName("Gradientboostapp").enableHiveSupport().getOrCreate()
data= spark.read.csv("C:/Users/codemen/Desktop/Timeseries Analytics/liver_patient.csv",header=True, inferSchema=True)
#data.show()
print(data.count())
#data.printSchema()
print("After deleting null values")
data=data.na.drop()
print(data.count())
data=StringIndexer(inputCol="Gender",outputCol="GenderIndex").fit(data)
#let onehot encode the data
data=OneHotEncoder(inputCol="GenderIndex",outputCol="gendervec")
usedfeature=["Age","gendervec","Total_Bilirubin","Direct_Bilirubin","Alkaline_Phosphotase","Alamine_Aminotransferase","Aspartate_Aminotransferase","Total_Protiens","Albumin","Albumin_and_Globulin_Ratio"]
#
assembler=VectorAssembler(inputCols=usedfeature,outputCol="features")
output=assembler.transform(data)
output.select("features","category").show()
I have converted Gender category into numerical form by using String indexer then I have tried to perform OnehotEncoding on Genderindex value. I am getting the error when I have performed VectorAssembler in code. May I am missing very silly concept here. kindly help me to figure it out

This line of code is incorrect: data=OneHotEncoder(inputCol="GenderIndex",outputCol="gendervec"). You are setting data to be equal to the OneHotEncoder() object, not transforming the data. You need to call a transform to encode the data. It should look like this.
encoder=OneHotEncoder(inputCol="GenderIndex",outputCol="gendervec")
data = encoder.transform(data)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

pipeline error in pyspark when excuting below code - pyspark

Last line of your code should be input.map(lambda r: r.split(",")).map(lambda r:Row(basedid=r[0],dt=r[1],nveh=int(r[2]),ncus=int(r[3]))) and remove extra parentheses at the end of last line.

Related

PySpark UDF crashing at return cloudpickle.loads(obj, encoding=encoding) with code() argument 13 must be str, not int as error

Why is there a syntax error on pyspark broadcast call

pyspark: How can I use multiple `lit(val)` in a computation where `val` is a variable in python?

spark.read.format('libsvm') not working with python

Getting AttributeError: 'OneHotEncoder' object has no attribute '_jdf in pyspark'

Categories

Resources