Why is there a syntax error on pyspark broadcast call - pyspark

I'm following a textbook example about creating a music recommender and when trying to set up the alias connection to link incorrect names to the correct names in the artist name column, I'm getting a syntax error.
The error is on line 3 and its on the broadcast call. It has an issue with the how='left'
from pyspark.sql.functions import broadcast, when
train_data = user_artist_df.join(broadcast(artist_alias),'artist', how='left').\ //here is the error
train_data = train_data.withColumn('artist',
when(col('alias').isNull(), col('artist')).\
otherwise(col('alias')))
train_data = train_data.withColumn('artist', col('artist').\
cast(IntegerType())).\
drop('alias')
train_data.cache()
train_data.count()

You have a ./ at the end of line 3 that is causing the error. This should work:
from pyspark.sql.functions import broadcast, when
train_data = user_artist_df.join(broadcast(artist_alias),'artist', how='left')
train_data = train_data.withColumn('artist',
when(col('alias').isNull(), col('artist')).\
otherwise(col('alias')))
train_data = train_data.withColumn('artist', col('artist').\
cast(IntegerType())).\
drop('alias')
train_data.cache()
train_data.count()

Related

PySpark UDF crashing at return cloudpickle.loads(obj, encoding=encoding) with code() argument 13 must be str, not int as error

The issue
When running my udf (listed below), an exception gets thrown from a Python worker. The exception being
File "C:\PATH\SparkInstallation\spark-3.3.1-bin-hadoop3\python\lib\pyspark.zip\pyspark\serializers.py", line 471, in loads
return cloudpickle.loads(obj, encoding=encoding)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: code() argument 13 must be str, not int
I do not know what would cause this error, and it seems like it might be caused by something else than my code.
The code
from pyspark.sql.types import StringType, ArrayType
import re
from pyspark.sql import SparkSession
import pyspark.sql.functions as pysparkfunctions
spark = SparkSession.builder.appName('test').getOrCreate()
savedTweets = spark.read.csv("testData/")
def getHashtags(string):
return re.findall(r"#(\w+)", string)
getHashtagsUDF = pysparkfunctions.udf(getHashtags, ArrayType(StringType()))
savedTweets = savedTweets.withColumn("hashtags", getHashtagsUDF(savedTweets['tweet']))
savedTweets.show()
Where savedTweets has one column called 'tweet' that contains the text of a tweet. The expected outcome would be a second column that gives an array of strings that lists the used hashtags.
Example of UDF
Given the input " #a #b #c", outputs ['a', 'b', 'c']
Given the input " a #b #c", outputs ['c']
I had the same error. Solved it by downgrading python version to 3.7
Edit: It also works in 3.10.9

How can I reuse the dataframe and use alternative for iloc to run an iterative imputer in Azure databricks

I am running an iterative imputer in Jupyter Notebook to first mark the known incorrect values as "Nan" and then run the iterative imputer to impute the correct values to achieve required sharpness in the data. The sample code is given below:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import numpy as np
import pandas as p
idx = [761, 762, 763, 764]
cols = ['11','12','13','14']
def fit_imputer():
for i in range(len(idx)):
for col in cols:
dfClean.iloc[idx[i], col] = np.nan
print('Index = {} Col = {} Defiled Value is: {}'.format(idx[i], col, dfClean.iloc[idx[i], col]))
# Run Imputer for Individual row
tempOut = imp.fit_transform(dfClean)
print("Imputed Value = ",tempOut[idx[i],col] )
dfClean.iloc[idx[i], col] = tempOut[idx[i],col]
print("new dfClean Value = ",dfClean.iloc[idx[i], col])
origVal.append(dfClean_Orig.iloc[idx[i], col])
I get an error when I try to run this code on Azure Databricks using pyspark or scala. Because the dataframes in spark are immutable also I cannot use iloc as I have used it in pandas dataframe.
Is there a way or better way of implementing such imputation in databricks?

pyspark: How can I use multiple `lit(val)` in a computation where `val` is a variable in python?

D = # came from numpy.int64 via pandas
E = # came from numpy.int64 via pandas
import pyspark.sql.functions as F
output_df.withColumn("c", F.col("A") - F.log(F.lit(D) - F.lit(E)))
I tried to use multiple lit inside pyspark with column operation. But I keep getting errors like
*** AttributeError: 'numpy.int64' object has no attribute '_get_object_id'
But these ones work
D=2
output_df.withColumn("c", F.lit(D))
output_df.withColumn("c", F.lit(2))
Try this
df.withColumn("c", F.col("A") - F.log(F.lit(int(D - E))))
D = int(D)
E = int(E)
Just add these two lines and it will work. The issue is that pyspark doesn't know how to handle numpy.int64

spark.read.format('libsvm') not working with python

I am learning PYSPARK and encountered a problem that I can't fix. I followed this video to copy codes from the PYSPARK documentation to load data for linear regression. The code I got from the documentation was spark.read.format('libsvm').load('file.txt'). I created a spark data frame before this btw. When I run this code in Jupyter notebook it keeps giving me some java error and the guy in this video did the exact same thing as I did and he didn't get this error. Can someone help me resolve this issue, please?
A lot of thanks!
I think I solved this issue by setting the "numFeatures" in the option method:
training = spark.read.format('libsvm').option("numFeatures","10").load('sample_linear_regression_data.txt', header=True)
You can use this custom function to read libsvm file.
from pyspark.sql import Row
from pyspark.ml.linalg import SparseVector
def read_libsvm(filepath, spark_session):
'''
A utility function that takes in a libsvm file and turn it to a pyspark dataframe.
Args:
filepath (str): The file path to the data file.
spark_session (object): The SparkSession object to create dataframe.
Returns:
A pyspark dataframe that contains the data loaded.
'''
with open(filepath, 'r') as f:
raw_data = [x.split() for x in f.readlines()]
outcome = [int(x[0]) for x in raw_data]
index_value_dict = list()
for row in raw_data:
index_value_dict.append(dict([(int(x.split(':')[0]), float(x.split(':')[1]))
for x in row[1:]]))
max_idx = max([max(x.keys()) for x in index_value_dict])
rows = [
Row(
label=outcome[i],
feat_vector=SparseVector(max_idx + 1, index_value_dict[i])
)
for i in range(len(index_value_dict))
]
df = spark_session.createDataFrame(rows)
return df
Usage:
my_data = read_libsvm(filepath="sample_libsvm_data.txt", spark_session=spark)
You can try to load via:
from pyspark.mllib.util import MLUtils
df = MLUtils.loadLibSVMFile(sc,"data.libsvm",numFeatures=781).toDF()
sc is Spark context and df is resulting data frame.

PySpark: Error "Cannot pickle standard input" on function map

I'm trying to learn to use Pyspark.
I'm usin spark-2.2.0- with Python3
I'm in front of a problem now and I can't find where it came from.
My project is to adapt a algorithm wrote by data-scientist to be distributed. The code below it's what I have to use to extract the features from images and I have to adapt it to extract features whith pyspark.
import json
import sys
# Dependencies can be installed by running:
# pip install keras tensorflow h5py pillow
# Run script as:
# ./extract-features.py images/*.jpg
from keras.applications.vgg16 import VGG16
from keras.models import Model
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input
import numpy as np
def main():
# Load model VGG16 as described in https://arxiv.org/abs/1409.1556
# This is going to take some time...
base_model = VGG16(weights='imagenet')
# Model will produce the output of the 'fc2'layer which is the penultimate neural network layer
# (see the paper above for mode details)
model = Model(input=base_model.input, output=base_model.get_layer('fc2').output)
# For each image, extract the representation
for image_path in sys.argv[1:]:
features = extract_features(model, image_path)
with open(image_path + ".json", "w") as out:
json.dump(features, out)
def extract_features(model, image_path):
img = image.load_img(image_path, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
features = model.predict(x)
return features.tolist()[0]
if __name__ == "__main__":
main()
I have written the begining of the Code:
rdd = sc.binaryFiles(PathImages)
base_model = VGG16(weights='imagenet')
model = Model(input=base_model.input, output=base_model.get_layer('fc2').output)
rdd2 = rdd.map(lambda x : (x[0], extract_features(model, x[0][5:])))
rdd2.collect()[0]
when I try to extract the feature. There is an error.
~/Code/spark-2.2.0-bin-hadoop2.7/python/pyspark/cloudpickle.py in
save_file(self, obj)
623 return self.save_reduce(getattr, (sys,'stderr'), obj=obj)
624 if obj is sys.stdin:
--> 625 raise pickle.PicklingError("Cannot pickle standard input")
626 if hasattr(obj, 'isatty') and obj.isatty():
627 raise pickle.PicklingError("Cannot pickle files that map to tty objects")
PicklingError: Cannot pickle standard input
I try multiple thing and here is my first result. I know that the error come from the line below in the method extract_features:
features = model.predict(x)
and when I try to run this line out of a map function or pyspark, this work fine.
I think the problem come from the object "model" and his serialisation whith pyspark.
Maybe I don't use a good way to distribute this with pyspark and if you have any clew to help me, I will take them.
Thanks in advance.