How can we fix this AttributeError: 'bytes' object has no attribute 'map'? - pyspark

I'm trying to run this code on an AWS EMR instance to get features from images by Transfer learning model but I get this error code : AttributeError: 'bytes' object has no attribute 'map'
*
input = np.stack(content_series.map(preprocess))
preds = model.predict(input)
here is the whole code :
def preprocess(content):
img = Image.open(io.BytesIO(content)).resize([224, 224])
arr = img_to_array(img)
return preprocess_input(arr)
def featurize_series(model, content_series):
"""
Featurize a pd.Series of raw images using the input model.
:return: a pd.Series of image features
"""
input = np.stack(content_series.map(preprocess))
preds = model.predict(input)
output = [p.flatten() for p in preds]
return pd.Series(output)
#pandas_udf('array<float>', PandasUDFType.SCALAR_ITER)
def featurize_udf(content_series_iter):
'''
This method is a Scalar Iterator pandas UDF wrapping our featurization function.
The decorator specifies that this returns a Spark DataFrame column of type ArrayType(FloatType).
:param content_series_iter: This argument is an iterator over batches of data, where each batch
is a pandas Series of image data.
'''
# With Scalar Iterator pandas UDFs, we can load the model once and then re-use it
# for multiple data batches. This amortizes the overhead of loading big models.
model = model_fn()
for content_series in content_series_iter:
yield featurize_series(model, content_series)
do u have any suggestions ? tks
I tried to change instance config

The error is occurring because the content_series.map method is trying to apply the preprocess function to each element of content_series, which is assumed to be a pandas Series. However, the error message indicates that content_series is a 'bytes' object, which does not have a 'map' attribute.
It appears that the content_series_iter input to the featurize_udf function is an iterator over batches of image data, where each batch is a pandas Series of image data. To resolve the error, you'll need to modify the featurize_udf function to convert the 'bytes' objects in each batch to pandas Series before applying the preprocess function:
#pandas_udf('array<float>', PandasUDFType.SCALAR_ITER)
def featurize_udf(content_series_iter):
model = model_fn()
for content_series in content_series_iter:
content_series = pd.Series(content_series.tolist())
yield featurize_series(model, content_series)
This should resolve the AttributeError and allow you to run the code successfully on the AWS EMR instance.

Related

NullpointException when reading file with RowCsvInputFormat in flink

I am a beginner on Flink streaming.
When reading a file with RowCsvInputFormat, the code that Kryo serializer creates Row does not work properly.
The code is below.
val readLocalCsvFile = new RowCsvInputFormat(
new Path("flink-test/000000_1"),
Array(Types.STRING, Types.STRING, Types.STRING),
"\n",
","
)
val read = env.readFile(
readLocalCsvFile,
"flink-test/000000_1",
FileProcessingMode.PROCESS_CONTINUOUSLY,
1000000)
read.print()
env.execute("test")
The contents of the file 000000_1 are as follows.
aa,bb,cc
aaa,bbb,ccc
As a result of debugging, I get the divided values ​​of aa, bb, and cc well. But when I put those values ​​into Row's fields one by one, a nullpointexception is raised because fields are null.
The image below shows that the fields of the Row are null.
enter image description here
The code that creates a Row when the above code is executed is as follows. KryoSerializer generates the row.
val kryo = new EmptyFlinkScalaKryoInstantiator().newKryo
val Row = kryo.newInstance(classOf[Row])
The output error is as follows.
java.lang.NullPointerException
at org.apache.flink.types.Row.setField(Row.java:140)
at org.apache.flink.api.java.io.RowCsvInputFormat.fillRecord(RowCsvInputFormat.java:162)
at org.apache.flink.api.java.io.RowCsvInputFormat.fillRecord(RowCsvInputFormat.java:33)
at org.apache.flink.api.java.io.CsvInputFormat.readRecord(CsvInputFormat.java:113)
at org.apache.flink.api.common.io.DelimitedInputFormat.nextRecord(DelimitedInputFormat.java:551)
at org.apache.flink.api.java.io.CsvInputFormat.nextRecord(CsvInputFormat.java:80)
at org.apache.flink.streaming.api.functions.source.ContinuousFileReaderOperator.readAndCollectRecord(ContinuousFileReaderOperator.java:387)
at
Maybe you can post the complete code.
Judging from the task error report, it may be because the number of fields does not match

How to union multiple dynamic inputs in Palantir Foundry?

I want to Union multiple datasets in Palantir Foundry, the name of the datasets are dynamic so I would not be able to give the dataset names in transform_df() statically. Is there a way I can dynamically take multiple inputs into transform_df and union all of those dataframes?
I tried looping over the datasets like:
li = ['dataset1_path', 'dataset2_path']
union_df = None
for p in li:
#transforms_df(
my_input = Input(p),
Output(p+"_output")
)
def my_compute_function(my_input):
return my_input
if union_df is None:
union_df = my_compute_function
else:
union_df = union_df.union(my_compute_function)
But, this doesn't generate the unioned output.
This should be able to work for you with some changes, this is an example of dynamic dataset with json files, your situation would maybe be only a little different. Here is a generalized way you could be doing dynamic json input datasets that should be adaptable to any type of dynamic input file type or internal to foundry dataset that you can specify. This generic example is working on a set of json files uploaded to a dataset node in the platform. This should be fully dynamic. Doing a union after this should be a simple matter.
There's some bonus logging going on here as well.
Hope this helps
from transforms.api import Input, Output, transform
from pyspark.sql import functions as F
import json
import logging
def transform_generator():
transforms = []
transf_dict = {## enter your dynamic mappings here ##}
for value in transf_dict:
#transform(
out=Output(' path to your output here '.format(val=value)),
inpt=Input(" path to input here ".format(val=value)),
)
def update_set(ctx, inpt, out):
spark = ctx.spark_session
sc = spark.sparkContext
filesystem = list(inpt.filesystem().ls())
file_dates = []
for files in filesystem:
with inpt.filesystem().open(files.path) as fi:
data = json.load(fi)
file_dates.append(data)
logging.info('info logs:')
logging.info(file_dates)
json_object = json.dumps(file_dates)
df_2 = spark.read.option("multiline", "true").json(sc.parallelize([json_object]))
df_2 = df_2.withColumn('upload_date', F.current_date())
df_2.drop_duplicates()
out.write_dataframe(df_2)
transforms.append(update_logs)
return transforms
TRANSFORMS = transform_generator()
So this question breaks down in two questions.
How to handle transforms with programatic input paths
To handle transforms with programatic inputs, it is important to remember two things:
1st - Transforms will determine your inputs and outputs at CI time. Which means that you can have python code that generates transforms, but you cannot read paths from a dataset, they need to be hardcoded into your python code that generates the transform.
2nd - Your transforms will be created once, during the CI execution. Meaning that you can't have an increment or special logic to generate different paths whenever the dataset builds.
With these two premises, like in your example or #jeremy-david-gamet 's (ty for the reply, gave you a +1) you can have python code that generates your paths at CI time.
dataset_paths = ['dataset1_path', 'dataset2_path']
for path in dataset_paths:
#transforms_df(
my_input = Input(path),
Output(f"{path}_output")
)
def my_compute_function(my_input):
return my_input
However to union them you'll need a second transform to execute the union, you'll need to pass multiple inputs, so you can use *args or **kwargs for this:
dataset_paths = ['dataset1_path', 'dataset2_path']
all_args = [Input(path) for path in dataset_paths]
all_args.append(Output("path/to/unioned_dataset"))
#transforms_df(*all_args)
def my_compute_function(*args):
input_dfs = []
for arg in args:
# there are other arguments like ctx in the args list, so we need to check for type. You can also use kwargs for more determinism.
if isinstance(arg, pyspark.sql.DataFrame):
input_dfs.append(arg)
# now that you have your dfs in a list you can union them
# Note I didn't test this code, but it should be something like this
...
How to union datasets with different schemas.
For this part there are plenty of Q&A out there on how to union different dataframes in spark. Here is a short code example copied from https://stackoverflow.com/a/55461824/26004
from pyspark.sql import SparkSession, HiveContext
from pyspark.sql.functions import lit
from pyspark.sql import Row
def customUnion(df1, df2):
cols1 = df1.columns
cols2 = df2.columns
total_cols = sorted(cols1 + list(set(cols2) - set(cols1)))
def expr(mycols, allcols):
def processCols(colname):
if colname in mycols:
return colname
else:
return lit(None).alias(colname)
cols = map(processCols, allcols)
return list(cols)
appended = df1.select(expr(cols1, total_cols)).union(df2.select(expr(cols2, total_cols)))
return appended
Since inputs and outputs are determined at CI time, we cannot form true dynamic inputs. We will have to somehow point to specific datasets in the code. Assuming the paths of datasets share the same root, the following seems to require minimum maintenance:
from transforms.api import transform_df, Input, Output
from functools import reduce
datasets = [
'dataset1',
'dataset2',
'dataset3',
]
inputs = {f'inp{i}': Input(f'input/folder/path/{x}') for i, x in enumerate(datasets)}
kwargs = {
**{'output': Output('output/folder/path/unioned_dataset')},
**inputs
}
#transform_df(**kwargs)
def my_compute_function(**inputs):
unioned_df = reduce(lambda df1, df2: df1.unionByName(df2), inputs.values())
return unioned_df
Regarding unions of different schemas, since Spark 3.1 one can use this:
df1.unionByName(df2, allowMissingColumns=True)

spark.read.format('libsvm') not working with python

I am learning PYSPARK and encountered a problem that I can't fix. I followed this video to copy codes from the PYSPARK documentation to load data for linear regression. The code I got from the documentation was spark.read.format('libsvm').load('file.txt'). I created a spark data frame before this btw. When I run this code in Jupyter notebook it keeps giving me some java error and the guy in this video did the exact same thing as I did and he didn't get this error. Can someone help me resolve this issue, please?
A lot of thanks!
I think I solved this issue by setting the "numFeatures" in the option method:
training = spark.read.format('libsvm').option("numFeatures","10").load('sample_linear_regression_data.txt', header=True)
You can use this custom function to read libsvm file.
from pyspark.sql import Row
from pyspark.ml.linalg import SparseVector
def read_libsvm(filepath, spark_session):
'''
A utility function that takes in a libsvm file and turn it to a pyspark dataframe.
Args:
filepath (str): The file path to the data file.
spark_session (object): The SparkSession object to create dataframe.
Returns:
A pyspark dataframe that contains the data loaded.
'''
with open(filepath, 'r') as f:
raw_data = [x.split() for x in f.readlines()]
outcome = [int(x[0]) for x in raw_data]
index_value_dict = list()
for row in raw_data:
index_value_dict.append(dict([(int(x.split(':')[0]), float(x.split(':')[1]))
for x in row[1:]]))
max_idx = max([max(x.keys()) for x in index_value_dict])
rows = [
Row(
label=outcome[i],
feat_vector=SparseVector(max_idx + 1, index_value_dict[i])
)
for i in range(len(index_value_dict))
]
df = spark_session.createDataFrame(rows)
return df
Usage:
my_data = read_libsvm(filepath="sample_libsvm_data.txt", spark_session=spark)
You can try to load via:
from pyspark.mllib.util import MLUtils
df = MLUtils.loadLibSVMFile(sc,"data.libsvm",numFeatures=781).toDF()
sc is Spark context and df is resulting data frame.

Converting DataFrame String column containing missing values to Date in Julia

I'm trying to convert a DataFrame String column to Date format in Julia, but if the column contains missing values an error is produced:
ERROR: MethodError: no method matching Int64(::Missing)
The code I've tried to run (which works for columns with no missing data) is:
df_pp[:tod] = Date.(df_pp[:tod], DateFormat("d/m/y"));
Other lines of code I have tried are:
df_pp[:tod] = Date.(passmissing(df_pp[:tod]), DateFormat("d/m/y"));
df_pp[.!ismissing.(df_pp[:tod]), :tod] = Date.(df_pp[:tod], DateFormat("d/m/y"));
The code relates to a column named tod in a data frame named df_pp. Both the DataFrames & Dates packages have been loaded prior to attempting this.
The passmissing way is
df_pp.tod = passmissing(x->Date(x, DateFormat("d/m/y"))).(df_pp.tod)
What happens here is this: passmissing takes a function, and returns a new function that handles missings (by returning missing). Inside the bracket, in x->Date(x, DateFormat("d/m/y")) I define a new, anonymous function, that calls the Date function with the appropriate DateFormat.
Finally, I use the function returned by passmissing immediately on df_pp.tod, using a . to broadcast along the column.
It's easier to see the syntax if I split it up:
myDate(x) = Date(x, DateFormat("d/m/y"))
Date_accepting_missing = passmissing(myDate)
df_pp[:tod] = Date_accepting_missing.(df_pp[:tod])

How to add a string as data to a dataset?

I use the following code to create a simple dataset and add the first two rows:
data = dataset([1; 2],[3; 4],'VarNames', {'A', 'B'})
After that, I would like to set the value 4 to 'test':
data(1,2) = 'test'
Since this throws the following exception:
Error using dataset/subsasgnParens (line 198)
Right hand side must be a dataset array.
Error in dataset/subsasgn (line 79)
a = subsasgnParens(a,s,b,creating);
I also tried:
data(1,2) = dataset('test');
But this is also not working. Therefore my question: How can I add a String to a dataset such as I have created, using the method I'm using (I have to specify the row and column)?
You can't do
data(1,2) = dataset('test');
because 'test' is char type and the rest of your data are doubles and because the string 'test' is four elements which you're trying to put into one element of an array.
You need to use cell arrays. If you want to use the dataset function capabilities, see the cell2dataset and dataset2cell functions. For example:
data = dataset([1; 2],[3; 4],'VarNames',{'A', 'B'})
data2 = dataset2cell(data);
data2{3,1} = 'test';
data3 = cell2dataset(data2,'ReadVarNames',true');