Extracting Features vectorAssembler - pyspark

When i am trying to extract the features from vector assembler. I got an error saying job aborted due to stage failure.
cols_to_keep_unscaled = ['Type_Index']
cols_to_scale = ['AirtemperatureK', 'ProcesstemperatureK', 'Rotationalspeedrpm','TorqueNm','Toolwearmin']
def extract(row):
return (row.Type_index) + tuple(row.scaled_features.toArray().tolist())
clean_df = clean_df.select(*cols_to_keep_unscaled, "scaled_features").rdd.map(extract).toDF(cols_to_keep_unscaled + cols_to_scale)

Related

Py4JJavaError: An error occurred while calling o771.save. Azure Synapse Analytics Notebook

Here is my pyspark code used in Notebook
data_lake_container = 'abfss://abc.dfs.core.windows.net'
stage_folder = 'abc'
delta_lake_folder = 'abc'
source_folder = 'abc'
source_wildcard = 'abc.parquet'
key_column = 'Id'
key_column1 = 'LastModifiedDate'
source_path = data_lake_container + '/' + stage_folder + '/' + source_folder + '/' + source_wildcard
delta_table_path = data_lake_container + '/' + delta_lake_folder + '/' + source_folder
sdf = spark.read.format('parquet').option("recursiveFileLookup", "true").load(source_path)
if (DeltaTable.isDeltaTable(spark, delta_table_path)):
delta_table = DeltaTable.forPath(spark, delta_table_path)
delta_table.alias("existing").merge(
source=sdf.alias("updates"),
condition=("existing." + key_column + " = updates." + key_column + " and existing." + key_column1 + " = updates." + key_column1) # We look for matches on the name column
).whenMatchedUpdateAll(
).whenNotMatchedInsertAll(
).execute()
else:
sdf.write.format('delta').save(delta_table_path)
while executing above code I'm getting below error
Py4JJavaError: An error occurred while calling o771.save.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:231)
at org.apache.spark.sql.delta.files.TransactionalWrite.$anonfun$writeFiles$1(TransactionalWrite.scala:216)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:107)
Kindly help me in resolving error
Py4JJavaError: An error occurred while calling o771.save.
: org.apache.spark.SparkException: Job aborted.
The above error generally occurred because of non-compatible versions of spark connector and spark.
Refer - org.apache.spark.SparkException: Job aborted due to stage failure: Task from application
If the above solution does not work for you, please share a full stack trace of error. It is difficult to identify issues with shared information.
#AbhishekKhandave, when I looked into full error, there was date column with data range less than '1900-01-01'. That was the issue. Finally, I was able to run script. Thank you for your response.

Using numba in a method to randomize matrices

I'm still not very familiar with numba and my problem is that I have the piece of code bellow that I use for randomize the edges of graphs.
This code is simply used to swap some edges in a connectivity matrix given the number of desired swaps and a seed for the random number generator.
My problem is that when I try to use it with numba to speed it up I did not menage to run it. The error it returns is also pasted bellow.
#nb.jit(nopython=True)
def _randomize_adjacency_wei(A, n_swaps, seed):
np.random.seed(seed)
# Number of nodes
n_nodes = A.shape[0]
# Copy the adj. matrix
Arnd = A.copy()
# Choose edges that will be swaped
edges = np.random.choice(n_nodes, size=(4, n_swaps), replace=True).T
#itr = range(n_swaps)
#for it in tqdm(itr) if verbose else itr:
it = 0
for it in range(n_swaps):
i,j,k,l = edges[it,:]
if len(np.unique([i,j,k,l]))<4:
continue
else:
# Old values of weigths
w_ij,w_il,w_kj,w_kl=Arnd[i,j],Arnd[i,l],Arnd[k,j],Arnd[k,l]
# Swaping edges
Arnd[i,j]=Arnd[j,i]=w_il
Arnd[k,l]=Arnd[l,k]=w_kj
Arnd[i,l]=Arnd[l,i]=w_ij
Arnd[k,j]=Arnd[j,k]=w_kl
return Arnd
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<function unique at 0x7f1a1c03b0d0>) found for signature:
>>> unique(list(int64)<iv=None>)
There are 2 candidate implementations:
- Of which 2 did not match due to:
Overload in function 'np_unique': File: numba/np/arrayobj.py: Line 1915.
With argument(s): '(list(int64)<iv=None>)':
Rejected as the implementation raised a specific error:
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Unknown attribute 'ravel' of type list(int64)<iv=None>
File "../../../home/vinicius/anaconda3/lib/python3.8/site-packages/numba/np/arrayobj.py", line 1918:
def np_unique_impl(a):
b = np.sort(a.ravel())
^
During: typing of get attribute at /home/vinicius/anaconda3/lib/python3.8/site-packages/numba/np/arrayobj.py (1918)
File "../../../home/vinicius/anaconda3/lib/python3.8/site-packages/numba/np/arrayobj.py", line 1918:
def np_unique_impl(a):
b = np.sort(a.ravel())
^
raised from /home/vinicius/anaconda3/lib/python3.8/site-packages/numba/core/typeinfer.py:1071
During: resolving callee type: Function(<function unique at 0x7f1a1c03b0d0>)
During: typing of call at <ipython-input-165-90ffd30fe0e8> (19)
File "<ipython-input-165-90ffd30fe0e8>", line 19:
def _randomize_adjacency_wei(A, n_swaps, seed):
<source elided>
i,j,k,l = edges[it,:]
if len(np.unique([i,j,k,l]))<4:
^
Thanks in advance,
Vinicius
According to the comments, you are passing a list to np.unique() but this is not supported by Numba.
Modifying the code this way:
i, j, k, l = e = edges[it, :]
if len(np.unique(e)) < 4:
...
The following example doesn't produce any errors:
>>> A = np.random.randint(0, 5, (8,8))
>>> r = _randomize_adjacency_wei(A, 4, 33)

Not able to pass StringIndexer as list to the model pipeline stage

PySpark pipeline is quite new to me .I am trying to create the stages in pipeline by passing below list :
pipeline = Pipeline().setStages([indexer,assembler,dtc_model])
where I am Applying feature indexing on multiple columns :
cat_col = ['Gender','Habit','Mode']
indexer = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(training_data_0) for column in cat_col ]
On Running the fit on the pipeline I am getting below error :
model_pipeline = pipeline.fit(train_df)
How we can pass the list to the stage or any work around to achieve this or better way to do this?
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<command-3999694668013877> in <module>
----> 1 model_pipeline = pipeline.fit(train_df)
/databricks/spark/python/pyspark/ml/base.py in fit(self, dataset, params)
130 return self.copy(params)._fit(dataset)
131 else:
--> 132 return self._fit(dataset)
133 else:
134 raise ValueError("Params must be either a param map or a list/tuple of param maps, "
/databricks/spark/python/pyspark/ml/pipeline.py in _fit(self, dataset)
95 if not (isinstance(stage, Estimator) or isinstance(stage, Transformer)):
96 raise TypeError(
---> 97 "Cannot recognize a pipeline stage of type %s." % type(stage))
98 indexOfLastEstimator = -1
99 for i, stage in enumerate(stages):
TypeError: Cannot recognize a pipeline stage of type <class 'list'>.```
Try below-
cat_col = ['Gender','Habit','Mode']
indexer = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(training_data_0) for column in cat_col ]
assembler = VectorAssembler...
dtc_model = DecisionTreeClassifier...
# Create pipeline using transformers and estimators
stages = indexer
stages.append(assembler)
stages.append(dtc_model)
pipeline = Pipeline().setStages(stages)
model_pipeline = pipeline.fit(train_df)

Py4JJava wrong columns error when calling PCA of pyspark.ml.feature

I am trying to visualize word2vec words using pyspark's PCA function, but I'm getting an unhelpful error message. Saying column features are of the wrong type, but they aren't. (Full message below)
Background
spark-2.4.0-bin-hadoop2.7
Scala 2.12.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191).
3.6.5 |Anaconda, Inc.
Ubuntu 16.04
My Code
maxWordsVis = 15
Feat = np.load('Gab_ai_posts_W2Vmatrix.npy')
words = np.load('Gab_ai_posts_WordList.npy')
# to rdd, avoid this with big matrices by reading them directly from hdfs
Feat = sc.parallelize(Feat)
Feat = Feat.map(lambda vec: (Vectors.dense(vec),))
# to dataframe
dfFeat = sqlContext.createDataFrame(Feat,["features"])
$dfFeat.head()
Row(features=DenseVector([-0.1282, 0.0699, -0.0891, -0.0437, -0.0915, -0.0557, 0.1432, -0.1564, 0.0058, -0.0603, 0.1383, -0.0359, -0.0306, -0.0415, -0.0191, 0.058, 0.0119, -0.0302, 0.0362, -0.0466, 0.0403, -0.1035, 0.0456, 0.0892, 0.0548, -0.0735, 0.1094, -0.0299, -0.0549, -0.1235, 0.0062, 0.1381, -0.0082, 0.085, -0.0083, -0.0346, -0.0226, -0.0084, -0.0463, -0.0448, 0.0285, -0.0013, 0.0343, -0.0056, 0.0756, -0.0068, 0.0562, 0.0638, 0.023, -0.0224, -0.0228, 0.0281, -0.0698, -0.0044, 0.0395, -0.021, 0.0228, 0.0666, 0.0362, 0.0116, -0.0088, 0.0949, 0.0265, -0.0293, -0.007, -0.0746, 0.0891, 0.0145, 0.0532, -0.0084, -0.0853, 0.0037, -0.055, -0.0706, -0.0296, 0.0321, 0.0495, -0.0776, -0.1339, -0.065, 0.0856, 0.0328, 0.0821, 0.036, -0.0179, -0.0006, -0.036, 0.0438, -0.0077, -0.0012, 0.0322, 0.0354, 0.0513, 0.0436, 0.0002, -0.0578, 0.1062, 0.019, 0.0346, -0.1261]))
numComponents = 3
pca = PCA(k = numComponents, inputCol = "features", outputCol = "pcaFeatures")
Error Message
Py4JJavaError: An error occurred while calling o4583.fit. : java.lang.IllegalArgumentException: requirement failed:
Column features must be of type
struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually
struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.
at scala.Predef$.require(Predef.scala:224)

Retrying celery failed tasks that are part of a chain

I have a celery chain that runs some tasks. Each of the tasks can fail and be retried. Please see below for a quick example:
from celery import task
#task(ignore_result=True)
def add(x, y, fail=True):
try:
if fail:
raise Exception('Ugly exception.')
print '%d + %d = %d' % (x, y, x+y)
except Exception as e:
raise add.retry(args=(x, y, False), exc=e, countdown=10)
#task(ignore_result=True)
def mul(x, y):
print '%d * %d = %d' % (x, y, x*y)
and the chain:
from celery.canvas import chain
chain(add.si(1, 2), mul.si(3, 4)).apply_async()
Running the two tasks (and assuming that nothing fails), your would get/see printed:
1 + 2 = 3
3 * 4 = 12
However, when the add task fails the first time and succeeds in subsequent retry calls, the rest of the tasks in the chain do not run, i.e. the add task fails, all other tasks in the chain are not run and after a few seconds, the add task runs again and succeeds and the rest of the tasks in the chain (in this case mul.si(3, 4)) does not run.
Does celery provide a way to continue failed chains from the task that failed, onwards? If not, what would be the best approach to accomplishing this and making sure that a chain's tasks run in the order specified and only after the previous task has executed successfully even if the task is retried a few times?
Note 1: The issue can be solved by doing
add.delay(1, 2).get()
mul.delay(3, 4).get()
but I am interested in understanding why chains do not work with failed tasks.
You've found a bug :)
Fixed in https://github.com/celery/celery/commit/b2b9d922fdaed5571cf685249bdc46f28acacde3
will be part of 3.0.4.
I'm also interested in understanding why chains do not work with failed tasks.
I dig some celery code and what I've found so far is:
The implementation happends at app.builtins.py
#shared_task
def add_chain_task(app):
from celery.canvas import chord, group, maybe_subtask
_app = app
class Chain(app.Task):
app = _app
name = 'celery.chain'
accept_magic_kwargs = False
def prepare_steps(self, args, tasks):
steps = deque(tasks)
next_step = prev_task = prev_res = None
tasks, results = [], []
i = 0
while steps:
# First task get partial args from chain.
task = maybe_subtask(steps.popleft())
task = task.clone() if i else task.clone(args)
i += 1
tid = task.options.get('task_id')
if tid is None:
tid = task.options['task_id'] = uuid()
res = task.type.AsyncResult(tid)
# automatically upgrade group(..) | s to chord(group, s)
if isinstance(task, group):
try:
next_step = steps.popleft()
except IndexError:
next_step = None
if next_step is not None:
task = chord(task, body=next_step, task_id=tid)
if prev_task:
# link previous task to this task.
prev_task.link(task)
# set the results parent attribute.
res.parent = prev_res
results.append(res)
tasks.append(task)
prev_task, prev_res = task, res
return tasks, results
def apply_async(self, args=(), kwargs={}, group_id=None, chord=None,
task_id=None, **options):
if self.app.conf.CELERY_ALWAYS_EAGER:
return self.apply(args, kwargs, **options)
options.pop('publisher', None)
tasks, results = self.prepare_steps(args, kwargs['tasks'])
result = results[-1]
if group_id:
tasks[-1].set(group_id=group_id)
if chord:
tasks[-1].set(chord=chord)
if task_id:
tasks[-1].set(task_id=task_id)
result = tasks[-1].type.AsyncResult(task_id)
tasks[0].apply_async()
return result
def apply(self, args=(), kwargs={}, **options):
tasks = [maybe_subtask(task).clone() for task in kwargs['tasks']]
res = prev = None
for task in tasks:
res = task.apply((prev.get(), ) if prev else ())
res.parent, prev = prev, res
return res
return Chain
You can see that at the end prepare_steps prev_task is linked to the next task.
When the prev_task failed the next task is not called.
I'm testing with adding the link_error from prev task to the next:
if prev_task:
# link and link_error previous task to this task.
prev_task.link(task)
prev_task.link_error(task)
# set the results parent attribute.
res.parent = prev_res
But then, the next task must take care of both cases (maybe, except when it's configured to be immutable, e.g. not accept more arguments).
I think chain can support that by allowing some syntax likes this:
c = chain(t1, (t2, t1e), (t3, t2e))
which means:
t1 link to t2 and link_error to t1e
t2 link to t3 and link_error to t2e