corruption loop for a data frame in PySpark

corruption loop for a data frame in PySpark - pyspark

I'm making a loop for corrupt my dataset on PySpark, and i want to control the errors.
First: I made a list of type: 9 errors and 1 with no error.
#code_erreur= [ "replace","inverse","inserte","delete","espace","NA"]
code_erreur= ["replace"]
varibale =["VARIABLEA","VARIABLEB"]
I start with just code the error : replace.
Select a random letter, on a random "varibale ", to replace by antoher random letter .
My input:
VARIABLEA  | VARIABLEB
BLUE        | WHITE
PINK         | DARK
My expected output:
VARIABLEA  | VARIABLEB
BLTE        | WHITE
PINK         | DARM
And I made a loop:
def algo_corruption(lettre,code_erreur,nombre_erreur,varibale,data):
alp=(list(string.ascii_uppercase))
table_corruption=[]
for i in range(1,data.count()):
code_erreur_choisi =random.choice(code_erreur)
varibale_choisie =random.choice(varibale)
(table_corruption.append((code_erreur_choisi,
varibale_choisie)))
cols=["code_erreur_choisi","varibale_choisie"]
result = spark.createDataFrame(table_corruption, cols)
result= result.withColumn("id", monotonically_increasing_id())
data= data.withColumn("id", monotonically_increasing_id())
data_join_result= data.join(result, "id","inner").drop("id")
for j in range(1,data_join_result.count()):
if data_join_result.filter(col("code_erreur_choisi") == "replace"):
data_corrp = (data_join_result[varibale_choisie].replace(random.choice(data_join_result.collect()[j][varibale_choisie]),
random.choice(alp))
display(data_corrp)
else:
print("erreur pas encore codée")
But that doesn't work, I always have errors like:
ValueError: R
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<command-3614001298541202> in <module>()
----> 1 algo_corruption(code_erreur,varibale,extrac_base_train)
<command-1181487391956055> in algo_corruption(code_erreur, varibale, data)
29 if data_join_result.filter(col("code_erreur_choisi") == "replace"):
30 data_corrp = (data_join_result[varibale_choisie].replace(random.choice(data_join_result.collect()[j][varibale_choisie]),
---> 31 data_join_result.collect()[j][lettre_choisie]))
32 display(data_corrp)
33
/databricks/spark/python/pyspark/sql/types.py in __getitem__(self, item)
1517 raise KeyError(item)
1518 except ValueError:
-> 1519 raise ValueError(item)
1520
1521 def __getattr__(self, item):
ValueError: R

Related

Pyspark Cosine similarity Invalid argument, not a string or column

I am trying to calculate cosine distances of 2 title and headline columns via using pre-trained bert model just like below
title
headline
title_array
headline_array
arrayed
Dance Gavin Dance bass player Tim Feerick dead at 34
Prince Harry and Meghan Markle make secret visit to see Queen ahead of Invictus Games
["Dance Gavin Dance bass player Tim Feerick dead at 34"]
["Prince Harry and Meghan Markle make secret visit to see Queen ahead of Invictus Games"]
["Dance Gavin Dance bass player Tim Feerick dead at 34", "Prince Harry and Meghan Markle make secret visit to see Queen ahead of Invictus Games"]
# downloading bert
model = SentenceTransformer('bert-base-nli-mean-tokens')
from sentence_transformers import SentenceTransformer
import numpy as np
from pyspark.sql.types import FloatType
import pyspark.sql.functions as f
#udf(FloatType())
def cosine_similarity(sentence_embeddings, ind_a, ind_b):
s = sentence_embeddings
return np.dot(s[ind_a], s[ind_b]) / (np.linalg.norm(s[ind_a]) * np.linalg.norm(s[ind_b]))
#udf_bert = udf(cosine_similarity, FloatType())
''''
s0 = "our president is a good leader he will not fail"
s1 = "our president is not a good leader he will fail"
s2 = "our president is a good leader"
s3 = "our president will succeed"
sentences = [s0, s1, s2, s3]
sentence_embeddings = model.encode(sentences)
s = sentence_embeddings
print(f"{s0} <--> {s1}: {udf_bert(sentence_embeddings, 0, 1)}")
print(f"{s0} <--> {s2}: {cosine_similarity(sentence_embeddings, 0, 2)}")
print(f"{s0} <--> {s3}: {cosine_similarity(sentence_embeddings, 0, 3)}")
'''''
test_df = test_df.withColumn("Similarities", (cosine_similarity(model.encode(test_df.arrayed), 0, 1))
As we see from the example , algorithm takes concatenation of two array of strings and calculate distances of cosine among them.
When I only run the algorithm/function with the sample texts commented out , it is working. But when I try to apply it into my dataframe via registering as a udf and call with dataframe I am facing with the error below:
TypeError Traceback (most recent call last)
<command-757165186581086> in <module>
26 '''''
27
---> 28 test_df = test_df.withColumn("Similarities", f.lit(cosine_similarity(model.encode(test_df.arrayed), 0, 1)))
/databricks/spark/python/pyspark/sql/udf.py in wrapper(*args)
197 #functools.wraps(self.func, assigned=assignments)
198 def wrapper(*args):
--> 199 return self(*args)
200
201 wrapper.__name__ = self._name
/databricks/spark/python/pyspark/sql/udf.py in __call__(self, *cols)
177 judf = self._judf
178 sc = SparkContext._active_spark_context
--> 179 return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))
180
181 # This function is for improving the online help system in the interactive interpreter.
/databricks/spark/python/pyspark/sql/column.py in _to_seq(sc, cols, converter)
60 """
61 if converter:
---> 62 cols = [converter(c) for c in cols]
63 return sc._jvm.PythonUtils.toSeq(cols)
64
/databricks/spark/python/pyspark/sql/column.py in <listcomp>(.0)
60 """
61 if converter:
---> 62 cols = [converter(c) for c in cols]
63 return sc._jvm.PythonUtils.toSeq(cols)
64
/databricks/spark/python/pyspark/sql/column.py in _to_java_column(col)
44 jcol = _create_column_from_name(col)
45 else:
---> 46 raise TypeError(
47 "Invalid argument, not a string or column: "
48 "{0} of type {1}. "
TypeError: Invalid argument, not a string or column: [-0.29246375 0.02216947 0.610355 -0.02230968 0.61386955 0.15291359]

The input of a UDF is a Column or a column name, that's why Spark is complaining Invalid argument, not a string or column: [-0.29246375 0.02216947 0.610355 -0.02230968 0.61386955 0.15291359]. You'll need to pass arrayed only, and refer model inside your UDF. Something like this
#udf(FloatType())
def cosine_similarity(sentence_embeddings, ind_a, ind_b):
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')
s = model.encode(arrayed)
return np.dot(s[ind_a], s[ind_b]) / (np.linalg.norm(s[ind_a]) * np.linalg.norm(s[ind_b]))
test_df = test_df.withColumn("Similarities", (cosine_similarity(test_df.arrayed, 0, 1))

Python Jupyter Notebook scipy

For a long time I was able to add data and fit, then plot the curve with data. But recently I get this:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-6-6f645a2744bc> in <module>
1 poland = prepare_data(europe_data, 'Poland')
----> 2 plot_all(poland, max_y=400000)
3 poland
~/Pulpit/library.py in plot_all(country, max_x, max_y)
43 def plot_all(country, max_x = 1000, max_y = 500000):
44
---> 45 parameters_logistic = scipy.optimize.curve_fit(func_logistic, country['n'], country['all'])[0]
46 parameters_expo = scipy.optimize.curve_fit(func_expo, country['n'], country['all'])[0]
47
/usr/local/lib64/python3.6/site-packages/scipy/optimize/minpack.py in curve_fit(f, xdata, ydata, p0, sigma, absolute_sigma, check_finite, bounds, method, jac, **kwargs)
787 cost = np.sum(infodict['fvec'] ** 2)
788 if ier not in [1, 2, 3, 4]:
--> 789 raise RuntimeError("Optimal parameters not found: " + errmsg)
790 else:
791 # Rename maxfev (leastsq) to max_nfev (least_squares), if specified.
RuntimeError: Optimal parameters not found: Number of calls to function has reached maxfev = 800.
Here are all Python Jupyter Notebook files: https://files.fm/u/zj7cc6ne#sign_up
How to solve this?

scipy.optimize.curve_fit takes a keyword argument p0.
Initial guess for the parameters (length N). If None, then the initial
values will all be 1 (if the number of parameters for the function can
be determined using introspection, otherwise a ValueError is raised).
If the defaults 1 are too far of from the result the algorithm may not converge. Try to put some values that make sense for your problem.

Sympy .coeff_all() returned list is not readable by scipy

I have question about the data type of the result returned by Sympy Poly.all_coeffs(). I have started to use Sympy just recently.
My Sympy transfer function is following:
Then I run this code:
n,d = fraction(Gs)
num = Poly(n,s)
den = Poly(d,s)
num_c = num.all_coeffs()
den_c = den.all_coeffs()
I get:
Then I run this code:
from scipy import signal
#nu = [5000000.0]
#de = [4.99, 509000.0]
nu = num_c
de = den_c
sys = signal.lti(nu, de)
w,mag,phase = signal.bode(sys)
plt.plot(w/(2*np.pi), mag)
and the result is:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-131-fb960684259c> in <module>
4 nu = num_c
5 de = den_c
----> 6 sys = signal.lti(nu, de)
But if I use those commented line 'nu' and 'de' straight python lists instead, the program works. So what is wrong here?

Why did you just show a bit the error? Why not the full message, maybe even the full traceback!
In [60]: sys = signal.lti(num_c, den_c)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-60-21f71ecd8884> in <module>
----> 1 sys = signal.lti(num_c, den_c)
/usr/local/lib/python3.6/dist-packages/scipy/signal/ltisys.py in __init__(self, *system, **kwargs)
590 self._den = None
591
--> 592 self.num, self.den = normalize(*system)
593
594 def __repr__(self):
/usr/local/lib/python3.6/dist-packages/scipy/signal/filter_design.py in normalize(b, a)
1609 leading_zeros = 0
1610 for col in num.T:
-> 1611 if np.allclose(col, 0, atol=1e-14):
1612 leading_zeros += 1
1613 else:
<__array_function__ internals> in allclose(*args, **kwargs)
/usr/local/lib/python3.6/dist-packages/numpy/core/numeric.py in allclose(a, b, rtol, atol, equal_nan)
2169
2170 """
-> 2171 res = all(isclose(a, b, rtol=rtol, atol=atol, equal_nan=equal_nan))
2172 return bool(res)
2173
<__array_function__ internals> in isclose(*args, **kwargs)
/usr/local/lib/python3.6/dist-packages/numpy/core/numeric.py in isclose(a, b, rtol, atol, equal_nan)
2267 y = array(y, dtype=dt, copy=False, subok=True)
2268
-> 2269 xfin = isfinite(x)
2270 yfin = isfinite(y)
2271 if all(xfin) and all(yfin):
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Now look at the elements of the num_c list (same for den_c):
In [55]: num_c[0]
Out[55]: 500000.000000000
In [56]: type(_)
Out[56]: sympy.core.numbers.Float
The scipy code is doing numpy testing on the inputs. So it's first turned the lists into arrays:
In [61]: np.array(num_c)
Out[61]: array([500000.000000000], dtype=object)
This array contains sympy object(s). It can't cast that to numpy float with 'safe'. But an explicit astype uses unsafe as the default:
In [63]: np.array(num_c).astype(float)
Out[63]: array([500000.])
So lets convert both lists into valid numpy float arrays:
In [64]: sys = signal.lti(np.array(num_c).astype(float), np.array(den_c).astype(float))
In [65]: sys
Out[65]:
TransferFunctionContinuous(
array([100200.4008016]),
array([1.00000000e+00, 1.02004008e+05]),
dt: None
)
Conversion in a list comprehension also works:
sys = signal.lti([float(i) for i in num_c],[float(i) for i in den_c])

You likely need to conver sympy objects to floats / lists of floats.

AttributeError: 'NoneType' object has no attribute 'setCallSite' pyspark after indexedRowMatrix columnSimilarities()

I'm working on a code that was correctly executed with the dataframe before, but this time when I execute it, I get an error. (The only difference is that I used persist() on the dataframe this time.)
simMat = IndexedRMat.columnSimilarities()
executes correctly, but then this part:
columns = ['product1', 'product2', 'sim']
vals = simMat.entries.map(lambda e: (e.i, e.j, e.value)).collect()
dfsim = spark.createDataFrame(vals, columns)
generates this error:
AttributeErrorTraceback (most recent call last)
<ipython-input-100-11502084c71b> in <module>()
1 columns = ['product1', 'product2', 'sim']
----> 2 vals = simMat.entries.map(lambda e: (e.i, e.j, e.value)).collect()
3 dfsim = spark.createDataFrame(vals, columns)
/opt/spark-2.3.0-SNAPSHOT-bin-spark-master/python/pyspark/rdd.pyc in collect(self)
806 to be small, as all the data is loaded into the driver's memory.
807 """
--> 808 with SCCallSiteSync(self.context) as css:
809 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
810 return list(_load_from_socket(port, self._jrdd_deserializer))
/opt/spark-2.3.0-SNAPSHOT-bin-spark-master/python/pyspark/traceback_utils.pyc in __enter__(self)
70 def __enter__(self):
71 if SCCallSiteSync._spark_stack_depth == 0:
---> 72 self._context._jsc.setCallSite(self._call_site)
73 SCCallSiteSync._spark_stack_depth += 1
74
AttributeError: 'NoneType' object has no attribute 'setCallSite'
What does it mean? I'm new to spark and didn't find an explanation for this type of error..

Using rpy2 in Jupyter/IPython run_line_magic error

In the IPython and Jupyter documentation it says that get_ipython().magic() is deprecated. But when I changed my code to use run_line_magic it is failing to push to R (see below). Might be related to this problem
https://bitbucket.org/rpy2/rpy2/issues/184/valueerror-call-stack-is-not-deep-enough
I'm on Mac Yosemite, using Anaconda with Python 2.7. I just updated both Anaconda and rpy2 yesterday. The code below is from a Jupyter notebook.
%load_ext rpy2.ipython
import pandas as pd
'''Two test functions with rpy2.
The only difference between them is that
rpy2fun_magic uses 'magic' to push variable to R and
rpy2fun_linemagic uses 'run_line_magic' to push variable.
'magic' works fine. 'run_line_magic' returns an error.'''
def rpy2fun_magic(df):
get_ipython().magic('R -i df')
get_ipython().run_line_magic('R','df_cor <- cor(df)')
get_ipython().run_line_magic('R','-o df_cor')
return (df_cor)
def rpy2fun_linemagic(df):
get_ipython().run_line_magic('R','-i df')
get_ipython().run_line_magic('R','df_cor <- cor(df)')
get_ipython().run_line_magic('R','-o df_cor')
return (df_cor)
dataframetest = pd.DataFrame([[1,2,3,4],[6,3,4,5],[9,1,7,3]])
df_cor_magic = rpy2fun_magic(dataframetest)
print 'Using magic to push variable works fine\n'
print df_cor_magic
print '\nBut using run_line_magic returns an error\n'
df_cor_linemagic = rpy2fun_linemagic(dataframetest)
Using magic to push variable works fine
[[ 1. -0.37115374 0.91129318 -0.37115374]
[-0.37115374 1. -0.72057669 1. ]
[ 0.91129318 -0.72057669 1. -0.72057669]
[-0.37115374 1. -0.72057669 1. ]]
But using run_line_magic returns an error
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-1-e418b72a8621> in <module>()
28 print '\nBut using run_line_magic returns an error\n'
29
---> 30 df_cor_linemagic = rpy2fun_linemagic(dataframetest)
<ipython-input-1-e418b72a8621> in rpy2fun_linemagic(df)
15
16 def rpy2fun_linemagic(df):
---> 17 get_ipython().run_line_magic('R','-i df')
18 get_ipython().run_line_magic('R','df_cor <- cor(df)')
19 get_ipython().run_line_magic('R','-o df_cor')
/Users/alexmillner/anaconda/lib/python2.7/site-packages/IPython/core/interactiveshell.pyc in run_line_magic(self, magic_name, line)
2255 kwargs['local_ns'] = sys._getframe(stack_depth).f_locals
2256 with self.builtin_trap:
-> 2257 result = fn(*args,**kwargs)
2258 return result
2259
/Users/alexmillner/anaconda/lib/python2.7/site-packages/rpy2/ipython/rmagic.pyc in R(self, line, cell, local_ns)
/Users/alexmillner/anaconda/lib/python2.7/site-packages/IPython/core/magic.pyc in <lambda>(f, *a, **k)
191 # but it's overkill for just that one bit of state.
192 def magic_deco(arg):
--> 193 call = lambda f, *a, **k: f(*a, **k)
194
195 if callable(arg):
/Users/alexmillner/anaconda/lib/python2.7/site-packages/rpy2/ipython/rmagic.pyc in R(self, line, cell, local_ns)
657 val = self.shell.user_ns[input]
658 except KeyError:
--> 659 raise NameError("name '%s' is not defined" % input)
660 if args.converter is None:
661 ro.r.assign(input, self.pyconverter(val))
NameError: name 'df' is not defined

Some discussion of the same issue with %timeit first, followed by workaround answers at the bottom. I'm using IPython 3.1.0 with Anaconda 2.7.10, so my observations below could be different based on version differences alone.
This is not unique to the R extension, you can reproduce this with something simpler like %timeit:
In [47]: dfrm
Out[47]:
A B C
0 0.690466 0.370793 0.963782
1 0.478427 0.358897 0.689173
2 0.189277 0.268237 0.570624
3 0.735665 0.342549 0.509810
4 0.929736 0.090079 0.384444
5 0.210941 0.347164 0.852408
6 0.241940 0.187266 0.961489
7 0.768143 0.548450 0.604004
8 0.055765 0.842224 0.668782
9 0.717827 0.047011 0.948673
In [48]: def run_timeit(df):
get_ipython().run_line_magic('timeit', 'df.sum()')
....:
In [49]: run_timeit(dfrm)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-49-1e62302232b6> in <module>()
----> 1 run_timeit(dfrm)
<ipython-input-48-0a3e09ec1e0c> in run_timeit(df)
1 def run_timeit(df):
----> 2 get_ipython().run_line_magic('timeit', 'df.sum()')
3
/home/ely/anaconda/lib/python2.7/site-packages/IPython/core/interactiveshell.pyc in run_line_magic(self, magic_name, line)
2226 kwargs['local_ns'] = sys._getframe(stack_depth).f_locals
2227 with self.builtin_trap:
-> 2228 result = fn(*args,**kwargs)
2229 return result
2230
/home/ely/anaconda/lib/python2.7/site-packages/IPython/core/magics/execution.pyc in timeit(self, line, cell)
/home/ely/anaconda/lib/python2.7/site-packages/IPython/core/magic.pyc in <lambda>(f, *a, **k)
191 # but it's overkill for just that one bit of state.
192 def magic_deco(arg):
--> 193 call = lambda f, *a, **k: f(*a, **k)
194
195 if callable(arg):
/home/ely/anaconda/lib/python2.7/site-packages/IPython/core/magics/execution.pyc in timeit(self, line, cell)
1034 number = 1
1035 for _ in range(1, 10):
-> 1036 time_number = timer.timeit(number)
1037 worst_tuning = max(worst_tuning, time_number / number)
1038 if time_number >= 0.2:
/home/ely/anaconda/lib/python2.7/site-packages/IPython/core/magics/execution.pyc in timeit(self, number)
130 gc.disable()
131 try:
--> 132 timing = self.inner(it, self.timer)
133 finally:
134 if gcold:
<magic-timeit> in inner(_it, _timer)
NameError: global name 'df' is not defined
The issue is that the line magics are set to look for variable names at global scope, not at function scope. If the argument to your function rpy2fun_linemagic happened to coincide with a global variable name, the interior code would pick that up, for example:
In [52]: def run_timeit(dfrm):
get_ipython().run_line_magic('timeit', 'dfrm.sum()')
....:
In [53]: run_timeit(dfrm)
The slowest run took 5.67 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 99.1 µs per loop
But this only works by accident, because the interior string passed to run_line_magic contains a name that is found globally.
However, I do get the same error even if using the plain magic function:
In [58]: def run_timeit(df):
get_ipython().magic('timeit df.sum()')
....:
In [59]: run_timeit(dfrm)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-59-1e62302232b6> in <module>()
----> 1 run_timeit(dfrm)
<ipython-input-58-e98c720ea7e8> in run_timeit(df)
1 def run_timeit(df):
----> 2 get_ipython().magic('timeit df.sum()')
3
/home/ely/anaconda/lib/python2.7/site-packages/IPython/core/interactiveshell.pyc in magic(self, arg_s)
2305 magic_name, _, magic_arg_s = arg_s.partition(' ')
2306 magic_name = magic_name.lstrip(prefilter.ESC_MAGIC)
-> 2307 return self.run_line_magic(magic_name, magic_arg_s)
2308
2309 #-------------------------------------------------------------------------
/home/ely/anaconda/lib/python2.7/site-packages/IPython/core/interactiveshell.pyc in run_line_magic(self, magic_name, line)
2226 kwargs['local_ns'] = sys._getframe(stack_depth).f_locals
2227 with self.builtin_trap:
-> 2228 result = fn(*args,**kwargs)
2229 return result
2230
/home/ely/anaconda/lib/python2.7/site-packages/IPython/core/magics/execution.pyc in timeit(self, line, cell)
/home/ely/anaconda/lib/python2.7/site-packages/IPython/core/magic.pyc in <lambda>(f, *a, **k)
191 # but it's overkill for just that one bit of state.
192 def magic_deco(arg):
--> 193 call = lambda f, *a, **k: f(*a, **k)
194
195 if callable(arg):
/home/ely/anaconda/lib/python2.7/site-packages/IPython/core/magics/execution.pyc in timeit(self, line, cell)
1034 number = 1
1035 for _ in range(1, 10):
-> 1036 time_number = timer.timeit(number)
1037 worst_tuning = max(worst_tuning, time_number / number)
1038 if time_number >= 0.2:
/home/ely/anaconda/lib/python2.7/site-packages/IPython/core/magics/execution.pyc in timeit(self, number)
130 gc.disable()
131 try:
--> 132 timing = self.inner(it, self.timer)
133 finally:
134 if gcold:
<magic-timeit> in inner(_it, _timer)
NameError: global name 'df' is not defined
One (super bad) way to get around this is to use globals to locate the item that is the same as the argument that was passed to your function, and then you'll have a global name for it.
For example:
In [68]: def run_timeit(df):
for var_name, var_val in globals().iteritems():
if df is var_val:
get_ipython().run_line_magic('timeit', '%s.sum()'%(var_name))
break
....:
In [69]: run_timeit(dfrm)
The slowest run took 5.72 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 99.2 µs per loop
But this is very unstable, since it relies on pass-by-name in Python. If I passed an object like an integer or string, I would have to check whether it was interned or something, but otherwise couldn't find it "by name" in the global namespace.
Another way to do it that might be slightly better is to use the user_ns namespace dict that IPython stores. Then at least you're not looking at globals, and there is more stability over specific variables that have been named when assigned by the user in IPython:
In [71]: def run_timeit(df):
....: g = get_ipython()
....: for var_name, var_val in g.user_ns.iteritems():
....: if df is var_val:
....: g.run_line_magic('timeit', '%s.sum()'%(var_name))
....: break
....:
In [72]: run_timeit(dfrm)
The slowest run took 5.58 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 99 µs per loop
In the case of your specific R function call, I would try:
def rpy2fun_linemagic(df):
g = get_ipython()
for var_name, var_val in g.user_ns.iteritems():
if df is var_val:
g.run_line_magic('R', '-i %s'%(var_name))
g.run_line_magic('R', 'df_cor <- cor(%s)'%(var_name))
g.run_line_magic('R', '-o df_cor')
return df_cor
You might also have to be careful on the return statement. You might need to use return g.user_ns['df_cor'] or something if the result of the output conversion back to Python is to create the variable at global scope as well, rather than function scope. Or, if that variable gets created as a side effect, you may not want to return anything at all. I'm not a big fan of relying on implicit mutation like that, but it could work for you.

I suspect that the code example you are providing is there only to demonstrate the issue with run_line_magic(), but for reference I am adding a way to do the same without ipython being involved.
from rpy2.robjects import globalenv
def rpy2cor(df):
fun = globalenv.get('cor', wantfun=True)
df_cor = fun(df)
return df_cor

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

corruption loop for a data frame in PySpark - pyspark

Related

Pyspark Cosine similarity Invalid argument, not a string or column

Python Jupyter Notebook scipy

Sympy .coeff_all() returned list is not readable by scipy

AttributeError: 'NoneType' object has no attribute 'setCallSite' pyspark after indexedRowMatrix columnSimilarities()

Using rpy2 in Jupyter/IPython run_line_magic error

Categories

Resources