Python coverting itertools.chain to map - pyspark

I am new to pyspark and trying to understand the code below from some prod code.
test_map = {"A":1,"B":2, "C":3, "D":4}
test_mapping = create_map([lit(ele) for ele in chain(*test_map.items())])
The code above gives me error.
AttributeError: 'NoneType' object has no attribute '_jvm'
Not sure what is wrong with that. Can somebody explain?

There is nothing wrong with the code above. In fact, it doesn't "run" anything, because Spark transformation is lazy.
This is the actual result of your code:
print(test_mapping)
# Column<'map(A, 1, B, 2, C, 3, D, 4)'>

Related

pyspark.sql.utils.ParseException error when filtering the df

I want to select all rows from a pyspark df except some rows where the array contains 1. It works with the code below in the notebook:
<pyspark df>.filter(~exists("<col name>", lambda x: x=="hello"))
But when I write it as this:
cond = '~exists("<col name>", lambda x: x=="hello")'
df = df.filter(con)
I got error as below:
pyspark.sql.utils.ParseException:
extraneous input 'x' expecting {')', ','}(line 1, pos 32)
I really can't spot any typo. Could someone give me a hint if I missed something?
Thanks, J
To pass in the conditions through variable, it needs to be written in the form of
expr str of spark sql. So it can be modified to:
cond = '!exists(col_name, x -> x == "hello")'

Got wrong answer in implementing merge sort in python

I am getting wrong answer for particular input arrays in this merge sort implementation.
I tried with this code below in python.
Python code -
a=[100,3,4]
b=[]
for i in range(len(a)):
b.append(0)
def ms( a ,lb,ub ):
if (lb<ub):
mid=int((lb+ub)/2)
ms(a, lb, mid)
ms(a, mid+1,ub)
merge(a,lb,mid,ub)
def merge(a,lb,mid,ub):
i=lb
j=mid+1
k=lb
while (i<=mid and j<=ub) :
if a[i]<=a[j]:
b[k]=a[i]
i+=1
k+=1
else:
b[k]=a[j]
j+=1
k+=1
if (i>mid) :
while j<=ub :
b[k]=a[j]
j+=1
k+=1
elif (j>ub) :
while i<=mid :
b[k]=a[i]
i+=1
k+=1
ms(a,0 , len(a)-1)
print(b)
i am getting output wrong answer.
Please go through this.
There are several problems with this code. You don't ask for a fix, and I can imagine at least two ways to go about fixing it, so I'll leave it to you, but essentially the fundamental problem in your implementation is that you're merging into b twice, both times at the beginning. That overwrites what the first one does.
If you add a print statement right before ms calls merge, you'll see that one call to ms turns b into the list [3, 0, 0], and a second call turns it into the list [4, 100, 0]. In other words, you've lost information. This happens because merge always initializes k=lb.
IMHO you should not try to perform merge sort using a global list in this manner.

How to translate this code into python 3?

This code is originally written in Python 2 and I need to translate it in python 3!
I'm sorry for not sharing enough information:
Also, here's the part where self.D was first assigned:
def __init__(self,instance,transformed,describe1,describe2):
self.D=[]
self.instance=instance
self.transformed=transformed
self.describe1,self.describe2=describe1,describe2
self.describe=self.describe1+', '+self.describe2 if self.describe2 else self.describe1
self.column_num=self.tuple_num=self.view_num=0
self.names=[]
self.types=[]
self.origins=[]
self.features=[]
self.views=[]
self.classify_id=-1
self.classify_num = 1
self.classes=[]
def generateViews(self):
T=map(list,zip(*self.D))
if self.transformed==0:
s= int( self.column_num)
for column_id in range(s):
f = Features(self.names[column_id],self.types[column_id],self.origins[column_id])
#calculate min,max for numerical,temporal
if f.type==Type.numerical or f.type==Type.temporal:
f.min,f.max=min(T[column_id]),max(T[column_id])
if f.min==f.max:
self.types[column_id]=f.type=Type.none
self.features.append(f)
continue
d={}
#calculate distinct,ratio for categorical,temporal
if f.type == Type.categorical or f.type == Type.temporal:
for i in range(self.tuple_num):
print([type(self.D[i]) for i in range(self.tuple_num)])
if self.D[i][column_id] in d:
d[self.D[i][column_id]]+=1
else:
d[self.D[i][column_id]]=1
f.distinct = len(d)
f.ratio = 1.0 * f.distinct / self.tuple_num
f.distinct_values=[(k,d[k]) for k in sorted(d)]
if f.type==Type.temporal:
self.getIntervalBins(f)
self.features.append(f)
TypeError: 'map' object is not subscriptable
The snippet you have given is not enough to solve the problem. The problem lies in self.D which you are trying to subscript using self.D[i]. Please look into your code where self.D is instantiated and make sure that its an array-like variable so that you can subscript it.
Edit
based on your edit, please confirm that whether self.D[i] is also array-like for all i in the range mentioned in the code. you can do that by simply
print([type(self.D[i]) for i in range(self.tuple_num))
share the response of this code, so that I may help further.
Edit-2
As per your comments and the edited code snippet, it seems that self.D is the output of some map function. In python 2, map is a function that returns a list. However, in python3 map is a class that when invoked, creates a map object.
The simplest way to resolve this is the find out the line where self.D was first assigned, and whatever code is in the RHS, wrap it with a list(...) function.
Alternately, just after this line
T=map(list,zip(*self.D))
add the following
self.D = list(self.D)
Hope this will resolve the issue
We don't have quite enough information to answer the question, but in Python 3, generator and map objects are not subscriptable. I think it may be in your
self.D[i]
variable, because you claim that self.D is a list, but it is possible that self.D[i] is a map object.
In your case, to access the indexes, you can convert it to a list:
list(self.D)[i]
Or use unpacking to implicitly convert to a list (this may be more condensed, but remember that explicit is better than implicit):
[*self.D[i]]

Error when invoking macro

My task is to do harmonic mean using macros.
So I wrote something like that:
macro mean(arr)
ex = Expr(:call, :/, 1, arr[1])
for i = 2:length(arr)
ex = Expr(:call, :+, ex, Expr(:call, :/, 1, arr[i]))
end
println(arr[1])
Expr(:call, :/, length(arr), ex)
end
and then executed with 4 arguments
#mean(2,2,5,7)
which caused error:
MethodError: no method matching #mean(::Int64, ::Int64, ::Int64, ::Int64)
So here comes my question: what is wrong and how should I correct this?
It is worth to mention that this program works for my friend, but not for me.
The problem here is that you inserted values as multiple arguments and not as an array. You should do #mean([2, 2, 5, 7])

Assignment within Spark Scala foreach Loop

I'm new to scala/spark and am trying to loop through a dataframe and assign the results as the loop progresses. The following code works but can only print the results to screen.
traincategory.columns.foreach { x=>
val test1 = traincategory.select("Id", x)
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}
//CODE TO PERFORM ONEHOT TRANSFORMATION
val encoded = encoder.transform(indexed)
encoded.show()
}
As val is immutable I have attempted to append the vectors from this transformation onto another variable, as might be done in R.
//var ended = traincategory.withColumn(x,encoded(0))
I suspect Scala has a more idiomatic way of processing this.
Thank you in advance for your help.
A solution was available at :
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/Correlations.scala
If anyone has similar issues with Scala MLIB there is great example code at :
https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples/mllib