PySpark: Count every element in flatmap - pyspark

I am having trouble counting every element in a list that I have created in PySpark.
Here is what I am working with:
test2 = words.filter(lambda line: re.match(r'^[AEIOU]', line)).take(10)
test2
[u'EBook', u'Author:', u'English', u'OF', u'EBOOK', u'Inc.,', u'Etext', u'Inc.,', u'Etexts', u'Etext']
Now I want to confirm the count of test2 is 10. But everytime I use test2.count(), it's giving me an error:
Traceback (most recent call last):
File "", line 1, in
TypeError: count() takes exactly one argument (0 given)
Can someone help me learn how to count the elements properly?
Thank you!

test2 is a list, so you should be doing len(test2) to find the number of elements. The function count(), when called on a list, will return the number of occurrences of whatever you pass as a parameter.

Related

pyspark Run bootstrap parallel

I have a function that takes 2 spark dataframes, some other arguments and outputs a scalar value.
Would like to bootstrap (to fill missing values in the dataframes above) n times with the whole process above and return the output with n rows.
I tried the below for a simple problem:
def sum_fn(a,b):
return a+b
rdd=spark.sparkContext.parallelize(list(range(1, 9+1)))
df = rdd.map(lambda x: (x,sum_fn(1,x))).toDF()
display(df)
This works fine, however when I input my function with sdf as input instead of sum_fn
I get an error :
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 476, in
dumps
return cloudpickle.dumps(obj, pickle_protocol)
File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
line 72, in dumps
cp.dump(obj)
File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 540, in dump
return Pickler.dump(self, obj)
TypeError: cannot pickle '_thread.RLock' object
PicklingError: Could not serialize object: TypeError: cannot pickle '_thread.RLock' object
Could someone please suggest on how I could do the same
Thanks

Scala - Not enough arguments for method count

I am fairly new to Scala and Spark RDD programming. The dataset I am working with is a CSV file containing list of movies (one row per movie) and their associated user ratings (comma delimited list of ratings). Each column in the CSV represents a distinct user and what rating he/she gave the movie. Thus, user 1's ratings for each movie are represented in the 2nd column from the left:
Sample Input:
Spiderman,1,2,,3,3
Dr.Sleep, 4,4,,,1
I am getting the following error:
Task4.scala:18: error: not enough arguments for method count: (p: ((Int, Int)) => Boolean)Int.
Unspecified value parameter p.
var moviePairCounts = movieRatings.reduce((movieRating1, movieRating2) => (movieRating1, movieRating2, movieRating1._2.intersect(movieRating2._2).count()
when I execute the few lines below. For the program below, the second line of code splits all values delimited by "," and produces this:
( Spiderman, [[1,0],[2,1],[-1,2],[3,3],[3,4]] )
( Dr.Sleep, [[4,0],[4,1],[-1,2],[-1,3],[1,4]] )
On the third line, taking the count() throws an error. For each movie (row), I am trying to get the number of common elements. In the above example, [-1, 2] is clearly a common element shared by both Spiderman and Dr.Sleep.
val textFile = sc.textFile(args(0))
var movieRatings = textFile.map(line => line.split(","))
.map(movingRatingList => (movingRatingList(0), movingRatingList.drop(1)
.map(ranking => if (ranking.isEmpty) -1 else ranking.toInt).zipWithIndex));
var moviePairCounts = movieRatings.reduce((movieRating1, movieRating2) => (movieRating1, movieRating2, movieRating1._2.intersect(movieRating2._2).count() )).saveAsTextFile(args(1));
My target output of line 3 is as follows:
( Spiderman, Dr.Sleep, 1 ) --> Between these 2 movies, there is 1 common entry.
Can somebody please advise ?
To get the number of elements in a collection, use length or size. count() returns number of elements which satisfy some additional condition.
Or you could avoid building the complete intersection by using count to count the elements of the first collection which the second contains:
movieRating1._2.count(movieRating2._2.contains(_))
The error message seems pretty clear: count takes one argument, but in your call, you are passing an empty argument list, i.e. zero arguments. You need to pass one argument to count.

Converting DataFrame String column containing missing values to Date in Julia

I'm trying to convert a DataFrame String column to Date format in Julia, but if the column contains missing values an error is produced:
ERROR: MethodError: no method matching Int64(::Missing)
The code I've tried to run (which works for columns with no missing data) is:
df_pp[:tod] = Date.(df_pp[:tod], DateFormat("d/m/y"));
Other lines of code I have tried are:
df_pp[:tod] = Date.(passmissing(df_pp[:tod]), DateFormat("d/m/y"));
df_pp[.!ismissing.(df_pp[:tod]), :tod] = Date.(df_pp[:tod], DateFormat("d/m/y"));
The code relates to a column named tod in a data frame named df_pp. Both the DataFrames & Dates packages have been loaded prior to attempting this.
The passmissing way is
df_pp.tod = passmissing(x->Date(x, DateFormat("d/m/y"))).(df_pp.tod)
What happens here is this: passmissing takes a function, and returns a new function that handles missings (by returning missing). Inside the bracket, in x->Date(x, DateFormat("d/m/y")) I define a new, anonymous function, that calls the Date function with the appropriate DateFormat.
Finally, I use the function returned by passmissing immediately on df_pp.tod, using a . to broadcast along the column.
It's easier to see the syntax if I split it up:
myDate(x) = Date(x, DateFormat("d/m/y"))
Date_accepting_missing = passmissing(myDate)
df_pp[:tod] = Date_accepting_missing.(df_pp[:tod])

Assign a value to a single SFrame element

I want to assign a value to a single element (i.e. single row and column) in an SFrame.
I am using the Python Notebook and importing graphlab.
I created an SFrame with dimensions 16364 rows x 37 columns.
The column 'test' contains zeros.
I have used the following syntax to set the value:
sf[1]['test'] = 3;
If I then type:
sf[1]['test']
then I see the correct value, i.e "3"
But if I type:
sf
then I just see values of zero for all rows of column 'test'
Also same for sf.head() or sf['test'] or sf['test'].head()
I don't understand why one syntax shows the value of "3" where an alternative one does not. Is the value in sf[1]['test'] 3 or 0 ?
SFrames are immutable, so they don't actually support item assignment. The reason for the difference you see here is because
sf[1]['test']
isn't actually referring to the SFrame at all. "sf[1]" returns a dictionary with keys that match to the SFrame's column names, and values that match the second row of the SFrame. When you assign a number to "sf[1]['test']", you are changing the value of the "test" key in the dictionary that was returned, so the SFrame "sf" is not involved in the assignment. The correct way to reference only the second value of the column "test" and assign the value "3" is this:
sf['test'][1] = 3
which would return this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-c52dab41d5dd> in <module>()
----> 1 sf['test'][1] = 3
TypeError: 'SArray' object does not support item assignment

Storing the query result in a list variable in plpython function

I am very new to postgresql and writing functions so bear with me. I need to transform a Python script into a postgresql function and I intend to use PL/Python for the purpose. However I am having some problems in doing so. When executing the function I receive an error:
ERROR: TypeError: unsupported operand type(s) for +: 'int' and 'dict'
SQL state: XX000
Context: Traceback (most recent call last):
PL/Python function "ellipse", line 5, in
meanX=float(sum(Xarray))/len(Xarray) if len(Xarray) > 0 else float('nan')
PL/Python function "ellipse"
As to my knowledge, the query stores the result in dictionary which then results in this error (since I am trying to operate with list in the script). At least I think this can be the problem. So my question would be - is there a way to store the query result in a list variable?
CREATE OR REPLACE FUNCTION ellipse()
returns setof ellipse_param as $$
Xarray=plpy.execute("select laius from proov")
Yarray=plpy.execute("select pikkus from proov")
meanX=float(sum(Xarray))/len(Xarray) if len(Xarray) > 0 else float('nan')
meanY=float(sum(Yarray))/len(Yarray) if len(Yarray) > 0 else float('nan')
Xdevs=[]
Ydevs=[]
for x in Xarray:
dev=x-meanX
Xdevs.append(dev)
dev=0
for y in Yarray:
dev=y-meanY
Ydevs.append(dev)
dev=0
sumX=0
sumY=0
for x in Xdevs:
sumX+=x**2
for y in Ydevs:
sumY+=y**2
Xaxes=sqrt(sumX/len(Xdevs))
Yaxes=sqrt(sumY/len(Ydevs))
A=sumX-sumY
B=sqrt(A**2+(((float(sum([a*b for a,b in zip(Xdevs,Ydevs)])))**2)*4))
C=float(sum([a*b for a,b in zip(Xdevs,Ydevs)]))*2
rotation=(atan(((A+B)/C)))
Sx=sqrt(((float(sum([(a*cos(rotation)-b*sin(rotation))**2 for a,b in zip(Xdevs,Ydevs)])))/(len(Xdevs)-2))*2)
Sy=sqrt(((float(sum([(c*sin(rotation)+d*cos(rotation))**2 for c,d in zip(Xdevs,Ydevs)])))/(len(Xdevs)-2))*2)
return meanX, meanY, rotation, Xaxes, Yaxes
$$ LANGUAGE plpython3u;
plpy.execute will give you a list of dict, so you want something like
sum([x['laius'] for x in Xarray])
More info in the docs here http://www.postgresql.org/docs/devel/static/plpython-database.html
Edit: I read too quickly and skimmed over your entire function - you may want to put the list constructor higher up, probably right after executing your queries, so that you have a list of values to use later on (I didn't notice how much of the later code assumes the data are in a simple list).