pyspark where to import unpackArray from? - pyspark

I am doing the course UCSanDiegoX: DSE230x on edx. In the part about user defined functions this code is used:
def count_nan(V):
A = unpackArray(V, data_type=np.float16)
return int(sum(np.isnan(A)))
Count_nan_udf = udf(count_nan, IntegerType())
Though they don't explain where this functions are comming from, i.e. how to import them into the namespace.
I found udf here:
from pyspark.sql.functions import udf
And IntegerType:
from pyspark.sql.types import IntegerType
Though I don't find unpackArray. Do I need to import it at all?

I am also taking the same course. packArray() and unpackArray() are user defined functions which have been defined in lib/numpy_pack.py file.
packArray() is used to pack a numpy array to a bytearray to be stored as a single field in a spark dataframe. unpackArray() is the reversed operation.

Related

apply map partitions on pyspark dataframe to run python logic

I would like to apply spacy nlp on my pyspark dataframe. I am using map partitions concept on my pyspark dataframe to apply python logic that consists of spacy.
Spark version: 3.2.0
Below is the sample pyspark dataframe:
token id
0 [This, is, java, world] 0
1 [This, is, spark, world] 1
Below is the code where I am passing a data to the python function and returning a dictionary
def get_spacy_doc_parallel_map_func(partitionData):
import spacy
from tqdm import tqdm
import pandas as pd
nlp=spacy.load('en_core_web_sm')
nlp.tokenizer=nlp.tokenizer.tokens_from_list
from spacy.tokens import Doc
if not Doc.has_extension("text_id"):
Doc.set_extension("text_id", default=None)
columnNames = broadcasted_source_columns.value
partitionData = pd.DataFrame(partitionData, columns=columnNames)
'''
This function creates a mapper of review id and review spacy.doc.Doc type
'''
def get_spacy_doc_parallel(data):
text_tuples = []
dodo = data[['token','id']].drop_duplicates(['id'])
for _,i in dodo.iterrows():
text_tuples.append((i['token'],{'text_id':i['id']}))
doc_tuples = nlp.pipe(text_tuples, as_tuples=True,n_process=8,disable=['tagger','parser','ner'])
docsf = []
for doc, context in tqdm(doc_tuples,total=len(text_tuples)):
doc._.text_id = context["text_id"]
docsf.append(doc)
vv={}
for doc in docsf:
vv[doc._.text_id] = doc
return vv
id_spacy_doc_mapper = get_spacy_doc_parallel(partitionData)
partitionData['spacy_doc'] = id_spacy_doc_mapper
partitionData.reset_index(inplace=True)
partitionData_dict = partitionData.to_dict("index")
result = []
for key in partitionData_dict:
result.append(partitionData_dict[key])
return iter(result)
resultDF_tokens = data.rdd.mapPartitions(get_spacy_doc_parallel_map_func)
df = spark.createDataFrame(resultDF_tokens)
The issue I am getting here is that the length of dictionary values does not match with length of the dataframe. Below is the error
Error:
ValueError: Length of values (954) does not match length of index (1438)
Output:
{0: This is java word, 1: This is spark world }
The above dictionary is assigned as a column to the python dataframe after applying spacy (partitionData['spacy_doc'] = id_spacy_doc_mapper)
I don't have enough experience with spacy to figure out what the intent is here and I'm very confused by the input and output because the input looks tokenized, but I'll take a stab at it and list my assumptions and the problems I ran into.
First off, I think Fugue can make this type of transformation significantly easier. It will use the underlying Spark UDF, pandas_udf, mapPartition, or mapInPandas depending what parameters you supply. The point is that Fugue will handle that boilerplate. For you, it seems you have Pandas DataFrame in (that part is clear), but the output is less clear. I think you are passing some iterable of list to make Spark happy, but I think Pandas DataFrame output might be simpler. I'm guessing here.
So first we set some stuff up. This is all native Python. The tokens_from_list portion was removed from the original because it seems like the latest versions deprecated it. Shouldn't matter for the example.
import pandas as pd
from typing import List, Any, Iterable, Dict
import spacy
nlp=spacy.load('en_core_web_sm')
from spacy.tokens import Doc
if not Doc.has_extension("text_id"):
Doc.set_extension("text_id", default=None)
data = pd.DataFrame({"token": ["This is java world", "This is spark world"],
"id": [0, 1]})
and then you define your logic for one partition. I am assuming Pandas DataFrame in and Pandas DataFrame out, but Fugue can actually support many other types such as Pandas DataFrame in and Iterable[List] out. The important thing is just you annotate your logic so Fugue knows how to handle it. Note this code is still native Python. I edited the logic a bit to just get it to work. Again, I am pretty sure I butchered the logic somewhere, but the example can still work. I really couldn't find a way for the original to work (because I don't know spacy enough)
def get_spacy_doc(data: pd.DataFrame) -> pd.DataFrame:
text_tuples = []
dodo = data[['token','id']].drop_duplicates(['id'])
for _,i in dodo.iterrows():
text_tuples.append((i['token'],{'text_id':i['id']}))
doc_tuples = nlp.pipe(text_tuples, as_tuples=True,n_process=1,disable=['tagger','parser','ner'])
docsf = []
for doc, context in doc_tuples:
doc._.text_id = context["text_id"]
docsf.append(doc)
vv={}
for doc in docsf:
vv[doc._.text_id] = doc
id_spacy_doc_mapper = vv.copy()
data['space_doc'] = id_spacy_doc_mapper
return data
Now to bring this to Spark, all you have to do with Fugue is:
from fugue import transform
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sdf = spark.createDataFrame(data)
sdf = transform(sdf, get_spacy_doc, schema="*, space_doc:int", engine=spark)
sdf.show()
and the Fugue transform will handle it. This is to run on Spark, but you can also run on Pandas if you don't supply an engine like this:
df = transform(data, get_spacy_doc, schema="*, space_doc:int")
This allows you to test the logic clearly without relying on Spark. It will then work when you bring it to Spark. Schema is a requirement because it is a requirement for Spark.
On partitioning
The Fugue transform can take partitioning strategy. For example:
transform(df, func, schema="*", partition={"by":"col1"}, engine=spark)
but for this case, I don't think you partition on anything so you can just use the default partitions, which is what will happen.
On parallelization
You have this code like:
nlp.pipe(text_tuples, as_tuples=True,n_process=8,disable=['tagger','parser','ner'])
This is two-stage parallelism. The first stage is Spark mapping over partitions, and the second stage is this pipe operation (I assume). Two stage parallelism is an anti-pattern in distributed computing because the first stage will already occupy all the available resources on the cluster. The parallelism should be done on the partition level. If you do something like this, it's very common to run into resource deadlocks when the 2nd stage tries to occupy resources also. I would recommend setting the n_process=1.
On tqdm
I may be wrong on this one but I don't think tqdm plays well with Spark because I don't think you can get a real time progress bar for work that happens on worker machines. It can only work on the driver machine. The workers don't send logs to the driver for the functions it runs.
If the example is clearer, I can certainly help you port this logic to Spark. Feel free to reach out. I hope at least some bit of this was useful.

How do I calculate a simple one-sample t-statistic in Scala-Spark in an AWS EMR cluster?

I'm a data scientist and still relatively new to Scala. I'm trying to understand the Scala documentation and run a t-test from any existing package. I am looking for sample Scala code on a dummy data set that will work and insight into understanding how to understand the documentation.
I'm working in an EMR Notebook (basically Jupyter notebook) in an AWS EMR cluster environment. I tried referring to this documentation but apparently I am not able to understand it: https://commons.apache.org/proper/commons-math/javadocs/api-3.6/org/apache/commons/math3/stat/inference/TTest.html#TTest()
Here's what I've tried, using multiple load statements for two different packages that have t-test functions. I have multiple lines for the math3.state.inference package since I'm not entirely certain the differences between each and wanted to make sure this part wasn't the problem.
import org.apache.commons.math3.stat.inference
import org.apache.commons.math3.stat.inference._ // note sure if this means, import all classes/methods/functions
import org.apache.commons.math3.stat.inference.TTest._
import org.apache.commons.math3.stat.inference.TTest
import org.apache.spark.mllib.stat.test
No errors there.
import org.apache.asdf
Returns an error, as expected.
The documentation for math3.state.inference says there is a TTest() constructor and then shows a bunch of methods. How does this tell me how to use these functions/methods/classes? I see the following "method" does what I'm looking for:
t(double m, double mu, double v, double n)
Computes t test statistic for 1-sample t-test.
but I don't know how to use it. Here's just several things I've tried:
inference.t
inference.StudentTTest
test.student
test.TTest
TTest.t
etc.
But I get errors like the following:
An error was encountered:
<console>:42: error: object t is not a member of package org.apache.spark.mllib.stat.test
test.t
An error was encountered:
<console>:42: error: object TTest is not a member of package org.apache.spark.mllib.stat.test
test.TTest
...etc.
So how do I fix these issues/calculate a simple, one-sample t-statistic in Scala with a Spark kernel? Any instructions/guidance on how to understand the documentation will be helpful for the long-term as well.
the formula for computing one sample t test is quite straightforward to implement as a udf (user defined function)
udfs are how we can write custom functions to apply to different rows of the DataFrame. I assume you are okay with generating the aggregated values using standard groupby and agg functions.
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.expressions.UserDefinedFunction
val data = Seq((310, 40, 300.0, 18.5), (310, 41, 320.0, 14.5)).toDF("mu", "sample_size", "sample_mean", "sample_sd")
+---+-----------+-----------+---------+
| mu|sample_size|sample_mean|sample_sd|
+---+-----------+-----------+---------+
|310| 40| 300.0| 18.5|
|310| 41| 320.0| 14.5|
+---+-----------+-----------+---------+
val testStatisticUdf: UserDefinedFunction = udf {
(sample_mean: Double, mu:Double, sample_sd:Double, sample_size: Int) =>
(sample_mean - mu) / (sample_sd / math.sqrt(sample_size.toDouble))
}
val result = data.withColumn("testStatistic", testStatisticUdf(col("sample_mean"), col("mu"), col("sample_sd"), col("sample_size")))
+---+-----------+-----------+---------+-------------------+
| mu|sample_size|sample_mean|sample_sd| testStatistic|
+---+-----------+-----------+---------+-------------------+
|310| 40| 300.0| 18.5|-3.4186785515333833|
|310| 41| 320.0| 14.5| 4.4159477499536886|
+---+-----------+-----------+---------+-------------------+

Why is there a problem with the code? I am connected to the clusters

I was trying to apply an UDF function to round those pct, maybe there are better ways, I am open to it because I am new to the pyspark.
When I removed the udf function to give up rounding the numbers, it worked so I am confident with the dataframe.
So guys, genius, please help me, love&peace
I tried spqrk.sql in databricks to get this dataframe and it looked good.
Here are the codes:
from pyspark.sql.types import IntegerType
round_func = udf(lambda x:round(x,2), IntegerType())
q2_res = q2_res.withColumn('pct_DISREGARD', round_func(col('pct')))
display(q2_res)
ERROR:
AttributeError: 'NoneType' object has no attribute '_jvm'
Apparently we can't use any of pyspark.sql.functions with udf. A detailed explanation is given in this thread. You're trying to use round function, so it's not going to work, as it only works on columns. We can achieve the same functionality in a much easier way:
from pyspark.sql.types import IntegerType
import pyspark.sql.functions as f
q2_res = q2_res.withColumn('pct_DISREGARD', f.round('pct', 2).astype(IntegerType()))
It's usually advisable to avoid UDFs as much as possible because these are usually slower than native dataframe operations.

Using a module with udf defined inside freezes pyspark job - explanation?

Here is the situation:
We have a module where we define some functions that return pyspark.sql.DataFrame (DF). To get those DF we use some pyspark.sql.functions.udf defined either in the same file or in helper modules. When we actually write job for pyspark to execute we only import functions from modules (we provide a .zip file to --py-files) and then just save the dataframe to hdfs.
Issue is that when we do this, the udf function freezes our job. The nasty fix we found was to define udf functions inside the job and provide them to imported functions from our module.
The other fix I found here is to define a class:
from pyspark.sql.functions import udf
class Udf(object):
def __init__(s, func, spark_type):
s.func, s.spark_type = func, spark_type
def __call__(s, *args):
return udf(s.func, s.spark_type)(*args)
Then use this to define my udf in module. This works!
Can anybody explain why we have this problem in the first place? And why this fix (the last one with the class definition) works?
Additional info: PySpark 2.1.0. Deploying job on yarn in cluster mode.
Thanks!
The accepted answer to the link you posted above, says, "My work around is to avoid creating the UDF until Spark is running and hence there is an active SparkContext." Looks like your issue is with serializing the UDF.
Make sure the UDF functions in your helper classes are static methods or global functions. And inside the public functions that you import elsewhere, you can define the udf.
class Helperclass(object):
#staticmethod
def my_udf_todo(...):
...
def public_function_that_is_imported_elsewhere(...):
todo_udf = udf(Helperclass.my_udf_todo, RETURN_SCHEMA)
...

Aggregate function in spark-sql not found

I am new to Spark and I am trying to make use of some aggregate features, like sum or avg. My query in spark-shell works perfectly:
val somestats = pf.groupBy("name").agg(sum("days")).show()
When I try to run it from scala project it is not working thou, throwing an error messageé
not found: value sum
I have tried to add
import sqlContext.implicits._
import org.apache.spark.SparkContext._
just before the command, but it does not help. My spark version is 1.4.1 Am I missing anything?
You need this import:
import org.apache.spark.sql.functions._
You can use sum method directly on GroupedData (groupBy return this type)
val somestats = pf.groupBy("name").sum("days").show()