How to develop and test python transforms code locally? - pyspark

What is the recommended way to develop and test python transforms code locally, given that the input datasets fit into memory of the local machine?

The simplest way that doesn't require you to mock the transforms package, would be to just extract your logic into a pure python with pyspark function that receives dataframes as input and returns the dataframe.
i.e.:
# yourtransform.py
from my_business_logic import magic_super_complex_computation
#transform_df(
Output("/foo/bar/out_dataset"),
input1=Input("/foo/bar/input1"),
input2=Input("/foo/bar/input2"))
def my_transform(input1, input2):
return magic_super_complex_computation(input1, input2)
You can now import in your test the magic_super_complex_computation and test it just with pyspark.
i.e:
from my_business_logic import magic_super_complex_computation
def test_magic_super_complex_computation(spark_context):
df1 = load_my_data_as_df(spark_context, "input1")
df2 = load_my_data_as_df(spark_context, "input2")
result = magic_super_complex_computation(input1, input2).collect()
assert len(result) == 123
Do note that this requires you to provide a valid spark context as a fixture in your pytest (or whatever testing framework you are using)

Related

apply map partitions on pyspark dataframe to run python logic

I would like to apply spacy nlp on my pyspark dataframe. I am using map partitions concept on my pyspark dataframe to apply python logic that consists of spacy.
Spark version: 3.2.0
Below is the sample pyspark dataframe:
token id
0 [This, is, java, world] 0
1 [This, is, spark, world] 1
Below is the code where I am passing a data to the python function and returning a dictionary
def get_spacy_doc_parallel_map_func(partitionData):
import spacy
from tqdm import tqdm
import pandas as pd
nlp=spacy.load('en_core_web_sm')
nlp.tokenizer=nlp.tokenizer.tokens_from_list
from spacy.tokens import Doc
if not Doc.has_extension("text_id"):
Doc.set_extension("text_id", default=None)
columnNames = broadcasted_source_columns.value
partitionData = pd.DataFrame(partitionData, columns=columnNames)
'''
This function creates a mapper of review id and review spacy.doc.Doc type
'''
def get_spacy_doc_parallel(data):
text_tuples = []
dodo = data[['token','id']].drop_duplicates(['id'])
for _,i in dodo.iterrows():
text_tuples.append((i['token'],{'text_id':i['id']}))
doc_tuples = nlp.pipe(text_tuples, as_tuples=True,n_process=8,disable=['tagger','parser','ner'])
docsf = []
for doc, context in tqdm(doc_tuples,total=len(text_tuples)):
doc._.text_id = context["text_id"]
docsf.append(doc)
vv={}
for doc in docsf:
vv[doc._.text_id] = doc
return vv
id_spacy_doc_mapper = get_spacy_doc_parallel(partitionData)
partitionData['spacy_doc'] = id_spacy_doc_mapper
partitionData.reset_index(inplace=True)
partitionData_dict = partitionData.to_dict("index")
result = []
for key in partitionData_dict:
result.append(partitionData_dict[key])
return iter(result)
resultDF_tokens = data.rdd.mapPartitions(get_spacy_doc_parallel_map_func)
df = spark.createDataFrame(resultDF_tokens)
The issue I am getting here is that the length of dictionary values does not match with length of the dataframe. Below is the error
Error:
ValueError: Length of values (954) does not match length of index (1438)
Output:
{0: This is java word, 1: This is spark world }
The above dictionary is assigned as a column to the python dataframe after applying spacy (partitionData['spacy_doc'] = id_spacy_doc_mapper)
I don't have enough experience with spacy to figure out what the intent is here and I'm very confused by the input and output because the input looks tokenized, but I'll take a stab at it and list my assumptions and the problems I ran into.
First off, I think Fugue can make this type of transformation significantly easier. It will use the underlying Spark UDF, pandas_udf, mapPartition, or mapInPandas depending what parameters you supply. The point is that Fugue will handle that boilerplate. For you, it seems you have Pandas DataFrame in (that part is clear), but the output is less clear. I think you are passing some iterable of list to make Spark happy, but I think Pandas DataFrame output might be simpler. I'm guessing here.
So first we set some stuff up. This is all native Python. The tokens_from_list portion was removed from the original because it seems like the latest versions deprecated it. Shouldn't matter for the example.
import pandas as pd
from typing import List, Any, Iterable, Dict
import spacy
nlp=spacy.load('en_core_web_sm')
from spacy.tokens import Doc
if not Doc.has_extension("text_id"):
Doc.set_extension("text_id", default=None)
data = pd.DataFrame({"token": ["This is java world", "This is spark world"],
"id": [0, 1]})
and then you define your logic for one partition. I am assuming Pandas DataFrame in and Pandas DataFrame out, but Fugue can actually support many other types such as Pandas DataFrame in and Iterable[List] out. The important thing is just you annotate your logic so Fugue knows how to handle it. Note this code is still native Python. I edited the logic a bit to just get it to work. Again, I am pretty sure I butchered the logic somewhere, but the example can still work. I really couldn't find a way for the original to work (because I don't know spacy enough)
def get_spacy_doc(data: pd.DataFrame) -> pd.DataFrame:
text_tuples = []
dodo = data[['token','id']].drop_duplicates(['id'])
for _,i in dodo.iterrows():
text_tuples.append((i['token'],{'text_id':i['id']}))
doc_tuples = nlp.pipe(text_tuples, as_tuples=True,n_process=1,disable=['tagger','parser','ner'])
docsf = []
for doc, context in doc_tuples:
doc._.text_id = context["text_id"]
docsf.append(doc)
vv={}
for doc in docsf:
vv[doc._.text_id] = doc
id_spacy_doc_mapper = vv.copy()
data['space_doc'] = id_spacy_doc_mapper
return data
Now to bring this to Spark, all you have to do with Fugue is:
from fugue import transform
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sdf = spark.createDataFrame(data)
sdf = transform(sdf, get_spacy_doc, schema="*, space_doc:int", engine=spark)
sdf.show()
and the Fugue transform will handle it. This is to run on Spark, but you can also run on Pandas if you don't supply an engine like this:
df = transform(data, get_spacy_doc, schema="*, space_doc:int")
This allows you to test the logic clearly without relying on Spark. It will then work when you bring it to Spark. Schema is a requirement because it is a requirement for Spark.
On partitioning
The Fugue transform can take partitioning strategy. For example:
transform(df, func, schema="*", partition={"by":"col1"}, engine=spark)
but for this case, I don't think you partition on anything so you can just use the default partitions, which is what will happen.
On parallelization
You have this code like:
nlp.pipe(text_tuples, as_tuples=True,n_process=8,disable=['tagger','parser','ner'])
This is two-stage parallelism. The first stage is Spark mapping over partitions, and the second stage is this pipe operation (I assume). Two stage parallelism is an anti-pattern in distributed computing because the first stage will already occupy all the available resources on the cluster. The parallelism should be done on the partition level. If you do something like this, it's very common to run into resource deadlocks when the 2nd stage tries to occupy resources also. I would recommend setting the n_process=1.
On tqdm
I may be wrong on this one but I don't think tqdm plays well with Spark because I don't think you can get a real time progress bar for work that happens on worker machines. It can only work on the driver machine. The workers don't send logs to the driver for the functions it runs.
If the example is clearer, I can certainly help you port this logic to Spark. Feel free to reach out. I hope at least some bit of this was useful.

How can I parametrize my fixture and get testdata to parametreize my tests

I'm a starter with pytest. Just learned about fixture and tried to do this:
My tests call functions I wrote, and get test data from a code practicing website.
Each test is from a particular page and has several sets of test data.
So, I want to use #pytest.mark.parametrize to parametrize my single test func.
Also, as the operations of the tests are samelike, I want to made the pageObject instantiation and the steps to get test data from page as a fixture.
# content of conftest.py
#pytest.fixture
def get_testdata_from_problem_page():
def _get_testdata_from_problem_page(problem_name):
page = problem_page.ProblemPage(problem_name)
return page.get_sample_data()
return _get_testdata_from_problem_page
# content of test_problem_a.py
import pytest
from page_objects import problem_page
from problem_func import problem_a
#pytest.mark.parametrize('input,expected', test_data)
def test_problem_a(get_testdata_from_problem_page):
input, expected = get_testdata_from_problem_page("problem_a")
assert problem_a.problem_a(input) == expected
Then I realized, as above, I can't parametrize the test using pytest.mark as the test_data should be given outside the test function....
Are there solutions for this? Thanks very much~~
If I understand you correctly, you want to write one parameterized test per page. In this case you just have to write a function instead of a fixture and use that for parametrization:
import pytest
from page_objects import problem_page
from problem_func import problem_a
def get_testdata_from_problem_page(problem_name):
page = problem_page.ProblemPage(problem_name)
# returns a list of (input, expected) tuples
return page.get_sample_data()
#pytest.mark.parametrize('input,expected',
get_testdata_from_problem_page("problem_a"))
def test_problem_a(input, expected):
assert problem_a.problem_a(input) == expected
As you wrote, a fixture can only be used as a parameter to a test function or to another fixture, not in a decorator.
If you want to use the function to get test data elsewhere, just move it to a common module and import it. This can be just some custom utility module, or you could put it into contest.py, though you still have to import it.
Note also that the fixture you wrote does not do anything - it defines a local function that is not called and returns.

Using a module with udf defined inside freezes pyspark job - explanation?

Here is the situation:
We have a module where we define some functions that return pyspark.sql.DataFrame (DF). To get those DF we use some pyspark.sql.functions.udf defined either in the same file or in helper modules. When we actually write job for pyspark to execute we only import functions from modules (we provide a .zip file to --py-files) and then just save the dataframe to hdfs.
Issue is that when we do this, the udf function freezes our job. The nasty fix we found was to define udf functions inside the job and provide them to imported functions from our module.
The other fix I found here is to define a class:
from pyspark.sql.functions import udf
class Udf(object):
def __init__(s, func, spark_type):
s.func, s.spark_type = func, spark_type
def __call__(s, *args):
return udf(s.func, s.spark_type)(*args)
Then use this to define my udf in module. This works!
Can anybody explain why we have this problem in the first place? And why this fix (the last one with the class definition) works?
Additional info: PySpark 2.1.0. Deploying job on yarn in cluster mode.
Thanks!
The accepted answer to the link you posted above, says, "My work around is to avoid creating the UDF until Spark is running and hence there is an active SparkContext." Looks like your issue is with serializing the UDF.
Make sure the UDF functions in your helper classes are static methods or global functions. And inside the public functions that you import elsewhere, you can define the udf.
class Helperclass(object):
#staticmethod
def my_udf_todo(...):
...
def public_function_that_is_imported_elsewhere(...):
todo_udf = udf(Helperclass.my_udf_todo, RETURN_SCHEMA)
...

Register UDF to SqlContext from Scala to use in PySpark

Is it possible to register a UDF (or function) written in Scala to use in PySpark ?
E.g.:
val mytable = sc.parallelize(1 to 2).toDF("spam")
mytable.registerTempTable("mytable")
def addOne(m: Integer): Integer = m + 1
// Spam: 1, 2
In Scala, the following is now possible:
val UDFaddOne = sqlContext.udf.register("UDFaddOne", addOne _)
val mybiggertable = mytable.withColumn("moreSpam", UDFaddOne(mytable("spam")))
// Spam: 1, 2
// moreSpam: 2, 3
I would like to use "UDFaddOne" in PySpark like
%pyspark
mytable = sqlContext.table("mytable")
UDFaddOne = sqlContext.udf("UDFaddOne") # does not work
mybiggertable = mytable.withColumn("+1", UDFaddOne(mytable("spam"))) # does not work
Background: We are a team of developpers, some coding in Scala and some in Python, and would like to share already written functions. It would also be possible to save it into a library and import it.
As far as I know PySpark doesn't provide any equivalent of the callUDF function and because of that it is not possible to access registered UDF directly.
The simplest solution here is to use raw SQL expression:
mytable.withColumn("moreSpam", expr("UDFaddOne({})".format("spam")))
## OR
sqlContext.sql("SELECT *, UDFaddOne(spam) AS moreSpam FROM mytable")
## OR
mytable.selectExpr("*", "UDFaddOne(spam) AS moreSpam")
This approach is rather limited so if you need to support more complex workflows you should build a package and provide complete Python wrappers. You'll find and example UDAF wrapper in my answer to Spark: How to map Python with Scala or Java User Defined Functions?
The following worked for me (basically a summary of multiple places including the link provided by zero323):
In scala:
package com.example
import org.apache.spark.sql.functions.udf
object udfObj extends Serializable {
def createUDF = {
udf((x: Int) => x + 1)
}
}
in python (assume sc is the spark context. If you are using spark 2.0 you can get it from the spark session):
from py4j.java_gateway import java_import
from pyspark.sql.column import Column
jvm = sc._gateway.jvm
java_import(jvm, "com.example")
def udf_f(col):
return Column(jvm.com.example.udfObj.createUDF().apply(col))
And of course make sure the jar created in scala is added using --jars and --driver-class-path
So what happens here:
We create a function inside a serializable object which returns the udf in scala (I am not 100% sure Serializable is required, it was required for me for more complex UDF so it could be because it needed to pass java objects).
In python we use access the internal jvm (this is a private member so it could be changed in the future but I see no way around it) and import our package using java_import.
We access the createUDF function and call it. This creates an object which has the apply method (functions in scala are actually java objects with the apply method). The input to the apply method is a column. The result of applying the column is a new column so we need to wrap it with the Column method to make it available to withColumn.

Run a read-only test in Spark

I want to compare the read performance of different storage systems using Spark ,e.g. HDFS/S3N. I have written a small Scala program for this:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
object SimpleApp {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("WordCount")
val sc = new SparkContext(conf)
val file = sc.textFile("s3n://test/wordtest")
val splits = file.map(word => word)
splits.saveAsTextFile("s3n://test/myoutput")
}
}
My question is, is it possible to run a read-only test with Spark? For the program above, isn't saveAsTextFile() causing some write as well?
I am not sure if that is possible at all. In order to run a transformation, a posterior action is necessary.
From the official Spark documentation:
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program.
Taking this into account, saveAsTextFile might not be considered the lightest from the wide range of actions available. Several lightweight alternatives exists, actions like count or first, for example. These would leverage almost the totality of the work to the transformations phase, making you able to measure the performance of your solution.
You might want to check the available actions and choose the one that best fits your requirements.
Yes."saveAsTextFile" writes the RDD data to text file using given path.