Create pyspark column from large # of case statements w/ regex - pyspark

I'm trying to transform a complicated text field into one of ~2000 possible values based on regular expressions and conditions.
Example: if VAL1 in ('3025','4817') and re.match('foo', VAL2) then (123, "GROUP_ABX")
elif ... (repeat about 2000 unique scenarios)
I put this bunch of conditions into a massive pySpark UDF function. Problem is, if I have more than a few hundred conditions, the performance grinds to a halt.
The UDF is registered like so:
schema = StructType([
StructField("FOO_ID", IntegerType(), False),
StructField("FOO_NAME", StringType(), False)])
spark.udf.register("FOOTagging", FOOTag, schema)
test_udf = udf(FOOTag, schema)
the dataframe is updated like:
df1 = spark.read.csv(file)\
.toDF(*Fields)\
.select(*FieldList)\
.withColumn("FOO_TAG_STRUCT", test_udf('VAL1','VAL2'))
When I run with <200 conditions, I process the 23k row file in a couple seconds. When I get over 500 or so, it takes forever.
Seems UDF doesn't handle large functions. Is there another solution out there?

Related

drop all df2.columns from another df (pyspark.sql.dataframe.DataFrame specific)

I have a large DF (pyspark.sql.dataframe.DataFrame) that is a result of multiple joins, plus new columns being created by using a combination of inputs from different DFS, including DF2.
I want to drop all DF2 columns from DF after I'm done with the join/creating new columns based on DF2 input.
drop() doesn't accept list - only a string or a Column.
I know that df.drop("col1", "col2", "coln") will work but I'd prefer not to crowd the code (if I can) by listing those 20 columns.
Is there a better way of doing this in pyspark dataframe specifically?
drop_cols = df2.columns
df = df.drop(*drop_cols)

Reading and appending files into a spark dataframe

I have created an empty dataframe and started adding to it, by reading each file. But one of the files has more number of columns than the previous. How can I select only the columns in the first file for all the other files?
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.types import StructType
import os, glob
spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
.enableHiveSupport().getOrCreate()
fpath=''
schema = StructType([])
sc = spark.sparkContext
df_spark=spark.createDataFrame(sc.emptyRDD(), schema)
files=glob.glob(fpath +'*.sas7bdat')
for i,f in enumerate(files):
if i == 0:
df=spark.read.format('com.github.saurfang.sas.spark').load(f)
df_spark= df
else:
df=spark.read.format('com.github.saurfang.sas.spark').load(f)
df_spark=df_spark.union(df)
You can provide your own schema while creating a dataframe.
for example, I have two files emp1.csv & emp2.csv having diffrent schema.
id,empname,empsalary
1,Vikrant,55550
id,empname,empsalary,age,country
2,Raghav,10000,32,India
schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("salary", IntegerType(), True)])
file_path="file:///home/vikct001/user/vikrant/inputfiles/testfiles/emp*.csv"
df=spark.read.format("com.databricks.spark.csv").option("header", "true").schema(schema).load(file_path)
Specifying a schema not only addresses data types & format issue but it's also necessary to improve a performance.
There are other options as well if you need to drop malformed records, but this will also drop the records which are having nulls or which doesn't fit as per schema provided.
It may skip those records also having multiple delimiters and junk characters or an empty file.
.option("mode", "DROPMALFORMED")
FAILFAST mode will throw an exception as and when it found malformed record.
.option("mode", "FAILFAST")
you can also use map function to select the elements of your choice and exclude others while building a dataframe.
df=spark.read.format('com.databricks.spark.csv').option("header", "true").load(file_path).rdd.map(lambda x :(x[0],x[1],x[2])).toDF(["id","name","salary"])
you need to set header as 'true' in both the cases, otherwise it will include your csv header as first record for your dataframe.
You can get the fieldnames from the schema of the first file and then use the array of fieldnames to select the columns from all other files.
fields = df.schema.fieldNames
You can use the fields array to select the columns from all other datasets. Following is the scala code for that.
df=spark.read.format('com.github.saurfang.sas.spark').load(f).select(fields(0),fields.drop(1):_*)

Applying transformations with filter or map which one is faster Scala spark

Iam trying to do some transformations on the dataset with spark using scala currently using spark sql but want to shift the code to native scala code. i want to know whether to use filter or map, doing some operations like matching the values in column and get a single column after the transformation into a different dataset.
SELECT * FROM TABLE WHERE COLUMN = ''
Used to write something like this earlier in spark sql can someone tell me an alternative way to write the same using map or filter on the dataset, and even which one is much faster when compared.
You can read documentation from Apache Spark website. This is the link to API documentation at https://spark.apache.org/docs/2.3.1/api/scala/index.html#package.
Here is a little example -
val df = sc.parallelize(Seq((1,"ABC"), (2,"DEF"), (3,"GHI"))).toDF("col1","col2")
val df1 = df.filter("col1 > 1")
df1.show()
val df2 = df1.map(x => x.getInt(0) + 3)
df2.show()
If I understand you question correctly, you need to rewrite your SQL query to DataFrame API. Your query reads all columns from table TABLE and filter rows where COLUMN is empty. You can do this with DF in the following way:
spark.read.table("TABLE")
.where($"COLUMN".eqNullSafe(""))
.show(10)
Performance will be the same as in your SQL. Use dataFrame.explain(true) method to understand what Spark will do.

How to use QuantileDiscretizer across groups in a DataFrame?

I have a DataFrame with the following columns.
scala> show_times.printSchema
root
|-- account: string (nullable = true)
|-- channel: string (nullable = true)
|-- show_name: string (nullable = true)
|-- total_time_watched: integer (nullable = true)
This is data about how many times customer has watched watched a particular show. I'm supposed to categorize the customer for each show based on total time watched.
The dataset has 133 million rows in total with 192 distinct show_names.
For each individual show I'm supposed to bin the customer into 3 categories (1,2,3).
I use Spark MLlib's QuantileDiscretizer
Currently I loop through every show and run QuantileDiscretizer in the sequential manner as in the code below.
What I'd like to have in the end is for the following sample input to get the sample output.
Sample Input:
account,channel,show_name,total_time_watched
acct1,ESPN,show1,200
acct2,ESPN,show1,250
acct3,ESPN,show1,800
acct4,ESPN,show1,850
acct5,ESPN,show1,1300
acct6,ESPN,show1,1320
acct1,ESPN,show2,200
acct2,ESPN,show2,250
acct3,ESPN,show2,800
acct4,ESPN,show2,850
acct5,ESPN,show2,1300
acct6,ESPN,show2,1320
Sample Output:
account,channel,show_name,total_time_watched,Time_watched_bin
acct1,ESPN,show1,200,1
acct2,ESPN,show1,250,1
acct3,ESPN,show1,800,2
acct4,ESPN,show1,850,2
acct5,ESPN,show1,1300,3
acct6,ESPN,show1,1320,3
acct1,ESPN,show2,200,1
acct2,ESPN,show2,250,1
acct3,ESPN,show2,800,2
acct4,ESPN,show2,850,2
acct5,ESPN,show2,1300,3
acct6,ESPN,show2,1320,3
Is there a more efficient and distributed way to do it using some groupBy-like operation instead of looping through each show_name and bin it one after other?
I know nothing about QuantileDiscretizer, but think you're mostly concerned with the dataset to apply QuantileDiscretizer to. I think you want to figure out how to split your input dataset into smaller datasets per show_name (you said that there are 192 distinct show_name in the input dataset).
Solution 1: Partition Parquet Dataset
I've noticed that you use parquet as the input format. My understanding of the format is very limited but I've noticed that people are using some partitioning scheme to split large datasets into smaller chunks that they could then process whatever they like (per some partitioning scheme).
In your case the partitioning scheme could include show_name.
That would make your case trivial as the splitting were done at writing time (aka not my problem anymore).
See How to save a partitioned parquet file in Spark 2.1?
Solution 2: Scala's Future
Given your iterative solution, you could wrap every iteration into a Future that you'd submit to process in parallel.
Spark SQL's SparkSession (and Spark Core's SparkContext) are thread-safe.
Solution 3: Dataset's filter and union operators
I would think twice before following this solution since it puts burden on your shoulders which I think could easily be sorted out by solution 1.
Given you've got one large 133-million-row parquet file, I'd first build the 192 datasets per show_name using filter operator (as you did to build show_rdd which is against the name as it's a DataFrame not RDD) and union (again as you did).
See Dataset API.
Solution 4: Use Window Functions
That's something I think could work, but didn't check it out myself.
You could use window functions (see WindowSpec and Column's over operator).
Window functions would give you partitioning (windows) while over would somehow apply QuantileDiscretizer to a window/partition. That would however require "destructuring" QuantileDiscretizer into an Estimator to train a model and somehow fit the result model to the window again.
I think it's doable, but haven't done it myself. Sorry.
This is older question. However answering it to help someone with same situation in future.
It can be achieved using pandas udf function. Both input and output of pandas UDF function is dataframe. We need to provide schema of the output dataframe as shown in annotation in below code sample. Below code sample can achieve required result.
output_schema = StructType(df.schema.fields + [StructField('Time_watched_bin', IntegerType(), True)])
#pandas_udf(output_schema, PandasUDFType.GROUPED_MAP)
# pdf: pandas dataframe
def get_buckets(pdf):
pdf['Time_watched_bin'] = pd.cut(pdf['total_time_watched'], 3, labels=False)
return pdf
df = df.groupby('show_name').apply(get_buckets)
df will have new column 'Time_watched_bin' with bucket information.

How is the performance impact of select statements on Spark DataFrames?

Using many select statements or expressions on Spark DataFrames, I wonder what their performance impact is on subsequent transformations once triggered by an action.
Given a dataframe df with 10 columns a to j.
How is the influence if I use as for column renaming on each column?
df.select( df("a").as("1"), ..., df("j").as("10"))
What if I select a subset (e.g. 5 columns)
val df2 = df.select( df("a"), ..., df("e") )
b. How handles Spark this projection? Is df still kept (as df2 is a projection) so df could serve as kind of reference? Or is instead df2 created freshly and df discarded? (neglecting any persist here)
How is the influence of general Column expressions used in select?
Are performance tests for the above cases available? And are performance measurements in general somewhere available? If not, how to measure the performance best?