How is the performance impact of select statements on Spark DataFrames? - scala

Using many select statements or expressions on Spark DataFrames, I wonder what their performance impact is on subsequent transformations once triggered by an action.
Given a dataframe df with 10 columns a to j.
How is the influence if I use as for column renaming on each column?
df.select( df("a").as("1"), ..., df("j").as("10"))
What if I select a subset (e.g. 5 columns)
val df2 = df.select( df("a"), ..., df("e") )
b. How handles Spark this projection? Is df still kept (as df2 is a projection) so df could serve as kind of reference? Or is instead df2 created freshly and df discarded? (neglecting any persist here)
How is the influence of general Column expressions used in select?
Are performance tests for the above cases available? And are performance measurements in general somewhere available? If not, how to measure the performance best?

Related

drop all df2.columns from another df (pyspark.sql.dataframe.DataFrame specific)

I have a large DF (pyspark.sql.dataframe.DataFrame) that is a result of multiple joins, plus new columns being created by using a combination of inputs from different DFS, including DF2.
I want to drop all DF2 columns from DF after I'm done with the join/creating new columns based on DF2 input.
drop() doesn't accept list - only a string or a Column.
I know that df.drop("col1", "col2", "coln") will work but I'd prefer not to crowd the code (if I can) by listing those 20 columns.
Is there a better way of doing this in pyspark dataframe specifically?
drop_cols = df2.columns
df = df.drop(*drop_cols)

Is df.schema action or transformation?

I have a schema created manually for creating a dataframe say myschema
Now my dataframe say df is created.
Now, I did some operations on df and some of the columns were dropped.
say original myschema consists of 500 columns
Now after dropping some columns, my df consist of 450 columns.
Now somewhere in my code I need the schema back but only the schema after dataframe has applied some operations(ie. having 450 columns).
Now ,
Q1. How optimum is calling df.schema and using it, Is it action or transformation?
Q2. Should I create another myschema2 by filtering out those columns from myschema which will be dropped and use that?
Quick answers:
to Q1: schema is neither an action neither a transformation, in the sense that it doesn't modify the data frame and doesn't trigger any computation.
to Q2:
if I understand well, I guess you have something like this
val myschema = StructType(someSchema)
val df = spark.createDataFrame(someData, myschema)
// do some transformation (drop, add columns etc)
val df2 = df.drop("column1", "column2").withColumn("new", $"c1" + $"c2"))
and you want get the schema of df2. if it is so you can just use
val myschema2 = df2.schema
Long Answer:
Informally speaking, DataFrame is an abstraction over distributed datasets, and as you already pointed out, there are transformations and actions defined on them.
When you makes some transformation on data frames, what happens under the hood is that spark just builds a Directed Acyclic Graph describing that transformations. When
That DAG is analyzed and used to build an execution plan to get the work done
Actions on the other hand trigger the execution of the plan, which is transforming the actual data.
Schema of a transformed data frame is derived from the schema of the initial data frame basically walking along the DAG. The impact of such derivations is _neglectable, it doesn't depend on the size of the data, it depends on how much big is the DAG, but in all practical cases, you can ignore the time required to get schema. Schema is just metadata attached to a dataframe.
So to respond to Q2: No you should not have schema2 taking track of the modification you. Just by calling df.schema Spark will do that for you
hope this clears your doubts

Applying transformations with filter or map which one is faster Scala spark

Iam trying to do some transformations on the dataset with spark using scala currently using spark sql but want to shift the code to native scala code. i want to know whether to use filter or map, doing some operations like matching the values in column and get a single column after the transformation into a different dataset.
SELECT * FROM TABLE WHERE COLUMN = ''
Used to write something like this earlier in spark sql can someone tell me an alternative way to write the same using map or filter on the dataset, and even which one is much faster when compared.
You can read documentation from Apache Spark website. This is the link to API documentation at https://spark.apache.org/docs/2.3.1/api/scala/index.html#package.
Here is a little example -
val df = sc.parallelize(Seq((1,"ABC"), (2,"DEF"), (3,"GHI"))).toDF("col1","col2")
val df1 = df.filter("col1 > 1")
df1.show()
val df2 = df1.map(x => x.getInt(0) + 3)
df2.show()
If I understand you question correctly, you need to rewrite your SQL query to DataFrame API. Your query reads all columns from table TABLE and filter rows where COLUMN is empty. You can do this with DF in the following way:
spark.read.table("TABLE")
.where($"COLUMN".eqNullSafe(""))
.show(10)
Performance will be the same as in your SQL. Use dataFrame.explain(true) method to understand what Spark will do.

Converting Pandas into Pyspark

So I'm trying to convert my python algorithm to Spark friendly code and I'm having trouble with this one:
indexer = recordlinkage.SortedNeighbourhoodIndex \
(left_on=column1, right_on=column2, window=41)
pairs = indexer.index(df_1,df_2)
It basically compares one column against the other and generates index pairs for those likely to be the same (Record Matching).
My code:
df1 = spark.read.load(*.csv)
df2 = spark.read.load(*.csv)
func_udf = udf(index.indexer) ????
df = df.withColumn('column1',func_udf(df1.column1,df2.column2)) ???
I've been using udf for transformations involving just one dataframe and one column, but how do I run a function that requires two arguments, one column from one dataframe and other from other dataframe? I can't join both dataframes as they have different lengths.
That's not how udf work. UserDefinedFunctions can operate only on data that comes from a single DataFrame
Standard udf on data from a single row.
pandas_udf on data from a single partition or single group.
I can't join both dataframes as they have different lengths.
Join is exactly what you should do (standard or manual broadcast). There is no need for objects to be of the same length - Spark join is a relational join not row-wise merge.
For similarity joins you can use built-in approx join tools:
Efficient string matching in Apache Spark
Optimize Spark job that has to calculate each to each entry similarity and output top N similar items for each

Spark DataFrame row count is inconsistent between runs

When I am running my spark job (version 2.1.1) on EMR, each run counts a different amount of rows on a dataframe. I first read data from s3 to 4 different dataframes, these counts are always consistent an then after joining the dataframes, the result of the join have different counts. afterwards I also filter the result and that also has a different count on each run. The variations are small, 1-5 rows difference but it's still something I would like to understand.
This is the code for the join:
val impJoinKey = Seq("iid", "globalVisitorKey", "date")
val impressionsJoined: DataFrame = impressionDsNoDuplicates
.join(realUrlDSwithDatenoDuplicates, impJoinKey, "outer")
.join(impressionParamterDSwithDateNoDuplicates, impJoinKey, "left")
.join(chartSiteInstance, impJoinKey, "left")
.withColumn("timestamp", coalesce($"timestampImp", $"timestampReal", $"timestampParam"))
.withColumn("url", coalesce($"realUrl", $"url"))
and this is for the filter:
val impressionsJoined: Dataset[ImpressionJoined] = impressionsJoinedFullDay.where($"timestamp".geq(new Timestamp(start.getMillis))).cache()
I have also tried using filter method instead of where, but with same results
Any thought?
Thanks
Nir
is it possible that one of the data sources changes over over time?
since impressionsJoined is not cached, spark will reevaluate it from scratch on every action, and that includes reading the data again from the source.
try caching impressionsJoined after the join.