When I clean big data by pandas, I have two methods:one method is to use #pandas_udf from pyspark 2.3+ clean data,another is to convert sdf to pdf by toPandas() ,and then use pandas to clean.
I'm confused what are these methods different?
I hope helper could explain from distributed, speed and other directions.
TL;DR: #pandas_udf and toPandas are very different;
#pandas_udf
Creates a vectorized user defined function (UDF).
which leverages the vectorization feature of pandas and serves as a faster alternative for udf, and it works on distributed dataset; To learn more about the pandas_udf performance, you can read pandas_udf vs udf performance benchmark here.
While toPandas collect the distributed spark data frame as pandas data frame, pandas data frame is localized, and resides in driver's memory so:
this method should only be used if the resulting Pandas’s DataFrame is expected to be small, as all the data is loaded into the driver’s
memory.
So if your data is large, then you can't use toPandas; #pandas_udf or udf or other built in methods would be your only option;
Related
I am trying to acquaint myself with RAPIDS Accelerator-based computation using Spark (3.3) with Scala. The primary contention in being able to use GPU appears to arise from the blackbox nature of UDFs. An automatic solution would be the Scala UDF compiler. But it won't work with cases where there are loops.
Doubt: Would I be able to get GPU contribution if my dataframe has only one column and produces another column, as this is a trivial case. If so, at least in some cases, even with no change in Spark code, the GPU performance benefit can be attained, even in case where the size of data is much higher than GPU memory. This would be great as sometimes it would be easy to simply merge all columns into one making a single column of WrappedArray using concat_ws that a UDF can simply convert into an Array. For all practical purposes to the GPU then the data is already in columnar fashion and only negligible overhead for row (on CPU) to column (on GPU) needs to be done.The case I am referring to would look like:
val newDf = df.withColumn(colB, opaqueUdf(col("colA")))
Resources: I tried to find good sources/examples to learn Spark-based approach for using RAPIDS, but it seems to me that only Python-based examples are given. Is there any resource/tutorial that gives some sample examples in coversion of Spark UDFs to make them RAPIDS compatible.
Yes #Quiescent, you are right. The Scala UDF -> Catalyst compiler can be used for simple UDFs that have a direct translation to Catalyst. Supported operations can be found here: https://nvidia.github.io/spark-rapids/docs/additional-functionality/udf-to-catalyst-expressions.html. Loops are definitely not supported in this automatic translation, because there isn't a direct expression that we can translate it to.
It all depends on how heavy opaqueUdf is, and how many rows are in your column. The GPU is going to be really good if there are many rows and the operation in the UDF is costly (say it's doing many arithmetic or string operations successively on that column). I am not sure why you want to "merge all columns into one", so can you clarify why you want to do that? On the conversion to Array, is that the purpose of the UDF, or are you wanting to take in N columns -> perform some operation likely involving loops -> produce an Array?
Another approach to accelerating UDFs with GPUs is to use our RAPIDS Accelerated UDFs. These are java or scala UDFs that you implement purposely, and they use the cuDF API directly. The Accelerated UDF document also links to our spark-rapids-examples repo, which has information on how to write Java or Scala UDFs in this way, please take a look there as well.
df=df_full[df_fill.part_col.isin(['part_a','part_b'])]
df=df[df.some_other_col =='some_value']
#df has shape of roughly 240k,200
#df_full has shape of roughly 30m, 200
df.to_pandas().reset_index().to_csv('testyyy.csv',index=False)
If I do any groupby operation it is amazingly fast. However the issue lies when I try to export small subset of this large dataset to csv. While I am eventually able to export the dataframe to csv but it is taking too much time.
Warnings:
2022-05-08 13:01:15,948 WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
df[column_name] = series
Note: part_a and part_b are stored as two separate parquet partitioned files. Also I am using pyspark.pandas in spark3+
So question is what is happening? And what is most efficient wat to export the filtered dataframe to csv?
To load a large dataset into Polars efficiently one can use the lazy API and the scan_* functions. This works well when we are performing an aggregation (so we have a big input dataset but a small result). However, if I want to process a big dataset in it's entirety (for example, change a value in each row of a column), it seems that there is no way around using collect and loading the whole (result) dataset into memory.
Is it instead possible to write a LazyFrame to disk directly, and have the processing operate on chunks of the dataset sequentially, in order to limit memory usage?
Edit (2023-01-08)
Polars' has growing support for streaming/out of core processing.
To run a query streaming collect your LazyFrame with collect(streaming=True).
If the result does not fit into memory, try to sink it to disk with sink_parquet.
Old answer (not true anymore).
Polars' algorithms are not streaming, so they need all data in memory for the operations like join, groupby, aggregations etc. So writing to disk directly would still have those intermediate DataFrames in memory.
There are of course things you can do. Depending on the type of query you do, it may lend itself to embarrassingly parallellizaton. A sum could for instance easily be computed in chunks.
You could also process columns in smaller chunks. This allows you to still compute harder aggregations/ computations.
Use lazy
If you have many filters in your query and polars is able to do them at the scan, your memory pressure is reduced to the selectivity ratio.
I just encountered a case where Polars manages memory much better using Lazy. When using the join function I highly recommend using scan_csv/scan_parquet/scan_ipc if memory is an issue.
import polars as pl
# combine datasets
PATH_1 = "/.../big_dataset.feather"
PATH_2 = "/.../other_big_dataset.feather"
big_dataset_1 = pl.scan_ipc(PATH_1)
big_dataset_2 = pl.scan_ipc(PATH_2)
big_dataset_expanded = big_dataset_1.join(
big_dataset_2, right_on="id_1", left_on="id_2", how="left"
)
big_dataset_expanded = big_dataset_expanded.collect()
Is it possible to pass a pyspark dataframe to a XGBClassifer as:
from xgboost import XGBClassifer
model1 = XGBClassifier()
model1.fit (df.select(features), df.select('label'))
If not, what is the best way to fit a pyspark dataframe to xgboost?
Many thanks
I believe there are two ways to skin this particular cat.
You can either:
Move your pyspark dataframe to pandas using the toPandas() method (or even better, using pyarrow). pandas dataframes will work just fine withxgboost. However, your data needs to fit in the memory, so you might need to subsample if you're working with TB or even GB of data.
Have a look at the xgboost4j and xgboost4j-spark packages. In the same way as pyspark is a wrapper using py4j, these guys can leverage SparkML built-ins, albeit typically for Scala-Spark. For example, the XGBoostEstimator from these packages can be used as a stage in SparkML Pipeline() object.
Hope this helps.
I am trying to understand how PySpark uses pickle for RDDs and avoids it for SparkSql and Dataframes. The basis of the question is from slide#30 in this link.I am quoting it below for reference:
"[PySpark] RDDs are generally RDDs of pickled objects. Spark SQL (and DataFrames) avoid some of this".
How is pickle used in Spark Sql?
In the original Spark RDD model, RDDs described distributed collections of Java objects or pickled Python objects. However, SparkSQL "dataframes" (including Dataset) represent queries against one or more sources/parents.
To evaluate a query and produce some result, Spark does need to process records and fields, but these are represented internally in a binary, language-neutral format (called "encoded"). Spark can decode these formats to any supported language (e.g., Python, Scala, R) when needed, but will avoid doing so if it's not explicitly required.
For example: if I have a text file on disk, and I want to count the rows, and I use a call like:
spark.read.text("/path/to/file.txt").count()
there is no need for Spark to ever convert the bytes in the text to Python strings -- Spark just needs to count them.
Or, if we did a spark.read.text("...").show() from PySpark, then Spark would need to convert a few records to Python strings -- but only the ones required to satisfy the query, and show() implies a LIMIT so only a few records are evaluated and "decoded."
In summary, with the SQL/DataFrame/DataSet APIs, the language you use to manipulate the query (Python/R/SQL/...) is just a "front-end" control language, it's not the language in which the actual computation is performed nor does it require converting original data sources to the language you are using. This approach allows higher performance across all language front ends.