Now a days, data comes with large number of features. To get a short summary of data, people load data in data frames and use head() method to display it. Its pretty common to run experiments using Jupyter Notebooks (with Toree for scala).
Spark (scala) is good for handling large amount of data, but its head() method doesn't show column headers in horizontally scrollable notebook.
Pandas Dataframe head
Spark Scala Dataframe head
I know you can get column header in scala dataframe by using .columns, but printing it doesn't display header along data columns making it difficult to understand.
Instead of df.head(20) try df.show(n=20, truncate=False). Here is the detailed documentation.
Related
I want to train a regression prediction model with Azure Databricks AutoML using the GUI. The training data is very wide. All of the columns except for the response variable will be used as features.
To use the Databricks AutoML GUI I have to store the data as a table in the Hive metastore. I have a large DataFrame df with more than 40,000 columns.
print((df.count(), len(df.columns)))
(33030, 45502)
This data is written to a table in Hive using the following PySpark command (I believe this is standard):
df.write.mode('overwrite').saveAsTable("WIDE_TABLE")
Unfortunately this job does not finish within 'acceptable' time (10 hours). I cancel and hence don't have an error message.
When I reduce the number of columns with
df.select(df.columns[:500]).write.mode('overwrite').saveAsTable("WIDE_TABLE")
it fares better and finishes in 9.87 minutes, so the method should work.
Can this be solved:
With a better compute instance?
With a better script?
Not at all and if so, is there another approach?
[EDIT to address questions in comments]
Runtime and driver summary:
2-16 Workers 112-896 GB Memory 32-256 Cores (Standard_DS5_v2)
1 Driver 56 GB Memory, 16 Cores (Same as worker)
Runtime10.4.x-scala2.12
To give an impression of the timings I've added a table below.
columns
time (mins)
10
1.94
100
1.92
200
3.04
500
9.87
1000
25.91
5000
938.4
Data type of the remaining columns is Integer.
As far as I know I'm writing the table on the same environment that I am working on. Data flow: Azure Blob CSV -> Data read and wrangling -> PySpark DataFrame -> Hive Table. Last three steps are on the same cloud machine.
Hope this helps!
I think your case is not related to either Spark resource configuration or network connection, it's related to Spark design itself.
Long in short, Spark is designed for long and narrow data, which is exactly opposite of your dataframe. When you look at your experiment, the time consuming is in exponential growth when your column size increase. Although it's about reading the csv but not writing table, you can check this post for a good explanation on why Spark is not good at handling wide dataframe: Spark csv reading speed is very slow although I increased the number of nodes
Although I didn't use the Azure AutoML before, based on the dataset to achieve your goal, I think you can try:
Try to use python pandas dataframe and Hive connection library to see if there is any performance enhancement
Concatenate all your column into a single Array / Vector before you write to Hive
How to implement the Scikit learn QuantileTransformer in PySpark? Due to the size of my data set (~68 million rows w/ 100+ columns), I am forced to attempt this in PySpark rather than converting it into Pandas. I am on PySpark 2.4.
I've seen PySpark has scalers such as StandardScaler, MinMaxScaler, etc. But I would like to use an equivalent of QuantileTransformer. Is there anything off the shelf that exists for this purpose?
Is it possible to pass a pyspark dataframe to a XGBClassifer as:
from xgboost import XGBClassifer
model1 = XGBClassifier()
model1.fit (df.select(features), df.select('label'))
If not, what is the best way to fit a pyspark dataframe to xgboost?
Many thanks
I believe there are two ways to skin this particular cat.
You can either:
Move your pyspark dataframe to pandas using the toPandas() method (or even better, using pyarrow). pandas dataframes will work just fine withxgboost. However, your data needs to fit in the memory, so you might need to subsample if you're working with TB or even GB of data.
Have a look at the xgboost4j and xgboost4j-spark packages. In the same way as pyspark is a wrapper using py4j, these guys can leverage SparkML built-ins, albeit typically for Scala-Spark. For example, the XGBoostEstimator from these packages can be used as a stage in SparkML Pipeline() object.
Hope this helps.
When I clean big data by pandas, I have two methods:one method is to use #pandas_udf from pyspark 2.3+ clean data,another is to convert sdf to pdf by toPandas() ,and then use pandas to clean.
I'm confused what are these methods different?
I hope helper could explain from distributed, speed and other directions.
TL;DR: #pandas_udf and toPandas are very different;
#pandas_udf
Creates a vectorized user defined function (UDF).
which leverages the vectorization feature of pandas and serves as a faster alternative for udf, and it works on distributed dataset; To learn more about the pandas_udf performance, you can read pandas_udf vs udf performance benchmark here.
While toPandas collect the distributed spark data frame as pandas data frame, pandas data frame is localized, and resides in driver's memory so:
this method should only be used if the resulting Pandas’s DataFrame is expected to be small, as all the data is loaded into the driver’s
memory.
So if your data is large, then you can't use toPandas; #pandas_udf or udf or other built in methods would be your only option;
I have a data set as a csv file. It has around 50 columns most of which are categorical. I am planning to run a RandomForest multi class classification with a new test data-set.
The pain-point of this is to handle the categorical variables. What would be the best way to handle them? I read the guide for Pipeline in Spark Website http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline which creates a DataFrame from a hard coded sequence also with the features a space delimited string. This looks very specific and I wanted to achieve the same thing on how they use HashingTF for the features using the CSV file i have.
In short I want to achieve the same thing as in the link but using a CSV file.
Any suggestions?
EDIT:
Data -> 50 features, 100k rows, most of it alphanumeric categorical
I am pretty new to MLlib and hence struggling to find the proper pipeline for my data from CSV. I tried creating a DataFrame from the file, but confused as to how I should encode the categorical columns. The doubts I have ar as follows
1. The example in the link above tokenizes the data ans uses it but I have a dataframe.
2. Also even if I try using a StringIndexer , should I write an indexer for every column? Shouldn't there be one method which accepts multiple columns?
3. How will I get back the label from the String Indexer for showing the prediction?
5. For new test data, how will I keep consistent encoding for every column?
I would suggest having a look at the feature transformers http://spark.apache.org/docs/ml-features.html and in particular the StringIndexer and VectorAssembler.