AttributeError: 'GroupedData' object has no attribute 'applyInPandas' in Azure Databricks - pyspark

I got stuck with the error while running some jobs on a DBR 6.6 ML cluster on Azure Databricks.
I've gone through the spark Grouped Map documentation and everything seems fine from my end.

The issue was easily resolved by upgrading the cluster to a 7.0 ML

Related

How to integrate Eclipse IDE with Databricks Cluster

I am trying to integrate my Scala Eclipse IDE with my Azure Databricks Cluster so that I can directly run my Spark program through Eclipse IDE on my Databricks Cluster.
I followed the official documentation of Databricks Connect(https://docs.databricks.com/dev-tools/databricks-connect.html)
.
I have:
Installed Anaconda.
Installed Python Lib 3.7 and Databricks Connect library 6.0.1.
Did the Databricks Connect Configuration part(CLI part).
Also, added the client libraries in the Eclipse IDE.
Set the SPARK_HOME env. variable to the path which I get from running command in Anaconda, i.e. 'databricks-connect get-jar-dir'
I have not set any other environment variables apart from the one mentioned above.
Need help on finding what else is to be done to accomplish this integration, like how the ENV. variable related to connection works if running through IDE.
If someone has already done this successfully, guide me please.

XGBoost on databricks - outdated scala version

I am trying to follow along the xgboost example on databricks found here
Everything seems to work fine until I get to the actual training part:
val xgboostModelRDD = XGBoost.trainWithRDD(trainRDD, ...)
At this point I get an error. Since the stacktrace is rather short I'll paste it here:
java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.overrideParamsAccordingToTaskCPUs(XGBoost.scala:232)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainWithRDD(XGBoost.scala:293)
After doing some research, It appears that the reason for that error is incompatible scala version. The databricks community edition cluster comes preconfigured with scala version 2.10. This cannot be modified.
Does that mean that it is impossible to run xgboost using the community edition, or is there a way to resolve this issue?
I think that the forum post that you linked to is slightly outdated. Databricks Community edition actually does allow you to choose the cluster's Scala version.
First, navigate to the clusters page and click on the blue "Create Cluster" button:
From the "Databricks Runtime Version" dropdown menu, you can pick a runtime version which contains your desired Scala and Spark versions:

Spark setup on Windows

I am trying to setup Spark on my Windows 10 PC. After executing the spark-shell command, I got the following error:
java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':
at rg.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect
Spark Installation on windows machine is not much difficult. We need to take care of some permissions and configurations during the installation. Please follow below link for step wise Spark and Scala installation and configuration on windows machine.
Apache Spark Installation on windows10

Spark job server for spark 1.6.0

Is there any specific Spark Job Server version matching with Spark 1.6.0 ?
As per the version information in https://github.com/spark-jobserver/spark-jobserver, I see SJS is available only for 1.6.1 not for 1.6.0.
Our CloudEra hosted Spark is running on 1.6.0
I deployed SJS by configuring the spark home to 1.6.1. When I submitted jobs, I see job ids are getting generated but I can't see the job result.
Any inputs?
No, there is no SJS version tied to spark 1.6.0. But it should be easy for you to compile against 1.6.0. May be you could modify this https://github.com/spark-jobserver/spark-jobserver/blob/master/project/Versions.scala#L10 and try.

Not able to find ./bin/spark-class for launching spark cluster on standalone mode

While launching standalone cluster on spark streaming, I am not able to find ./bin/spark-class command.
Please let me know, if I need to do any additional configurations for getting "spark-class".
Which version of Spark are you using? Starting with Spark 0.9.0, spark-class is located in the bin folder, but in earlier versions it was at the root of SPARK_HOME.
Perhaps you're following instructions for Spark 0.9.0+ even though you've installed an earlier version of Spark? You can find documentation for older releases of Spark on the Spark documentation overview page.