Kedro on Databricks: Cannot import SparkDataset - pyspark

Cannot import SparkDataset in Databricks using;
from kedro.extras.datasets.spark import SparkDataSet

have you done pip install "kedro[spark.SparkDataSet]" ?
A new kedro project needs dependencies installed for connectors before use.
Also dataset types are case sensitive , so make sure in your catalog it is SparkDataSet and not Sparkdataset etc.

Related

NoSuchMethodError on com.fasterxml.jackson.dataformat.xml.XmlMapper.coercionConfigDefaults()

I'm parsing a XML string to convert it to a JsonNode in Scala using a XmlMapper from the Jackson library. I code on a Databricks notebook, so compilation is done on a cloud cluster. When compiling my code I got this error java.lang.NoSuchMethodError: com.fasterxml.jackson.dataformat.xml.XmlMapper.coercionConfigDefaults()Lcom/fasterxml/jackson/databind/cfg/MutableCoercionConfig; with a hundred lines of "at com.databricks. ..."
I maybe forget to import something but for me this is ok (tell me if I'm wrong) :
import ch.qos.logback.classic._
import com.typesafe.scalalogging._
import com.fasterxml.jackson._
import com.fasterxml.jackson.core._
import com.fasterxml.jackson.databind.{ObjectMapper, JsonNode}
import com.fasterxml.jackson.dataformat.xml._
import com.fasterxml.jackson.module.scala._
import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
import java.io._
import java.time.Instant
import java.util.concurrent.TimeUnit
import javax.xml.parsers._
import okhttp3.{Headers, OkHttpClient, Request, Response, RequestBody, FormBody}
import okhttp3.OkHttpClient.Builder._
import org.apache.spark._
import org.xml.sax._
As I'm using Databricks, there's no SBT file for dependencies. Instead I installed the libs I need directly on the cluster. Here are the ones I'm using :
com.squareup.okhttp:okhttp:2.7.5
com.squareup.okhttp3:okhttp:4.9.0
com.squareup.okhttp3:okhttp:3.14.9
org.scala-lang.modules:scala-swing_3:3.0.0
ch.qos.logback:logback-classic:1.2.6
com.typesafe:scalalogging-slf4j_2.10:1.1.0
cc.spray.json:spray-json_2.9.1:1.0.1
com.fasterxml.jackson.module:jackson-module-scala_3:2.13.0
javax.xml.parsers:jaxp-api:1.4.5
org.xml.sax:2.0.1
The code causing the error is simply (coming from here : https://www.baeldung.com/jackson-convert-xml-json Chapter 5):
val xmlMapper: XmlMapper = new XmlMapper()
val jsonNode: JsonNode = xmlMapper.readTree(responseBody.getBytes())
with responseBody being a String containing a XML document (I previously checked the integrity of the XML). When removing those two lines the code is working fine.
I've read tons of articles or forums but I can't figure out what's causing my issue. Can someone please help me ? Thanks a lot ! :)
Welcome to dependency hell and breaking changes in libraries.
This usually happens, when various lib bring in different version of same lib. In this case it is Jackson.
java.lang.NoSuchMethodError: com.fasterxml.jackson.dataformat.xml.XmlMapper.coercionConfigDefaults()Lcom/fasterxml/jackson/databind/cfg/MutableCoercionConfig; means: One lib probably require Jackson version, which has this method, but on class path is version, which does not yet have this funcion or got removed bcs was deprecated or renamed.
In case like this is good to print dependency tree and check version of Jackson required in libs. And if possible use newer versions of requid libs.
Solution: use libs, which use compatible versions of Jackson lib. No other shortcut possible.
upgrading to 2.12.5 version fixed my issue.
this issue may also appear when there are multiple versions of jackson jars in project lib directory. you should remove the older versions.

No module named 'gcp_sql_operator' in cloud composer

I am not able to import statement as-
from airflow.contrib.operators.gcp_sql_operator import CloudSqlQueryOperator
I want to import this in my DAG file which will be run in cloud composer airflow whose version is 1.10.0 and not 1.9.0.Here just to check, I tried to import gcs_to_gcs as-
from airflow.contrib.operators.gcs_to_gcs import GoogleCloudStorageToGoogleCloudStorageOperator
I am able to import this but not gcp_sql_operator.
The CloudSqlQueryOperator operator is released since Airflow 1.10.2, which is currently not supported in Composer versions. The support for 1.10.2 should be available soon in Composer, before that you may want to manually copy gcp_sql_operator.py and its dependencies to the Composer dags folder following the instruction here.

Error when running pyspark

I tried to run pyspark via terminal. From my terminal, I runs snotebook and it will automatically load jupiter. After that, when I select python3, the error comes out from the terminal.
[IPKernelApp] WARNING | Unknown error in handling PYTHONSTARTUP file
/Users/simon/spark-1.6.0-bin-hadoop2.6/python/pyspark/shell.py
Here's my .bash_profile setting:
export PATH="/Users/simon/anaconda/bin:$PATH"
export SPARK_HOME=~/spark-1.6.0-bin-hadoop2.6
export PATH=$PATH:$SPARK_HOME/bin
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
export PYSPARK_PYTHON=python3
alias snotebook='$SPARK_HOME/bin/pyspark'
Please let me know if you have any ideas, thanks.
You need to add below line in your code
PYSPARK_DRIVER_PYTHON=ipython
or
PYSPARK_DRIVER_PYTHON=ipython3
Hope it will help.
In my case, I was using a virtual environment and forgot to install Jupyter, so it was using some version that it found in the $PATH. Installing it inside the environment fixed this issue.
Spark now includes PySpark as part of the install, so remove the PySpark library unless you really need it.
Remove the old Spark, install latest version.
Install (pip) findspark library.
In Jupiter, import and use findspark:
import findspark
findspark.init()
Quick PySpark / Python 3 Check
import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext()
print(sc)
sc.stop()

How to import libraries in Spark Notebook

I'm having trouble importing magellan-1.0.4-s_2.11 in spark notebook. I've downloaded the jar from https://spark-packages.org/package/harsha2010/magellan and have tried placing SPARK_HOME/bin/spark-shell --packages harsha2010:magellan:1.0.4-s_2.11 in the Start of Customized Settings section of the spark-notebook file of the bin folder.
Here are my imports
import magellan.{Point, Polygon, PolyLine}
import magellan.coord.NAD83
import org.apache.spark.sql.magellan.MagellanContext
import org.apache.spark.sql.magellan.dsl.expressions._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
And my errors...
<console>:71: error: object Point is not a member of package org.apache.spark.sql.magellan
import magellan.{Point, Polygon, PolyLine}
^
<console>:72: error: object coord is not a member of package org.apache.spark.sql.magellan
import magellan.coord.NAD83
^
<console>:73: error: object MagellanContext is not a member of package org.apache.spark.sql.magellan
import org.apache.spark.sql.magellan.MagellanContext
I then tried to import the new library like any other library by placing it into the main script like so:
$lib_dir/magellan-1.0.4-s_2.11.jar"
This didn't work and I'm left scratching my head wondering what I've done wrong. How do I import libraries such as magellan into spark notebook?
Try evaluating something like
:dp "harsha2010" % "magellan" % "1.0.4-s_2.11"
It will load the library into Spark, allowing it to be imported - assuming it can be obtained though the Maven repo. In my case it failed with a message:
failed to load 'harsha2010:magellan:jar:1.0.4-s_2.11 (runtime)' from ["Maven2 local (file:/home/dev/.m2/repository/, releases+snapshots) without authentication", "maven-central (http://repo1.maven.org/maven2/, releases+snapshots) without authentication", "spark-packages (http://dl.bintray.com/spark-packages/maven/, releases+snapshots) without authentication", "oss-sonatype (https://oss.sonatype.org/content/repositories/releases/, releases+snapshots) without authentication"] into /tmp/spark-notebook/aether/b2c7d8c5-1f56-4460-ad39-24c4e93a9786
I think file was to big and connection was interrupted before whole file could be downloaded.
Workaround
So I downloaded the JAR manually from:
http://dl.bintray.com/spark-packages/maven/harsha2010/magellan/1.0.4-s_2.11/
and copied it into the:
/tmp/spark-notebook/aether/b2c7d8c5-1f56-4460-ad39-24c4e93a9786/harsha2010/magellan/1.0.4-s_2.11
And then :dp command worked. Try Calling it first, and if it will fail copy JAR into the right path to make things work.
Better solution
I should investigate why download failed to fix it in the first place... or put that library in my local M2 repo. But that should get you going.
I would suggest to check this:
https://github.com/spark-notebook/spark-notebook/blob/master/docs/metadata.md#import-download-dependencies
and
https://github.com/spark-notebook/spark-notebook/blob/master/docs/metadata.md#add-spark-packages
I think the :dp magic command is depreciated, instead you should add your custom dependencies in the notebook metadata. You can go in the menu Edit > Edit notebook metadata, there add something like:
"customDeps": [
"harsha2010 % magellan % 1.0.4-s_2.11"
]
Once done, you will need to restart the kernel, you can check in the browser console if the package is being downloaded properly.
The easy way, you should set or add the EXTRA_CLASSPATH environnent variable to point to your .jar file downloaded :
export EXTRA_CLASSPATH = </link/to/your.jar> or set EXTRA_CLASSPATH= </link/to/your.jar> in wondows OS. Here find the detailed solution.

The import com.interwoven cannot be resolved

I am new to Teamsite and a not sure which jar file to use for resolving the issue that I get while trying to write a Java Datasource in Eclipse.
The list of imports I am using are:
import com.interwoven.datasource.MapDataSource;
import com.interwoven.datasource.core.DataSourceContext;
import com.interwoven.livesite.dom4j.Dom4jUtils;
import com.interwoven.serverutils100.InstalledLocations;
import com.interwoven.cssdk.filesys.CSVPath;
All these imports show the same error
The import com.interwoven cannot be resolved.
Can anyone please tell me which jar files should I add?