I've a question on hive metastore support for delta lake,
I've defined a metastore on a standalone spark session with the following configurations
pyspark --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension"
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", '/mnt/data/db/medilake/') \
.config("spark.hadoop.datanucleus.autoCreateSchema", 'true') \
.enableHiveSupport() \
.getOrCreate()
and it all worked while i was in the session according to the doc https://docs.delta.io/latest/delta-batch.html#-control-data-location&language-python
then when i opened a new session i got this error
>>> spark.sql("SELECT * FROM BRONZE.user").show()
20/09/04 20:30:27 WARN ObjectStore: Failed to get database bronze, returning NoSuchObjectException
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/spark-3.0.0-bin-hadoop3.2/python/pyspark/sql/session.py", line 646, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/opt/spark-3.0.0-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/opt/spark-3.0.0-bin-hadoop3.2/python/pyspark/sql/utils.py", line 137, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException: Table or view not found: BRONZE.user; line 1 pos 14;
'Project [*]
+- 'UnresolvedRelation [BRONZE, user]
i can still see the data arranged in the folders but now databases are empty (used to be bronze/silver/gold)
>>> spark.catalog.listDatabases()
[Database(name='default', description='Default Hive database', locationUri='file:/mnt/data/db/medilake/spark-warehouse')]
folders:
root#m:/mnt/data/db/medilake# ll
total 16
drwxr-xr-x 9 root root 4096 Sep 4 19:05 bronze.db
-rw-r--r-- 1 root root 708 Sep 4 19:45 derby.log
drwxr-xr-x 4 root root 4096 Sep 4 19:36 gold.db
drwxr-xr-x 5 root root 4096 Sep 4 19:45 metastore_db
conf:
>>> spark.conf.get("spark.sql.warehouse.dir")
'/mnt/data/db/medilake/'
how can I set the property to work from any working directory?
.config("spark.sql.warehouse.dir", '/mnt/data/db/medilake/')
Barak, maybe I can help you add the spark sql syntax in the following way
spark.sql("CREATE TABLE BRONZE.user USING DELTA LOCATION '/mnt/data/db/medilake/spark-warehouse'")
I remember that to read a delta table it must be saved from a parquet or csv file, for more details consult the documentation
Delta Lake
just needed to follow the instructions.
setting the spark.sql.warehouse.dir on the spark-default.conf did the trick
Related
I was dealing with a previous error when trying to perform some Named Entity Recognition with spaCy, relying on Dataproc + PySpark. I have created a brand-new cluster, to deal with "insufficient local disk space", as mentioned in the comments of that case:
gcloud dataproc clusters create spacy_tests \
--autoscaling-policy policy-dbeb \
--enable-component-gateway \
--region us-central1 \
--zone us-central1-c \
--master-machine-type n1-standard-4 \
--master-boot-disk-size 500 \
--num-workers 2 \
--worker-machine-type n1-standard-4 \
--worker-boot-disk-size 500 \
--num-secondary-workers 2 \
--secondary-worker-boot-disk-size 500 \
--num-secondary-worker-local-ssds 0 \
--image-version 1.5-debian10 \
--properties dataproc:pip.packages=spacy==3.2.1,numpy==1.19.5,dataproc:efm.spark.shuffle=primary-worker \
--optional-components ANACONDA,JUPYTER,DOCKER \
--project mentor-pilot-project
Nevertheless, I have stumbled upon a new error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-1-9a88da6b7731> in <module>
39
40 # loading spaCy model and broadcasting it
---> 41 broadcasted_nlp = spark.sparkContext.broadcast(load_spacy_model())
42
43 print("DATA READING (OR MANUAL DATA GENERATION)...")
<ipython-input-1-9a88da6b7731> in load_spacy_model()
20
21 def load_spacy_model():
---> 22 import spacy
23 print("\tLoading spacy model...")
24 return spacy.load("./spacy_model") # This model exists locally
/opt/conda/anaconda/lib/python3.7/site-packages/spacy/__init__.py in <module>
9
10 # These are imported as part of the API
---> 11 from thinc.api import prefer_gpu, require_gpu, require_cpu # noqa: F401
12 from thinc.api import Config
13
/opt/conda/anaconda/lib/python3.7/site-packages/thinc/api.py in <module>
1 from .config import Config, registry, ConfigValidationError
----> 2 from .initializers import normal_init, uniform_init, glorot_uniform_init, zero_init
3 from .initializers import configure_normal_init
4 from .loss import CategoricalCrossentropy, L2Distance, CosineDistance
5 from .loss import SequenceCategoricalCrossentropy
/opt/conda/anaconda/lib/python3.7/site-packages/thinc/initializers.py in <module>
2 import numpy
3
----> 4 from .backends import Ops
5 from .config import registry
6 from .types import FloatsXd, Shape
/opt/conda/anaconda/lib/python3.7/site-packages/thinc/backends/__init__.py in <module>
6
7 from .ops import Ops
----> 8 from .cupy_ops import CupyOps, has_cupy
9 from .numpy_ops import NumpyOps
10 from ._cupy_allocators import cupy_tensorflow_allocator, cupy_pytorch_allocator
/opt/conda/anaconda/lib/python3.7/site-packages/thinc/backends/cupy_ops.py in <module>
17 from .. import registry
18 from .ops import Ops
---> 19 from .numpy_ops import NumpyOps
20 from . import _custom_kernels
21 from ..types import DeviceTypes
/opt/conda/anaconda/lib/python3.7/site-packages/thinc/backends/numpy_ops.pyx in init thinc.backends.numpy_ops()
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject
The ValueError I am getting is most likely related with "the package sources being different" (quote) but I am not sure if that applies also for my case (In summary and after googling that error, it seems to go away by "uninstalling certain package, and installing it again" or "upgrading certain package"; needless to say, I don't know which package would apply for my case, if any). Nevertheless, I would like to stress the fact that "the default cluster packages shipped with Dataproc (so to speak) cannot be controlled from my side" (or so I think)
NOTES:
I have tried to insert the results of pip list and conda list, howevere they were too lenghty; if you need some specific package version, please leave the command in the comment section. Leave another requests there too, I will make sure to be updating this section.
What can be happening here?
I'm trying to run pyspark using a notebook in a conda enviroment.
$ which python
inside the enviroment 'env', returns:
/Users/<username>/anaconda2/envs/env/bin/p
ython
and outside the environment:
/Users/<username>/anaconda2/bin/python
My .bashrc file has:
export PATH="/Users/<username>/anaconda2/bin:$PATH"
export JAVA_HOME=`/usr/libexec/java_home`
export SPARK_HOME=/usr/local/Cellar/apache-spark/3.1.2
export PYTHONPATH=$SPARK_HOME/libexec/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH
But still, when I run:
import findspark
findspark.init()
I'm getting the error:
Exception: Unable to find py4j, your SPARK_HOME may not be configured correctly
Any ideas?
Full traceback
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
~/anaconda2/envs/ai/lib/python3.7/site-packages/findspark.py in init(spark_home, python_path, edit_rc, edit_profile)
142 try:
--> 143 py4j = glob(os.path.join(spark_python, "lib", "py4j-*.zip"))[0]
144 except IndexError:
IndexError: list index out of range
During handling of the above exception, another exception occurred:
Exception Traceback (most recent call last)
/var/folders/dx/dfb8h2h925l7vmm7y971clpw0000gn/T/ipykernel_72686/1796740182.py in <module>
1 import findspark
2
----> 3 findspark.init()
~/anaconda2/envs/ai/lib/python3.7/site-packages/findspark.py in init(spark_home, python_path, edit_rc, edit_profile)
144 except IndexError:
145 raise Exception(
--> 146 "Unable to find py4j, your SPARK_HOME may not be configured correctly"
147 )
148 sys.path[:0] = [spark_python, py4j]
Exception: Unable to find py4j, your SPARK_HOME may not be configured correctly
EDIT:
If I run the following in the notebook:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
I get the error:
/usr/local/Cellar/apache-spark/3.1.2/bin/load-spark-env.sh: line 2: /usr/local/Cellar/apache-spark/3.1.2/libexec/bin/load-spark-env.sh: Permission denied
/usr/local/Cellar/apache-spark/3.1.2/bin/load-spark-env.sh: line 2: exec: /usr/local/Cellar/apache-spark/3.1.2/libexec/bin/load-spark-env.sh: cannot execute: Undefined error: 0
In apache airflow, I wrote a PythonOperator which use pyspark to run a job on yarn cluster mode. I initialize the sparksession object as follows.
spark = SparkSession \
.builder \
.appName("test python operator") \
.master("yarn") \
.config("spark.submit.deployMode","cluster") \
.getOrCreate()
However, when I run my dag, I get an Exception.
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/airflow/models/taskinstance.py", line 983, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.8/dist-packages/airflow/operators/python_operator.py", line 113, in execute
return_value = self.execute_callable()
File "/usr/local/lib/python3.8/dist-packages/airflow/operators/python_operator.py", line 118, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/catfish/dags/dags_dag_test_python_operator.py", line 39, in print_count
spark = SparkSession \
File "/usr/local/lib/python3.8/dist-packages/pyspark/sql/session.py", line 186, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/usr/local/lib/python3.8/dist-packages/pyspark/context.py", line 371, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/usr/local/lib/python3.8/dist-packages/pyspark/context.py", line 128, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "/usr/local/lib/python3.8/dist-packages/pyspark/context.py", line 320, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "/usr/local/lib/python3.8/dist-packages/pyspark/java_gateway.py", line 105, in launch_gateway
raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
I also set PYSPARK_SUBMIT_ARGS, but it doesn't work for me!
You need to install spark on your ubuntu container.
RUN apt-get -y install default-jdk scala git curl wget
RUN wget --no-verbose https://downloads.apache.org/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz
RUN tar xvf spark-2.4.6-bin-hadoop2.7.tgz
RUN mv spark-2.4.6-bin-hadoop2.7 /opt/spark
ENV SPARK_HOME=/opt/spark
And unfortunately you cannot run spark on yarn with PythonOperator. I suggest you to use SparkSubmitOperator or BashOperator.
Approximately just over one week ago, I was able to read a BigQuery table into an RDD for a Spark job running on a Dataproc cluster using the guide at https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example as a template. Since then, I am now encountering missing class issues, despite no changes being affected to the guide.
I have attempted to track down the missing class, com/google/cloud/hadoop/repackaged/bigquery/com/google/common/collect/ImmutableList, although I cannot find any information on whether or not this class is now excluded from the gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar
The job submission request is as follows:
gcloud dataproc jobs submit pyspark \
--cluster $CLUSTER_NAME \
--jars gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar \
--bucket gs://$BUCKET_NAME \
--region europe-west2 \
--py-files $PYSPARK_PATH/main.py
The PySpark code breaks at the following point:
bq_table_rdd = spark_context.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
where conf is a Python dict structured as follows:
conf = {
'mapred.bq.project.id': project_id,
'mapred.bq.gcs.bucket': gcs_staging_bucket,
'mapred.bq.temp.gcs.path': input_staging_path,
'mapred.bq.input.project.id': bq_input_project_id,
'mapred.bq.input.dataset.id': bq_input_dataset_id,
'mapred.bq.input.table.id': bq_input_table_id,
}
When my output indicates that the code has reached the above spark_context.newAPIHadoopRDD function, the following is printed to stdout:
class com.google.cloud.hadoop.repackaged.bigquery.com.google.common.flogger.backend.system.DefaultPlatform: cannot cast result of calling 'com.google.cloud.hadoop.repackaged.gcs.com.google.common.flogger.backend.log4j.Log4jBackendFactory#getInstance' to 'com.google.cloud.hadoop.repackaged.bigquery.com.google.common.flogger.backend.system.BackendFactory': java.lang.ClassCastException: Cannot cast com.google.cloud.hadoop.repackaged.gcs.com.google.common.flogger.backend.log4j.Log4jBackendFactory to com.google.cloud.hadoop.repackaged.bigquery.com.google.common.flogger.backend.system.BackendFactory
Traceback (most recent call last):
File "/tmp/0af805a2dd104e46b087037f0790691f/main.py", line 31, in <module>
sc)
File "/tmp/0af805a2dd104e46b087037f0790691f/extract.py", line 65, in bq_table_to_rdd
conf=conf)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/context.py", line 749, in newAPIHadoopRDD
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.NoClassDefFoundError: com/google/cloud/hadoop/repackaged/bigquery/com/google/common/collect/ImmutableList
This had not been an issue as recently as last week. I am concerned that even the hello world example on the GCP website is not stable in the short term. If anyone could shed some light on this issue, it would be greatly appreciated. Thanks.
I reproduced the problem
$ gcloud dataproc clusters create test-cluster --image-version=1.4
$ gcloud dataproc jobs submit pyspark wordcount_bq.py \
--cluster test-cluster \
--jars=gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar
then exactly the same error happened:
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.NoClassDefFoundError: com/google/cloud/hadoop/repackaged/bigquery/com/google/common/collect/ImmutableList
I noticed there was a new release 1.0.0 on Aug 23:
$ gsutil ls -l gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-**
...
4038762 2018-10-03T20:59:35Z gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-0.13.8.jar
4040566 2018-10-19T23:32:19Z gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-0.13.9.jar
14104522 2019-06-28T21:08:57Z gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-1.0.0-RC1.jar
14104520 2019-07-01T20:38:18Z gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-1.0.0-RC2.jar
14149215 2019-08-23T21:08:03Z gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-1.0.0.jar
14149215 2019-08-24T00:27:49Z gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar
Then I tried version 0.13.9, it worked:
$ gcloud dataproc jobs submit pyspark wordcount_bq.py \
--cluster test-cluster \
--jars=gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-0.13.9.jar
It is a problem with 1.0.0, there is already an issue filed on GitHub. We'll fix it and improve the tests.
I was going through the processBuilder API of Scala, In order to run shell commands like we run in Shell Script, I could run few scripts but Having an issue with one type of Hive query execution.
When I run the below commands it was running success fully, but one type of format fails:
Running Shell Command(Successful):
scala> import sys.process._
scala> "ls -lrt /home/cloudera/Desktop".!
total 164
-rwxrwxr-x 1 cloudera cloudera 237 Apr 5 2016 Parcels.desktop
-rwxrwxr-x 1 cloudera cloudera 238 Apr 5 2016 Kerberos.desktop
-rwxrwxr-x 1 cloudera cloudera 259 Apr 5 2016 Express.desktop
Running Hive Query With File Option(Successful):
scala> "hive -f /home/cloudera/hi.hql" !!
warning: there was one feature warning; re-run with -feature for
details
ls: cannot access /usr/lib/spark/lib/spark-assembly-*.jar: No such
file or directory
2017-09-03 23:20:34,392 WARN [main] mapreduce.TableMapReduceUtil: The
hbase-prefix-tree module jar containing PrefixTreeCodec is not
present. Continuing without it.
Logging initialized using configuration in
file:/etc/hive/conf.dist/hive-log4j.properties
OK
Time taken: 0.913 seconds, Fetched: 2 row(s)
res20: String =
"100 Amit 12000 10
101 Allen 22000 20 .
"
Running Hive Query With -e Option(Failed):
If i had the run the below query on the terminal I could run in the below given Format.
bash$ hive -e "select * staging.from employee_canada;"
The problem while running the same query in scala terminal fails because of the double quotes ("") # the Select query. How can I escape those and run successfully. Tried using triple quotes and as well as escape "\" sequence but still failed to execute.
scala> "hive -e select * staging.from employee_canada; "!!
Below is the Error:
FAILED: ParseException line 1:6 cannot recognize input near '<EOF>'
'<EOF>' '<EOF>' in select clause
java.lang.RuntimeException: Nonzero exit value: 64
at scala.sys.package$.error(package.scala:27)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.slurp
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.$bang$bang(
ProcessBuilderImpl.scala:102)
... 32 elided