I'm trying to run the PyPostal python bindings for the C program LibPostal to perform address standardisation on spark (AWS Glue to be precise). I have followed the guide https://github.com/openvenues/pypostal/issues/58 which links to https://github.com/openvenues/libpostal/issues/153 but to no avail.
The steps I have taken.
Compiled the libpostal into a single tar file joint.tar.gz
Compiled the libpostal data files into a single tar file libpostal_datadir.tar.gz
Created an egg of the PyPostal python bindings postal-1.1.9-py2.7-linux-x86_64.egg
Configured pyspark shell with the following command.
gluepyspark3 \
--conf "spark.yarn.dist.archives=s3://{some_S3_location}/libpostal/joint.tar.gz,s3://{some_S3_location}/libpostal/libpostal_datadir.tar.gz" \
--conf "spark.executorEnv.LD_LIBRARY_PATH=./joint.tar.gz" \
--conf "spark.executorEnv.LIBPOSTAL_DATA_DIR=./libpostal_datadir.tar.gz" \
--conf "spark.driver.extraLibraryPath=./joint.tar.gz" \
--conf "spark.driver.LibraryPath=./joint.tar.gz" \
--conf "spark.yarn.appMasterEnv.LD_LIBRARY_PATH=./joint.tar.gz" \
--conf "spark.yarn.appMasterEnv.LIBPOSTAL_DATA_DIR=./libpostal_datadir.tar.gz" \
--py-files "s3://{some_s3_location}/libpostal/postal-1.1.9-py2.7-linux-x86_64.egg"
Inside Spark shell from postal.parser import *
The following error is returned.
Python 3.6.12 (default, Aug 31 2020, 18:56:18)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux
Type "help", "copyright", "credits" or "license" for more information.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/share/aws/glue/etl/jars/glue-assembly.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
21/02/19 15:12:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/02/19 15:12:19 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
21/02/19 15:12:32 WARN Client: Same path resource s3://safari-dev-assets/glue/jobs/kw_libpostal/postal-1.1.9-py2.7-linux-x86_64.egg added multiple times to distributed cache.
21/02/19 15:12:54 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.3
/_/
Using Python version 3.6.12 (default, Aug 31 2020 18:56:18)
SparkSession available as 'spark'.
>>> from postal.parser import *
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/mnt/tmp/spark-57e88948-952a-4a7b-b104-f7832af5f9c8/userFiles-cf4d4719-3866-4d63-8ef6-fbe3b4dd08ad/postal-1.1.9-py2.7-linux-x86_64.egg/postal/parser.py", line 2, in <module>
File "/mnt/tmp/spark-57e88948-952a-4a7b-b104-f7832af5f9c8/userFiles-cf4d4719-3866-4d63-8ef6-fbe3b4dd08ad/postal-1.1.9-py2.7-linux-x86_64.egg/postal/_parser.py", line 7, in <module>
File "/mnt/tmp/spark-57e88948-952a-4a7b-b104-f7832af5f9c8/userFiles-cf4d4719-3866-4d63-8ef6-fbe3b4dd08ad/postal-1.1.9-py2.7-linux-x86_64.egg/postal/_parser.py", line 6, in __bootstrap__
File "/usr/lib64/python3.6/imp.py", line 343, in load_dynamic
return _load(spec)
ImportError: libpostal.so.1: cannot open shared object file: No such file or directory
Related
I am running a spark-submit command that will do some database work via a Scala class.
spark-submit
--verbose
--class mycompany.MyClass
--conf spark.driver.extraJavaOptions=-Dconfig.resource=dev-test.conf
--conf spark.executor.extraJavaOptions=-Dconfig.resource=dev-test.conf
--master yarn
--driver-library-path /usr/lib/hadoop-lzo/lib/native/
--jars /home/hadoop/mydir/dbp.spark-utils-1.1.0-SNAPSHOT.jar,/usr/lib/phoenix/phoenix-client-hbase-2.4-5.1.2.jar,/usr/lib/hadoop-lzo/lib/hadoop-lzo.jar,/usr/lib/hadoop/lib/commons-compress-1.18.jar,/usr/lib/hadoop/hadoop-aws-3.2.1-amzn-5.jar,/usr/share/aws/aws-java-sdk/aws-java-sdk-bundle-1.12.31.jar
--files /home/hadoop/mydir/dev-test.conf
--num-executors 1
--executor-memory 3g
--driver-memory 3g
--queue default /home/hadoop/mydir/dbp.spark-utils-1.1.0-SNAPSHOT.jar
<<args to MyClass>>
When I run, I get an exception:
Caused by: java.sql.SQLException: No suitable driver found for jdbc:phoenix:host1,host2,host3:2181:/hbase;
at java.sql.DriverManager.getConnection(DriverManager.java:689)
at java.sql.DriverManager.getConnection(DriverManager.java:208)
at org.apache.phoenix.util.QueryUtil.getConnection(QueryUtil.java:422)
at org.apache.phoenix.util.QueryUtil.getConnection(QueryUtil.java:414)
Here are the relevant parts of my Scala code:
val conf: SerializableHadoopConfiguration =
new SerializableHadoopConfiguration(sc.hadoopConfiguration)
Class.forName("org.apache.phoenix.jdbc.PhoenixDriver")
val tableRowKeyPairs: RDD[(Cell, ImmutableBytesWritable)] =
df.rdd.mapPartitions(partition => {
val configuration = conf.get()
val partitionConn: JavaConnection = QueryUtil.getConnection(configuration)
// ...
}
My spark-submit command includes /usr/lib/phoenix/phoenix-client-hbase-2.4-5.1.2.jar using the --jars argument. When I search that file for "org.apache.phoenix.jdbc.PhoenixDriver", I find it:
$ jar -tf /usr/lib/phoenix/phoenix-client-hbase-2.4-5.1.2.jar | grep -i driver
...
org/apache/phoenix/jdbc/PhoenixDriver.class
...
So why can't my program locate the driver?
I was able to get the program to find the driver by adding the following argument to the spark-submit command shown in the question:
--conf "spark.executor.extraClassPath=/usr/lib/phoenix/phoenix-client-hbase-2.4-5.1.2.jar"
This StackOverflow article has great explanations for what the various arguments do.
I am getting IllegalArgumentException while submitting spark job
C:\spark\spark-2.2.1-bin-hadoop2.7\hadoop\bin>pyspark
Python 2.7.14 (v2.7.14:84471935ed, Sep 16 2017, 20:25:58) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/C:/spark/spark-2.2.1-bin-hadoop2.7/jars/hadoop-auth-2.7.3.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
17/12/21 15:48:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
File "C:\spark\spark-2.2.1-bin-hadoop2.7\python\pyspark\shell.py", line 45, in <module>
spark = SparkSession.builder\
File "C:\spark\spark-2.2.1-bin-hadoop2.7\python\pyspark\sql\session.py", line 183, in getOrCreate
session._jsparkSession.sessionState().conf().setConfString(key, value)
File "C:\spark\spark-2.2.1-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py", line 1133, in __call__
File "C:\spark\spark-2.2.1-bin-hadoop2.7\python\pyspark\sql\utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':"
Please advise how to resolve these errors
Check your permissions to /tmp/hive/. Use sudo chmod -R 777 /tmp/hive/. Also, please check whether you are using sqlContext to pass queries, otherwise you can change it to sparkSession.
I am struggling to make junit work in my Spark shell.
When trying to import Assert from junit I geet the following error message:
scala> import org.junit.Assert._
<console>:23: error: object junit is not a member of package org
import org.junit.Assert._
Any way of having this fixed? Any idea of how can I download org.junit from the scala shell?
EDIT:
After following the recomendation from zsxwing, I have used spark-shell --packages junit:junit:4.12 with the following output:
C:\spark>spark-shell --packages junit:junit:4.12
Ivy Default Cache set to: C:\xxx\.ivy2\cache
The jars for the packages stored in: C:\xxxx\.ivy2\jars
:: loading settings :: url = jar:file:/C:/spark/jars/ivy-2.4.0.jar!/org/apache/i
vy/core/settings/ivysettings.xml
junit#junit added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found junit#junit;4.12 in central
found org.hamcrest#hamcrest-core;1.3 in central
:: resolution report :: resolve 365ms :: artifacts dl 7ms
:: modules in use:
junit#junit;4.12 from central in [default]
org.hamcrest#hamcrest-core;1.3 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 2 | 0 | 0 | 0 || 2 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 2 already retrieved (0kB/20ms)
Setting default log level to "WARN".
Spark context Web UI available at http://xxxxx
Spark context available as 'sc' (master = local[*], app id = local-xxx
).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.0
/_/
However still facing the same issue when trying to import org.junit.Assert._
JUnit is a test dependency and it's not included in the Spark shell's classpath. You can use the --packages parameter add any dependency such as,
bin/spark-shell --packages junit:junit:4.12
I am following the instructions laid out on Kubernetes' Spark example. I can get to the step with launching the PySpark shell. However, I need to use PySpark with JDBC to connect to my Postgres database. Before I tried Kubernetes, I got the JDBC working with Spark using the spark-defaults.conf file:
spark.driver.extraClassPath /spark/postgresql-9.4.1209.jre7.jar
spark.executor.extraClassPath /spark/postgresql-9.4.1209.jre7.jar
I also had to download the driver into the location first. How do I achieve the same thing with Kubernetes? I don't think I can do
kubectl exec zeppelin-controller-xzlrf -it pyspark --jars /spark/postgresql-9.4.1209.jre7.jar
because the jar would have to be inside the container first. Therefore, maybe I can get it working if I can get the jar file inside the container, but how do I do that? Any thoughts or help is greatly appreciated.
UPDATE: I tried following #LostInOverflow's solution but encountered the following:
kubectl exec zeppelin-controller-2p3ew -it -- pyspark --packages org.postgresql:postgresql:9.4.1209.jre7.jar
which appears to boot up and recognizes the package argument but still fails:
Python 2.7.9 (default, Mar 1 2015, 12:57:24)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/opt/spark-1.5.2-bin-hadoop2.6/lib/spark-assembly-1.5.2-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.postgresql#postgresql added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
:: resolution report :: resolve 2294ms :: artifacts dl 0ms
:: modules in use:
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 1 | 0 | 0 | 0 || 0 | 0 |
---------------------------------------------------------------------
:: problems summary ::
:::: WARNINGS
module not found: org.postgresql#postgresql;9.4.1209.jre7.jar
==== local-m2-cache: tried
file:/root/.m2/repository/org/postgresql/postgresql/9.4.1209.jre7.jar/postgresql-9.4.1209.jre7.jar.pom
-- artifact org.postgresql#postgresql;9.4.1209.jre7.jar!postgresql.jar:
file:/root/.m2/repository/org/postgresql/postgresql/9.4.1209.jre7.jar/postgresql-9.4.1209.jre7.jar.jar
==== local-ivy-cache: tried
/root/.ivy2/local/org.postgresql/postgresql/9.4.1209.jre7.jar/ivys/ivy.xml
==== central: tried
https://repo1.maven.org/maven2/org/postgresql/postgresql/9.4.1209.jre7.jar/postgresql-9.4.1209.jre7.jar.pom
-- artifact org.postgresql#postgresql;9.4.1209.jre7.jar!postgresql.jar:
https://repo1.maven.org/maven2/org/postgresql/postgresql/9.4.1209.jre7.jar/postgresql-9.4.1209.jre7.jar.jar
==== spark-packages: tried
http://dl.bintray.com/spark-packages/maven/org/postgresql/postgresql/9.4.1209.jre7.jar/postgresql-9.4.1209.jre7.jar.pom
-- artifact org.postgresql#postgresql;9.4.1209.jre7.jar!postgresql.jar:
http://dl.bintray.com/spark-packages/maven/org/postgresql/postgresql/9.4.1209.jre7.jar/postgresql-9.4.1209.jre7.jar.jar
::::::::::::::::::::::::::::::::::::::::::::::
:: UNRESOLVED DEPENDENCIES ::
::::::::::::::::::::::::::::::::::::::::::::::
:: org.postgresql#postgresql;9.4.1209.jre7.jar: not found
::::::::::::::::::::::::::::::::::::::::::::::
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: org.postgresql#postgresql;9.4.1209.jre7.jar: not found]
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1011)
at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Traceback (most recent call last):
File "/opt/spark/python/pyspark/shell.py", line 43, in <module>
sc = SparkContext(pyFiles=add_files)
File "/opt/spark/python/pyspark/context.py", line 110, in __init__
SparkContext._ensure_initialized(self, gateway=gateway)
File "/opt/spark/python/pyspark/context.py", line 234, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
File "/opt/spark/python/pyspark/java_gateway.py", line 94, in launch_gateway
raise Exception("Java gateway process exited before sending the driver its port number")
Exception: Java gateway process exited before sending the driver its port number
>>>
You can use --packages with coordinates in place of --jars:
--packages org.postgresql:postgresql:9.4.1209.jre7.jar
How to reduce the amount of trace info the Spark runtime produces?
The default is too verbose,
How to turn off it, and turn on it when I need.
Thanks
Verbose mode
scala> val la = sc.parallelize(List(12,4,5,3,4,4,6,781))
scala> la.collect
15/01/28 09:57:24 INFO SparkContext: Starting job: collect at <console>:15
15/01/28 09:57:24 INFO DAGScheduler: Got job 3 (collect at <console>:15) with 1 output
...
15/01/28 09:57:24 INFO Executor: Running task 0.0 in stage 3.0 (TID 3)
15/01/28 09:57:24 INFO Executor: Finished task 0.0 in stage 3.0 (TID 3). 626 bytes result sent to driver
15/01/28 09:57:24 INFO DAGScheduler: Stage 3 (collect at <console>:15) finished in 0.002 s
15/01/28 09:57:24 INFO DAGScheduler: Job 3 finished: collect at <console>:15, took 0.020061 s
res5: Array[Int] = Array(12, 4, 5, 3, 4, 4, 6, 781)
Silent mode(expected)
scala> val la = sc.parallelize(List(12,4,5,3,4,4,6,781))
scala> la.collect
res5: Array[Int] = Array(12, 4, 5, 3, 4, 4, 6, 781)
Spark 1.4.1
sc.setLogLevel("WARN")
From comments in source code:
Valid log levels include: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN
Spark 2.x - 2.3.1
sparkSession.sparkContext().setLogLevel("WARN")
Spark 2.3.2
sparkSession.sparkContext.setLogLevel("WARN")
quoting from 'Learning Spark' book.
You may find the logging statements that get printed in the shell
distracting. You can control the verbosity of the logging. To do this,
you can create a file in the conf directory called log4j.properties.
The Spark developers already include a template for this file called
log4j.properties.template. To make the logging less verbose, make a
copy of conf/log4j.properties.template called conf/log4j.properties
and find the following line:
log4j.rootCategory=INFO, console
Then
lower the log level so that we only show WARN message and above by
changing it to the following:
log4j.rootCategory=WARN, console
When
you re-open the shell, you should see less output.
Logging configuration at the Spark app level
With this approach no need of code change in cluster for a spark application.
Let's create a new file log4j.properties from log4j.properties.template.
Then change verbosity with log4j.rootCategory property.
Say, we need to check ERRORs of given jar then, log4j.rootCategory=ERROR, console
Spark submit command would be
spark-submit \
... #Other spark props goes here
--files prop/file/location \
--conf 'spark.executor.extraJavaOptions=-Dlog4j.configuration=prop/file/location' \
--conf 'spark.driver.extraJavaOptions=-Dlog4j.configuration=prop/file/location' \
jar/location \
[application arguments]
Now you would see only the logs which are ERROR categorised.
Plain Log4j way wo Spark(but needs code change)
Set Logging OFF for packages org and akka
import org.apache.log4j.{Level, Logger}
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("akka").setLevel(Level.ERROR)
If you are invoking a command from a shell, there is a lot you can do without changing any configurations. That is by design.
Below are a couple of Unix examples using pipes, but you could do similar filters in other environments.
To completely silence the log (at your own risk)
Pipe stderr to /dev/null, i.e.:
run-example org.apache.spark.examples.streaming.NetworkWordCount localhost 9999 2> /dev/null
To ignore INFO messages
run-example org.apache.spark.examples.streaming.NetworkWordCount localhost 9999 | awk '{if ($3 != "INFO") print $0}'