Does scala has "Options" to parse command line arguments in spark-submit just like in Java? [duplicate] - scala

This question already has answers here:
Best way to parse command-line parameters? [closed]
(26 answers)
Closed 3 years ago.
In order to parse command line arguments while using spark-submit:
SPARK_MAJOR_VERSION=2 spark-submit --class com.partition.source.Pickup --master=yarn --conf spark.ui.port=0000 --driver-class-path /home/hdpusr/jars/postgresql-42.1.4.jar --conf spark.jars=/home/hdpusr/jars/postgresql-42.1.4.jar,/home/hdpusr/jars/postgresql-42.1.4.jar --executor-cores 4 --executor-memory 4G --keytab /home/hdpusr/hdpusr.keytab --principal hdpusr#DEVUSR.COM --files /usr/hdp/current/spark2-client/conf/hive-site.xml,testconnection.properties --name Spark_APP --conf spark.executor.extraClassPath=/home/hdpusr/jars/greenplum.jar sparkload_2.11-0.1.jar ORACLE
I am passing a database name: ORACLE which I am parsing it in the code as
def main(args: Array[String]): Unit = {
val dbtype = args(0).toString
.....
}
Is there a way I can give it a name like "--dbname" and then check for that option in the spark-submit to get the option's value ?
Ex:
SPARK_MAJOR_VERSION=2 spark-submit --class com.partition.source.Pickup --master=yarn --conf spark.ui.port=0000 --driver-class-path /home/hdpusr/jars/postgresql-42.1.4.jar --conf spark.jars=/home/hdpusr/jars/postgresql-42.1.4.jar,/home/hdpusr/jars/postgresql-42.1.4.jar --executor-cores 4 --executor-memory 4G --keytab /home/hdpusr/hdpusr.keytab --principal hdpusr#DEVUSR.COM --files /usr/hdp/current/spark2-client/conf/hive-site.xml,testconnection.properties --name Spark_APP --conf spark.executor.extraClassPath=/home/hdpusr/jars/greenplum.jar sparkload_2.11-0.1.jar --dbname ORACLE
In Java there are two packages which can be used to do the same:
import org.apache.commons.cli.Option;
import org.apache.commons.cli.Options;
public static void main(String[] args) {
Options options = new Options();
Option input = new Option("s", "ssn", true, "source system names");
input.setRequired(false);
options.addOption(input);
CommandLineParser parser = new DefaultParser();
HelpFormatter formatter = new HelpFormatter();
CommandLine cmd = null;
try {
cmd = parser.parse(options, args);
if(cmd.hasOption("s")) { // Checks if there is an argument '--s' in the CLI. Runs the Recon only for the received SSNs.
}
} catch(ParseException e) {
formatter.printHelp("utility-name", options);
e.printStackTrace();
System.exit(1);
} catch(Exception e) {
e.printStackTrace();
}
}
Could anyone let me know if it is possible to name the command line arguments and parse them accordingly ?

If you use --dbname=ORACLE for example.
val pattern = """--dbname=(.*)""".r
val params = args.map {
case pattern(pair, _) => pair
case arg => throw new ConfigException.Generic(s"""unable to parse command-line argument "$arg"""")
}
\s Matches whitespace, you can use it to create --dbname ORACLE, but it's easier if you just use a string.
Here you can see all the possibilities.

If we are not specific about the key name, we can prefix the key name with spark. in this case spark.dbname, and pass an conf argument like spark-submit --conf spark.dbname=<> .... or add it to the spark-defaults.conf
In the user code, we can access the key as sparkContext.getConf.get("spark.dbname")

Related

Update/Replace value in Mongo Database using Mongo Spark Connector (Pyspark) v10x

I am using the spark version in the image below. Details:
mongo-spark-connector:10.0.5
Spark version 3.1.3
And I config the spark-mongo-connector by following:
spark = SparkSession.builder \
.appName("hello") \
.master("yarn") \
.config("spark.executor.memory", "4g") \
.config('spark.driver.memory', '2g') \
.config('spark.driver.cores', '4') \
.config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector:10.0.5') \
.config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.27.1.jar') \
.enableHiveSupport() \
.getOrCreate()
I want to ask the question, how to update and replace value in Mongo Database.
I read the following question in Updating mongoData with MongoSpark. But it is successful for mongo-spark v2.x. About mongo-spark v10 above is fail.
Example:
I have these following attributes:
from bson.objectid import ObjectId
data = {
'_id' : ObjectId("637367d5262dc89a8e318d09"),
'database' : database_name,
"table" : table,
"latestSyncAt": lastestSyncAt,
"lastest_id" : str(lastest_id)
}
df = spark.createDataFrame(data)
How do I update or replace _id attribute value in Mongo Database by using Mongo-spark-connector?
Thank you very much for your support.

PySpark-streaming: How to access files sent using --files

I am running a pyspark-streaming client with kafka. I want to send files to cluster.
I am using --files option:
spark-submit --master yarn \
--deploy-mode client \
--jars "/home/aiman/testing_aiman/spark-sql-kafka-0-10_2.11-2.4.0-cdh6.3.4.jar" \
--files /home/aiman/testing_aiman/kafka.keystore.uat.jks#keystore.jks,/home/aiman/testing_aiman/kafka.truststore.uat.jks#truststore.jks \
sparkStreamingTest.py
and trying to access the files using SparkFiles.get():
from pyspark.sql import SparkSession
from pyspark import SparkFiles
spark = SparkSession.builder.appName("Test Streaming").getOrCreate()
# Get the Keystore File and Truststore File
keystore = str(SparkFiles.get('keystore.jks'))
truststore = str(SparkFiles.get('truststore.jks'))
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers","kafka.server.com:9093") \
.option("subscribe","TEST_TOPIC") \
.option("startingOffsets", "earliest") \
.option("kafka.security.protocol","SSL") \
.option("kafka.ssl.keystore.location", keystore) \
.option("kafka.ssl.keystore.password", "abcd") \
.option("kafka.ssl.key.password","abcd") \
.option("kafka.ssl.truststore.type","JKS") \
.option("kafka.ssl.truststore.location", truststore) \
.option("kafka.ssl.truststore.password","abcd") \
.option("kafka.ssl.enabled.protocols","TLSv1") \
.option("kafka.ssl.endpoint.identification.algorithm","") \
.load()
....
...
but i am still getting NoSuchFileException:
Caused by: org.apache.kafka.common.KafkaException: Failed to load SSL keystore /tmp/spark-4578a498-f96d-4c8a-a716-e128d90531fb/userFiles-5792bc5c-d513-4aa3-9014-26df66ace1db/keystore.jks of type JKS
at org.apache.kafka.common.security.ssl.SslFactory$SecurityStore.load(SslFactory.java:357)
at org.apache.kafka.common.security.ssl.SslFactory.createSSLContext(SslFactory.java:240)
at org.apache.kafka.common.security.ssl.SslFactory.configure(SslFactory.java:141)
... 55 more
Caused by: java.nio.file.NoSuchFileException: /tmp/spark-4578a498-f96d-4c8a-a716-e128d90531fb/userFiles-5792bc5c-d513-4aa3-9014-26df66ace1db/keystore.jks
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
at java.nio.file.Files.newByteChannel(Files.java:361)
at java.nio.file.Files.newByteChannel(Files.java:407)
at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384)
at java.nio.file.Files.newInputStream(Files.java:152)
at org.apache.kafka.common.security.ssl.SslFactory$SecurityStore.load(SslFactory.java:350)
... 57 more
Where am I going wrong ?
Instead of using SparkFiles.get() method to get the absolute path, using the file names directly, also removing the #keystore.jks and #truststore.jks from --files option in spark-submit command:
spark-submit --master yarn \
--deploy-mode client \
--jars "/home/aiman/testing_aiman/spark-sql-kafka-0-10_2.11-2.4.0-cdh6.3.4.jar" \
--files /home/aiman/testing_aiman/kafka.keystore.uat.jks,/home/aiman/testing_aiman/kafka.truststore.uat.jks \
sparkStreamingTest.py
Using the file actual file names:
#Commenting the SparkFiles.get() method
#keystore = str(SparkFiles.get('keystore.jks'))
#truststore = str(SparkFiles.get('truststore.jks'))
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers","kafka.server.com:9093") \
.option("subscribe","TEST_TOPIC") \
.option("startingOffsets", "earliest") \
.option("kafka.security.protocol","SSL") \
.option("kafka.ssl.keystore.location", "kafka.keystore.uat.jks") \
.option("kafka.ssl.keystore.password", "abcd") \
.option("kafka.ssl.key.password","abcd") \
.option("kafka.ssl.truststore.type","JKS") \
.option("kafka.ssl.truststore.location", "kafka.truststore.uat.jks") \
.option("kafka.ssl.truststore.password","abcd") \
.option("kafka.ssl.enabled.protocols","TLSv1") \
.option("kafka.ssl.endpoint.identification.algorithm","") \
.load()

Read / Write data from HBase using Pyspark

I am trying to read data from HBase using Pyspark and it is getting many weird errors. Below is the sample snippet of my code.
Please suggest any solution.
empdata = ''.join("""
{
'table': {
'namespace': 'default',
'name': 'emp'
},
'rowkey': 'key',
'columns': {
'emp_id': {'cf': 'rowkey', 'col': 'key', 'type': 'string'},
'emp_name': {'cf': 'personal data', 'col': 'name', 'type': 'string'}
}
}
""".split())
df = sqlContext \
.read \
.options(catalog=empdata) \
.format('org.apache.spark.sql.execution.datasources.hbase') \
.load()
df.show()
I have used the below version
HBase 2.1.6,
Pyspark 2.3.2, Hadoop 3.1
I have ran the code as follows
pyspark --master local --packages com.hortonworks:shc-core:1.1.1-1.6-s_2.10 --repositories http://repo.hortonworks.com/content/groups/public/ --files /etc/hbase/conf/hbase-site.xml
The error is
An error occurred while calling o71.load. : java.lang.NoclassDefFoundError: org/apache/apark/Logging

Dataflow template is not taking input parameters

I have a dataflow template created with below command
mvn compile exec:java \
-Dexec.mainClass=com.StarterPipeline \
-Dexec.args="--runner=DataflowRunner \
--project=jason-qa \
--stagingLocation=gs://jason_test/dataflow/staging \
--region=asia-east1 \
--zone=asia-east1-a \
--subnetwork=regions/asia-east1/subnetworks/dmz-asia-east1 \
--templateLocation=gs://jason_test/dataflow/Template \
--campaignId="
Executing Templates with below command
gcloud dataflow jobs run jason203 \
--project=jason-qa \
--region=asia-east1 \
--gcs-location gs://jason_test/dataflow/Template \
--parameters campaignId=run-test323,output=gs://jason_test/dataflow/counts
code copy from Count example and have few changes
public interface MyOptions extends PipelineOptions {
#Description("campaign id")
#Required
#Default.String("default-test123")
ValueProvider<String> getCampaignId();
void setCampaignId(ValueProvider<String> campaignId);
}
static void run(MyOptions options) {
Pipeline p = Pipeline.create(options);
String id = options.getCampaignId().get();
p.apply("ReadLines", TextIO.read().from(options.getInputFile()))
.apply(new Count())
.apply(MapElements.via(new FormatAsTextFn()))
.apply("WriteCounts", TextIO.write().to(options.getOutput() + id));
p.run();
}
from the DataFlow Job summary --> Pipeline options
I can find the info: campaignId run-test323
but the result in bucket, file name is -00000-of-00003(It should be run-test323-00000-of-00003)
should remove the .get() in the code, then it will work

Connecting to DSE Graph running on a Docker Container results into No host found

I am running my DSE Graph inside a Docker container with this command.
docker run -e DS_LICENSE=accept -p 9042:9042 --name my-dse -d datastax/dse-server -g -k -s
On my Scala Code I am referencing it as follows
object GrDbConnection {
val dseCluster = DseCluster.builder()
.addContactPoints("127.0.0.1").withPort(9042)
.build()
val graphName = "graphName"
val graphOptions = new GraphOptions()
.setGraphName(graphName)
var graph:ScalaGraph = null
try {
val session = dseCluster.connect()
// The following uses the DSE graph schema API, which is currently only supported by the string-based
// execution interface. Eventually there will be a programmatic API for making schema changes, but until
// then this needs to be used.
// Create graph
session.executeGraph("system.graph(name).ifNotExists().create()", ImmutableMap.of("name", graphName))
// Clear the schema to drop any existing data and schema
session.executeGraph(new SimpleGraphStatement("schema.clear()").setGraphName(graphName))
// Note: typically you would not want to use development mode and allow scans, but it is good for convenience
// and experimentation during development.
// Enable development mode and allow scans
session.executeGraph(new SimpleGraphStatement("schema.config().option('graph.schema_mode').set('development')")
.setGraphName(graphName))
session.executeGraph(new SimpleGraphStatement("schema.config().option('graph.allow_scan').set('true')")
.setGraphName(graphName))
// Create a ScalaGraph from a remote Traversal Source using withRemote
// See: http://tinkerpop.apache.org/docs/current/reference/#connecting-via-remotegraph for more details
val connection = DseRemoteConnection.builder(session)
.withGraphOptions(graphOptions)
.build()
graph = EmptyGraph.instance().asScala
.configure(_.withRemote(connection))
} finally {
dseCluster.close()
}
}
Then In one of my controllers I have used this to invoke a query to the DSE Graph.
def test = Action {
val r = GrDbConnection.graph.V().count()
print(r.iterate())
Ok
}
This returns an error
lay.api.http.HttpErrorHandlerExceptions$$anon$1: Execution exception[[NoHostAvailableException: All host(s) tried for query failed (no host was tried)]]
at play.api.http.HttpErrorHandlerExceptions$.throwableToUsefulException(HttpErrorHandler.scala:251)
at play.api.http.DefaultHttpErrorHandler.onServerError(HttpErrorHandler.scala:178)
at play.core.server.AkkaHttpServer$$anonfun$1.applyOrElse(AkkaHttpServer.scala:363)
at play.core.server.AkkaHttpServer$$anonfun$1.applyOrElse(AkkaHttpServer.scala:361)
at scala.concurrent.Future.$anonfun$recoverWith$1(Future.scala:413)
at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:37)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:91)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried)
at com.datastax.driver.core.RequestHandler.reportNoMoreHosts(RequestHandler.java:221)
at com.datastax.driver.core.RequestHandler.access$1000(RequestHandler.java:41)
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.findNextHostAndQuery(RequestHandler.java:292)
at com.datastax.driver.core.RequestHandler.startNewExecution(RequestHandler.java:109)
at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:89)
at com.datastax.driver.core.SessionManager.executeAsync(SessionManager.java:124)
at com.datastax.driver.dse.DefaultDseSession.executeGraphAsync(DefaultDseSession.java:123)
at com.datastax.dse.graph.internal.DseRemoteConnection.submitAsync(DseRemoteConnection.java:74)
at org.apache.tinkerpop.gremlin.process.remote.traversal.step.map.RemoteStep.promise(RemoteStep.java:89)
at org.apache.tinkerpop.gremlin.process.remote.traversal.step.map.RemoteStep.processNextStart(RemoteStep.java:65)
Turns out all I needed to do was to remove
finally {
dseCluster.close()
}