I use following test case to write data to a postgresql table, and it works fine.
test("SparkSQLTest") {
val session = SparkSession.builder().master("local").appName("SparkSQLTest").getOrCreate()
val url = "jdbc:postgresql://dbhost:12345/db1"
val table = "schema1.table1"
val props = new Properties()
props.put("user", "user123")
props.put("password", "pass#123")
props.put(JDBCOptions.JDBC_DRIVER_CLASS, "org.postgresql.Driver")
session.range(300, 400).write.mode(SaveMode.Append).jdbc(url, table, props)
}
Then, I use following spark-sql -f sql_script_file.sql to write an hive data into postgresql table.
CREATE OR REPLACE TEMPORARY VIEW tmp_v1
USING org.apache.spark.sql.jdbc
OPTIONS (
driver 'org.postgresql.Driver',
url 'jdbc:postgresql://dbhost:12345/db1',
dbtable 'schema1.table2',
user 'user123',
password 'pass#123',
batchsize '2000'
);
insert into tmp_v1 select
name,
age
from test.person; ---test.person is the Hive db.table
But when I run the above script using spark-sql -f sql_script.sql, it complains that the postgresql user/passord is invalid, the exception is as follows, I think the above two methods are basically the same, so I would ask where the problem is, thanks.
org.postgresql.util.PSQLException: FATAL: Invalid username/password,login denied.
at org.postgresql.core.v3.ConnectionFactoryImpl.doAuthentication(ConnectionFactoryImpl.java:375)
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:189)
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:64)
at org.postgresql.jdbc2.AbstractJdbc2Connection.<init>(AbstractJdbc2Connection.java:124)
at org.postgresql.jdbc3.AbstractJdbc3Connection.<init>(AbstractJdbc3Connection.java:28)
at org.postgresql.jdbc3g.AbstractJdbc3gConnection.<init>(AbstractJdbc3gConnection.java:20)
at org.postgresql.jdbc4.AbstractJdbc4Connection.<init>(AbstractJdbc4Connection.java:30)
at org.postgresql.jdbc4.Jdbc4Connection.<init>(Jdbc4Connection.java:22)
at org.postgresql.Driver.makeConnection(Driver.java:392)
at org.postgresql.Driver.connect(Driver.java:266)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:59)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:50)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:58)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:114)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:45)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
at org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:76)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:59)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:57)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:75)
While executing insert and update command in oracle 11g I'm getting below error.
val stmt = con.createStatement()
//Insert
val query1 = "insert into audit values('D','abc','T','01-NOV-18','Inprogress')"
stmt.executeUpdate(query1)
//Update
val query2 = "Update audit Set status='test' where where product= = 'D'"
stmt.executeUpdate(query2)
I'm getting below error
//Error while updating record
Exception in thread "main" java.sql.SQLSyntaxErrorException: ORA-00936: missing expression
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:447)
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:396)
at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:951)
at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:513)
at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:227)
at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:531)
at oracle.jdbc.driver.T4CStatement.doOall8(T4CStatement.java:195)
at oracle.jdbc.driver.T4CStatement.executeForRows(T4CStatement.java:1036)
at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1336)
at oracle.jdbc.driver.OracleStatement.executeUpdateInternal(OracleStatement.java:1845)
at oracle.jdbc.driver.OracleStatement.executeUpdate(OracleStatement.java:1810)
at oracle.jdbc.driver.OracleStatementWrapper.executeUpdate(OracleStatementWrapper.java:294)
at com.oracle.OracleConnection$.main(OracleConnection.scala:21)
at com.oracle.OracleConnection.main(OracleConnection.scala)
Process finished with exit code 1
Any help will be appreciated. Thanks in advance
You have two where and two = in your update query.
val query2 = "Update audit Set status='test' where product='D'"
I have this csv file which contains description of several cities:
Cities_information_extract.csv
I can parse this file just fine using python pandas.read_csv or R read.csv methods. They both return 693 rows for 25 columns.
I am trying, unsuccessfully, to load the csv using Spark 1.6.0 and scala.
For this I am using spark-csv and commons-csv (which I have included in the spark jars path).
This is is what I have tried:
var cities_info = sqlContext.read.
format("com.databricks.spark.csv").
option("header", "true").
option("inferSchema", "false").
option("delimiter", ";").
load("path/to/Cities_information_extract.csv")
cities_info.count()
// ERROR
java.io.IOException: (line 1) invalid char between encapsulated token and delimiter
at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:275)
at org.apache.commons.csv.Lexer.nextToken(Lexer.java:152)
at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:498)
at org.apache.commons.csv.CSVParser.getRecords(CSVParser.java:365)
Then I've tried using univocity parser:
var cities_info = sqlContext.read.
format("com.databricks.spark.csv").
option("parserLib", "univocity").
option("header", "true").
option("inferSchema", "false").
option("delimiter", ";").
load("path/to/Cities_information_extract.csv")
cities_info.count()
// ERROR
Caused by: java.lang.ArrayIndexOutOfBoundsException: 100000
at com.univocity.parsers.common.input.DefaultCharAppender.append(DefaultCharAppender.java:103)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:115)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:159)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:108)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:159)
[...]
Inspecting the file I've noticed the presence of several html tags in description fields with embedding quotes like:
<div id="someID">
I've tried, using python, to remove all html tags using regular expressions:
import os
import re
pattern = re.compile("<[^>]*>") # find all html tags <..>
with io.open("Cities_information_extract.csv", "r", encoding="utf-8") as infile:
text = infile.read()
text = re.sub(pattern, " ", text)
with io.open("cities_info_clean.csv", "w", encoding="utf-8") as outfile:
outfile.write(text)
Next I've tried again with the new file without the html tags:
var cities_info = sqlContext.read.
format("com.databricks.spark.csv").
option("header", "true").
option("inferSchema", "false").
option("delimiter", ";").
load("path/to/cities_info_clean.csv")
cities_info.count()
// ERROR
java.io.IOException: (startline 1) EOF reached before encapsulated token finished
at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:282)
at org.apache.commons.csv.Lexer.nextToken(Lexer.java:152)
at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:498)
at org.apache.commons.csv.CSVParser.getRecords(CSVParser.java:365)
at com.databricks.spark.csv.CsvRelation$$anonfun$com$databricks$spark$csv$CsvRelation$$parseCSV$1.apply(CsvRelation.scala:304)
[...]
And with the univocity parser:
var cities_info = sqlContext.read.
format("com.databricks.spark.csv").
option("parserLib", "univocity").
option("header", "true").
option("inferSchema", "false").
option("delimiter", ";").
load("path/to/cities_info_clean.csv")
cities_info.count()
// ERROR
Caused by: java.lang.ArrayIndexOutOfBoundsException: 100000
at com.univocity.parsers.common.input.DefaultCharAppender.append(DefaultCharAppender.java:103)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:115)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:159)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:108)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:159)
[...]
Both python and R are able to parse correctly both files, while spark-csv still fails. Any suggestion for correctly parsing this csv file using spark-scala?
I am new to spark/scala development. I am using maven to build my project and IDE is intelliJ. I am trying to query a hive table and then iterate over the resulting dataframe(using foreach). Here's my code:
try{
val DF_1 = hiveContext.sql("select distinct(address) from
test_table where trim(address)!=''")
println("number of rows: "+DF_1.count)
DF_1.foreach(x => {
val y =hiveContext.sql("select place from test_table where address='"+x(0).toString+"'")
if(y.count > 1){
println("Multiple place values for address: "+x(0).toString)
y.foreach(r => println(r))
println("*************")
}
})}
catch {case e: Exception => e.printStackTrace()}
With each iteration, I am Querying the same table to get another column, trying to see if there are multiple values of places for each address in test_table. I have no compilation errors and the application builds successfully. But, when I run the above code, I get the following error:
java.lang.NoClassDefFoundError: Could not initialize class xxxxxxxx
the application launches successfully, prints the count of rows in DF_1 and then fails with the above error at the foreach loop. I did a jar xvf on my jar and can see the main class - driver.class:
com/.../driver$$anonfun$1$$anonfun$apply$1.class
com/.../driver$$anonfun$1.class
com/.../driver$$anonfun$2.class
com/.../driver$$anonfun$3.class
com/.../driver$$anonfun$4.class
com/.../driver$$anonfun$5.class
com/.../driver$$anonfun$main$1$$anonfun$apply$1.class
com/.../driver$$anonfun$main$1$$anonfun$apply$2.class
com/.../driver$$anonfun$main$1$$anonfun$apply$3.class
com/.../driver$$anonfun$main$1.class
com/.../driver$$anonfun$main$10$$anonfun$apply$9.class
com/.../driver$$anonfun$main$10.class
com/.../driver$$anonfun$main$11.class
com/.../driver$$anonfun$main$12.class
com/.../driver$$anonfun$main$13.class
com/.../driver$$anonfun$main$14.class
com/.../driver$$anonfun$main$15.class
com/.../driver$$anonfun$main$16.class
com/.../driver$$anonfun$main$17.class
com/.../driver$$anonfun$main$18.class
com/.../driver$$anonfun$main$19.class
com/.../driver$$anonfun$main$2$$anonfun$apply$4.class
com/.../driver$$anonfun$main$2$$anonfun$apply$5.class
com/.../driver$$anonfun$main$2$$anonfun$apply$6.class
com/.../driver$$anonfun$main$2.class
com/.../driver$$anonfun$main$20.class
com/.../driver$$anonfun$main$21.class
com/.../driver$$anonfun$main$22.class
com/.../driver$$anonfun$main$23.class
com/.../driver$$anonfun$main$3$$anonfun$apply$7.class
com/.../driver$$anonfun$main$3$$anonfun$apply$8.class
com/.../driver$$anonfun$main$3.class
com/.../driver$$anonfun$main$4$$anonfun$apply$9.class
com/.../driver$$anonfun$main$4.class
com/.../driver$$anonfun$main$5.class
com/.../driver$$anonfun$main$6$$anonfun$apply$1.class
com/.../driver$$anonfun$main$6$$anonfun$apply$2.class
com/.../driver$$anonfun$main$6$$anonfun$apply$3.class
com/.../driver$$anonfun$main$6$$anonfun$apply$4.class
com/.../driver$$anonfun$main$6$$anonfun$apply$5.class
com/.../driver$$anonfun$main$6.class
com/.../driver$$anonfun$main$7$$anonfun$apply$1.class
com/.../driver$$anonfun$main$7$$anonfun$apply$2.class
com/.../driver$$anonfun$main$7$$anonfun$apply$3.class
com/.../driver$$anonfun$main$7$$anonfun$apply$4.class
com/.../driver$$anonfun$main$7$$anonfun$apply$5.class
com/.../driver$$anonfun$main$7$$anonfun$apply$6.class
com/.../driver$$anonfun$main$7$$anonfun$apply$7.class
com/.../driver$$anonfun$main$7$$anonfun$apply$8.class
com/.../driver$$anonfun$main$7.class
com/.../driver$$anonfun$main$8$$anonfun$apply$10.class
com/.../driver$$anonfun$main$8$$anonfun$apply$4.class
com/.../driver$$anonfun$main$8$$anonfun$apply$5.class
com/.../driver$$anonfun$main$8$$anonfun$apply$6.class
com/.../driver$$anonfun$main$8$$anonfun$apply$7.class
com/.../driver$$anonfun$main$8$$anonfun$apply$8.class
com/.../driver$$anonfun$main$8$$anonfun$apply$9.class
com/.../driver$$anonfun$main$8.class
com/.../driver$$anonfun$main$9$$anonfun$apply$11.class
com/.../driver$$anonfun$main$9$$anonfun$apply$7.class
com/.../driver$$anonfun$main$9$$anonfun$apply$8.class
com/.../driver$$anonfun$main$9$$anonfun$apply$9.class
com/.../driver$$anonfun$main$9.class
com/.../driver$.class
com/.../driver.class
I am not facing the error when I launch the job in local mode instead of yarn. What is causing the issue and how can it be corrected?
Any help would be appreciated, Thank you.
Looks like your your jar or some dependencies aren't distributed between worker nodes. In local mode it works because you have the jars in the place. In yarn mode you need to build a fat-jar with all dependencies include hive and spark libraries in it.
I'm trying to connect to IBM Cloud Object Storage from IBM Data Science Experience:
access_key = 'XXX'
secret_key = 'XXX'
bucket = 'mybucket'
host = 'lon.ibmselect.objstor.com'
service = 'mycos'
sqlCxt = SQLContext(sc)
hconf = sc._jsc.hadoopConfiguration()
hconf.set('fs.cos.myCos.access.key', access_key)
hconf.set('fs.cos.myCos.endpoint', 'http://' + host)
hconf.set('fs.cose.myCos.secret.key', secret_key)
hconf.set('fs.cos.service.v2.signer.type', 'false')
obj = 'mydata.tsv.gz'
rdd = sc.textFile('cos://{0}.{1}/{2}'.format(bucket, service, obj))
print(rdd.count())
This returns:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.io.IOException: No FileSystem for scheme: cos
I'm guessing I need to use the 'cos' scheme based on the stocator docs. However, the error suggests stocator isn't available or is an old version?
Any ideas?
Update 1:
I have also tried the following:
sqlCxt = SQLContext(sc)
hconf = sc._jsc.hadoopConfiguration()
hconf.set('fs.cos.impl', 'com.ibm.stocator.fs.ObjectStoreFileSystem')
hconf.set('fs.stocator.scheme.list', 'cos')
hconf.set('fs.stocator.cos.impl', 'com.ibm.stocator.fs.cos.COSAPIClient')
hconf.set('fs.stocator.cos.scheme', 'cos')
hconf.set('fs.cos.mycos.access.key', access_key)
hconf.set('fs.cos.mycos.endpoint', 'http://' + host)
hconf.set('fs.cos.mycos.secret.key', secret_key)
hconf.set('fs.cos.service.v2.signer.type', 'false')
service = 'mycos'
obj = 'mydata.tsv.gz'
rdd = sc.textFile('cos://{0}.{1}/{2}'.format(bucket, service, obj))
print(rdd.count())
However, this time the response was:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.io.IOException: No object store for: cos
at com.ibm.stocator.fs.ObjectStoreVisitor.getStoreClient(ObjectStoreVisitor.java:121)
...
Caused by: java.lang.ClassNotFoundException: com.ibm.stocator.fs.cos.COSAPIClient
The latest version of Stocator (v1.0.9) that supports fs.cos scheme is not yet deployed on Spark aaService (It will be soon). Please use the stocator scheme "fs.s3d" to connect to your COS.
Example:
endpoint = 'endpointXXX'
access_key = 'XXX'
secret_key = 'XXX'
prefix = "fs.s3d.service"
hconf = sc._jsc.hadoopConfiguration()
hconf.set(prefix + ".endpoint", endpoint)
hconf.set(prefix + ".access.key", access_key)
hconf.set(prefix + ".secret.key", secret_key)
bucket = 'mybucket'
obj = 'mydata.tsv.gz'
rdd = sc.textFile('s3d://{0}.service/{1}'.format(bucket, obj))
rdd.count()
Alternatively, you can use ibmos2spark. The lib is already installed on our service. Example:
import ibmos2spark
credentials = {
'endpoint': 'endpointXXXX',
'access_key': 'XXXX',
'secret_key': 'XXXX'
}
configuration_name = 'os_configs' # any string you want
cos = ibmos2spark.CloudObjectStorage(sc, credentials, configuration_name)
bucket = 'mybucket'
obj = 'mydata.tsv.gz'
rdd = sc.textFile(cos.url(obj, bucket))
rdd.count()
Stocator is on the classpath for Spark 2.0 and 2.1 kernels, but the cos scheme is not configured. You can access the config by executing the following in a Python notebook:
!cat $SPARK_CONF_DIR/core-site.xml
Look for the property fs.stocator.scheme.list. What I currently see is:
<property>
<name>fs.stocator.scheme.list</name>
<value>swift2d,swift,s3d</value>
</property>
I recommend that you raise a feature request against DSX to support the cos scheme.
It looks like cos driver is not properly initialized. Try this configuration:
hconf.set('fs.cos.impl', 'com.ibm.stocator.fs.ObjectStoreFileSystem')
hconf.set('fs.stocator.scheme.list', 'cos')
hconf.set('fs.stocator.cos.impl', 'com.ibm.stocator.fs.cos.COSAPIClient')
hconf.set('fs.stocator.cos.scheme', 'cos')
hconf.set('fs.cos.mycos.access.key', access_key)
hconf.set('fs.cos.mycos.endpoint', 'http://' + host)
hconf.set('fs.cos.mycos.secret.key', secret_key)
hconf.set('fs.cos.service.v2.signer.type', 'false')
UPDATE 1:
You also need to ensure stocator classes are on the classpath. You can use packages system by exceuting pyspark in the following way:
./bin/pyspark --packages com.ibm.stocator:stocator:1.0.24
This works with swift2d and cos scheme.
UPDATE 2:
Just follow Stocator documentation (https://github.com/CODAIT/stocator). It contains all details how to install it, what branch to use, etc.
I found the same issue, and to solve it I just changed environment:
Within IBM Watson Studio, if you start a a Jupyter notebook in an environment without a pre-configured spark cluster, than you get that error. Installing PySpark is not enough.
Instead, if you start a notebook with the Spark cluster available, you will be just fine.
You have to set .config("spark.hadoop.fs.stocator.scheme.list", "cos") along with some others fs.cos... configurations.
Here's an end-to-end snippet code example that works (tested with pyspark==2.3.2 and Python 3.7.3):
from pyspark.sql import SparkSession
stocator_jar = '/path/to/stocator-1.1.2-SNAPSHOT-IBM-SDK.jar'
cos_instance_name = '<myCosIntanceName>'
bucket_name = '<bucketName>'
s3_region = '<region>'
cos_iam_api_key = '*******'
iam_servicce_id = 'crn:v1:bluemix:public:iam-identity::<****************>'
spark_builder = (
SparkSession
.builder
.appName('test_app'))
spark_builder.config('spark.driver.extraClassPath', stocator_jar)
spark_builder.config('spark.executor.extraClassPath', stocator_jar)
spark_builder.config(f"fs.cos.{cos_instance_name}.iam.api.key", cos_iam_api_key)
spark_builder.config(f"fs.cos.{cos_instance_name}.endpoint", f"s3.{s3_region}.cloud-object-storage.appdomain.cloud")
spark_builder.config(f"fs.cos.{cos_instance_name}.iam.service.id", iam_servicce_id)
spark_builder.config("spark.hadoop.fs.stocator.scheme.list", "cos")
spark_builder.config("spark.hadoop.fs.cos.impl", "com.ibm.stocator.fs.ObjectStoreFileSystem")
spark_builder.config("fs.stocator.cos.impl", "com.ibm.stocator.fs.cos.COSAPIClient")
spark_builder.config("fs.stocator.cos.scheme", "cos")
spark_sess = spark_builder.getOrCreate()
dataset = spark_sess.range(1, 10)
dataset = dataset.withColumnRenamed('id', 'user_idx')
dataset.repartition(1).write.csv(
f'cos://{bucket_name}.{cos_instance_name}/test.csv',
mode='overwrite',
header=True)
spark_sess.stop()
print('done!')