SAP HANA Vora failing to see the table contents in Scala - scala

Failing to see the data in Scala using Vora.
VORA: 1.2
Spark: 1.5.2 / Spark controller: 1.5.8
The hdfs file "content" showing fine.
hdfs dfs -cat /user/vora/XXXXXXXX/part-00000
AB05,560
CD06,340
EF07,590
GH08,230
Table showing up fine in the "show datasourcestables" command
scala> vc.sql(s"""SHOW DATASOURCETABLES USING com.sap.spark.vora""".stripMargin ).show
Output
Show table is failing in Scala
scala> vc.sql("select * from VVCSV").show
scala> vc.sql("select * from VVCSV").show
java.lang.RuntimeException: Table Not Found: VVCSV
at scala.sys.package$.error(package.scala:27)
at >org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:139)
at >org.apache.spark.sql.extension.ExtendableSQLContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(ExtendableSQLContext.scala:52)
at >org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:203)
at >?>org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:203)
at scala.Option.getOrElse(Option.scala:120)

Command show datasourcetables is deprecated with Vora1.2 and been replaced by show tables using com.sap.spark.vora. However that command is only showing what is persisted in the Vora catalog. To load the tables in the current Spark context (e.g. after restarting the spark-shell) you need to run the register tables command:
vc.sql("register all tables using com.sap.spark.vora").show
To check what is in the current Spark context you can use the show tables command (without the 'using' clause). For more details you can check the Vora Developer Guide and the Spark documentation.

Related

Spark MongoDB Connector unable to df.join - Unspecialised MongoConfig

Using the latest MongoDB connector for Spark (v10) and trying to join two dataframes yields the following unhelpful error.
Py4JJavaError: An error occurred while calling o64.showString.
: java.lang.UnsupportedOperationException: Unspecialised MongoConfig. Use `mongoConfig.toReadConfig()` or `mongoConfig.toWriteConfig()` to specialize
at com.mongodb.spark.sql.connector.config.MongoConfig.getDatabaseName(MongoConfig.java:201)
at com.mongodb.spark.sql.connector.config.MongoConfig.getNamespace(MongoConfig.java:196)
at com.mongodb.spark.sql.connector.MongoTable.name(MongoTable.java:99)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation.name(DataSourceV2Relation.scala:66)
at org.apache.spark.sql.execution.datasources.v2.V2ScanRelationPushDown$$anonfun$pushDownFilters$1.$anonfun$applyOrElse$2(V2ScanRelationPushDown.scala:65)
Pyspark Code is simply pulling in two tables and running a join:
dfa = spark.read.format("mongodb").option("uri", mongodb://127.0.0.1/people.contacts").load()
dfb = spark.read.format("mongodb").option("uri", mongodb://127.0.0.1/people.accounts").load()
dfa.join(dfb, 'PKey').count()
SQL gives the same error:
dfa.createOrReplaceTempView("usr")
dfb.createOrReplaceTempView("ast")
spark.sql("SELECT count(*) FROM ast JOIN usr on usr._id = ast._id").show()
Document structures are flat.
Have you try using the latest version (10.0.2) of mongo-spark-connector? Can find it at here
I had a similar problem, solved it by replace 10.0.1 by 10.0.2

error in initSerDe : java.lang.ClassNotFoundException class org.apache.hive.hcatalog.data.JsonSerDe not found

I am trying to read data from the Hive table using spark sql (scala) and it is throwing me error as
ERROR hive.log: error in initSerDe: java.lang.ClassNotFoundException Class org.apache.hive.hcatalog.data.JsonSerDe not found
java.lang.ClassNotFoundException: Class org.apache.hive.hcatalog.data.JsonSerDe not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2255)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:392)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:274)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:256)
at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:607)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$7.apply(HiveClientImpl.scala:358)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$7.apply(HiveClientImpl.scala:355)
at scala.Option.map(Option.scala:146)
Hive table is stored as
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
I added the /opt/cloudera/parcels/CDH/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar using :require /opt/cloudera/parcels/CDH/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar and able to see Added to classpath.
I also tried to add the JAR file in the SparkSession.config(). Both of them didn't worked. I checked some answers from stackoverflow which are not helped to resolve my issue.
CREATE EXTERNAL TABLE `test.record`(
`test_id` string COMMENT 'from deserializer',
`test_name` string COMMENT 'from deserializer',
`processed_datetime` timestamp COMMENT 'from deserializer'
)
PARTITIONED BY (
`filedate` date)
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
I am expecting to read the data from the hive table and able to store it in Dataframe.
var tempDF =sql("SELECT * FROM test.record WHERE filedate = '2019-06-03' LIMIT 5")
tempDF.show()
should work
One quick way to solve this is copy jar file to spark.
source file is from hive lib directory, hive-hcatalog-core-3.1.2.jar, copy it to jars under spark directory.
I also tried to modify hive.aux.jars.path config in hive-site.xml, but it doesn't work. If anyone knows there's configuration for spark to load extra jars, please comment.
Add the required jar as follows rather than copying the jar on all the nodes in the cluster:
in conf/spark-defaults add following configs
spark.driver.extraClassPath /fullpath/hive-hcatalog-core-3.1.2.jar spark.executor.extraClassPath /fullpath/hive-hcatalog-core-3.1.2.jar
Or in spark-sql, execute add jar statement before querying:
ADD JAR /fullpath/hive-hcatalog-core-3.1.2.jar

Pyspark connection to Postgres database in ipython notebook

I've read previous posts on this, but I still cannot pinpoint why I am unable to connect my ipython notebook to a Postgres db.
I am able to launch pyspark in an ipython notebook, SparkContext is loaded as 'sc'.
I have the following in my .bash_profile for finding the Postgres driver:
export SPARK_CLASSPATH=/path/to/downloaded/jar
Here's what I am doing in the ipython notebook to connect to the db (based on this post):
from pyspark.sql import DataFrameReader as dfr
sqlContext = SQLContext(sc)
table= 'some query'
url = 'postgresql://localhost:5432/dbname'
properties = {'user': 'username', 'password': 'password'}
df = dfr(sqlContext).jdbc(
url='jdbc:%s' % url, table=table, properties=properties
)
The error:
Py4JJavaError: An error occurred while calling o156.jdbc.
: java.SQL.SQLException: No suitable driver.
I understand it's an error with finding the driver I've downloaded, but I don't understand why I am getting this error when I've added the path to it in my .bash_profile.
I also tried to set driver via pyspark --jars, but I get a "no such file or directory" error.
This blogpost also shows how to connect to Postgres data sources, but the following also gives me a "no such directory" error:
./bin/spark-shell --packages org.postgresql:postgresql:42.1.4
Additional info:
spark version: 2.2.0
python version: 3.6
java: 1.8.0_25
postgres driver: 42.1.4
I am not sure why the above answer did not work for me but I thought I could also share what actually worked for me when running pyspark from a jupyter notebook (Spark 2.3.1 - Python 3.6.3):
from pyspark.sql import SparkSession
spark = SparkSession.builder.config('spark.driver.extraClassPath', '/path/to/postgresql.jar').getOrCreate()
url = 'jdbc:postgresql://host/dbname'
properties = {'user': 'username', 'password': 'pwd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)
They've changed how this works several times in Apache Spark. Looking at my setup, this is what I have in my .bashrc (aka .bash_profile on Mac), so you could try it: export SPARK_CLASSPATH=$SPARK_CLASSPATH:/absolute/path/to/your/driver.jar Edit: I'm using Spark 1.6.1.
And, as always, make sure you use a new shell or source the script so you have the updated envvar (verify with echo $SPARK_CLASSPATH in your shell before you run ipython notebook).
I followed directions in this post. SparkContext is already set as sc for me, so all I had to do was remove the SPARK_CLASSPATH setting from my .bash_profile, and use the following in my ipython notebook:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--driver-class-path /path/to/postgresql-42.1.4.jar --jars /path/to/postgresql-42.1.4.jar pyspark-shell'
I added a 'driver' settings to properties as well, and it worked. As stated elsewhere in this post, this is likely because SPARK_CLASSPATH is deprecated, and it is preferable to use --driver-class-path.

export data from mongo to hive

my input: a collection("demo1") in mongo db (version 3.4.4 )
my output : my data imported in a database in hive("demo2") (version 1.2.1.2.3.4.7-4)
purpose : create a connector between mongo and hive
Error:
Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. com/mongodb/util/JSON
I tried 2 solutions following those steps (but the error remains):
1) I create a local collection in mongo (via robomongo) connected to docker
2) I upload those version of jars and add it in hive
ADD JAR /home/.../mongo-hadoop-hive-2.0.2.jar;
ADD JAR /home/.../mongo-hadoop-core-2.0.2.jar;
ADD JAR /home/.../mongo-java-driver-3.4.2.jar;
Unfortunately the error doesn't change; so I upload those version, I hesitate in choosing right version for my export, so I try this:
ADD JAR /home/.../mongo-hadoop-hive-1.3.0.jar;
ADD JAR /home/.../mongo-hadoop-core-1.3.0.jar;
ADD JAR /home/.../mongo-java-driver-2.13.2.jar;
3) I create an external table
CREATE EXTERNAL TABLE demo2
(
id INT,
name STRING,
password STRING,
email STRING
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH
SERDEPROPERTIES('mongo.columns.mapping'='{"id":"_id","name":"name","password":"password","email":"email"}')
TBLPROPERTIES('mongo.uri'='mongodb://localhost:27017/local.demo1');
Error returned in hive :
Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. com/mongodb/util/JSON
How can I resolve this problem?
Copying the correct jar files (mongo-hadoop-core-2.0.2.jar, mongo-hadoop-hive-2.0.2.jar, mongo-java-driver-3.2.2.jar) on ALL the nodes of the cluster did the trick for me.
Other points to take care about:
Follow all steps mentioned here religiously - https://github.com/mongodb/mongo-hadoop/wiki/Hive-Usage#installation
Adhere to the requirements given here - https://github.com/mongodb/mongo-hadoop#requirements
Other useful links
https://github.com/mongodb/mongo-hadoop/wiki/FAQ#i-get-a-classnotfoundexceptionnoclassdeffounderror-when-using-the-connector-what-do-i-do
https://groups.google.com/forum/#!topic/mongodb-user/xMVoTSePgg0

HiveContext setting in scala+spark project to access existing HDFS

I am trying to access my existing hadoop setup in my spark+scala project
Spark Version 1.4.1
Hadoop 2.6
Hive 1.2.1
from Hive Console I able to create table and access it without any issue, I can also see the same table from Hadoop URL as well.
the problem is when I try to create a table from project, system shows error
ERROR Driver: FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask.
MetaException(message:file:/user/hive/warehouse/src is not a directory
or unable to create one)
following is the code I write:
import
import org.apache.spark._
import org.apache.spark.sql.hive._
Code
val sparkContext = new SparkContext("local[2]", "HiveTable")
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sparkContext)
hiveContext.setConf("hive.metastore.warehouse.dir", "hdfs://localhost:54310/user/hive/warehouse")
hiveContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
Edit:
instead of create table if I had to execute insert statement like:
hiveContext.sql("INSERT INTO TABLE default.src SELECT 'username','password' FROM foo;")
any help to resolve his issue would be highly appreciable.