This question already has answers here:
How to connect Spark SQL to remote Hive metastore (via thrift protocol) with no hive-site.xml?
(11 answers)
Closed 2 years ago.
Hive .14
Spark 1.6
.Trying to connect hive table from spark pragmatically. I have already put my hive-site.xml in spark conf folder. But when I run this code, everytime its connecting to underlying hive metastore i.e. Derby. I tried googled a lot but evertywhere I am getting suggestion to put hive-site.xml in spark cofiguration folder, which I already did. Please someone suggest me the solution.Below is my code
FYI: My existing hive is using MYSQL as metastore.
I am running this code directly from eclipse, not using spark-submit utility.
package org.scala.spark
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.hive.HiveContext
object HiveToHdfs {
def main(args: Array[String])
{
val conf=new SparkConf().setAppName("HDFS to Local").setMaster("local")
val sc=new SparkContext(conf)
val hiveContext=new org.apache.spark.sql.hive.HiveContext(sc)
import hiveContext.implicits._
hiveContext.sql("load data local inpath '/home/cloudera/Documents/emp_table.txt' into table employee")
sc.stop()
}
}
Below are my eclipse error log:
16/11/18 22:09:03 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/11/18 22:09:03 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/11/18 22:09:06 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/11/18 22:09:06 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
**16/11/18 22:09:06 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY**
16/11/18 22:09:06 INFO ObjectStore: Initialized ObjectStore
16/11/18 22:09:06 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/11/18 22:09:06 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
16/11/18 22:09:07 INFO HiveMetaStore: Added admin role in metastore
16/11/18 22:09:07 INFO HiveMetaStore: Added public role in metastore
16/11/18 22:09:07 INFO HiveMetaStore: No user is added in admin role, since config is empty
16/11/18 22:09:07 INFO HiveMetaStore: 0: get_all_databases
16/11/18 22:09:07 INFO audit: ugi=cloudera ip=unknown-ip-addr cmd=get_all_databases
16/11/18 22:09:07 INFO HiveMetaStore: 0: get_functions: db=default pat=*
16/11/18 22:09:07 INFO audit: ugi=cloudera ip=unknown-ip-addr cmd=get_functions: db=default pat=*
16/11/18 22:09:07 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwx------
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:194)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238)
at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218)
at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208)
at org.apache.spark.sql.hive.HiveContext.functionRegistry$lzycompute(HiveContext.scala:462)
at org.apache.spark.sql.hive.HiveContext.functionRegistry(HiveContext.scala:461)
at org.apache.spark.sql.UDFRegistration.<init>(UDFRegistration.scala:40)
at org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:330)
at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:90)
at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:101)
at org.scala.spark.HiveToHdfs$.main(HiveToHdfs.scala:15)
at org.scala.spark.HiveToHdfs.main(HiveToHdfs.scala)
Caused by: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwx------
at org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:612)
at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:554)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508)
... 12 more
16/11/18 22:09:07 INFO SparkContext: Invoking stop() from shutdown hook
Please let me know if any other in other information is also needed to rectify it.
check this link -> https://issues.apache.org/jira/browse/SPARK-15118
metastore might be using mysql db
the above error is from,
<property>
<name>hive.exec.scratchdir</name>
<value>/tmp/hive</value>
<description>HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user, an HDFS scratch dir: ${hive.exec.scratchdir}/<username> is created, with ${hive.scratch.dir.permission}.</description>
</property>
give permission for /tmp/hive
Related
I am trying to stream local directory content to HDFS. This local directory will be modified by a script and contents will be added for every 5 seconds. My spark program will stream this local directory contents and save them to HDFS. However, when I start streaming nothing is happening.
I checked the logs but I didn't get a hint.
Let me explain the scenario. A shell script will moves a file with some data for every 5 seconds in the local directory. The duration object of streaming context is also 5 seconds. As the script moves a new file, atomicity is maintained here if I am not wrong. For every five seconds receivers will process the data and create Dstream object. I just searched about streaming local directories and found that the path should be provided as ”file:///my/path”. I didn't tried with this format. But if this is the case then how the spark executors of the nodes will maintain the common state of the local path provided?
import org.apache.spark._
import org.apache.spark.streaming._
val ssc = new StreamingContext(sc, Seconds(5))
val filestream = ssc.textFileStream("/home/karteekkhadoop/ch06input")
import java.sql.Timestamp
case class Order(time: java.sql.Timestamp, orderId:Long, clientId:Long, symbol:String, amount:Int, price:Double, buy:Boolean)
import java.text.SimpleDateFormat
val orders = filestream.flatMap(line => {
val dateFormat = new SimpleDateFormat("yyyy-MM-dd hh:mm:ss")
var s = line.split(",")
try {
assert(s(6) == "B" || s(6) == "S")
List(Order(new Timestamp(dateFormat.parse(s(0)).getTime()), s(1).toLong, s(2).toLong, s(3), s(4).toInt, s(5).toDouble, s(6)=="B"))
}catch{
case e: Throwable => println("Wrong line format("+e+") : " + line)
List()
}
})
val numPerType = orders.map(o => (o.buy, 1L)).reduceByKey((x,y) => x+y)
numPerType.repartition(1).saveAsTextFiles("/user/karteekkhadoop/ch06output/output", "txt")
ssc.awaitTermination()
Paths given are absolute and exists. I am also including the following logs.
[karteekkhadoop#gw03 stream]$ yarn logs -applicationId application_1540458187951_12531
18/11/21 11:12:35 INFO client.RMProxy: Connecting to ResourceManager at rm01.itversity.com/172.16.1.106:8050
18/11/21 11:12:35 INFO client.AHSProxy: Connecting to Application History server at rm01.itversity.com/172.16.1.106:10200
Container: container_e42_1540458187951_12531_01_000001 on wn02.itversity.com:45454
LogAggregationType: LOCAL
==================================================================================
LogType:stderr
LogLastModifiedTime:Wed Nov 21 10:52:00 -0500 2018
LogLength:5320
LogContents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hdp01/hadoop/yarn/local/filecache/2693/spark2-hdp-yarn-archive.tar.gz/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.6.5.0-292/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
18/11/21 10:51:57 INFO SignalUtils: Registered signal handler for TERM
18/11/21 10:51:57 INFO SignalUtils: Registered signal handler for HUP
18/11/21 10:51:57 INFO SignalUtils: Registered signal handler for INT
18/11/21 10:51:57 INFO SecurityManager: Changing view acls to: yarn,karteekkhadoop
18/11/21 10:51:57 INFO SecurityManager: Changing modify acls to: yarn,karteekkhadoop
18/11/21 10:51:57 INFO SecurityManager: Changing view acls groups to:
18/11/21 10:51:57 INFO SecurityManager: Changing modify acls groups to:
18/11/21 10:51:57 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, karteekkhadoop); groups with view permissions: Set(); users with modify permissions: Set(yarn, karteekkhadoop); groups with modify permissions: Set()
18/11/21 10:51:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/11/21 10:51:58 INFO ApplicationMaster: Preparing Local resources
18/11/21 10:51:59 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
18/11/21 10:51:59 INFO ApplicationMaster: ApplicationAttemptId: appattempt_1540458187951_12531_000001
18/11/21 10:51:59 INFO ApplicationMaster: Waiting for Spark driver to be reachable.
18/11/21 10:51:59 INFO ApplicationMaster: Driver now available: gw03.itversity.com:38932
18/11/21 10:51:59 INFO TransportClientFactory: Successfully created connection to gw03.itversity.com/172.16.1.113:38932 after 90 ms (0 ms spent in bootstraps)
18/11/21 10:51:59 INFO ApplicationMaster:
===============================================================================
YARN executor launch context:
env:
CLASSPATH -> {{PWD}}<CPS>{{PWD}}/__spark_conf__<CPS>{{PWD}}/__spark_libs__/*<CPS>/usr/hdp/2.6.5.0-292/hadoop/conf<CPS>/usr/hdp/2.6.5.0-292/hadoop/*<CPS>/usr/hdp/2.6.5.0-292/hadoop/lib/*<CPS>/usr/hdp/current/hadoop-hdfs-client/*<CPS>/usr/hdp/current/hadoop-hdfs-client/lib/*<CPS>/usr/hdp/current/hadoop-yarn-client/*<CPS>/usr/hdp/current/hadoop-yarn-client/lib/*<CPS>/usr/hdp/current/ext/hadoop/*<CPS>$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/2.6.5.0-292/hadoop/lib/hadoop-lzo-0.6.0.2.6.5.0-292.jar:/etc/hadoop/conf/secure:/usr/hdp/current/ext/hadoop/*<CPS>{{PWD}}/__spark_conf__/__hadoop_conf__
SPARK_YARN_STAGING_DIR -> *********(redacted)
SPARK_USER -> *********(redacted)
command:
LD_LIBRARY_PATH="/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64:$LD_LIBRARY_PATH" \
{{JAVA_HOME}}/bin/java \
-server \
-Xmx1024m \
-Djava.io.tmpdir={{PWD}}/tmp \
'-Dspark.history.ui.port=18081' \
'-Dspark.driver.port=38932' \
'-Dspark.port.maxRetries=100' \
-Dspark.yarn.app.container.log.dir=<LOG_DIR> \
-XX:OnOutOfMemoryError='kill %p' \
org.apache.spark.executor.CoarseGrainedExecutorBackend \
--driver-url \
spark://CoarseGrainedScheduler#gw03.itversity.com:38932 \
--executor-id \
<executorId> \
--hostname \
<hostname> \
--cores \
1 \
--app-id \
application_1540458187951_12531 \
--user-class-path \
file:$PWD/__app__.jar \
1><LOG_DIR>/stdout \
2><LOG_DIR>/stderr
resources:
__spark_libs__ -> resource { scheme: "hdfs" host: "nn01.itversity.com" port: 8020 file: "/hdp/apps/2.6.5.0-292/spark2/spark2-hdp-yarn-archive.tar.gz" } size: 202745446 timestamp: 1533325894570 type: ARCHIVE visibility: PUBLIC
__spark_conf__ -> resource { scheme: "hdfs" host: "nn01.itversity.com" port: 8020 file: "/user/karteekkhadoop/.sparkStaging/application_1540458187951_12531/__spark_conf__.zip" } size: 248901 timestamp: 1542815515889 type: ARCHIVE visibility: PRIVATE
===============================================================================
18/11/21 10:51:59 INFO RMProxy: Connecting to ResourceManager at rm01.itversity.com/172.16.1.106:8030
18/11/21 10:51:59 INFO YarnRMClient: Registering the ApplicationMaster
18/11/21 10:51:59 INFO Utils: Using initial executors = 0, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
18/11/21 10:52:00 INFO ApplicationMaster: Started progress reporter thread with (heartbeat : 3000, initial allocation : 200) intervals
End of LogType:stderr.This log file belongs to a running container (container_e42_1540458187951_12531_01_000001) and so may not be complete.
What is wrong with the code. Please help. Thank you.
You cannot use local directory like that. As with any Spark readers, input (and output) storage has to be accessible from each node (driver and executors) and all nodes have to see exactly the same state.
Additionally please remember that for file system sources, changes to files have to be atomic (like file system move), and non-atomic operations (like appending to file) won't work.
I have set the hive metastore in mySql and same can be accessed through hive and create database and tables. If I try to access hive table through spark-shell then able to get the tables info correctly by getting from mysql hive metastore. But it is not fetching from Mysql if execute from eclipse.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.0
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_151)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
warning: there was one deprecation warning; re-run with -deprecation for details
sqlContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext#2b9e69fb
scala> sqlContext.sql("show databases");
res0: org.apache.spark.sql.DataFrame = [databaseName: string]
But if I try to access through eclipse, then it is not point to MySql. Instead of that it is Derby. Please find the below log and hive-site.xml for an idea.
Note: hive-site.xml is same in hive/conf and spark/conf paths.
Spark code that is executing from eclipse:
package events
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql
object DataContent {
def main(args:Array[String])
{
val conf = new SparkConf()
conf.setAppName("Word Count2").setMaster("local")
val sc = new SparkContext(conf)
println("Hello to Spark World")
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
var query = sqlContext.sql("show databases");
query.collect()
println("Bye to Spark example2")
}
}
Spark output log:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
18/02/17 12:03:15 INFO SparkContext: Running Spark version 2.0.0
18/02/17 12:03:17 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/02/17 12:03:18 WARN Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 192.168.189.136 instead (on interface ens33)
18/02/17 12:03:18 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
18/02/17 12:03:18 INFO SecurityManager: Changing view acls to: vm4learning
18/02/17 12:03:18 INFO SecurityManager: Changing modify acls to: vm4learning
18/02/17 12:03:18 INFO SecurityManager: Changing view acls groups to:
18/02/17 12:03:18 INFO SecurityManager: Changing modify acls groups to:
18/02/17 12:03:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(vm4learning); groups with view permissions: Set(); users with modify permissions: Set(vm4learning); groups with modify permissions: Set()
18/02/17 12:03:20 INFO Utils: Successfully started service 'sparkDriver' on port 45637.
18/02/17 12:03:20 INFO SparkEnv: Registering MapOutputTracker
18/02/17 12:03:20 INFO SparkEnv: Registering BlockManagerMaster
18/02/17 12:03:20 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-37db59bd-c12a-4603-ba9f-8e8fec88cc29
18/02/17 12:03:20 INFO MemoryStore: MemoryStore started with capacity 881.4 MB
18/02/17 12:03:21 INFO SparkEnv: Registering OutputCommitCoordinator
18/02/17 12:03:23 INFO Utils: Successfully started service 'SparkUI' on port 4040.
18/02/17 12:03:23 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.189.136:4040
18/02/17 12:03:23 INFO Executor: Starting executor ID driver on host localhost
18/02/17 12:03:23 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 38313.
18/02/17 12:03:23 INFO NettyBlockTransferService: Server created on 192.168.189.136:38313
18/02/17 12:03:23 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.189.136, 38313)
18/02/17 12:03:23 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.189.136:38313 with 881.4 MB RAM, BlockManagerId(driver, 192.168.189.136, 38313)
18/02/17 12:03:23 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.189.136, 38313)
Hello to Spark World
18/02/17 12:03:30 INFO HiveSharedState: Warehouse path is 'file:/home/vm4learning/workspace/Acumen/spark-warehouse'.
18/02/17 12:03:30 INFO SparkSqlParser: Parsing command: show databases
18/02/17 12:03:32 INFO HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
18/02/17 12:03:34 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
18/02/17 12:03:34 INFO ObjectStore: ObjectStore, initialize called
18/02/17 12:03:36 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
18/02/17 12:03:36 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
18/02/17 12:03:41 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
18/02/17 12:03:47 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
18/02/17 12:03:47 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
18/02/17 12:03:49 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
18/02/17 12:03:49 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
18/02/17 12:03:49 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery#0" since the connection used is closing
18/02/17 12:03:49 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
18/02/17 12:03:49 INFO ObjectStore: Initialized ObjectStore
18/02/17 12:03:50 INFO HiveMetaStore: Added admin role in metastore
18/02/17 12:03:50 INFO HiveMetaStore: Added public role in metastore
18/02/17 12:03:50 INFO HiveMetaStore: No user is added in admin role, since config is empty
18/02/17 12:03:51 INFO HiveMetaStore: 0: get_all_databases
18/02/17 12:03:51 INFO audit: ugi=vm4learning ip=unknown-ip-addr cmd=get_all_databases
18/02/17 12:03:51 INFO HiveMetaStore: 0: get_functions: db=default pat=*
18/02/17 12:03:51 INFO audit: ugi=vm4learning ip=unknown-ip-addr cmd=get_functions: db=default pat=*
18/02/17 12:03:51 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
18/02/17 12:03:52 INFO SessionState: Created local directory: /tmp/5c825a73-be72-4bd1-8bfc-966d8a095919_resources
18/02/17 12:03:52 INFO SessionState: Created HDFS directory: /tmp/hive/vm4learning/5c825a73-be72-4bd1-8bfc-966d8a095919
18/02/17 12:03:52 INFO SessionState: Created local directory: /tmp/vm4learning/5c825a73-be72-4bd1-8bfc-966d8a095919
18/02/17 12:03:52 INFO SessionState: Created HDFS directory: /tmp/hive/vm4learning/5c825a73-be72-4bd1-8bfc-966d8a095919/_tmp_space.db
18/02/17 12:03:52 INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.1) is file:/home/vm4learning/workspace/Acumen/spark-warehouse
18/02/17 12:03:52 INFO SessionState: Created local directory: /tmp/32b99842-2ac2-491e-934d-9726a6213c37_resources
18/02/17 12:03:52 INFO SessionState: Created HDFS directory: /tmp/hive/vm4learning/32b99842-2ac2-491e-934d-9726a6213c37
18/02/17 12:03:52 INFO SessionState: Created local directory: /tmp/vm4learning/32b99842-2ac2-491e-934d-9726a6213c37
18/02/17 12:03:52 INFO SessionState: Created HDFS directory: /tmp/hive/vm4learning/32b99842-2ac2-491e-934d-9726a6213c37/_tmp_space.db
18/02/17 12:03:52 INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.1) is file:/home/vm4learning/workspace/Acumen/spark-warehouse
18/02/17 12:03:53 INFO HiveMetaStore: 0: create_database: Database(name:default, description:default database, locationUri:file:/home/vm4learning/workspace/Acumen/spark-warehouse, parameters:{})
18/02/17 12:03:53 INFO audit: ugi=vm4learning ip=unknown-ip-addr cmd=create_database: Database(name:default, description:default database, locationUri:file:/home/vm4learning/workspace/Acumen/spark-warehouse, parameters:{})
18/02/17 12:03:55 INFO HiveMetaStore: 0: get_databases: *
18/02/17 12:03:55 INFO audit: ugi=vm4learning ip=unknown-ip-addr cmd=get_databases: *
18/02/17 12:03:56 INFO CodeGenerator: Code generated in 1037.20509 ms
Bye to Spark example2
18/02/17 12:03:56 INFO SparkContext: Invoking stop() from shutdown hook
18/02/17 12:03:57 INFO SparkUI: Stopped Spark web UI at http://192.168.189.136:4040
18/02/17 12:03:57 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/02/17 12:03:57 INFO MemoryStore: MemoryStore cleared
18/02/17 12:03:57 INFO BlockManager: BlockManager stopped
18/02/17 12:03:57 INFO BlockManagerMaster: BlockManagerMaster stopped
18/02/17 12:03:57 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/02/17 12:03:57 INFO SparkContext: Successfully stopped SparkContext
18/02/17 12:03:57 INFO ShutdownHookManager: Shutdown hook called
18/02/17 12:03:57 INFO ShutdownHookManager: Deleting directory /tmp/spark-6addf0da-f076-4dd1-a5eb-38dca93a2ad6
hive-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://localhost:9083</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost/hcatalog?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>vm4learning</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>hive.hwi.listen.host</name>
<value>localhost</value>
</property>
<property>
<name>hive.hwi.listen.port</name>
<value>9999</value>
</property>
<property>
<name>hive.hwi.war.file</name>
<value>lib/hive-hwi-0.11.0.war</value>
</property>
</configuration>
I am trying to create HiveContext, but its throwing error.
Is it just because of i dont have winutils.exe? or how can i solve this issue?
The reason i want to create HiveContext is i am planning to use collect_set functions which is created as UDF in Hive.
Error log:
16/11/23 14:12:09 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
16/11/23 14:12:10 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/11/23 14:12:10 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/11/23 14:12:14 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/11/23 14:12:14 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/11/23 14:12:15 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
16/11/23 14:12:15 INFO ObjectStore: Initialized ObjectStore
16/11/23 14:12:15 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/11/23 14:12:16 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
16/11/23 14:12:16 WARN : Your hostname, INNR90GLH5G resolves to a loopback/non-reachable address: fe80:0:0:0:0:5efe:c0a8:1584%net6, but we couldn't find any external IP address!
16/11/23 14:12:17 INFO HiveMetaStore: Added admin role in metastore
16/11/23 14:12:17 INFO HiveMetaStore: Added public role in metastore
16/11/23 14:12:17 INFO HiveMetaStore: No user is added in admin role, since config is empty
16/11/23 14:12:18 INFO HiveMetaStore: 0: get_all_databases
16/11/23 14:12:18 INFO audit: ugi=1554161 ip=unknown-ip-addr cmd=get_all_databases
16/11/23 14:12:18 INFO HiveMetaStore: 0: get_functions: db=default pat=*
16/11/23 14:12:18 INFO audit: ugi=1554161 ip=unknown-ip-addr cmd=get_functions: db=default pat=*
16/11/23 14:12:18 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
Exception in thread "main" java.lang.RuntimeException: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:204)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238)
at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218)
at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208)
at org.apache.spark.sql.hive.HiveContext.functionRegistry$lzycompute(HiveContext.scala:462)
at org.apache.spark.sql.hive.HiveContext.functionRegistry(HiveContext.scala:461)
at org.apache.spark.sql.UDFRegistration.<init>(UDFRegistration.scala:40)
at org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:330)
at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:90)
at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:101)
at com.scb.cnc.payments.CockpitChannelStatus$.main(CockpitChannelStatus.scala:13)
at com.scb.cnc.payments.CockpitChannelStatus.main(CockpitChannelStatus.scala)
Caused by: java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:774)
at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:572)
at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:547)
at org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:599)
at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:554)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508)
... 12 more
If issue is lack of winutils.exe despite of hadoop and spark on your workstation then try this article in order to solve your problem
solution for spark
I try to import data from postgres9.3 to hadoop2.7.2 using sqoop1.4.6 on Linux.
When I use the following code, it returns the correct thing.
sqoop-list-databases --connect jdbc:postgresql://localhost:5432/ --username postgres --password "baixinghehe"
The following result shows the username and password are both OK.
Warning: /usr/local/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /usr/local/sqoop/../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
16/07/09 15:39:50 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
16/07/09 15:39:50 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
16/07/09 15:39:50 INFO manager.SqlManager: Using default fetchSize of 1000
template1
template0
postgres
learnflask
When I want to use sqoop import like the following code.
sqoop import --connect jdbc:postgresql://localhost:5432/learnflask --username postgres --password baixinghehe --table employee --target-dir /data/employee -m 1
It runs ok at first, until the map part. Here is the error. I really feel strange because the program should not go to the mapreduce part if the password is wrong. Can someone help? Thanks very much.
Warning: /usr/local/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /usr/local/sqoop/../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
16/07/09 15:35:02 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
16/07/09 15:35:02 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
16/07/09 15:35:02 INFO manager.SqlManager: Using default fetchSize of 1000
16/07/09 15:35:02 INFO tool.CodeGenTool: Beginning code generation
16/07/09 15:35:03 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "employee" AS t LIMIT 1
16/07/09 15:35:03 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/local/hadoop
注: /tmp/sqoop-chenxuyuan/compile/b54e25a7f52c8eb496b2b84945ecd05a/employee.java.Use replace the old api
Using -Xlint:deprecation to recompile
16/07/09 15:35:06 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-chenxuyuan/compile/b54e25a7f52c8eb496b2b84945ecd05a/employee.jar
16/07/09 15:35:06 WARN manager.PostgresqlManager: It looks like you are importing from postgresql.
16/07/09 15:35:06 WARN manager.PostgresqlManager: This transfer can be faster! Use the --direct
16/07/09 15:35:06 WARN manager.PostgresqlManager: option to exercise a postgresql-specific fast path.
16/07/09 15:35:06 INFO mapreduce.ImportJobBase: Beginning import of employee
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hbase/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/07/09 15:35:06 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
16/07/09 15:35:07 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
16/07/09 15:35:07 INFO client.RMProxy: Connecting to ResourceManager at cxy10/192.168.0.110:8032
16/07/09 15:35:33 INFO db.DBInputFormat: Using read commited transaction isolation
16/07/09 15:35:33 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN("id"), MAX("id") FROM "employee"
16/07/09 15:35:33 INFO mapreduce.JobSubmitter: number of splits:2
16/07/09 15:35:34 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1468047672859_0003
16/07/09 15:35:35 INFO impl.YarnClientImpl: Submitted application application_1468047672859_0003
16/07/09 15:35:35 INFO mapreduce.Job: The url to track the job: http://cxy10:8088/proxy/application_1468047672859_0003/
16/07/09 15:35:35 INFO mapreduce.Job: Running job: job_1468047672859_0003
16/07/09 15:35:46 INFO mapreduce.Job: Job job_1468047672859_0003 running in uber mode : false
16/07/09 15:35:46 INFO mapreduce.Job: map 0% reduce 0%
16/07/09 15:35:55 INFO mapreduce.Job: Task Id : attempt_1468047672859_0003_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: java.lang.RuntimeException: org.postgresql.util.PSQLException: FATAL: password authentication failed for user "postgres"
at org.apache.sqoop.mapreduce.db.DBInputFormat.setConf(DBInputFormat.java:167)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:76)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:136)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:749)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.RuntimeException: org.postgresql.util.PSQLException: FATAL: password authentication failed for user "postgres"
at org.apache.sqoop.mapreduce.db.DBInputFormat.getConnection(DBInputFormat.java:220)
at org.apache.sqoop.mapreduce.db.DBInputFormat.setConf(DBInputFormat.java:165)
... 9 more
Caused by: org.postgresql.util.PSQLException: FATAL: password authentication failed for user "postgres"
at org.postgresql.core.v3.ConnectionFactoryImpl.doAuthentication(ConnectionFactoryImpl.java:415)
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:188)
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:64)
at org.postgresql.jdbc2.AbstractJdbc2Connection.<init>(AbstractJdbc2Connection.java:143)
at org.postgresql.jdbc3.AbstractJdbc3Connection.<init>(AbstractJdbc3Connection.java:29)
at org.postgresql.jdbc3g.AbstractJdbc3gConnection.<init>(AbstractJdbc3gConnection.java:21)
at org.postgresql.jdbc3g.Jdbc3gConnection.<init>(Jdbc3gConnection.java:24)
at org.postgresql.Driver.makeConnection(Driver.java:412)
at org.postgresql.Driver.connect(Driver.java:280)
at java.sql.DriverManager.getConnection(DriverManager.java:571)
at java.sql.DriverManager.getConnection(DriverManager.java:215)
at org.apache.sqoop.mapreduce.db.DBConfiguration.getConnection(DBConfiguration.java:302)
at org.apache.sqoop.mapreduce.db.DBInputFormat.getConnection(DBInputFormat.java:213)
... 10 more
I'm parsing a JSON with Spark SQL and it works really well, it finds the schema and I'm doing queries with it.
Now I need to "flat" the JSON and I have read in the forum that the best way is to Explode with Hive (Lateral View), so I trying to do the same with it. But I can't even create the context... Spark gives me an error and I can't find how to fix it.
As I have said, at this point I'm only trying to create de context:
println ("Create Spark Context:")
val sc = new SparkContext( "local", "Simple", "$SPARK_HOME")
println ("Create Hive context:")
val hiveContext = new HiveContext(sc)
And it gives me this error:
Create Spark Context:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/12/26 15:13:44 INFO Remoting: Starting remoting
15/12/26 15:13:45 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver#192.168.80.136:40624]
Create Hive context:
15/12/26 15:13:50 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
15/12/26 15:13:50 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
15/12/26 15:13:56 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
15/12/26 15:13:56 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
15/12/26 15:13:58 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
15/12/26 15:13:58 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
15/12/26 15:13:59 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
15/12/26 15:14:01 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
15/12/26 15:14:01 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.liftedTree1$1(IsolatedClientLoader.scala:183)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.<init>(IsolatedClientLoader.scala:179)
at org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:226)
at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185)
at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:392)
at org.apache.spark.sql.hive.HiveContext.defaultOverrides(HiveContext.scala:174)
at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:177)
at pebd.emb.Bicing$.main(Bicing.scala:73)
at pebd.emb.Bicing.main(Bicing.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
Caused by: java.lang.OutOfMemoryError: PermGen space
Process finished with exit code 1
I know it is a very simple question, but I don't really know the reason of that error.
Thank you in advance for everyone.
Here's the relevant part of the exception:
Caused by: java.lang.OutOfMemoryError: PermGen space
You need to increase the amount of PermGen memory that you give to the JVM. By default (SPARK-1879), Spark's own launch scripts increase this to 128 MB, so I think you'll have to do something similar in your IntelliJ run configuration. Try adding -XX:MaxPermSize=128m to the "VM options" list.