Not able to insert data into hive elasticsearch index using spark SQL - scala

I have used the following steps in hive terminal to insert into elasticsearch index -
Create hive table pointing to elasticsearch index
CREATE EXTERNAL TABLE test_es(
id string,
name string
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource = test/person', 'es.mapping = id');
Create a staging table and insert data into it
Create table emp(id string,name string) row format delimited fields terminated by ',';
load data local inpath '/home/monami/data.txt' into table emp;
Insert data from staging table into the hive elasticsearch index
insert overwrite table test_es select * from emp;
I could browse through the hive elasticsearch index successfully following the above steps in hive CLI. But whenever I am trying to insert in the same way using SPARK SQL hiveContext object, I am getting the folllowing error -
java.lang.RuntimeException: java.lang.RuntimeException: class org.elasticsearch.hadoop.mr.EsOutputFormat$EsOutputCommitter not org.apache.hadoop.mapred.OutputCommitter
Can you please let me know the reason for this error? If it is not possible to insert the same way using Spark, then what is the method to insert into hive elasticsearch index using Spark ?
Versions used - Spark 1.6, Scala 2.10, Elasticsearch 6.4, Hive 1.1

Related

Inserting value with multiple lines are different in pyspark and simple sql query via jdbc (hive)

If you run this sql query (using jdbc, hive server):
--create table test.testing (a string) stored as ORC;
insert into test.testing values ("line1\nline2");
I want to get 1 record but I'll get 2 reconds in tables.
If you run this sql query but using pyspark:
spark.sql("""insert into test.testing values ('pysparkline1\npysparkline2')""")
I'll get 1 record in table
How I can add multiple row data in table column using "insert ... values (...)" statement to manage this problem via jdbc?
P.S. Query type "INSERT... from SELECT" is not suitable and i can not change line delimeter in create query

Hive create partitioned table based on Spark temporary table

I have a Spark temporary table spark_tmp_view with DATE_KEY column. I am trying to create a Hive table (without writing the temp table to a parquet location. What I have tried to run is spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS mydb.result AS SELECT * FROM spark_tmp_view PARTITIONED BY(DATE_KEY DATE)")
The error I got is mismatched input 'BY' expecting <EOF> I tried to search but still haven't been able to figure out the how to do it from a Spark app, and how to insert data after. Could someone please help? Many thanks.
PARTITIONED BY is part of definition of a table being created, so it should precede ...AS SELECT..., see Spark SQL syntax.

Spark Scala create external hive table not working with location as a variable

I am trying to create Hive external table from Spark application and passing location as a variable to the SQL command. It doesn't create Hive table and I don't see any errors.
val location = "/home/data"
hiveContext.sql(s"""CREATE EXTERNAL TABLE IF NOT EXISTS TestTable(id STRING,name STRING) PARTITIONED BY (city string) STORED AS PARQUET LOCATION '${location}' """)
Spark only supports creating managed tables. And even then there are severe restrictions: it does not support dynamically partitioned tables.
TL;DR you can not create external tables through Spark. Spark can read them
Not sure which version had this limitations.
I using Spark 1.6, Hive 1.1.
I am able to create the external table, please follow below:
var query = "CREATE EXTERNAL TABLE avro_hive_table ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'TBLPROPERTIES ('avro.schema.url'='hdfs://localdomain/user/avro/schemas/activity.avsc') STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/user/avro/applog_avro'"
var hiveContext = new org.apache.spark.sql.hive.HiveContext(sc);
hiveContext.sql(query);
var df = hiveContext.sql("select count(*) from avro_hive_table");

Apache Spark - Error persisting Dataframe to MemSQL database using JDBC driver

I'm currently facing an issue while trying to save an Apache Spark DataFrame loaded from an Apache Spark temp table to a distributed MemSQL database.
The trick is that I cannot use MemSQLContext connector for the moment. So I'm using JDBC driver.
Here is my code:
//store suppliers data from temp table into a dataframe
val suppliers = sqlContext.read.table("tmp_SUPPLIER")
//append data to the target table
suppliers.write.mode(SaveMode.Append).jdbc(url_memsql, "R_SUPPLIER", prop_memsql)
Here is the error message (occuring during the suppliers.write statement):
java.sql.SQLException: Distributed tables must either have a PRIMARY or SHARD key.
Note:
R_SUPPLIER table has exactly the same fields and datatypes than the temp table and has a primary key set.
FYI, here are some clues:
R_SUPPLIER script:
`CREATE TABLE R_SUPPLIER
(
SUP_ID INT NOT NULL PRIMARY KEY,
SUP_CAGE_CODE CHAR(5) NULL,
SUP_INTERNAL_SAP_CODE CHAR(5) NULL,
SUP_NAME VARCHAR(255) NULL,
SHARD KEY(SUP_ID)
);`
The suppliers.write statement has worked once, but data was then loaded in the DataFrame with a sqlContext.read.jdbc command and not sqlContext.sql (data was stored in a distant database and not in Apache Spark local temp table).
Did anyone face the same issue, please?
Are you getting that error when you run the create table, or when you run the suppliers.write code? That is an error that you should only get when creating a table. Therefore if you are hitting it when running suppliers.write, your code is probably trying to create and write to a new table, not the one you created before.

Errors while saving Spark Dataframe to Hbase using Apache Phoenix

I'm trying to save jsonRDD into hbase using apache phoenix spark plugin : df.saveToPhoenix(tableName, zkUrl = Some(quorumAddress)). The table looks like:
CREATE TABLE IF NOT EXISTS person (
ID BIGINT NOT NULL PRIMARY KEY,
NAME VARCHAR,
SURNAME VARCHAR) SALT_BUCKETS = 40, COMPRESSION='GZ';
I have about 100,000 - 2,000,000 records in this kind of tables. Some of them are saved normally. But some of them fail with error:
java.lang.RuntimeException: org.apache.phoenix.exception.PhoenixIOException:
callTimeout=1200000, callDuration=2902566: row 'PERSON' on table 'SYSTEM.CATALOG' at
region=SYSTEM.CATALOG,,1443172839381.a593d4dbac97863f897bca469e8bac66.,
hostname=hadoop-02,16020,1443292360474, seqNum=339
at org.apache.phoenix.mapreduce.PhoenixRecordWriter.close(PhoenixRecordWriter.java:62)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$5.apply$mcV$sp(PairRDDFunctions.scala:1043)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1294)
What could that mean? Are there any other ways to bulk insert data from DataFrame to hbase?