Delta Lake 1.1.0 location in spark local mode not work properly - scala

I've updated some ETLs to spark 3.2.1 and delta lake 1.1.0. After doing this my local tests started to fail. After some debugging, I found that when I create an empty table with a specified location it is registered in the metastore with some prefix.
Let's say if try to create a table on the bronze DB with spark-warehouse/users as my specified location:
spark.sql("""CREATE DATABASE IF NOT EXISTS bronze""")
spark.sql("""CREATE TABLE bronze.users (
| name string,
| active boolean
|)
|USING delta
|LOCATION 'spark-warehouse/users'""".stripMargin)
I end up with:
spark-warehouse/bronze.db/spark-warehouse/users registered on the metastore but with the actual files in spark-warehouse/users! This makes any query to the table fail.
I generated a sample repository: https://github.com/adrianabreu/delta-1.1.0-table-location-error-example/blob/master/src/test/scala/example/HelloSpec.scala

Related

DataJpaTest: Numeric scale default seems to be 0 with spring-boot-starter 2.7.1

I have a DataJpaTest with some schema.sql and data.sql for preparing the postgresql in-memory database. I've just upgraded spring-boot-starter-parent from 2.6.3 to 2.7.1, and now the test fails.
schema:
CREATE TABLE IF NOT EXISTS some_table(
id BIGSERIAL,
name TEXT,
problematic_number NUMERIC NOT NULL
);
data:
INSERT INTO some_table (name, problematic_number) VALUES ('something', 1.4321);
For some reason a test is failing now with:
org.opentest4j.AssertionFailedError:
Expected :1.4321
Actual :1
I also connected to the h2 database and I got really "1" in here instead of "1.4321". Before my spring upgrade, the test was fine.
Did the default scale for numeric maybe change? if I change my schema.sql to NUMERIC(10,4), the test succeeds.

Error when creating external table in Redshift Spectrum with dbt: cross-database reference not supported

I want to create an external table in Redshift Spectrum from CSV files. When I try doing so with dbt, I get a strange error. But when I manually remove some double quotes from the SQL generated by dbt and run it directly, I get no such error.
First I run this in Redshift Query Editor v2 on default database dev in my cluster:
CREATE EXTERNAL SCHEMA example_schema
FROM DATA CATALOG
DATABASE 'example_db'
REGION 'us-east-1'
IAM_ROLE 'iam_role'
CREATE EXTERNAL DATABASE IF NOT EXISTS
;
Database dev now has an external schema named example_schema (and Glue catalog registers example_db).
I then upload example_file.csv to the S3 bucket s3://example_bucket. The file looks like this:
col1,col2
1,a,
2,b,
3,c
Then I run dbt run-operation stage_external_sources in my local dbt project and get this output with an error:
21:03:03 Running with dbt=1.0.1
21:03:03 [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- models.example_project.example_models
21:03:03 1 of 1 START external source example_schema.example_table
21:03:03 1 of 1 (1) drop table if exists "example_db"."example_schema"."example_table" cascade
21:03:04 Encountered an error while running operation: Database Error
cross-database reference to database "example_db" is not supported
I try running the generated SQL in Query Editor:
DROP TABLE IF EXISTS "example_db"."example_schema"."example_table" CASCADE
and get the same error message:
ERROR: cross-database reference to database "example_db" is not supported
But when I run this SQL in Query Editor, it works:
DROP TABLE IF EXISTS "example_db.example_schema.example_table" CASCADE
Note that I just removed some quotes.
What's going on here? Is this a bug in dbt-core, dbt-redshift, or dbt_external_tables--or just a mistake on my part?
To confirm, I can successfully create the external table by running this in Query Editor:
DROP SCHEMA IF EXISTS example_schema
DROP EXTERNAL DATABASE
CASCADE
;
CREATE EXTERNAL SCHEMA example_schema
FROM DATA CATALOG
DATABASE 'example_db'
REGION 'us-east-1'
IAM_ROLE 'iam_role'
CREATE EXTERNAL DATABASE IF NOT EXISTS
;
CREATE EXTERNAL TABLE example_schema.example_table (
col1 SMALLINT,
col2 CHAR(1)
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
LOCATION 's3://example_bucket'
TABLE PROPERTIES ('skip.header.line.count'='1')
;
dbt config files
models/example/schema.yml (modeled after this example:
version: 2
sources:
- name: example_source
database: dev
schema: example_schema
loader: S3
tables:
- name: example_table
external:
location: 's3://example_bucket'
row_format: >
serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties (
'strip.outer.array'='false'
)
columns:
- name: col1
data_type: smallint
- name: col2
data_type: char(1)
dbt_project.yml:
name: 'example_project'
version: '1.0.0'
config-version: 2
profile: 'example_profile'
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
target-path: "target"
clean-targets:
- "target"
- "dbt_packages"
models:
example_project:
example:
+materialized: view
packages.yml:
packages:
- package: dbt-labs/dbt_external_tables
version: 0.8.0

Import MongoDB data into Hive Error: Splitter implementation is incompatible

I'm trying to import mongodb data into hive.
The jar versions that i have used are
ADD JAR /root/HDL/mongo-java-driver-3.4.2.jar;
ADD JAR /root/HDL/mongo-hadoop-hive-2.0.2.jar;
ADD JAR /root/HDL/mongo-hadoop-core-2.0.2.jar;
And my cluster versions are
Ambari - Version 2.6.0.0, HDFS 2.7.3, Hive 1.2.1000, HBase 1.1.2, Tez 0.7.0
MongoDB Server version:- 3.6.5
Hive Script:-
CREATE TABLE sampletable
( ID STRING,
EmpID STRING,
BeginDate DATE,
EndDate DATE,
Time TIMESTAMP,
Type STRING,
Location STRING,
Terminal STRING)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"ID":"_id","EmpID":"emp_id","BeginDate":"begin_date","EndDate":"end_date","Time":"time","Type":"time_event_type","Location":"location","Terminal":"terminal"}')
TBLPROPERTIES('mongo.uri'='mongodb://username:password#10.10.170.43:27017/testdb.testtable');
Output:-
hive> select * from sampletable;
OK
Failed with exception java.io.IOException:java.io.IOException: Failed to aggregate sample documents. Note that this Splitter implementation is incompatible with MongoDB versions prior to 3.2.
Please suggest me how can i solve this.
Thanks,
Mohan V
set mongo.input.split_size=50;

Cannot run tests on h2 in-memory database, rather it runs on PostgreSQL

(I have multiple related questions, so I highlight them as bold)
I have a play app.
play: 2.6.19
scala: 2.12.6
h2: 1.4.197
postgresql: 42.2.5
play-slick/play-slick-evolutions: 3.0.1
slick-pg: 0.16.3
I am adding a test for DAO, and I believe it should run on an h2 in-memory database that is created when tests start, cleared when tests end.
However, my test always runs on PostgreSQL database I configure and use.
# application.conf
slick.dbs.default.profile="slick.jdbc.PostgresProfile$"
slick.dbs.default.db.driver="org.postgresql.Driver"
slick.dbs.default.db.url="jdbc:postgresql://localhost:5432/postgres"
Here is my test test/dao/TodoDAOImplSpec.scala.
package dao
import play.api.inject.guice.GuiceApplicationBuilder
import play.api.test.{Injecting, PlaySpecification, WithApplication}
class TodoDAOImplSpec extends PlaySpecification {
val conf = Map(
"slick.dbs.test.profile" -> "slick.jdbc.H2Profile$",
"slick.dbs.test.db.driver" -> "org.h2.Driver",
"slick.dbs.test.db.url" -> "jdbc:h2:mem:test;MODE=PostgreSQL;DB_CLOSE_DELAY=-1;DATABASE_TO_UPPER=FALSE"
)
val fakeApp = new GuiceApplicationBuilder().configure(conf).build()
//val fakeApp = new GuiceApplicationBuilder().configure(inMemoryDatabase()).build()
//val fakeApp = new GuiceApplicationBuilder().configure(inMemoryDatabase("test")).build()
"TodoDAO" should {
"returns current state in local pgsql table" in new WithApplication(fakeApp) with Injecting {
val todoDao = inject[TodoDAOImpl]
val result = await(todoDao.index())
result.size should_== 0
}
}
}
For fakeApp, I try all three, but none of them work as expected - my test still runs on my local PostgreSQL table (in which there are 3 todo items), so the test fails.
What I have tried/found:
First, inMemoryDatabase() simply returns a Map("db.<name>.driver"->"org.h2.Driver", "db.<name>.url"->""jdbc:h2:mem:play-test-xxx"), which looks very similar to my own conf map. However, there are 2 main differeneces:
inMemoryDatabase uses db.<name>.xxx while my conf map uses slick.dbs.<name>.db.xxx. Which one should be correct?
Second, rename conf map's keys to "slick.dbs.default.profile", "slick.dbs.default.db.driver" and "slick.dbs.default.db.url" will throw error.
[error] p.a.d.e.DefaultEvolutionsApi - Unknown data type: "status_enum"; SQL statement:
ALTER TABLE todo ADD COLUMN status status_enum NOT NULL [50004-197] [ERROR:50004, SQLSTATE:HY004]
cannot create an instance for class dao.TodoDAOImplSpec
caused by #79bg46315: Database 'default' is in an inconsistent state!
The finding is interesting - is it related to my use of PostgreSQL ENUM type and slick-pg? (See slick-pg issue with h2). Does it mean this is the right configuration for running h2 in-memory tests? If so, the question becomes How to fake PostgreSQL ENUM in h2.
Third, I follow this thread, run sbt '; set javaOptions += "-Dconfig.file=conf/application-test.conf"; test' with a test configuration file conf/application-test.conf:
include "application.conf"
slick.dbs.default.profile="slick.jdbc.H2Profile$"
slick.dbs.default.db.driver="org.h2.Driver"
slick.dbs.default.db.url="jdbc:h2:mem:test;MODE=PostgreSQL;DB_CLOSE_DELAY=-1;DATABASE_TO_UPPER=FALSE"
Not surprisingly, I get the same error as the 2nd trial.
It seems to me that the 2nd and 3rd trials point to the right direction (Will work on this). But why must we set name to default? Any other better approach?
In play the default database is default. You could however change that to any other database name to want, but then you need to add the database name as well. For example, I want to have a comment database that has the user table:
CREATE TABLE comment.User(
id int(250) NOT NULL AUTO_INCREMENT,
username varchar(255),
comment varchar(255),
PRIMARY KEY (id));
Then I need to have the configuration of it to connect to it (add it to the application.conf file):
db.comment.url="jdbc:mysql://localhost/comment"
db.comment.username=admin-username
db.comment.password="admin-password"
You could have the test database for your testing as mentioned above and use it within your test.
Database Tests Locally: Why not have the database, in local, as you have in production? The data is not there and running the test on local does not touch the production database; why you need an extra database?
Inconsistent State: This is when the MYSQL you wrote, changes the state of the current database within the database, that could be based on creation of a new table or when you want to delete it.
Also status_enum is not recognizable as a MySQL command obviously. Try the commands you want to use in MySQL console if you are not sure about it.

export data from mongo to hive

my input: a collection("demo1") in mongo db (version 3.4.4 )
my output : my data imported in a database in hive("demo2") (version 1.2.1.2.3.4.7-4)
purpose : create a connector between mongo and hive
Error:
Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. com/mongodb/util/JSON
I tried 2 solutions following those steps (but the error remains):
1) I create a local collection in mongo (via robomongo) connected to docker
2) I upload those version of jars and add it in hive
ADD JAR /home/.../mongo-hadoop-hive-2.0.2.jar;
ADD JAR /home/.../mongo-hadoop-core-2.0.2.jar;
ADD JAR /home/.../mongo-java-driver-3.4.2.jar;
Unfortunately the error doesn't change; so I upload those version, I hesitate in choosing right version for my export, so I try this:
ADD JAR /home/.../mongo-hadoop-hive-1.3.0.jar;
ADD JAR /home/.../mongo-hadoop-core-1.3.0.jar;
ADD JAR /home/.../mongo-java-driver-2.13.2.jar;
3) I create an external table
CREATE EXTERNAL TABLE demo2
(
id INT,
name STRING,
password STRING,
email STRING
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH
SERDEPROPERTIES('mongo.columns.mapping'='{"id":"_id","name":"name","password":"password","email":"email"}')
TBLPROPERTIES('mongo.uri'='mongodb://localhost:27017/local.demo1');
Error returned in hive :
Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. com/mongodb/util/JSON
How can I resolve this problem?
Copying the correct jar files (mongo-hadoop-core-2.0.2.jar, mongo-hadoop-hive-2.0.2.jar, mongo-java-driver-3.2.2.jar) on ALL the nodes of the cluster did the trick for me.
Other points to take care about:
Follow all steps mentioned here religiously - https://github.com/mongodb/mongo-hadoop/wiki/Hive-Usage#installation
Adhere to the requirements given here - https://github.com/mongodb/mongo-hadoop#requirements
Other useful links
https://github.com/mongodb/mongo-hadoop/wiki/FAQ#i-get-a-classnotfoundexceptionnoclassdeffounderror-when-using-the-connector-what-do-i-do
https://groups.google.com/forum/#!topic/mongodb-user/xMVoTSePgg0