how can I specify a different database and schema to create temporary tables in Great Expectations? - data-quality

Great Expectations creates temporary tables. I tried profiling data in my Snowflake lab. It worked because the role I was using could create tables in the schema that contained the tables I was profiling.
I tried to profile a table in a Snowflake share, where we can't create objects, and it failed:
(snowflake.connector.errors.ProgrammingError) 002003 (02000): SQL compilation error:
Schema 'OUR_DATABASE.SNOWFLAKE_SHARE_SCHEMA' does not exist or not authorized.
[SQL: CREATE OR REPLACE TEMPORARY TABLE ge_temp_3eb6c50b AS SELECT *
FROM "SNOWFLAKE_SHARE_SCHEMA"."INTERESTING_TABLE"
WHERE true]
(Background on this error at: https://sqlalche.me/e/14/f405)
Here's the output from the CLI:
% great_expectations suite new
Using v3 (Batch Request) API
How would you like to create your Expectation Suite?
1. Manually, without interacting with a sample batch of data (default)
2. Interactively, with a sample batch of data
3. Automatically, using a profiler
: 3
A batch of data is required to edit the suite - let's help you to specify it.
Select data_connector
1. default_runtime_data_connector_name
2. default_inferred_data_connector_name
3. default_configured_data_connector_name
: 3
Which data asset (accessible by data connector "default_configured_data_connector_name") would you like to use?
1. INTERESTING_TABLE
Type [n] to see the next page or [p] for the previous. When you're ready to select an asset, enter the index.
: 1
Name the new Expectation Suite [INTERESTING_TABLE.warning]:
Great Expectations will create a notebook, containing code cells that select from available columns in your dataset and
generate expectations about them to demonstrate some examples of assertions you can make about your data.
When you run this notebook, Great Expectations will store these expectations in a new Expectation Suite "INTERESTING_TABLE.warning" here:
file:///path/to-my-repo/great_expectations/expectations/INTERESTING_TABLE/warning.json
Would you like to proceed? [Y/n]: Y
Here's the datasources section from great_expectations.yml:
datasources:
our_snowflake:
class_name: Datasource
module_name: great_expectations.datasource
execution_engine:
module_name: great_expectations.execution_engine
credentials:
host: xyz92716.us-east-1
username: MYUSER
query:
schema: MYSCHEMA
warehouse: MY_WAREHOUSE
role: RW_ROLE
password: password1234
drivername: snowflake
class_name: SqlAlchemyExecutionEngine
data_connectors:
default_runtime_data_connector_name:
class_name: RuntimeDataConnector
batch_identifiers:
- default_identifier_name
module_name: great_expectations.datasource.data_connector
default_inferred_data_connector_name:
include_schema_name: true
class_name: InferredAssetSqlDataConnector
introspection_directives:
schema_name: SNOWFLAKE_SHARE_SCHEMA
module_name: great_expectations.datasource.data_connector
default_configured_data_connector_name:
assets:
INTERESTING_TABLE:
schema_name: SNOWFLAKE_SHARE_SCHEMA
class_name: Asset
module_name: great_expectations.datasource.data_connector.asset
class_name: ConfiguredAssetSqlDataConnector
module_name: great_expectations.datasource.data_connector
How can I tweak great_expectations.yml so that temporary objects are created in a separate database and schema from the datasource?
As a workaround, we created a view in the schema with read/write that points to the data in the read-only share. That adds an extra step. I'm hoping there's a simple config to create temporary objects outside the schema being profiled.

Related

Error in connection to mongodb atlas from siddhi

I am new to Siddhi, and I am trying to connect to MongoDB Atlas to make an insertion to the database of a collection, but when I configure the parameters and run the code in siddhi editor, it seems that there is no error in the console but it does not add the record to MongoDB.
Here is the code:
#App:name("ConectionMongoDBAtlas")
#App:description("Description of conection to MongoDB Atlas")
#sink(type='mongodb',
-- mongodb.uri='mongodb://username:password#ac-qe2xpea-shard-00-00.cs3wyqb.mongodb.net:27017,ac-qe2xpea-shard-00-01.cs3wyqb.mongodb.net:27017,ac-qe2xpea-shard-00-02.cs3wyqb.mongodb.net:27017/siddhi?ssl=true&replicaSet=atlas-4drk5v-shard-0&authSource=admin&retryWrites=true&w=majority',
uri='mongodb+srv://username:password#cluster0.cs3wyqb.mongodb.net/siddhi?retryWrites=true&w=majority',
collection.name = 'siddhiCollection',
database.name = 'siddhi'
-- secure.connection = 'true',
-- trust.store = 'C:/Users/luis.ortega/Downloads/siddhi-tooling-5.1.0/resources/security/client-truststore.jks',
-- key.store.password = 'mongodb',
-- sslEnabled = 'true',
-- trustStore = 'C:/Users/luis.ortega/Downloads/siddhi-tooling-5.1.0/resources/security/cloud.mongodb2',
-- keyStorePassword = 'mongodb',
-- #map(type='json')
-- #payload('{"name":"{{name}}", "age":{{age}}}')
)
#primaryKey("name")
#index('age')
define table siddhiCollection(name string, age int);
#sink(type = 'log')
define stream BarStream(message string);
#info(name= 'query1')
define stream InsertStream (name string, age int);
from InsertStream
insert into MongoCollection;
I tried to configure mongodb with the store annotation as it is in the documentation and also with the sink annotation.
We don't know if the database SSL certificate issue is a problem, I even added the certificate to the client-trutstore.jks.
I have tried the connection to the MongoDB (but not for the Atlas) with the following steps and it was successful.
Copy the mongo driver[1] to the /lib directory.
Create a database as 'test' and add the user admin to the database.
db.createUser({user: "admin", pwd: "admin", roles : [{role: "readWrite", db: "test"}]});
Then simulate the siddhi application using tooling UI after deploying the siddhi application[2].
[1] https://mvnrepository.com/artifact/org.mongodb/mongo-java-driver/3.4.2
[2]
#App:name("store-mongodb")
#Store(type="mongodb",mongodb.uri='mongodb://admin:admin#localhost:27017/test') #PrimaryKey("name") #Index("amount:1", "{background:true}") define table SweetProductionTable (name string, amount double);
/* Inserting event into the mongo store */ #info(name='query1') from insertSweetProductionStream insert into SweetProductionTable;
If there are connection issues to the database, there should be error logs in the carbon log file of the tooling. You can check the log file, 'carbon.log' from the location /wso2/server/logs. This is needed to be checked only if you have observed the logs in the console of the browser.

Error when creating external table in Redshift Spectrum with dbt: cross-database reference not supported

I want to create an external table in Redshift Spectrum from CSV files. When I try doing so with dbt, I get a strange error. But when I manually remove some double quotes from the SQL generated by dbt and run it directly, I get no such error.
First I run this in Redshift Query Editor v2 on default database dev in my cluster:
CREATE EXTERNAL SCHEMA example_schema
FROM DATA CATALOG
DATABASE 'example_db'
REGION 'us-east-1'
IAM_ROLE 'iam_role'
CREATE EXTERNAL DATABASE IF NOT EXISTS
;
Database dev now has an external schema named example_schema (and Glue catalog registers example_db).
I then upload example_file.csv to the S3 bucket s3://example_bucket. The file looks like this:
col1,col2
1,a,
2,b,
3,c
Then I run dbt run-operation stage_external_sources in my local dbt project and get this output with an error:
21:03:03 Running with dbt=1.0.1
21:03:03 [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- models.example_project.example_models
21:03:03 1 of 1 START external source example_schema.example_table
21:03:03 1 of 1 (1) drop table if exists "example_db"."example_schema"."example_table" cascade
21:03:04 Encountered an error while running operation: Database Error
cross-database reference to database "example_db" is not supported
I try running the generated SQL in Query Editor:
DROP TABLE IF EXISTS "example_db"."example_schema"."example_table" CASCADE
and get the same error message:
ERROR: cross-database reference to database "example_db" is not supported
But when I run this SQL in Query Editor, it works:
DROP TABLE IF EXISTS "example_db.example_schema.example_table" CASCADE
Note that I just removed some quotes.
What's going on here? Is this a bug in dbt-core, dbt-redshift, or dbt_external_tables--or just a mistake on my part?
To confirm, I can successfully create the external table by running this in Query Editor:
DROP SCHEMA IF EXISTS example_schema
DROP EXTERNAL DATABASE
CASCADE
;
CREATE EXTERNAL SCHEMA example_schema
FROM DATA CATALOG
DATABASE 'example_db'
REGION 'us-east-1'
IAM_ROLE 'iam_role'
CREATE EXTERNAL DATABASE IF NOT EXISTS
;
CREATE EXTERNAL TABLE example_schema.example_table (
col1 SMALLINT,
col2 CHAR(1)
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
LOCATION 's3://example_bucket'
TABLE PROPERTIES ('skip.header.line.count'='1')
;
dbt config files
models/example/schema.yml (modeled after this example:
version: 2
sources:
- name: example_source
database: dev
schema: example_schema
loader: S3
tables:
- name: example_table
external:
location: 's3://example_bucket'
row_format: >
serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties (
'strip.outer.array'='false'
)
columns:
- name: col1
data_type: smallint
- name: col2
data_type: char(1)
dbt_project.yml:
name: 'example_project'
version: '1.0.0'
config-version: 2
profile: 'example_profile'
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
target-path: "target"
clean-targets:
- "target"
- "dbt_packages"
models:
example_project:
example:
+materialized: view
packages.yml:
packages:
- package: dbt-labs/dbt_external_tables
version: 0.8.0

How to define schema name in #MappedEntity annotation for r2dbc

I have kotlin & Micronaut application connecting to postgresql using r2dbc for reactive approach
r2dbc:
datasources:
default:
schema-generate: NONE
dialect: POSTGRES
url: r2dbc:postgresql://localhost:5434/mydb
username: postgres
password: postgres
I have the table called Customer inside database mydb and schema myschema, but while using the #MappedEntity we can only define the table name. Since table is inside of myschema the application is throws entity does not exist
15:26:15.455 [reactor-tcp-nio-1] ERROR i.m.h.n.stream.HttpStreamsHandler - Error occurred writing stream response: relation "customer" does not exist
io.r2dbc.postgresql.ExceptionFactory$PostgresqlBadGrammarException: relation "customer" does not exist
how to define schema name in MappedEntity annotation ?
One way you can do it is, you can define the current schema in url using query parameter
url: r2dbc:postgresql://localhost:5434/mydb?currentSchema=myschema
You can use JPA’s ‘#Table’ as a workaround.

Cannot run tests on h2 in-memory database, rather it runs on PostgreSQL

(I have multiple related questions, so I highlight them as bold)
I have a play app.
play: 2.6.19
scala: 2.12.6
h2: 1.4.197
postgresql: 42.2.5
play-slick/play-slick-evolutions: 3.0.1
slick-pg: 0.16.3
I am adding a test for DAO, and I believe it should run on an h2 in-memory database that is created when tests start, cleared when tests end.
However, my test always runs on PostgreSQL database I configure and use.
# application.conf
slick.dbs.default.profile="slick.jdbc.PostgresProfile$"
slick.dbs.default.db.driver="org.postgresql.Driver"
slick.dbs.default.db.url="jdbc:postgresql://localhost:5432/postgres"
Here is my test test/dao/TodoDAOImplSpec.scala.
package dao
import play.api.inject.guice.GuiceApplicationBuilder
import play.api.test.{Injecting, PlaySpecification, WithApplication}
class TodoDAOImplSpec extends PlaySpecification {
val conf = Map(
"slick.dbs.test.profile" -> "slick.jdbc.H2Profile$",
"slick.dbs.test.db.driver" -> "org.h2.Driver",
"slick.dbs.test.db.url" -> "jdbc:h2:mem:test;MODE=PostgreSQL;DB_CLOSE_DELAY=-1;DATABASE_TO_UPPER=FALSE"
)
val fakeApp = new GuiceApplicationBuilder().configure(conf).build()
//val fakeApp = new GuiceApplicationBuilder().configure(inMemoryDatabase()).build()
//val fakeApp = new GuiceApplicationBuilder().configure(inMemoryDatabase("test")).build()
"TodoDAO" should {
"returns current state in local pgsql table" in new WithApplication(fakeApp) with Injecting {
val todoDao = inject[TodoDAOImpl]
val result = await(todoDao.index())
result.size should_== 0
}
}
}
For fakeApp, I try all three, but none of them work as expected - my test still runs on my local PostgreSQL table (in which there are 3 todo items), so the test fails.
What I have tried/found:
First, inMemoryDatabase() simply returns a Map("db.<name>.driver"->"org.h2.Driver", "db.<name>.url"->""jdbc:h2:mem:play-test-xxx"), which looks very similar to my own conf map. However, there are 2 main differeneces:
inMemoryDatabase uses db.<name>.xxx while my conf map uses slick.dbs.<name>.db.xxx. Which one should be correct?
Second, rename conf map's keys to "slick.dbs.default.profile", "slick.dbs.default.db.driver" and "slick.dbs.default.db.url" will throw error.
[error] p.a.d.e.DefaultEvolutionsApi - Unknown data type: "status_enum"; SQL statement:
ALTER TABLE todo ADD COLUMN status status_enum NOT NULL [50004-197] [ERROR:50004, SQLSTATE:HY004]
cannot create an instance for class dao.TodoDAOImplSpec
caused by #79bg46315: Database 'default' is in an inconsistent state!
The finding is interesting - is it related to my use of PostgreSQL ENUM type and slick-pg? (See slick-pg issue with h2). Does it mean this is the right configuration for running h2 in-memory tests? If so, the question becomes How to fake PostgreSQL ENUM in h2.
Third, I follow this thread, run sbt '; set javaOptions += "-Dconfig.file=conf/application-test.conf"; test' with a test configuration file conf/application-test.conf:
include "application.conf"
slick.dbs.default.profile="slick.jdbc.H2Profile$"
slick.dbs.default.db.driver="org.h2.Driver"
slick.dbs.default.db.url="jdbc:h2:mem:test;MODE=PostgreSQL;DB_CLOSE_DELAY=-1;DATABASE_TO_UPPER=FALSE"
Not surprisingly, I get the same error as the 2nd trial.
It seems to me that the 2nd and 3rd trials point to the right direction (Will work on this). But why must we set name to default? Any other better approach?
In play the default database is default. You could however change that to any other database name to want, but then you need to add the database name as well. For example, I want to have a comment database that has the user table:
CREATE TABLE comment.User(
id int(250) NOT NULL AUTO_INCREMENT,
username varchar(255),
comment varchar(255),
PRIMARY KEY (id));
Then I need to have the configuration of it to connect to it (add it to the application.conf file):
db.comment.url="jdbc:mysql://localhost/comment"
db.comment.username=admin-username
db.comment.password="admin-password"
You could have the test database for your testing as mentioned above and use it within your test.
Database Tests Locally: Why not have the database, in local, as you have in production? The data is not there and running the test on local does not touch the production database; why you need an extra database?
Inconsistent State: This is when the MYSQL you wrote, changes the state of the current database within the database, that could be based on creation of a new table or when you want to delete it.
Also status_enum is not recognizable as a MySQL command obviously. Try the commands you want to use in MySQL console if you are not sure about it.

export data from mongo to hive

my input: a collection("demo1") in mongo db (version 3.4.4 )
my output : my data imported in a database in hive("demo2") (version 1.2.1.2.3.4.7-4)
purpose : create a connector between mongo and hive
Error:
Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. com/mongodb/util/JSON
I tried 2 solutions following those steps (but the error remains):
1) I create a local collection in mongo (via robomongo) connected to docker
2) I upload those version of jars and add it in hive
ADD JAR /home/.../mongo-hadoop-hive-2.0.2.jar;
ADD JAR /home/.../mongo-hadoop-core-2.0.2.jar;
ADD JAR /home/.../mongo-java-driver-3.4.2.jar;
Unfortunately the error doesn't change; so I upload those version, I hesitate in choosing right version for my export, so I try this:
ADD JAR /home/.../mongo-hadoop-hive-1.3.0.jar;
ADD JAR /home/.../mongo-hadoop-core-1.3.0.jar;
ADD JAR /home/.../mongo-java-driver-2.13.2.jar;
3) I create an external table
CREATE EXTERNAL TABLE demo2
(
id INT,
name STRING,
password STRING,
email STRING
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH
SERDEPROPERTIES('mongo.columns.mapping'='{"id":"_id","name":"name","password":"password","email":"email"}')
TBLPROPERTIES('mongo.uri'='mongodb://localhost:27017/local.demo1');
Error returned in hive :
Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. com/mongodb/util/JSON
How can I resolve this problem?
Copying the correct jar files (mongo-hadoop-core-2.0.2.jar, mongo-hadoop-hive-2.0.2.jar, mongo-java-driver-3.2.2.jar) on ALL the nodes of the cluster did the trick for me.
Other points to take care about:
Follow all steps mentioned here religiously - https://github.com/mongodb/mongo-hadoop/wiki/Hive-Usage#installation
Adhere to the requirements given here - https://github.com/mongodb/mongo-hadoop#requirements
Other useful links
https://github.com/mongodb/mongo-hadoop/wiki/FAQ#i-get-a-classnotfoundexceptionnoclassdeffounderror-when-using-the-connector-what-do-i-do
https://groups.google.com/forum/#!topic/mongodb-user/xMVoTSePgg0