Set TBLProperties using Scala API - scala

I am writing data into a table with mode overwrite
And since I have special characters in my columnname I need to set below three properties for the column mapping.
'delta.minReaderVersion' = '2',
'delta.minWriterVersion' = '5',
'delta.columnMapping.mode' = 'name'
So would like to know is there a way to set TBLProperties using Scala API ?
I tried this but not working
myDf.write.mode("Overwrite")
.option("delta.minReaderVersion", "2")
.option("delta.minWriterVersion", "5")
.option("delta.columnMapping.mode", "name")
.saveAsTable("testDB.employees")

Not test with databricks but try this may help:
df.write.format("delta")
.mode("overwrite")
.option("TBLPROPERTIES", "key1=value1, key2=value2")
.save("/path/to/table")

Related

Flink SQL-CLi: bring header records

I'm new with flink sql cli and I want to create a sink from my kafka cluster.
I've read the documentation and as I understand de headers are a map<STRING, BYTE> types and through them are all the important information.
When I'm using de sql-cli I try to create a sink table following this command:
CREATE TABLE KafkaSink (
`headers` MAP<STRING, BYTES> METADATA
) WITH (
'connector' = 'kafka',
'topic' = 'MyTopic',
'properties.bootstrap.servers' ='LocalHost',
'properties.group.id' = 'MyGroypID',
'scan.startup.mode' = 'earliest-offset',
'value.format' = 'json'
);
But when I try to read the data with select * from KafkaSink limit 10; It returns me null records
I've tried to run queries like
select headers.col1 from a limit 10;
And also, I've tried to create the sink table with different structures at selecting columns part:
...
`headers` STRING
...
...
`headers` MAP<STRING, STRING>
...
...
`headers` ROW(COL1 VARCHAR, COL2 VARCHAR...)
...
But it returns me nothing, however when I bring the offset columns from kafka cluster it brings me the offset but no the headers.
Can someone explain me my error?
I want to create a kafka sink with flink sql cli
Ok, as I could see it, when I tried to change to
'format' = 'debezium-json'
I could see in a better way the json.
I follow the json schema, in my case was
{
"data": {...},
"metadata":{...}
}
So instead of bringing the header i'm bringing the data with all the columns that i need, the data as a string and the columns as for example
data.col1, data.col2
In order to see the records, just with a
select
json_value(data, '$.Col1') as Col1
from Table;
it works!

Scala, Quill - how to compare values with case-insensitive?

I created a quill query, which should find some data in database by given parameter:
val toFind = "SomeName"
val query = query.find(value => infix"$value = ${lift(toFind)}".as[Boolean])
It works fine when for example I have data in database "SomeName", but if I want to have same results by passing there "somename" I found nothing. The problem is with data case-sensitive.
Is it possible to always find values with case-insensitive way? In quill docs I have not found anything about it.
Ok, I found a solution. It is enough to add LOWER() sql function to infix:
val query = query.find(value => infix"LOWER($value) = ${lift(toFind.toLowerCase)}".as[Boolean])

How to save dataframe to Elasticsearch with mappings

I have following code to save dataframe to elastic search. It works well.
val conf = new SparkConf(true).set("spark.cassandra.connection.host", host)
conf.set("spark.es.index.auto.create", "true")
conf.set("spark.es.nodes", host)
val features = sqlContext.read.parquet(input)
features.write.format("org.elasticsearch.spark.sql")
.mode(SaveMode.Append)
.option("es.resource","{ts}/log").save()
It autocreates index when it is not there. But when I try to query on some field. It shows following error
Set fielddata=true on [country] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.
I am aware of the mappings to make text fields as keywords
{
"your_field": {
"type" "keyword",
"index": true
}
}
But I couldn't find how to use these mappings when creating index with this code
In my experience, Elasticsearch for hadoop also creates a .keyword with keyword type already for you!
Try using country.keyword

Default value doesn't work in SQLAlchemy + PostgreSQL + aiopg + psycopg2

I've found an unexpected behavior in SQLAlchemy. I'm using the following versions:
SQLAlchemy (0.9.8)
PostgreSQL (9.3.5)
psycopg2 (2.5.4)
aiopg (0.5.1)
This is the table definition for the example:
import asyncio
from aiopg.sa import create_engine
from sqlalchemy import (
MetaData,
Column,
Integer,
Table,
String,
)
metadata = MetaData()
users = Table('users', metadata,
Column('id_user', Integer, primary_key=True, nullable=False),
Column('name', String(20), unique=True),
Column('age', Integer, nullable=False, default=0),
)
Now if I try to execute a simple insert to the table just populating the id_user and name, the column age should be auto-generated right? Lets see...
#asyncio.coroutine
def go():
engine = yield from create_engine('postgresql://USER#localhost/DB')
data = {'id_user':1, 'name':'Jimmy' }
stmt = users.insert(values=data, inline=False)
with (yield from engine) as conn:
result = yield from conn.execute(stmt)
loop = asyncio.get_event_loop()
loop.run_until_complete(go())
This is the resulting statement with the corresponding error:
INSERT INTO users (id_user, name, age) VALUES (1, 'Jimmy', null);
psycopg2.IntegrityError: null value in column "age" violates not-null constraint
I didn't provide the age column, so where is that age = null value coming from? I was expecting something like this:
INSERT INTO users (id_user, name) VALUES (1, 'Jimmy');
Or if the default flag actually works should be:
INSERT INTO users (id_user, name, Age) VALUES (1, 'Jimmy', 0);
Could you put some light on this?
This issue has been confirmed has an aiopg bug. Seems like at the moment it's ignoring the default argument on data manipulation.
I've fixed the issue using server_default instead:
users = Table('users', metadata,
Column('id_user', Integer, primary_key=True, nullable=False),
Column('name', String(20), unique=True),
Column('age', Integer, nullable=False, server_default='0'))
I think you need to use inline=True in your insert. This turns off 'pre-execution'.
Docs are a bit cryptic on what exactly this 'pre-execution' entails, but they mentions default parameters:
:param inline:
if True, SQL defaults present on :class:`.Column` objects via
the ``default`` keyword will be compiled 'inline' into the statement
and not pre-executed. This means that their values will not
be available in the dictionary returned from
:meth:`.ResultProxy.last_updated_params`.
This piece of docstring is from Update class, but they have a shared behavior with Insert.
Besides, that's the only way they test it:
https://github.com/zzzeek/sqlalchemy/blob/rel_0_9/test/sql/test_insert.py#L385

Concatenating databases with Squeryl

I'm trying to use Squeryl to take the contents of a table from one database, and append it to the equivalent table in another database. The primary key will have to be reassigned in the process, but I'm getting the error NULL not allowed for column "SIMID". Why is this?
object Concatenator {
def main(args: Array[String]) {
Class.forName("org.h2.Driver");
val seshA = Session.create(
java.sql.DriverManager.getConnection("jdbc:h2:file:data/resultsA", "sa", "password"),
new H2Adapter
)
val seshB = Session.create(
java.sql.DriverManager.getConnection("jdbc:h2:file:data/resultsB", "sa", "password"),
new H2Adapter
)
using(seshA){
import Library._
from(sims){s => select(s)}.foreach{item =>
using(seshB){
sims.insert(item);
}
}
}
}
case class Simulation(
#Column("SIMID")
var id: Long,
val date: Date
) extends KeyedEntity[Long]
object Library extends Schema {
val sims = table[Simulation]
on(sims)(s => declare(
s.id is(unique, indexed, autoIncremented)
))
}
}
Update:
I think it might be something to do with the DBs. They were created in a Java project using JPA/EclipseLink and in additional to generating tables for my entities it also created a table called SEQUENCE, presumably for primary key generation.
I've found that I can create an brand new table in Squeryl and manually put the contents of both databases in that, thus achieving the same effect. Interestingly this new table did not have any SEQUENCE table auto generated. So I'm guessing it comes down to how JPA/EclipseLink was generating my primary keys?
Update 2:
As requested, I appended trace_level_file=3 to the url and the files are here: resultsA.trace.db and resultsB.trace.db. B is the more interesting one I think. Also, I've put a simplified version of the database here which has had unnecessary tables removed (the same database is used for resultsA and resultsB).
Just got a moment to look at this more closely. I turns out you were on the right track. While I guess that EclipseLink uses Sequences to generate the PK value, Squeryl defines the column as something like:
simid bigint not null primary key auto_increment
Without the auto_increment flag a value is never placed in the column and you end up with the constraint violation you mentioned. It sounds like you've already worked around the issue, but hopefully this will help you or someone else in the future.
Not really a solution, but my workaround is to create a new database
val seshNew = Session.create(java.sql.DriverManager.getConnection("jdbc:h2:file:data/resultsNew", "sa","password"),new H2Adapter)
and then just write all the data from the other databases into it
using(seshNew){
sims.insert(new Simulation(0,item.date))
}
The primary keys 0 gets overwritten as appropriate.