case when in merge statement databricks - pyspark

I am trying to upsert in Databricks using merge statement in pyspark. I wanted to know if using expressions (e.g. adding two columns, case when) allowed in the whenMatchedUpdate part. For example I want to do something like this
deltaTableTarget = DeltaTable.forPath(spark, delta_table_path)
deltaTableTarget.alias('TgtCrmUserAggr') \
.merge(
broadcast(df_transformed.alias('DeltaSource')),
"DeltaSource.primary_key==TargetTable.primary_key"
) \
.whenMatchedUpdate(set =
{
"aggcount":"DeltaSource.count + TargetTable.count",
"max_date": "case when DeltaSource.max_date > TargetTable.max_date then DeltaSource.max_date else TargetTable.max_date end"
}
)
.whenNotMatchedInsert().insertAll()
)\
.execute()

If I understand your logic well, you can just take the max value of the 2 columns, right?
deltaTableTarget = DeltaTable.forPath(spark, delta_table_path)
deltaTableTarget.alias('TgtCrmUserAggr') \
.merge(
broadcast(df_transformed.alias('DeltaSource')),
"DeltaSource.primary_key==TargetTable.primary_key"
) \
.whenMatchedUpdate(set =
{
"aggcount":"DeltaSource.count + TargetTable.count",
"max_date": "GREATEST(DeltaSource.max_date,TargetTable.max_date)"
}
)
.whenNotMatchedInsert().insertAll()
)\
.execute()
If this is not correct, something you could do is use multiple whenMatchedUpdate() functions with a condition.
deltaTableTarget = DeltaTable.forPath(spark, delta_table_path)
deltaTableTarget.alias('TgtCrmUserAggr') \
.merge(
broadcast(df_transformed.alias('DeltaSource')),
"DeltaSource.primary_key==TargetTable.primary_key"
) \
.whenMatchedUpdate(condition= 'DeltaSource.max_date > TargetTable.max_date',
set =
{
"aggcount":"DeltaSource.count + TargetTable.count",
"max_date": "DeltaSource.max_date"
}
)
.whenMatchedUpdate(set =
{
"aggcount":"DeltaSource.count + TargetTable.count",
"max_date": "TargetTable.max_date"
}
)
.whenNotMatchedInsert().insertAll()
)\
.execute()

Related

Process runs shell successfully, but does not upload data into database

I'm creating a process that basically reads a path from a csv file, then generates a dataframe, and applies a filter to fit for the different fields. This dataframe only needs to write to an Oracle database table, it means does not create a destination hdfs path. The process was created in intellij, it is sbt model, in scala. The process runs and the dataframe loads(append mode) successfully when it is executed directly in the spark shell. However, when the jar is created, and the shell is executed with the bash command and nifi, is executed SUCCEDDED, but it does not load data in the database.I tried to execute it in overwrite mode, but answer was that I did not have privileges. I wrote this, because at the top of the code, in the spark configuration part, it refers to an overwrite mode(i don´t put this spark configuration when is executed directly in spark shell), and I want to know if my code its ok to write in a database, or its something related to a lack of privileges.
Next, i share the code of the process:
import com.tchile.bigdata.hdfs.Hdfs
import com.tchile.bigdata.utilities.Utilities
import org.apache.spark.SparkConf
import org.apache.spark.sql.{SaveMode, SparkSession}
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import java.sql.DriverManager
import java.util.concurrent.TimeUnit
class Process {
def process(shell_variable: String): Unit = {
val startTotal = TimeUnit.MILLISECONDS.toSeconds(System.currentTimeMillis())
println("[INFO] Inicio de proceso Logistica Carga Input Simple Data")
// Productivo (Configuracion de Spark)
val sparkConf = new SparkConf()
.setAppName("Logistica_Carga_Input_SimpleData")
.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
.set("spark.sql.parquet.output.committer.class", "org.apache.parquet.hadoop.ParquetOutputCommitter")
.set("spark.sql.sources.commitProtocolClass", "org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol")
val spark = SparkSession
.builder()
.config(sparkConf)
.enableHiveSupport()
.getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
val hdfs = new Hdfs
val utils = new Utilities
println("[INFO] Obteniendo variables desde shell/NIFI")
// Valores extraídos desde NIFI
val daysAgo = shell_variable.split(":")(0).toInt // Cantidad de días de reproceso: inicio
val daysAhead = shell_variable.split(":")(1).toInt // Cantidad de días de reproceso: límite
val repartition = shell_variable.split(":")(2).toInt
// Variables auxiliares para reproceso en HDFS
var pathToProcess = ""
var pathToDelete = ""
var deleteStatus = false
println("[INFO] Obteniendo variables desde Parametros.conf")
// Valores extraídos de Parametros.conf > exadata
val driver_jdbc = sc.getConf.get("spark.exadata.driver_jdbc")
val url_jdbc = sc.getConf.get("spark.exadata.url_jdbc")
val user_jdbc = sc.getConf.get("spark.exadata.user_jdbc")
val pass_jdbc = sc.getConf.get("spark.exadata.pass_jdbc")
val table_name = sc.getConf.get("spark.exadata.table_name")
val table_owner = sc.getConf.get("spark.exadata.table_owner")
val table = sc.getConf.get("spark.exadata.table")
// Valores extraídos de Parametros.conf > path
//val pathCsv = sc.getConf.get("spark.path.Csv")
// Cálculo de la fecha de los días atrás
val startDate = utils.getCalculatedDate("yyyy-MM-dd", -daysAgo)
// Cálculo de la fecha límite
val endDate = utils.getCalculatedDate("yyyy-MM-dd", -daysAhead)
// Se separan el año, el mes y el día a partir de la fecha_atras
val startDateYear = startDate.substring(0, 4)
val startDateMonth = startDate.substring(5, 7)
val startDateDay = startDate.substring(8, 10)
// Se separan el año, el mes y el día a partir del sub_day
val endDateYear = endDate.substring(0, 4)
val endDateMonth = endDate.substring(5, 7)
val endDateDay = endDate.substring(8, 10)
// Información para log
println("[INFO] Reproceso de: " + daysAgo + " días")
println("[INFO] Fecha inicio: " + startDate)
println("[INFO] Fecha límite: " + endDate)
try {
// ================= INICIO LÓGICA DE PROCESO =================
//Se crea df a partir de archivo diario input de acuerdo a la ruta indicada
val df_csv = spark.read.format("csv").option("header","true").option("sep",";").option("mode","dropmalformed").load("/applications/recup_remozo_equipos/equipos_por_recuperar/output/agendamientos_sin_pet_2")
val df_final = df_csv.select($"RutSinDV".as("RUT_SIN_DV"),
$"dv".as("DV"),
$"Agendado".as("AGENDADO"),
to_date(col("Dia_Agendado"), "yyyyMMdd").as("DIA_AGENDADO"),
$"Horario_Agendado".as("HORARIO_AGENDADO"),
$"Nombre_Agendamiento".as("NOMBRE_AGENDAMIENTO"),
$"Telefono_Agendamiento".as("TELEFONO_AGENDAMIENTO"),
$"Email".substr(0,49).as("EMAIL"),
$"Region_Agendamiento".substr(0,29).as("REGION_AGENDAMIENTO"),
$"Comuna_Agendamiento".as("COMUNA_AGENDAMIENTO"),
$"Direccion_Agendamiento".as("DIRECCION_AGENDAMIENTO"),
$"Numero_Agendamiento".substr(0,5)as("NUMERO_AGENDAMIENTO"),
$"Depto_Agendamiento".substr(0,9).as("DEPTO_AGENDAMIENTO"),
to_timestamp(col("fecha_registro")).as("FECHA_REGISTRO"),
to_timestamp(col("Fecha_Proceso")).as("FECHA_PROCESO")
)
// ================== FIN LÓGICA DE PROCESO ==================
// Limpieza en EXADATA
println("[INFO] Se inicia la limpieza por reproceso en EXADATA")
val query_particiones = "(SELECT * FROM (WITH DATA AS (select table_name,partition_name,to_date(trim('''' " +
"from regexp_substr(extractvalue(dbms_xmlgen.getxmltype('select high_value from all_tab_partitions " +
"where table_name='''|| table_name|| ''' and table_owner = '''|| table_owner|| ''' and partition_name = '''" +
"|| partition_name|| ''''),'//text()'),'''.*?''')),'syyyy-mm-dd hh24:mi:ss') high_value_in_date_format " +
"FROM all_tab_partitions WHERE table_name = '" + table_name + "' AND table_owner = '" + table_owner + "')" +
"SELECT partition_name FROM DATA WHERE high_value_in_date_format > DATE '" + startDateYear + "-" + startDateMonth + "-" + startDateDay + "' " +
"AND high_value_in_date_format <= DATE '" + endDateYear + "-" + endDateMonth + "-" + endDateDay + "') A)"
Class.forName(driver_jdbc)
val db = DriverManager.getConnection(url_jdbc, user_jdbc, pass_jdbc)
val st = db.createStatement()
try {
val consultaParticiones = spark.read.format("jdbc")
.option("url", url_jdbc)
.option("driver", driver_jdbc)
.option("dbTable", query_particiones)
.option("user", user_jdbc)
.option("password", pass_jdbc)
.load()
.collect()
for (partition <- consultaParticiones) {
st.executeUpdate("call " + table_owner + ".DO_THE_TRUNCATE_PARTITION('" + table + "','" + partition.getString(0) + "')")
}
} catch {
case e: Exception =>
println("[ERROR TRUNCATE] " + e)
}
st.close()
db.close()
println("[INFO] Se inicia la inserción en EXADATA")
df_final.filter($"DIA_AGENDADO" >= "2022-08-01")
.repartition(repartition).write.mode("append")
.jdbc(url_jdbc, table, utils.jdbcProperties(driver_jdbc, user_jdbc, pass_jdbc))
println("[INFO] Inserción en EXADATA completada con éxito")
println("[INFO] Proceso Logistica Carga Input SimpleData")
val endTotal = TimeUnit.MILLISECONDS.toSeconds(System.currentTimeMillis()) - startTotal
println("[INFO] TIEMPO TOTAL EJECUCIÓN: " + utils.secondsToMinutes(endTotal))
}
catch {
case e: Exception =>
println("[EXCEPTION] " + e)
val endTotal = TimeUnit.MILLISECONDS.toSeconds(System.currentTimeMillis()) - startTotal
println("[INFO] TIEMPO TOTAL EJECUCIÓN (CON ERROR): " + utils.secondsToMinutes(endTotal))
throw e
}
}
}
Now i want to show the shell file :
#!/bin/bash
args=("$#")
export SPARK_MAJOR_VERSION=2;
in=${args[0]}
process_name2=${args[1]}
dir_hdfs=$(echo $in | awk -F ":" '{print$1}')
process_name=Logistica_Carga_Input_SimpleData
if [ -z "$process_name2" ]; then
process_name2=$process_name
fi
var_log=$(echo $dir_hdfs"/log/part*")
spark-submit \
--master yarn \
--deploy-mode cluster \
--name "Logistica_Carga_Input_SimpleData" \
--class com.tchile.bigdata.App \
--num-executors 2 \
--properties-file "/home/TCHILE.LOCAL/srv_remozo_equip/ingesta/Logistica_Carga_Input_SimpleData/Parametros.conf" \
--conf "spark.executor.cores=16" \
--conf "spark.default.parallelism=120" \
--conf "spark.driver.memory=20gb" \
--conf "spark.executor.memory=30gb" \
--conf "spark.ui.acls.enable=true" \
--conf "spark.sql.sources.partitionOverwriteMode=dynamic" \
--conf "spark.ui.view.acls=srv_remozo_equip" \
--conf "spark.admin.acls=srv_remozo_equip" \
--conf "spark.sql.warehouse.dir=/etc/hive/conf/hive-site.xml" \
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./kafka_client_jaas.conf" \
--conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=./kafka_client_jaas.conf" \
--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=./log4j.properties" \
--conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=./log4j.properties" \
--conf "spark.driver.extraClassPath=/usr/share/java/ojdbc7.jar" \
--jars "/usr/share/java/ojdbc7.jar" \
hdfs://nn:8020/applications/recup_remozo_equipos/Logistica_Carga_Input_SimpleData/logistica_carga_input_simpledata_2.11-1.0.jar "${args[0]}"
Some comments, field names are in Spanish, but I don't think it's relevant for the analysis. I also do not include the variables configuration parameter file. I appreciate any comments on this.
Problem solved, in dataframe df_final, when it cast "dia_agendado"(to_date(col("Dia_Agendado"), "yyyyMMdd").as("DIA_AGENDADO")), the format was not right, it was yyyy-MM-dd. This was because, i had not changed this format in Intellij. I was using a code in sublime, and that worked, but havent realized about this difference, and this is the reason cause didn't upload in the database.

Adding a custom function in SPARK NLP Pipeline

I am trying to add a custom function (to remove names from a document) and I want this udf to be incorporated into the SPARK NLP pipeline. Can you please tell me where I am doing it wrong?
Here is the udf I created to remove names
def remove_names(str):
#All_names = list(map(lambda x: x.lower(), All_names))
All_names = ['josh','alice']
split_str = str.split()
found_names = [name for name in split_str if name in All_names]
for name in found_names:
str = str.replace(name,'')
str = ' '.join(str.split())
return str
udf_txt_clean = udf(lambda x: remove_names(x),StringType())
My pipeline looks like this :
documentAssembler = DocumentAssembler()\
.setInputCol("combined_text")\
.setOutputCol("document")
remove_names = udf_txt_clean('document') \
.setInputCols(["document"]) \
.setOutputCol("noname")
tokenizer = Tokenizer() \
.setInputCols(["noname"]) \
.setOutputCol("token")
spellModel = ContextSpellCheckerModel\
.pretrained()\
.setInputCols("token")\
.setOutputCol("checked")
normalizer = Normalizer() \
.setInputCols(["checked"]) \
.setOutputCol("normalized")\
.setLowercase(True)\
stopwords_cleaner = StopWordsCleaner()\
.setInputCols("normalized")\
.setOutputCol("cleanTokens")\
.setCaseSensitive(False)\
tokenassembler = TokenAssembler()\
.setInputCols(["document", "cleanTokens"]) \
.setOutputCol("clean_text")
nlpPipeline = Pipeline(stages=[
documentAssembler,
remove_names,
tokenizer,
spellModel,
normalizer,
stopwords_cleaner,
tokenassembler
])
empty_df = spark.createDataFrame([['']]).toDF("text")
pipelineModel = nlpPipeline.fit(empty_df)
result = pipelineModel.transform(sdf)
result.select('text', explode('clean_text.result').alias('clean_text')).display()
Sample Data just to recreate the issue
text = 'josh and sam are buddies. They (both) like <b>running</b>. They got better at it over the weekend'
test = pd.DataFrame({"test": [text]})
sdf = spark.createDataFrame(test)

Flink table sink doesn't work with debezium-avro-confluent source

I'm using Flink SQL to read debezium avro data from Kafka and store as parquet files in S3. Here is my code,
import os
from pyflink.datastream import StreamExecutionEnvironment, FsStateBackend
from pyflink.table import TableConfig, DataTypes, BatchTableEnvironment, StreamTableEnvironment, \
ScalarFunction
exec_env = StreamExecutionEnvironment.get_execution_environment()
exec_env.set_parallelism(1)
# start a checkpoint every 12 s
exec_env.enable_checkpointing(12000)
t_config = TableConfig()
t_env = StreamTableEnvironment.create(exec_env, t_config)
INPUT_TABLE = 'source'
KAFKA_TOPIC = os.environ['KAFKA_TOPIC']
KAFKA_BOOTSTRAP_SERVER = os.environ['KAFKA_BOOTSTRAP_SERVER']
OUTPUT_TABLE = 'sink'
S3_BUCKET = os.environ['S3_BUCKET']
OUTPUT_S3_LOCATION = os.environ['OUTPUT_S3_LOCATION']
ddl_source = f"""
CREATE TABLE {INPUT_TABLE} (
`event_time` TIMESTAMP(3) METADATA FROM 'timestamp' VIRTUAL,
`id` BIGINT,
`price` DOUBLE,
`type` INT,
`is_reinvite` INT
) WITH (
'connector' = 'kafka',
'topic' = '{KAFKA_TOPIC}',
'properties.bootstrap.servers' = '{KAFKA_BOOTSTRAP_SERVER}',
'scan.startup.mode' = 'earliest-offset',
'format' = 'debezium-avro-confluent',
'debezium-avro-confluent.schema-registry.url' = 'http://kafka-production-schema-registry:8081'
)
"""
ddl_sink = f"""
CREATE TABLE {OUTPUT_TABLE} (
`event_time` TIMESTAMP,
`id` BIGINT,
`price` DOUBLE,
`type` INT,
`is_reinvite` INT
) WITH (
'connector' = 'filesystem',
'path' = 's3://{S3_BUCKET}/{OUTPUT_S3_LOCATION}',
'format' = 'parquet'
)
"""
t_env.sql_update(ddl_source)
t_env.sql_update(ddl_sink)
t_env.execute_sql(f"""
INSERT INTO {OUTPUT_TABLE}
SELECT *
FROM {INPUT_TABLE}
""")
When I submit the job, I get the following error message,
pyflink.util.exceptions.TableException: Table sink 'default_catalog.default_database.sink' doesn't support consuming update and delete changes which is produced by node TableSourceScan(table=[[default_catalog, default_database, source]], fields=[id, price, type, is_reinvite, timestamp])
I'm using Flink 1.12.1. The source is working properly and I have tested it using a 'print' connector in the sink. Here is a sample data set which was extracted from the task manager logs when using 'print' connector in the table sink,
-D(2021-02-20T17:07:27.298,14091764,26.0,9,0)
-D(2021-02-20T17:07:27.298,14099765,26.0,9,0)
-D(2021-02-20T17:07:27.299,14189806,16.0,9,0)
-D(2021-02-20T17:07:27.299,14189838,37.0,9,0)
-D(2021-02-20T17:07:27.299,14089840,26.0,9,0)
-D(2021-02-20T17:07:27.299,14089847,26.0,9,0)
-D(2021-02-20T17:07:27.300,14189859,26.0,9,0)
-D(2021-02-20T17:07:27.301,14091808,37.0,9,0)
-D(2021-02-20T17:07:27.301,14089911,37.0,9,0)
-D(2021-02-20T17:07:27.301,14099937,26.0,9,0)
-D(2021-02-20T17:07:27.302,14091851,37.0,9,0)
How can I make my table sink work with the filesystem connector ?
What happens is that:
when receiving the Debezium records, Flink updates a logical table by adding, removing and suppressing Flink rows based on their primary key.
the only sinks that can handle that kind of information are those that have a concept of update by key. Jdbc would be a typical example, in which case it's straightforward for Flink to translate the concept of "a Flink row with key foo has been updated to bar" into "JDBC row with key foo should be updated with value bar", or something. filesystem sink do not support that kind of operation since files are append-only.
See also Flink documentation on append and update queries
In practice, in order to do the conversion, we first have to decide what is it exactly we want to have in this append-only file.
If what we want is to have in the file the latest version of each item any time an id is updated, then to my knowledge the way to go would be to convert it to a stream first, and then output that with a FileSink. Note that in that case, the result contains a boolean saying if the row is updated or deleted, and we have to decide how we want this information to be visible in the resulting file.
Note: I used this other CDC example from the Flink SQL cookbook to reproduce a similar setup:
// assuming a Flink retract table of claims build from a CDC stream:
tableEnv.executeSql("" +
" CREATE TABLE accident_claims (\n" +
" claim_id INT,\n" +
" claim_total FLOAT,\n" +
" claim_total_receipt VARCHAR(50),\n" +
" claim_currency VARCHAR(3),\n" +
" member_id INT,\n" +
" accident_date VARCHAR(20),\n" +
" accident_type VARCHAR(20),\n" +
" accident_detail VARCHAR(20),\n" +
" claim_date VARCHAR(20),\n" +
" claim_status VARCHAR(10),\n" +
" ts_created VARCHAR(20),\n" +
" ts_updated VARCHAR(20)" +
") WITH (\n" +
" 'connector' = 'postgres-cdc',\n" +
" 'hostname' = 'localhost',\n" +
" 'port' = '5432',\n" +
" 'username' = 'postgres',\n" +
" 'password' = 'postgres',\n" +
" 'database-name' = 'postgres',\n" +
" 'schema-name' = 'claims',\n" +
" 'table-name' = 'accident_claims'\n" +
" )"
);
// convert it to a stream
Table accidentClaims = tableEnv.from("accident_claims");
DataStream<Tuple2<Boolean, Row>> accidentClaimsStream = tableEnv
.toRetractStream(accidentClaims, Row.class);
// and write to file
final FileSink<Tuple2<Boolean, Row>> sink = FileSink
// TODO: adapt the output format here:
.forRowFormat(new Path("/tmp/flink-demo"),
(Encoder<Tuple2<Boolean, Row>>) (element, stream) -> stream.write((element.toString() + "\n").getBytes(StandardCharsets.UTF_8)))
.build();
ordersStreams.sinkTo(sink);
streamEnv.execute();
Note that during the conversion, you obtain a boolean telling you whether that row is a new value for that accident claim, or a deletion of such claim. My basic FileSink config there is just including that boolean in the output, although how to handle deletions is to be decided case by case.
The result in the file then looks like this:
head /tmp/flink-demo/2021-03-09--09/.part-c7cdb74e-893c-4b0e-8f69-1e8f02505199-0.inprogress.f0f7263e-ec24-4474-b953-4d8ef4641998
(true,1,4153.92,null,AUD,412,2020-06-18 18:49:19,Permanent Injury,Saltwater Crocodile,2020-06-06 03:42:25,IN REVIEW,2021-03-09 06:39:28,2021-03-09 06:39:28)
(true,2,8940.53,IpsumPrimis.tiff,AUD,323,2019-03-18 15:48:16,Collision,Blue Ringed Octopus,2020-05-26 14:59:19,IN REVIEW,2021-03-09 06:39:28,2021-03-09 06:39:28)
(true,3,9406.86,null,USD,39,2019-04-28 21:15:09,Death,Great White Shark,2020-03-06 11:20:54,INITIAL,2021-03-09 06:39:28,2021-03-09 06:39:28)
(true,4,3997.9,null,AUD,315,2019-10-26 21:24:04,Permanent Injury,Saltwater Crocodile,2020-06-25 20:43:32,IN REVIEW,2021-03-09 06:39:28,2021-03-09 06:39:28)
(true,5,2647.35,null,AUD,74,2019-12-07 04:21:37,Light Injury,Cassowary,2020-07-30 10:28:53,REIMBURSED,2021-03-09 06:39:28,2021-03-09 06:39:28)

Error detected while processing function SwitchFlowOrTsLsps nvim windows 10

Hello I tried to enter a js file and upon entering I got this error:
Error detected while processing function SwitchFlowOrTsLsps:
line 4:
E121: Undefined variable: state
E15: Invalid expression: (tsserver.state == 'disabled')
Try to find the line that tells me in the different files such as maps, plugin or plugin config, the last one is the one I found it in, before I did not suffer from this error because I was only entering py, cpp and java extension files, I have the coc, but do not download anything for this, I also have the kite installed as a plugin for autocompletion
" HTML, JSX
let g:closetag_filenames = '*.html,*.js,*.jsx,*.ts,*.tsx'
" Lightlane
let g:lightline = {
\ 'active': {
\ 'left': [['mode', 'paste'], [], ['relativepath', 'modified']],
\ 'right': [['kitestatus'], ['filetype', 'percent', 'lineinfo'], ['gitbranch']]
\ },
\ 'inactive': {
\ 'left': [['inactive'], ['relativepath']],
\ 'right': [['bufnum']]
\ },
\ 'component': {
\ 'bufnum': '%n',
\ 'inactive': 'inactive'
\ },
\ 'component_function': {
\ 'gitbranch': 'fugitive#head',
\ 'cocstatus': 'coc#status'
\ },
\ 'colorscheme': 'gruvbox',
\ 'subseparator': {
\ 'left': '',
\ 'right': ''
\ }
\}
" nerdtree
let NERDTreeShowHidden=1
let NERDTreeQuitOnOpen=1
let NERDTreeAutoDeleteBuffer=1
let NERDTreeMinimalUI=1
let NERDTreeDirArrows=1
let NERDTreeShowLineNumbers=1
let NERDTreeMapOpenInTab='\t'
let g:javascript_plugin_flow = 1
" Trigger configuration. Do not use <tab> if you use https://github.com/Valloric/YouCompleteMe.
let g:UltiSnipsSnippetDirectories=[$HOME.'/config/.vim/UltiSnips']
let g:UltiSnipsExpandTrigger="<tab>"
let g:UltiSnipsJumpForwardTrigger="<tab>"
let g:UltiSnipsJumpBackwardTrigger="<S-tab>"
" deoplete
let g:deoplete#enable_at_startup = 1
let g:neosnippet#enable_completed_snippet = 1
" kite
let g:kite_supported_lenguages = ['javascript', 'python']
" coc
autocmd FileType python let b:coc_suggest_disable = 1
autocmd FileType javascript let b:coc_suggest_disable = 1
autocmd FileType scss setl iskeyword+=#-#
command! -bang -nargs=? -complete=dir GFiles
\ call fzf#vim#gitfiles(<q-args>, fzf#vim#with_preview(), <bang>0)
command! -bang -nargs=* Ag
\ call fzf#vim#ag(<q-args>, fzf#vim#with_preview(), <bang>0)
command! -bang -nargs=? -complete=dir Files
\ call fzf#vim#files(<q-args>, fzf#vim#with_preview(), <bang>0)
" if hidden is not set, TextEdit might fail.
set hidden
" Some servers have issues with backup files, see #649
set nobackup
set nowritebackup
" Better display for messages
set cmdheight=2
" You will have bad experience for diagnostic messages when it's default 4000.
set updatetime=300
" don't give |ins-completion-menu| messages.
set shortmess+=c
" always show signcolumns
set signcolumn=yes
" fugitive always vertical diffing
set diffopt+=vertical
" Use <c-space> to trigger completion.
inoremap <silent><expr> <c-space> coc#refresh()
" Remap keys for gotos
nmap <silent> gd <Plug>(coc-definition)
nmap <silent> gy <Plug>(coc-type-definition)
nmap <silent> gi <Plug>(coc-implementation)
nmap <silent> gr <Plug>(coc-references)
" Highlight symbol under cursor on CursorHold
autocmd CursorHold * silent call CocActionAsync('highlight')
autocmd BufEnter *.js :silent let myIndex = SearchPatternInFile("#flow") | call SwitchFlowOrTsLsps(myIndex)
autocmd BufEnter *.jsx :silent let myIndex = SearchPatternInFile("#flow") | call SwitchFlowOrTsLsps(myIndex)
function! SwitchFlowOrTsLsps(flowIndex)
silent let stats = CocAction("extensionStats")
silent let tsserver = get(filter(copy(stats), function('FindTsServer')), 0)
if(a:flowIndex == 0)
if(tsserver.state == 'disabled')
call CocActionAsync("toggleExtension", "coc-tsserver")
endif
else
if(tsserver.state == 'activated')
call CocActionAsync("toggleExtension", "coc-tsserver")
endif
endif
endfunction
function! FindTsServer(idx, value)
return a:value.id == 'coc-tsserver'
endfunction
let $FZF_DEFAULT_OPTS='--layout=reverse'
let g:fzf_layout = { 'window': 'call FloatingFZF()' }
function! FloatingFZF()
let buf = nvim_create_buf(v:false, v:true)
call setbufvar(buf, '&signcolumn', 'no')
let height = float2nr((&lines - 3) / 2)
let width = float2nr(&columns - (&columns * 2 / 10))
let col = float2nr((&columns - width) / 2)
let row = float2nr((&lines - height) / 2)
let opts = {
\ 'relative': 'editor',
\ 'row': row,
\ 'col': col,
\ 'width': width,
\ 'height': height
\ }
call nvim_open_win(buf, v:true, opts)
endfunction
function! SearchPatternInFile(pattern)
" Save cursor position.
let save_cursor = getcurpos()
" Set cursor position to beginning of file.
call cursor(0, 0)
" Search for the string 'hello' with a flag c. The c flag means that a
" match at the cursor position will be accepted.
let search_result = search(a:pattern, "c")
" Set the cursor back at the saved position. The setpos function was
" used here because the return value of getcurpos can be used directly
" with it, unlike the cursor function.
call setpos('.', save_cursor)
" If the search function didn't find the pattern, it will have
" returned 0, thus it wasn't found. Any other number means that an instance
" has been found.
return search_result
endfunction
How could I solve it?, thanks to all.
Can you try running this command on vim and see if it works?
:CocInstall coc-json coc-tsserver

Filter on Oid using spark-mongo connector

I would like to filter on the objectId of the mongo document from spark program. I have tried the following:
case class _id(oid: String)
val str_start: _id = _id((start.getMillis() / 1000).toHexString + "0000000000000000")
val str_end: _id = _id((end.getMillis() / 1000).toHexString + "0000000000000000")
val filteredDF = df.filter(
$"_timestamp".isNotNull
.and($"_timestamp".between(new Timestamp(start.getMillis()), new Timestamp(end.getMillis())))
.and($"_id").between(str_start, str_end)
or
val str_start = (start.getMillis() / 1000).toHexString + "0000000000000000"
val str_end = (end.getMillis() / 1000).toHexString + "0000000000000000"
val filteredDF = df.filter(
$"_timestamp".isNotNull
.and($"_timestamp".between(new Timestamp(start.getMillis()), new Timestamp(end.getMillis())))
.and($"_id.oid").between(str_start, str_end)
Both give me an error in analysis:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot
resolve '(((`_timestamp` IS NOT NULL) AND ((`_timestamp` >= TIMESTAMP('2017-
07-31 00:22:00.0')) AND (`_timestamp` <= TIMESTAMP('2017-08-01
00:22:00.0')))) AND `_id`)' due to data type mismatch: differing types in
'(((`_timestamp` IS NOT NULL) AND ((`_timestamp` >= TIMESTAMP('2017-07-31
00:22:00.0')) AND (`_timestamp` <= TIMESTAMP('2017-08-01 00:22:00.0')))) AND
`_id`)' (boolean and struct<oid:string>).;;
'Filter (((((isnotnull(_timestamp#40) && ((_timestamp#40 >=
1501449720000000) && (_timestamp#40 <= 1501536120000000))) && _id#38) >=
597e4df80000000000000000) && (((isnotnull(_timestamp#40) && ((_timestamp#40
>= 1501449720000000) && (_timestamp#40 <= 1501536120000000))) && _id#38) <=
597f9f780000000000000000))
How can I query on the oid?
Thanks
Nir
I think you are misplaced parenthesis: Should be something like
and($"_id.oid" between(str_start, str_end) )
(so that's why you have the error message:
(boolean and struct<oid:string>)