Process runs shell successfully, but does not upload data into database - scala

I'm creating a process that basically reads a path from a csv file, then generates a dataframe, and applies a filter to fit for the different fields. This dataframe only needs to write to an Oracle database table, it means does not create a destination hdfs path. The process was created in intellij, it is sbt model, in scala. The process runs and the dataframe loads(append mode) successfully when it is executed directly in the spark shell. However, when the jar is created, and the shell is executed with the bash command and nifi, is executed SUCCEDDED, but it does not load data in the database.I tried to execute it in overwrite mode, but answer was that I did not have privileges. I wrote this, because at the top of the code, in the spark configuration part, it refers to an overwrite mode(i don´t put this spark configuration when is executed directly in spark shell), and I want to know if my code its ok to write in a database, or its something related to a lack of privileges.
Next, i share the code of the process:
import com.tchile.bigdata.hdfs.Hdfs
import com.tchile.bigdata.utilities.Utilities
import org.apache.spark.SparkConf
import org.apache.spark.sql.{SaveMode, SparkSession}
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import java.sql.DriverManager
import java.util.concurrent.TimeUnit
class Process {
def process(shell_variable: String): Unit = {
val startTotal = TimeUnit.MILLISECONDS.toSeconds(System.currentTimeMillis())
println("[INFO] Inicio de proceso Logistica Carga Input Simple Data")
// Productivo (Configuracion de Spark)
val sparkConf = new SparkConf()
.setAppName("Logistica_Carga_Input_SimpleData")
.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
.set("spark.sql.parquet.output.committer.class", "org.apache.parquet.hadoop.ParquetOutputCommitter")
.set("spark.sql.sources.commitProtocolClass", "org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol")
val spark = SparkSession
.builder()
.config(sparkConf)
.enableHiveSupport()
.getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
val hdfs = new Hdfs
val utils = new Utilities
println("[INFO] Obteniendo variables desde shell/NIFI")
// Valores extraídos desde NIFI
val daysAgo = shell_variable.split(":")(0).toInt // Cantidad de días de reproceso: inicio
val daysAhead = shell_variable.split(":")(1).toInt // Cantidad de días de reproceso: límite
val repartition = shell_variable.split(":")(2).toInt
// Variables auxiliares para reproceso en HDFS
var pathToProcess = ""
var pathToDelete = ""
var deleteStatus = false
println("[INFO] Obteniendo variables desde Parametros.conf")
// Valores extraídos de Parametros.conf > exadata
val driver_jdbc = sc.getConf.get("spark.exadata.driver_jdbc")
val url_jdbc = sc.getConf.get("spark.exadata.url_jdbc")
val user_jdbc = sc.getConf.get("spark.exadata.user_jdbc")
val pass_jdbc = sc.getConf.get("spark.exadata.pass_jdbc")
val table_name = sc.getConf.get("spark.exadata.table_name")
val table_owner = sc.getConf.get("spark.exadata.table_owner")
val table = sc.getConf.get("spark.exadata.table")
// Valores extraídos de Parametros.conf > path
//val pathCsv = sc.getConf.get("spark.path.Csv")
// Cálculo de la fecha de los días atrás
val startDate = utils.getCalculatedDate("yyyy-MM-dd", -daysAgo)
// Cálculo de la fecha límite
val endDate = utils.getCalculatedDate("yyyy-MM-dd", -daysAhead)
// Se separan el año, el mes y el día a partir de la fecha_atras
val startDateYear = startDate.substring(0, 4)
val startDateMonth = startDate.substring(5, 7)
val startDateDay = startDate.substring(8, 10)
// Se separan el año, el mes y el día a partir del sub_day
val endDateYear = endDate.substring(0, 4)
val endDateMonth = endDate.substring(5, 7)
val endDateDay = endDate.substring(8, 10)
// Información para log
println("[INFO] Reproceso de: " + daysAgo + " días")
println("[INFO] Fecha inicio: " + startDate)
println("[INFO] Fecha límite: " + endDate)
try {
// ================= INICIO LÓGICA DE PROCESO =================
//Se crea df a partir de archivo diario input de acuerdo a la ruta indicada
val df_csv = spark.read.format("csv").option("header","true").option("sep",";").option("mode","dropmalformed").load("/applications/recup_remozo_equipos/equipos_por_recuperar/output/agendamientos_sin_pet_2")
val df_final = df_csv.select($"RutSinDV".as("RUT_SIN_DV"),
$"dv".as("DV"),
$"Agendado".as("AGENDADO"),
to_date(col("Dia_Agendado"), "yyyyMMdd").as("DIA_AGENDADO"),
$"Horario_Agendado".as("HORARIO_AGENDADO"),
$"Nombre_Agendamiento".as("NOMBRE_AGENDAMIENTO"),
$"Telefono_Agendamiento".as("TELEFONO_AGENDAMIENTO"),
$"Email".substr(0,49).as("EMAIL"),
$"Region_Agendamiento".substr(0,29).as("REGION_AGENDAMIENTO"),
$"Comuna_Agendamiento".as("COMUNA_AGENDAMIENTO"),
$"Direccion_Agendamiento".as("DIRECCION_AGENDAMIENTO"),
$"Numero_Agendamiento".substr(0,5)as("NUMERO_AGENDAMIENTO"),
$"Depto_Agendamiento".substr(0,9).as("DEPTO_AGENDAMIENTO"),
to_timestamp(col("fecha_registro")).as("FECHA_REGISTRO"),
to_timestamp(col("Fecha_Proceso")).as("FECHA_PROCESO")
)
// ================== FIN LÓGICA DE PROCESO ==================
// Limpieza en EXADATA
println("[INFO] Se inicia la limpieza por reproceso en EXADATA")
val query_particiones = "(SELECT * FROM (WITH DATA AS (select table_name,partition_name,to_date(trim('''' " +
"from regexp_substr(extractvalue(dbms_xmlgen.getxmltype('select high_value from all_tab_partitions " +
"where table_name='''|| table_name|| ''' and table_owner = '''|| table_owner|| ''' and partition_name = '''" +
"|| partition_name|| ''''),'//text()'),'''.*?''')),'syyyy-mm-dd hh24:mi:ss') high_value_in_date_format " +
"FROM all_tab_partitions WHERE table_name = '" + table_name + "' AND table_owner = '" + table_owner + "')" +
"SELECT partition_name FROM DATA WHERE high_value_in_date_format > DATE '" + startDateYear + "-" + startDateMonth + "-" + startDateDay + "' " +
"AND high_value_in_date_format <= DATE '" + endDateYear + "-" + endDateMonth + "-" + endDateDay + "') A)"
Class.forName(driver_jdbc)
val db = DriverManager.getConnection(url_jdbc, user_jdbc, pass_jdbc)
val st = db.createStatement()
try {
val consultaParticiones = spark.read.format("jdbc")
.option("url", url_jdbc)
.option("driver", driver_jdbc)
.option("dbTable", query_particiones)
.option("user", user_jdbc)
.option("password", pass_jdbc)
.load()
.collect()
for (partition <- consultaParticiones) {
st.executeUpdate("call " + table_owner + ".DO_THE_TRUNCATE_PARTITION('" + table + "','" + partition.getString(0) + "')")
}
} catch {
case e: Exception =>
println("[ERROR TRUNCATE] " + e)
}
st.close()
db.close()
println("[INFO] Se inicia la inserción en EXADATA")
df_final.filter($"DIA_AGENDADO" >= "2022-08-01")
.repartition(repartition).write.mode("append")
.jdbc(url_jdbc, table, utils.jdbcProperties(driver_jdbc, user_jdbc, pass_jdbc))
println("[INFO] Inserción en EXADATA completada con éxito")
println("[INFO] Proceso Logistica Carga Input SimpleData")
val endTotal = TimeUnit.MILLISECONDS.toSeconds(System.currentTimeMillis()) - startTotal
println("[INFO] TIEMPO TOTAL EJECUCIÓN: " + utils.secondsToMinutes(endTotal))
}
catch {
case e: Exception =>
println("[EXCEPTION] " + e)
val endTotal = TimeUnit.MILLISECONDS.toSeconds(System.currentTimeMillis()) - startTotal
println("[INFO] TIEMPO TOTAL EJECUCIÓN (CON ERROR): " + utils.secondsToMinutes(endTotal))
throw e
}
}
}
Now i want to show the shell file :
#!/bin/bash
args=("$#")
export SPARK_MAJOR_VERSION=2;
in=${args[0]}
process_name2=${args[1]}
dir_hdfs=$(echo $in | awk -F ":" '{print$1}')
process_name=Logistica_Carga_Input_SimpleData
if [ -z "$process_name2" ]; then
process_name2=$process_name
fi
var_log=$(echo $dir_hdfs"/log/part*")
spark-submit \
--master yarn \
--deploy-mode cluster \
--name "Logistica_Carga_Input_SimpleData" \
--class com.tchile.bigdata.App \
--num-executors 2 \
--properties-file "/home/TCHILE.LOCAL/srv_remozo_equip/ingesta/Logistica_Carga_Input_SimpleData/Parametros.conf" \
--conf "spark.executor.cores=16" \
--conf "spark.default.parallelism=120" \
--conf "spark.driver.memory=20gb" \
--conf "spark.executor.memory=30gb" \
--conf "spark.ui.acls.enable=true" \
--conf "spark.sql.sources.partitionOverwriteMode=dynamic" \
--conf "spark.ui.view.acls=srv_remozo_equip" \
--conf "spark.admin.acls=srv_remozo_equip" \
--conf "spark.sql.warehouse.dir=/etc/hive/conf/hive-site.xml" \
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./kafka_client_jaas.conf" \
--conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=./kafka_client_jaas.conf" \
--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=./log4j.properties" \
--conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=./log4j.properties" \
--conf "spark.driver.extraClassPath=/usr/share/java/ojdbc7.jar" \
--jars "/usr/share/java/ojdbc7.jar" \
hdfs://nn:8020/applications/recup_remozo_equipos/Logistica_Carga_Input_SimpleData/logistica_carga_input_simpledata_2.11-1.0.jar "${args[0]}"
Some comments, field names are in Spanish, but I don't think it's relevant for the analysis. I also do not include the variables configuration parameter file. I appreciate any comments on this.

Problem solved, in dataframe df_final, when it cast "dia_agendado"(to_date(col("Dia_Agendado"), "yyyyMMdd").as("DIA_AGENDADO")), the format was not right, it was yyyy-MM-dd. This was because, i had not changed this format in Intellij. I was using a code in sublime, and that worked, but havent realized about this difference, and this is the reason cause didn't upload in the database.

Related

Pyspark to connect to postgresql

I'm trying to connect to a Posgresql database in Databricks and I'm using the following code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.jars", "path_to_postgresDriver/postgresql_42_2_14.jar").getOrCreate()
driver = "org.postgresql.Driver"
database_host = "my_host"
database_port = "5432"
database_name = "db_name"
table = "db_table"
user = "my_user"
password = "my_pwd"
url = f"jdbc:postgresql://{database_host}:{database_port}/{database_name}"
remote_table = (spark.read
.format("jdbc")
.option("driver", driver)
.option("url", url)
.option("dbtable", table)
.option("user", user)
.option("password", password)
.load()
)
Unfortunately I get back the following error:
Any idea what might be happening?
Thank you in advance!

case when in merge statement databricks

I am trying to upsert in Databricks using merge statement in pyspark. I wanted to know if using expressions (e.g. adding two columns, case when) allowed in the whenMatchedUpdate part. For example I want to do something like this
deltaTableTarget = DeltaTable.forPath(spark, delta_table_path)
deltaTableTarget.alias('TgtCrmUserAggr') \
.merge(
broadcast(df_transformed.alias('DeltaSource')),
"DeltaSource.primary_key==TargetTable.primary_key"
) \
.whenMatchedUpdate(set =
{
"aggcount":"DeltaSource.count + TargetTable.count",
"max_date": "case when DeltaSource.max_date > TargetTable.max_date then DeltaSource.max_date else TargetTable.max_date end"
}
)
.whenNotMatchedInsert().insertAll()
)\
.execute()
If I understand your logic well, you can just take the max value of the 2 columns, right?
deltaTableTarget = DeltaTable.forPath(spark, delta_table_path)
deltaTableTarget.alias('TgtCrmUserAggr') \
.merge(
broadcast(df_transformed.alias('DeltaSource')),
"DeltaSource.primary_key==TargetTable.primary_key"
) \
.whenMatchedUpdate(set =
{
"aggcount":"DeltaSource.count + TargetTable.count",
"max_date": "GREATEST(DeltaSource.max_date,TargetTable.max_date)"
}
)
.whenNotMatchedInsert().insertAll()
)\
.execute()
If this is not correct, something you could do is use multiple whenMatchedUpdate() functions with a condition.
deltaTableTarget = DeltaTable.forPath(spark, delta_table_path)
deltaTableTarget.alias('TgtCrmUserAggr') \
.merge(
broadcast(df_transformed.alias('DeltaSource')),
"DeltaSource.primary_key==TargetTable.primary_key"
) \
.whenMatchedUpdate(condition= 'DeltaSource.max_date > TargetTable.max_date',
set =
{
"aggcount":"DeltaSource.count + TargetTable.count",
"max_date": "DeltaSource.max_date"
}
)
.whenMatchedUpdate(set =
{
"aggcount":"DeltaSource.count + TargetTable.count",
"max_date": "TargetTable.max_date"
}
)
.whenNotMatchedInsert().insertAll()
)\
.execute()

Adding a custom function in SPARK NLP Pipeline

I am trying to add a custom function (to remove names from a document) and I want this udf to be incorporated into the SPARK NLP pipeline. Can you please tell me where I am doing it wrong?
Here is the udf I created to remove names
def remove_names(str):
#All_names = list(map(lambda x: x.lower(), All_names))
All_names = ['josh','alice']
split_str = str.split()
found_names = [name for name in split_str if name in All_names]
for name in found_names:
str = str.replace(name,'')
str = ' '.join(str.split())
return str
udf_txt_clean = udf(lambda x: remove_names(x),StringType())
My pipeline looks like this :
documentAssembler = DocumentAssembler()\
.setInputCol("combined_text")\
.setOutputCol("document")
remove_names = udf_txt_clean('document') \
.setInputCols(["document"]) \
.setOutputCol("noname")
tokenizer = Tokenizer() \
.setInputCols(["noname"]) \
.setOutputCol("token")
spellModel = ContextSpellCheckerModel\
.pretrained()\
.setInputCols("token")\
.setOutputCol("checked")
normalizer = Normalizer() \
.setInputCols(["checked"]) \
.setOutputCol("normalized")\
.setLowercase(True)\
stopwords_cleaner = StopWordsCleaner()\
.setInputCols("normalized")\
.setOutputCol("cleanTokens")\
.setCaseSensitive(False)\
tokenassembler = TokenAssembler()\
.setInputCols(["document", "cleanTokens"]) \
.setOutputCol("clean_text")
nlpPipeline = Pipeline(stages=[
documentAssembler,
remove_names,
tokenizer,
spellModel,
normalizer,
stopwords_cleaner,
tokenassembler
])
empty_df = spark.createDataFrame([['']]).toDF("text")
pipelineModel = nlpPipeline.fit(empty_df)
result = pipelineModel.transform(sdf)
result.select('text', explode('clean_text.result').alias('clean_text')).display()
Sample Data just to recreate the issue
text = 'josh and sam are buddies. They (both) like <b>running</b>. They got better at it over the weekend'
test = pd.DataFrame({"test": [text]})
sdf = spark.createDataFrame(test)

Spark stream stops abruptly - "the specified path does not exist"

I am working on the spark structure streaming. My Stream works fine but after sometime it just stops because of below issue.
Any suggestion what could be the reason and how to resolve this issue.
java.io.FileNotFoundException: Operation failed: "The specified path does not exist.", 404, GET, https://XXXXXXXX.dfs.core.windows.net/output?upn=false&resource=filesystem&maxResults=5000&directory=XXXXXXXX&timeout=90&recursive=true, PathNotFound, "The specified path does not exist. RequestId:d1b7c77f-e01f-0027-7f09-4646f7000000 Time:2022-04-01T20:47:30.1791444Z"
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.checkException(AzureBlobFileSystem.java:1290)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.listKeysWithPrefix(AzureBlobFileSystem.java:530)
at com.databricks.tahoe.store.EnhancedAzureBlobFileSystemUpgrade.listKeysWithPrefix(EnhancedFileSystem.scala:605)
at com.databricks.tahoe.store.EnhancedDatabricksFileSystemV2.$anonfun$listKeysWithPrefix$1(EnhancedFileSystem.scala:374)
at com.databricks.backend.daemon.data.client.DBFSV2.$anonfun$listKeysWithPrefix$1(DatabricksFileSystemV2.scala:247)
at com.databricks.logging.UsageLogging.$anonfun$recordOperation$1(UsageLogging.scala:395)
at com.databricks.logging.UsageLogging.executeThunkAndCaptureResultTags$1(UsageLogging.scala:484)
at com.databricks.logging.UsageLogging.$anonfun$recordOperationWithResultTags$4(UsageLogging.scala:504)
at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:266)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:261)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:258)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.withAttributionContext(DatabricksFileSystemV2.scala:510)
at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:305)
at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:297)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.withAttributionTags(DatabricksFileSystemV2.scala:510)
at com.databricks.logging.UsageLogging.recordOperationWithResultTags(UsageLogging.scala:479)
at com.databricks.logging.UsageLogging.recordOperationWithResultTags$(UsageLogging.scala:404)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.recordOperationWithResultTags(DatabricksFileSystemV2.scala:510)
at com.databricks.logging.UsageLogging.recordOperation(UsageLogging.scala:395)
at com.databricks.logging.UsageLogging.recordOperation$(UsageLogging.scala:367)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.recordOperation(DatabricksFileSystemV2.scala:510)
at com.databricks.backend.daemon.data.client.DBFSV2.listKeysWithPrefix(DatabricksFileSystemV2.scala:240)
at com.databricks.tahoe.store.EnhancedDatabricksFileSystemV2.listKeysWithPrefix(EnhancedFileSystem.scala:374)
at com.databricks.tahoe.store.AzureLogStore.listKeysWithPrefix(AzureLogStore.scala:54)
at com.databricks.tahoe.store.DelegatingLogStore.listKeysWithPrefix(DelegatingLogStore.scala:251)
at com.databricks.sql.fileNotification.autoIngest.FileEventBackfiller$.listFiles(FileEventWorkerThread.scala:967)
at com.databricks.sql.fileNotification.autoIngest.FileEventBackfiller.runInternal(FileEventWorkerThread.scala:876)
at com.databricks.sql.fileNotification.autoIngest.FileEventBackfiller.run(FileEventWorkerThread.scala:809)
Caused by: Operation failed: "The specified path does not exist.", 404, GET, https://XXXXXXXXXX.dfs.core.windows.net/output?upn=false&resource=filesystem&maxResults=5000&directory=XXXXXXXX&timeout=90&recursive=true, PathNotFound, "The specified path does not exist. RequestId:02ae07cf-901f-0001-080e-46dd43000000 Time:2022-04-01T21:21:40.2136657Z"
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:241)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsClient.listPath(AbfsClient.java:235)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listFiles(AzureBlobFileSystemStore.java:1112)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.access$200(AzureBlobFileSystemStore.java:143)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore$1.fetchMoreResults(AzureBlobFileSystemStore.java:1052)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore$1.(AzureBlobFileSystemStore.java:1033)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listKeysWithPrefix(AzureBlobFileSystemStore.java:1029)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.listKeysWithPrefix(AzureBlobFileSystem.java:527)
... 27 more
Below is my code:
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.types import StructType, StringType
from pyspark.sql import functions as F
from delta.tables import *
spark.sql("set spark.sql.files.ignoreMissingFiles=true")
filteredRawDF = ""
try:
filteredRawDF = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.option("cloudFiles.schemaLocation", landingcheckPointFilePath) \
.option("cloudFiles.inferColumnTypes", "true") \
.load(landingFilePath) \
.select(from_json('body', schema).alias('temp')) \
.select(explode("temp.report.data").alias("details")) \
.select("details",
explode("details.breakdown").alias("inner_breakdown")) \
.select("details","inner_breakdown",
explode("inner_breakdown.breakdown").alias("outer_breakdown"))\
.select(to_timestamp(col("details.name"), "yyyy-MM-
dd'T'HH:mm:ss+SSSS").alias('datetime'),
col("details.year"),
col("details.day"),
col("details.hour"),
col("details.minute"),
col("inner_breakdown.name").alias("hotelName"),
col("outer_breakdown.name").alias("checkindate"),
col("outer_breakdown.counts")[0].cast("int").alias("HdpHits"))
except Exception as e:
print(e)
query = filteredRawDF \
.writeStream \
.format("delta") \
.option("mergeSchema", "true") \
.outputMode("append") \
.option("checkpointLocation", checkPointPath) \
.trigger(processingTime='50 seconds') \
.start(savePath) '''
Thanks

Flink SQL Client connect to secured kafka cluster

I want to execute a query on Flink SQL Table backed by kafka topic of secured kafka cluster. I'm able to execute the query programmatically but unable to do the same through Flink SQL client. I'm not sure on how to pass JAAS config (java.security.auth.login.config) and other system properties through Flink SQL client.
Flink SQL query programmatically
private static void simpleExec_auth() {
// Create the execution environment.
final EnvironmentSettings settings = EnvironmentSettings.newInstance()
.inStreamingMode()
.withBuiltInCatalogName(
"default_catalog")
.withBuiltInDatabaseName(
"default_database")
.build();
System.setProperty("java.security.auth.login.config","client_jaas.conf");
System.setProperty("sun.security.jgss.native", "true");
System.setProperty("sun.security.jgss.lib", "/usr/libexec/libgsswrap.so");
System.setProperty("javax.security.auth.useSubjectCredsOnly","false");
TableEnvironment tableEnvironment = TableEnvironment.create(settings);
String createQuery = "CREATE TABLE test_flink11 ( " + "`keyid` STRING, " + "`id` STRING, "
+ "`name` STRING, " + "`age` INT, " + "`color` STRING, " + "`rowtime` TIMESTAMP(3) METADATA FROM 'timestamp', " + "`proctime` AS PROCTIME(), " + "`address` STRING) " + "WITH ( "
+ "'connector' = 'kafka', "
+ "'topic' = 'test_flink10', "
+ "'scan.startup.mode' = 'latest-offset', "
+ "'properties.bootstrap.servers' = 'kafka01.nyc.com:9092', "
+ "'value.format' = 'avro-confluent', "
+ "'key.format' = 'avro-confluent', "
+ "'key.fields' = 'keyid', "
+ "'value.fields-include' = 'EXCEPT_KEY', "
+ "'properties.security.protocol' = 'SASL_PLAINTEXT', 'properties.sasl.kerberos.service.name' = 'kafka', 'properties.sasl.kerberos.kinit.cmd' = '/usr/local/bin/skinit --quiet', 'properties.sasl.mechanism' = 'GSSAPI', "
+ "'key.avro-confluent.schema-registry.url' = 'http://kafka-schema-registry:5037', "
+ "'key.avro-confluent.schema-registry.subject' = 'test_flink6', "
+ "'value.avro-confluent.schema-registry.url' = 'http://kafka-schema-registry:5037', "
+ "'value.avro-confluent.schema-registry.subject' = 'test_flink4')";
System.out.println(createQuery);
tableEnvironment.executeSql(createQuery);
TableResult result = tableEnvironment
.executeSql("SELECT name,rowtime FROM test_flink11");
result.print();
}
This is working fine.
Flink SQL query through SQL client
Running this giving the following error.
Flink SQL> CREATE TABLE test_flink11 (`keyid` STRING,`id` STRING,`name` STRING,`address` STRING,`age` INT,`color` STRING) WITH('connector' = 'kafka', 'topic' = 'test_flink10','scan.startup.mode' = 'earliest-offset','properties.bootstrap.servers' = 'kafka01.nyc.com:9092','value.format' = 'avro-confluent','key.format' = 'avro-confluent','key.fields' = 'keyid', 'value.avro-confluent.schema-registry.url' = 'http://kafka-schema-registry:5037', 'value.avro-confluent.schema-registry.subject' = 'test_flink4', 'value.fields-include' = 'EXCEPT_KEY', 'key.avro-confluent.schema-registry.url' = 'http://kafka-schema-registry:5037', 'key.avro-confluent.schema-registry.subject' = 'test_flink6', 'properties.security.protocol' = 'SASL_PLAINTEXT', 'properties.sasl.kerberos.service.name' = 'kafka', 'properties.sasl.kerberos.kinit.cmd' = '/usr/local/bin/skinit --quiet', 'properties.sasl.mechanism' = 'GSSAPI');
Flink SQL> select * from test_flink11;
[ERROR] Could not execute SQL statement. Reason:
java.lang.IllegalArgumentException: Could not find a 'KafkaClient' entry in the JAAS configuration. System property 'java.security.auth.login.config' is /tmp/jaas-6309821891889949793.conf
There is nothing in /tmp/jaas-6309821891889949793.conf except the following comment
# We are using this file as an workaround for the Kafka and ZK SASL implementation
# since they explicitly look for java.security.auth.login.config property
# Please do not edit/delete this file - See FLINK-3929
SQL client run command
bin/sql-client.sh embedded --jar flink-sql-connector-kafka_2.11-1.12.0.jar --jar flink-sql-avro-confluent-registry-1.12.0.jar
Flink cluster command
bin/start-cluster.sh
How to pass this java.security.auth.login.config and other system properties (that I'm setting in the above java code snippet), for SQL client?
flink-conf.yaml
security.kerberos.login.use-ticket-cache: true
security.kerberos.login.principal: XXXXX#HADOOP.COM
security.kerberos.login.use-ticket-cache: false
security.kerberos.login.keytab: /path/to/kafka.keytab
security.kerberos.login.principal: XXXX#HADOOP.COM
security.kerberos.login.contexts: Client,KafkaClient
I haven't really tested whether this solution is feasible, you can try it out, hope it will help you.