The topic test already exists inside of a remote Kafka Cluster. The session details are saved in the client variable. We have three columns in our table called FEATURE, ACCOUNTHOLDER, CLASS.
Now I create sample_table.
client.create_table(table_name='sample_table',
columns_type=['FEATURE double', 'ACOUNTHOLDER string', 'CLASS int'],
topic='test',
value_format='DELIMITED',
key = 'ACCOUNTHOLDER')
The result is true. So the table is successfully created.
Now I want to push data into sample_table by:
client.ksql("""INSERT INTO sample_table (FEATURE, ACCOUNTHOLDER, CLASS) VALUES (6.1, 'C1', 1)""")
I receive a NullPointerException.
KSQLError: ('java.lang.NullPointerException', 50000, ['io.confluent.ksql.rest.server.execution.InsertValuesExecutor.extractRow(InsertValuesExecutor.java:164)', 'io.confluent.ksql.rest.server.execution.InsertValuesExecutor.execute(InsertValuesExecutor.java:98)', 'io.confluent.ksql.rest.server.validation.CustomValidators.validate(CustomValidators.java:109)', 'io.confluent.ksql.rest.server.validation.RequestValidator.validate(RequestValidator.java:143)', 'io.confluent.ksql.rest.server.validation.RequestValidator.validate(RequestValidator.java:115)', 'io.confluent.ksql.rest.server.resources.KsqlResource.handleKsqlStatements(KsqlResource.java:163)', 'sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)', 'sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)', 'java.lang.reflect.Method.invoke(Method.java:498)', 'org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:76)', 'org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:148)', 'org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:191)', 'org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:200)', 'org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:103)', 'org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:493)', 'org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:415)', 'org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:104)', 'org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:277)', 'org.glassfish.jersey.internal.Errors$1.call(Errors.java:272)', 'org.glassfish.jersey.internal.Errors$1.call(Errors.java:268)', 'org.glassfish.jersey.internal.Errors.process(Errors.java:316)', 'org.glassfish.jersey.internal.Errors.process(Errors.java:298)', 'org.glassfish.jersey.internal.Errors.process(Errors.java:268)', 'org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:289)', 'org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:256)', 'org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:703)', 'org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:416)', 'org.glassfish.jersey.servlet.ServletContainer.serviceImpl(ServletContainer.java:409)', 'org.glassfish.jersey.servlet.ServletContainer.doFilter(ServletContainer.java:584)', 'org.glassfish.jersey.servlet.ServletContainer.doFilter(ServletContainer.java:525)', 'org.glassfish.jersey.servlet.ServletContainer.doFilter(ServletContainer.java:462)', 'org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)', 'org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)', 'org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)', 'org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1700)', 'org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)', 'org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)', 'org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)', 'org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)', 'org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1667)', 'org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)', 'org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)', 'org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)', 'org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:152)', 'org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:174)', 'org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)', 'org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:753)', 'org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)', 'org.eclipse.jetty.server.Server.handle(Server.java:505)', 'org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)', 'org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)', 'org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)', 'org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)', 'org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)', 'org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)', 'org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)', 'org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)', 'org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)', 'org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)', 'org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698)', 'org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804)', 'java.lang.Thread.run(Thread.java:748)'])
Any ideas what I am doing wrong?
Related
CREATE EXTERNAL TABLE `table_name`( data string COMMENT 'from deserializer') PARTITIONED BY ( date string) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' WITH SERDEPROPERTIES ( 'paths'='') STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://' TBLPROPERTIES ( 'classification'='json')
Trying to read this table using pyspark but getting below error
table = spark.table('table_name')
table.show(5, False)
: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.openx.data.jsonserde.JsonSerDe
at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializerClass(TableDesc.java:79)
at org.apache.spark.sql.hive.execution.HiveTableScanExec.addColumnMetadataToConf(HiveTableScanExec.scala:123)
at org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopConf$lzycompute(HiveTableScanExec.scala:101)
at org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopConf(HiveTableScanExec.scala:98)
at org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopReader$lzycompute(HiveTableScanExec.scala:110)
at org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopReader(HiveTableScanExec.scala:105)
at org.apache.spark.sql.hive.execution.HiveTableScanExec.$anonfun$doExecute$2(HiveTableScanExec.scala:210)
at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2524)
at org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:210)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181)
at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:326)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:444)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:430)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47)
at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3733)
at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2762)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3724)
at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
But this table is accessible from athena.
Clickhouse version 21.12.3.32. I`m following this PR(https://github.com/ClickHouse/ClickHouse/pull/21850) to handle incorrect messages from kafka topic, but after some investigation I`ve found that if a single message contains broken data, a whole batch of received messages cannot be parsed and it can lead to data loss.
Kafka engine table:
CREATE TABLE default.kafka_engine (message String)
ENGINE = Kafka
SETTINGS
kafka_broker_list = 'kafka:9092',
kafka_topic_list = 'topic',
kafka_group_name = 'group',
kafka_format = 'JSONAsString',
kafka_row_delimiter = '\n',
kafka_num_consumers = 1,
kafka_handle_error_mode ='stream';
Example of broken message: [object Object]
First message error: JSON object must begin with '{'.: (at row 1).
Other messages error: Cannot parse input: expected ']' at end of stream..
Is it possible to skip just that broken message and correctly parse other messages in a batch received from kafka topic?
Changing the kafka_format to RawBLOB fixed my issue.
I try to get test query to AWS RedShift from AWS Managed AirFlow:
QUERY:
AWS_GET_DATA_FROM_REDSHIFT = """('SELECT * FROM information_schema.tables;')"""
stage_data_from_redshift_to_s3 = FromRedshiftToS3TransferOperator(
task_id=f'Stage_unload_{SCHEMA}_{TABLE}_from_redshift_to_s3_{S3_BUCKET}',
dag=dag,
table=TABLE,
s3_bucket=S3_BUCKET,
s3_prefix=f'{SCHEMA}_{TABLE}',
select_query=AWS_GET_DATA_FROM_REDSHIFT,
unload_options=['CSV']
)
class FromRedshiftToS3TransferOperator(BaseOperator):
"""
Executes an UNLOAD command to s3 as a CSV with headers
:param schema: reference to a specific schema in redshift database
:type schema: str
:param table: reference to a specific table in redshift database
:type table: str
:param s3_bucket: reference to a specific S3 bucket
:type s3_bucket: str
:param s3_key: reference to a specific S3 key
:type s3_key: str
:param redshift_conn_id: reference to a specific redshift database
:type redshift_conn_id: str
:param aws_conn_id: reference to a specific S3 connection
:type aws_conn_id: str
:param verify: Whether or not to verify SSL certificates for S3 connection.
By default SSL certificates are verified.
You can provide the following values:
- ``False``: do not validate SSL certificates. SSL will still be used
(unless use_ssl is False), but SSL certificates will not be
verified.
- ``path/to/cert/bundle.pem``: A filename of the CA cert bundle to uses.
You can specify this argument if you want to use a different
CA cert bundle than the one used by botocore.
:type verify: bool or str
:param unload_options: reference to a list of UNLOAD options
:type unload_options: list
:param autocommit: If set to True it will automatically commit the UNLOAD statement.
Otherwise it will be committed right before the redshift connection gets closed.
:type autocommit: bool
:param include_header: If set to True the s3 file contains the header columns.
:type include_header: bool
"""
ui_color = '#8EB6D4'
#apply_defaults
def __init__(self,
table,
s3_bucket,
s3_prefix,
select_query,
redshift_conn_id='redshift',
aws_conn_id='aws_credentials',
unload_options=tuple(),
autocommit=False,
include_header=False,
*args, **kwargs):
super(FromRedshiftToS3TransferOperator, self).__init__(*args, **kwargs)
self.table = table
self.s3_bucket = s3_bucket
self.s3_prefix = s3_prefix
self.select_query = select_query
self.redshift_conn_id = redshift_conn_id
self.aws_conn_id = aws_conn_id
self.unload_options = unload_options
self.autocommit = autocommit
self.include_header = include_header
if self.include_header and 'HEADER' not in [uo.upper().strip() for uo in self.unload_options]:
self.unload_options = list(self.unload_options) + ['HEADER', ]
def execute(self, context):
aws_hook = AwsHook("aws_credentials")
credentials = aws_hook.get_credentials()
redshift_hook = PostgresHook("redshift")
self.log.info(f'Preparing to stage data from {self.select_query} to {self.s3_bucket}/{self.s3_prefix}...')
unload_query = """
UNLOAD {select_query}
TO 's3://{s3_bucket}/{s3_prefix}/{table}_'
with credentials
'aws_access_key_id={access_key};aws_secret_access_key={secret_key}'
{unload_options};
""".format(select_query=self.select_query,
s3_bucket=self.s3_bucket,
s3_prefix=self.s3_prefix,
table=self.table,
access_key=credentials.access_key,
secret_key=credentials.secret_key,
unload_options='\n\t\t\t'.join(self.unload_options))
self.log.info(f'{credentials.access_key}')
self.log.info(f'{credentials.secret_key}')
self.log.info('Executing UNLOAD command...')
redshift_hook.run(unload_query, self.autocommit)
self.log.info("UNLOAD command complete.")
And get an error:
[2021-08-17 10:40:50,186] {{taskinstance.py:1150}} ERROR - Specified types or functions (one per INFO message) not supported on Redshift tables.
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 984, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/airflow/dags/ownoperators/aws_from_redshift_to_s3_operator.py", line 95, in execute
redshift_hook.run(unload_query, self.autocommit)
File "/usr/local/lib/python3.7/site-packages/airflow/hooks/dbapi_hook.py", line 175, in run
cur.execute(s)
psycopg2.errors.FeatureNotSupported: Specified types or functions (one per INFO message) not supported on Redshift tables.
This error is generated by Redshift and in most cases it is when your query uses a leader-node only function (such as generater_series() - there are a number of these). Look at your select_query code and check if the functions called are valid on compute nodes (run the query in a workbench). Happy to help if you post the query. The issue is the SQL, not the code you posted.
The root of the issue is that leader node information is needed by the compute nodes during execution and this route isn't supported. There can be several causes for this and each has work-arounds.
I am trying to connect to kafka server. Authentication is based on GSSAPI.
/opt/app-root/src/server/node_modules/node-rdkafka/lib/error.js:411
return new LibrdKafkaError(e);
^
Error: broker transport failure
at Function.createLibrdkafkaError (/opt/app-root/src/server/node_modules/node-rdkafka/lib/error.js:411:10)
at /opt/app-root/src/server/node_modules/node-rdkafka/lib/client.js:350:28
This my test_kafka.js:
const Kafka = require('node-rdkafka');
const kafkaConf = {
'group.id': 'espdev2',
'enable.auto.commit': true,
'metadata.broker.list': 'br01',
'security.protocol': 'SASL_SSL',
'sasl.kerberos.service.name': 'kafka',
'sasl.kerberos.keytab': 'svc_esp_kafka_nonprod.keytab',
'sasl.kerberos.principal': 'svc_esp_kafka_nonprod#INT.LOCAL',
'debug': 'all',
'enable.ssl.certificate.verification': true,
//'ssl.certificate.location': 'some-root-ca.cer',
'ssl.ca.location': 'some-root-ca.cer',
//'ssl.key.location': 'svc_esp_kafka_nonprod.keytab',
};
const topics = 'hello1';
console.log(Kafka.features);
let readStream = new Kafka.KafkaConsumer.createReadStream(kafkaConf, { "auto.offset.reset": "earliest" }, { topics })
readStream.on('data', function (message) {
const messageString = message.value.toString();
console.log(`Consumed message on Stream: ${messageString}`);
});
You can look at this issue for the explanation of this error:
https://github.com/edenhill/librdkafka/issues/1987
Taken from #edenhill:
As a general rule for librdkafka-based clients: given that the cluster and client are correctly configured, all errors can be ignored as they are most likely temporary and librdkafka will attempt to recover automatically. In this specific case; if a group coordinator request fails it will be retried (using any broker in state Up) within 500ms. The current assignment and group membership will not be affected, if a new coordinator is found before the missing heartbeats times out the membership (session.timeout.ms).
Auto offset commits will be stalled until a new coordinator is found. In a future version we'll extend the error type to include a severity, allowing applications to happily ignore non-terminal errors. At this time an application should consider all errors informational, and not terminal.
I use following test case to write data to a postgresql table, and it works fine.
test("SparkSQLTest") {
val session = SparkSession.builder().master("local").appName("SparkSQLTest").getOrCreate()
val url = "jdbc:postgresql://dbhost:12345/db1"
val table = "schema1.table1"
val props = new Properties()
props.put("user", "user123")
props.put("password", "pass#123")
props.put(JDBCOptions.JDBC_DRIVER_CLASS, "org.postgresql.Driver")
session.range(300, 400).write.mode(SaveMode.Append).jdbc(url, table, props)
}
Then, I use following spark-sql -f sql_script_file.sql to write an hive data into postgresql table.
CREATE OR REPLACE TEMPORARY VIEW tmp_v1
USING org.apache.spark.sql.jdbc
OPTIONS (
driver 'org.postgresql.Driver',
url 'jdbc:postgresql://dbhost:12345/db1',
dbtable 'schema1.table2',
user 'user123',
password 'pass#123',
batchsize '2000'
);
insert into tmp_v1 select
name,
age
from test.person; ---test.person is the Hive db.table
But when I run the above script using spark-sql -f sql_script.sql, it complains that the postgresql user/passord is invalid, the exception is as follows, I think the above two methods are basically the same, so I would ask where the problem is, thanks.
org.postgresql.util.PSQLException: FATAL: Invalid username/password,login denied.
at org.postgresql.core.v3.ConnectionFactoryImpl.doAuthentication(ConnectionFactoryImpl.java:375)
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:189)
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:64)
at org.postgresql.jdbc2.AbstractJdbc2Connection.<init>(AbstractJdbc2Connection.java:124)
at org.postgresql.jdbc3.AbstractJdbc3Connection.<init>(AbstractJdbc3Connection.java:28)
at org.postgresql.jdbc3g.AbstractJdbc3gConnection.<init>(AbstractJdbc3gConnection.java:20)
at org.postgresql.jdbc4.AbstractJdbc4Connection.<init>(AbstractJdbc4Connection.java:30)
at org.postgresql.jdbc4.Jdbc4Connection.<init>(Jdbc4Connection.java:22)
at org.postgresql.Driver.makeConnection(Driver.java:392)
at org.postgresql.Driver.connect(Driver.java:266)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:59)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:50)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:58)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:114)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:45)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
at org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:76)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:59)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:57)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:75)