How to switch to using Zeta SQL with DirectRunner? - apache-beam

Q: While using DirectRunner to test Beam SQL, how do I switch to using Zeta instead of Calcite?
Back story:
I am studying Apache Beam and am now playing with the SqlTransform. I understand that there are two flavors of SQL processor available ... Zeta and Calcite. I understand that the default is Calcite. Since I eventually plan on running my pipeline in Dataflow, I wanted to switch to Zeta for testing. I read the docs here which said:
To change dialects, pass the dialect’s full package name to the setPlannerName method in the PipelineOptions interface.
That was rather vague but I ended up coding:
var options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
options.as(org.apache.beam.sdk.extensions.sql.impl.BeamSqlPipelineOptions.class).setPlannerName("org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLQueryPlanner");
When I now run my pipeline, I fail with:
---------------------------------------------------------------------------
com.google.zetasql.io.grpc.StatusRuntimeException: INTERNAL: Panic! This is a bug!
at com.google.zetasql.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262)
at com.google.zetasql.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243)
at com.google.zetasql.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156)
at com.google.zetasql.ZetaSqlLocalServiceGrpc$ZetaSqlLocalServiceBlockingStub.getLanguageOptions(ZetaSqlLocalServiceGrpc.java:1562)
at com.google.zetasql.LanguageOptions.getDefaultFeatures(LanguageOptions.java:57)
at com.google.zetasql.LanguageOptions.<init>(LanguageOptions.java:65)
at com.google.zetasql.AnalyzerOptions.<init>(AnalyzerOptions.java:56)
at org.apache.beam.sdk.extensions.sql.zetasql.SqlAnalyzer.baseAnalyzerOptions(SqlAnalyzer.java:138)
at org.apache.beam.sdk.extensions.sql.zetasql.SqlAnalyzer.getAnalyzerOptions(SqlAnalyzer.java:162)
at org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLPlannerImpl.rel(ZetaSQLPlannerImpl.java:80)
at org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLQueryPlanner.convertToBeamRelInternal(ZetaSQLQueryPlanner.java:199)
at org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLQueryPlanner.convertToBeamRel(ZetaSQLQueryPlanner.java:187)
at org.apache.beam.sdk.extensions.sql.impl.BeamSqlEnv.parseQuery(BeamSqlEnv.java:112)
at org.apache.beam.sdk.extensions.sql.SqlTransform.expand(SqlTransform.java:171)
at org.apache.beam.sdk.extensions.sql.SqlTransform.expand(SqlTransform.java:110)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:548)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:499)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:373)
at .(#47:1)
It should be noted that the exact same pipeline works just fine without switching to Zeta.
In my maven dependencies I have (among others):
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-sdks-java-extensions-sql</artifactId>
<version>2.43.0</version>
</dependency>
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-sdks-java-extensions-sql-zetasql</artifactId>
<version>2.43.0</version>
</dependency>
Here is a Gist showing the full Java notebook including the error.

Related

ehcache 3 migration of net.sf.ehcache.hibernate.SingletonEhCacheProvider?

(also posted here: https://github.com/ehcache/ehcache3/issues/3129 )
I'm trying to upgrade from 2 to 3 and the (large) codebase contains:
net.sf.ehcache.hibernate.SingletonEhCacheProvider
inside xml-based bean containers:
<prop key="hibernate.cache.provider_class">net.sf.ehcache.hibernate.SingletonEhCacheProvider</prop>
and I don't see in the migration guide any hints on what is an acceptable way to achieve this:
https://www.ehcache.org/documentation/3.3/migration-guide.html
I'm using Spring 3.2.18 and hibernate as low as 3.3:
./WEB-INF/lib/hibernate-3.2.3.ga.jar
./WEB-INF/lib/hibernate-annotations-3.3.0.ga.jar
./WEB-INF/lib/hibernate-commons-annotations-4.0.1.Final.jar
./WEB-INF/lib/hibernate-validator-5.1.3.Final.jar
./WEB-INF/lib/spring-aop-3.2.18.RELEASE.jar
./WEB-INF/lib/spring-beans-3.2.18.RELEASE.jar
./WEB-INF/lib/spring-context-3.2.18.RELEASE.jar
./WEB-INF/lib/spring-context-support-3.2.18.RELEASE.jar
./WEB-INF/lib/spring-core-3.2.18.RELEASE.jar
./WEB-INF/lib/spring-expression-3.2.18.RELEASE.jar
./WEB-INF/lib/spring-jdbc-3.2.18.RELEASE.jar
./WEB-INF/lib/spring-jms-3.0.3.RELEASE.jar
./WEB-INF/lib/spring-orm-3.2.18.RELEASE.jar
./WEB-INF/lib/spring-oxm-3.2.18.RELEASE.jar
./WEB-INF/lib/spring-security-config-3.1.2.RELEASE.jar
./WEB-INF/lib/spring-security-core-3.2.10.RELEASE.jar
./WEB-INF/lib/spring-security-saml2-core-1.0.0.RC2.jar
./WEB-INF/lib/spring-security-web-3.2.10.RELEASE.jar
./WEB-INF/lib/spring-test-3.2.18.RELEASE.jar
./WEB-INF/lib/spring-tx-3.2.18.RELEASE.jar
./WEB-INF/lib/spring-web-3.2.18.RELEASE.jar
./WEB-INF/lib/spring-webmvc-3.2.18.RELEASE.jar
./WEB-INF/lib/spring-ws-core-2.1.4.RELEASE-all.jar
What is the easiest way to use ehcache 3 with code that currently uses net.sf.ehcache.hibernate.SingletonEhCacheProvider?
Is there a compatibility matrix? I see lots of search results about Hibernate 4+, Spring/Spring Boot at higher versions than my code has.
(note: this is legacy code not written by me :) And we do have plans to modernize but there's a more immediate security concern with ehcache 2 that I need to address)

How to do effective logging in Spark application

I have a spark application code written in Scala that runs a series of Spark-SQL statements. These results are calculated by calling an action 'Count' in the end against the final dataframe. I would like to know what is the best way to do logging from within a Spark-scala application job? Since all the dataframes (around 20) in number are computed using a single action in the end, what are my options when it comes to logging the outputs/sequence/success of some statements.
Question is little generic in nature. Since spark works on lazy evaluation, the execeution plan is decided by spark and I want to know till what point application statements ran successfully and what were the intermediate results at that stage.
The intention here being to monitor the long running task and see till which point it was fine and where the the problems creeped in.
If we try to put logging before/after transformations then it gets printed when code is read. So, the logging has to be done with custom messages during the actual execution (calling the action in the end of the scala code). If I try to put count/take/first etc in between the code then the execution of job slows down a lot.
I understand the problem that you are facing. Let me put out a simple solution for this.
You need to make use of org.apache.log4j.Logger. Use following lines of code to generate logger messages.
org.apache.log4j.Logger logger = org.apache.log4j.Logger.getRootLogger();
logger.error(errorMessage);
logger.info(infoMessage);
logger.debug(debugMessage);
Now, in order to redirect these messages to a log file, you need to create a log4j property file with below contents.
# Root logger option
# Set everything to be logged to the console
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Settings to quiet third party logs that are too verbose
log4j.logger.org.eclipse.jetty=OFF
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=OFF
log4j.logger.org.spark-project.jetty.servlet.ServletHandler=OFF
log4j.logger.org.spark-project.jetty.server=OFF
log4j.logger.org.spark-project.jetty=OFF
log4j.category.org.spark_project.jetty=OFF
log4j.logger.Remoting=OFF
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
# Setting properties to have logger logs in local file system
log4j.appender.rolling=org.apache.log4j.RollingFileAppender
log4j.appender.rolling.encoding=UTF-8
log4j.appender.rolling.layout=org.apache.log4j.PatternLayout
log4j.appender.rolling.layout.conversionPattern=[%d] %p %m (%c)%n
log4j.appender.rolling.maxBackupIndex=5
log4j.appender.rolling.maxFileSize=50MB
log4j.logger.org.apache.spark=OFF
log4j.logger.org.spark-project=OFF
log4j.logger.org.apache.hadoop=OFF
log4j.logger.io.netty=OFF
log4j.logger.org.apache.zookeeper=OFF
log4j.rootLogger=INFO, rolling
log4j.appender.rolling.file=/tmp/logs/application.log
You can name the log file in the last statement. Ensure the folders at every node with appropriate permissions.
Now, we need to pass the configurations while submitting the spark job as follows.
--conf spark.executor.extraJavaOptions=-Dlog4j.configuration=spark-log4j.properties --conf spark.driver.extraJavaOptions=-Dlog4j.configuration=spark-log4j.properties
And,
--files "location of spark-log4j.properties file"
Hope this helps!
you can use log4j lib from maven
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-api</artifactId>
<version>${log4j.version}</version>
</dependency>
For logging, first you need to create a logger object and then you can do logging at different log levels like info, error, warning. Below is the example of logging info in spark scala using log4j:
import org.apache.logging.log4j.LogManager
val logger = LogManager.getLogger(this.getClass.getName)
logger.info("logging message")
So, to add info at some points you can use logger.info("logging message") at that point.

Orient DB 3.0.18: Compression with name 'snappy' is absent

I am trying to open a database using Orient DB 3.0.18 created using Orient DB v2.2.37, however, the error "Compression with name 'snappy' is absent is outputted". How does one register snappy compression with Orient V3? I tried org.xerial.snappy in the maven POM, but no joy. Thanks in advance.
2019-04-22 21:44:20 t.c.s.a.d.Services [DEBUG] error:stop:com.orientechnologies.orient.core.exception.OSecurityException: Compression with name 'snappy' is absent
com.orientechnologies.orient.core.exception.OSecurityException: Compression with name 'snappy' is absent
at com.orientechnologies.orient.core.compression.OCompressionFactory.getCompression(OCompressionFactory.java:79)
at com.orientechnologies.orient.core.storage.cluster.v0.OPaginatedClusterV0.init(OPaginatedClusterV0.java:1547)
at com.orientechnologies.orient.core.storage.cluster.v0.OPaginatedClusterV0.configure(OPaginatedClusterV0.java:154)
at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.createClusterFromConfig(OAbstractPaginatedStorage.java:4804)
at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.openClusters(OAbstractPaginatedStorage.java:519)
at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.open(OAbstractPaginatedStorage.java:388)
at com.orientechnologies.orient.core.db.OrientDBEmbedded.open(OrientDBEmbedded.java:281)
at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.open(ODatabaseDocumentTx.java:903)
at com.orientechnologies.orient.core.db.OPartitionedDatabasePool$DatabaseDocumentTxPooled.internalOpen(OPartitionedDatabasePool.java:440)
at com.orientechnologies.orient.core.db.OPartitionedDatabasePool.openDatabase(OPartitionedDatabasePool.java:303)
at com.orientechnologies.orient.core.db.OPartitionedDatabasePool.acquire(OPartitionedDatabasePool.java:261)
at com.tinkerpop.blueprints.impls.orient.OrientBaseGraph.<init>(OrientBaseGraph.java:172)
at com.tinkerpop.blueprints.impls.orient.OrientTransactionalGraph.<init>(OrientTransactionalGraph.java:78)
at com.tinkerpop.blueprints.impls.orient.OrientGraph.<init>(OrientGraph.java:137)
at com.tinkerpop.blueprints.impls.orient.OrientGraphFactory$1.getGraph(OrientGraphFactory.java:87)
at com.tinkerpop.blueprints.impls.orient.OrientGraphFactory.getTx(OrientGraphFactory.java:224)
Resolution for those using Orient DB in embedded mode is as follows:
1) download OSnappyCompression.java from Orient DB Github repo and incorporate into your project
2) add the following lib to your Maven pom.xml
<dependency>
<groupId>org.xerial.snappy</groupId>
<artifactId>snappy-java</artifactId>
<version>1.1.7.3</version>
</dependency>
3) register Snappy compression method as follows prior to start of Orient DB
OCompressionFactory.INSTANCE.register(new OSnappyCompression());
Unfortunately, this compression was removed. You have to convert the database into a non-compressed version.

TreeTagger can't find Charsetname when used in Uima Pipeline

I would like to use the TreeTagger for chunking inside an uima pipeline for a German text. The chunking works fine when I start the Tagger with cmd, but causes the following error when used in the pipeline:
org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator processing failed.
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:401)
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:308)
at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:570)
at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.<init>(ASB_impl.java:412)
at org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:344)
at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:265)
at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:269)
at org.apache.uima.fit.pipeline.SimplePipeline.runPipeline(SimplePipeline.java:150)
at de.fraunhofer.fkie.re_analysis.RA_pipeline.main(RA_pipeline.java:107)
Caused by: java.lang.NullPointerException: charsetName
at java.io.InputStreamReader.<init>(InputStreamReader.java:99)
at org.annolab.tt4j.TreeTaggerWrapper$Reader.<init>(TreeTaggerWrapper.java:946)
at org.annolab.tt4j.TreeTaggerWrapper.process(TreeTaggerWrapper.java:598)
at de.tudarmstadt.ukp.dkpro.core.treetagger.TreeTaggerChunker.process(TreeTaggerChunker.java:293)
at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:385)
... 8 more
I suppose I should specify the parameter "Chunk_Mapping_Location", but I don't know to which file. The chunker is initialised in the following way:
AnalysisEngineDescription chunker =
AnalysisEngineFactory.createEngineDescription(
TreeTaggerChunker.class,
TreeTaggerChunker.PARAM_EXECUTABLE_PATH, "C:/TreeTagger/bin/tree-tagger.exe",
TreeTaggerChunker.PARAM_MODEL_LOCATION, "C:/TreeTagger/lib/german-chunker-utf8.par",
TreeTaggerChunker.PARAM_PERFORMANCE_MODE, true,
TreeTaggerChunker.PARAM_PRINT_TAGSET, true,
TreeTaggerChunker.PARAM_LANGUAGE, "de"
);
Looks like TreeTaggerChunking is missing PARAM_MODEL_ENCODING which prevents it being usable with directly specified models. I have opened an issue.
You can get around this by packaging the TreeTagger models as JARs using the build.xml Ant script included with DKPro Core. The process is described in the DKPro Core developer documentation.
Disclosure: I am one of the DKPro Core developers.

Starting KsqlRestApplication form scala and getting NoSuchMethodError org.apache.kafka.streams.StreamsConfig.getConsumerConfigs

I am trying to write a program that enables me to run predefined KSQL operations on Kafka topics in Scala, but I don't want to open the KSQL Cli everytime. Therefore I want to start the KSQL "Server" from within my Scala program. If I understand the KSQL source code right, I have to build and start a KsqlRestApplication:
def restServer = KsqlRestApplication.buildApplication(new
KsqlRestConfig(defaultServerProperties), true, new VersionCheckerAgent
{override def start(ksqlModuleType: KsqlModuleType, properties:
Properties): Unit = ???})
But when I try doing that, I get the following error:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.kafka.streams.StreamsConfig.getConsumerConfigs(Ljava/lang/String;Ljava/lang/String;)Ljava/util/Map;
at io.confluent.ksql.rest.server.BrokerCompatibilityCheck.create(BrokerCompatibilityCheck.java:62)
at io.confluent.ksql.rest.server.KsqlRestApplication.buildApplication(KsqlRestApplication.java:241)
I looked into the function call in BrokerCompatibilityCheck and in the create function it calls the StreamsConfig.getConsumerConfigs() with 2 Strings as parameters instead of the parameters defined in
https://kafka.apache.org/0102/javadoc/org/apache/kafka/streams/StreamsConfig.html#getConsumerConfigs(StreamThread,%20java.lang.String,%20java.lang.String).
Are my KSQL and Kafka version simply not compatible or am I doing something wrong?
I am using KSQL version 4.1.0-SNAPSHOT and Kafka version 1.0.0.
Yes, NoSuchMethodError typically indicates a version incompatibility between libraries.
The link you posted is to javadoc for kafka 0.10.2. The method hasn't changed in 1.0 but indeed in the upcoming 1.1 it only takes 2 Strings:
https://kafka.apache.org/11/javadoc/org/apache/kafka/streams/StreamsConfig.html#getConsumerConfigs(java.lang.String,%20java.lang.String)
. That suggests the version of KSQL you're using (4.1.0-SNAPSHOT) depends on version 1.1 of kafka streams, which is currently in the release candidate phase and I believe and should be out soon:
https://lists.apache.org/thread.html/780c4458b16590e99261b69d7b41b9ec374a3226d72c8d38885a008a#%3Cusers.kafka.apache.org%3E
As per that email you can find the latest (1.1.0-rc2) artifacts in the apache staging repo:
https://repository.apache.org/content/groups/staging/