Stop spark log warning "Truncated the string representation of a plan ...."

Stop spark log warning "Truncated the string representation of a plan ...." - scala

I'm trying to use a log4j2 RegexFilter to filter the spark warning Truncated the string representation of a plan since it was too long. Spark logs this warning because I'm setting the config option spark.sql.maxPlanStringLength=0 because I don't want query plan output in the application logs.
Here is my spark app, that triggers the warning:
package sparklog4j2
import org.apache.spark.sql.SparkSession
import org.apache.logging.log4j.core.LoggerContext
import org.apache.logging.log4j.core.config.{LoggerConfig}
import org.apache.logging.log4j.{Logger, LogManager, Level}
object Demo {
def main(args: Array[String]): Unit = {
val ctx: LoggerContext = LogManager.getContext().asInstanceOf[LoggerContext]
val conf = ctx.getConfiguration()
println(s"CONFIG NAME: ${conf.getName}")
val spark = SparkSession.builder().appName("log4j2 demo").getOrCreate()
import spark.implicits._
spark.createDataset[String](Seq("foo","bar")).show
}
}
I build a fat jar with sbt assembly:
scalaVersion := "2.12.15"
version := "1.0.0"
libraryDependencies ++= Seq(
"org.apache.logging.log4j" % "log4j-api" % "2.13.2",
"org.apache.logging.log4j" % "log4j-core" % "2.13.2",
"org.apache.logging.log4j" % "log4j-slf4j-impl" % "2.13.2",
"org.apache.logging.log4j" % "log4j-1.2-api" % "2.13.2" % "provided",
"org.apache.spark" %% "spark-core" % "3.2.1" % "provided",
"org.apache.spark" %% "spark-sql" % "3.2.1" % "provided",
)
Here is my log4j2.json which defines the configuration level RegexFilter:
{
"configuration": {
"name": "sparklog4j2-demo",
"RegexFilter": {
"regex": ".*Truncated.*",
"onMatch": "DENY",
"onMismatch": "NEUTRAL"
},
"loggers": {
"logger": [
{
"name": "org.apache.spark.*",
"level": "error",
"includeLocation": true
}
],
"root": {
"level": "error",
"includeLocation": true
}
}
}
}
And here is how I run the app:
spark-submit \
--verbose \
--class sparklog4j2.Demo \
--jars ./jars/log4j-1.2-api-2.13.2.jar \
--driver-java-options "-Dlog4j.configurationFile=files/log4j2.json -Dlog4j2.debug=true -DLog4jDefaultStatusLevel=trace" \
--conf "spark.sql.maxPlanStringLength=0" \
--files ./files/log4j2.json \
target/scala-2.12/log4j-spark-assembly-1.0.0.jar
AS the app is running this linkage error is emitted:
INFO StatusLogger Plugin [org.apache.hadoop.hive.ql.log.HiveEventCounter] could not be loaded due to linkage error.
java.lang.NoClassDefFoundError: org/apache/logging/log4j/core/appender/AbstractAppender
Even though I've packaged log4j-core - this suggests it's missing.
However the app runs fine, I see CONFIG NAME: sparklog4j2-demo which proves the app has loaded my log4j2.json config.
Yet spark emits this:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
...
WARN StringUtils: Truncated the string representation of a plan since it was too long.
So my filter is not working, and it appears that Spark is not even using my log4j config.

Instead of using a RegexFilter, I'm able to stop this warning by raising the priority threshold of the relevant loggers: I added these lines to spark-3.2.1-bin-hadoop3.2/conf/log4j.properties.template:
log4j.logger.org.apache.spark.sql.catalyst.util.StringUtils=ERROR
log4j.logger.org.apache.spark.sql.catalyst.util=ERROR
And modified my submit command to load the properties file:
spark-submit \
--class sparklog4j2.Demo \
--jars ./jars/log4j-1.2-api-2.13.2.jar \
--driver-java-options "-Dlog4j.configuration=File:$HOME/lib/spark-3.2.1-bin-hadoop3.2/conf/log4j.properties.template" \
--conf "spark.sql.maxPlanStringLength=0" \
target/scala-2.12/log4j-spark-assembly-1.0.0.jar
This doesn't resolve the issues I was having with log4j 2 but it does stop the warning.

Related

error: object Service is not a member of package com.twitter.finagle - Defining Bazel dependencies in Build file, Scala finagle

Im trying to add the finagle-http library to my new bazel project as an external maven dependency. But getting the following error. I assume im doing something wrong in creating the build without fully understanding it. Trying to learning. Appreciate any help on this.
error: object Service is not a member of package com.twitter.finagle
error: object util is not a member of package com.twitter
error: type Request is not a member of package com.twitter.finagle.http
error: object Response is not a member of package com.twitter.finagle.http
error: Symbol 'type com.twitter.finagle.Client' is missing from the classpath. This symbol is required by 'object com.twitter.finagle.Http'.
error: not found: value Await
The same code is working using sbt. Below is the code.
import com.twitter.finagle.{Http, Service}
import com.twitter.finagle.http
import com.twitter.util.{Await, Future}
object HelloWorld extends App {
val service = new Service[http.Request, http.Response] {
def apply(req: http.Request): Future[http.Response] =
Future.value(http.Response(req.version, http.Status.Ok))
}
val server = Http.serve(":8080", service)
Await.ready(server)
}
WORKSPACE file
maven_install(
artifacts = [
"org.apache.spark:spark-core_2.11:2.4.4",
"org.apache.spark:spark-sql_2.11:2.4.1",
"org.apache.spark:spark-unsafe_2.11:2.4.1",
"org.apache.spark:spark-tags_2.11:2.4.1",
"org.apache.spark:spark-catalyst_2.11:2.4.1",
"com.twitter:finagle-http_2.12:21.8.0",
],
repositories = [
"https://repo.maven.apache.org/maven2/",
"https://repo1.maven.org/maven2/",
]
)
BUILD file
load("#io_bazel_rules_scala//scala:scala.bzl", "scala_binary")
package(default_visibility = ["//visibility:public"])
scala_binary(
name="helloworld",
main_class="microservices.HelloWorld",
srcs=[
"Main.scala",
],
deps = ["spark],
)
java_library(
name = "spark",
exports = [
"#maven//:com_twitter_finagle_http_2_12_21_8_0",
],
)
Working SBT dependency that was working in my initial sbt project
libraryDependencies += "com.twitter" %% "finagle-http" % "21.8.0"

Figured out the issue, unlike in sbt, in bazel i had induvidualy add the related dependencies. I modified the workspace as below.
maven_install(
artifacts = [
"com.twitter:finagle-http_2.12:21.8.0",
"com.twitter:util-core_2.12:21.8.0",
"com.twitter:finagle-core_2.12:21.8.0",
"com.twitter:finagle-base-http_2.12:21.8.0",
"com.fasterxml.jackson.module:jackson-module-scala_2.12:2.11.2",
"com.fasterxml.jackson.core:jackson-databind:2.11.2",
],
repositories = [
"https://repo.maven.apache.org/maven2/",
"https://repo1.maven.org/maven2/",
]
Build file --
java_library(
name = "finagletrial",
exports = [
"#maven//:com_twitter_finagle_http_2_12_21_8_0",
"#maven//:com_twitter_util_core_2_12_21_8_0",
"#maven//:com_twitter_finagle_core_2_12_21_8_0",
"#maven//:com_twitter_finagle_base_http_2_12_21_8_0",
"#maven//:com_fasterxml_jackson_module_jackson_module_scala_2_12_2_11_2",
"#maven//:com_fasterxml_jackson_core_jackson_databind_2_11_2"
],

Slick + HikariCP + Play is causing Postgres to run out of connections after often restarting Play

I am developing a Play application with Slick and HikariCP for connection pooling. After a few restarts of Play, my development Postgres server runs out of connections and shows,
db_1 | 2019-11-19 21:06:46.583 GMT [395] FATAL: remaining connection slots are reserved for non-replication superuser connections
db_1 | 2019-11-19 21:06:46.886 GMT [396] FATAL: remaining connection slots are reserved for non-replication superuser connections
db_1 | 2019-11-19 21:06:48.167 GMT [397] FATAL: remaining connection slots are reserved for non-replication superuser connections
I monitored with a SQL query SELECT state, COUNT(*) FROM pg_stat_activity GROUP BY state; and it seems true that the count of idle connection increases rapidly. I would like to resolve this issue so that I don't have leaking connections in development or production.
Any suggestions on how I can fix my idle connections leaking?
Setup
build.sbt
My build.sbt has the following dependencies,
"com.typesafe.play" %% "play-slick" % "4.0.2",
"com.typesafe.play" %% "play-slick-evolutions" % "4.0.2",
"com.typesafe.slick" %% "slick-codegen" % "3.3.2",
"com.typesafe.slick" %% "slick" % "3.3.2",
"org.slf4j" % "slf4j-nop" % "1.7.26",
"com.typesafe.slick" %% "slick-hikaricp" % "3.3.2",
"org.postgresql" % "postgresql" % "42.2.8",
Application.conf
My configuration of postgres is stored in my application.conf
slick {
dbs {
default {
profile="slick.jdbc.PostgresProfile$"
db {
connectionPool = "HikariCP" //use HikariCP for our connection pool
profile = "org.postgresql.Driver"
dataSourceClass = "org.postgresql.ds.PGSimpleDataSource" //Simple datasource with no connection pooling. The connection pool has already been specified with HikariCP.
properties = {
serverName = "localhost"
portNumber = "5432"
databaseName = "website"
user = "websiteserver"
password = "397c9140fb0e2424396510b8d6e29a07aa1a92420027d3750ef1faed87bb617a"
}
}
numThreads = 10
connectionTimeout = 6000 // In the hope that this resolves the connection errors.
leakDetectionThreshold=60000 // In the hope that this resolves the connection errors.
}
}
}
Play app
Within my play 2.7.3 app I load the database configuration using,
#Singleton
class PersonRepositoryImpl #Inject() ()(implicit ec: PersonExecutionContext)
extends PersonRepository {
// We want the JdbcProfile for this provider
private val db = Database.forConfig("slick.dbs.default.db")
private val persons = TableQuery[PersonTable]
def create(p: Person)(implicit mc: MarkerContext): Future[PersonData] = db.run {
// Some operations on persons
}
}
I tried many different configurations but none seem to resolve the leaking connection issue that I'm facing.

You're calling Database.forConfig as a private val, when you need it to be a dependency. You should be leveraging play-slick to dependency inject the database config provider:
#Singleton
class PersonRepository #Inject() (dbConfigProvider: DatabaseConfigProvider)(implicit ec: ExecutionContext) {
// We want the JdbcProfile for this provider
private val dbConfig = dbConfigProvider.get[JdbcProfile]
...
}
https://github.com/playframework/play-samples/blob/2.8.x/play-scala-slick-example/app/models/PersonRepository.scala#L15
Also see the documentation:
While you can get the DatabaseConfig instance manually by accessing the SlickApi, we’ve provided some helpers for runtime DI users (Guice, Scaldi, Spring, etc.) for obtaining specific instances within your controller.
Here is an example of how to inject a DatabaseConfig instance for the default database (i.e., the database named default in your configuration):
class Application #Inject() (protected val dbConfigProvider: DatabaseConfigProvider, cc: ControllerComponents)(
implicit ec: ExecutionContext
) extends AbstractController(cc)
with HasDatabaseConfigProvider[JdbcProfile] {
}

Kamon is not reporting the data to prometheus

I have web service which is built using scala-play(version 2.5.12) framework. Trying to capture metrics using kamon and prometheus.
Below is the code snippet which i have done so far.
Dependencies:
"io.kamon" %% "kamon-play-2.5" % "1.1.0",
"io.kamon" %% "kamon-core" % "1.1.0",
"org.aspectj" % "aspectjweaver" % "1.9.2",
"io.kamon" %% "kamon-prometheus" % "1.1.1"
conf/application.conf
kamon {
metric {
tick-interval = 1 second
}
metric {
filters {
trace.includes = [ "**" ]
akka-dispatcher.includes = [ "**" ]
}
}
modules {
kamon-log-reporter.auto-start = no
}
}
I have initialized the kamon reporter in one of my config file.
import kamon.Kamon
import kamon.prometheus.PrometheusReporter
Kamon.addReporter( new PrometheusReporter() )
I am adding tracing in one of my controller
import kamon.play.action.OperationName
override def test(userName: Option[String]): Action[JsValue] = OperationName("test-access") {
Action.async(parse.json) {
......
}
}
I am building the jar and running in local with below command
/bin/example-app -J-javaagent:./lib/org.aspectj.aspectjweaver-1.9.2.jar -Dorg.aspectj.tracing.factory=default
Application is running and i can see in the logs that reporter has started.
Below is the log
2018-12-07 12:06:20,556 level=[INFO] logger=[kamon.prometheus.PrometheusReporter] thread=[kamon.prometheus.PrometheusReporter] rid=[] user=[] message=[Started the embedded HTTP server on http://0.0.0.0:9095]
But I don't see anything in http://localhost:9095/metrics. It is empty.
There is no error and unable to debug this. Is there anything i am missing here?

Documentation says that the metrics are exposed at http://localhost:9095/. There is no metrics endpoint.

How to set up a local development environment for Scala Spark ETL to run in AWS Glue?

I'd like to be able to write Scala in my local IDE and then deploy it to AWS Glue as part of a build process. But I'm having trouble finding the libraries required to build the GlueApp skeleton generated by AWS.
The aws-java-sdk-glue doesn't contain the classes imported, and I can't find those libraries anywhere else. Though they must exist somewhere, but perhaps they are just a Java/Scala port of this library: aws-glue-libs
The template scala code from AWS:
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.MappingSpec
import com.amazonaws.services.glue.errors.CallSite
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._
object GlueApp {
def main(sysArgs: Array[String]) {
val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
// #params: [JOB_NAME]
val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
Job.init(args("JOB_NAME"), glueContext, args.asJava)
// #type: DataSource
// #args: [database = "raw-tickers-oregon", table_name = "spark_delivery_2_1", transformation_ctx = "datasource0"]
// #return: datasource0
// #inputs: []
val datasource0 = glueContext.getCatalogSource(database = "raw-tickers-oregon", tableName = "spark_delivery_2_1", redshiftTmpDir = "", transformationContext = "datasource0").getDynamicFrame()
// #type: ApplyMapping
// #args: [mapping = [("exchangeid", "int", "exchangeid", "int"), ("data", "struct", "data", "struct")], transformation_ctx = "applymapping1"]
// #return: applymapping1
// #inputs: [frame = datasource0]
val applymapping1 = datasource0.applyMapping(mappings = Seq(("exchangeid", "int", "exchangeid", "int"), ("data", "struct", "data", "struct")), caseSensitive = false, transformationContext = "applymapping1")
// #type: DataSink
// #args: [connection_type = "s3", connection_options = {"path": "s3://spark-ticker-oregon/target", "compression": "gzip"}, format = "json", transformation_ctx = "datasink2"]
// #return: datasink2
// #inputs: [frame = applymapping1]
val datasink2 = glueContext.getSinkWithFormat(connectionType = "s3", options = JsonOptions("""{"path": "s3://spark-ticker-oregon/target", "compression": "gzip"}"""), transformationContext = "datasink2", format = "json").writeDynamicFrame(applymapping1)
Job.commit()
}
}
And the build.sbt I have started putting together for a local build:
name := "aws-glue-scala"
version := "0.1"
scalaVersion := "2.11.12"
updateOptions := updateOptions.value.withCachedResolution(true)
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.1"
The documentation for AWS Glue Scala API seems to outline similar functionality as is available in the AWS Glue Python library. So perhaps all that is required is to download and build the PySpark AWS Glue library and add it on the classpath? Perhaps possible since the Glue python library uses Py4J.

#Frederic gave a very helpful hint to get the dependency from s3://aws-glue-jes-prod-us-east-1-assets/etl/jars/glue-assembly.jar.
Unfortunately that version of glue-assembly.jar is already outdated and brings spark in versoin 2.1.
It's fine if you're using backward compatible features, but if you rely on latest spark version (and possibly latest glue features) you can get the appropriate jar from a Glue dev-endpoint under /usr/share/aws/glue/etl/jars/glue-assembly.jar.
Provided you have a dev-endpoint named my-dev-endpoint you can copy the current jar from it:
export DEV_ENDPOINT_HOST=`aws glue get-dev-endpoint --endpoint-name my-dev-endpoint --query 'DevEndpoint.PublicAddress' --output text`
scp -i dev-endpoint-private-key \
glue#$DEV_ENDPOINT_HOST:/usr/share/aws/glue/etl/jars/glue-assembly.jar .

Unfortunately, there are no libraries available for Scala glue API. Already contacted amazon support and they are aware about this problem. However, they didn't provide any ETA for delivering API jar.

As a workaround you can download the jar from S3. The S3 URI is s3://aws-glue-jes-prod-us-east-1-assets/etl/jars/glue-assembly.jar
See https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-repl.html

now it supports, a recent release from AWS.
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html

sbt-release publish artifact to wrong repository

So I'm trying to use sbt-release, and I'm having issues where it's publishing the artifact to my snapshot repository, and not the release repository.
val Organization = "com.mycompany"
val Name = "My Service"
val Version = "0.1-SNAPSHOT"
...
settings = Defaults.coreDefaultSettings ++ Seq(
name := Name,
organization := Organization,
version := Version,
scalaVersion := ScalaVersion
...
assemblyJarName in assembly := s"my-service-${Version}.jar",
...
)
publishTo := {
val nexus = "my.nexus.url.com/repositories/"
if (isSnapshot.value)
Some("snapshots" at nexus + "snapshots/")
else
Some("releases" at nexus + "releases/")
},
credentials += Credentials(Path.userHome / ".ivy2" / ".credentials")
If I remove the -SNAPSHOT from the version, then it publishes it to the correct repository, but shouldn't sbt-release be telling it to do that by itself?
Also if I get rid of the if (isSnapshot.value) then sbt publish will also publish to the wrong repository.
If I could get some help on this, I would really appreciate it.

It was the version I had here. It was over riding the version.sbt which is where 0.1-SNAPSHOT should be stored.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Stop spark log warning "Truncated the string representation of a plan ...." - scala

Related

error: object Service is not a member of package com.twitter.finagle - Defining Bazel dependencies in Build file, Scala finagle

Slick + HikariCP + Play is causing Postgres to run out of connections after often restarting Play

Kamon is not reporting the data to prometheus

How to set up a local development environment for Scala Spark ETL to run in AWS Glue?

sbt-release publish artifact to wrong repository

Categories

Resources