How can I mock DynamoDB access via Spark in Scala?

How can I mock DynamoDB access via Spark in Scala? - scala

I have a Spark job written in Scala that ultimately writes out to AWS DynamoDB. I want to write some unit tests around it, but the only problem is I don't have a clue how to go about mocking the bit that writes to DynamoDB. I'm making use of their emr-dynamodb-connector class, which means I'm not using any dependency injection (otherwise this would be easy).
After I read in some RDD data using Spark, I do some simple transforms on it into a Pair RDD of type (org.apache.hadoop.io.Text, org.apache.hadoop.dynamodb.DynamoDBItemWritable). So my code's only brush-up with Dynamo is by creating DynamoDBItemWritable objects. That class doesn't inherently contain any logic to utilize the AWS SDK to save anything; it's essentially just a data object. My code then calls this:
val conf = new Configuration()
conf.set("dynamodb.servicename", "dynamodb")
conf.set("dynamodb.input.tableName", "MyOutputTable")
conf.set("dynamodb.output.tableName", "MyInputTable")
conf.set("dynamodb.endpoint", "https://dynamodb.us-east-1.amazonaws.com")
conf.set("dynamodb.regionid", "us-east-1")
conf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
conf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
myTransformedRdd.saveAsHadoopDataset(new JobConf(conf)
...and the connector magically registers the right classes and makes the right calls so that it effectively saves the results to DynamoDB accordingly.
I can't mock SparkSession because it has a private constructor (that would be extremely messy anyway). And I don't have any direct way, as far as I know, to mock the DynamoDB client. Is there some magic syntax in Scala (or Scalatest, or Scalamock) to allow me to tell it that if it ever wants to instantiate a Dynamo client class, that it should use a mocked version instead?
If not, how would I go about testing this code? I suppose theoretically, perhaps there's a way to set up a local, in-memory instance of Dynamo and then change the value of dynamodb.endpoint but that sounds horribly messy just to get a unit test working. Plus I'm not sure it's possible anyway.

Take a look at LocalStack. It provides an easy-to-use test/mocking framework for developing AWS-related applications by spinning up the AWS-compatible APIs on your local machine or in Docker. It supports two dozen of AWS APIs and DynamoDB is among them. It is really a great tool for functional testing without using a separate environment in AWS for that.
If you need only DynamoDB there is another tool: DynamoDB Local, a Docker image with Amazon DynamoDB onboard.
Both are as simple as starting a Docker container:
docker run -p 8000:8000 amazon/dynamodb-local
docker run -P localstack/localstack
And if you're using JUnit 5 for the tests, let me recommend you JUnit 5 extensions for AWS, a few JUnit 5 extensions that could be useful for testing AWS-related code. These extensions can be used to inject clients for AWS service clients provided by tools like localstack (or the real ones). Both AWS Java SDK v 2.x and v 1.x are supported.

Related

"PSQLException: FATAL: sorry, too many clients already" error in integration tests with jOOQ & Spring Boot

There are already similar questions about this error and suggested solutions; e.g. increasing max_connections in postgresql.conf and / or adapting the max number of connections your app requests. However, my question is more specific to using jOOQ in a Spring Boot application.
I integrated jOOQ into my application as in the example on GitHub. Namely, I am using DataSourceConnectionProvider with TransactionAwareDataSourceProxy to handle database connections, and I inject the DSLContext in the classes that need it.
My application provides various web services to front-ends and I've never encountered that PSQLException on dev or test environments so far. I only started getting that error when running all integration tests (around 1000) locally. I don't expect some leak in handling the connection as Spring and jOOQ manage the resources; nevertheless that error got me worried if that would also happen on production.
Long story short, is there a better alternative to using DataSourceConnectionProvider to manage connections? Note that I already tried using DefaultConnectionProvider as well, and tried to make spring.datasource.max-active less than max_connections allowed by Postgres. Neither fixed my problem so far.

Since your question seems not to be about the generally best way to work with PostgreSQL connections / data sources, I'll answer the part about jOOQ and using its DataSourceConnectionProvider:
Using DataSourceConnectionProvider
There is no better alternative in general. In order to understand DataSourceConnectionProvider (the implementation), you have to understand ConnectionProvider (its specification). It is an SPI that jOOQ uses for two things:
to acquire() a connection prior to running a statement or a transaction
to release() a connection after running a statement (and possibly, fetching results) or a transaction
The DataSourceConnectionProvider does so by acquiring a connection from your DataSource through DataSource.getConnection() and by releasing it through Connection.close(). This is the most common way to interact with data sources, in order to let the DataSource implementation handle transaction and/or pooling semantics.
Whether this is a good idea in your case may depend on individual configurations that you have made. It generally is a good idea because you usually don't want to manually manage connection lifecycles.
Using DefaultConnectionProvider
This can certainly be done instead, in case of which jOOQ does not close() your connection for you, you'll do that yourself. I'm expecting this to have no effect in your particular case, as you'll implement the DataSourceConnectionProvider semantics manually using e.g.
try (Connection c = ds.getConnection()) {
// Implicitly using a DefaultConnectionProvider
DSL.using(c).select(...).fetch();
// Implicit call to c.close()
}
In other words: this is likely not a problem related to jOOQ, but to your data source.

Scala Event Sourcing with Kafka

For a microservice I need the functionality to persist state (changes). Essentially, the following happens:
case class Item(i: Int)
val item1 = Item(0)
val item2 = exec(item1)
Where exec is user defined and hence not known in advance. As an example, let's assume this implementation:
def exec(item: Item) = item.copy(i = item.i + 1)
After each call to exec, I want log the state changes (here: item.i: 0->1) so that...
there is a history (e.g. list of tuples like (timestamp, what has changed, old value, new value))
state changes and snapshots could be persisted efficiently to a local file systems and sent to a journal
Arbitrary consumers (not only the specific producer where the changes originated) could be restored from the journal/snapshots
As less dependencies to libraries and infrastructure as possible (it is a small project, complex infrastructure/server installations & maintenance is not possible)
I know that the EventStore DB is probably the best solution, however, in the given environment (a huge enterprise with a lot of policies), it is not possible for me to install & run it. The only infrastructural options are a RDBMS or Kafka. I'd like to go with Kafka as it seems to be the natural fit in this event sourcing use case.
I also noticed that Akka Persistence seems to handle all of the requirements well. But I have a couple of questions:
Are there any alternatives I missed?
Akka Persistence's Kafka integration is only available through a community plugin that is not maintained regularly. Seems to me, as this is not a common use case. Is there any reason the outlined architecture is not wide spread?
Is cloning possible? In the Akka documentation it says:
"So, if two different entities share the same persistenceId,
message-replaying behavior is corrupted."
So, let's assume two application instances, one and two, both have unique persistenceIds. Could two be restored (cloned) from one's journal? Even if they don't share the same Id (which is not allowed)?
Are there any complete examples of this architecture available?

How can you share Transformers across mirth channels

We are using appliance based mirth connect ver 3.4.2
We have few transformers which are common to all the channels but still they are under each channel. Anytime we have to modify something, we have to make changes in all channels.
We have transformers for
some functions with javascript and java code
some mappings
some database operations like inserts etc
Can we put this code somewhere where it is shared across channels and we don't need to write transformers under each channel ?
Thanks
Sid

A good way to do this is to move common code (functions, database operations, etc) into code templates.

some functions with javascript - Edit Code Templates will be a place where you can provide common codes which has to go for all channels.
some database operations like inserts - I believe/(good practice) these should be specific to channels, and if you have functions specific to certain channel and used in many places in that specific channel, then declare that function in modes of process needed like either in deploy,pre-processor,undeploy or post-processor.
some mappings - I'm not sure about this. If you choose Javascript for mapping we can achieve this mapping by making it as a global variable in global script places or coded templates.
some JAVA code - If it is a JAVA code, and a library built to invoke script on top of the library, then make the JAVA library to have get and set objects that way you can traverse to any depth on your Mirth script to access JAVA objects
For Eg: If you are building XML, there are many libraries you can use like Stax parser, JDOM etc, but using a document builder factory for developing XML will allow you to access JAVA objects to depth in Mirth script .

What is the proper way of running Gatling from a stand-alone application

I need to start a Gatling simulation from a main application. The use case is as follows:
The application reads a specification, and generate test cases based on this specification.
The test cases are converted into Gatling scenarios.
The scenarios are run in a Gatling simulation.
So far I managed to do this via the sbt plugin. However this is inconvenient if we want to reuse the tool I'm developing in other contexts (imagine non-scala projects for instance).
Since I'm generating Gatling scenarios dynamically which means that I cannot simply pass a Scala class to the Gatling binary.
I was able to run the simulation as follows:
Gatling.fromArgs(args, Some(classOf[Simulation]), _ => new ValidationTest)
Where ValidationTest is the class that generates the scenarios dynamically. However, I'm not sure that is the proper way of using Gatling in a standalone application.

Using several database setups with Lucene.Net

Hi
I am developing a search function for an web application with Lucene.Net and NHibernate.Search. The application is used by a lots of companies but is runned as a single service, using different databases for different companies. Therefore I would need an index directory for each database rather than one directory for the entire application. Is there a way of achieve this in Lucene.Net?
I have also considering storing the indexes for each company in there respecitive database but havent found any satisfying compontents for this. I have read about Compass and JdbcDirectory for Java but I need something for C# or NHibernate. Does anyone know if there is a port of JdbcDirectory or something simular for C#?

Hmm, it looks like you can't change anything at the session factory level using normal nhibernate.search. You may need separate instances of a configuration, or maybe try something along the lines of Fluent NHibernate Search to ease the pain.
Piecing it together from the project's wiki it appears you could do something like this to spin up separate session factories pointing to different databases / index directories:
Fluently.Configure()
.Database(SQLiteConfiguration.Standard.InMemory())
.Search(s => s.DefaultAnalyzer().Standard()
.DirectoryProvider().FSDirectory()
.IndexBase("~/Index")
.IndexingStrategy().Event()
.MappingClass<LibrarySearchMapping>())
.BuildConfiguration()
.BuildSessionFactory();
The "IndexBase" property and the connection are the parts you'll need to define per customer. Once you get the session factories set up you could resolve them using whatever strategy you use currently.
Hope it helps

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse