AspectJ Used to profile query execution making aplicaiont too slow - aspectj

I am using AspecJ to capture the query being executed in each of the form and show the time each query takes to execute. We are using spring jdbc and my aspect look as below:
#Aspect
public class QueryProfilerAspect {
#Pointcut("call(* org.springframework.jdbc.core.simple.SimpleJdbcTemplate.query* (..))")
public void profileQuery() {
}
#Around("profileQuery()")
public Object profile(ProceedingJoinPoint thisJoinPoint) throws Throwable {
// System.out.println("Inside join point execution");
SimpleJdbcTemplate template = (SimpleJdbcTemplate) thisJoinPoint
.getTarget();
JdbcTemplate jdbcTemplate = (JdbcTemplate) template
.getNamedParameterJdbcOperations().getJdbcOperations();
DataSource ds = jdbcTemplate.getDataSource();
// System.out.println("Datasource name URL =="
// + ds.getConnection().getMetaData().getURL());
// System.out.println("Datasource name ==" + schemaName);
String sqlQuery = thisJoinPoint.getArgs()[0].toString();
final long start, end;
start = System.nanoTime();
Object ll = thisJoinPoint.proceed();
end = System.nanoTime();
long executionTime = ((end - start) / 1000) / 1000;
System.out.println("execution_time=" +executionTime + sqlquery="+sqlQuery );
return ll;
}
Functionality wise this works however if i put this i my application it makes the application too slow. I am using compile time weaving. And the aspect finds 1683 query* method calls within the application.
Is there anything I can do to optimize this. Any suggestion/help will be really appreciated.

First I would advice against using System.nanoTime(). It's precision using cold like that is atrocious and pretty much useless for measuring timespans. Certainly not better than System.currentTimeMillies() if you divide the result anyway.
What probably slows you down most though, are the the String operations performed at various places in your aspect. If you have to concatenate Strings, at least use a StringBuilder to do so before outputting it. That might be done by the optimizer already, but you can never be too sure. ;)
And... Sysout isn't exactly the way to go if you want logging - looking into one of the various logging implementations (slf4j with logback is my personal favourite, but there are others as well) will be worth your time.
Especially if you want to use the fact that Spring has the feature you are trying to build (, as asked and answered here before: Seeing the underlying SQL in the Spring JdbcTemplate?
(Edit: I know it's only the query, not the time measuring, but not having to worry about that should shave some time off your overhead as well.)

Related

EF Core raw query with Like clause

I want to create queries using EF FromSqlInterpolated or FromSqlRaw that allows me to use Like clauses, but I don't know what is the right way to do it without opening the application to SqlInjection attacks.
One first approach has took me to the following code
var results = _context.Categories.FromSqlInterpolated(
$"Select * from Category where name like {"%" + partialName + "%"}");
First test worked fine, it returns results when providing expected strings, and returns nothing when i provide something like ';select * from Category Where name='Notes'--%';
Still I don't know much about SqlInjection, at least not enough to feel safe with the query shown before.
Does someone know if the query is safe, or if there is a right way to do it?
Thanks
From this document
The FromSqlInterpolated and ExecuteSqlInterpolated methods allow using
string interpolation syntax in a way that protects against SQL injection attacks.
var results = _context.Categories.FromSqlInterpolated(
$"Select * from Category where name like {"%" + partialName + "%"}");
Or you can also change your query to Linq-to-Entity like this way
var results = _context.Categories.Where(p => p.name.Contains(partialName ));

Cloud Dataflow GlobalWindow trigger ignored

Using the AfterPane.elementCountAtLeast trigger does not work when run using the Dataflow runner, but works correctly when run locally. When run on Dataflow, it produces only a single pane.
The goals is to extract data from Cloud SQL, transform and write into Cloud Storage. However, there is too much data to keep in memory, so it needs to be split up and written to Cloud Storage in chunks. That's what I hoped this would do.
The complete code is:
val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
.applyTransform(ParDo.of(new Translator()))
.map(row => row.mkString("|"))
// produce one global window with one pane per ~500 records
.withGlobalWindow(WindowOptions(
trigger = Repeatedly.forever(AfterPane.elementCountAtLeast(500)),
accumulationMode = AccumulationMode.DISCARDING_FIRED_PANES
))
val out = TextIO
.write()
.to("gs://test-bucket/staging")
.withSuffix(".txt")
.withNumShards(1)
.withShardNameTemplate("-P-S")
.withWindowedWrites() // gets us one file per window & pane
pipe.saveAsCustomOutput("writer",out)
I think the root of the problem may be that the JdbcIO class is implemented as a PTransform<PBegin,PCollection> and a single call to processElement outputs the entire SQL query result:
public void processElement(ProcessContext context) throws Exception {
try (PreparedStatement statement =
connection.prepareStatement(
query.get(), ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY)) {
statement.setFetchSize(fetchSize);
parameterSetter.setParameters(context.element(), statement);
try (ResultSet resultSet = statement.executeQuery()) {
while (resultSet.next()) {
context.output(rowMapper.mapRow(resultSet));
}
}
}
}
In the end, I had two problems to resolve:
1. The process would run out of memory, and 2. the data was written to a single file.
There is no way to work around problem 1 with Beam's JdbcIO and Cloud SQL because of the way it uses the MySQL driver. The driver itself loads the entire result within a single call to executeStatement. There is a way to get the driver to stream results, but I had to implement my own code to do that. Specifically, I implemented a BoundedSource for JDBC.
For the second problem, I used the row number to set the timestamp of each element. That allows me to explicitly control how many rows are in each window using FixedWindows.
elementCountAtLeast is a lower bound so making only one pane is a valid option for a runner to do.
You have a couple of options when doing this for a batch pipeline:
Allow the runner to decide how big the files are and how many shards are written:
val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
.applyTransform(ParDo.of(new Translator()))
.map(row => row.mkString("|"))
val out = TextIO
.write()
.to("gs://test-bucket/staging")
.withSuffix(".txt")
pipe.saveAsCustomOutput("writer",out)
This is typically the fastest option when the TextIO has a GroupByKey or a source that supports splitting that precedes it. To my knowledge JDBC doesn't support splitting so your best option is to add a Reshuffle after the jdbcSelect which will enable parallelization of processing after reading the data from the database.
Manually group into batches using the GroupIntoBatches transform.
val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
.applyTransform(ParDo.of(new Translator()))
.map(row => row.mkString("|"))
.apply(GroupIntoBatches.ofSize(500))
val out = TextIO
.write()
.to("gs://test-bucket/staging")
.withSuffix(".txt")
.withNumShards(1)
pipe.saveAsCustomOutput("writer",out)
In general, this will be slower then option #1 but it does allow you to choose how many records are written per file.
There are a few other ways to do this with their pros and cons but the above two are likely the closest to what you want. If you add more details to your question, I may revise this question further.

Spark caching strategy

I have a Spark driver that goes like this:
EDIT - earlier version of the code was different & didn't work
var totalResult = ... // RDD[(key, value)]
var stageResult = totalResult
do {
stageResult = stageResult.flatMap(
// Some code that returns zero or more outputs per input,
// and updates `acc` to number of outputs
...
).reduceByKey((x, y) => x.sum(y))
totalResult = totalResult.union(stageResult)
} while(stageResult.count() > 0)
I know from properties of my data that this will eventually terminate (I'm essentially aggregating up the nodes in a DAG).
I'm not sure of a reasonable caching strategy here - should I cache stageResult each time through the loop? Am I setting up a horrible tower of recursion, since each totalResult depends on all previous incarnations of itself? Or will Spark figure that out for me? Or should I put each RDD result in an array and take one big union at the end?
Suggestions will be welcome here, thanks.
I would rewrite this as follows:
do {
stageResult = stageResult.flatMap(
//Some code that returns zero or more outputs per input
).reduceByKey(_+_).cache
totalResult = totalResult.union(stageResult)
} while(stageResult.count > 0)
I am fairly certain(95%) that the stageResult DAG used in the union will be the correct reference (especially since count should trigger it), but this might need to be double checked.
Then when you call totalResult.ACTION, it will put all of the cached data together.
ANSWER BASED ON UPDATED QUESTION
As long as you have the memory space, then I would indeed cache everything along the way as it stores the data of each stageResult, unioning all of those data points at the end. In fact, each union does not rely on the past as that is not the semantics of RDD.union, it merely puts them together at the end. You could just as easily change your code to use a val due to RDD immutability.
As a final note, maybe the DAG visualization will help understand why there would not be recursive ramifications:

How do I Benchmark RESTful Service with Variable Parameters?

I'm currently working on benchmarking a RESTful service I've made, and part of that is making sure it runs in a reasonable amount of times for a large array of parameters. For example, let's say I have RESTful API of the form some_site.com/item?item_id=y. In that case to be sure my service is working as fast as I'd like it to work, I'd want to try out many values for y one by one, preferably coming from some text file. I can't figure out any way of doing this in ab or httperf. I'm open to using a different benchmarking program if I have, but would prefer something simple and light. What I want to do seems like something pretty standard, so I'm guessing there must already be a program that let's me do it, but an hour or so of googling hasn't gotten me an answer. Ideas?
Answer: Jmeter (which is apparently awesome). This faq explains how to do it. Hopefully this helps someone else, as it took me like a day of searching to figure this out.
I have just had some good experience with using JavaScript (via BSF/Rhino) in JMeter.
I have put one thread group in my test plan and stick a 'Simple Controller' with two elements under it - 'HTTP Request' sampler and 'BSF PreProcessor'.
Set BSF language to 'javascript' and either type the code into the text box or point it to a file (use full path or relative to CWD of JMeter process).
/* Since `Math.random()` gives us float, we use `java.util.Random()`
* see: http://docs.oracle.com/javase/7/docs/api/java/util/Random.html */
var Random = new Packages.java.util.Random();
var min = 10-1;
var max = 2;
var maxLines = (min)+Random.nextInt(max-min);
var s = '';
for (var d = 0; d <= maxLines; d++) {
s += d.toString()+','+Random.nextInt(1000).toString()+'\n';
}
// s => '0,312\n1,104\n2,608\n'
vars.put('PAYLOAD', s);
Now I can refer to ${PAYLOAD} in the HTTP request!
You can generate JSON, but you will need to upgrade jakarta-jmeter-2.5.1/lib/js-1.6R5.jar with the newest version of Rhino to get JSON.stringify and JSON.parse. That worked perfectly for me also, though I thought I'd put a simple example here.
You can use BSF pre-processor for URL params as well, just set another variable with vars.put('X', 'some value') and pass it as ${X} in the request parameter.
This blog post helped quite a bit, by the way.

ObjectContext, Entities and loading performance

I am writing a RIA service, which is also exposed using SOAP.
One of its methods needs to read data from a very big table.
At the beginning I was doing something like:
public IQueryable<MyItem> GetMyItems()
{
return this.ObjectContext.MyItems.Where(x => x.StartDate >= start && x.EndDate <= end);
}
But then I stopped because I was worried about the performance.
As far as I understand MyItemsis fully loaded and "Where" just filters the elements that were loaded at the first access of the property MyItems. Because MyItemswill have really lots of rows, I don't think this is the right approach.
I tried to google a bit the question but no interesting results came up.
So, I was thinking I could create a new instance of the context inside the GetMyItems method and load MyItems selectively. Something like:
public IQueryable<MyItems> GetMyItems(string Username, DateTime Start, DateTime End)
{
using (MyEntities ctx = new MyEntities ())
{
var objQuery = ctx.CreateQuery<MyItems>(
"SELECT * FROM MyItems WHERE Username = #Username AND Timestamp >= #Start AND Timestamp <= #End",
new ObjectParameter("#Username", Username),
new ObjectParameter("#Start", Start),
new ObjectParameter("#End", End));
return objQuery.AsQueryable();
}
}
But I am not sure at all this is the correct way to do it.
Could you please assist me and point out the right approach to do this?
Thanks in advance,
Cheers,
Gianluca.
As far as I understand MyItemsis fully loaded and "Where" just filters the elements that were loaded at the first access of the property MyItems.
No. That's entirely wrong. Don't fix "performance problems" until you actually have them. The code you already have is likely to perform better than the code you propose replacing it with. It certainly won't behave in the way you describe. But don't take my word for it. Use the performance profiler. Use SQL Profiler. And test!