Short: Redis set time = get time (strange)
I did some tests just insert 30000 records and than receive them 30000 times (Redis).
def redis_set(data):
for k, v in data.iteritems():
redis_conn.set(k, v)
def redis_get(data):
for k in data.iterkeys():
val = redis_conn.get(k)
def do_tests(num, tests):
# setup dict with key/values to retrieve
data = {'key' + str(i): 'val' + str(i)*100 for i in range(num)}
# run tests
for test in tests:
start = time.time()
print "Starting test .. %s" % (test.__name__)
test(data)
elapsed = time.time() - start
print "%s: %d ops in %.2f seconds : %.1f ops/sec" % (test.__name__, num, elapsed, num / elapsed)
tests = [redis_set, redis_get]
do_tests(30000, tests)
Results
Redis:
redis_set: 30000 ops in 106.21 seconds : 282.4 ops/sec
redis_get: 30000 ops in 94.94 seconds : 316.0 ops/sec
It is OK?
Nothing goes wrong.
Since Redis is single-threaded, there's no lock penalty for read and write operations. Both GET and SET are several memory operations, and both are very fast.
According to your benchmark, SET is a little slower than GET. That's also reasonable, since SET operation needs to allocate memory for newly added item, and memory allocation costs more than other memory operations.
On the other hand, Mongodb's read operation is much faster than write operation. Because it does lots of optimizations for read operations, such as cache. And the intention lock that Mongodb used is much more friendly to read operations, i.e. multiple readers can read data from a single slot at the same time, while writers are exclusive.
Related
I have a function whose purpose is to divide a dataset into arrays of a given size.
For example - I have a dataset with 123 objects of the Foo type, I provide to the function arraysSize 10 so as a result I will have a Dataset[Array[Foo]] with 12 arrays with 10 Foo's and 1 array with 3 Foo.
Right now function is working on collected data - I would like to change it on dataset based because of performance but I dont know how.
This is my current solution:
private def mapToFooArrays(data: Dataset[Foo],
arraysSize: Int): Dataset[Array[Foo]]= {
data.collect().grouped(arraysSize).toSeq.toDS()
}
The reason for doing this transformation is because the data will be sent in the event. Instead of sending 1 million events with information about 1 object, I prefer to send, for example, 10 thousand events with information about 100 objects
IMO, this is a weird use case. I can not think of any efficient solution to do this, as it is going to require a lot of shuffling no matter how we do it.
But, the following is still better, as it avoids collecting to the driver node and will thus be more scalable.
Things to keep in mind -
what is the value of data.count() ?
what is the size of a single Foo ?
what is the value of arraySize ?
what is your executor configuration ?
Based on these factors you will be able to come up with the desiredArraysPerPartition value.
val desiredArraysPerPartition = 50
private def mapToFooArrays(
data: Dataset[Foo],
arraysSize: Int
): Dataset[Array[Foo]] = {
val size = data.count()
val numArrays = (size.toDouble / arrarySize).ceil
val numPartitions = (numArrays.toDouble / desiredArraysPerPartition).ceil
data
.repartition(numPartitions)
.mapPartitions(_.grouped(arrarySize).map(_.toArray))
}
After reading the edited part, I think that 100 size in 10 thousand events with information about 100 objects is not really important. As it is referred as about 100. There can be more than one events with less than 100 Foo's.
If we are not very strict about that 100 size, then there is no need of reshuffle.
We can locally group the Foo's present in each of the partitions. As this grouping is being done locally and not globally, this might result in more than one (potentially one for each partition) Arrays with less than 100 Foo's.
private def mapToFooArrays(
data: Dataset[Foo],
arraysSize: Int
): Dataset[Array[Foo]] =
data
.mapPartitions(_.grouped(arrarySize).map(_.toArray))
I'm training doc2vec, and using callbacks trying to see if alpha is decreasing over training time using this code:
class EpochSaver(CallbackAny2Vec):
'''Callback to save model after each epoch.'''
def __init__(self, path_prefix):
self.path_prefix = path_prefix
self.epoch = 0
os.makedirs(self.path_prefix, exist_ok=True)
def on_epoch_end(self, model):
savepath = get_tmpfile(
'{}_epoch{}.model'.format(self.path_prefix, self.epoch)
)
model.save(savepath)
print(
"Model alpha: {}".format(model.alpha),
"Model min_alpha: {}".format(model.min_alpha),
"Epoch saved: {}".format(self.epoch + 1),
"Start next epoch"
)
self.epoch += 1
def train():
workers = multiprocessing.cpu_count()*4
model = Doc2Vec(
DocIter(),
vec_size=600, alpha=0.03, min_alpha=0.00025, epochs=20,
min_count=10, dm=1, hs=1, negative=0, workers=workers,
callbacks=[EpochSaver("./checkpoints")]
)
print(
"HS", model.hs, "Negative", model.negative, "Epochs",
model.epochs, "Workers: ", model.workers, "Model alpha:
{}".format(model.alpha)
)
And while training I see that alpha is not changing over time. On each callback I see alpha = 0.03.
Is it possible to check if alpha is decreasing? Or it really not decreasing at all during training?
One more question:
How can I benefit from all my cores while training doc2vec?
As we can see, each core is not loaded more than +-30%.
The model.alpha property only holds the initially-configured starting-alpha – it's not updated to the effective learning-rate through training.
So, even if the value is being decreased properly (and I expect that it is), you wouldn't see it in the logging you've added.
Separate observations about your code:
in gensim versions at least through 3.5.0, maximum training throughput is most often reached with some value for workers between 3 and the number of cores – but usually not the full number of cores (if it's higher than 12) or larger. So workers=multiprocessing.cpu_count()*4 is likely going to much slower than what you could achieve with a lower number.
if your corpus is large enough to support 600-dimensional vectors, and discarding words with fewer than min_count=10 examples, negative sampling may work faster and get better results than the hs mode. (The pattern in published work seems to be to prefer negative-sampling with larger corpuses.)
import java.util.concurrent.{Executors, TimeUnit}
import scala.annotation.tailrec
import scala.concurrent.{Await, ExecutionContext, Future}
import scala.util.{Failure, Success}
object Fact extends App {
def time[R](block: => R): Long = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
val t: Long = TimeUnit.SECONDS.convert((t1 - t0), TimeUnit.NANOSECONDS)
//println(
// "Time taken seconds: " + t)
t
}
def factorial(n: BigInt): BigInt = {
#tailrec
def process(n: BigInt, acc: BigInt): BigInt = {
//println(acc)
if (n <= 0) acc
else process(n - 1, n * acc)
}
process(n, 1)
}
implicit val ec =
ExecutionContext.fromExecutor(Executors.newFixedThreadPool(2))
val f1: Future[Stream[Long]] =
Future.sequence(
(1 to 50).toStream.map(x => Future { time(factorial(100000)) }))
f1.onComplete {
case Success(s) => {
println("Success : " + s.foldLeft(0L)(_ + _) + " seconds!")
}
case Failure(f) => println("Fails " + f)
}
import scala.concurrent.duration._
Await.ready(Future { 10 }, 10000 minutes)
}
I have the above Factorial code that needs to use multiple cores to complete the program faster and should utilize more cores.
So I change,
Executors.newFixedThreadPool(1) to utilize 1 core
Executors.newFixedThreadPool(2) to utilize 2 cores
When changing to 1 core, then result appears in 127 seconds.
But when changing to 2 cores, then I get 157 seconds.
My doubt is, when i increase cores(parallelism) then it should give good performance. But it is not. Why?
Please correct me, if I am wrong or missing something.
Thanks in Advance.
How are you measuring the time? The result you are printing out is not the time execution took, but the sum of times of each individual call.
Running Fact.time(Fact.main(Array.empty)) in REPL I get 90 and 178 with two and one threads respectively. Seems to make sense ...
First of all, Dima is right that what you print is total execution time of all tasks rather than total time till the last task is finished. The difference is that the first sums time for all the work done in parallel and only the latter shows actual speed up from multi-threading.
However there is another important effect. When I run this code with 1, 2 and 3 threads and measure both total time (time until f1 is ready) and total parallel time (the one that you print), I get following data (I also reduce number of calculations from 50 to 20 to speed up my tests):
1 - 70 - 70
2 - 47 - 94
3 - 43 - 126
At the first glance it looks OK as the parallel time divided by the real time is about the same as the number of threads. But if you look a bit closer, you may notice that speed up going from 1 thread to 2 is only about 1.5x and only 1.1x for the third thread. Also these figures mean that total time of all tasks actually goes up when you add threads. This might seem puzzling.
The answer to this puzzle is that your calculation is actually not CPU-bound. The thing is that the answer (factorial(100000)) is actually a pretty big number. In fact it is so big that it takes about 185KB of memory to store it. What this means is that at the latter stages of computation your factorial method actually becomes more memory-bound than CPU-bound because this size is big enough to overflow the fastest CPU caches. And this is the reason why adding more threads slows down each calculation: yes, you do calculation faster but memory doesn't get any faster. So when you saturate CPU caches and then memory channel, adding more threads (cores) doesn't improves performance that much.
I would like to know how to measure the throughput rate of the production line on Anylogic.
Question: Are there any methods to measure the Time Between Departure of the agent at the sink block? >>(I will calculate the throughput rate by inverting the time between departure value.)
At the moment, I just simply calculated the throughput based on Little's law, which I use the average lead time and WIP level of the line. I am not sure that whether the throughput value based on this calculation will be equal to the inverted value of the time between departure or not?
I hope you guys could help me figure it out.
Thanks in advance!
There is a function "time()" that returns the current model time in model time units. Using this function, you may know the times when agent A and agent B left the system, and calculate the difference between these times. You can do this by writing the code like below in the "On exit" field of the "sink" block:
statistic.add(time() - TimeOfPreviousAgent);
TimeOfPreviousAgent = time();
"TimeOfPreviousAgent" is a variable of "double" type;
"statistic" is a "Statistic" element used to collect the measurements
This approach of measuring time in the process flow is described in the tutorial Bank Office.
As an alternative, you can store leaving time of each agent into a collection. Then, you will need to iterate over the samples stored in the collection to find the difference between each pair of samples.
Not sure if this will help but it stems off Tatiana's answer. In the agents state chart you can create variables TimeIn, TimeOut, and TimeInSystem. Then at the Statechart Entry Point have,
TimeIn = time();
And at the Final state have,
TimeOut = time();
TimeInSystem = TimeOut - TimeIn;
To observe these times for each individual agent you can use the following code,
System.out.println("I came in at " + TimeIn + " and exited at " TimeOut + " and spent " + TimeInSystem + " seconds in the system";
Then for statistical analysis you can calculate the min, avg, and max throughputs of all agents by creating in Main variables, TotalTime, TotalAgentsServiced, AvgServiceTime, MaxServiceTime, MinServiceTime and then add a function call it say TrackAvgTimeInSystem ... within the function add argument NextAgent with type double. In the function body have,
TotalTime += NextAgent;
TotalAgentsServiced += 1;
AverageServiceTime = TotalTime/TotalCarsServiced;
if(MinServiceTimeReported == 0)
{
MinServiceTime = NextAgent;
}
else if(NextAgent < MinServiceTime)
{
MinServiceTime = NextAgent;
}
if(NextAgent > MaxServiceTime)
{
MaxServiceTime = NextAgent;
}
Then within your agent's state charts, in the Final State call the function
get_Main().TrackAvgTimeInSystem(TimeInSystem);
This then calculates the min, max, and average throughput of all agents.
I try to "profile" an expensive method by the means of just printing the system time. I've written a small method that prints the current time in seconds relative to the start-time. :
object Bechmark extends App {
var starttime = 0L
def printTime(): Unit = {
if (starttime == 0L) {
starttime = System.currentTimeMillis()
}
println((System.currentTimeMillis() - starttime) / 1000.0)
}
printTime()
Thread.sleep(100)
printTime()
}
I expect therefore that the first call to printTime prints something close to 0. But the output I get is
0.117
0.221
I don't understand why the first call already gives me ~120 miliseconds? What is the correct implementation for my purpose?
As others have mentioned the running time of your application does not necessarily represent the actual world time elapsed. There are several factors that affect it: warm up time of the JVM, JVM garbage collection, steady state of the JVM for the accurate measurement, OS process dispatching and shuffling.
For Scala-related purposes I suggest ScalaMeter
that allows you to tune all the aforementioned variables and measure the time quite accurately.