Is Pipeline in gremlin run faster and can save from out of memory errors? - titan

Here is the code snippet without pipeline,
map = [:]
edges = g.V('type', 'update').inE('createupdate')
for(edge in edges)
{
date_long = edge.since.toLong()
incrValInMap(map,date_long) // this function add\increment value in map
}
And Here is one with pipleline doing same thing
map = [:]
g.V('type', 'update').inE('createupdate').since.groupCount(map)
I have couple of questions
Is pipleline a lazy evaluation? mean it does not need all vertex gathered before it moves to next pipe?
Is second code snippet will run faster than first one?
Is second code snippet will result in less memory consumption

I don't see any real difference between your two snippets of code. Both use a Pipeline...the only difference is the way you are building your group count and how you iterate it. To me, the second code snippet is better as it is easier to read and requires less code. The groupCount step from your second code snippet is doing basically the same thing that your doing manually in your first snippet.

Related

Is it possible to remove column from the GetTableSql() in BIML

I use the GetTableSql() function in BIML a lot, but I often need to remove some columns from this function before it executes. Is this possible?
You'd have to write your own Extension method to do so. Looking through the code, you might be better off just scrubbing the column(s) from results of the method call - it just depends on what you're looking to do.
The current GetTableSql method is an extension method that chains a call to EmitTableScript which in turn calls a number of methods to build out the returning SQL. At least in the BimlStudio product, EmitTableScript is in Varigence.Biml.CoreLowerer.Capabilities.TableToPackageLowerer class in BimlExtensions.dll
After a bit more thinking, what might be an even better, less support headache would be to create a clone of the table node and then remove the columns you don't want.
Code approximately
var table0 = this.RootNode.Tables[0];
var tablePrime = table0;
// I don't have a biml project handy so this section is a guess
tablePrime.Columns.Clear();
// Might be AddRange if this method exists
// Remove all the columns that start with ignore, as an example of filtering columns
tablePrime.Columns.Add(table0.Columns.Where(x => !x.name.StartsWith("ignore"));
// end guess block
var sql = tablePrime.GetTableSql();

scalaz-stream consume stream based on computed value

I've got two streams and I want to be able to consume only one based on a computation that I run every x seconds.
I think I basically need to create a third tick stream - something like every(3.seconds) - that does the computation and then come up with sort of a switch between the other two.
I'm kind of stuck here (and I've only just started fooling around with scalaz-stream).
Thanks!
There are several ways we can approach this problem. One way to approach it is using awakeEvery. For concrete example, see here.
To describe the example briefly, consider that we would like to query twitter in every 5 sec and get the tweets and perform sentiment analysis. We can compose this pipeline as follows:
val source =
awakeEvery(5 seconds) |> buildTwitterQuery(query) through queryChannel flatMap {
Process emitAll _
}
Note that the queryChannel can be stated as follows.
def statusTask(query: Query): Task[List[Status]] = Task {
twitterClient.search(query).getTweets.toList
}
val queryChannel: Channel[Task, Query, List[Status]] = channel lift statusTask
Let me know if you have any question. As stated earlier, for the complete example, see this.
I hope it helps!

Get a value from RichPipe

I have a RichPipe with 3 fields: name: String, time: Long and value: Int. I need to get the value for a specific name, time pair. How can I do it? I can't figure it out from scalding documentation, as it is very cryptic and can't find any examples that do this.
Well a RichPipe is not a Key-Value store, that's why there is no documentation on using as a key-value store :) A RichPipe should be thought of as a pipe - so you can't get at data in the middle without first going in at one end and traversing the pipe till you find the element your looking for. Furthermore this is a little painful in Scalding because you have to write your results to disk (because it's built on top of Hadoop) and then read the result from disk in order to use it in your application. So the code will be something like:
myPipe.filter[String, Long](('name, 'time))(_ == (specificName, specificTime))
.write(Tsv("tmp/location"))
Then you'll need some higher level code to run the job and read the data back into memory to get at the result. Rather than write out all the code to do this (it's pretty straightforward), why don't you give some more context about what your use case is and what you are trying to do - maybe you can solve your problem under the Map-Reduce programming model.
Alternatively, use Spark, you'll have the same problem of having to traverse a distributed dataset, but you don't have the faff of writting to disk and reading back again. Furthermore you can use custom partitioner is Spark that could result in near key-value store like behaviour. But anyway naively, the code would be:
val theValueYouWant =
myRDD.filter {
case (`specificName`, `specificTime`, _) => true
case _ => false
}
.toArray.head._3

How to correctly get current loop count from a Iterator in scala

I am looping over the following lines from a csv file to parse them. I want to identify the first line since its the header. Whats the best way of doing this instead of making a var counter holder.
var counter = 0
for (line <- lines) {
println(CsvParser.parse(line, counter))
counter++
}
I know there is got to be a better way to do this, newbie to Scala.
Try zipWithIndex:
for (line <- lines.zipWithIndex) {
println(CsvParser.parse(line._1, line._2))
}
#tenshi suggested the following improvement with pattern matching:
for ((line, count) <- lines.zipWithIndex) {
println(CsvParser.parse(line, count))
}
I totally agree with the given answer, still that I've to point something important out and initially I planned to put in a simple comment.
But it would be quite long, so that, leave me set it as a variant answer.
It's prefectly true that zip* methods are helpful in order to create tables with lists, but they have the counterpart that they loop the lists in order to create it.
So that, a common recommendation is to sequence the actions required on the lists in a view, so that you combine all of them to be applied only producing a result will be required. Producing a result is considered when the returnable isn't an Iterable. So is foreach for instance.
Now, talking about the first answer, if you have lines to be the list of lines in a very big file (or even an enumeratee on it), zipWithIndex will go through all of 'em and produce a table (Iterable of tuples). Then the for-comprehension will go back again through the same amount of items.
Finally, you've impacted the running lenght by n, where n is the length of lines and added a memory footprint of m + n*16 (roughtly) where m is the lines' footprint.
Proposition
lines.view.zipWithIndex map Function.tupled(CsvParser.parse) foreach println
Some few words left (I promise), lines.view will create something like scala.collection.SeqView that will hold all further "mapping" function producing new Iterable, as are zipWithIndex and map.
Moreover, I think the expression is more elegant because it follows the reader and logical.
"For lines, create a view that will zip each item with its index, the result as to be mapped on the result of the parser which must be printed".
HTH.

In javascript, what are the trade-offs for defining a function inline versus passing it as a reference?

So, let's say I have a large set of elements to which I want to attach event listeners. E.g. a table where I want each row to turn red when clicked.
So my question is which of these is the fastest, and which uses the least memory. I understand that it's (usually) a tradeoff, so I would like to know my best options for each.
Using the table example, let's say there's a list of all the row elements, "rowList":
Option 1:
for(var r in rowList){
rowList[r].onclick = function(){ this.style.backgroundColor = "red" };
}
My gut feeling is that this is the fastest, since there is one less pointer call, but the most memory intensive, since each rowlist will have its own copy of the function, which might get serious if the onclick function is large.
Option 2:
function turnRed(){
this.style.backgroundColor = "red";
}
for(var r in rowList){
rowList[r].onclick = turnRed;
}
I'm guessing this is going to be only a teensy bit slower than the one above (oh no, one more pointer dereference!) but a lot less memory intensive, since the browser only needs to keep track of one copy of the function.
Option 3:
var turnRed = function(){
this.style.backgroundColor = "red";
}
for(var r in rowList){
rowList[r].onclick = turnRed;
}
I assume this is the same as option 2, but I just wanted to throw it out there. For those wondering what the difference between this and option 2 is: JavaScript differences defining a function
Bonus Section: Jquery
Same question with:
$('tr').click(function(){this.style.backgroundColor = "red"});
Versus:
function turnRed(){this.style.backgroundColor = "red"};
$('tr').click(turnRed);
And:
var turnRed = function(){this.style.backgroundColor = "red"};
$('tr').click(turnRed);
Here's your answer:
http://jsperf.com/function-assignment
Option 2 is way faster and uses less memory. The reason is that Option 1 creates a new function object for every iteration of the loop.
In terms of memory usage, your Option 1 is creating a distinct function closure for each row in your array. This approach will therefore use more memory than Option 2 and Option 3, which only create a single function and then pass around a reference to it.
For this same reason I would also expect Option 1 to be the slowest of the three. Of course, the difference in terms of real-world performance and memory usage will probably be quite small, but if you want the most efficient one then pick either Option 2 or Option 3 (they are both pretty much the same, the only real difference between the two is the scope at which turnRed is visible).
As for jQuery, all three options will have the same memory usage and performance characteristics. In every case you are creating and passing a single function reference to jQuery, whether you define it inline or not.
And one important note that is not brought up in your question is that using lots of inline functions can quickly turn your code into an unreadable mess and make it more difficult to maintain. It's not a big deal here since you only have a single line of code in your function, but as a general rule if your function contains more than 2-3 lines it is a good idea to avoid defining it inline. Instead define it as in Option 2 or Option 3 and then pass around a reference to it.