Lazy filtering of a range? - scala

I was trying to calculate the first x prime numbers by using the following line of code:
(1 to Int.MaxValue).filter(is_prime _).take(x)
However, the program just didn't stop and I had to close it (I didn't want to wait until Int.MaxValue was reached). How could I rewrite this to work in normal time while mantaining the simplicity?

You can also use a Stream (or Iterator - see the comments below)
Stream.from(1).filter(is_prime).take(x)

Range will traverse the whole collection with filter. Try using a view instead:
(1 to Int.MaxValue).view.filter(is_prime _).take(x)

Related

Gremlin: Calculate division of based on two counts in one line of code

I have two counts, calculated as follows:
1)g.V().hasLabel('brand').where(__.inE('client_brand').count().is(gt(0))).count()
2)g.V().hasLabel('brand').count()
and I want to get one line of code that results in the first count divided by the second.
Here's one way to do it:
g.V().hasLabel('brand').
fold().as('a','b').
math('a/b').
by(unfold().where(inE('client_brand')).count())
by(unfold().count())
Note that I simplify the first traversal to just .where(inE('client_brand')).count() since you only care to count that there is at least one edge, there's no need to count them all and do a compare.
You could also union() like:
g.V().hasLabel('brand').
union(where(inE('client_brand')).count(),
count())
fold().as('a','b').
math('a/b').
by(limit(local,1))
by(tail(local))
While the first one was a bit easier to read/follow, I guess the second is nicer because it only stores a list of the two counts whereas, the first stores a list of all the "brand" vertices which would be more memory intensive I guess.
Yet another way, provided by Daniel Kuppitz, that uses groupCount() in an interesting way:
g.V().hasLabel('brand').
groupCount().
by(choose(inE('client_brand'),
constant('a'),
constant('b'))).
math('a/(a+b)')
The following solution that uses sack() step shows why we have math() step:
g.V().hasLabel('brand').
groupCount().
by(choose(inE('client_brand'),
constant('a'),
constant('b'))).
sack(assign).
by(coalesce(select('a'), constant(0))).
sack(mult).
by(constant(1.0)). /* we need a double */
sack(div).
by(select(values).sum(local)).
sack()
If you can use lambdas then:
g.V().hasLabel('brand').
union(where(inE('client_brand')).count(),
count())
fold().
map{ it.get()[0]/it.get()[1]}
This is what worked for me:
g.V().limit(1).project('client_brand_count','total_brands')
.by(g.V().hasLabel('brand')
.where(__.inE('client_brand').count().is(gt(0))).count())
.by(g.V().hasLabel('brand').count())
.map{it.get().values()[0] / it.get().values()[1]}
.project('brand_client_pct')

How to use OSRM's match service

As stated in the header: how can I use the match call?
I tried
http://router.project-osrm.org/match/v1/driving/8.610048,46.99917;8.530232,47.051?overview=full&radiuses=49;49
I am not sure, whether the list of radiuses is given correctly.
I can't get it work. I also tried [49;49] or {49;49} The command works with route:
http://router.project-osrm.org/route/v1/driving/8.610048,46.99917;8.530232,47.051?overview=full
For backround see here
Edit: If you look at the example here, itr seems, the timestamps are not needed /match/v1/{profile}/{coordinates}?steps={true|false}&geometries={polyline|polyline6|geojson}&overview={simplified|full|false}&annotations={true|false}
From the docs:
Large jumps in the timestamps (> 60s) or improbable transitions lead to trace splits if a complete matching could not be found.
I think that's the problem with your request. The two given points are more than 60s appart and OSRM cannot match them successfully. The radiuses are specified correctly.
The following query works for me:
http://router.project-osrm.org/match/v1/driving/8.610048,46.99917;8.620048,46.99917?overview=full&radiuses=49;49
This returns:
{"tracepoints":[{"location":[8.610971,46.998963],"name":"Alte Kantonstrasse","hint":"GKUFgJEhBwAAAAAAHQAAAAAAAAC5AAAAAAAAAB0AAAAAAAAAuQAAAPsCAACbZIMAsyXNAgBhgwCCJs0CAAAPABki8hY=","matchings_index":0,"waypoint_index":0,"alternatives_count":0},{"location":[8.620295,46.999681],"name":"Schönenbuchstrasse","hint":"nIEFAJ7IFIA3AAAAZAAAAAAAAADYAAAANwAAAGQAAAAAAAAA2AAAAPsCAAAHiYMAgSjNAhCIgwCCJs0CAAAPABki8hY=","matchings_index":0,"waypoint_index":1,"alternatives_count":5}],"matchings":[{"distance":922.3,"duration":114.1,"weight":114.1,"weight_name":"routability","geometry":"onz}Gqyps#Wg#S_#aCaFMUYo#c#w#OKOCWmAWs#aBiDsAsCMYH[HY\\_#h#ObBW^w#BQAUKu#ASF[ZaABOFYpAyIf#mD","confidence":0.000982,"legs":[{"distance":922.3,"duration":114.1,"weight":114.1,"summary":"","steps":[]}]}],"code":"Ok"}
So the two given input points 8.610048,46.99917 and 8.620048,46.99917 are matched to 8.610971,46.998963 and 8.620295,46.999681.
So as far as I can see, if you want to implement something like that, you need to give OSRM more input points on its way which are less than 60s apart.
See also here for an explanation about the differences between route and match service.

Using distinct on a slice stringbuilder

buffer.slice(mouse,highlight).distinct
Now when I perform this it seems to apply .distinct to the whole string rather than the selection I use with slice. (mouse and highlight are just index positions and buffer is a StringBuilder). I'm just wondering what is the reason for this.
Your approach is correct. Please see below code for more clarification.
slice() function gives you the sub-string so in your approach It will first find the sub-string and then distinct.
Please follow below step by step approach for more understanding.
val buffer=new StringBuilder
buffer.append("bbbaabbbcccbdbcdbd")
val sl=buffer.slice(2,10)
The variable sl contains
sl= baabbbcc
Now you can apply distinct on sl variable
val result=sl.distinct
Finally your output
result= bac
This is the how your single line of code is working.

dataFrame keying using pandas groupby method

I new to pandas and trying to learn how to work with it. Im having a problem when trying to use an example I saw in one of wes videos and notebooks on my data. I have a csv file that looks like this:
filePath,vp,score
E:\Audio\7168965711_5601_4.wav,Cust_9709495726,-2
E:\Audio\7168965711_5601_4.wav,Cust_9708568031,-80
E:\Audio\7168965711_5601_4.wav,Cust_9702445777,-2
E:\Audio\7168965711_5601_4.wav,Cust_7023544759,-35
E:\Audio\7168965711_5601_4.wav,Cust_9702229339,-77
E:\Audio\7168965711_5601_4.wav,Cust_9513243289,25
E:\Audio\7168965711_5601_4.wav,Cust_2102513187,18
E:\Audio\7168965711_5601_4.wav,Cust_6625625104,-56
E:\Audio\7168965711_5601_4.wav,Cust_6073165338,-40
E:\Audio\7168965711_5601_4.wav,Cust_5105831247,-30
E:\Audio\7168965711_5601_4.wav,Cust_9513082770,-55
E:\Audio\7168965711_5601_4.wav,Cust_5753907026,-79
E:\Audio\7168965711_5601_4.wav,Cust_7403410322,11
E:\Audio\7168965711_5601_4.wav,Cust_4062144116,-70
I loading it to a data frame and the group it by "filePath" and "vp", the code is:
res = df.groupby(['filePath','vp']).size()
res.index
and the output is:
[E:\Audio\7168965711_5601_4.wav Cust_2102513187,
Cust_4062144116, Cust_5105831247,
Cust_5753907026, Cust_6073165338,
Cust_6625625104, Cust_7023544759,
Cust_7403410322, Cust_9513082770,
Cust_9513243289, Cust_9702229339,
Cust_9702445777, Cust_9708568031,
Cust_9709495726]
Now Im trying to approach the index like a dict, as i saw in examples, but when im doing
res['Cust_4062144116']
I get an error:
KeyError: 'Cust_4062144116'
I do succeed to get a result when im putting the filepath, but as i understand and saw in previouse examples i should be able to use the vp keys as well, isnt is so?
Sorry if its a trivial one, i just cant understand why it is working in one example but not in the other.
Rutger you are not correct. It is possible to "partial" index a multiIndex series. I simply did it the wrong way.
The index first level is the file name (e.g. E:\Audio\7168965711_5601_4.wav above) and the second level is vp. Meaning, for each file name i have multiple vps.
Now, this is correct:
res['E:\Audio\7168965711_5601_4.wav]
and will return:
Cust_2102513187 2
Cust_4062144116 8
....
but trying to index by the inner index (the Cust_ indexes) will fail.
You groupby two columns and therefore get a MultiIndex in return. This means you also have to slice using those to columns, not with a single index value.
Your .size() on the groupby object converts it into a Series. If you force it in a DataFrame you can use the .xs method to slice a single level:
res = pd.DataFrame(df.groupby(['filePath','vp']).size())
res.xs('Cust_4062144116', level=1)
That works. If you want to keep it as a series, boolean indexing can help, something like:
res[res.index.get_level_values(1) == 'Cust_4062144116']
The last option is a bit less readable, but sometimes also more flexibile, you could test for multiple values at once for example:
res[res.index.get_level_values(1).isin(['Cust_4062144116', 'Cust_6073165338'])]

Sum of DOM elements using XPath

I am using MSXML v3.0 in a VB 6.0 application. The application calculates sum of an attribute of all nodes using for each loop as shown below
Set subNodes = docXML.selectNodes("//Transaction")
For Each subNode In subNodes
total = total + Val(subNode.selectSingleNode("Amount").nodeTypedValue)
Next
This loop is taking too much time, sometime it takes 15-20 minutes for 60 thousand nodes.
I am looking for XPath/DOM solution to eliminate this loop, probably
docXML.selectNodes("//Transaction").Sum("Amount")
or
docXML.selectNodes("Sum(//Transaction/Amount)")
Any suggestion is welcomed to get this sum faster.
// Open the XML.
docNav = new XPathDocument(#"c:\books.xml");
// Create a navigator to query with XPath.
nav = docNav.CreateNavigator();
// Find the sum
// This expression uses standard XPath syntax.
strExpression = "sum(/bookstore/book/price)";
// Use the Evaluate method to return the evaluated expression.
Console.WriteLine("The price sum of the books are {0}", nav.Evaluate(strExpression));
source: http://support.microsoft.com/kb/308333
Any solution that uses the XPath // pseudo-operator on an XML document with 60000+ nodes is going to be quite slow, because //x causes a complete traversal of the tree starting at the root of the document.
The solution can be speeded up significantly, if a more exact XPath expression is used, that doesn't include the // pseudo-operator.
If you know the structure of the XML document, always use a specific chain of location steps -- never //.
If you provide a small example, showing the specific structure of the document, then many people will be able to provide a faster solution than any solution that uses //.
For example, if it is known that all Transaction elements can be selected using this XPath expression:
/x/y/Transaction
then the evaluation of
sum(/x/y/Transaction/Amount)
is likely to be significantly faster than Sum(//Transaction/Amount)
Update:
The OP has revealed in a comment that the structure of the XML file is quite simple.
Accordingly, I tried with a 60000 Transaction nodes XML document the following:
/*/*/Amount
With .NET XslCompiledTransform (Yes, I used XSLT as the host for the XPath engine) this took 220ms (milliseconds), that means 0.22 seconds, to produce the sum.
With MSXML3 it takes 334 seconds.
With MSXML6 it takes 76 seconds -- still quite slow.
Conclusion: This is a bug in MSXML3 -- try to upgrade to another XPath engine, such as the one offered by .NET.