groupBy toList element order - scala

I have a RichPipe with several fields, let's say:
'sex
'weight
'age
I need to group by 'sex and then get a list of tuples ('weight and 'age). I then want to do a scanLeft operation on the list for each group and get a pipe with 'sex and 'result. I currently do this by doing
pipe.groupBy('sex) {_.toList('weight -> 'weights).toList('age - 'ages)}
and then zipping the two lists together. I'm not sure this is the best possible way, and also I'm not sure if the order of the values in the lists is the same, so that when I zip the two lists the tuples don't get mixed up with wrong values. I found nothing about this in the documentation.

Ok, so it looks like I've answered my own question.
You can simply do
pipe.groupBy('sex) {_.toList[(Int, Int)](('weight, 'age) -> 'list)}
which results in a list of tuples. Would've saved me a lot of time if the Fields API Reference mentioned this.

Related

Using dplyr correctly to combine shared values of a row to a new column of a table

How do I combine data from two tables based on certain shared values from the row?
I already tried using the which function and it didn't work.
I think you will have the best luck using the dplyr fuction. Specifically you can use right_join(). You can wright it like this, right_join(df1,df2, by="specification")
This will combine that columns from df2 with the specifications matching the rows according to the shared specification from df1.
For future reference it would be a lot of help if you included a screenshot of code just so it is easier to know exactly what you are asking.
Anyway, let me know if this answers your question!

`set_sorted` when a dataframe is sorted on multiple columns

I have some panel data in polars. The dataframe is sorted by its id column and then its date column (basically it's a bunch of time series concatenated together).
I've seen that polars has a .set_sorted method for working with expressions. I can of course set pl.col("id").set_sorted() but I want it to be aware that it's actually sorted in both id and date columns. In pandas I know the Index has an .is_monotonic_increasing property that is aware of whether all the columns of the Index are sorted but is there a way to do something similar with polars?
Have you tried
df.get_column('id').is_sorted()
and
df.get_column('date').is_sorted()
to see if they're each already known to be sorted?
For instance if I do:
df=pl.DataFrame({'a':[1,1,2,2], 'b':[1,2,3,4]})
df.get_column('a').is_sorted()
df.get_column('b').is_sorted()
Then I get 2 Trues even though I haven't ever told it that the columns are sorted.
In general, I don't think you want to be manually setting columns as sorted. Just sort them and it'll keep track of the fact that they're sorted.
If you do:
df=pl.DataFrame({'a':[1,2,1,2], 'b':[1,3,2,4]})
df.get_column('a').is_sorted()
df.get_column('b').is_sorted()
then you get False twice, as you'd hope. If you then do df=df.sort(['a','b']) and follow it up by checking the sortedness of a and b again then you see that it knows they're sorted

Remove rows from search expression solr

I'm trying to search for the items which's attribute matches the given function below in my large dataset, but I'm facing a problem here.
The row parameter only selects first 300 objects and the function then filters the matching results, but I'm trying to search the whole index, not only just first few, how can I rewrite this to achieve it?
having(
select(search(myIndex,q="*:*", fl="*", rows=300),
id,
dotProduct(ATTRIBUTE, array(4,5,2)) as prod,
l1norm(array(1,2,3)) as a,
l1norm(ATTRIBUTE) as b,
div(prod, add(a, sub(b, prod))) as c
), and(gteq(c, 5), lteq(c, 8)))
The simplest would be to increase the number of rows to cover the number of entries in the index.
However if this number is huge, you should probably use the /export request handler instead of a regular select-like handler.
The /export request handler allows a fully sorted result set to be
streamed out of Solr using a special rank query parser and
response writer. These have been specifically designed to work together to handle scenarios that involve sorting and exporting millions of records.
Depending on your needs, you could also do multiple queries playing with paginated results using both parameter start and rows, or if the number of entries is not known by the client code, use cursorMark.

How do you count both filters through RDD operations?

I have two RDD's, one looks like this:
increase
rose
die
bear
contracted
own
eyes
lights
making
Then I count the first RDD:
(float,2)
(agree,20)
(healing,2)
(shot,45)
(guide,24)
(opening,11)
(urging,9)
(practises,1)
(surge,9)
(maintained,2)
I have another RDD, which is a dictionary of different forms of verbs, like this
abash,abash,abashed,abashed,abashes,abashing
abate,abate,abated,abated,abates,abating
abide,abide,abode,abode,abides,abiding
absorb,absorb,absorbed,absorbed,absorbs,absorbing
accept,accept,accepted,accepted,accepts,accepting
accompany,accompany,accompanied,accompanied,accompanies,accompanying
ache,ache,ached,ached,aches,aching
achieve,achieve,achieved,achieved,achieves,achieving
Now, I need to count the words in the first RDD and merge the words that belong to the same word but have different forms according to the second RDD. E.g. (work, 100), (works,50), (working,150) -> (work, 300)
I tried counting the first RDD and then figuring out which of the elements in the first RDD belongs to which in the second RDD and counting, but this part doesn't know how to do it through an RDD operation
Is this a homework or something. Same question (that targets same task) is asked and answered here.

Merge vertex list in gremlin orientDb

I am a newbie in the graph databases world, and I made a query to get leaves of the tree, and I also have a list of Ids. I want to merge both lists of leaves and remove duplicates in a new one to sum property of each. I cannot merge the first 2 sets of vertex
g.V().hasLabel('Group').has('GroupId','G001').repeat(
outE().inV()
).emit().hasLabel('User').as('UsersList1')
.V().has('UserId', within('001','002')).as('UsersList2')
.select('UsersList1','UsersList2').dedup().values('petitions').sum().unfold()
Regards
There are several things wrong in your query:
you call V().has('UserId', within('001','002')) for every user that was found by the first part of the traversal
the traversal could emit more than just the leafs
select('UsersList1','UsersList2') creates pairs of users
values('petitions') tries to access the property petitions of each pair, this will always fail
The correct approach would be:
g.V().has('User', 'UserId', within('001','002')).fold().
union(unfold(),
V().has('Group','GroupId','G001').
repeat(out()).until(hasLabel('User'))).
dedup().
values('petitions').sum()
I didn't test it, but I think the following will do:
g.V().union(
hasLabel('Group').has('GroupId','G001').repeat(
outE().inV()
).until(hasLabel('User')),
has('UserId', within('001','002')))
.dedup().values('petitions').sum()
In order to get only the tree leaves, it is better to use until. Using emit will output all inner tree nodes as well.
union merges the two inner traversals.