Is there a way to know if a column of a Polars Lazy/DataFrame is set as sorted in the python api? - python-polars

You can use df.with_columns(pl.col('A').set_sorted()) to say to polars that column A is sorted.
I assume that internally there is some metadata associated to this column. Is there a way to read it ?
I see that polars algorithms are much faster if dataframes are sorted, sometimes I want to be sure that am I taking these fast paths.
Proposal : Is it possible to have a Lazy/DataFrame attribute like metadata that would store this type of information.
Something like that:
df.metadata
{'A' : {'is_sorted' : True}}

This information is stored on the Series.
>>> s = pl.Series([1, 3, 2]).sort()
>>> s.flags
{'SORTED_ASC': True, 'SORTED_DESC': False}

Related

unnest string array in druid in ingestion phase for better rollup

I am trying to define a druid ingestion spec for the following case.
I have a field of type string array and I want it to be unnested and rolled up by druid during the ingestion. For example, if I have the following two entries in the raw data:
["a","b","c"]
["a", "c"]
In the rollup table I would like to see three entries:
"a" 2
"b" 1
"c" 2
If I just define this column as a dimension, the array is kept as is and the individual values are not extracted. I've looked on possible solution with transformSpec and Expressions, by no help.
I know how to use GROUP BY in query time to get what I need, but I'd like to have this functionality during the ingestion time. Is there some way to define in in the dataSchema?
Thank you.

What is the best way to store column oriented table in MongoDB for optimal query of data

I have a large table where the columns are user_id, user_feature_1, user_feature_2, ...., user_feature_n
So each row corresponds to a user and his or her features.
I stored this table in MongoDB by storing each column's values as an array, e.g.
{
'name': 'user_feature_1',
'values': [
15,
10,
...
]
}
I am using Meteor to pull data from MongoDB, and this way of storage facilitates fast and easy retrieval of the whole column's values for graph plotting.
However, this way of storing has a major drawback; I can't store arrays larger than 16mb.
There are a couple of possible solutions, but non of them seems good enough:
Store each column's values using gridFS. I am not sure if meteor supports gridFS, and it lacks support for slicing of the data, i.e., I may need to just get the top 1000 values of a column.
Store the table in row oriented format. E.g.
{
'user_id': 1,
'user_feature_1': 10,
'user_feature_2': 0.9,
....
'user_feature_n': 42
}
But I think this way of storing data is inefficient for querying a feature column's values
Or MongoDB is not suitable at all and sql is the way to go? But Meteor does not support sql
Update 1:
I found this interesting article which talks about array in mongodb is inefficient. https://www.mongosoup.de/blog-entry/Storing-Large-Lists-In-MongoDB.html
Following explanation is from http://bsonspec.org/spec.html
Array - The document for an array is a normal BSON document with integer values for the keys, starting with 0 and continuing sequentially. For example, the array ['red', 'blue'] would be encoded as the document {'0': 'red', '1': 'blue'}. The keys must be in ascending numerical order.
This means that we can store at most 1 million values in a document, if the values and keys are of float type (16mb/128bits)
There is also a third option. A separate document for each user and feature:
{ u:"1", f:"user_feature_1", v:10 },
{ u:"1", f:"user_feature_2", v:11 },
{ u:"1", f:"user_feature_3", v:52 },
{ u:"2", f:"user_feature_1", v:4 },
{ u:"2", f:"user_feature_2", v:13 },
{ u:"2", f:"user_feature_3", v:12 },
You will have no document growth problems and you can query both "all values for user x" and "all values for feature x" without also accessing any unrelated data.
16MB / 64bit float = 2,000,000 uncompressed datapoints. What kind of graph requires a minimum of 2 million points per column??? Instead try:
Saving a picture on an s3 server
Using a map-reduce solution like hadoop (probably your best bet)
Reducing numbers to small ints if they're currently floats
Computing the data on the fly, on the client (preferred, if possible)
Using a compression algo so you can save a subset & interpolate the rest
That said, a document-based DB would outperform a SQL DB in this use case because a SQL DB would do exactly as Philipp suggested. Either way, you cannot send multiple 16MB files to a client, if the client doesn't leave you for poor UX then you'll go broke for server costs :-).

How can I return an array of mongodb objects in pymongo (without a cursor)? Can MapReduce do this?

I have a db set up in mongo that I'm accessing with pymongo.
I'd like to be able to pull a small set of fields into a list of dictionaries. So, something like what I get in the mongo shell when I type...
db.find({},{"variable1_of_interest":1, "variable2_of_interest":1}).limit(2).pretty()
I'd like a python statement like:
x = db.find({},{"variable1_of_interest":1, "variable2_of_interest":1})
where x is an array structure of some kind rather than a cursor---that is, instead of iterating, like:
data = []
x = db.find({},{"variable1_of_interest":1, "variable2_of_interest":1})
for i in x:
data.append(x)
Is it possible that I could use MapReduce to bring this into a one-liner? Something like
db.find({},{"variable1_of_interest":1, "variable2_of_interest":1}).map_reduce(mapper, reducer, "data")
I intend to output this dataset to R for some analysis, but I'd like concentrate the IO in Python.
You don't need to call mapReduce, you just turn the cursor into a list like so:
>>> data = list(col.find({},{"a":1,"b":1,"_id":0}).limit(2))
>>> data
[{u'a': 1.0, u'b': 2.0}, {u'a': 2.0, u'b': 3.0}]
where col is your db.collection object.
But caution with large/huge result cause every thing is loaded into memory.
What you can do is to call mapReduce in pymongo and pass it the find query as an argument, it could be like this:
db.yourcollection.Map_reduce(map_function, reduce_function,query='{}')
About the projections I think that you would need to do them in the reduce function since query only specify the selection criteria as it says in the mongo documentation
Building off of Asya's answer:
If you wanted a list of just one value in each entry as opposed to a list of objects--using a list comprehension worked for me.
I.e. if each object represents a user and the database stored their email, and you say wanted all the users that were 15 years old
user_emails = [user['email'] for user in db.people.find( {'age' : 15} )]
More here

Should I use sparse index for boolean flags in mongodb?

I have a boolean flag :finished. Should I
A: index({ finished: 1 })
B: index({ finished: 1 }, {sparse: true})
C: use flag :unfinished instead, to query by that
D: other?
Ruby mongoid syntax. Most my records will have flag finished=true, and most operations fetch those unfinished, obviously. I'm not sure if I understand when to use sparse and when not to. Thanks!
The sparse flag is a little weird. To understand when to use it, you have to understand why "sparse" exists in the first place.
When you create a simple index on one field, there is an entry for each document, even documents that don't have that field.
For example, if you have an index on {rarely_set_field : 1}, you will have an index that is filled mostly with null because that field doesn't exist in most cases. This is a waste of space and it's inefficient to search.
The {sparse:true} option will get rid of the null values, so you get an index that only contain entries when {rarely_set_field} is defined.
Back to your case.
You are asking about using a boolean + sparse. But sparse doesn't really affect "boolean", sparse affect "is set vs. is not set".
In your case, you are trying to fetch unfinished. To leverage sparse the key is not the boolean value, but the fact that unfinished entries have that key and that "finished" entries have no key at all.
{ _id: 1, data: {...}, unfinished: true }
{ _id: 2, data: {...} } // this entry is finished
It sounds like you are using a Queue
You can definitely leverage the information above to implement a sparse index. However, it actually sounds like you are using a Queue. MongoDB is serviceable as a Queue, here are two examples.
However, if you look at the Queue, they are not doing it the way you are doing it. I'm personally using MongoDB as a Queue for some production systems and it runs pretty well, but test your expected load as a dedicated Queue will perform much better.
Sparse is only helpful if the value is null, not false. When you say "most will have finished=true", I'm guessing that most of finished is non-null, making sparse not very beneficial.
And since most values are a single value, I doubt any type of index would help at all if your queries are specific enough.

MongoDB - forcing stored value to uppercase and searching

in SQL world I could do something to the effect of:
SELECT name FROM table WHERE UPPER(name) = UPPER('Smith');
and this would match a search for "Smith", "SMITH", "SmiTH", etc... because it forces the query and the value to be the same case.
However, MongoDB doesn't seem to have this capability without using a RegEx, which won't use indexes and would be slow for a large amount of data.
Is there a way to convert a stored value to a particular case before doing a search against it in MongoDB?
I've come across the $toUpper aggregate, but I can't figure out how that would be used in this particular case.
If there's not way to convert stored values before searching, is it possible to have MongoDB convert a value when it's created in Mongo? So when I add a document to the collection it would force the "name" attribute to a particular case? Something like a callback in the Rails world.
It looks like there's the ability to create stored JS for MongoDB as well, similar to a Stored Procedure. Would that be a feasible solution as well?
Mostly looking for a push in the right direction; I can figure out the particular code once I know what I'm looking for, but so far I'm not even sure if my desired functionality is doable.
You have to normalize your data before storing them. There is no support for performing normalization as part of a query at runtime.
The simplest thing to do is probably to save both a case-normalized (i.e. all-uppercase) and display version of the field you want to search by. Suppose you are storing users and want to do a case-insensitive search on last name. You might store:
{
_id: ObjectId(...),
first_name: "Dan",
last_name: "Crosta",
last_name_upper: "CROSTA"
}
You can then create an index on last_name_upper, and query like:
> db.users.find({last_name_upper: "CROSTA"})