`set_sorted` when a dataframe is sorted on multiple columns - python-polars

I have some panel data in polars. The dataframe is sorted by its id column and then its date column (basically it's a bunch of time series concatenated together).
I've seen that polars has a .set_sorted method for working with expressions. I can of course set pl.col("id").set_sorted() but I want it to be aware that it's actually sorted in both id and date columns. In pandas I know the Index has an .is_monotonic_increasing property that is aware of whether all the columns of the Index are sorted but is there a way to do something similar with polars?

Have you tried
df.get_column('id').is_sorted()
and
df.get_column('date').is_sorted()
to see if they're each already known to be sorted?
For instance if I do:
df=pl.DataFrame({'a':[1,1,2,2], 'b':[1,2,3,4]})
df.get_column('a').is_sorted()
df.get_column('b').is_sorted()
Then I get 2 Trues even though I haven't ever told it that the columns are sorted.
In general, I don't think you want to be manually setting columns as sorted. Just sort them and it'll keep track of the fact that they're sorted.
If you do:
df=pl.DataFrame({'a':[1,2,1,2], 'b':[1,3,2,4]})
df.get_column('a').is_sorted()
df.get_column('b').is_sorted()
then you get False twice, as you'd hope. If you then do df=df.sort(['a','b']) and follow it up by checking the sortedness of a and b again then you see that it knows they're sorted

Related

PostgreSQL - Compare ts_vector fields

I have two tables in which I have data coming from two different sources. One of the field of each table contains the title of a movie, but for some reason out of my control, the titles are not always exactly the same.
So I use the ts_vector to get rid of all the minor differences (stop words, plurals and so on).
See an example here: http://sqlfiddle.com/#!17/5ccbc/3
My problem is how to compare the two ts_vector without taking into account the numberic values, but just the text content. If I compare directly the two fields, I only get the exact match between values, including position of each word. The only solution I have found is using the strip() function, that remove positions and weights from tsvector, leaving only the text content.
I was wondering if there is a fastest way to compare ts_vectors.
You could create in index on the stripped vector:
create index on tbl1 (strip(ts_title));
create index on tbl2 (strip(ts_title));
But given that your query has to fetch every row of each table, it is unlikely this would serve much of a point. Doing a merge join between the precomputed stripped vectors could be faster, but probably not once you include the overhead of building and maintaining the indexes. If the real WHERE clause is more restrictive (selecting only a few rows from one or the other of the tables) then please share the real query.

Transpose data using Talend

I have this kind of data:
I need to transpose this data into something like this using Talend:
Help would be much appreciated.
dbh's suggestion should work indeed, but I did not try it.
However, I have another solution which doesn't require to change input format and is not too complicated to implement. Indeed the job has only 2 transformation components (tDenormalize and tMap).
The job looks like the following:
Explanation :
Your input is read from a CSV file (could be a database or any other kind of input)
tDenormalize component will Denormalize your column value (column 2), based on value on id column (column 1), separating fields with a specific delimiter (";" in my case), resulting as shown in 2 rows.
tMap : split the aggregated column into multiple columns, by using java's String.split() method and spreading the resulting array into multiple columns. The tMap should like like this:
Since Talend doesn't accept to store Array objects, make sure to store the splitted String in Object format. Then, cast that object into Array on the right side of the Map.
That approach should give you the expected result.
IMPORTANT:
tNormalize might shuffle the rows, meaning for bigger input, you might encounter unsorted output. Make sure to sort it if needed or use tDenormalizeSortedRow instead.
tNormalize is similar to an aggregation component meaning it scans the whole input before processing, which results into possible performance issues with particularly big inputs (tens of millions of records).
Your input is probably wrong (you have 5 entries with 1 as id, and 6 entries with 2 as id). 6 columns are expected meaning you should always have 6 lines per id. If not, then you should implement dbh's solution, and you probably HAVE TO add a column with a key.
You can use Talend's tPivotToColumnsDelimited component to achieve this. You will most likely need an additional column in your data to represent the field name.
Like "Identifier, field name, value "
Then you can use this component to pivot the data and write a file as output. If you need to process the data further, read the resulting file with tFileInoutDelimited .
See docs and an example at
https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide521EN/13.43+tPivotToColumnsDelimited

Is there any way for Access 2016 to sort the numbers that are part of a "text" data type formatted field as though they are numeric values?

I am working on a database that (hopefully) will end up using a primary key with both numbers and letters in the values to track lots of agricultural product. Due to the way in which the weighing of product takes place at more than one facility, I have no other option but to maintain the same base number but use letters in addition to this base number to denote split portions of each lot of product. The problem is, after I create record number 99, the number 100 suddenly floats up and underneath 10. This makes it difficult to maintain consistency and forces me to replace this alphanumeric lot ID with a strictly numeric value in order to keep it sorted (which I use "autonumber" as the data type). Either way, I need the alphanumeric lot ID, and so having 2 ID's for the same lot can be confusing for anyone inputting values into the form. Is there a way around this that I am just not seeing?
If you're using query as a data source then you may try to sort it by string converted to number, something like
SELECT id, field1, field2, ..
ORDER BY CLng(YourAlphaNumericField)
Edit: you may also try Val function instead of CLng - it should not fail on non-numeric input
Why not properly format your key before saving ? e.g: "0000099". You will avoid a costly conversion later.
Alternatively, you could use 2 fields as the composite PK. One with the Number (as Long) and one with the Location (as String).

100 columns vs Array of length 100

I have a table with 100+ values corresponding to each row, so I'm exploring different ways to store them.
Without any indexes, would I lose anything if I store these 100 values in an integer[] column in postgresql? As compared to storing them in separate columns.
Plus, since we can add indexes to array elemnets,
CREATE INDEX test_index on test ((foo[1]));
Would there be a performance difference queries using such an index as compared to regular index on a column?
As far as I've read, this performance difference would come into picture in arrays with variable length elements; but I'm not sure about fixed length ones.
Don't go for the lazy way.
If you need to store 100 and more values as array, it is ok, if it has sense has array for your application, your data.
If you need to query for a specific element of the array, then this design is not good, regardless of performances, and you must use columns. This will help you in the moment you must delete a "column" in the middle or redesign it.
Anyway, as wrote by Frank in comments, if values are all same type, consider to model them to another table (if also the meaning is the same).

Deleting multiple rows from all fields of a given structure array using Matlab

I'm working on a structure with 12 fields, each consisting of a 1350x1 double.
I want to delete all rows of all fields, given a condition regarding one of those fields, and come up with a new structure, still 12 field but of only 1150 rows.
I don't mind the change of index, since one of these fields is a "Time" field to index the others.
To clarify, my structure is called sens, and the fields are sens(1).Time, sens(1).Temp, sens(1).Humidity and so on.
I want to find a value in sens(1).Time, define it as threshold value, and delete all rows smaller than this value, in all fields.
I know it's possible by going through all field names, but been looking for a faster way as the names of the fields and/or the structure vary.
Thank you.
Check the structfun function from matlab and that should solve your problem.
You have some examples that can help you with what you want, but in sum it should look something like this:
% A is your struct
% cond is the indexes of the rows you want to keep
A_clean= structfun(#(x) x(cond), A, 'UniformOutput', false)