Excel: Select the newest date from a list that contains multiple rows with the same ID - date

In Excel, I have a list with multiple rows of the same ID (column A), each with various dates recorded (Column B). I need to extract one row for each ID that contains the newest date. See below for example:
|Column A | Column B|
|(ID) | (Date) |
|-----------|-----------|
|00001 | 01/01/2022|
|00001 | 02/01/2022|
|00001 | 03/01/2022| <-- I Need this one
|00002 | 01/02/2022|
|00002 | 02/02/2022|
|00002 | 03/02/2022| <-- I Need this one
|00003 | 01/03/2022|
|00003 | 02/03/2022|
|00003 | 03/03/2022| <-- I Need this one
|00004 | 01/04/2022|
|00004 | 02/04/2022|
|00004 | 03/04/2022| <-- I Need this one
|00005 | 01/05/2022|
|00005 | 02/05/2022|
|00005 | 03/05/2022| <-- I Need this one
I need to extract the above rows, where the row with the newest date is extracted for each unique ID. It needs to look like this:
|Column A | Column B |
|(ID) | (Date) |
|----------|--------------|
|00001 | 03/01/2022 |
|00002 | 03/02/2022 |
|00003 |03/03/2022 |
|00004 | 03/04/2022 |
|00005 | 03/05/2022 |
I'm totally stumped and I can't seem to find the right answer (probably because of how I'm wording the question!)
Thank you!
Google searches for the answer - no joy. I don't know where to start in excel with this function, I thought perhaps DISTINCT or similar...

Assuming you have Office 365 compatible version of Excel, you could do something like this:
(screenshot/here refers):
=INDEX(SORTBY(A2:B11,B2#,-1),SEQUENCE(1,1,1,1),SEQUENCE(1,2,1,1))
This formula is superfluous albeit convenient - you don't really require the first sequence (there's only one row being returned). However, as you can see in the screenshot, using the self-same formula, this time with a leading 2 in the first argument of that sequence returns the top two (descending order) dates, and so forth.
FOR THOSE w/ Office 365 you could do something like this....
=LARGE(B2#+(ROW(B2#)-ROW(B2))/1000,1)
i.e. adding a "little bit" to the dates that we can subtract later and use as a unique reference (row number, original unsorted list)
As mentioned, reverse engineer, throw into an index, and voila!
=INDEX(A2:A11,ROUND((H2-ROUND(H2,0))*1000,6))
caveats:
the round(<>,6) is purely to eliminate Excel's irritating lack of precision issue.
can work if you're looking up text strings (i.e. attempting to sort alphabetically) EXCEPT large doesn't work with string (no prob, just use unicode - but good luck with expanding out the string etc. ☺ with mid(<>,row(a1:offset(a1,len(<>)-1)..,1)..

Related

create JSONB array grouped from column values with incrementing integers

For a PostgreSQL table, suppose the following data is in table A:
key_path | key | value
--------------------------------------
foo[1]__scrog | scrog | apple
foo[2]__scrog | scrog | orange
bar | bar | peach
baz[1]__biscuit | biscuit | watermelon
The goal is to group data when there is an incrementing number present for an otherwise identical value for column key_path.
For context, key_path is a JSON key path and key is the leaf key. The desired outcome would be:
key_path_group | key | values
------------------------------------------------------------
[foo[1]__scrog, foo[2]__scrog] | scrog | [apple, orange]
bar | bar | peach
[baz[1]__biscuit] | biscuit | [watermelon]
Also noting that for key_path=baz[1]__biscuit even though there is only a single incrementing value, it still triggers casting to an array of length 1.
Any tips or suggestions much appreciated!
May have answered my own question (sometimes just typing it out helps). The following gets very close, if not exactly, what I'm looking for:
select
regexp_replace(key_path, '(.*)\[(\d+)\](.*)', '\1[x]\3') as key_path_group,
key,
jsonb_agg(value) as values
from A
group by gp_key_path, key;

How to parse month-year string using Presto

I have a column that contains a Month-Year string that I would like to convert to an actual date representing the first day of the Month and Year combination. For example
+----------+------------+
| Original | Desired |
+----------+------------+
| Aug-19 | 08/01/2019 |
+----------+------------+
| Sep-20 | 09/01/2020 |
+----------+------------+
| May-22 | 05/01/2022 |
+----------+------------+
I have tried breaking apart the Month-Year string using split_part but when I try and pass Month as a parameter into date_parse it throws an error with the input (INVALID_FUNCTION_ARGUMENT). I could break apart the Month-Year into strings and then recombine, hard-coding the 01 however the problem seems that three letter month cannot be parsed into an actual month by Presto. I also want to avoid a 12 line CASE WHEN statement to parse the month if possible.
I'm not sure where the year comes from, but the query will be like this:
select date_format(date_parse('May-22', '%b-%d'), '%m/%d/%Y')
https://trino.io/docs/current/functions/datetime.html?mysql-date-functions

Calculate time range in org-mode table

Given a table that has a column of time ranges e.g.:
| <2015-10-02>--<2015-10-24> |
| <2015-10-05>--<2015-10-20> |
....
how can I create a column showing the results of org-evalute-time-range?
If I attempt something like:
#+TBLFM: $2='(org-evaluate-time-range $1)
the 2nd column is populated with
Time difference inserted
in every row.
It would also be nice to generate the same result from two different columns with, say, start date and end date instead of creating one column of time ranges out of those two.
If you have your date range split into 2 columns, a simple subtraction works and returns number of days:
| <2015-10-05> | <2015-10-20> | 15 |
| <2013-10-02 08:30> | <2015-10-24> | 751.64583 |
#+TBLFM: $3=$2-$1
Using org-evaluate-time-range is also possible, and you get a nice formatted output:
| <2015-10-02>--<2015-10-24> | 22 days |
| <2015-10-05>--<2015-10-20> | 15 days |
| <2015-10-22 Thu 21:08>--<2015-08-01> | 82 days 21 hours 8 minutes |
#+TBLFM: $2='(org-evaluate-time-range)
Note that the only optional argument that org-evaluate-time-range accepts is a flag to indicate insertion of the result in the current buffer, which you don't want.
Now, how does this function (without arguments) get the correct time range when evaluated is a complete mystery to me; pure magic(!)

Sane way to store different data types within same column in postgres?

I'm currently attempting to modify an existing API that interacts with a postgres database. Long story short, it's essentially stores descriptors/metadata to determine where an actual 'asset' (typically this is a file of some sort) is storing on the server's hard disk.
Currently, its possible to 'tag' these 'assets' with any number of undefined key-value pairs (i.e. uploadedBy, addedOn, assetType, etc.) These tags are stored in a separate table with a structure similar to the following:
+---------------+----------------+-------------+
|assetid (text) | tagid(integer) | value(text) |
|---------------+----------------+-------------|
|someStringValue| 1234 | someValue |
|---------------+----------------+-------------|
|aDiffStringKey | 1235 | a username |
|---------------+----------------+-------------|
|aDiffStrKey | 1236 | Nov 5, 1605 |
+---------------+----------------+-------------+
assetid and tagid are foreign keys from other tables. Think of the assetid representing a file and the tagid/value pair is a map of descriptors.
Right now, the API (which is in Java) creates all these key-value pairs as a Map object. This includes things like timestamps/dates. What we'd like to do is to somehow be able to store different types of data for the value in the key-value pair. Or at least, storing it differently within the database, so that if we needed to, we could run queries checking date-ranges and the like on these tags. However, if they're stored as text items in the db, then we'd have to a.) Know that this is actually a date/time/timestamp item, and b.) convert into something that we could actually run such a query on.
There is only 1 idea I could think of thus far, without complete changing changing the layout of the db too much.
It is to expand the assettag table (shown above) to have additional columns for various types (numeric, text, timestamp), allow them to be null, and then on insert, checking the corresponding 'key' to figure out what type of data it really is. However, I can see a lot of problems with that sort of implementation.
Can any PostgreSQL-Ninjas out there offer a suggestion on how to approach this problem? I'm only recently getting thrown back into the deep-end of database interactions, so I admit I'm a bit rusty.
You've basically got two choices:
Option 1: A sparse table
Have one column for each data type, but only use the column that matches that data type you want to store. Of course this leads to most columns being null - a waste of space, but the purists like it because of the strong typing. It's a bit clunky having to check each column for null to figure out which datatype applies. Also, too bad if you actually want to store a null - then you must chose a specific value that "means null" - more clunkiness.
Option 2: Two columns - one for content, one for type
Everything can be expressed as text, so have a text column for the value, and another column (int or text) for the type, so your app code can restore the correct value in the correct type object. Good things are you don't have lots of nulls, but importantly you can easily extend the types to something beyond SQL data types to application classes by storing their value as json and their type as the class name.
I have used option 2 several times in my career and it was always very successful.
Another option, depending on what your doing, could be to just have one value column but store some json around the value...
This could look something like:
{
"type": "datetime",
"value": "2019-05-31 13:51:36"
}
That could even go a step further, using a Json or XML column.
I'm not in any way PostgreSQL ninja, but I think that instead of two columns (one for name and one for type) you could look at hstore data type:
data type for storing sets of key/value pairs within a single
PostgreSQL value. This can be useful in various scenarios, such as
rows with many attributes that are rarely examined, or semi-structured
data. Keys and values are simply text strings.
Of course, you have to check how date/timestamps converting into and from this type and see if it good for you.
You can use 2 different technics:
if you have floating type for every tagid
Define table and ID for every tagid-assetid combination and actual data tables:
maintable:
+---------------+----------------+-----------------+---------------+
|assetid (text) | tagid(integer) | tablename(text) | table_id(int) |
|---------------+----------------+-----------------+---------------|
|someStringValue| 1234 | tablebool | 123 |
|---------------+----------------+-----------------+---------------|
|aDiffStringKey | 1235 | tablefloat | 123 |
|---------------+----------------+-----------------+---------------|
|aDiffStrKey | 1236 | tablestring | 123 |
+---------------+----------------+-----------------+---------------+
tablebool
+-------------+-------------+
| id(integer) | value(bool) |
|-------------+-------------|
| 123 | False |
+-------------+-------------+
tablefloat
+-------------+--------------+
| id(integer) | value(float) |
|-------------+--------------|
| 123 | 12.345 |
+-------------+--------------+
tablestring
+-------------+---------------+
| id(integer) | value(string) |
|-------------+---------------|
| 123 | 'text' |
+-------------+---------------+
In case if every tagid has fixed type
create tagid description table
tag descriptors
+---------------+----------------+-----------------+
|assetid (text) | tagid(integer) | tablename(text) |
|---------------+----------------+-----------------|
|someStringValue| 1234 | tablebool |
|---------------+----------------+-----------------|
|aDiffStringKey | 1235 | tablefloat |
|---------------+----------------+-----------------|
|aDiffStrKey | 1236 | tablestring |
+---------------+----------------+-----------------+
and correspodnding data tables
tablebool
+-------------+----------------+-------------+
| id(integer) | tagid(integer) | value(bool) |
|-------------+----------------+-------------|
| 123 | 1234 | False |
+-------------+----------------+-------------+
tablefloat
+-------------+----------------+--------------+
| id(integer) | tagid(integer) | value(float) |
|-------------+----------------+--------------|
| 123 | 1235 | 12.345 |
+-------------+----------------+--------------+
tablestring
+-------------+----------------+---------------+
| id(integer) | tagid(integer) | value(string) |
|-------------+----------------+---------------|
| 123 | 1236 | 'text' |
+-------------+----------------+---------------+
All this is just for general idea. You should adapt it for your needs.

Org mode spreadsheet programmatic remote references

I keep my budget in org-mode and have been pleased with how simple it is. The simplicity fails, however, as I am performing formulas on many cells; for instance, my year summary table that performs the same grab-and-calculate formulas for each month. I end up with a massive line in my +TBLFM. This would be dramatically shorter if I could programmatically pass arguments to the formula. I'm looking for something like this, but working:
| SEPT |
| #ERROR |
#+TBLFM: #2$1=remote(#1,$tf)
Elsewhere I have a table named SEPT and it has field named "tf". This function works if I replace "#1" with "SEPT" but this would cause me to need a new entry in the formula for every column.
Is there a way to get this working, where the table itself can specify what remote table to call (such as the SEPT in my example)?
Yes, you can't do this with built-in remote and you need to use org-table-get-remote-range. Hopefully this better suits your needs than the answer given by artscan (I used his/her example):
| testname1 | testname2 |
|-----------+-----------|
| 1 | 2 |
#+TBLFM: #2='(org-table-get-remote-range #<$0 (string ?# ?1 ?$ ?1))
#+TBLNAME: testname1
| 1 |
#+TBLNAME: testname2
| 2 |
Note the (string ?# ?1 ?$ ?1): this is necessary because before evaluating table formulae, all substitutions will be done first. If you use "#1$1" directly, it would have triggered the substitution mechanism and be substituted by the contents of the first cell in this table.
There is some ugly hack for same effect without using remote:
1) it needs named variable for remote address
(setq eab/test-remote "#1$1")
2) it uses elisp expression (from org-table.el) instead remote(tablename,#1$1)
(defun eab/test-remote (x)
`(car (read
(org-table-make-reference
(org-table-get-remote-range ,x eab/test-remote)
't 't nil))))
3) worked example
| testname1 | testname2 |
|-----------+-----------|
| | |
#+TBLFM: #2='(eval (eab/test-remote #1))
#+TBLNAME: testname1
| 1 |
#+TBLNAME: testname2
| 2 |
4) result
| testname1 | testname2 |
|-----------+-----------|
| 1 | 2 |