I have a bunch of dates in a database written in full worded text and I need to convert them into a useable date field. It's a one time deal but there's a few million lines so doing it manually is unthinkable.
"January the twenty sixth, nineteen eighty nine" would become "1989/01/26"
The format isn't always he same, I could also have "The nineteenth of August, nineteen hundred ninety"
Ideally doing it in SQL would be easier, but I could run a script and update the database after.
Any suggestions?
My code is too specific for my case to post here, but here's a pseudo-code of what I did for the numbers so maybe someone will be saved the headache, especially with numbers in French.
NumberString = 'Nineteen hundred ninety nine'
-- Replace each word individually by their value : Nineteen hundred ninety nine would be 19,100,90,9 (or 10,9,100,4,20,10,9 in French)
NumberString= '19 100 90 9'
-- For French only, replace '4 20' by 80 (easier for the next step)
Total = 0
NumberArray = NumberString.SplitOnSpaces
FOREACH Numbers as Number
IF Number is not a power of 100 THEN Total += Number
IF Number is a power of 100 THEN
IF Total%Number = 0 THEN Total += Number
ELSE
ModuloTemp = Total%Number
Total -= ModuloTemp
Total += ModuloTemp * Number
END
It won't work when the number is written in parts, like 2020 would be 'twenty twenty', but it was good enough for my needs.
Then for the date I just find the month's position. If it's day, month, year you can build back the date using the month as a separator. If it's month, day, year then you have to find the ordinal number (first, second, tenth, etc.) to distinguish the year from the day.
Related
I am very new at all of this, and I don't know if what I want to do is even possible, but I'm hoping someone can assist me with some formulas if it is.
I am trying to create a spreadsheet for my business's scheduling purposes. I have created a spreadsheet that lists my PO's, start date, end date, location, project hours, and total days.
I currently have a couple formulas on the sheet. When I enter the project hours in column e the formula =roundup(E2/24) inputs the expected total days of work into column f.
I have a starting date of 7/1/2022 entered in b2 then have a formula that looks at column c (end date) and adds the amount of days from column f (total days) to the end date. Each line there after copies the end date from the row above to the start date and then adds the total days from f to that to complete the next row.
What I would like to do is have the dates only reflect workdays (M-F) instead of returning all dates. Is this even possible?
Take a look at spreadsheet example, but it is pretty basic.
Thank you in advance for any help you can provide!
I understand that you want to add "Total days" to the "Start Date" to get the "End Date" excluding weekends, in this case Sundays and Saturdays.
Paste this formula in cell C2 "End Date" column.
=ArrayFormula(IF(E2:E="",,WORKDAY.INTL(B2:B,F2:F,"0000011",)))
Breakdown:
1 - ArrayFormula output's values into several rows and/or columns, and to make the formula dynamic.
2 - WORKDAY.INTL determines the date after a certain number of workdays, excluding certain number of weekends and holidays.
3 - [weekend] argument using string method of ones and zeros "0000011" to specify workdays and weekends, 0 workday, 1 weekends starting from Monday to Sunday.
here is the link to the spreadsheet, hope that answerd your question.
I tried to count week number of month using below code,but I got weird num like -48.(I know my logic is weird lol)
Could you point out the fault of below code to make weeknum of month.
I need sensei's help.
I used Dataprep
WEEKNUM(date)-WEEKNUM(DATE(YEAR(date),MONTH(date),1))
no error , but some values are -48,47......
Your logic is mostly sound, except you're only thinking of WEEKNUM in the context of a single year. In order to have non-overlapping weeks, the week containing January 1 is always week 1 (regardless of the period), so in this case December 29–31, 2019 are all going to be week 1, just like the following 4 days of January will be. It makes sense, but only when you think about it in context of multiple years.
You can work around this for your case by checking the WEEKNUM and MONTH values conditionally, and then outputting a different value:
IF(AND(MONTH(date) == 12,WEEKNUM(date) == 1),53,WEEKNUM(date)) - WEEKNUM(DATE(YEAR(date),MONTH(date),1))
While hard-coding the week number of 53 is a little hacky, the whole point of week numbers is to always have 52 subdivisions—so I don't really see any concerns there.
You may also want to add 1 to the resulting formula (unless you want your week numbers to start with 0 each month).
If I have a list of tasks with a certain date ranges, and the task is broken into weekly hour chunks of work (ie. 30 hours from 2018-12-31 to 2019-01-06 ... etc starting from Monday).
The kind of operations I would like to do are
Display all the weekly hours of all the tasks for a list of users
Sum the weekly hours for a user for all his tasks for the week
When the duration of the task is modified, create/destroy the weekly hour chunks.
Would it be more efficient to store these weekly records as
start date/end date/hours,
year/week number/hours
Storing start/end date probably give more flexibility to the table as it could potentially store non-weekly align hours.
Storing week number means given a date range, creating the weekly chunks is as simple as finding the week number of the start date and the week number of the end date, and populating the weeks in between (without converting to date ranges). Also easier validation for updating the hours for a week, as long as the week number is 1-53.
Wondering if anyone has tried out either option and can give any pointers on their preferred option.
I would probably go for a daterange column.
That gives you the flexibility to have differently sized chunks and allows you to define an exclusion constraint to prevent overlapping ranges.
Finding the row for a given week is still quite simple using the "contains" operator #>, e.g. where the_column #> to_date('2019-24', 'iyyy-iw') finds the row(s) that contain week number 24 in 2019.
The expression to_date('2019-24', 'iyyy-iw') returns the first day (Monday) of the specified week.
Finding all rows that are between two weeks can also be done, however construction the corresponding date range looks a bit ugly. You can either construction an inclusive range with the first and last day: daterange(to_date('2019-24', 'iyyy-iw'), to_date('2019-24', 'iyyy-iw') + 6, '[]')
Or you can create a range with an exclusive upper range with the next week's first day: daterange(to_date('2019-24', 'iyyy-iw'), to_date('2019-25', 'iyyy-iw'), '[)')
While ranges can be indexed quite efficiently and , the required GIST indexes are a bit more expensive to maintain than a B-Tree index on two integer columns.
Another downside of using ranges (if you don't really need the flexibility) is that they take up more space than two integer columns (14 byte instead of 8, or even 4 with two smallint). So if the size of the table is of any concern, then your current solution with the year/week columns is more efficient.
"Storing week number means given a date range, creating the weekly chunks is as simple as finding the week number of the start date and the week number of the end date"
If your input is a start and end date to begin with (rather than a "week number"), then I would definitely go for a daterange column. If that start and end date cover more than one week, then you store only one row, rather than multiple rows.
I have a date column which I am trying to query to return only the largest date per month.
What I currently have, albeit very simple, returns 99% of what I am looking for. For example, If I list the column in ascending order the first entry is 2016-10-17 and ranges up to 2017-10-06.
A point to note is that the last day of every month may not be present in the data, so I'm really just looking to pull back whatever is the "largest" date present for any existing month.
The query I'm running at the moment looks like
SELECT MAX(date_col)
FROM schema_name.table_name
WHERE <condition1>
AND <condition2>
GROUP BY EXTRACT (MONTH FROM date_col)
ORDER BY max;
This does actually return most of what I'm looking for - what I'm actually getting back is
"2016-11-30"
"2016-12-30"
"2017-01-31"
"2017-02-28"
"2017-03-31"
"2017-04-28"
"2017-05-31"
"2017-06-30"
"2017-07-31"
"2017-08-31"
"2017-09-29"
"2017-10-06"
which are indeed the maximal values present for every month in the column. However, the result set doesn't seem to include the maximum date value from October 2016 (The first months worth of data in the column). There are multiple values in the column for that month, ranging up to 2016-10-31.
If anyone could point out why the max value for this month isn't being returned, I'd much appreciate it.
You are grouping by month (1 to 12) rather than by month and year. Since 2017-10-06 is greater than any day in October 2016, that's what you get for the "October" group.
You should
GROUP BY date_trunc('month', date_col)
As you can see from the picture above I am trying to add new column and to calculate the difference between =2014-2017.
Is there any way to make this because Tableau's option "Table Calculation" doesn't play role for me.
Working out the difference between the first and last periods with table calculations:
First you need to get minimum year's values (i'm calling the field "Min Year Select"):
IF DATETRUNC('year',[Order Date]) =
{FIXED: MIN(DATETRUNC('year',[Order Date]))}
THEN 1 END
The above field named Min Year Select is saying that it should return a 1 if the year of the order date is the minimum year in your date range
Now we are flagging the smallest years, we can create a field to get the values (i'll call this "Min Year Segment"):
IF SUM([Min Year Select]) >= 1 THEN [Sales] END
Here we're saying that if the year is flagged as the smallest (as classified by the previous calc field we made), then get the value
But before we can compare the two values, you have to work out the number of time periods between the min and current year so that the difference calculation lookup field is comparing the right values (i'll call this "Number Years in Range"):
{FIXED [Segment]: COUNTD(DATETRUNC('year',[Order Date]))}
What we're doing is fixing the query at the category level (segment), think of this as removing the date pill from your report, then performing a calculation. Here it's COUNT DISTINCT years. So if a segment has data for 2011,2012,2013; then the query returns 3
We can now get the difference between your latest and your minimum Segments (called: "Difference from First Last Segment"):
[Segment] -
LOOKUP([Min Year Segment],
-1*(Number Years in Range)-1)
Firstly we get the first year's sales for each segment (Min Year Segment will be null for all years that aren't the first, so we need to lookup the first by going backwards by the number of years in our range:
We do -1 * because we want the lookup to lookup backwards, then we add in ("Number Years in Range" - 1) because we want to lookup to the period that had the earliest data. We do minus one so we're excluding the current year/latest year in your dataset
This is a lot to digest, I think it's easier to present as a picture too:
Here we calculate the difference between the first and last month, with the value in the last month
If this helped or you have any more questions, please vote on my answer/let me know