Stata, string variable to quartely time series - date

I have a data set with this type of observations:
"2015_1"
"2015_2"
"2015_3"
I want to convert to time series(quarterly), like:
2015q1
2015q2
2015q3

This is a standard conversion task. See help datetime and help datetime display formats for the detail.
* Example generated by -dataex-. To install: ssc install dataex
clear
input str6 have
"2015_1"
"2015_2"
"2015_3"
end
gen wanted = quarterly(have, "YQ")
format wanted %tq
list
+-----------------+
| have wanted |
|-----------------|
1. | 2015_1 2015q1 |
2. | 2015_2 2015q2 |
3. | 2015_3 2015q3 |
+-----------------+
describe
Contains data
obs: 3
vars: 2
---------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
---------------------------------------------------------------------------------------------
have str6 %6s
wanted float %tq
-------------------------------------------------------------------------------------

Related

Excel: Select the newest date from a list that contains multiple rows with the same ID

In Excel, I have a list with multiple rows of the same ID (column A), each with various dates recorded (Column B). I need to extract one row for each ID that contains the newest date. See below for example:
|Column A | Column B|
|(ID) | (Date) |
|-----------|-----------|
|00001 | 01/01/2022|
|00001 | 02/01/2022|
|00001 | 03/01/2022| <-- I Need this one
|00002 | 01/02/2022|
|00002 | 02/02/2022|
|00002 | 03/02/2022| <-- I Need this one
|00003 | 01/03/2022|
|00003 | 02/03/2022|
|00003 | 03/03/2022| <-- I Need this one
|00004 | 01/04/2022|
|00004 | 02/04/2022|
|00004 | 03/04/2022| <-- I Need this one
|00005 | 01/05/2022|
|00005 | 02/05/2022|
|00005 | 03/05/2022| <-- I Need this one
I need to extract the above rows, where the row with the newest date is extracted for each unique ID. It needs to look like this:
|Column A | Column B |
|(ID) | (Date) |
|----------|--------------|
|00001 | 03/01/2022 |
|00002 | 03/02/2022 |
|00003 |03/03/2022 |
|00004 | 03/04/2022 |
|00005 | 03/05/2022 |
I'm totally stumped and I can't seem to find the right answer (probably because of how I'm wording the question!)
Thank you!
Google searches for the answer - no joy. I don't know where to start in excel with this function, I thought perhaps DISTINCT or similar...
Assuming you have Office 365 compatible version of Excel, you could do something like this:
(screenshot/here refers):
=INDEX(SORTBY(A2:B11,B2#,-1),SEQUENCE(1,1,1,1),SEQUENCE(1,2,1,1))
This formula is superfluous albeit convenient - you don't really require the first sequence (there's only one row being returned). However, as you can see in the screenshot, using the self-same formula, this time with a leading 2 in the first argument of that sequence returns the top two (descending order) dates, and so forth.
FOR THOSE w/ Office 365 you could do something like this....
=LARGE(B2#+(ROW(B2#)-ROW(B2))/1000,1)
i.e. adding a "little bit" to the dates that we can subtract later and use as a unique reference (row number, original unsorted list)
As mentioned, reverse engineer, throw into an index, and voila!
=INDEX(A2:A11,ROUND((H2-ROUND(H2,0))*1000,6))
caveats:
the round(<>,6) is purely to eliminate Excel's irritating lack of precision issue.
can work if you're looking up text strings (i.e. attempting to sort alphabetically) EXCEPT large doesn't work with string (no prob, just use unicode - but good luck with expanding out the string etc. ☺ with mid(<>,row(a1:offset(a1,len(<>)-1)..,1)..

Convert variable of type double to date and time

How do I do the following in Stata:
I have a variable of type double with values similar to the ones below:
20180405013331
20160107085521
How can I convert it to a date/time (YYYYMMDDhhmmss) variable like the following:
2018April5 01:33:31
2016January7 08:55:21
help datetime explains the basics here. The only extra twist is that your date-times arrive as a double, so you need to convert to a string, which can be done on the fly.
clear
input double mydate
20180405013331
20160107085521
end
format mydate %14.0f
gen double wanted = clock(string(mydate, "%14.0f"), "YMDhms")
format wanted %tc
list
+-------------------------------------+
| mydate wanted |
|-------------------------------------|
1. | 20180405013331 05apr2018 01:33:31 |
2. | 20160107085521 07jan2016 08:55:21 |
+-------------------------------------+

Concatenate Dataframe rows based on timestamp value

I have a Dataframe with text messages and a timestamp value for each row.
Like so:
+--------------------------+---------------------+
| message | timestamp |
+--------------------------+---------------------+
| some text from message 1 | 2019-08-03 01:00:00 |
+--------------------------+---------------------+
| some text from message 2 | 2019-08-03 01:01:00 |
+--------------------------+---------------------+
| some text from message 3 | 2019-08-03 01:03:00 |
+--------------------------+---------------------+
I need to concatenate the messages by creating time windows of X number of minutes so that for example they look like this:
+---------------------------------------------------+
| message |
+---------------------------------------------------+
| some text from message 1 some text from message 2 |
+---------------------------------------------------+
| some text from message 3 |
+---------------------------------------------------+
After doing the concatenation I have no use for the timestamp column so I can drop it or keep it with any value.
I have been able to do this by iterating through the entire Dataframe, adding timestamp diffs and inserting into a new Dataframe when the time window is achieved. It works but it's ugly and I am looking for some pointers into how to accomplish this in Scala in a more functional/elegant way.
I looked at the Window functions but since I am not doing aggregations it appears that I do not have a way to access the content of the groups once the WindowSpec is created so I didn't get very far.
I also looked at the lead and lag functions but I couldn't figure out how to use them without also having to go into a for loop.
I appreciate any ideas or pointers you can provide.
Any thoughts or pointers into how to accomplish this?
You can use the window datetime function (not to be confused with Window functions) to generate time windows, followed by a groupBy to aggregate messages using concat_ws:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
("message1", "2019-08-03 01:00:00"),
("message2", "2019-08-03 01:01:00"),
("message3", "2019-08-03 01:03:00")
).toDF("message", "timestamp")
val duration = "2 minutes"
df.
groupBy(window($"timestamp", duration)).
agg(concat_ws(" ", collect_list($"message")).as("message")).
show(false)
// +------------------------------------------+-----------------+
// |window |message |
// +------------------------------------------+-----------------+
// |[2019-08-03 01:00:00, 2019-08-03 01:02:00]|message1 message2|
// |[2019-08-03 01:02:00, 2019-08-03 01:04:00]|message3 |
// +------------------------------------------+-----------------+

Fasttext: How to process a corpora with fasttext?

I'm new to fasttext and NLP. I have a corpora csv in french structured as follow:
| value | sentence | pivot |
|-------|--------------------------------|----------|
| 1 | My first [sentence] | sentence |
| 0 | My second [word] in a sentence | word |
| .. | ... | ... |
I want to know how to tell fasttext to process the pivot words between brackets [pivot] to build my model, or is it a feature built-in in fasttext that he knows which word to process ? I really want to know the mechanics about fasttext ! the documentation I found is limited. Thanks.
You can extract word vectors of pivot column using fastText in this way:
!git clone https://github.com/facebookresearch/fastText.git
!cd fastText
!pip install fastText
import fasttext.util
fasttext.util.download_model('fr', if_exists='ignore') # French
model = fasttext.load_model('cc.en.300.bin')
vectors = []
dataset = pd.read_csv('path to csv file', sep='\t')
for data in dataset.pivot:
vectors.append(model[data])
https://fasttext.cc/docs/en/crawl-vectors.html

How do I convert Epoch time to Date in Open Refine?

I don't care which language I use (as long as it's one of the three available in Open Refine), but I need to convert a timestamp returned from an API from epoch time to a regular date (see Expression box in the screenshot below). Not too picky about the output date format, just that it retains the date down to the second. Thanks!
Can use: GREL, Jython, or Clojure.
If you have to stick to GREL you can use the following one-liner:
inc(toDate("01/01/1970 00:00:00","dd/MM/YYYY H:m:s"),value.toNumber(),"seconds").toString('yyyy-MM-dd HH:mm:ss')
Breaking it down:
inc(date d, number value, string unit) as defined in the GREL documentation : Returns a date changed by the given amount in the given unit of time. Unit defaults to 'hour'
toDate(o, string format) : Returns o converted to a date object. (more complex uses of toDate() are shown in the GREL documentation)
We use the string "01/01/1970 00:00:00" as input for toDate() to get the start of the UNIX Epoch (January 1st 1970 midnight).
We pass the newly created date object into inc() and as a second parameter the result of value.toNumber() (assuming value is a string representation of the number of seconds since the start of the Unix Epoch), as a 3rd parameter, the string "seconds" which tells inc() the unit of the 2nd parameter.
We finally convert the resulting date object into a string using the format: yyyy-MM-dd HH:mm:ss
Test Data
Following is a result of using the function described above to turn a series of timestamps grabbed from the Timestamp Generator into string dates.
| Name | Value | Date String |
|-----------|------------|---------------------|
| Timestamp | 1491998962 | 2017-04-09 12:09:22 |
| +1 Hour | 1492002562 | 2017-04-09 13:09:22 |
| +1 Day | 1492085362 | 2017-04-10 12:09:22 |
| +1 Week | 1492603762 | 2017-04-16 12:09:22 |
| +1 Month | 1494590962 | 2017-05-09 12:09:22 |
| +1 Year | 1523534962 | 2018-04-09 12:09:22 |
Unfortunately, I do not think you can do it with a GREL statement like this or somesuch, but I might be pleasantly surprised by someone else that can make it work somehow:
value.toDate().toString("dd/MM/yyy")
So in the meantime, use this Jython / Python Code:
import time;
# This is a comment.
# We change 'value' to an integer, since time needs to work with numbers.
# If we needed to, we could also * 1000 if we had a Unix Epoch Time in seconds, instead of milliseconds.
# We also have no idea what the local time zone is for this, which could affect the date. But we digress...
epochlong = int(float(value));
datetimestamp = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(epochlong));
return datetimestamp