Pandas - Convert HH:MM:SS.F string to seconds - Caveat : HH sometimes goes over 24H - date

I have the following dataframe :
**flashtalking_df =**
+--------------+--------------------------+------------------------+
| Placement ID | Average Interaction Time | Total Interaction Time |
+--------------+--------------------------+------------------------+
| 2041083 | 00:01:04.12182 | 24:29:27.500 |
| 2041083 | 00:00:54.75043 | 52:31:48.89108 |
+--------------+--------------------------+------------------------+
where 00:01:04.12182 = HH:MM:SS.F
I need to convert both columns, Average Interaction Time, and Total Interaction Time into seconds.
The problem is that Total Interaction Time goes over 24h.
I found the following code which works for the most part. However, when the Total Interaction Time goes over 24h, it gives me
ValueError: time data '24:29:27.500' does not match format '%H:%M:%S.%f'
This is the function I am currently using, which I grabbed from another Stack Overflow question, for both Average Interaction Time and Total Interaction Time:
flashtalking_df['time'] = flashtalking_df['Total Interaction Time'].apply(lambda x: datetime.datetime.strptime(x,'%H:%M:%S.%f'))
flashtalking_df['timedelta'] = flashtalking_df['time'] - datetime.datetime.strptime('00:00:00.00000','%H:%M:%S.%f')
flashtalking_df['Total Interaction Time'] = flashtalking_df['timedelta'].apply(lambda x: x / np.timedelta64(1, 's'))
If there's an easier way, please let me know.
Thank you for all your help

I think you need first convert to_timedelta and then to seconds by astype:
df['Average Interaction Time'] = pd.to_timedelta(df['Average Interaction Time'])
.astype('timedelta64[s]')
.astype(int)
df['Total Interaction Time'] = pd.to_timedelta(df['Total Interaction Time'])
.astype('timedelta64[s]')
.astype(int)
.map('{:,.2f}'.format)
print (df)
Placement ID Average Interaction Time Total Interaction Time
0 2041083 64 88,167.00
1 2041083 54 189,108.00
Solution with total_seconds, thank you NickilMaveli:
df['Average Interaction Time'] = pd.to_timedelta(df['Average Interaction Time'])
.dt.total_seconds()
.map('{:,.2f}'.format)
df['Total Interaction Time'] = pd.to_timedelta(df['Total Interaction Time'])
.dt.total_seconds()
.map('{:,.2f}'.format)
print (df)
Placement ID Average Interaction Time Total Interaction Time
0 2041083 64.12 88,167.50
1 2041083 54.75 189,108.89

Related

Twitter Rate Limit for search_all for full archive data

I am using tweepy to get replies to a given user name and filter by since id and until id. I got a rate limit for every three requests. My code is like blow:
q = "to:"+screen_name
reply_ls = []
tweets = client.search_all_tweets(query = q ,
since_id = since_id,
until_id = until_id,
tweet_fields = ['in_reply_to_user_id',\
'author_id','created_at','conversation_id',"referenced_tweets"],
expansions = "referenced_tweets.id" )
Here is the rate limit I got
10%|█████▌ | 10/100 [00:10<01:28, 1.02it/s]Rate limit exceeded. Sleeping for 607 seconds.
13%|███████ | 13/100 [10:20<2:07:02, 87.62s/it]Rate limit exceeded. Sleeping for 898 seconds.
17%|█████████ | 17/100 [25:22<2:37:38, 113.95s/it]Rate limit exceeded. Sleeping for 897 seconds.
23%|████████████▍ | 23/100 [40:26<1:16:23, 59.52s/it]Rate limit exceeded. Sleeping for 894 seconds.
I thought the API allow 300 requests every 15 minutes. But now I can only have three requests every 15 minutes. I don't know is it reasonable?

Strange 64-bit time format from game, you can recognize it?

i capture this 64-bit time format from a game and trying to understand it.
I can not use a date delta because every now and then the value totally changes and even becomes negative as seen below.
v1:int64=-5990085973098618987; //2021-01-25 13:30:00
v2:int64=-5990085973321147595; //4 mins later
v3:int64=6140958949625363349; //7 mins later
v4:int64=6140958948894898101; //11 mins later
v5:int64=-174740204032730139; //16 mins later
v6:int64=-174740204054383467; //18 mins later
v7:int64=-6490439358095090795; //23 mins later
I tried to split the 64-bit into two 32-bit containers to get low and high part. still strange values.
I also tried using pdouble(#value)^ to get float value of the 64-bit data, still strange values.
So kind of running out of ideas, maybe some kind of bitfield data or something else going on.
hipart: -1394675573 | lopart: 1466441621 | hex: acdef08b|57681f95 | swap: -7701322112560996692
hipart: -1394675573 | lopart: 1243913013 | hex: acdef08b|4a249b35 | swap: 3862721007994330796
hipart: 1429803424 | lopart: -458425451 | hex: 553911a0|e4acfb95 | swap: -7639322244965910187
hipart: 1429803424 | lopart: -1188890699 | hex: 553911a0|b922f7b5 | swap: -5334757052947285675
hipart: -40684875 | lopart: -760849435 | hex: fd9332b5|d2a65be5 | swap: -1919757392230050819
hipart: -40684875 | lopart: -782502763 | hex: fd9332b5|d15bf495 | swap: -7641381711494605827
hipart: -1511173174 | lopart: -1467540587 | hex: a5ed53ca|a8871b95 | swap: -7702413578668347995
Any ideas welcomed, thanks in advance
//mbs
--EDIT: So far, thanks to Martin Rosenau we are able to encode like this:
func mulproc_nfsw(i:int64;key:uint32):int64;
begin
if (blnk i) or (blnk key) then exit;
p:pointer=#i;
hi:uint32=uint32(p+4)^; //30864159 (hex: 01d6f31f)
lo:uint32=uint32(p)^; //748455936 (hex: 2c9c8800)
hi64:int64=hi*key; //0135b55a acdef08b <-- keep
lo64:int64=lo*key; //1d566e0b a65f2800 <-- keep
q:pointer=#result; //-5990085971773806592
uint32(q+4)^:=hi64; //acdef08b
uint32(q)^:=lo64; //a65f2800
end;
func encode_time_nfsw(j:juncture):int64;
begin
if blnk j then exit; //input: '2021-01-25 13:37:07'
key:uint32=$A85A2115; //encode key
ft:int64=j.filetime; //hex: 01d6f31f 2c9c8800
result:=mulproc_nfsw(ft,key);
end;
--EDIT2: Finally, thanks to fpiette we are able to decode also:
func decode_time_nfsw(i:int64):juncture;
begin
if blnk i then exit; //input: -5990085971773806592
key:uint32=$3069263D; //decode key
ft:int64=mulproc_nfsw(i,key);
result.setfiletime(ft);
end;
I checked my suspicion that the high and the low 32 bits are simply multiplied by A85A2115 (hex):
We get a FILETIME structure. Then we perform a 32x32->32 bit multiplication (this means we throw away the high 32 bits of the 64 bits result) of the high and the low word independently.
Example:
25 Jan 2021 13:37:07 (and some milliseconds)
Unencrypted FILETIME:
High dword = 1D6F31F (hex)
Low dword = 2C9CA481 (hex)
Multiplication
High dword: 1D6F31F * A85A2115 = 135B55AACDEF08B (hex)
Low dword: 2C9CA481 * A85A2115 = 1D5680CA57681F95 (hex)
Now only take the low 32 bits of the results:
High dword: ACDEF08B (hex)
Low dword: 57681F95 (hex)
Unfortunately, I don't know how to do the the "reverse operation"; I did it by searching for the result in a loop with the following pseudo-code:
encryptedValue = 57681F95 (hex)
originalValue = 0
product = 0
while product not equal to encryptedValue
// 32-bit addition discarding carry:
product = product + A85A2115 (hex)
originalValue = originalValue + 1
end_of_while_loop
We get the following results:
25 Jan 2021 13:37:07 => acdef08b|57681f95
25 Jan 2021 13:40:51 => acdef08b|4a249b35
25 Jan 2021 13:45:07 => 553911a0|e4acfb95
25 Jan 2021 13:49:03 => 553911a0|b922f7b5
25 Jan 2021 13:53:53 => fd9332b5|d2a65be5
25 Jan 2021 13:55:50 => fd9332b5|d15bf495
25 Jan 2021 14:00:39 => a5ed53ca|a8871b95
Addendum
The reverse operation seems to be done by multiplying with 3069263D (hex) (and only using the low 32 bits).
Encrypting:
2C9CA481 * A85A2115 = 1D5680CA57681F95
=> Result: 57681F95
Decrypting:
57681F95 * 3069263D = 10876CAF2C9CA481
=> Result: 2C9CA481

Marathon backoff - is it really exponential?

I'm trying to figure out Marathon's exponential backoff configuration. Here's the documentation:
The backoffSeconds and backoffFactor values are multiplied until they reach the maxLaunchDelaySeconds value. After they reach that value, Marathon waits maxLaunchDelaySeconds before repeating this cycle exponentially. For example, if backoffSeconds: 3, backoffFactor: 2, and maxLaunchDelaySeconds: 3600, there will be ten attempts to launch a failed task, each three seconds apart. After these ten attempts, Marathon will wait 3600 seconds before repeating this cycle.
The way I think of exponential backoff is that the wait periods should be:
3*2^0 = 3
3*2^1 = 6
3*2^2 = 12
3*2^3 = 24 and so on
so every time the app crashes, Marathon will wait a longer period of time before retrying. However, given the description above, Marathon's logic for waiting looks something like this:
int retryCount = 0;
while(backoffSeconds * (backoffFactor ^ retryCount) < maxLaunchDelaySeconds)
{
wait(backoffSeconds);
retryCount++;
}
wait(maxLaunchDelaySeconds);
This matches the explanation in the documentation, since 3*2^x < 3600 for values of x fewer than or equal to 10. However, I really don't see how it can be called an exponential backoff, since the wait time is constant.
Is there a way to make Marathon wait progressively longer times with every restart of the app? Am I misunderstand the doc? Any help would be appreciated!
as far as I understand the code in the RateLimiter.scala, it is like you described, but then capped to the maxLaunchDelay waiting period. Let`s say maxLaunchDelay is one hour (3600s)
3*2^0 = 3
3*2^1 = 6
3*2^2 = 12
3*2^3 = 24
3*2^4 = 48
3*2^5 = 96
3*2^6 = 192
3*2^7 = 384
3*2^8 = 768
3*2^9 = 1536
3*2^10 = 3072
3*2^11 = 3600 (6144)
3*2^12 = 3600 (12288)
3*2^13 = 3600 (24576)
Which brings us a typically 2^n graph, see
You would get a bigger increase, if you would other backoffFactors,
for example backoff factor 10:
or backoff factor 20:
Additionally I saw a re-work of this topic, code review currently open here: https://phabricator.mesosphere.com/D1007
What do you think?
Thanks
Johannes

Spark: All RDD data not getting saved to Cassandra table

Hi, I am trying to load RDD data to a Cassandra Column family using Scala. Out of a total 50 rows , only 28 are getting stored into cassandra table.
Below is the Code snippet:
val states = sc.textFile("state.txt")
//list o fall the 50 states of the USA
var n =0 // corrected to var
val statesRDD = states.map{a =>
n=n+1
(n, a)
}
scala> statesRDD.count
res2: Long = 50
cqlsh:brs> CREATE TABLE BRS.state(state_id int PRIMARY KEY, state_name text);
statesRDD.saveToCassandra("brs","state", SomeColumns("state_id","state_name"))
// this statement saves only 28 rows out of 50, not sure why!!!!
cqlsh:brs> select * from state;
state_id | state_name
----------+-------------
23 | Minnesota
5 | California
28 | Nevada
10 | Georgia
16 | Kansas
13 | Illinois
11 | Hawaii
1 | Alabama
19 | Maine
8 | Oklahoma
2 | Alaska
4 | New York
18 | Virginia
15 | Iowa
22 | Wyoming
27 | Nebraska
20 | Maryland
7 | Ohio
6 | Colorado
9 | Florida
14 | Indiana
26 | Montana
21 | Wisconsin
17 | Vermont
24 | Mississippi
25 | Missouri
12 | Idaho
3 | Arizona
(28 rows)
Can anyone please help me in finding where the issue is?
Edit:
I understood why only 28 rows are getting stored in Cassandra, it's because I have made the first column a PRIMARY KEY and It looks like in my code, n is incremented maximum to 28 and then it starts again with 1 till 22 (total 50).
val states = sc.textFile("states.txt")
var n =0
var statesRDD = states.map{a =>
n+=1
(n, a)
}
I tried making n an accumulator variable as well(viz. val n = sc.accumulator(0,"Counter")), but I don't see any differnce in the output.
scala> statesRDD.foreach(println)
[Stage 2:> (0 + 0) / 2]
(1,New Hampshire)
(2,New Jersey)
(3,New Mexico)
(4,New York)
(5,North Carolina)
(6,North Dakota)
(7,Ohio)
(8,Oklahoma)
(9,Oregon)
(10,Pennsylvania)
(11,Rhode Island)
(12,South Carolina)
(13,South Dakota)
(14,Tennessee)
(15,Texas)
(16,Utah)
(17,Vermont)
(18,Virginia)
(19,Washington)
(20,West Virginia)
(21,Wisconsin)
(22,Wyoming)
(1,Alabama)
(2,Alaska)
(3,Arizona)
(4,Arkansas)
(5,California)
(6,Colorado)
(7,Connecticut)
(8,Delaware)
(9,Florida)
(10,Georgia)
(11,Hawaii)
(12,Idaho)
(13,Illinois)
(14,Indiana)
(15,Iowa)
(16,Kansas)
(17,Kentucky)
(18,Louisiana)
(19,Maine)
(20,Maryland)
(21,Massachusetts)
(22,Michigan)
(23,Minnesota)
(24,Mississippi)
(25,Missouri)
(26,Montana)
(27,Nebraska)
(28,Nevada)
I am curious to know what is causing n to not getting updated after value 28? Also, what are the ways in which I can create a counter which I can use for creating RDD?
There are some misconceptions about distributed systems embedded inside your question. The real heart of this is "How do I have a counter in a distributed system?"
The short answer is you don't. For example what you've done in your code example originally is something like this.
Task One {
var x = 0
record 1: x = 1
record 2: x = 2
}
Task Two {
var x = 0
record 20: x = 1
record 21: x = 2
}
Each machine is independently creating a new x variable set at 0 which gets incremented within it's own context, independently over the other nodes.
For most use cases the "counter" question can be replaced with "How can I get a Unique Identifier per Record in a distributed system?"
For this most users end up using a UUID which can be generated on independent machines with infinitesimal chances of conflicts.
If the question can be "How can I get a monotonic increasing unique indentifier?"
Then you can use zipWithUniqueIndex which will not count but will generate monotonically increasing ids.
If you just want them number to start with it's best to do it on the local system.
Edit; Why can't I use an accumulator?
Accumulators store their state (surprise) per task. You can see this with a little example:
val x = sc.accumulator(0, "x")
sc.parallelize(1 to 50).foreachPartition{ it => it.foreach(y => x+= 1); println(x)}
/*
6
7
6
6
6
6
6
7
*/
x.value
// res38: Int = 50
The accumulators combine their state after finishing their tasks, which means you can't use them as a global distributed counter.

How can I convert a 5 digit int date and 7 digit int time to a real date?

I've come across some data where the date for today's value is 77026 and the time (as of a few minutes ago) is 4766011. FYI: today is Fri, 18 Nov 2011 12:54:46 -0600
I can't figure out how these represent a date/time, and there is no supporting documentation.
How can I convert these numbers to a date value?
Some other dates from today are:
77026 | 4765509
77026 | 4765003
77026 | 4714129
77026 | 4617107
And some dates from what is probably yesterday:
77025 | 6292509
77025 | 6238790
77025 | 4009544
Ok, with your expanded examples, it would appear the first number is a day count. That'd put this time system's epoch at
to_days(today) = 734824
734824 - 77025 = 657799
from_days(657799) = Dec 29, 1800
The time values are problematic, it looks like they're decreasing (unless you listed most recent first?), but if they are some "# of intervals since midnight", then centi-seconds could be likely. That'd give us a range of 0 - 8,640,000.
4765509 = 47655.09 seconds -> sec_to_time(47655) = 13:14:15
sec_to_time(47650.03) -> 13:14:10
sec_to_time(47141.29) -> 13:05:41
sec_to_time(46171.07) -> 12:49:31