mathematical opération with pyspark - pyspark

I have a structure in rdd that contains a record time like this : 02:00:30.
I want to convert the data from this format to the second format, i want to do this treatement: 02 * 3600 + 00 * 60 + 30
could someone please help me to do this treatment in pyspark ?? thank you in advance.

map it
rdd=rdd.map(lambda row: [3600*int(row[0].split(':')[0])+60*int(row[0].split(':')[1])+*int(row[0].split(':')[2]))

Related

Pyspark dataframe filter based on in between values

I have a Pyspark dataframe with below values -
[Row(id='ABCD123', score='28.095238095238095'), Row(id='EDFG456', score='36.2962962962963'), Row(id='HIJK789', score='37.56218905472637'), Row(id='LMNO1011', score='36.82352941176471')]
I want only the values from the DF which have score between the input score value and input score value + 1, say, the input score value is 36 then I want the output DF with only two ids - EDFG456 & LMNO1011 as their score falls between 36 & 37. I achieved this by doing as follows -
input_score_value = 36
input_df = my_df.withColumn("score_num", substring(my_df.score, 1,2))
output_matched = input_df.filter(input_df.score_num == input_score_value)
print(output_matched.take(5))
The above code gives the below output, but it takes too long to process 2 mil rows. I was thinking if there is some better way to do this to reduce the response time.
[Row(id='EDFG456', score='36.2962962962963'), Row(id='LMNO1011',score='36.82352941176471')]
You can use the function floor.
from pyspark.sql.functions import floor
output_matched = input_df.filter(foor(input_df.score_num) == input_score_value)
print(output_matched.take(5))
It should be much faster compared to substring. Let me know.

two different size of dataset filtering via considering timestamp using matlab?

I have two very large dataset of matlab. In both dataset we have different parameter. The only common parameter is timestamp means measuring value of all parameter with every 10 min of interval. Let us take an example,
In dataset 1 , I have Timestamp (YYYY-MM-DD , HH : MM :SS format) and power
In dataset 2, I have again timestamp(in above format) and speed
I want a new dataset which have power and speed with timestamp synchronization. For example :
TimeStamp P S
2014 - 01 - 01 , 00 :10 100 5
00 :20 7
00:30 150 10
00:40 200
00:50 145 12
01:00 50 7
01:10 6
etc............
So in short the output of the final dataset must be like :
TimeStamp P S
00 :10 100 5
00:30 150 10
00:50 145 12
So basically if i am getting both power and speed with same time then it should take otherwise filter rest.
And If we have different size of observation in both data set will it work ?? Even though they might have different observation size but I want only those data in my final database whose P and S matching with time Stamp and if it is not making then my final data base exclude those sets
anyone help me on this with the help of matlab ??? thanks in advance
You could try something like this:
%type "help ismember" in command window to see what the function does
%finds index of timestamp in dataset1 that exists in dataset 2
indexPinS = ismember(dataset1(:,1),dataset2(:,1));
%finds index of timestamp in dataset2 that exists in dataset 1
indexSinP = ismember(dataset2(:,1),dataset1(:,1));
%combines data in final database
finalDatabase = [dataset1(indexPinS,1), dataset1(indexPinS,2), dataset2(indexSinP,2)];

Apache Spark: Exponential Moving Average

I am writing an application in Spark/Scala in which I need to calculate the exponential moving average of a column.
EMA_t = (price_t * 0.4) + (EMA_t-1 * 0.6)
The problem I am facing is that I need the previously calculated value (EMA_t-1) of the same column. Via mySQL this would be possible by using MODEL or by creating an EMA column which you can then update row per row, but I've tried this and neither work with the Spark SQL or Hive Context... Is there any way I can access this EMA_t-1?
My data looks like this :
timestamp price
15:31 132.3
15:32 132.48
15:33 132.76
15:34 132.66
15:35 132.71
15:36 132.52
15:37 132.63
15:38 132.575
15:39 132.57
So I would need to add a new column where my first value is just the price of the first row and then I would need to use the previous value: EMA_t = (price_t * 0.4) + (EMA_t-1 * 0.6) to calculate the following rows in that column.
My EMA column would have to be:
EMA
132.3
132.372
132.5272
132.58032
132.632192
132.5873152
132.6043891
132.5926335
132.5835801
I am currently trying to do it using Spark SQL and Hive but if it is possible to do it in another way, this would be just as welcome! I was also wondering how I could do this with Spark Streaming. My data is in a dataframe and I am using Spark 1.4.1.
Thanks a lot for any help provided!
To answer your question:
The problem I am facing is that I need the previously calculated value (EMA_t-1) of the same column
I think you need two functions: Window and Lag. (I also make null value to zero for convenience when calculating EMA)
my_window = Window.orderBy("timestamp")
df.withColumn("price_lag_1",when(lag(col("price"),1).over(my_window).isNull,lit(0)).otherwise(lag(col("price"),1).over(my_window)))
I am new to Spark Scala also, and am trying to see if I can define an UDF to do the exponential average. But for now an obvious walk around would be manually adding up all lag column ( 0.4 * lag0 + 0.4*0.6*lag1 + 0.4 * 0.6^2*lag2 ...) Something like this
df.withColumn("ema_price",
price * lit(0.4) * Math.pow(0.6,0) +
lag(col("price"),1).over(my_window) * 0.4 * Math.pow(0.6,1) +
lag(col("price"),2).over(my_window) * 0.4 * Math.pow(0.6,2) + .... )
I ignored the when.otherwise to make it more clear. And this method works for me now..
----Update----
def emaFunc (y: org.apache.spark.sql.Column, group: org.apache.spark.sql.Column, order: org.apache.spark.sql.Column, beta: Double, lookBack: Int) : org.apache.spark.sql.Column = {
val ema_window = Window.partitionBy(group).orderBy(order)
var i = 1
var result = y
while (i < lookBack){
result = result + lit(1) * ( when(lag(y,i).over(ema_window).isNull,lit(0)).otherwise(lag(y,i).over(ema_window)) * beta * Math.pow((1-beta),i)
- when(lag(y,i).over(ema_window).isNull,lit(0)).otherwise(y * beta * Math.pow((1-beta),i)) )
i = i + 1
}
return result }
By using this fuction you should be able to get EMA of price like..
df.withColumn("one",lit(1))
.withColumn("ema_price", emaFunc('price,'one,'timestamp,0.1,10)
This will look back 10 days and calculate estimate EMA with beta=0.1. Column "one" is just a place holder since you don't have grouping column.
You should be able to do this with Spark Window Functions, which were introduced in 1.4: https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
w = Window().partitionBy().orderBy(col("timestamp"))
df.select("*", lag("price").over(w).alias("ema"))
This would select the last price for you so you can do your calculations on it

Converting Time into a Double Swift

I am new to Swift, and am stuck on a little problem.
I have the current hour and minute(s) as Ints, and would like to know how to put them together, to create one whole number, like military times.
WITHOUT CONVERTING TO STRING, as I need to be able to compare ( using < and >) later on.
EXAMPLE:
Current Hour 02
Current Minute 24
I would like 0224
Multiply the current hour by 100 then add the number of minutes. From your example:
02 * 100 = 200
200 + 24 = 224
Here is a simple example
var num:Int = 02
print(num)
var num2:Int = 24
Double.init("\(num)\(num2)")

Converting time to milliseconds?

I have some time as string format in my data. Can anyone help me to convert this date to milliseconds in Matlab.
This is an example how date looks like '00:26:16:926', So, that is 0 hours 26 minutes 16 seconds and 926 milliseconds. After converting this time, I need to get only milliseconds such as 1576926 milliseconds for the time that I gave above. Thank you in advance.
Why don't you try using datevec instead? datevec is designed to take in various time and date strings and it parses the string and spits out useful information for you. There's no need to use regexp or split up your string in any way. Here's a quick example:
[~,~,~,hours,minutes,seconds] = datevec('00:26:16:926', 'HH:MM:SS:FFF');
out = 1000*(3600*hours + 60*minutes + seconds);
out =
1576926
In this format, the output of datevec will be a 6 element vector which outputs the year, month, day, hours, minutes and seconds respectively. The millisecond resolution will be added on to the sixth element of datevec's output, so all you have to do is convert the fourth to sixth elements into milliseconds and add them all up, which is what is done above. If you don't specify the actual day, it just defaults to January 1st of the current year... but we're not using the date anyway... we just want the time!
The beauty with datevec is that it can accept multiple strings so you're not just limited to a single input. Simply put all of your strings into a single cell array, then use datevec in the following way:
times = {'00:26:16:926','00:27:16:926', '00:28:16:926'};
[~,~,~,hours,minutes,seconds] = datevec(times, 'HH:MM:SS:FFF');
out = 1000*(3600*hours + 60*minutes + seconds);
out =
1576926
1636926
1696926
One solution could be:
timeString = '00:26:16:926';
cellfun(#(x)str2num(x),regexp(timeString,':','split'))*[3600000;60000;1000;1]
Result:
1576926
Assuming that your date string comes in that format consistently, you could use something as simple as this:
test = '00:26:16:926';
H = str2num(test(1:2)); % hours
M = str2num(test(4:5)); % minutes
S = str2num(test(7:8)); % seconds
MS = str2num(test(10:12)); % milliseconds
totalMS = MS + 1000*S + 1000*60*M + 1000*60*60*H;
Output:
1576926.00
you can convert a single string with a date or even a vector by using datevec for conversion and the dot product
a = ['00:26:16:926' ; '08:42:12:936']
datevec(a,'HH:MM:SS:FFF') * [0 0 0 3600e3 60e3 1e3]'
ans =
1576926
31332936