Pandas edit dataframe - mongodb

I am querying a MongoDB collection, with two queries and appending them to get a single data frame.(keys are: status, date, uniqueid)
for record in results:
query1 = (record["sensordata"]["user"])
df1 = pd.DataFrame(query1.items())
query2 = (record["created_date"])
df2 = pd.DataFrame(query2.items())
index = "status"
result = df1.append(df2, index)
b = result.transpose()
print b
b.to_csv(q)
output is :
0 1 2
0 status uniqueid date
1 0 191b117fcf5c 2017-03-01 17:51:08.263000
0 1 2
0 status uniqueid date
1 1 191b117fcf5c 2017-03-01 17:51:17.216000
0 1 2
0 status uniqueid date
1 1 191b117fcf5c 2017-03-01 17:51:23.269000
0 1 2
0 status uniqueid date
1 1 191b117fcf5c 2017-03-01 18:26:17.216000
0 1 2
0 status uniqueid date
1 1 191b117fcf5c 2017-03-01 18:26:21.130000
0 1 2
0 status uniqueid date
1 0 191b117fcf5c 2017-03-01 18:26:28.217000
how to remove these extra 0 ,1 ,2 and 0,1 in rows and columns?
also, i don't want status uniqueid and date repeat everytime.
My desired output should be like this:
status uniqueid date
0 191b117fcf5c 2017-03-01 18:26:28.217000
1 191b117fcf5c 2017-03-01 19:26:28.192000
1 191b117fcf5c 2017-04-01 11:16:28.222000

Related

my-sql query to fill out the time gaps between the records

I want to write an optimized query to fill out the time gaps between the records with the stock value that is most recent to date.
The requirement is to have the latest stock value for every group of id_warehouse, id_stock, and date. The table is quite large (2 million records) and hence I would like to optimize the query that I have added below and the table grows.
daily_stock_levels:
date
id_stock
id_warehouse
new_stock
is_stock_avaible
2022-01-01
1
1
24
1
2022-01-01
1
1
25
1
2022-01-01
1
1
29
1
2022-01-02
1
1
30
1
2022-01-06
1
1
27
1
2022-01-09
1
1
26
1
Result:
date
id_stock
id_warehouse
closest_date_with_stock_value
most_recent_stock_value
2022-01-01
1
1
29
1
2022-01-02
1
1
30
1
2022-01-03
1
1
30
1
2022-01-04
1
1
30
1
2022-01-05
1
1
30
1
2022-01-06
1
1
27
1
2022-01-07
1
1
27
1
2022-01-07
1
1
27
1
2022-01-09
1
1
26
1
2022-01-10
1
1
26
1
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
2022-08-08
1
1
26
1
SELECT
sl.date,
sl.id_warehouse,
sl.id_item,
(SELECT
s.date
FROM
daily_stock_levels s
WHERE s.is_stock_available = 1
AND sl.id_warehouse = s.id_warehouse
AND sl.id_item = s.id_item
AND sl.date >= s.date
ORDER BY s.date DESC
LIMIT 1) AS closest_date_with_stock_value,
(SELECT
s.new_stock
FROM
daily_stock_levels s
WHERE s.is_stock_available = 1
AND sl.id_warehouse = s.id_warehouse
AND sl.id_item = s.id_item
AND sl.date >= s.date
ORDER BY s.date DESC
LIMIT 1) AS most_recent_stock_value
FROM
daily_stock_levels sl
GROUP BY sl.id_warehouse,
sl.id_item,
sl.date

PySpark UDF does not return expected result

I have a Databricks dataframe with multiple columns and a UDF that generates the contents of a new column, based on values from other columns.
A sample of the original dataset is:
interval_group_id control_value pulse_value device_timestamp
2797895314 5 5 2020-09-12 09:08:44
0 5 5 2020-09-12 09:08:45
0 6 5 2020-09-12 09:08:46
0 0 5 2020-09-12 09:08:47
Now I am trying to add a new column, called group_id, based on some logic with the columns above. My UDF code is:
#udf('integer')
def udf_calculate_group_id_new (interval_group_id, prev_interval_group_id, control_val, pulse_val):
if interval_group_id != 0:
return interval_group_id
elif control_val >= pulse_val and prev_interval_group_id != 0:
return prev_interval_group_id
else:
return -1
And the new column being added to my dataframe is done with:
df = df.withColumn('group_id'
, udf_calculate_group_id_new(
df.interval_group_id
, lag(col('interval_group_id')).over(Window.orderBy('device_timestamp'))
, df.control_value
, df.pulse_value)
)
My expected results are:
interval_group_id control_value pulse_value device_timestamp group_id
2797895314 5 5 2020-09-12 09:08:44 2797895314
0 5 5 2020-09-12 09:08:45 2797895314
0 6 5 2020-09-12 09:08:46 2797895314
0 0 5 2020-09-12 09:08:47 -1
However, the results of adding the new group_id column are:
interval_group_id control_value pulse_value device_timestamp group_id
2797895314 5 5 2020-09-12 09:08:44 null
0 5 5 2020-09-12 09:08:45 null
0 6 5 2020-09-12 09:08:46 -1
0 0 5 2020-09-12 09:08:47 -1
My goal is to propagate the value 2797895314 down the group_id column, based on the conditions mentioned above, but somehow this doesn't happen and the results are populated with null and -1 incorrectly.
Is this a bug with UDF's or is my expectation of the way of working for the UDF incorrect? Or am I just bad at coding?

Postgres: for each row evaluate all successive rows under conditions

I have this table:
id | datetime | row_number
1 2018-04-09 06:27:00 1
1 2018-04-09 14:15:00 2
1 2018-04-09 15:25:00 3
1 2018-04-09 15:35:00 4
1 2018-04-09 15:51:00 5
1 2018-04-09 17:05:00 6
1 2018-04-10 06:42:00 7
1 2018-04-10 16:39:00 8
1 2018-04-10 18:58:00 9
1 2018-04-10 19:41:00 10
1 2018-04-14 17:05:00 11
1 2018-04-14 17:48:00 12
1 2018-04-14 18:57:00 13
I'd count for each row the successive rows with time <= '01:30:00' and start the successive evaluation from the first row that doesn't meet the condition.
I try to exlplain better the question.
Using windows function lag():
SELECT id, datetime,
CASE WHEN datetime - lag (datetime,1) OVER(PARTITION BY id ORDER BY datetime)
< '01:30:00' THEN 1 ELSE 0 END AS count
FROM table
result is:
id | datetime | count
1 2018-04-09 06:27:00 0
1 2018-04-09 14:15:00 0
1 2018-04-09 15:25:00 1
1 2018-04-09 15:35:00 1
1 2018-04-09 15:51:00 1
1 2018-04-09 17:05:00 1
1 2018-04-10 06:42:00 0
1 2018-04-10 16:39:00 0
1 2018-04-10 18:58:00 0
1 2018-04-10 19:41:00 1
1 2018-04-14 17:05:00 0
1 2018-04-14 17:48:00 1
1 2018-04-14 18:57:00 1
But it's not ok for me because I want exclude row_number 5 because interval between row_number 5 and row_number 2 is > '01:30:00'. And start the new evaluation from row_number 5.
The same for row_number 13.
The right output could be:
id | datetime | count
1 2018-04-09 06:27:00 0
1 2018-04-09 14:15:00 0
1 2018-04-09 15:25:00 1
1 2018-04-09 15:35:00 1
1 2018-04-09 15:51:00 0
1 2018-04-09 17:05:00 1
1 2018-04-10 06:42:00 0
1 2018-04-10 16:39:00 0
1 2018-04-10 18:58:00 0
1 2018-04-10 19:41:00 1
1 2018-04-14 17:05:00 0
1 2018-04-14 17:48:00 1
1 2018-04-14 18:57:00 0
So right count is 5.
I'd use a recursive query for this:
WITH RECURSIVE tmp AS (
SELECT
id,
datetime,
row_number,
0 AS counting,
datetime AS last_start
FROM mytable
WHERE row_number = 1
UNION ALL
SELECT
t1.id,
t1.datetime,
t1.row_number,
CASE
WHEN lateral_1.counting THEN 1
ELSE 0
END AS counting,
CASE
WHEN lateral_1.counting THEN tmp.last_start
ELSE t1.datetime
END AS last_start
FROM
mytable AS t1
INNER JOIN
tmp ON (t1.id = tmp.id AND t1.row_number - 1 = tmp.row_number),
LATERAL (SELECT (t1.datetime - tmp.last_start) < '1h 30m'::interval AS counting) AS lateral_1
)
SELECT id, datetime, counting
FROM tmp
ORDER BY id, datetime;

KDB: String comparison with a table

I have a table bb:
bb:([]key1: 0 1 2 1 7; col1: 1 2 3 4 5; col2: 5 4 3 2 1; col3:("11";"22" ;"33" ;"44"; "55"))
How do I do a relational comparison of string? Say I want to get records with col3 less than or equal to "33"
select from bb where col3 <= "33"
Expected result:
key1 col1 col2 col3
0 1 5 11
1 2 4 22
2 3 3 33
If you want col3 to remain of string type, then just cast temporarily within the qsql query?
q)select from bb where ("J"$col3) <= 33
key1 col1 col2 col3
-------------------
0 1 5 "11"
1 2 4 "22"
2 3 3 "33"
If you are looking for classical string comparison, regardless to if string is number or not, I would propose the next approach:
a. Create methods which behave similar to common Java Comparators. Which returns 0 when strings are equal, -1 when first string is less than second one, and 1 when first is greater than the second
.utils.compare: {$[x~y;0;$[x~first asc (x;y);-1;1]]};
.utils.less: {-1=.utils.compare[x;y]};
.utils.lessOrEq: {0>=.utils.compare[x;y]};
.utils.greater: {1=.utils.compare[x;y]};
.utils.greaterOrEq: {0<=.utils.compare[x;y]};
b. Use them in where clause
bb:([]key1: 0 1 2 1 7;
col1: 1 2 3 4 5;
col2: 5 4 3 2 1;
col3:("11";"22" ;"33" ;"44"; "55"));
select from bb where .utils.greaterOrEq["33"]'[col3]
c. As you see below, this works for arbitrary strings
cc:([]key1: 0 1 2 1 7;
col1: 1 2 3 4 5;
col2: 5 4 3 2 1;
col3:("abc" ;"def" ;"tyu"; "55poi"; "gab"));
select from cc where .utils.greaterOrEq["ffff"]'[col3]
.utils.compare could also be written in vector form, though, I'm not sure if it will be more time/memory efficient
.utils.compareVector: {
?[x~'y;0;?[x~'first each asc each(enlist each x),'enlist each y;-1;1]]
};
one way would be to evaluate the strings before comparison:
q)bb:([]key1: 0 1 2 1 7; col1: 1 2 3 4 5; col2: 5 4 3 2 1; col3:("11";"22" ;"33" ;"44"; "55"))
q)bb
key1 col1 col2 col3
-------------------
0 1 5 "11"
1 2 4 "22"
2 3 3 "33"
1 4 2 "44"
7 5 1 "55"
q)
q)
q)select from bb where 33>=value each col3
key1 col1 col2 col3
-------------------
0 1 5 "11"
1 2 4 "22"
2 3 3 "33"
in this case value each returns the strings values as integers and then performs the comparison

Spark: restart counting on specific value

I have a dataFrame with Boolean records and i want restart the counting when goal=False/Null.
How can i get the Score tab ?
The score tab is a count of True values with a reset on False/null values
My df:
Goals
Null
True
False
True
True
True
True
False
False
True
True
Expected Result:
Goals Score
Null 0
True 1
False 0
True 1
True 2
True 3
True 4
False 0
False 0
True 1
True 2
EDIT: Adding more infos
Actually my full dataset is:
Player Goals Date Score
1 Null 2017-08-18 10:30:00 0
1 True 2017-08-18 11:30:00 1
1 False 2017-08-18 12:30:00 0
1 True 2017-08-18 13:30:00 1
1 True 2017-08-18 14:30:00 2
1 True 2017-08-18 15:30:00 3
1 True 2017-08-18 16:30:00 4
1 False 2017-08-18 17:30:00 0
1 False 2017-08-18 18:30:00 0
1 True 2017-08-18 19:30:00 1
1 True 2017-08-18 20:30:00 2
2 False 2017-08-18 10:30:00 0
2 False 2017-08-18 11:30:00 0
2 True 2017-08-18 12:30:00 1
2 True 2017-08-18 13:30:00 2
2 False 2017-08-18 15:30:00 0
I've created a window to calculate the score by player on a certain date
val w = Window.partitionBy("Player","Goals").orderBy("date")
I've tried with the lag function and comparing the values but i can't reset the count.
EDIT2: Add unique Date per player
Thank you.
I finally solved the problem with grouping the goals that occurs together.
I used a count over a partition containing the difference between the row index of the "table" and the row_number related to the partitioned window.
First declare the window with future columns to use
val w = Window.partitionBy("player","goals","countPartition").orderBy("date")
Then populate the columns "countPartition" and "goals" with 1 to keep the rowNumber neutral
val list1= dataList.withColumn("countPartition", lit(1)).withColumn("goals", lit(1)).withColumn("index", rowNumber over w )
the udf
def div = udf((countInit: Int, countP: Int) => countInit-countP)
And finally calculate the score
val score = list1.withColumn("goals", goals).withColumn("countPartition", div(col("index") , rowNumber over w )).withColumn("Score", when(col("goals") === true, count("goals") over w ).otherwise(when(col("goals") isNull, "null").otherwise(0))).orderBy("date")