I am following documentation and thanks to | line_format and regexReplaceAll I was able to fetch some substring from a line.
Let's say now I have those columns:
line
123
7
123
54
14
Having that I want to perform some transform operation, ex sum, or transform operation with grouping by and taking total.
It is not working as I am suspecting those values are not being numbers, but only strings.
Is it possible to convert it to numbers?
I was trying using unwrap but it didn't worked:
sum_over_time(
{service="some"}
|="text expression"
| json
| line_format `{{ regexReplaceAll "text expression to remove from (\\d+)" .label_id "${1}" | trim }}`
| unwrap label_id [1m]
)
it ends up with
pipeline error: 'SampleExtractionErr' for series:
when I am filtering out errors, there is no results.
Related
I want to convert the prefix from 222.. to 999.. in pyspark.
Expected new column new_id with changed prefixt to 999..s
I will be using this column for inner merge b/w 2 pysparl dataframes
id
new_id
2222238308750
9999938308750
222222579844
999999579844
222225701296
999995701296
2222250087899
9999950087899
2222250087899
9999950087899
2222237274658
9999937274658
22222955099
99999955099
22222955099
99999955099
22222955099
99999955099
222285678
999985678
You can achieve it with something like this,
# First calculate the number of "2"s from the start till some other value is found, for eg '2223' should give you 3 as the length
# Use that calculated value to repeat the "9" that many times
# replace starting "2"s with the calulated "9" string
# finally drop all the calculated columns
df.withColumn("len_2", F.length(F.regexp_extract(F.col("value"), r"^2*(?!2)", 0)).cast('int'))\
.withColumn("to_replace_with", F.expr("repeat('9', len_2)"))\
.withColumn("new_value", F.expr("regexp_replace(value, '^2*(?!2)', to_replace_with)")) \
.drop("len_2", "to_replace_with")\
.show(truncate=False)
Output:
+-------------+-------------+
|value |new_value |
+-------------+-------------+
|2222238308750|9999938308750|
|222222579844 |999999579844 |
|222225701296 |999995701296 |
|2222250087899|9999950087899|
|2222250087899|9999950087899|
|2222237274658|9999937274658|
|22222955099 |99999955099 |
|22222955099 |99999955099 |
|22222955099 |99999955099 |
|222285678 |999985678 |
+-------------+-------------+
I have used the column name as value, you would have to substitute it with id.
You can try the following:
from pyspark.sql.functions import *
df = df.withColumn("tempcol1", regexp_extract("id", "^2*", 0)).withColumn("tempcol2", split(regexp_replace("id", "^2*", "_"), "_")[1]).withColumn("new_id", concat((regexp_replace("tempcol1", "2", "9")), "tempcol2")).drop("tempcol1", "tempcol2")
The id column is split into two temp columns, one having the prefix and the other the rest of the string. The prefix column values are replaced and concatenated back with the second temp column.
I have got a string that represents a query. It begins with a function, and the argument is a dictionary.
"runQuery `syms`columns`fastQuery`exchange!((`AAPL`MSFT`GOOG`AMD);(`sym`price`date);1b;`nasdaq)"
How can I extract the dictionary from the string, and save it in kdb as a dictionary type?
parse the string, to get the parse tree, and then take the param (the dict):
q)eval last parse "runQuery `syms`columns`fastQuery`exchange!((`AAPL`MSFT`GOOG`AMD);(`sym`price`date);1b;`nasdaq)"
syms | `AAPL`MSFT`GOOG`AMD
columns | `sym`price`date
fastQuery| 1b
exchange | `nasdaq
For this example value can have the desired effect:
q)myDict:value {(first where x=" ")_x}"runQuery `syms`columns`fastQuery`exchange!((`AAPL`MSFT`GOOG`AMD);(`sym`price`date);1b;`nasdaq)"
q)myDict
syms | `AAPL`MSFT`GOOG`AMD
columns | `sym`price`date
fastQuery| 1b
exchange | `nasdaq
If you have free rein to (re)define the function then you could just do:
q)runQuery:{x}
q)value"runQuery `syms`columns`fastQuery`exchange!((`AAPL`MSFT`GOOG`AMD);(`sym`price`date);1b;`nasdaq)"
syms | `AAPL`MSFT`GOOG`AMD
columns | `sym`price`date
fastQuery| 1b
exchange | `nasdaq
This could be quite useful if you're replaying a tickerplant-style log using -11!
I have log lines that contain a few timestamp fields. Here is an example of log line I am filtering in order to process it:
{
"time": "2022-06-22T10:33:08.710037238Z",
"#version": "1",
"message": "Duration at response processing ends",
"logger_name": "com.mks.cloud.filters.MainFilter",
"thread_name": "reactor-http-epoll-1",
"level": "INFO",
"level_value": 20000,
"rqst_id_ts": "b65c37d9284584e71b1dcd84b6a74075",
"rqst_end_ts": "1655893988698",
"rqst_start_ts": "1655893988698",
"rsp_start_ts": "1655893988709",
"rsp_end_ts": "1655893988709"
}
What I would like to do is calculate a value that would represent the duration between 2 timestamps in the log line so that I can then put the obtained range into quantile_over_time() or any other aggregate function.
For instance, the following works:
quantile_over_time(0.99, {app="myapp"} |~ "rsp_end_ts"
| json
| __error__ = ""
| unwrap rsp_end_ts | __error__="" [5m]) by (tsNs)
However this is not what I want to do since calculating the quantile of epoch timestamps makes no sense. What I want to calculate is the p99 of (rsp_end_ts - rqst_start_ts).
I tried the following which of course doesn't work, but gives an idea as to what I am attempting to do:
quantile_over_time(0.99, {app="myapp"} |~ "rsp_end_ts"
| json
| __error__ = ""
| unwrap (rsp_end_ts - rqst_start_ts) | __error__="" [5m]) by (tsNs)
If somehow there was a way to create a new label like rqst_duration=(rsp_end_ts - rqst_start_ts)
Then the following would be what I am looking for:
quantile_over_time(0.99, {app="myapp"} |~ "rsp_end_ts"
| json
| __error__ = ""
| unwrap rqst_duration | __error__="" [5m]) by (tsNs)
I couldn't find any documentation about this, which is very surprising, I would think (but it seems I might be wrong) this to be a common use case. Any help would be greatly appreciated :).
You can use template functions for that. Here's a sample query on the Grafana playground that you can inspire from.
So your query will look something like:
{app="myapp"} |~ "rsp_end_ts"
| json
} label_format result=`{{sub .rsp_end_ts rqst_start_ts}}` | line_format "{{ .result }}"
Then you can use the result.
I have a column that contains a Month-Year string that I would like to convert to an actual date representing the first day of the Month and Year combination. For example
+----------+------------+
| Original | Desired |
+----------+------------+
| Aug-19 | 08/01/2019 |
+----------+------------+
| Sep-20 | 09/01/2020 |
+----------+------------+
| May-22 | 05/01/2022 |
+----------+------------+
I have tried breaking apart the Month-Year string using split_part but when I try and pass Month as a parameter into date_parse it throws an error with the input (INVALID_FUNCTION_ARGUMENT). I could break apart the Month-Year into strings and then recombine, hard-coding the 01 however the problem seems that three letter month cannot be parsed into an actual month by Presto. I also want to avoid a 12 line CASE WHEN statement to parse the month if possible.
I'm not sure where the year comes from, but the query will be like this:
select date_format(date_parse('May-22', '%b-%d'), '%m/%d/%Y')
https://trino.io/docs/current/functions/datetime.html?mysql-date-functions
I'm getting the following error from Redshift.
Decimal: Integral number too large
This is happening when inserting the following csv line
2015-03-20,A_M300X250CONTENT_INT_ADSENSE,3443,3443,1.4,13,,
The error is being thrown by 1.4.
The definition of that column is this:
schemaName | tablename | column | type | encoding | disktkey | sortkey | notnull
-----------|-----------|-----------------|--------------|----------|----------|---------|---------
public | partners | revenue_partner | numeric(7,7) | none | false | 0 | false
This copy worked fine when the type was numeric(7,2), but I need to change it to fix a rounding error.
numeric(7,7) means the total number of digits allowed is 7 and all 7 are allocated as decimals. If you want 7 decimals and 7 digits you need numeric(14,7)
Reading the docs http://docs.aws.amazon.com/redshift/latest/dg/r_Numeric_types201.html
It looks like a numeric(7,7) data type can only store values between 0-1 with 7 significant figures. The second number is the number of values you can have after the decimal and the first number - the second number will be the number of values you can have before the decimal.