JSONPath in Peewee - postgresql

Is there a way to use the Postgres JSONPath features in Peewee? I've searched the documentation and the Playhouse extension, but haven't found anything. Any resources or tips would be greatly appreciated!

You don't have to do anything special.
class Reg(Model):
data = BinaryJSONField()
class Meta:
database = db
Reg.create(data={'a': [1, 2, 3, 4, 5]})
query = Reg.select(fn.jsonb_path_query(Reg.data, '$.a[*] ? (# >= 2)'))
for row in query.tuples():
print(row)
# (2,)
# (3,)
# (4,)
# (5,)
query = Reg.select(fn.jsonb_path_query(Reg.data, '$.a[*] ? (# <= $m)', json.dumps({'m': 2})))
for row in query.tuples():
print(row)
# (0,)
# (1,)
# (2,)

Related

Is it bad to use `GroupBy` multiple times in pyspark?

This is an educational question.
I have a text file containing several records of power consumption of factories - identified by a unique id -. The file contains the following columns
factory_id, city, country, date, consumption
where date is in the format mm/YYYY. I want to compute which countries have less than 20 cities (including those with 0) which experienced a decrease in factories' consumption in two consecutive years. This is nothing but the total yearly consumption of the factories located in that city.
To do this, I used multiple times a groupBy + agg as follows
import pyspark.sql.functions as F
import pyspark.sql.types as T
df = df.withColumn("year", F.split("Date", "/")[1])
# compute for each city the yearly consumption
df_consump = df.groupBy("Country", "City", "year").agg(
F.sum("consumption").alias("consumption")
)
#F.udf(returnType=T.IntegerType())
def had_a_decrease(structs):
structs = sorted(structs, key=lambda s: s.year)
# retrieve 0 if list is monotonically growing, 1 otherwise
cur_cons = pairs[0].consumption
for struct in structs[1:]:
cons = struct.consumption
if cons <= cur_cons:
return 1
cur_cons = cons
return 0
df_cons_decrease = df_consump.groupBy("Country", "City").agg(
# here I collect a list of structs containing (year, consumption)
# which is needed because collect_list doesn't guarantee the order
# is respected so I keep the info on the year to sort this (small)
# list first in the udf "had_a_decrease" defined above.
# eventually this yields a column with a 1 if we had a decrease, 0 otherwise,
# which I sum afterwards.
had_a_decrease(F.collect_list(F.struct("year", "consumption"))).alias("had_decrease")
)
df_cons_decrease.groupBy("Country").agg(
F.sum("had_decrease").alias("num_cities_with_decrease")
).filter("num_cities_with_decrease < 20")\
.write.csv(outputFolder)
however I was wondering:
is this a bad practice (e.g. inefficient) ?
are dataframe better suited than RDDs for this ?
would you recommend a better approach than grouping this many times ?
Compare the consumption with the consomption 1 year and 2 year ago by using Window and lag function without udf and then group by.
data = [
[1, 1, 1, '01/2022', 100],
[1, 1, 1, '01/2021', 90],
[1, 1, 1, '01/2020', 80],
[1, 1, 2, '01/2022', 100],
[1, 1, 2, '01/2021', 110],
[1, 1, 2, '01/2020', 120]
]
cols = ['factory_id', 'city', 'country', 'date', 'consumption']
df = spark.createDataFrame(data, cols) \
.withColumn('year', f.split('date', '/')[1])
w = Window.partitionBy('country', 'city').orderBy('year')
df.groupBy('country', 'city', 'year') \
.agg(f.sum('consumption').alias('consumption')) \
.withColumn('consumption-1', f.lag('consumption', 1).over(w)) \
.withColumn('consumption-2', f.lag('consumption', 2).over(w)) \
.withColumn('is_decreased', f.expr('if(`consumption` < `consumption-1` and `consumption-1` < `consumption-2`, true, false)')) \
.filter('is_decreased = true') \
.select('country', 'city').distinct() \
.groupBy('country').count() \
.filter('count < 20') \
.select('country') \
.show()
+-------+
|country|
+-------+
| 2|
+-------+

Create words and its position in Pyspark

Hi I am trying to create string which will have words and its position as it appear in the input string. I am able to do it in python using below code -
from collections import defaultdict
import re
s = 'Create a string with position from a string a'
wp = defaultdict(list)
for n, k in enumerate(s.split()):
wp[k].append(n+1)
raw_output = re.search('{(.*)}', str(wp)).group(1).replace('[','').replace(']','')
final_output = re.sub("(\d), '", r"\1 '", raw_output)
And output is
"'Create': 1 'a': 2, 7, 9 'string': 3, 8 'with': 4 'position': 5 'from': 6"
How can I do the same in pyspark?
Pyspark has few additional concepts you might need to revisit, using RDD apis is the best
for your problem statement.
Here is a code snippet that should work for you.
def positional_encoder(sentence):
words=sentence.split(" ")
indexes=list(range(0,len(words)))
return list(zip(words,indexes))
data_rdd = sc.parallelize(["Create a string with position from a string a"])
words_index=data_rdd.map(lambda sentence: positional_encoder(sentence))
## Just for debugging:
words_index.collect() ## Remove this after debugging

Find a text and replace it with a value in Matlab

I have some data which look like this:
I would like to pre-process the data in a way that I replace all Mostly false with 1, Mostly true with 2 and Definitely true w/ 3. Is there a find and replace command or what is the best way of doing this?
You can use a map object to do the mapping
m = containers.Map( {'Mostly false', 'Mostly true', 'Definitely true'}, ...
{ 1, 2, 3} );
Then for some example data
data = {'Mostly false', 'Mostly false', 'Mostly true', 'Mostly false', 'Definitely true'};
You can perform the conversion with
data = m.values( data );
% >> data = {1, 1, 2, 1, 3}
This assumes there will always be a match in your map.
Alternatively, you could do the operation manually (for the same example data), this will leave non-matches unaltered, and you could use strcmpi for case-insensitivity:
c = {'Mostly false', 'Mostly true', 'Definitely true'
1, 2, 3};
for ii = 1:size(c,2)
% Make the replacement for each column in 'c'
data( strcmp( data, c{1,ii} ) ) = c(2,ii);
end

convert elements in list to nested lists in list

I have a list which is needed to be converted to nested lists in a list
my_list = [2,5,6,7,8,15,34,56,78]
I need list as
final_list = [[2,5,6],[7,8,15],[34,56,78]]
I wrote code using for loop with length command and range command, I know there is error with range function, but I couldn't figure it out.
my_list = [2,5,6,7,8,15,34,56,78]
max_split = 3
final_list = [[len(my_list) for _ in range(max_split)] for _ in range(max_split)]
print(final_list)
But the output I get is [[9,9,9],[9,9,9],[9,9,9]]
You can try following code
my_list = [2,5,6,7,8,15,34,56,78]
max_split = 3
final_list = [my_list[i:i + max_split ] for i in range(0, len(my_list), max_split )]
print(final_list)
Demo.
If you use the indexes returned by the for loops, you can use them to count through the indexes in your list like this:
my_list = [2,5,6,7,8,15,34,56,78]
max_split = 3
final_list = [[my_list[i+3*j] for i in range(max_split)] for j in range(max_split)]
print(final_list)
Output:
[[2, 5, 6], [7, 8, 15], [34, 56, 78]]

Aggregate in Julia like R or pandas

I want to aggregate a monthly series at the quarterly frequency, for which R has ts and aggregate() (see the first answer on this thread) and pandas has df.resample("Q").sum() (see this question). Does Julia offer something similar?
Appendix: my current solution uses a function to convert a data to the first quarter and split-apply-combine:
"""
month_to_quarter(date)
Returns the date corresponding to the first day of the quarter enclosing date
# Examples
```jldoctest
julia> Date(1990, 1, 1) == RED.month_to_quarter(Date(1990, 2, 1))
true
julia> Date(1990, 1, 1) == RED.month_to_quarter(Date(1990, 1, 1))
true
julia> Date(1990, 1, 1) == RED.month_to_quarter(Date(1990, 2, 25))
true
```
"""
function month_to_quarter(date::Date)
new_month = 1 + 3 * floor((Dates.month(date) - 1) / 3)
return Date(Dates.year(date), new_month, 1)
end
"""
monthly_to_quarterly(monthly_df)
Aggregates a monthly data frame to the quarterly frequency. The data frame should have a :DATE column.
# Examples
```jldoctest
julia> monthly = convert(DataFrame, hcat(collect([Dates.Date(1990, m, 1) for m in 1:3]), [1; 2; 3]));
julia> rename!(monthly, :x1 => :DATE);
julia> rename!(monthly, :x2 => :value);
julia> quarterly = RED.monthly_to_quarterly(monthly);
julia> quarterly[:value][1]
2.0
julia> length(quarterly[:value])
1
```
"""
function monthly_to_quarterly(monthly::DataFrame)
# quarter months: 1, 4, 7, 10
quarter_months = collect(1:3:10)
# Deep copy the data frame
monthly_copy = deepcopy(monthly)
# Drop initial rows until it starts on a quarter
while !in(Dates.month(monthly_copy[:DATE][1]), quarter_months)
# Verify that something is left to pop
#assert 1 <= length(monthly_copy[:DATE])
monthly_copy = monthly_copy[2:end, :]
end
# Drop end rows until it finishes before a quarter
while !in(Dates.month(monthly_copy[:DATE][end]), 2 + quarter_months)
monthly_copy = monthly_copy[1:end-1, :]
end
# Change month of each date to the nearest quarter
monthly_copy[:DATE] = month_to_quarter.(monthly_copy[:DATE])
# Split-apply-combine
quarterly = by(monthly_copy, :DATE, df -> mean(df[:value]))
# Rename
rename!(quarterly, :x1 => :value)
return quarterly
end
I couldn't find such a function in the docs. Here's a more DataFrames.jl-ish and more succint version of your own answer
using DataFrames
# copy-pasted your own function
function month_to_quarter(date::Date)
new_month = 1 + 3 * floor((Dates.month(date) - 1) / 3)
return Date(Dates.year(date), new_month, 1)
end
# the data
r=collect(1:6)
monthly = DataFrame(date=[Dates.Date(1990, m, 1) for m in r],
val=r);
# the functionality
monthly[:quarters] = month_to_quarter.(monthly[:date])
_aggregated = by(monthly, :quarters, df -> DataFrame(S = sum(df[:val])))
#show monthly
#show _aggregated