Making only some columns editable on streamlit-aggrid?

Making only some columns editable on streamlit-aggrid? - ag-grid

I'm new to streamlit-aggrid.
I have a CSV file I want to load to a dynamic table and allow editions to only some of the columns.
I saw this example:
import streamlit as st
import pandas as pd
from st_aggrid import AgGrid
df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
grid_return = AgGrid(df, editable=True)
new_df = grid_return['data']
So I've followed it, but let's say that instead of editable=True, that allows both col1 and col2 values to be modified, I want to allow modifications on one of them (not important which one).
How can I do that please?
Thanks!
I tried to pass a columns subset into the editable args but it is only accepting boolean values.

Related

How to create this function in PySpark?

I have a large data frame, consisting of 400+ columns and 14000+ records, that I need to clean.
I have defined a python code to do this, but due to the size of my dataset, I need to use PySpark to clean it. However, I am very unfamiliar with PySpark and don't know how I would create the python function in PySpark.
This is the function in python:
unwanted_characters = ['[', ',', '-', '#', '#', ' ']
cols = df.columns.to_list()
def clean_col(item):
column= str(item.loc[col])
for character in unwanted_characters:
if character in column:
character_index = column.find(character)
column = column[:character_index]
return column
for x in cols:
df[x] = lrndf.apply(clean_col, axis=1)
This function works in python but I cannot apply it to 400+ columns.
I have tried to convert this funtion to pyspark:
clean_colUDF = udf(lambda z: clean_col(z))
df.select(col("Name"), \
convertUDF(col("Name")).alias("Name") ) \
.show(truncate=False)
But when I run it I get the error:
AttributeError: 'str' object has no attribute 'loc'
Does anyone know how I would modify this so that it works in pyspark?
My columns datatypes are both integers and strings so I need it to work on both.

Use built-in pyspark.sql.functions wherever possible as they provide a ready-made performant toolkit which should be able to cover 95% of any data transformation requirement without having to implement your own custom UDF's
pyspark.sql.functions docs: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html
For what you want to do I would start with regex_replace()
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.regexp_replace.html#pyspark.sql.functions.regexp_replace

PySpark Code Modification to Remove Nulls

I received help with following PySpark to prevent errors when doing a Merge in Databricks, see here
Databricks Error: Cannot perform Merge as multiple source rows matched and attempted to modify the same target row in the Delta table conflicting way
I was wondering if I could get help to modify the code to drop NULLs.
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
df2 = partdf.withColumn("rn", row_number().over(Window.partitionBy("P_key").orderBy("Id")))
df3 = df2.filter("rn = 1").drop("rn")
Thanks

The code that you are using does not completely delete the rows where P_key is null. It is applying the row number for null values and where row number value is 1 where P_key is null, that row is not getting deleted.
You can instead use the df.na.drop instead to get the required result.
df.na.drop(subset=["P_key"]).show(truncate=False)
To make your approach work, you can use the following approach. Add a row with least possible unique id value. Store this id in a variable, use the same code and add additional condition in filter as shown below.
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number,when,col
df = spark.read.option("header",True).csv("dbfs:/FileStore/sample1.csv")
#adding row with least possible id value.
dup_id = '0'
new_row = spark.createDataFrame([[dup_id,'','x','x']], schema = ['id','P_key','c1','c2'])
#replacing empty string with null for P_Key
new_row = new_row.withColumn('P_key',when(col('P_key')=='',None).otherwise(col('P_key')))
df = df.union(new_row) #row added
#code to remove duplicates
df2 = df.withColumn("rn", row_number().over(Window.partitionBy("P_key").orderBy("id")))
df2.show(truncate=False)
#additional condition to remove added id row.
df3 = df2.filter((df2.rn == 1) & (df2.P_key!=dup_id)).drop("rn")
df3.show()

How do I create new columns containing lead values generated from another column in Pyspark?

The following code pulls down daily oil prices (dcoilwtico), resamples the daily figures to monthly, calculates the 12-month (i.e. year over year percent) change and finally contains a loop to shift the YoY percent change ahead 1 month (dcoilwtico_1), 2 months (dcoilwtico_2) all the way out to 12 months (dcoilwtico_12) as new columns:
import pandas_datareader as pdr
start = datetime.datetime (2016, 1, 1)
end = datetime.datetime (2022, 12, 1)
#1. Get historic data
df_fred_daily = pdr.DataReader(['DCOILWTICO'],'fred', start, end).dropna().resample('M').mean() # Pull daily, remove NaN and collapse from daily to monthly
df_fred_daily.columns= df_fred_daily.columns.str.lower()
#2. Expand df range: index, column names
index_fred = pd.date_range('2022-12-31', periods=13, freq='M')
columns_fred_daily = df_fred_daily.columns.to_list()
#3. Append history + empty df
df_fred_daily_forecast = pd.DataFrame(index=index_fred, columns=columns_fred_daily)
df_fred_test_daily=pd.concat([df_fred_daily, df_fred_daily_forecast])
#4. New df, calculate yoy percent change for each commodity
df_fred_test_daily_yoy= ((df_fred_test_daily - df_fred_test_daily.shift(12))/df_fred_test_daily.shift(12))*100
#5. Extend each variable as a series from 1 to 12 months
for col in df_fred_test_daily_yoy.columns:
for i in range(1,13):
df_fred_test_daily_yoy["%s_%s"%(col,i)] = df_fred_test_daily_yoy[col].shift(i)
df_fred_test_daily_yoy.tail(18)
And produces the following df:
Question: My real world example contains hundreds of columns and I would like to generate these same results using Pyspark.
How would this be coded using Pyspark?

As your code is already ready, I would use koalas, "a pandas spark version", You just need to install https://pypi.org/project/koalas/
see the simple example
import databricks.koalas as ks
import pandas as pd
pdf = pd.DataFrame({'x':range(3), 'y':['a','b','b'], 'z':['a','b','b']})
# Create a Koalas DataFrame from pandas DataFrame
df = ks.from_pandas(pdf)
# Rename the columns
df.columns = ['x', 'y', 'z1']
# Do some operations in place:
df['x2'] = df.x * df.x

Using select statement in pyspark changes values in column

I'm experiencing a very weird behavior in pyspark (databricks).
In my initial dataframe (df_original) I have multiple columns (id, text and some_others) and I add a new column 'detected_language'. The new column is added using a join with another dataframe df_detections (with columns id and detected_language). The ids in the two dataframes correspond to each other).
df_detections is created like this:
ids = [125, ...] # length x
detections = ['ko', ...] # length x
detections_with_id = list(zip(ids, detections))
df_detections = spark.createDataFrame(detections_with_id, ["id", "detected_language"])
df = df_original.join(df_detections, on='id', how='left)
Here is the weird part. Whenever I display the dataframe using a select statement I get the correct detected_language value. However, using only display I get a totally different value (e.g. 'fr' or any other language code) for the same entry (see the statements and their corresponding results below).
How is that possible? Can anybody think of a reason why this is? And how would I solve something like this?
Displaying correct value with select:
display(df.select(['id', 'text', 'detected_language']))
id
text
detected_language
125
내 한국어 텍스트
ko
...
...
...
Displaying wrong value without select:
display(df)
id
text
other_columns...
detected_language
125
내 한국어 텍스트
...
fr
...
...
...
...
I appreciate any hints or ideas! Thank you!

How to sum values of an entire column in pyspark

I have a data frame with 900 columns I need the sum of each column in pyspark, so it will be 900 values in a list. Please let me know how to do this? Data has around 280 mil rows all binary data.

Assuming you already have the data in a Spark DataFrame, you can use the sum SQL function, together with DataFrame.agg.
For example:
sdf = spark.createDataFrame([[1, 3], [2, 4]], schema=['a','b'])
from pyspark.sql import functions as F
sdf.agg(F.sum(sdf.a), F.sum(sdf.b)).collect()
# Out: [Row(sum(a)=3, sum(b)=7)]
Since in your case you have quite a few columns, you can use a list comprehension to avoid naming columns explicitly.
sums = sdf.agg(*[F.sum(sdf[c_name]) for c_name in sdf.columns]).collect()
Notice how you need to unpack the arguments from the list using the * operator.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Making only some columns editable on streamlit-aggrid? - ag-grid

Related

How to create this function in PySpark?

PySpark Code Modification to Remove Nulls

How do I create new columns containing lead values generated from another column in Pyspark?

Using select statement in pyspark changes values in column

How to sum values of an entire column in pyspark

Categories

Resources