How to get a scalar from polars to python without to_series.to_list()[0] - python-polars

Let's say I want to get the max of a column back to regular python variable
df=pl.DataFrame({'a':[1,2,3]})
Right now I'm doing df.select(pl.col('a').max()).to_series().to_list()[0] but that's seems a bit clunky. Is there a more direct way?

You can index a DataFrame in row, column order.
>>> df.select(pl.max('a'))[0, 0]
3

Related

How to get the numeric value of missing values in a PySpark column?

I am working with the OpenFoodFacts dataset using PySpark. There's quite a lot of columns which are entirely made up of missing values and I want to drop said columns. I have been looking up ways to retrieve the number of missing values on each column, but they are displayed in a table format instead of actually giving me the numeric value of the total null values.
The following code shows the number of missing values in a column but displays it in a table format:
from pyspark.sql.functions import col, isnan, when, count
data.select([count(when(isnan("column") | col("column").isNull(), "column")]).show()
I have tried the following codes:
This one does not work as intended as it doesn't drop any columns (as expected)
for c in data.columns:
if(data.select([count(when(isnan(c) | col(c).isNull(), c)]) == data.count()):
data = data.drop(c)
data.show()
This one I am currently trying but takes ages to execute
for c in data.columns:
if(data.filter(data[c].isNull()).count() == data.count()):
data = data.drop(c)
data.show()
Is there a way to get ONLY the number? Thanks
If you need the number instead of showing in the table format, you need to use the .collect(), which is:
list_of_values = data.select([count(when(isnan("column") | col("column").isNull(), "column")]).collect()
What you get is a list of Row, which contain all the information in the table.

Use the bin binary on a per-ticker basis

For two tables each with datetime and ticker symbol columns, how can we achieve the functionality of the binary function bin within each ticker group. That is, instead of returning the latest index from the entire left-table prior to the time of each right-table row, for a given right-table row it should return the latest index from the left-table amongst only the rows of the same ticker symbol as the right-table row.
My first thought would be to add a per-group index in the left-table, apply bin on each ticker group for it’s group-index and then use the unique (ticker,group-index) pair to find the index on the full left-table. However, I am not sure how to implement this or if this is the best way to achieve the desired functionality.
Could you give some sample inputs and desired output?
This sounds like something you can solve with aj
Check https://code.kx.com/q/ref/aj/ for details

How to run Logical Test in Tableau when one column has multiple rows

The first column only has one row while the second column has three rows that correspond to the first row of the first column. For exemple, something like this.
Is there a way to run a logical test where if any of the values in the second column pass the test, I get a 1 and if none of the values in the second column pass the test, I get a 0.
Thank you for your help!
Yes, you can do this, using LODs and simple boolean formula. Please give a more specific example of what you want to do and you can have a formula that'll do it.

How do you divide a single column in a table by a constant in Tableau?

Sorry if this seems trivial, but I am fairly new to Tableau. I have a simple table that has 1 dimension for columns and 1 dimension for rows. My Marks are the Count of a third dimension. I'd like to divide only 1 of the columns in the table by a constant but not all of them. When I have tried conditional statements, I receive the error regarding mix of non-aggregate and aggregate statements.
What is the best way to divide a single column's values based upon a condition?
Thanks in advance.
Typically the error regarding non-aggregate and aggregate statements can be resolved using the ATTR() function.
SUM([Sales]) / [Constant]
Turns to:
SUM([Sales]) / ATTR([Constant])
Or conversely, which might or might not fit your data:
[Sales] / [Constant]
You just cant mix the two as in the first example.
Edit
This is probably a more accurate place for the ATTR() function given what I'm guessing is your use case:
If ATTR([Segment]) = 'Corporate'
Then COUNT(Sales) / SUM([Constant])
END
Try turning the constant to a discrete measure and see if that works. (right-click on measure and select 'discrete')
Also, without seeing the conditional code you are using, you probably need to wrap the entire condition with count() in order to not get the Aggregate/Non-Aggregate error, like this:
Count(If [MyDimension] = "XX" then [MyOtherDimension] else Null End)
NOT like this:
If [MyDimension] = "XX" then Count([MyOtherDimension]) else Null End

How to handle NaNs in pandas dataframe integer column to postgresql database

I have a pandas dataframe with a "year" column. However some rows have a np.NaN value due to an outer merge. The data type of the column in pandas is therefore converted to float64 instead of integer (integer cannot store NaNs?). Next, I want to store the dataframe on a postGreSQL database. For this I use:
df.to_sql()
Everything works fine but my postGreSQL column is now type "double precision" and the np.NaN values are now [null]. This all makes sense since the input column type was float64 and not integer type.
I was wondering if there is a way to store the results in an integer type column with [nans].
Example Notebook
Result of Ami's answer:
(integer cannot store NaNs?)
No, they cannot. If you look at the postgresql numeric documentation, you can see that the number of bytes, and ranges, are completely specified, and integers cannot store this.
A common solution in this case is to decide, by convention, that some number is logically a nan. In your case, if it is year, you might choose a negative value (or just -1) as that. Before writing, you could use
df.year = df.year.fillna(-1).astype(int)
Alternatively, you can define another column as year_is_none.
Alternatively, you can store them as floats.
These solutions range from most efficient, to least efficient in terms of memory.
You should use it;
df.year = df.year.fillna(-1) OR 0