Overwrite a slice of a timeseries with a value - python-polars

I have some timeseries data in the form of a pl.DataFrame object with a datetime col and a data col. I would like to correct an error in the data that occurs during a distinct time range by overwriting it with a value.
Now in pandas, one would use the datetimes as index and slice that timerange and assign to it, like so
df.loc[start_dt_string:end_dt_string, column_name] = some_val
Being completely new to polars I have a hard time figuring out how to express this. I tried selecting rows with .filter and .is_between but of course this doesn't support assignment. How would one go about doing this with polars?

Apparently I missed this in the docs, so RTFM to the rescue. In the corresponding section of the Coming from Pandas guide, this case is covered almost verbatim:
df.with_column(
pl.when(pl.col("c") == 2)
.then(pl.col("b"))
.otherwise(pl.col("a")).alias("a")
)
The above pandas example uses timerange slicing, so for the sake of completeness I'm going to add polars code that does exactly the same:
df.with_column(
pl.when(pl.col(dt_column_name).is_between(
datetime(start_dt_string),
datetime(end_dt_string),
include_bounds=True
).then(pl.lit(some_val))
.otherwise(pl.col(column_name))
.alias(column_name)
)

Related

`set_sorted` when a dataframe is sorted on multiple columns

I have some panel data in polars. The dataframe is sorted by its id column and then its date column (basically it's a bunch of time series concatenated together).
I've seen that polars has a .set_sorted method for working with expressions. I can of course set pl.col("id").set_sorted() but I want it to be aware that it's actually sorted in both id and date columns. In pandas I know the Index has an .is_monotonic_increasing property that is aware of whether all the columns of the Index are sorted but is there a way to do something similar with polars?
Have you tried
df.get_column('id').is_sorted()
and
df.get_column('date').is_sorted()
to see if they're each already known to be sorted?
For instance if I do:
df=pl.DataFrame({'a':[1,1,2,2], 'b':[1,2,3,4]})
df.get_column('a').is_sorted()
df.get_column('b').is_sorted()
Then I get 2 Trues even though I haven't ever told it that the columns are sorted.
In general, I don't think you want to be manually setting columns as sorted. Just sort them and it'll keep track of the fact that they're sorted.
If you do:
df=pl.DataFrame({'a':[1,2,1,2], 'b':[1,3,2,4]})
df.get_column('a').is_sorted()
df.get_column('b').is_sorted()
then you get False twice, as you'd hope. If you then do df=df.sort(['a','b']) and follow it up by checking the sortedness of a and b again then you see that it knows they're sorted

why pyspark pandas udf grouped map serialization is designed this way?

I am trying to get a concrete understanding on how the pandas UDF grouped map is working. While looking at the code here[1], i see that first the arrow object is converted into pandas series and then pd.concat is applied to create the full data frame.
What is confusing for me is , since arrow has support for Tabular format[2] and an API exist for converting table format to pandas in pyarrow[3] , why is that not being used.
I am pretty sure i am missing out on something very basic, any pointers would be useful ?
[1] https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/python/pyspark/sql/pandas/serializers.py#L234-L276
[2] https://arrow.apache.org/docs/cpp/tables.html
[3] https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pandas

Comparing columns in two data frame in spark

I have two dataframes, both of them contain different number of columns.
I need to compare three fields between them to check if those are equal.
I tried following approach but its not working.
if(df_table_stats("rec_cnt").equals(df_aud("REC_CNT")) || df_table_stats("hashcount").equals(df_aud("HASH_CNT")) || round(df_table_stats("hashsum"),0).equals(round(df_aud("HASH_TTL"),0)))
{
println("Job executed succefully")
}
df_table_stats("rec_cnt"), this returns Column rather than actual value hence condition becoming false.
Also, please explain difference between df_table_stats.select("rec_cnt") and df_table_stats("rec_cnt").
Thanks.
Use sql and inner join both df , with your conditions .
Per my comment, the syntax you're using are simple column references, they don't actually return data. Assuming you MUST use Spark for this, you'd want a method that actually returns the data, known in Spark as an action. For this case you can use take to return the first Row of data and extract the desired columns:
val tableStatsRow: Row = df_table_stats.take(1).head
val audRow: Row = df_aud.take(1).head
val tableStatsRecCount = tableStatsRow.getAs[Int]("rec_cnt")
val audRecCount = audRow.getAs[Int]("REC_CNT")
//repeat for the other values you need to capture
However, Spark definitely is overkill if this is all you're using it for. You could use a simple JDBC library for Scala like ScalikeJDBC to do these queries and capture the primitives in the results.

Scala: wrapper for Breeze DenseMatrix for column and row referencing

I am new to Scala. Looking at it as an alternative to MATLAB for some applications.
I would like to program in Scala a wrapping class in order to be able to assign column names ("QuantityQ" && "QuantityP" -> Range) and row names (dates -> Range) to Breeze DenseMatrices (http://www.scalanlp.org/) in order to reference columns and rows.
The usage should resemble Python Pandas or Scala Saddle (http://saddle.github.io).
Saddle is very interesting but its usage is limited to 2D matrices. A huge limitation.
My Ideas:
Columns:
I thought a Map would do the job for colums but that may not be the best implementation.
Rows:
For rows, I could maintain a separate Breeze vector with timestamps and provide methods that convert dates into timestamps, doing the numbercruncing through Breeze. This comes with a loss of generality as a user may want to give whatever string names to rows.
Concerning dates I use nscala-time (a scala wrapper for joda)?
What are the drawbacks of my implementation?
Would you design the data structure differently?
Thank you for your help.

How to do pandas groupby([multiple columns]) so its result can be looked up

I have two dataframes: tr is a training-set, ts is a test-set.
They contain columns uid (a user_id), categ (a categorical), and response.
response is the dependent variable I'm trying to predict in ts.
I am trying to compute the mean of response in tr, broken out by columns uid and categ:
avg_response_uid_categ = tr.groupby(['uid','categ']).response.mean()
This gives the result but (unwantedly) the dataframe index is a MultiIndex. (this is the groupby(..., as_index=True) behavior):
MultiIndex[--5hzxWLz5ozIg6OMo6tpQ SomeValueOfCateg, --65q1FpAL_UQtVZ2PTGew AnotherValueofCateg, ...
But instead I want the result to keep the two columns 'uid', 'categ' and keep them separate.
Should I use aggregate() instead of groupby()?
Trying groupby(as_index=False) is useless.
The result seems to differ depending on whether you do:
tr.groupby(['uid','categ']).response.mean()
or:
tr.groupby(['uid','categ'])['response'].mean() # RIGHT
i.e. whether you slice a single Series, or a DataFrame containing a single Series. Related: Pandas selecting by label sometimes return Series, sometimes returns DataFrame