Polars equivalent to SQL `COUNT(DISTINCT expr,[expr...])`, or other method of checking uniqueness - python-polars

When processing data, I often add a check after each step to validate that the data still has the unique key I think it does. For example, I might check that my data is still unique on (a, b). To accomplish this, I would typically check that the number of distinct combinations of columns a and b equals the total number of rows.
In polars, to get a COUNT(DISTINCT ...) I can do
(
df
.select(['a', 'b'])
.unique()
.height
)
But height does not work on LazyFrames, so I need to actually materialize the entire data with this method, I think (?). Is there a better way?
For reference, in R's data.table library I would do
mtc_dt <- data.table::as.data.table(mtcars)
stopifnot(data.table::uniqueN(mtc_dt[, .(mpg, disp)]) == nrow(mtc_dt))
To any contributors reading:
Thanks for the great package! Has sped up many of my workflows to a fraction of the time.

You can use a map function that asserts on the unique count.
This allows you to get an eager DataFrame in the middle of a query plan.
Note that we turn off projection_pushdown optimization, as the optimizer is not able to know which subset of columns we select.
df = pl.DataFrame({
"foo": [1, 2, 3],
"bar": [None, "hello", None]
})
def unique_check(df: pl.DataFrame, subset: list[str]) -> pl.DataFrame:
assert df.select(pl.struct(subset).unique().count()).item() == df.height
return df
out = (df.lazy()
.map(lambda df: unique_check(df, ["foo", "bar"]), projection_pushdown=False)
.select("bar")
.collect()
)
print(out)
shape: (3, 1)
┌───────┐
│ bar │
│ --- │
│ str │
╞═══════╡
│ null │
│ hello │
│ null │
└───────┘
Not turning of predicate_pushdown is better, but then we must ensure the subset is selected before the map.

The answer here provides a technique that can answer this question: gather the columns together in a struct column, and then apply .n_unique() to that struct. That question uses groupby, but it will work without groupby as well.
(
df
.with_column(pl.struct(['a', 'b'].alias('ident'))
['ident'].n_unique()
)
I was able to run code more or less identical to this on a dataset I am working with, and got a sensible answer.
Note that I am not sure if this materializes the entire table before aggregating, nor if this works specifically on lazy data frames. If not, please let me know, and I will retract this answer.

If you have
df=pl.DataFrame({'a':[1,2,3],'b':[2,3,4],'c':[3,4,5]}).lazy()
and you want to see if [a,b] are unique without returning all the data, you can lazily group by and count those groups. With that, you can add a filter such that only rows with a count greater than 1 are returned. Only after those expressions are strung to the LazyFrame do you collect and if your pair of columns remain unique as you intend then the result will have 0 rows.
df \
.groupby(['a','b']) \
.agg(pl.count()). \
filter(pl.col('count')>1).select('count').collect().height

Related

Formatting hstore column Postgres

I'm trying to find the best way to format a hstore column (see screenshot) my goal is to have the same format based on the screenshot as the "updated_column. I was thinking about a case statement like :
Case when json_column -> id then 'id:'
any suggestion would be appreciated.
Migration approach:
Add new column with type text like you want it
make sure new data directly enters the new column as the string you want (pre-formatted at the backend)
Create a migration function that converts json column data batchwise into your new string table. You can use postgres replace/.. operations to reformat it. You can also use an external python script/...
remove the json column after the migration is done
Let me see what / how you have tried and then we can see how to improve/solve your issues.
So I think i found a temporary solution that will work, but I think like #Bergi mentioned a view might be more appropriate.
For now I will just use something like:
concat(concat(concat(concat('id',':',column -> 'id')
,' ','auth_id',':',column -> 'auth_id')
,' ','type',':',column -> 'type')
,' ','transaction',':',column -> 'transaction')
You can use some function to make it generic:
Let's get some example:
select '{"a":1,"b":2}'::json;
┌───────────────┐
│ json │
├───────────────┤
│ {"a":1,"b":2} │
└───────────────┘
(1 row)
Back to text:
select '{"a":1,"b":2}'::json::text;
┌───────────────┐
│ text │
├───────────────┤
│ {"a":1,"b":2} │
└───────────────┘
(1 row)
Now, remove the undesired tokens {}" with a regex:
select regexp_replace('{"a":1,"b":2}'::json::varchar, '["{}]+', '', 'g');
┌────────────────┐
│ regexp_replace │
├────────────────┤
│ a:1,b:2 │
└────────────────┘
(1 row)
and you can wrap it into a function:
create function text_from_json(json) returns text as $$select regexp_replace($1::text, '["{}]+', '', 'g')$$ language sql;
CREATE FUNCTION
Testing the function now:
tsdb=> select text_from_json('{"a":1,"b":2}'::json);
┌────────────────┐
│ text_from_json │
├────────────────┤
│ a:1,b:2 │
└────────────────┘

Whats the meaning of select attributeName(tableName) from tablename in postgresql

Using Postgresql I have an apparently strange behavior that I don't understand
Assume to have a simple table
create table employee (
number int primary key,
surname varchar(20) not null,
name varchar(20) not null);
It is well clear for me the meaning of
select name from employee
However, I obtain all the names also with
select name(employee) from employee
and I do not understand this last statement.
I'm using PostgreSQL 13 and pgadmin 4
I'd like to expand #Abelisto's answer with this quotation from PostgreSQL docs:
Another special syntactical behavior associated with composite values is that we can use functional notation for extracting a field of a composite value. The simple way to explain this is that the notations field(table) and table.field are interchangeable. For example, these queries are equivalent:
SELECT c.name FROM inventory_item c WHERE c.price > 1000;
SELECT name(c) FROM inventory_item c WHERE price(c) > 1000;
...
This equivalence between functional notation and field notation makes it possible to use functions on composite types to implement “computed fields”. An application using the last query above wouldn't need to be directly aware that somefunc isn't a real column of the table.
Just an assumption.
There are two syntactic ways in PostgreSQL to call a function that receives a row as its argument. For example:
create table t(x int, y int); insert into t values(1, 2);
create function f(a t) returns int language sql as 'select a.x+a.y';
select f(t), t.f from t;
┌───┬───┐
│ f │ f │
├───┼───┤
│ 3 │ 3 │
└───┴───┘
Probably it is implemented to make the syntax same for columns also:
select f(t), t.f, x(t), t.x from t;
┌───┬───┬───┬───┐
│ f │ f │ x │ x │
├───┼───┼───┼───┤
│ 3 │ 3 │ 1 │ 1 │
└───┴───┴───┴───┘

Array manipulation in Spark, Scala

I'm new to scala, spark, and I have a problem while trying to learn from some toy dataframes.
I have a dataframe having the following two columns:
Name_Description Grade
Name_Description is an array, and Grade is just a letter. It's Name_Description that I'm having a problem with. I'm trying to change this column when using scala on Spark.
Name description is not an array that's of fixed size. It could be something like
['asdf_ Brandon', 'Ca%abc%rd']
['fthhhhChris', 'Rock', 'is the %abc%man']
The only problems are the following:
1. the first element of the array ALWAYS has 6 garbage characters, so the real meaning starts at 7th character.
2. %abc% randomly pops up on elements, so I wanna erase them.
Is there any way to achieve those two things in Scala? For instance, I just want
['asdf_ Brandon', 'Ca%abc%rd'], ['fthhhhChris', 'Rock', 'is the %abc%man']
to change to
['Brandon', 'Card'], ['Chris', 'Rock', 'is the man']
What you're trying to do might be hard to achieve using standard spark functions, but you could define UDF for that:
val removeGarbage = udf { arr: WrappedArray[String] =>
//in case that array is empty we need to map over option
arr.headOption
//drop first 6 characters from first element, then remove %abc% from the rest
.map(head => head.drop(6) +: arr.tail.map(_.replace("%abc%","")))
.getOrElse(arr)
}
Then you just need to use this UDF on your Name_Description column:
val df = List(
(1, Array("asdf_ Brandon", "Ca%abc%rd")),
(2, Array("fthhhhChris", "Rock", "is the %abc%man"))
).toDF("Grade", "Name_Description")
df.withColumn("Name_Description", removeGarbage($"Name_Description")).show(false)
Show prints:
+-----+-------------------------+
|Grade|Name_Description |
+-----+-------------------------+
|1 |[Brandon, Card] |
|2 |[Chris, Rock, is the man]|
+-----+-------------------------+
We are always encouraged to use spark sql functions and avoid using the UDFs as long as we can. I have a simplified solution for this which makes use of the spark sql functions.
Please find below my approach. Hope it helps.
val d = Array((1,Array("asdf_ Brandon","Ca%abc%rd")),(2,Array("fthhhhChris", "Rock", "is the %abc%man")))
val df = spark.sparkContext.parallelize(d).toDF("Grade","Name_Description")
This is how I created the input dataframe.
df.select('Grade,posexplode('Name_Description)).registerTempTable("data")
We explode the array along with the position of each element in the array. I register the dataframe in order to use a query to generate the required output.
spark.sql("""select Grade, collect_list(Names) from (select Grade,case when pos=0 then substring(col,7) else replace(col,"%abc%","") end as Names from data) a group by Grade""").show
This query will give out the required output. Hope this helps.

pg_column_size reports different vastly sizes for table.* than specific columns

I have a simple example where pg_column_size is reporting vastly different values. I think it has to do with whether or not it's considering TOASTed values, but I'm not sure. Here's the setup:
CREATE TABLE foo (bar TEXT);
INSERT INTO foo (bar) VALUES (repeat('foo', 100000));
SELECT pg_column_size(bar) as col, pg_column_size(foo.*) as table FROM foo;
What I'm seeing in Postgres 9.6 is,
col table
3442 300028
There's an order of magnitude difference here. Thoughts? What's the right way for me to calculate the size of the row? One idea I have is,
SELECT pg_column_size(bar), pg_column_size(foo.*) - octet_length(bar) + pg_column_size(bar) FROM foo;
Which should subtract out the post-TOAST size and add in the TOAST size.
Edit: My proposed work around only works on character columns, e.g. won't work on JSONB.
The first value is the compressed size of the TOASTed value, while the second value is the uncompressed size of the whole row.
SELECT 'foo'::regclass::oid;
┌───────┐
│ oid │
├───────┤
│ 36344 │
└───────┘
(1 row)
SELECT sum(length(chunk_data)) FROM pg_toast.pg_toast_36344;
┌──────┐
│ sum │
├──────┤
│ 3442 │
└──────┘
(1 row)
foo.* (or foo for that matter) is a “wholerow reference” in PostgreSQL, its data type is foo (which is created when the table is created).
PostgreSQL knows that foo.bar is stored externally, so it returns its size as it is in the TOAST table, but foo (a composite type) isn't, so you get the total size.
See the relevant piece of code from src/backend/access/heap/tuptoaster.c:
Size
toast_datum_size(Datum value)
{
struct varlena *attr = (struct varlena *) DatumGetPointer(value);
Size result;
if (VARATT_IS_EXTERNAL_ONDISK(attr))
{
/*
* Attribute is stored externally - return the extsize whether
* compressed or not. We do not count the size of the toast pointer
* ... should we?
*/
struct varatt_external toast_pointer;
VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
result = toast_pointer.va_extsize;
}
[...]
else
{
/*
* Attribute is stored inline either compressed or not, just calculate
* the size of the datum in either case.
*/
result = VARSIZE(attr);
}
return result;
}

Scan HTable rows for specific column value using HBase shell

I want to scan rows in a HTable from hbase shell where a column family (i.e., Tweet) has a particular value (i.e., user_id).
Now I want to find all rows where tweet:user_id has value test1 as this column has value 'test1'
column=tweet:user_id, timestamp=1339581201187, value=test1
Though I can scan table for a particular using,
scan 'tweetsTable',{COLUMNS => 'tweet:user_id'}
but I did not find any way to scan a row for a value.
Is it possible to do this via HBase Shell?
I checked this question as well.
It is possible without Hive:
scan 'filemetadata',
{ COLUMNS => 'colFam:colQualifier',
LIMIT => 10,
FILTER => "ValueFilter( =, 'binaryprefix:<someValue.e.g. test1 AsDefinedInQuestion>' )"
}
Note: in order to find all rows that contain test1 as value as specified in the question, use binaryprefix:test1 in the filter (see this answer for more examples)
Nishu,
here is solution I periodically use. It is actually much more powerful than you need right now but I think you will use it's power some day. Yes, it is for HBase shell.
import org.apache.hadoop.hbase.filter.CompareFilter
import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
import org.apache.hadoop.hbase.filter.SubstringComparator
import org.apache.hadoop.hbase.util.Bytes
scan 'yourTable', {LIMIT => 10, FILTER => SingleColumnValueFilter.new(Bytes.toBytes('family'), Bytes.toBytes('field'), CompareFilter::CompareOp.valueOf('EQUAL'), Bytes.toBytes('AAA')), COLUMNS => 'family:field' }
Only family:field column is returned with filter applied. This filter could be improved to perform more complicated comparisons.
Here are also hints for you that I consider most useful:
http://hadoop-hbase.blogspot.com/2012/01/hbase-intra-row-scanning.html - Intra-row scanning explanation (Java API).
https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/filter/FilterBase.html - JavaDoc for FilterBase class with links to descendants which actually can be used the same style. OK, shell syntax will be slightly different but having example above you can use this.
As there were multiple requests to explain this answer this additional answer has been posted.
Example 1
If
scan '<table>', { COLUMNS => '<column>', LIMIT => 3 }
would return:
ROW COLUMN+CELL
ROW1 column=<column>, timestamp=<timestamp>, value=hello_value
ROW2 column=<column>, timestamp=<timestamp>, value=hello_value2
ROW3 column=<column>, timestamp=<timestamp>, value=hello_value3
then this filter:
scan '<table>', { COLUMNS => '<column>', LIMIT => 3, FILTER => "ValueFilter( =, 'binaryprefix:hello_value2') AND ValueFilter( =, 'binaryprefix:hello_value3')" }
would return:
ROW COLUMN+CELL
ROW2 column=<column>, timestamp=<timestamp>, value=hello_value2
ROW3 column=<column>, timestamp=<timestamp>, value=hello_value3
Example 2
If not is supported as well:
scan '<table>', { COLUMNS => '<column>', LIMIT => 3, FILTER => "ValueFilter( !=, 'binaryprefix:hello_value2' )" }
would return:
ROW COLUMN+CELL
ROW1 column=<column>, timestamp=<timestamp>, value=hello_value
ROW3 column=<column>, timestamp=<timestamp>, value=hello_value3
An example of a text search for a value BIGBLUE in table t1 with column family of d:a_content. A scan of the table will show all the available values :-
scan 't1'
...
column=d:a_content, timestamp=1404399246216, value=BIGBLUE
...
To search just for a value of BIGBLUE with limit of 1, try the below command :-
scan 't1',{ COLUMNS => 'd:a_content', LIMIT => 1, FILTER => "ValueFilter( =, 'regexstring:BIGBLUE' )" }
COLUMN+CELL
column=d:a_content, timestamp=1404399246216, value=BIGBLUE
Obviously removing the limit will show all occurrences in that table/cf.
To scan a table in hbase on the basis of any column value, SingleColumnValueFilter can be used as :
scan 'tablename' ,
{
FILTER => "SingleColumnValueFilter('column_family','col_name',>, 'binary:1')"
}
From HBAse shell i think it is not possible because it is some how like query from which we use want to find spsecific data. As all we know that HBAse is noSQL so when we want to apply query or if we have a case like you then i think you should use Hive or PIG where as Hive is quiet good approach because in PIG we need to mess with scripts.
Anyway you can get good guaidence about hive from here HIVE integration with HBase and from Here
If yout only purpose is to view data not to get from code (of any client) then you can use HBase Explorer or a new and very good product but it is in its beta release is "HBase manager". You can get this from HBase Manager
Its simple, and more importantly, it helps to insert and delete data, applying filters on column qualifiers from UI like other DBclients. Have a try.
I hope it would be helpful for you :)
Slightly different question but if you you want to query a specific column which is not present in all rows, DependentColumnFilter is your best friend:
import org.apache.hadoop.hbase.filter.DependentColumnFilter
scan 'orgtable2', {FILTER => "DependentColumnFilter('cf1','lan',false,=,'binary:fre')"}
The previous scan will return all columns for the rows in which the lan column is present and for which its associated value is equal to fre. The third argument is dropDependentColumn and would prevent the lan column itself to be displayed in the results if set to true.