I write spec to collect data from kafka.
But I know that Druid will auto convert null value of type double, long to 0.0 and 0. '
But I want to store null value instead of 0.
So what should I do when set type for each dimension column.
Based on my interpretation of the docs, it seems druid.generic.useDefaultValueForNull=false could help here, although it seems like this is an all or nothing (affects all columns) and also may have some performance implications: see docs and this blog. In the ingestion spec docs I could not find anything of that could suggest you can have a per column null handling setting. Just a thought but I am not sure of the order of operations during ingestion if this would do anything, but may be worth a POC: combine changing the above property to false with transformations, where you can transform the fields you want to keep for the default value first using nvl. I am not sure if that would work and what other implications it may have.
Related
I have a colleague that tells me that the reason why we add default values instead of null values to our table is, that Postgresql allocates a number of bytes to a file when a new row is stored. And if this column gets updated later on, it might end up splitting this row into two rows in the file, and multiple IO operations will have to occure when reading and writing.
I'm not a Postgresql expert at all, and I have hard time finding any documentation suggesting this.
Can someone clearify this for me?
Is this a good reason for not having null values in a column, and using some default instead? Will there be any hughe performance issues in such cases?
I'm not sure I'd say the documentation is hard to find:
https://www.postgresql.org/docs/10/storage-file-layout.html
https://www.postgresql.org/docs/current/storage-page-layout.html
It's fair to say there is a lot to absorb though.
So, the reason you SHOULD have defaults rather than NULLs is because you don't want to have an "unknown" in your column. Start with the requirements before worrying about efficiency tweaks.
Whether a particular value is null is stored in a bitmap. This bitmap is optional - so if there are no nulls in a row then the bitmap is not created. So - that suggests nulls would make a row bigger. But wait, if a bit is set to show null then you don't need the overhead of of the value structure, and (IIRC - you'll need to check the docs) that can end up saving you space. There is a good chance that general per-row overheads and type alignment issues are far more important to you though.
However - all of this is ignoring the elephant* in the room which is that if you update a row then PostgreSQL marks the current version of the row as expired and creates a whole new row. So the whole description of how updates work is just confused in that first paragraph you wrote.
So - don't worry about the efficiency of nulls in 99.9% of cases. Worry about using them properly and about the general structure of your database, its indexes and queries.
* no I'm not apologising for that pun.
It says Snowflake stores metadata about all rows stored in a micro-partition, including the range of values for each of the columns in the micro-partition in the following thread, https://community.snowflake.com/s/question/0D53r00009kz6HpCAI/are-min-max-values-stored-in-a-micro-partitions-metadata-, What function I can use to retrieve this information? I tried to run SYSTEM$CLUSTERING_INFORMATION and it returns total_partition_count, depth, overlaps related information but no information about the column values in the micro-partitions. Thanks!
Snowflake stores this meta-data about each partition internally to optimize queries, but it does not publish it.
Part of the reason is security, as knowing metadata about each partition can reveal data that should be masked to some users, or hidden through row level security.
But if there's an interesting business use case that you are looking for to have this data, Snowflake is listening.
I have a pyspark dataframe, named df. I want to know if his columns contains NA's, I don't care if it is just one row or all of them. The problem is, my current way to know if there are NA's, is this one:
from pyspark.sql import functions as F
if (df.where(F.isnull('column_name')).count() >= 1):
print("There are nulls")
else:
print("Yey! No nulls")
The issue I see here, is that I need to compute the number of nulls in the whole column, and that is a huge amount of time wasted, because I want the process to stop when it finds the first null.
I thought about this solution but I am not sure it works (because I work in a cluster with a lot of other people so the execution time depends on the multiple jobs other people run in the cluster, so I can't compare the two approaches in even conditions):
(df.where(F.isnull('column_name')).limit(1).count() == 1)
Does adding the limit help ? Are there more efficient ways to achieve this ?
There is no non-exhaustive search for something that isn't there.
We can probably squeeze a lot more performance out of your query for the case where a record with a null value exists (see below), but what about when it doesn't? If you're planning on running this query multiple times, with the answer changing each time, you should be aware (I don't mean to imply that you aren't) that if the answer is "there are no null values in the entire dataframe", then you will have to scan the entire dataframe to know this, and there isn't a fast way to do that. If you need this kind of information frequently and the answer can frequently be "no", you'll almost certainly want to persist this kind of information somewhere, and update it whenever you insert a record that might have null values by checking just that record.
Don't use count().
count() is probably making things worse.
In the count case Spark used wide transformation and actually applies LocalLimit on each partition and shuffles partial results to perform GlobalLimit.
In the take case Spark used narrow transformation and evaluated LocalLimit only on the first partition.
In other words, .limit(1).count() is likely to select one example from each partition of your dataset, before selecting one example from that list of examples. Your intent is to abort as soon as a single example is found, but unfortunately, count() doesn't seem smart enough to achieve that on its own.
As alluded to by the same example, though, you can use take(), first(), or head() to achieve the use case you want. This will more effectively limit the number of partitions that are examined:
If no shuffle is required (no aggregations, joins, or sorts), these operations will be optimized to inspect enough partitions to satisfy the operation - likely a much smaller subset of the overall partitions of the dataset.
Please note, count() can be more performant in other cases. As the other SO question rightly pointed out,
neither guarantees better performance in general.
There may be more you can do.
Depending on your storage method and schema, you might be able to squeeze more performance out of your query.
Since you aren't even interested in the value of the row that was chosen in this case, you can throw a select(F.lit(True)) between your isnull and your take. This should in theory reduce the amount of information the workers in the cluster need to transfer. This is unlikely to matter if you have only a few columns of simple types, but if you have complex data structures, this can help and is very unlikely to hurt.
If you know how your data is partitioned and you know which partition(s) you're interested in or have a very good guess about which partition(s) (if any) are likely to contain null values, you should definitely filter your dataframe by that partition to speed up your query.
I'm using the PostgreSQL timestamp to determine the end date of a row and would like to populate high values such as "9999-12-31 00.00.00.000000".
How can I do that using a query?
The greatest possible value for a timestamp in PostgreSQL is the special value infinity (+infinity or positive infinity).
The greatest non-infinite value depends on the date/time representation and the data types on the platform. That's part of why it's often best to use infinity or if more appropriate, null.
If you use infinity, note that many programming languages don't define an infinite value for their date/time types. So the database driver must pick a sentinel value on the client language side to represent infinity. Then it has the problem that it doesn't know, when writing data back to the server, if the client application meant that value or its use as a placeholder for infinity. So even though infinity is the correct choice when you want a "higher than everything" value, it's not necessarily a practical choice.
If you really must have a high sentinel value, choose one that's the lowest common maximum among all the client applications and languages you want to use, for both their timestamp and date types. Then add a check constraint that prevents any higher values from being inserted.
My application uses DB2 data base. I had created a sequence for my table to generate the primary key,it was working fine uptill today, but now it seems to be generating existing values and I am getting DuplicateKeyException while inserting values. After a bit of googling I found that we can reset the sequence again.
Could some one please help me with the best possible option as I have not worked with sequences much and the things to consider while going with that approach.
If I have to reset the sequence then what should be the best way to do it and again points to consider before doing so.Also it would be great to know what could be the reason behind the issue that I am facing so that I can take care of it in future.
Just for information the max value assigned while creating the sequence has not yet reached.
Thanks a lot in advance.
ALTER SEQUENCE SCHEMA.SEQ_NAME RESTART WITH NUMERIC_VALUE;
this was required in my case i.e restarting the sequence with a value higher then the current max value of the id field for which the sequence was being used.
The NUMERIC_VALUE denotes the value which is higher then current max value for my sequence generated field.
Hope it will be helpful for others.
The cause probably for this issue was manual insertion of records in the db.