Is it possible to enumerate a table in-memory - kdb

I want to conduct some analysis on a large table in-memory.
Firstly, will it be faster to go through the data if enumerated even in-memory?
Secondly, is there a simple way enumerate the entire table? .Q.en exists but that saves it as a splayed table. I could always use get, but is there a better way if I don't want to save the table down.
In my table, I have one column of type dictionary where the keys are symbols, the other types are either symbols, symbol lists or integers.
Thanks

There is no benefit to enumerating symbols in memory.
kdb+ automatically interns each unique symbol for the lifetime of the process which gives you the speed advantage without you needing to take action.
You can view their memory usage in .Q.w[]
syms: number of interned symbols
symw: bytes used by interned symbols
https://code.kx.com/q/ref/dotq/#qw-memory-stats

Related

Redshift: How to see columns that are analyzed in a table?

Is there a way to see the "Analyzed" values in a Redshift table? In Teradata, the help stats command gives you this. I'm wondering if there is an equivalent in Redshift. Can't find anything on that.
Yes you can, with some work. First off, remember that Redshift stores data in 1MB blocks and these blocks are distributed across the cluster (slices). So the metadata that is created by ANALYZE gives information about these blocks and what they contain.
The table stv_blocklist contains this information about block contents. It contains a number of pieces of information about the blocks but the pieces that will likely be of most interest are minvalue and maxvalue (along with table id, slice, and column so you know what part of the database any block is a part of).
Minvalue and maxvalue are BIGINTs so these values can make perfect sense if the data column they are for are also BIGINTs. However, if your column is text or a timestamp you will need to do some reverse engineering to understand the hash used to be able to store values for these column types as BIGINTs. I've done it for just about all data types in the past and it isn't too hard to work out. If memory serves there is some endian trick with the text values but again not too hard to work things out since you can see the input values and output values. With this decoder ring you can make evaluations on effectiveness of sort keys in speeding up queries.

POSTGRESQL JSONB column for storing follower-following relationship [duplicate]

Imagine a web form with a set of check boxes (any or all of them can be selected). I chose to save them in a comma separated list of values stored in one column of the database table.
Now, I know that the correct solution would be to create a second table and properly normalize the database. It was quicker to implement the easy solution, and I wanted to have a proof-of-concept of that application quickly and without having to spend too much time on it.
I thought the saved time and simpler code was worth it in my situation, is this a defensible design choice, or should I have normalized it from the start?
Some more context, this is a small internal application that essentially replaces an Excel file that was stored on a shared folder. I'm also asking because I'm thinking about cleaning up the program and make it more maintainable. There are some things in there I'm not entirely happy with, one of them is the topic of this question.
In addition to violating First Normal Form because of the repeating group of values stored in a single column, comma-separated lists have a lot of other more practical problems:
Can’t ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
Can’t use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
Can’t enforce uniqueness: no way to prevent 1,2,3,3,3,5
Can’t delete a value from the list without fetching the whole list.
Can't store a list longer than what fits in the string column.
Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan. May have to resort to regular expressions, for example in MySQL:
idlist REGEXP '[[:<:]]2[[:>:]]' or in MySQL 8.0: idlist REGEXP '\\b2\\b'
Hard to count elements in the list, or do other aggregate queries.
Hard to join the values to the lookup table they reference.
Hard to fetch the list in sorted order.
Hard to choose a separator that is guaranteed not to appear in the values
To solve these problems, you have to write tons of application code, reinventing functionality that the RDBMS already provides much more efficiently.
Comma-separated lists are wrong enough that I made this the first chapter in my book: SQL Antipatterns, Volume 1: Avoiding the Pitfalls of Database Programming.
There are times when you need to employ denormalization, but as #OMG Ponies mentions, these are exception cases. Any non-relational “optimization” benefits one type of query at the expense of other uses of the data, so be sure you know which of your queries need to be treated so specially that they deserve denormalization.
"One reason was laziness".
This rings alarm bells. The only reason you should do something like this is that you know how to do it "the right way" but you have come to the conclusion that there is a tangible reason not to do it that way.
Having said this: if the data you are choosing to store this way is data that you will never need to query by, then there may be a case for storing it in the way you have chosen.
(Some users would dispute the statement in my previous paragraph, saying that "you can never know what requirements will be added in the future". These users are either misguided or stating a religious conviction. Sometimes it is advantageous to work to the requirements you have before you.)
There are numerous questions on SO asking:
how to get a count of specific values from the comma separated list
how to get records that have only the same 2/3/etc specific value from that comma separated list
Another problem with the comma separated list is ensuring the values are consistent - storing text means the possibility of typos...
These are all symptoms of denormalized data, and highlight why you should always model for normalized data. Denormalization can be a query optimization, to be applied when the need actually presents itself.
In general anything can be defensible if it meets the requirements of your project. This doesn't mean that people will agree with or want to defend your decision...
In general, storing data in this way is suboptimal (e.g. harder to do efficient queries) and may cause maintenance issues if you modify the items in your form. Perhaps you could have found a middle ground and used an integer representing a set of bit flags instead?
Yes, I would say that it really is that bad. It's a defensible choice, but that doesn't make it correct or good.
It breaks first normal form.
A second criticism is that putting raw input results directly into a database, without any validation or binding at all, leaves you open to SQL injection attacks.
What you're calling laziness and lack of SQL knowledge is the stuff that neophytes are made of. I'd recommend taking the time to do it properly and view it as an opportunity to learn.
Or leave it as it is and learn the painful lesson of a SQL injection attack.
I needed a multi-value column, it could be implemented as an xml field
It could be converted to a comma delimited as necessary
querying an XML list in sql server using Xquery.
By being an xml field, some of the concerns can be addressed.
With CSV: Can't ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
With XML: values in a tag can be forced to be the correct type
With CSV: Can't use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
With XML: still an issue
With CSV: Can't enforce uniqueness: no way to prevent 1,2,3,3,3,5
With XML: still an issue
With CSV: Can't delete a value from the list without fetching the whole list.
With XML: single items can be removed
With CSV: Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan.
With XML: xml field can be indexed
With CSV: Hard to count elements in the list, or do other aggregate queries.**
With XML: not particularly hard
With CSV: Hard to join the values to the lookup table they reference.**
With XML: not particularly hard
With CSV: Hard to fetch the list in sorted order.
With XML: not particularly hard
With CSV: Storing integers as strings takes about twice as much space as storing binary integers.
With XML: storage is even worse than a csv
With CSV: Plus a lot of comma characters.
With XML: tags are used instead of commas
In short, using XML gets around some of the issues with delimited list AND can be converted to a delimited list as needed
Yes, it is that bad. My view is that if you don't like using relational databases then look for an alternative that suits you better, there are lots of interesting "NOSQL" projects out there with some really advanced features.
Well I've been using a key/value pair tab separated list in a NTEXT column in SQL Server for more than 4 years now and it works. You do lose the flexibility of making queries but on the other hand, if you have a library that persists/derpersists the key value pair then it's not a that bad idea.
I would probably take the middle ground: make each field in the CSV into a separate column in the database, but not worry much about normalization (at least for now). At some point, normalization might become interesting, but with all the data shoved into a single column you're gaining virtually no benefit from using a database at all. You need to separate the data into logical fields/columns/whatever you want to call them before you can manipulate it meaningfully at all.
If you have a fixed number of boolean fields, you could use a INT(1) NOT NULL (or BIT NOT NULL if it exists) or CHAR (0) (nullable) for each. You could also use a SET (I forget the exact syntax).

Why can't keyed table be splayed in kdb?

Keyed tables are nothing but dictionary mapping of two tables like:
q)kts:([] sym:`GOOG`AMZN`FB)!([] px:3?10.; size:3?100000)
q).Q.dpft[`:/path/db;.z.d;`id;`kts]
'nyi
[0] .Q.dpft[`:/path/db;.z.d;`id;`kts]
Why is there is a limitation that keyed tables cannot be splayed or partitioned?
I think the simplest answer comes from both the technical and the logical.
Technical: there is no way in the on-disk format to indicate this currently. The .d file indicates the order of columns on disk but not any further metadata. This could be changed at a later date in theory.
The logical answer comes from the size of the data in question. Splayed tables are typically used when you want to hold a few columns in memory. A decade ago this meant that splayed tables were useful for holding up to 100M rows but with 3.x and modern memory that upper limit can be well north of 250M. I don't think there's a good way to make that kind of join performant in ad-hoc calculation. The grouped attribute index supported to make that work is around the same size as the column on disk and would need to be constantly re-written as data is appended.
I think the use of 'nyi in this case, to mean, "we probably need to think about this one for a bit", is appropriate.
The obvious solution is to look at explicit row relationships via linking columns, where the lookup calculation is done ahead of time.

What are the differences between Tables and Categorical Arrays, and cell and struct arrays?

In the newest version of MATLAB there are two new data types: Tables and Categorical Arrays.
Table is a new data type suitable for holding data and metadata, and can be used with mixed-type tabular data that are often stored as columns in a text file or in a spreadsheet. It consists of rows and column-oriented variables.
Categorical arrays are useful for holding categorical data - which have values from a finite list of discrete categories.
In previous versions I would have handled these use cases using cell and struct arrays. What are the differences between these and the new data types?
I haven't upgraded yet so I can't play around but based on this video and this article I can already see some advantages. They're not necessarily adding functionality that you couldn't do before, but rather just taking the hassle out of it. Using readtable over xlsread is immediately appealing to me. Being able to access columns by name rather than just by index is great, I do it in other languages often. In a table where column order doesn't really matter (unlike a matrix) it's really convenient to be able to address a column by it's name instead of having to know the column order. Also you can merge table using the join function which wasn't that easy to do with cell arrays before. I see that you can name the rows too, I didn't see what advantage that gives you and I can't play around but I know in some languages (like PANDAS in Python and I think in R as well) naming rows means you can work with time series data with different series that are not completely overlapping and not have to worry about alignment. I hope this is the case in Matlab too! Categorical arrays also look like just an extra layer of convenience, kind of like an enum. You never actually need a enum but it just makes development more pleasant.
Anyway that's just my two cents, I probably won't get an opportunity to play around with them any time soon but I look forward to using them when I do need them.
I use the table format to organize different input/output cases in my data, where the result may come from different tables. Main advantages compared to struct or cell array:
convenient table functions such as join, innerjoin, outerjoin
the use of fields <> more robust programming than arrays
data format is easy to export/import (e.g. delimited .txt file) <> no fprintf()
the data file can be opened in excel/Calc (libreoffice) <> no .mat

Optimizing word count

(This is rather hypothetical in nature as of right now, so I don't have too many details to offer.)
I have a flat file of random (English) words, one on each line. I need to write an efficient program to count the number of occurrences of each word. The file is big (perhaps about 1GB), but I have plenty of RAM for everything. They're stored on permanent media, so read speeds are slow, so I need to just read through it once linearly.
My two off-the-top-of-my-head ideas were to use a hash with words => no. of occurrences, or a trie with the no. of occurrences at the end node. I have enough RAM for a hash array, but I'm thinking that a trie would have as fast or faster lookups.
What approach would be best?
I think a trie with the count as the leaves could be faster.
Any decent hash table implementation will require reading the word fully, processing it using a hash function, and finally, a look-up in the table.
A trie can be implemented such that the search occurs as you are reading the word. This way, rather than doing a full look-up of the word, you could often find yourself skipping characters once you've established the unique word prefix.
For example, if you've read the characters: "torto", a trie would know that the only possible word that starts this way is tortoise.
If you can perform this inline searching faster on a word faster than the hashing algorithm can hash, you should be able to be faster.
However, this is total overkill. I rambled on since you said it was purely hypothetical, I figured you'd like a hypothetical-type of answer. Go with the most maintainable solution that performs the task in a reasonable amount of time. Micro-optimizations typically waste more time in man-hours than they save in CPU-hours.
I'd use a Dictionary object where the key is word converted to lower case and the value is the count. If the dictionary doesn't contain the word, add it with a value of 1. If it does contain the word, increment the value.
Given slow reading, it's probably not going to make any noticeable difference. The overall time will be completely dominated by the time to read the data anyway, so that's what you should work at optimizing. For the algorithm (mostly data structure, really) in memory, just use whatever happens to be most convenient in the language you find most comfortable.
A hash table is (if done right, and you said you had lots of RAM) O(1) to count a particular word, while a trie is going to be O(n) where n is the length of the word.
With a sufficiently large hash space, you'll get much better performance from a hash table than from a trie.
I think that a trie is overkill for your use case. A hash of word => # of occurrences is exactly what I would use. Even using a slow interpreted language like Perl, you can munge a 1GB file this way in just a few minutes. (I've done this before.)
I have enough RAM for a hash array, but I'm thinking that a trie would have as fast or faster lookups.
How many times will this code be run? If you're just doing it once, I'd say optimize for your time rather than your CPU's time, and just do whatever's fastest to implement (within reason). If you have a standard library function that implements a key-value interface, just use that.
If you're doing it many times, then grab a subset (or several subsets) of the data file, and benchmark your options. Without knowing more about your data set, it'd be dubious to recommend one over another.
Use Python!
Add these elements to a set data type as you go line by line, before asking whether it is in the hash table. After you know it is in the set, then add a dictionary value of 2, since you already added it to the set once before.
This will take some of the memory and computation away from asking the dictionary every single time, and instead will handle unique valued words better, at the end of the call just dump all the words that are not in the dictionary out of the set with a value of 1. (Intersect the two collections in respect to the set)
To a large extent, it depends on what you want you want to do with the data once you've captured it. See Why Use a Hash Table over a Trie (Prefix Tree)?
a simple python script:
import collections
f = file('words.txt')
counts = collections.defaultdict(int)
for line in f:
counts[line.strip()] +=1
print "\n".join("%s: %d" % (word, count) for (word, count) in counts.iteritems())