Column type for ZipCode in PostgreSQL database? - postgresql

What is the correct column type for holding ZipCode values in PostgreSQL database?

I strongly disagree with the advice presented here.
The accepted answer accepts things that aren't digits.
The question is about Zip Codes, not postal codes.
If we assume the post is wrong and means international postal codes, there are characters that appear in international postal codes that don't appear in that list, and many international - and also the US domestic - postal codes can be over ten characters
If we actually answer the question they asked, about zip codes, then there should be no accomodation for anything but digits (and arguably the hyphen)
US zip codes can be up to 11 digits long (13 characters counting the two dashes) - there is a zip, a zip+4, and a zip+6 (which programmers would call zip+4+2) notation; the last is used by skyscrapers, universities, et cetera
US zip codes are always non-negative integers, and therefore should not be stored as text data, which is subject to non-canon representation problems (ask anyone who's done a system about that time they found out that their zip 00203 didn't match the zip 203 that they accidentally got when constantly unnecessarily parsing string representations)
If you pretend you're actually tracking international post codes, the short character sequence limited text fields here don't even begin to do the job. The word "China" comes to mind.
My opinon:
Decide whether you're actually handling US postal codes or international
If you're handling US postal codes, track them as unsigned integers, and left-pad them with zeros when text representing them. (Think unix timestamps and local TZ representations if you need to understand why this will be simpler in the long run.)
If you're handling international post codes, store them in an unbounded unicode string, tie them to the country they represent, and validate country by country with check constraints. This problem is far more difficult than it sounds up front. International addresses are some of the least standardized things on Earth. Wait'll you find out how Japanese house numbers work, or why the British postal 6-code has the gaps it has.

It is something like xxxxx-xxxx, so varchar(10) is recommended.
If you want to check the syntax of the values in the database, you could create a domain type for zip codes.
CREATE DOMAIN zipcode varchar(10)
CONSTRAINT valid_zipcode
CHECK (VALUE ~ '[A-Z0-9-]+'); -- or a better regular expression
You could have a look at this site, which proposes this regex:
(^\d{5}(-\d{4})?$)|(^[ABCEGHJKLMNPRSTVXY]{1}\d{1}[A-Z]{1} *\d{1}[A-Z]{1}\d{1}$)
But you should check it works for the PostgreSQL regex syntax.

it depends on what kind of zip you want. if you're sure you will only need to store the standard 5 digit then use an int will be the most space saving.
however if you need to do the 5+4 extended then a 10 digit character field is best. I personally suggest that as it does make it easier in the future if you end up needing to store international postal codes 10 digits covers just about every possible postal code format i've come across.

Related

Incorrect data population issue in numeric field in PF in AS400 [duplicate]

This question already has an answer here:
What do hyphens signify in Db2 for i query results?
(1 answer)
Closed 3 months ago.
I have a PF in DB2 which is showing a ++++ sign, the column value is defined as numeric 3 lengths.
I have tried ABS, ABSVAL, ROUND, TRUNCATE, REPLACE, and CHAR biffs on this column but none of them seems to show me what this ++++ actually is. Because of this ++++ sign, I cannot insert any data on this row, thereby stopping anything from being inserted after this row.
If possible, I am looking to remove this ++++ sign incorrect data from the file.
I shall be grateful for any help/guidance.
Thank you #jmarkmurphy and #charles for your thoughtful inputs, I have found my solution in both your suggestions.
I have tried to summarize the issue for future readers as below.
So apparently the ++++ sign or - sign is DB2's way of showing corrupt data. The possible reason for existence of such a data could be during data transmission between two systems, or improper handling of decimal data error done by the operator.
However, I have researched a lot on this but there is no way we can see what this ++++ or - sign actually holds.
But still just to our eyes satisfaction, as recommended by #jmarkmurphy
use the hex() function on the column we can see something like
notice that the hex value for ++++ is 404040, which represents # sign in ascii table
Reference fore HEx to ascii conversion -
https://www.freecodecamp.org/news/ascii-table-hex-to-ascii-value-character-code-chart-2/
The only possible way to deal with such corrupt data is to isolate them and remove them. thanks to #charles for this url
What do hyphens signify in Db2 for i query results?

How to do a name similarity using clustering

I have a very big -super big- database of names.
The task is to find all the similar names (of the same person per se) despite some diffrences like :
first name, second name inversed --> John Doe & Doe John
two names or more (same ones) with light changes, maybe some
letters misplaced or something else --> Jonh Doe & John Deo
two names with some letters added --> Johhn Doe & Johnn Doee &
John Doe
names where another middle name inserted --> John Blair Campbell Doe & John Blair Doe
And so on..
I tried using the classical methods like soundex and leveshtein but the results were not very good, had results like : Amine depi and Amina dope are in the same group while they're diffrent
and It would take very long to perform the task on just a fraction on the data, as for my database, it would directly crash after a long time
I also thought of using another approach like cosine which uses numerical values and I though of finding a way of representing the names in a numerical way, or convert them (something like word2vec), I actually though of using directly word2vec with the whole database of namems as the text, but as expected it didn't work. Tried to codify the names in a low level way, like code ASCII for exemple, but the results weren't good neither.
So I thought of Clustering.
So I tried using DBSCAN. I found a way to use DBSCAN clustering with a custom distance metric and used leveshtein distance. (If you ask me why DBSCAN? It is because I don't know the numbers of similar groups of names which are in the database in the beginning)
I did have some results, but very poor performance overall. It would either give the same exact ones, John Doe and John Doe int he same cluster, or nothing at all, and would even skip some exact ones.
Do you have a suggestion for performing this task ? preferbly using clsutering or another smart way since the database is very big (more than 500 000 line and up to millions ) so I cannot iterate alot.
I am open to suggestions or propositions !
Especially if you worked on something like this previously or similar to this, Thank you in advance.
Try AgglomerativeClustering.
Sample code:
clustering = AgglomerativeClustering(
n_clusters=None,
distance_threshold=0.3 # smaller threshold meaning more strict similarities, and more clusters
).fit(your_vectorized_name_list)
print(f'total clusters: {clustering.n_clusters_}')

Money type: remove currency symbol without type casts

I'm using MONEY type for currency data in my Postgres table. When I select data, postgres formats values according to system's lc_monetary setting.
I would like to get rid of currency symbol in the query result without using explicit type casts (I'm using Laravel's query builder currently. Type casts will require raw queries).
Is there a way to setup lc_monetary config setting so that currency values in query results are formatted exactly like simple floats with 2-digit precision and without thousands separator (so that I would be able to use it as a string/float in my PHP code)?
Most people I have talked to reccommend not using the money type. Typically MONEY types get output as strings by your local implementation becuase of the LC_MONETARY formatting. Most people (myself included) recommend using a NUMERIC for your monetary values.
Also you mentioned placing your money values in a float. Floats on computers have rounding errors naturally and can cause issues with monetary amounts, so be careful.
In Python we use the decimal class when we need to do math on money, I assume that PHP has something similar.
Is select money_column/1::money safe...
According to the doc the result of dividing by money is double precision.... (The currency cancels out).

Solr date field tdate vs date?

So I have a question about Solr's field date types which is pretty straight forward: what's the difference between a 'date' field and a 'tdate' one?
The schema .xml claims that 'For faster range queries, consider the tdate type' and 'A Trie based date field for faster date range queries and date faceting. '
Fair enough... but what's the precisionStep="6" all about? should i change this? does it change the way i would create the query in case I use the tdate? What's the real advantage or what does Solr do that makes it better?
P.S went through google, Solr manual, solr wiki and the java docs without any luck so I'd appreciate a kind and explanatory answer :)...
Also checked:
http://www.lucidimagination.com/blog/2009/05/13/exploring-lucene-and-solrs-trierange-capabilities/
http://web.archiveorange.com/archive/v/AAfXfqRYyLnDFtskmLRi
Trie fields make range queries faster by precomputing certain range results and storing them as a single record in the index. For clarity, my example will use integers in base ten. The same concept applies to all trie types. This includes dates, since a date can be represented as the number of seconds since, say, 1970.
Let's say we index the number 12345678. We can tokenize this into the following tokens.
12345678
123456xx
1234xxxx
12xxxxxx
The 12345678 token represents the actual integer value. The tokens with the x digits represent ranges. 123456xx represents the range 12345600 to 12345699, and matches all the documents that contain a token in that range.
Notice how in each token on the list has successively more x digits. This is controlled by the precision step. In my example, you could say that I was using a precision step of 2, since I trim 2 digits to create each extra token. If I were to use a precision step of 3, I would get these tokens.
12345678
12345xxx
12xxxxxx
A precision step of 4:
12345678
1234xxxx
A precision step of 1:
12345678
1234567x
123456xx
12345xxx
1234xxxx
123xxxxx
12xxxxxx
1xxxxxxx
It's easy to see how a smaller precision step results in more tokens and increases the size of the index. However, it also speeds up range queries.
Without the trie field, if I wanted to query a range from 1250 to 1275, Lucene would have to fetch 25 entries (1250, 1251, 1252, ..., 1275) and combine search results. With a trie field (and precision step of 1), we could get away with fetching 8 entries (125x, 126x, 1270, 1271, 1272, 1273, 1274, 1275), because 125x is a precomputed aggregation of 1250 - 1259. If I were to use a precision step larger than 1, the query would go back to fetching all 25 individual entries.
Note: In reality, the precision step refers to the number of bits trimmed for each token. If you were to write your numbers in hexadecimal, a precision step of 4 would trim one hex digit for each token. A precision step of 8 would trim two hex digits.
Basically trie ranges are faster. Here is one explanation. With precisionStep you configure how much your index can grow to get the performance benefits. To quote from the link you are referring:
More importantly, it is not dependent on the index size, but instead the precision chosen.
and
the only drawbacks of TrieRange are a little bit larger index sizes, because of the additional terms indexed
Your best bet is to just look at the source code. Some of the things for Solr aren't well documented and the fastest way to get a trustworthy answer is to simply look at the code. If you haven't been in the code yet, that too is to your benefit. At least in the long run.
Here's a link to the TrieTokenizerFactory.
http://www.jarvana.com/jarvana/view/org/apache/solr/solr-core/1.4.1/solr-core-1.4.1-sources.jar!/org/apache/solr/analysis/TrieTokenizerFactory.java?format=ok
The javadoc in the class at least hints at the purpose of the precisionStep. You could dig futher.
EDIT: I dug a bit further for you. It's passed off directly to Lucene's NumericTokenStream class, which will used the value during parsing the token stream. Probably worth closer examination. It seems to deal with granularity and is probably a tradeoff between size in the index and speed.

Filemaker: making queries of large data more efficient

OK I have a Master Table of shipments, and a separate Charges table. There are millions of records in each, and it's come into Filemaker from a legacy system, so all the fields are defined as Text even though they may be Date, Number, etc.
There's a date field in the charges table. I want to create a number field to represent just the year. I can use the Middle function to parse the field and get just the year in a Calculation field. But wouldn't it be faster to have the year as a literal number field, especially since I'm going to be filtering and sorting? So how do I turn this calculation into its value? I've tried just changing the Calculation field to Number, but it just renders blanks.
There's something wrong with your calculation, it should not turn blank just because field type is different. I.e.:
Middle("10-12-2010", 7, 4)
should suffice, provided the calc result is set to Number. You may also wrap it into GetAsNumber(...), but, really, there's no difference as long as field type is right.
If you have FM Advanced, try to set up your calc in the Data Viewer (Tools -> Data Viewer) rather than in Define Fields, this would be faster and, once you like the result, you can transfer it into a field or make a replace. But, from the searching/sorting standpoint there's no difference between a (stored) calculation and a regular field, so replacing is pointless and, actually, more dangerous, as there's no way to undo a wrong replace.
Here's what i was looking for, from
http://help.filemaker.com/app/answers/detail/a_id/3366/~/converting-unstored-calculation-fields-to-store-data
:
Basically, instead of using a
Calculation field, you create am EMPTY
Number, date or text field and use
Replace Field Contents from the Records menu, and put
your calculation (or reference, or
both) there.
Not dissing FileMaker at all, but millions of records means FileMaker is probably the wrong choice here. Your system will be slow, slow, slow. FileMaker is great for workgroups and there is no way to develop a database app faster. But one thing FileMaker is not good at is handling huge numbers of records.
BTW, Mikhail Edoshin is exactly right.