How to import space-formatted tables (9.0)? - postgresql

This are the first three lines of my text file:
Dist Mv CL Typ LTef logg Age Mass B-V U-B V-I V-K V [Fe/H] l b Av Mbol
0.033 14.40 5 7.90 3.481 5.10 1 0.15 1.723 1.512 3.153 5.850 17.008 0.13 0.50000 0.50000 0.014 12.616
0.033 7.40 5 6.50 3.637 4.62 7 0.71 1.178 0.984 1.302 2.835 10.047 -0.56 0.50000 0.50000 0.014 6.125
0.052 11.70 5 7.40 3.529 4.94 2 0.31 1.541 1.167 2.394 4.565 15.393 -0.10 0.50000 0.50000 0.028 10.075
Assuming I have the right columns, how do I import this?
Bonus: is it possible/are there tools to create the schema from these kinds of files automatically?

At lowest level you could just use COPY command (or \copy in psql if you don't have access to superuser account, which COPY requires to load data from file). Unfortunately you have to create table structure at first (there is no built-in guess-by-header feature), but it looks straightforward to write one.
Choose whaterer suitable datatype you need e.g. real for single precision floating-point (IEEE 754), double precision or numeric if you need arbitrary precision numbers:
CREATE TABLE measurement
(
"Dist" double precision,
"Mv" double precision,
"CL" double precision,
"Typ" double precision,
"LTef" double precision,
"logg" double precision,
"Age" double precision,
"Mass" double precision,
"B-V" double precision,
"U-B" double precision,
"V-I" double precision,
"V-K" double precision,
"V" double precision,
"[Fe/H]" double precision,
"l" double precision,
"b" double precision,
"Av" double precision,
"Mbol" double precision
);
Another thing is that your file contains multiple spaces between values, so it's better to transform it into single-tab-delimited entries (there are plenty of tools to do this):
$ sed 's/ */\t/g' import.csv
Dist Mv CL Typ LTef logg Age Mass B-V U-B V-I V-K V [Fe/H] l b Av Mbol
0.033 14.40 5 7.90 3.481 5.10 1 0.15 1.723 1.512 3.153 5.850 17.008 0.13 0.50000 0.50000 0.014 12.616
0.033 7.40 5 6.50 3.637 4.62 7 0.71 1.178 0.984 1.302 2.835 10.047 -0.56 0.50000 0.50000 0.014 6.125
0.052 11.70 5 7.40 3.529 4.94 2 0.31 1.541 1.167 2.394 4.565 15.393 -0.10 0.50000 0.50000 0.028 10.075
Finally you can import your file straight into Postgres database, for example:
=> \copy measurement FROM '/path/import.csv' (FORMAT csv, DELIMITER E'\t', HEADER 'true')
=> TABLE measurement;
Dist | Mv | CL | Typ | LTef | logg | Age | Mass | B-V | U-B | V-I | V-K | V | [Fe/H] | l | b | Av | Mbol
-------+------+----+-----+-------+------+-----+------+-------+-------+-------+-------+--------+--------+-----+-----+-------+--------
0.033 | 14.4 | 5 | 7.9 | 3.481 | 5.1 | 1 | 0.15 | 1.723 | 1.512 | 3.153 | 5.85 | 17.008 | 0.13 | 0.5 | 0.5 | 0.014 | 12.616
0.033 | 7.4 | 5 | 6.5 | 3.637 | 4.62 | 7 | 0.71 | 1.178 | 0.984 | 1.302 | 2.835 | 10.047 | -0.56 | 0.5 | 0.5 | 0.014 | 6.125
0.052 | 11.7 | 5 | 7.4 | 3.529 | 4.94 | 2 | 0.31 | 1.541 | 1.167 | 2.394 | 4.565 | 15.393 | -0.1 | 0.5 | 0.5 | 0.028 | 10.075
(3 rows)

Related

Different type of base 16?

Base 16 should go from 0 to F, with F being equal to 15 in base 10. But yet, when I use a base 16 converter found on google (https://www.hexator.com/) , it says that F is equal to 46.
Expected results:
0 | 0
1 | 1
2 | 2
3 | 3
4 | 4
5 | 5
6 | 6
7 | 7
8 | 8
9 | 9
a | 10
b | 11
c | 12
d | 13
e | 14
f | 15
Am I miss-interpreting something here?
That encoder is converting the ASCII value of the letter 'F' into the hexadecimal representation of it. The ASCII value of 'F' is 70, which is 46 when converted into hexadecimal. See this ascii table.
That converter is converting text into its hex representation, not Hex strings into decimal numbers.

Repeat a specific value per firm to all years

I have a panel data ranging from year 1917 to 1922 with various variables (for example Leverage) for 200 firms.
It looks something like this:
Year ID Leverage
1917 1 0.1
1918 1 0.2
1919 1 0.3
1917 2 0.4
1918 2 0.5
1917 3 0.6
1918 3 0.7
1919 3 0.8
1920 3 0.9
....
I want to copy the values for 1917 to all other years (1918, 1919,...) for the variables per FirmID. As in my example, not all years are present (so I cannot say that the value is there every X row). The result must be something like:
Year ID Leverage
1917 1 0.1
1918 1 0.1
1919 1 0.1
1917 2 0.4
1918 2 0.4
1917 3 0.6
1918 3 0.6
1919 3 0.6
1920 3 0.6
....
The following works for me:
clear
input year id leverage
1917 1 0.1
1918 1 0.2
1919 1 0.3
1917 2 0.4
1918 2 0.5
1917 3 0.6
1918 3 0.7
1919 3 0.8
1920 3 0.9
end
gen leverage1917 = leverage if year == 1917
bysort id: egen min = min(leverage1917)
replace leverage = min
drop min leverage1917
. list, sepby(id)
+----------------------+
| year id leverage |
|----------------------|
1. | 1917 1 .1 |
2. | 1918 1 .1 |
3. | 1919 1 .1 |
|----------------------|
4. | 1917 2 .4 |
5. | 1918 2 .4 |
|----------------------|
6. | 1917 3 .6 |
7. | 1918 3 .6 |
8. | 1919 3 .6 |
9. | 1920 3 .6 |
+----------------------+
EDIT NJC
This could be simplified to
generate leverage1917 = leverage if year == 1917
bysort id (leverage1917) : replace leverage1917 = leverage1917[1]
thus cutting out the egen call and the generation of another variable you then need to drop. This works properly even if there is no value for 1917 for some values of id.
Borrowing #Cybernike's helpful data example, here are two ways to do it in one line
clear
input year id leverage
1917 1 0.1
1918 1 0.2
1919 1 0.3
1917 2 0.4
1918 2 0.5
1917 3 0.6
1918 3 0.7
1919 3 0.8
1920 3 0.9
end
egen wanted1 = mean(cond(year == 1917, leverage, .)), by(id)
egen wanted2 = mean(leverage / (year == 1917)), by(id)
list, sepby(id)
+------------------------------------------+
| year id leverage wanted1 wanted2 |
|------------------------------------------|
1. | 1917 1 .1 .1 .1 |
2. | 1918 1 .2 .1 .1 |
3. | 1919 1 .3 .1 .1 |
|------------------------------------------|
4. | 1917 2 .4 .4 .4 |
5. | 1918 2 .5 .4 .4 |
|------------------------------------------|
6. | 1917 3 .6 .6 .6 |
7. | 1918 3 .7 .6 .6 |
8. | 1919 3 .8 .6 .6 |
9. | 1920 3 .9 .6 .6 |
+------------------------------------------+
For detailed discussion of both methods, see Sections 9 and 10 of this paper.
I don't overwrite the original data, contrary to your request. Often you decide later that you need them after all, or someone asks to see them.
This isn't necessarily better than the solution of #Cybernike. The division method behind wanted2 has struck some experienced users as too tricksy, and I tend to recommend the cond() device behind wanted1.

how to read date/time fields in brunel visualization

I'm importing the following csv file:
import pandas as pd
from numpy import log, abs, sign, sqrt
import brunel
# Read data
DISKAVGRIO = pd.read_csv("../DISKAVGRIO_nmon.csv")
DISKAVGRIO.head(6)
And the following table:
Hostname | Date-Time | hdisk1342 | hdisk1340 | hdisk1343 | ...
------------ | ----------------------- | ----------- | ------------- | ----------- | ------
host1 | 12-08-2015 00:56:12 | 0.0 | 0.0 | 0.0 | ...
host1 | 12-08-2015 01:11:13 | 0.0 | 0.0 | 0.0 | ...
host1 | 12-08-2015 01:26:14 | 0.0 | 0.0 | 0.0 | ...
host1 | 12-08-2015 01:41:14 | 0.0 | 0.0 | 0.0 | ...
host1 | 12-08-2015 01:56:14 | 0.0 | 0.4 | 4.2 | ...
host1 | 12-08-2015 02:11:14 | 0.0 | 0.0 | 0.0 | ...
Then I try to plot a line graphic and get the following error message:
# Line plot
%brunel data('DISKAVGRIO') x(Date-Time) y(hdisk1342) color(#series) line
And get the following error message:
--------------------------------------------------------------------------- java.lang.RuntimeExceptionPyRaisable Traceback (most recent call last) <ipython-input-4-1c7cb7700929> in <module>()
1 # Line plot
----> 2 get_ipython().magic("brunel data('DISKAVGRIO') x(Date-Time) y(hdisk1342) color(#series) line")
/home/anobre/anaconda3/lib/python3.5/site-packages/IPython/core/interactiveshell.py in magic(self, arg_s)
2161 magic_name, _, magic_arg_s = arg_s.partition(' ')
2162 magic_name = magic_name.lstrip(prefilter.ESC_MAGIC)
-> 2163 return self.run_line_magic(magic_name, magic_arg_s)
2164
2165 #-------------------------------------------------------------------------
/home/anobre/anaconda3/lib/python3.5/site-packages/IPython/core/interactiveshell.py in run_line_magic(self, magic_name, line)
2082 kwargs['local_ns'] = sys._getframe(stack_depth).f_locals
2083 with self.builtin_trap:
-> 2084 result = fn(*args,**kwargs)
2085 return result
2086
<decorator-gen-124> in brunel(self, line, cell)
/home/anobre/anaconda3/lib/python3.5/site-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
191 # but it's overkill for just that one bit of state.
192 def magic_deco(arg):
--> 193 call = lambda f, *a, **k: f(*a, **k)
194
195 if callable(arg):
/home/anobre/anaconda3/lib/python3.5/site-packages/brunel/magics.py in brunel(self, line, cell)
42 parts = line.split('::')
43 action = parts[0].strip()
---> 44 datasets_in_brunel = brunel.get_dataset_names(action)
45 self.cache_data(datasets_in_brunel,datas)
46 if len(parts) > 2:
/home/anobre/anaconda3/lib/python3.5/site-packages/brunel/brunel.py in get_dataset_names(brunel_src)
92
93 def get_dataset_names(brunel_src):
---> 94 return brunel_util_java.D3Integration.getDatasetNames(brunel_src)
95
96 def cacheData(data_key, data):
java.lang.RuntimeExceptionPyRaisable: org.brunel.model.VisException: Illegal field name: Date-Time while parsing action text: data('DISKAVGRIO') x(Date-Time) y(hdisk1342) color(#series) line
I'm not sure but I thing the problem is the date/time format. Does anyone know how to read date/time fields?
Try using:
%brunel data('DISKAVGRIO') x(Date_Time) y(hdisk1342) color(#series) line
That is, use an underscore "_" instead of a dash "-" within the field name. Brunel converts characters in field names that interfere with the syntax into underscores for reference within the syntax--but the original field name will appear as is on the displayed axis.

3D or 4D interpolate to find the corrsponding values based on 4 column of variables

I'm trying to find out whether if its possible to find/interpolate to calculate the corresponding values from this set of variables
+-------------+-------------+------+------+
| x | y | z | g |
+-------------+-------------+------+------+
| 150.8385804 | 183.7613678 | 0.58 | 2 |
| 171.0745381 | 231.7033081 | 2 | 0.58 |
| 179.1394672 | 244.5019837 | 0.8 | 0.8 |
| 149.1849453 | 180.7103271 | 0.8 | 2 |
| 162.5648017 | 212.8121033 | 2 | 0.8 |
| 141.1687115 | 163.4759979 | 0.8 | 3 |
| 140.7505385 | 162.7905884 | 0.9 | 3 |
| 148.1461022 | 180.5486908 | 1.8 | 1.6 |
| 147.1552106 | 178.7599182 | 2 | 1.6 |
+-------------+-------------+------+------+
What would be the corresponding z and g for x=143 and y=179? I do have access to matlab if anyone can suggest a code for it.
Here is the MATLAB syntax to load the above data into your workspace:
X = [150.8385804 171.0745381 179.1394672 149.1849453 162.5648017 141.1687115 140.7505385 148.1461022 147.1552106].';
Y = [183.7613678 231.7033081 244.5019837 180.7103271 212.8121033 163.4759979 162.7905884 180.5486908 178.7599182].';
Z = [0.58 2 0.8 0.8 2 0.8 0.9 1.8 2].';
G = [2 0.58 0.8 2 0.8 3 3 1.6 1.6].';
You can use scatteredInterpolant to do this for you. scatteredInterpolant is used to perform interpolation on a scattered dataset, which is basically what you have. Actually, you can do it twice: Once for z and once for g. You specify x and y as key / control points with the corresponding z and g output points. scatteredInterpolant will create an object for you, and you can specify custom x and y values for each of the z and g scatteredInterpolants and it will give you an interpolated answer. The default interpolation method is linear. As such, you'd specify x=143 and y=179 and figure out what the output z and g are.
In other words:
X = [150.8385804 171.0745381 179.1394672 149.1849453 162.5648017 141.1687115 140.7505385 148.1461022 147.1552106].';
Y = [183.7613678 231.7033081 244.5019837 180.7103271 212.8121033 163.4759979 162.7905884 180.5486908 178.7599182].';
Z = [0.58 2 0.8 0.8 2 0.8 0.9 1.8 2].';
G = [2 0.58 0.8 2 0.8 3 3 1.6 1.6].';
%// Create scatteredInterpolant
Zq = scatteredInterpolant(X, Y, Z);
Gq = scatteredInterpolant(X, Y, G);
%// Figure out interpolated values
zInterp = Zq(143, 179);
gInterp = Gq(143, 179);

Difference between correctly / incorrectly classified instances in decision tree and confusion matrix in Weka

I have been using Weka’s J48 decision tree to classify frequencies of keywords
in RSS feeds into target categories. And I think I may have a problem
reconciling the generated decision tree with the number of correctly classified
instances reported and in the confusion matrix.
For example, one of my .arff files contains the following data extracts:
#attribute Keyword_1_nasa_Frequency numeric
#attribute Keyword_2_fish_Frequency numeric
#attribute Keyword_3_kill_Frequency numeric
#attribute Keyword_4_show_Frequency numeric
...
#attribute Keyword_64_fear_Frequency numeric
#attribute RSSFeedCategoryDescription {BFE,FCL,F,M, NCA, SNT,S}
#data
0,0,0,34,0,0,0,0,0,40,0,0,0,0,0,0,0,0,0,0,24,0,0,0,0,13,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
0,0,0,10,0,0,0,0,0,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
...
20,0,64,19,0,162,0,0,36,72,179,24,24,47,24,40,0,48,0,0,0,97,24,0,48,205,143,62,78,
0,0,216,0,36,24,24,0,0,24,0,0,0,0,140,24,0,0,0,0,72,176,0,0,144,48,0,38,0,284,
221,72,0,72,0,SNT
...
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,S
And so on: there’s a total of 64 keywords (columns) and 570 rows where each one contains the frequency of a keyword in a feed for a day. In this case, there are 57 feeds for
10 days giving a total of 570 records to be classified. Each keyword is prefixed
with a surrogate number and postfixed with ‘Frequency’.
My use of the decision tree is with default parameters using 10x validation.
Weka reports the following:
Correctly Classified Instances 210 36.8421 %
Incorrectly Classified Instances 360 63.1579 %
With the following confusion matrix:
=== Confusion Matrix ===
a b c d e f g <-- classified as
11 0 0 0 39 0 0 | a = BFE
0 0 0 0 60 0 0 | b = FCL
1 0 5 0 72 0 2 | c = F
0 0 1 0 69 0 0 | d = M
3 0 0 0 153 0 4 | e = NCA
0 0 0 0 90 10 0 | f = SNT
0 0 0 0 19 0 31 | g = S
The tree is as follows:
Keyword_22_health_Frequency <= 0
| Keyword_7_open_Frequency <= 0
| | Keyword_52_libya_Frequency <= 0
| | | Keyword_21_job_Frequency <= 0
| | | | Keyword_48_pic_Frequency <= 0
| | | | | Keyword_63_world_Frequency <= 0
| | | | | | Keyword_26_day_Frequency <= 0: NCA (461.0/343.0)
| | | | | | Keyword_26_day_Frequency > 0: BFE (8.0/3.0)
| | | | | Keyword_63_world_Frequency > 0
| | | | | | Keyword_31_gaddafi_Frequency <= 0: S (4.0/1.0)
| | | | | | Keyword_31_gaddafi_Frequency > 0: NCA (3.0)
| | | | Keyword_48_pic_Frequency > 0: F (7.0)
| | | Keyword_21_job_Frequency > 0: BFE (10.0/1.0)
| | Keyword_52_libya_Frequency > 0: NCA (31.0)
| Keyword_7_open_Frequency > 0
| | Keyword_31_gaddafi_Frequency <= 0: S (32.0/1.0)
| | Keyword_31_gaddafi_Frequency > 0: NCA (4.0)
Keyword_22_health_Frequency > 0: SNT (10.0)
My question concerns reconciling the matrix to the tree or vice versa. As far as
I understand the results, a rating like (461.0/343.0) indicates that 461 instances have been classified as NCA. But how can that be when the matrix reveals only 153? I am
not sure how to interpret this so any help is welcome.
Thanks in advance.
The number in parentheses at each leaf should be read as (number of total instances of this classification at this leaf / number of incorrect classifications at this leaf).
In your example for the first NCA leaf, it says there are 461 test instances that were classified as NCA, and of those 461, there were 343 incorrect classifications. So there are 461-343 = 118 correctly classified instances at that leaf.
Looking through your decision tree, note that NCA is also at other leaves. I count 118 + 3 + 31 + 4 = 156 correctly classified instances out of 461 + 3 + 31 + 4 = 499 total classifications of NCA.
Your confusion matrix shows 153 correct classifications of NCA out of 39 + 60 + 72 + 69 + 153 + 90 + 19 = 502 total classifications of NCA.
So there is a slight difference between the tree (156/499) and your confusion matrix (153/502).
Note that if you are running Weka from the command-line, it outputs a tree and a confusion matrix for testing on all the training data and also for testing with cross-validation. Be careful that you are looking at the right matrix for the right tree. Weka outputs both training and test results, resulting in two pairs of matrix and tree. You may have mixed them up.