how to read date/time fields in brunel visualization - date

I'm importing the following csv file:
import pandas as pd
from numpy import log, abs, sign, sqrt
import brunel
# Read data
DISKAVGRIO = pd.read_csv("../DISKAVGRIO_nmon.csv")
DISKAVGRIO.head(6)
And the following table:
Hostname | Date-Time | hdisk1342 | hdisk1340 | hdisk1343 | ...
------------ | ----------------------- | ----------- | ------------- | ----------- | ------
host1 | 12-08-2015 00:56:12 | 0.0 | 0.0 | 0.0 | ...
host1 | 12-08-2015 01:11:13 | 0.0 | 0.0 | 0.0 | ...
host1 | 12-08-2015 01:26:14 | 0.0 | 0.0 | 0.0 | ...
host1 | 12-08-2015 01:41:14 | 0.0 | 0.0 | 0.0 | ...
host1 | 12-08-2015 01:56:14 | 0.0 | 0.4 | 4.2 | ...
host1 | 12-08-2015 02:11:14 | 0.0 | 0.0 | 0.0 | ...
Then I try to plot a line graphic and get the following error message:
# Line plot
%brunel data('DISKAVGRIO') x(Date-Time) y(hdisk1342) color(#series) line
And get the following error message:
--------------------------------------------------------------------------- java.lang.RuntimeExceptionPyRaisable Traceback (most recent call last) <ipython-input-4-1c7cb7700929> in <module>()
1 # Line plot
----> 2 get_ipython().magic("brunel data('DISKAVGRIO') x(Date-Time) y(hdisk1342) color(#series) line")
/home/anobre/anaconda3/lib/python3.5/site-packages/IPython/core/interactiveshell.py in magic(self, arg_s)
2161 magic_name, _, magic_arg_s = arg_s.partition(' ')
2162 magic_name = magic_name.lstrip(prefilter.ESC_MAGIC)
-> 2163 return self.run_line_magic(magic_name, magic_arg_s)
2164
2165 #-------------------------------------------------------------------------
/home/anobre/anaconda3/lib/python3.5/site-packages/IPython/core/interactiveshell.py in run_line_magic(self, magic_name, line)
2082 kwargs['local_ns'] = sys._getframe(stack_depth).f_locals
2083 with self.builtin_trap:
-> 2084 result = fn(*args,**kwargs)
2085 return result
2086
<decorator-gen-124> in brunel(self, line, cell)
/home/anobre/anaconda3/lib/python3.5/site-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
191 # but it's overkill for just that one bit of state.
192 def magic_deco(arg):
--> 193 call = lambda f, *a, **k: f(*a, **k)
194
195 if callable(arg):
/home/anobre/anaconda3/lib/python3.5/site-packages/brunel/magics.py in brunel(self, line, cell)
42 parts = line.split('::')
43 action = parts[0].strip()
---> 44 datasets_in_brunel = brunel.get_dataset_names(action)
45 self.cache_data(datasets_in_brunel,datas)
46 if len(parts) > 2:
/home/anobre/anaconda3/lib/python3.5/site-packages/brunel/brunel.py in get_dataset_names(brunel_src)
92
93 def get_dataset_names(brunel_src):
---> 94 return brunel_util_java.D3Integration.getDatasetNames(brunel_src)
95
96 def cacheData(data_key, data):
java.lang.RuntimeExceptionPyRaisable: org.brunel.model.VisException: Illegal field name: Date-Time while parsing action text: data('DISKAVGRIO') x(Date-Time) y(hdisk1342) color(#series) line
I'm not sure but I thing the problem is the date/time format. Does anyone know how to read date/time fields?

Try using:
%brunel data('DISKAVGRIO') x(Date_Time) y(hdisk1342) color(#series) line
That is, use an underscore "_" instead of a dash "-" within the field name. Brunel converts characters in field names that interfere with the syntax into underscores for reference within the syntax--but the original field name will appear as is on the displayed axis.

Related

Different type of base 16?

Base 16 should go from 0 to F, with F being equal to 15 in base 10. But yet, when I use a base 16 converter found on google (https://www.hexator.com/) , it says that F is equal to 46.
Expected results:
0 | 0
1 | 1
2 | 2
3 | 3
4 | 4
5 | 5
6 | 6
7 | 7
8 | 8
9 | 9
a | 10
b | 11
c | 12
d | 13
e | 14
f | 15
Am I miss-interpreting something here?
That encoder is converting the ASCII value of the letter 'F' into the hexadecimal representation of it. The ASCII value of 'F' is 70, which is 46 when converted into hexadecimal. See this ascii table.
That converter is converting text into its hex representation, not Hex strings into decimal numbers.

Get subexpression strings from output of pretty() in MATLAB

Is there a good way to get all the subexpressions in the output of a pretty() call in single-line strings? subexpr() returns a single subexpression, but I'd like to get all of them. Here's what pretty() returns:
syms x
s = solve(x^4 + 2*x + 1, x,'MaxDegree',3);
pretty(s)
/ -1 \
| |
| 2 1 |
| #2 - ---- + - |
| 9 #2 3 |
| |
| 1 #2 1 |
| ---- - #1 - -- + - |
| 9 #2 2 3 |
| |
| 1 #2 1 |
| #1 + ---- - -- + - |
\ 9 #2 2 3 /
where
/ 2 \
sqrt(3) | ---- + #2 | 1i
\ 9 #2 /
#1 == ------------------------
2
/ sqrt(11) sqrt(27) 17 \1/3
#2 == | ----------------- - -- |
\ 27 27 /
Here's what I'd like:
#1 == sqrt(3) ((2/(9 #2)) + #2) 1i) / 2
#2 == (sqrt(11) sqrt(27) / 27 - 17 / 27) ^ (1/3)
That way the output is easy cut-and-pastable into an editor for rapid conversion to code.
MATLAB functions ccode (or matlabFunction) do the trick beautifully.
syms x
s = solve(x^4 + 2*x + 1, x,'MaxDegree',3);
ccode(s, 'file', 'outfile.c');
Matlab generates outfile.c with sparse matrix notation and substitution-simplified computation:
t2 = sqrt(1.1E1);
t3 = sqrt(2.7E1);
t4 = t2*t3*(1.0/2.7E1);
t5 = t4-1.7E1/2.7E1;
t6 = 1.0/pow(t5,1.0/3.0);
t7 = pow(t5,1.0/3.0);
t8 = sqrt(3.0);
t9 = t6*(2.0/9.0);
t10 = t7+t9;
t11 = t6*(1.0/9.0);
A0[0][0] = -1.0;
A0[1][0] = t6*(-2.0/9.0)+t7+1.0/3.0;
A0[2][0] = t7*(-1.0/2.0)+t11-t8*t10*5.0E-1*sqrt(-1.0)+1.0/3.0;
A0[3][0] = t7*(-1.0/2.0)+t11+t8*t10*5.0E-1*sqrt(-1.0)+1.0/3.0;

Difference between correctly / incorrectly classified instances in decision tree and confusion matrix in Weka

I have been using Weka’s J48 decision tree to classify frequencies of keywords
in RSS feeds into target categories. And I think I may have a problem
reconciling the generated decision tree with the number of correctly classified
instances reported and in the confusion matrix.
For example, one of my .arff files contains the following data extracts:
#attribute Keyword_1_nasa_Frequency numeric
#attribute Keyword_2_fish_Frequency numeric
#attribute Keyword_3_kill_Frequency numeric
#attribute Keyword_4_show_Frequency numeric
...
#attribute Keyword_64_fear_Frequency numeric
#attribute RSSFeedCategoryDescription {BFE,FCL,F,M, NCA, SNT,S}
#data
0,0,0,34,0,0,0,0,0,40,0,0,0,0,0,0,0,0,0,0,24,0,0,0,0,13,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
0,0,0,10,0,0,0,0,0,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
...
20,0,64,19,0,162,0,0,36,72,179,24,24,47,24,40,0,48,0,0,0,97,24,0,48,205,143,62,78,
0,0,216,0,36,24,24,0,0,24,0,0,0,0,140,24,0,0,0,0,72,176,0,0,144,48,0,38,0,284,
221,72,0,72,0,SNT
...
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,S
And so on: there’s a total of 64 keywords (columns) and 570 rows where each one contains the frequency of a keyword in a feed for a day. In this case, there are 57 feeds for
10 days giving a total of 570 records to be classified. Each keyword is prefixed
with a surrogate number and postfixed with ‘Frequency’.
My use of the decision tree is with default parameters using 10x validation.
Weka reports the following:
Correctly Classified Instances 210 36.8421 %
Incorrectly Classified Instances 360 63.1579 %
With the following confusion matrix:
=== Confusion Matrix ===
a b c d e f g <-- classified as
11 0 0 0 39 0 0 | a = BFE
0 0 0 0 60 0 0 | b = FCL
1 0 5 0 72 0 2 | c = F
0 0 1 0 69 0 0 | d = M
3 0 0 0 153 0 4 | e = NCA
0 0 0 0 90 10 0 | f = SNT
0 0 0 0 19 0 31 | g = S
The tree is as follows:
Keyword_22_health_Frequency <= 0
| Keyword_7_open_Frequency <= 0
| | Keyword_52_libya_Frequency <= 0
| | | Keyword_21_job_Frequency <= 0
| | | | Keyword_48_pic_Frequency <= 0
| | | | | Keyword_63_world_Frequency <= 0
| | | | | | Keyword_26_day_Frequency <= 0: NCA (461.0/343.0)
| | | | | | Keyword_26_day_Frequency > 0: BFE (8.0/3.0)
| | | | | Keyword_63_world_Frequency > 0
| | | | | | Keyword_31_gaddafi_Frequency <= 0: S (4.0/1.0)
| | | | | | Keyword_31_gaddafi_Frequency > 0: NCA (3.0)
| | | | Keyword_48_pic_Frequency > 0: F (7.0)
| | | Keyword_21_job_Frequency > 0: BFE (10.0/1.0)
| | Keyword_52_libya_Frequency > 0: NCA (31.0)
| Keyword_7_open_Frequency > 0
| | Keyword_31_gaddafi_Frequency <= 0: S (32.0/1.0)
| | Keyword_31_gaddafi_Frequency > 0: NCA (4.0)
Keyword_22_health_Frequency > 0: SNT (10.0)
My question concerns reconciling the matrix to the tree or vice versa. As far as
I understand the results, a rating like (461.0/343.0) indicates that 461 instances have been classified as NCA. But how can that be when the matrix reveals only 153? I am
not sure how to interpret this so any help is welcome.
Thanks in advance.
The number in parentheses at each leaf should be read as (number of total instances of this classification at this leaf / number of incorrect classifications at this leaf).
In your example for the first NCA leaf, it says there are 461 test instances that were classified as NCA, and of those 461, there were 343 incorrect classifications. So there are 461-343 = 118 correctly classified instances at that leaf.
Looking through your decision tree, note that NCA is also at other leaves. I count 118 + 3 + 31 + 4 = 156 correctly classified instances out of 461 + 3 + 31 + 4 = 499 total classifications of NCA.
Your confusion matrix shows 153 correct classifications of NCA out of 39 + 60 + 72 + 69 + 153 + 90 + 19 = 502 total classifications of NCA.
So there is a slight difference between the tree (156/499) and your confusion matrix (153/502).
Note that if you are running Weka from the command-line, it outputs a tree and a confusion matrix for testing on all the training data and also for testing with cross-validation. Be careful that you are looking at the right matrix for the right tree. Weka outputs both training and test results, resulting in two pairs of matrix and tree. You may have mixed them up.

How to import space-formatted tables (9.0)?

This are the first three lines of my text file:
Dist Mv CL Typ LTef logg Age Mass B-V U-B V-I V-K V [Fe/H] l b Av Mbol
0.033 14.40 5 7.90 3.481 5.10 1 0.15 1.723 1.512 3.153 5.850 17.008 0.13 0.50000 0.50000 0.014 12.616
0.033 7.40 5 6.50 3.637 4.62 7 0.71 1.178 0.984 1.302 2.835 10.047 -0.56 0.50000 0.50000 0.014 6.125
0.052 11.70 5 7.40 3.529 4.94 2 0.31 1.541 1.167 2.394 4.565 15.393 -0.10 0.50000 0.50000 0.028 10.075
Assuming I have the right columns, how do I import this?
Bonus: is it possible/are there tools to create the schema from these kinds of files automatically?
At lowest level you could just use COPY command (or \copy in psql if you don't have access to superuser account, which COPY requires to load data from file). Unfortunately you have to create table structure at first (there is no built-in guess-by-header feature), but it looks straightforward to write one.
Choose whaterer suitable datatype you need e.g. real for single precision floating-point (IEEE 754), double precision or numeric if you need arbitrary precision numbers:
CREATE TABLE measurement
(
"Dist" double precision,
"Mv" double precision,
"CL" double precision,
"Typ" double precision,
"LTef" double precision,
"logg" double precision,
"Age" double precision,
"Mass" double precision,
"B-V" double precision,
"U-B" double precision,
"V-I" double precision,
"V-K" double precision,
"V" double precision,
"[Fe/H]" double precision,
"l" double precision,
"b" double precision,
"Av" double precision,
"Mbol" double precision
);
Another thing is that your file contains multiple spaces between values, so it's better to transform it into single-tab-delimited entries (there are plenty of tools to do this):
$ sed 's/ */\t/g' import.csv
Dist Mv CL Typ LTef logg Age Mass B-V U-B V-I V-K V [Fe/H] l b Av Mbol
0.033 14.40 5 7.90 3.481 5.10 1 0.15 1.723 1.512 3.153 5.850 17.008 0.13 0.50000 0.50000 0.014 12.616
0.033 7.40 5 6.50 3.637 4.62 7 0.71 1.178 0.984 1.302 2.835 10.047 -0.56 0.50000 0.50000 0.014 6.125
0.052 11.70 5 7.40 3.529 4.94 2 0.31 1.541 1.167 2.394 4.565 15.393 -0.10 0.50000 0.50000 0.028 10.075
Finally you can import your file straight into Postgres database, for example:
=> \copy measurement FROM '/path/import.csv' (FORMAT csv, DELIMITER E'\t', HEADER 'true')
=> TABLE measurement;
Dist | Mv | CL | Typ | LTef | logg | Age | Mass | B-V | U-B | V-I | V-K | V | [Fe/H] | l | b | Av | Mbol
-------+------+----+-----+-------+------+-----+------+-------+-------+-------+-------+--------+--------+-----+-----+-------+--------
0.033 | 14.4 | 5 | 7.9 | 3.481 | 5.1 | 1 | 0.15 | 1.723 | 1.512 | 3.153 | 5.85 | 17.008 | 0.13 | 0.5 | 0.5 | 0.014 | 12.616
0.033 | 7.4 | 5 | 6.5 | 3.637 | 4.62 | 7 | 0.71 | 1.178 | 0.984 | 1.302 | 2.835 | 10.047 | -0.56 | 0.5 | 0.5 | 0.014 | 6.125
0.052 | 11.7 | 5 | 7.4 | 3.529 | 4.94 | 2 | 0.31 | 1.541 | 1.167 | 2.394 | 4.565 | 15.393 | -0.1 | 0.5 | 0.5 | 0.028 | 10.075
(3 rows)

How to delete all characters but the last

I want to parse a file and delete all leading 0's of a number using sed. (of course if i have something like 0000 then results to 0) How to do that?
I think you may be searching for this.
Here lies your answer. You need to modify of course.
How to remove first/last character from a string using SED
This is probably over complicated, but it catches all the corner cases I tested:
sed 's/^\([^0-9]*\)0/\1\n0/;s/$/}/;s/\([^0-9\n]\)0/\1\n/g;s/\n0\+/\n/g;s/\n\([^0-9]\)/0\1/g;s/\n//g;s/}$//' inputfile
Explanation:
This uses the divide-and-conquer technique of inserting newlines to delimit segments of a line so they can be manipulated individually.
s/^\([^0-9]*\)0/\1\n0/ - insert a newline before the first zero
s/$/}/ - add a buffer character at the end
s/\([^0-9\n]\)0/\1\n/g - insert newlines before each leading zero (and remove the first)
s/\n0\+/\n/g - remove the remaining leading zeros
s/\n\([^0-9]\)/0\1/g - replace bare zeros
s/\n//g - remove the newlines
s/}$// - remove the end-of-line buffer
This file:
0 foo 1 bar 01 10 001 baz 010 100 qux 000 00 0001 0100 0010
100 | 00100
010 | 010
001 | 001
100 | 100
0 | 0
00 | 0
000 | 0
00 | 00
00 | 00
00 | 00 z
Becomes:
0 foo 1 bar 1 10 1 baz 10 100 qux 0 0 1 100 10
100 | 100
10 | 10
1 | 1
100 | 100
0 | 0
0 | 0
0 | 0
0 | 0
0 | 0
0 | 0 z
If you have leading zeroes and it is accompanied by string of numbers, all you have to do is to convert it into integer. Something like this
$ echo "000123 test " | awk '{$1=$1+0}1'
123 test
This will not require any significant amount of regex whether they are simple or overly complicated.
Similarly (Ruby1.9+)
$ echo "000123 test " | ruby -lane '$F[0]=$F[0].to_i; print $F.join(" ")'
123 test
For cases of all 0000's
$ echo "0000 test " | ruby -lane '$F[0]=$F[0].to_i; print $F.join(" ")'
0 test
$ echo "000 test " | awk '{$1=$1+0}1'
0 test