FITS_rec and selection of data: masking instead of "true" filtering? - select

Probably a duplicate to Ashley's post (but I can't comment -yet ;) ).
I have the same issue when trying to add a column to a sub-selection/sample of my initial FITS_rec (based on numpy's recarray); all the rows reappear (AND the filling of this new column doesn't seem to be respected...). "hdu_sliced._get_raw_data()" proposed by Vlas Sokolov is a solution that is working very fine for me, but I was wondering:
1) What are "the better ways" suggested by Iguananaut? I certainly need someone to just google it for me; the newbie me is feeling stuck :$ (Staying in a FITS_rec would be required).
2) Is that an expected behaviour? Meaning, are we really wanting to work on a "masked array" which would a copy of our original array? What is worrying me the most is the "collapse" of the values in the new computed column. See below:
# A nice FITS_rec
a1 = np.array(['NGC1001', 'NGC1002', 'NGC1003'])
a2 = np.array([11.1, 12.3, 15.2])
col1 = fits.Column(name='target', format='20A', array=a1)
col2 = fits.Column(name='V_mag', format='E', array=a2)
cols = fits.ColDefs([col1, col2])
hdu = fits.BinTableHDU.from_columns(cols)
ori_rec=hdu.data
ori_rec
`
FITS_rec([('NGC1001', 11.1), ('NGC1002', 12.3), ('NGC1003', 15.2)],
dtype=(numpy.record, [('target', 'S20'), ('V_mag', '
# Sub-selection
bug=ori_rec[ori_rec["V_mag"]>12.]
bug
FITS_rec([('NGC1002', 12.3), ('NGC1003', 15.2)],
dtype=(numpy.record, [('target', 'S20'), ('V_mag', '
So far so good...
# Let's add a new column
col0=bug.columns
col1 =fits.ColDefs([fits.Column(name='new',format='D',array=bug.field("V_mag")+1.)])
newbug = fits.BinTableHDU.from_columns(col0 + col1).data
FITS_rec([('NGC1001', 11.1, 13.30000019), ('NGC1002', 12.3, 16.20000076),
('NGC1003', 15.2, 0. )],
dtype=(numpy.record, [('target', 'S20'), ('V_mag', '
...AND ... the values of the new column for NGC1002 and NGC1003 are correct but in the row of NGC1001 and NGC1002 respectively... :|
Any enlightenment will be welcomed :)

This is a confusing problem, and it stems from the fact that there are many layers of legacy classes and data structures in astropy.io.fits (stemming back from earlier versions of PyFITS). For example, you can see in your example that hdu.data is a FITS_rec object, which is like a Numpy recarray (itself a soft-deprecated legacy class), but it also has a .columns attribute (as you've noted):
>>> bug.columns
ColDefs(
name = 'target'; format = '20A'
name = 'V_mag'; format = 'E'
)
This in turn actually holds references back to the original arrays from which you described the columns. For example:
>>> bug.columns['target'].array
chararray(['NGC1001', 'NGC1002', 'NGC1003'],
dtype='|S20')
You can see here that even though bug is a "slice" of your original table, the arrays referenced through bug.columns are still contain the original, unsliced array data. So when you do something like in your original post
>>> col0 = bug.columns
>>> col1 = fits.ColDefs([fits.Column(name='new',format='D',array=bug.field("V_mag")+1.)])
it's doing its best here to figure out the intent, but col0 here has no idea that bug is a slice of the original table anymore, it only has the original "coldefs" with the full columns to rely on here.
Most of these classes, including FITS_rec, Column, and especially ColDefs almost never need to be used directly anymore. Unfortunately not all of the documentation has been updated to reflect this fact, and there are a lot of older tutorials and example code that show usage of these classes. Nobody with the requisite expertise has been able to take the time to update the docs and clarify this point.
On occasion Column is useful if you already have columnar data with each column in a separate array, and you want to build a table from it and give some specific FITS attributes to the table columns. But I have redesigned much of the API so that you can take native Python data structures like Numpy arrays and save them to FITS files without worrying about the details of how FITS is implemented or annoying things like FITS data format codes in many cases.
This work is slightly incomplete, because it seems if you want to define a FITS table from some columnar arrays, you still need to use the Column class and specify a FITS format at a minimum (but you never need to use ColDefs directly):
>>> hdu = fits.BinTableHDU.from_columns([fits.Column(name='target', format='20A', array=a1), fits.Column(name='V_mag', format='E', array=a2)])
>>> hdu.data
FITS_rec([('NGC1001', 11.1), ('NGC1002', 12.3), ('NGC1003', 15.2)],
dtype=(numpy.record, [('target', 'S20'), ('V_mag', '<f4')]))
However, you can also work with Numpy structured arrays directly, and I usually find that simpler personally, as it allows you to ignore most FITS-isms and just focus on your data, for those cases where it's not important to finely tweak the FITS-specific stuff. For example, to define a structured array for your data, there are several ways to go about that, but you might try:
>>> nrows = 3
>>> data = np.empty(nrows, dtype=[('target', 'S20'), ('V_mag', np.float32)])
>>> data['target'] = a1
>>> data['V_mag'] = a2
>>> data
array([('NGC1001', 11.100000381469727), ('NGC1002', 12.300000190734863),
('NGC1003', 15.199999809265137)],
dtype=[('target', 'S20'), ('V_mag', '<f4')])
and then you can instantiate a BinTableHDU directly from this array:
>>> hdu = fits.BinTableHDU(data)
>>> hdu.data
FITS_rec([('NGC1001', 11.1), ('NGC1002', 12.3), ('NGC1003', 15.2)],
dtype=(numpy.record, [('target', 'S20'), ('V_mag', '<f4')]))
>>> hdu.header
XTENSION= 'BINTABLE' / binary table extension
BITPIX = 8 / array data type
NAXIS = 2 / number of array dimensions
NAXIS1 = 24 / length of dimension 1
NAXIS2 = 3 / length of dimension 2
PCOUNT = 0 / number of group parameters
GCOUNT = 1 / number of groups
TFIELDS = 2 / number of table fields
TTYPE1 = 'target '
TFORM1 = '20A '
TTYPE2 = 'V_mag '
TFORM2 = 'E '
Likewise when it comes to things like masking and slicing and adding new columns, working directly with the native Numpy data structures is best.
Or, as suggested in the answers to other question, use the Astropy Table API and don't mess with low-level FITS stuff at all if you can help it. Because as I discussed, it contains several layers of legacy interfaces that make things confusing (and that long term should probably be cleaned up, but it's hard to do because code that uses them in some way or another are pervasive). The Table API was designed from the ground-up to make table manupulations, including things like masking rows and adding columns, relatively easy. Whereas the old PyFITS APIs never quite worked for many simple cases.
I hope this answer was edifying--I know it's maybe a bit long and confusing. If there is anything specific I can clear up let me know.

Related

How to create a variable in Matlab where certain subjects are coded as 1?

I want to create a variable called 'flag_artifact' where certain subjects from my dataset (for whom I know have bad quality images) are coded as e.g., 1. My dataset is stored in a table T with a certain number of rows and 'subject' is the 1st column in the table.
I managed to do it by creating a for loop. Surely there is a more efficient way to do this by perhaps directly creating the variable and using less lines of code? Could anyone give some advice?
Thank you very much! Here is what I have:
flag_artifact = {'T_300'}; %flagging subject number 300 for example
for i = 1:size(T,1)
if isequal(table2cell(T(i, 1)), flag_artifact)
T(i, 1) = {'1'};
end
end
However, when creating the variable flag_artifact = {'T_300'}, I would like it to include more than one subject. I tried using flag_artifact = {'T_300'}; {'T_301'}, as well as flag_artifact = {'T_300', 'T_301'} but it doesn't work because these subject identifiers do not get replaced with 1s.

Transforming dates in tensorflow or tensorflow extended

I am working with Tensorflow Extended, preprocessing data and among this data are date values (e.g. values of the form 16-04-2019). I need to apply some preprocessing to this, like the difference between two dates and extracting the day, month and year from it.
For example, I could need to have the difference in days between 01-04-2019 and 16-04-2019, but this difference could also span days, months or years.
Now, just using Python scripts this is easy to do, but I am wondering if it is also possible to do this with Tensorflow? It's important for my use case to do this within Tensorflow, because the transform needs to be done in the graph format so that I can serve the model with the transformations inside the pipeline.
I am using Tensorflow 1.13.1, Tensorflow Extended and Python 2.7 for this.
Posting from similar issue on tft github.
Here's a way to do it:
import tensorflow_addons as tfa
import tensorflow as tf
from typing import TYPE_CHECKING
#tf.function(experimental_follow_type_hints=True)
def fn_seconds_since_1970(date_time: tf.string, date_format: str = "%Y-%m-%d %H:%M:%S %Z"):
seconds_since_1970 = tfa.text.parse_time(date_time, date_format, output_unit='SECOND')
seconds_since_1970 = tf.cast(seconds_since_1970, dtype=tf.int64)
return seconds_since_1970
string_date_tensor = tf.constant("2022-04-01 11:12:13 UTC")
seconds_since_1970 = fn_seconds_since_1970(string_date_tensor)
seconds_in_hour, hours_in_day = tf.constant(3600, dtype=tf.int64), tf.constant(24, dtype=tf.int64)
hours_since_1970 = seconds_since_1970 / seconds_in_hour
hours_since_1970 = tf.cast(hours_since_1970, tf.int64)
hour_of_day = hours_since_1970 % hours_in_day
days_since_1970 = seconds_since_1970 / (seconds_in_hour * hours_in_day)
days_since_1970 = tf.cast(days_since_1970, tf.int64)
day_of_week = (days_since_1970 + 4) % 7 #Jan 1st 1970 was a Thursday, a 4, Sunday is a 0
print(f"On {string_date_tensor.numpy().decode('utf-8')}, {seconds_since_1970} seconds had elapsed since 1970.")
My two cents on the broader underlying issue, here the question is computing time differences, for which we want to do these computations on tensors. Then the question becomes "What are the units of these tensors?" This is a question of granularity. "The next question is what are the data types involved?" Start with a string likely, end with a numeric. Then the next question becomes is there a "native" tensorflow function that can do this? Enter tensorflow addons!
Just like we are trying to optimize training by doing everything as tensor operations within the graph, similarly we need to optimize "getting to the graph". I have seen the way datetime would work with python functions here, and I would do everything I could do avoid going into python function land as the code becomes so complex and the performance suffers as well. It's a lose-lose in my opinion.
PS - This op is not yet implemented on windows as per this, maybe because it only returns unix timestamps :)
I had a similar problem. The issue because of an if-check with in TFX that doesn't take dates types into account. As far as I've been able to figure out, there are two options:
Preprocess the date column and cast it to an int (e.g. calling toordinal() on each element) field before reading it into TFX
Edit the TFX function that checks types to account for date-like types and cast them to ordinal on the fly.
You can navigate to venv/lib/python3.7/site-packages/tfx/components/example_gen/utils.py and look for the function dict_to_example. You can add a datetime check there like so
def dict_to_example(instance: Dict[Text, Any]) -> tf.train.Example:
"""Converts dict to tf example."""
feature = {}
for key, value in instance.items():
# TODO(jyzhao): support more types.
if isinstance(value, datetime.datetime): # <---- Check here
value = value.toordinal()
if value is None:
feature[key] = tf.train.Feature()
...
value will become an int, and the int will be handled and cast to a Tensorflow type later on in the function.

Matlab: Understanding a piece of code

I have a matlab code which is for printing a cell array to excel. The size of matrix is 50x13.
The row 1 is the column names.
Column 1 is dates and rest columns are numbers.
The dateformat being defined in the code is:
dFormat = struct;
dFormat.Style = struct( 'NumberFormat', '_(* #,##0.00_);_(* (#,##0.00);_(* "-"??_);_(#_)' );
dFormat.Font = struct( 'Size', 8 );
Can someone please explain me what the dFormat.Style code means ?
Thanks
The first line creates an empty struct (struct with no fields) called dFormat. A structure can contain pretty much anything in one of its fields, including another structure. The second line adds a field called 'Style' to the dFormat struct and sets it equal to another struct with a field called 'NumberFormat'. The 'NumberFormat' field is set equal to that long string of characters. You now have a structure of structures. The third line is similar to the second.
Note that the first line isn't really necessary unless dFormat already exists and it needs to be "zeroed out" as dFormat.Style with create it implicitly. However, using the struct function can make code more readable in some cases as objects use a similar notation for access methods and properties. In other words, all of your code could be replaced with:
dFormat.Style.NumberFormat = '_(* #,##0.00_);_(* (#,##0.00);_(* "-"??_);_(#_)';
dFormat.Font.Size = 8;
See this video from the MathWorks for more details and this list of helpful structure functions and examples.
#horchler already elaborated on structs, but I imagine you may actually be more interested in the content of this structs Style field.
In case you are solely interested in _(* #,##0.00_);_(* (#,##0.00);_(* "-"??_);_(#_), that does not really look like something MATLAB related to me.
My best guess is that this code is used to later feed some other program, for examle to build an excel file.

Matlab: dynamic name for structure

I want to create a structure with a variable name in a matlab script. The idea is to extract a part of an input string filled by the user and to create a structure with this name. For example:
CompleteCaseName = input('s');
USER WRITES '2013-06-12_test001_blabla';
CompleteCaseName = '2013-06-12_test001_blabla'
casename(12:18) = struct('x','y','z');
In this example, casename(12:18) gives me the result test001.
I would like to do this to allow me to compare easily two cases by importing the results of each case successively. So I could write, for instance :
plot(test001.x,test001.y,test002.x,test002.y);
The problem is that the line casename(12:18) = struct('x','y','z'); is invalid for Matlab because it makes me change a string to a struct. All the examples I find with struct are based on a definition like
S = struct('x','y','z');
And I can't find a way to make a dynamical name for S based on a string.
I hope someone understood what I write :) I checked on the FAQ and with Google but I wasn't able to find the same problem.
Use a structure with a dynamic field name.
For example,
mydata.(casename(12:18)) = struct;
will give you a struct mydata with a field test001.
You can then later add your x, y, z fields to this.
You can use the fields later either by mydata.test001.x, or by mydata.(casename(12:18)).x.
If at all possible, try to stay away from using eval, as another answer suggests. It makes things very difficult to debug, and the example given there, which directly evals user input:
eval('%s = struct(''x'',''y'',''z'');',casename(12:18));
is even a security risk - what happens if the user types in a string where the selected characters are system(''rm -r /''); a? Something bad, that's what.
As I already commented, the best case scenario is when all your x and y vectors have same length. In this case you can store all data from the different files into 2 matrices and call plot(x,y) to plot each column as a series.
Alternatively, you can use a cell array such that:
c = cell(2,nufiles);
for ii = 1:numfiles
c{1,ii} = import x data from file ii
c{2,ii} = import y data from file ii
end
plot(c{:})
A structure, on the other hand
s.('test001').x = ...
s.('test001').y = ...
Use eval:
eval(sprintf('%s = struct(''x'',''y'',''z'');',casename(12:18)));
Edit: apologies, forgot the sprintf.

Generate unique 3 letter/number code and compare to existing ones in PHP/MySQL

I'm making a code generation script for UN/LOCODE system and the database has unique 3 letter/number codes in every country. So for example the database contains "EE TLL", EE being the country (Estonia) and TLL the unique code inside Estonia, "AR TLL" can also exist (the country code and the 3 letter/number code are stored separately). Codes are in capital letters.
The database is fairly big and already contains a huge number of locations, the user has also the possibility of entering the 3 letter/number him/herself (which will be checked against the database before submission automatically).
Finally neither 0 or 1 may be used (possible confusion with O and I).
What I'm searching for is the most efficient way to pick the next available code when none is provided.
What I've came up with:
I'd check with AAA till 999, but then for each code it would require a new query (slow?).
I could store all the 40000 possibilities in an array and subtract all the used codes that are already in the database... but that uses too much memory IMO (not sure what I'm talking about here actually, maybe 40000 isn't such a big number).
Generate a random code and hope it doesn't exist yet and see if it does, if it does start over again. That's just risk taking.
Is there some magic MySQL query/PHP script that can get me the next available code?
I will go with number 2, it is simple and 40000 is not a big number.
To make it more efficient, you can store a number representing each 3-letter code. The conversion should be trivial because you have a total of 34 (A-Z, 2-9) letters.
I would for option 1 (i.e. do a sequential search), adding a table that gives the last assigned code per country (i.e. such that AAA..code are all assigned already). When assigning a new code through sequential scan, that table gets updated; for user-assigned codes, it remains unmodified.
If you don't want to issue repeated queries, you can also write this scan as a stored routine.
To simplify iteration, it might be better to treat the three-letter codes as numbers (as Shawn Hsiao suggests), i.e. give a meaning to A-Z = 0..25, and 2..9 = 26..33. Then, XYZ is the number X*34^2+Y*34+Z == 23*1156+24*34+25 == 27429. This should be doable using standard MySQL functions, in particular using CONV.
I went with the 2nd option. I was also able to make a script that will try to match as close as possible the country name, for example for Tartu it will try to match T** then TA* and if possible TAR, if not it will try TAT as T is the next letter after R in Tartu.
The code is quite extensive, I'll just post the part that takes the first possible code:
$allowed = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ23456789';
$length = strlen($allowed);
$codes = array();
// store all possibilities in a huge array
for($i=0;$i<$length;$i++)
for($j=0;$j<$length;$j++)
for($k=0;$k<$length;$k++)
$codes[] = substr($allowed, $i, 1).substr($allowed, $j, 1).substr($allowed, $k, 1);
$used = array();
$query = mysql_query("SELECT code FROM location WHERE country = '$country'");
while ($result = mysql_fetch_array($query))
$used[] = $result['code'];
$remaining = array_diff($codes, $used);
$code = $remaining[0];
Thanks for your opinion, this will be the key to transport codes all over the world :)