selecting a range of colums in SKlearn column transformer - encoding

I am encoding catagorical data, many columns need to be seletced, I have typed them in individually and it works ok but there is obviouly a more elegant way.
dataset =pd.read_csv('train.csv')
x = dataset.iloc[:,:-1].values
y = dataset.iloc[:, -1].values
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(),[2,5,6,7,8,9,10,11,12,13,14,15,16,21,22,23,24,25,27,28,29,30,31,32,33,34,35,39,40,41,42,53,54,55,56,57,58,60,63,64,65,72,73,74,78,79])], remainder='passthrough')
x = np.array(ct.fit_transform(x))
I have tried using (23:34) I have tried using slice but that does not work as it is not that data type.
Which method should I use for selecting a range of columns?
Also what datatype is it at this point were I am selecting the columns?
I made a search I an not able to see a solution for this exact question.
Finally, is this an effecient way to encode catagorical data or should I be looking at an alternative method?
Thanks!

you can use the following workaround:
ct = ColumnTransformer(
transformers=[
("ordinal_enc", OrdinalEncoder(), data.loc[:, "col1":"col100"].columns)
])

Related

K-means in pyspark runing infinitely in jupyter notebook, works fine in zepplin notebook

I am running a k-means algorithm in pyspark:
from pyspark.ml.clustering import KMeans
from pyspark.ml.clustering import KMeansModel
import numpy as np
kmeans_modeling = KMeans(k = 3, seed = 0)
model = kmeans_modeling.fit(data.select("parameters"))
The data is a pyspark sql dataframe: pyspark.sql.dataframe.DataFrame
However, the algorithm is running infinitely (it is taking much, much longer than supposed for the amount of data in the dataframe).
Does anyone know what could be causing the algorithm to behave like this? I ran this exact code for a different dataframe of the same type, and everything worked fine.
The dataset I used before (that worked) had 72020 rows and 35 columns, and the present dataset has 60297 rows and 31 columns, so it is not a size-related problem. The data was normalized in both cases, but I assume the problem has to be in the data treatment. Can anyone help me with this? If any other information is needed let me know in the comments and I will answer or edit the question.
EDIT:
This is what I can show about creating the data:
aux1 = temp.filter("valflag = 0")
sample = spark.read.option("header", "true").option("delimiter", ",").csv("gs://LOCATION.csv").select("id")
data_pre = aux1.join(sample, sample["sample"] == aux1["id"], "leftanti").drop("sample")
data_pre.createOrReplaceTempView("data_pre")
data_pre = spark.table("data_pre")
data_pre = data.withColumn(col, functions.col(col).cast("double"))
data_pre = data_pre.na.fill(0)
data = vectorization_function(df = data_pre, inputCols = inputCols, outputCol = "parameters")
EDIT 2: I cannot provide additional information about the data, but I have now realized that the algorithm runs without problem in a zepplin notebook, but it is not working in a jupyter notebook; I have edited the tags and titel accordingly. Does anyone know why this could be happening?
Here is some documentation about running clustering jobs in Spark.
https://spark.apache.org/docs/latest/ml-clustering.html
Here is another, very similar, idea.
https://spark.apache.org/docs/latest/mllib-clustering.html

FITS_rec and selection of data: masking instead of "true" filtering?

Probably a duplicate to Ashley's post (but I can't comment -yet ;) ).
I have the same issue when trying to add a column to a sub-selection/sample of my initial FITS_rec (based on numpy's recarray); all the rows reappear (AND the filling of this new column doesn't seem to be respected...). "hdu_sliced._get_raw_data()" proposed by Vlas Sokolov is a solution that is working very fine for me, but I was wondering:
1) What are "the better ways" suggested by Iguananaut? I certainly need someone to just google it for me; the newbie me is feeling stuck :$ (Staying in a FITS_rec would be required).
2) Is that an expected behaviour? Meaning, are we really wanting to work on a "masked array" which would a copy of our original array? What is worrying me the most is the "collapse" of the values in the new computed column. See below:
# A nice FITS_rec
a1 = np.array(['NGC1001', 'NGC1002', 'NGC1003'])
a2 = np.array([11.1, 12.3, 15.2])
col1 = fits.Column(name='target', format='20A', array=a1)
col2 = fits.Column(name='V_mag', format='E', array=a2)
cols = fits.ColDefs([col1, col2])
hdu = fits.BinTableHDU.from_columns(cols)
ori_rec=hdu.data
ori_rec
`
FITS_rec([('NGC1001', 11.1), ('NGC1002', 12.3), ('NGC1003', 15.2)],
dtype=(numpy.record, [('target', 'S20'), ('V_mag', '
# Sub-selection
bug=ori_rec[ori_rec["V_mag"]>12.]
bug
FITS_rec([('NGC1002', 12.3), ('NGC1003', 15.2)],
dtype=(numpy.record, [('target', 'S20'), ('V_mag', '
So far so good...
# Let's add a new column
col0=bug.columns
col1 =fits.ColDefs([fits.Column(name='new',format='D',array=bug.field("V_mag")+1.)])
newbug = fits.BinTableHDU.from_columns(col0 + col1).data
FITS_rec([('NGC1001', 11.1, 13.30000019), ('NGC1002', 12.3, 16.20000076),
('NGC1003', 15.2, 0. )],
dtype=(numpy.record, [('target', 'S20'), ('V_mag', '
...AND ... the values of the new column for NGC1002 and NGC1003 are correct but in the row of NGC1001 and NGC1002 respectively... :|
Any enlightenment will be welcomed :)
This is a confusing problem, and it stems from the fact that there are many layers of legacy classes and data structures in astropy.io.fits (stemming back from earlier versions of PyFITS). For example, you can see in your example that hdu.data is a FITS_rec object, which is like a Numpy recarray (itself a soft-deprecated legacy class), but it also has a .columns attribute (as you've noted):
>>> bug.columns
ColDefs(
name = 'target'; format = '20A'
name = 'V_mag'; format = 'E'
)
This in turn actually holds references back to the original arrays from which you described the columns. For example:
>>> bug.columns['target'].array
chararray(['NGC1001', 'NGC1002', 'NGC1003'],
dtype='|S20')
You can see here that even though bug is a "slice" of your original table, the arrays referenced through bug.columns are still contain the original, unsliced array data. So when you do something like in your original post
>>> col0 = bug.columns
>>> col1 = fits.ColDefs([fits.Column(name='new',format='D',array=bug.field("V_mag")+1.)])
it's doing its best here to figure out the intent, but col0 here has no idea that bug is a slice of the original table anymore, it only has the original "coldefs" with the full columns to rely on here.
Most of these classes, including FITS_rec, Column, and especially ColDefs almost never need to be used directly anymore. Unfortunately not all of the documentation has been updated to reflect this fact, and there are a lot of older tutorials and example code that show usage of these classes. Nobody with the requisite expertise has been able to take the time to update the docs and clarify this point.
On occasion Column is useful if you already have columnar data with each column in a separate array, and you want to build a table from it and give some specific FITS attributes to the table columns. But I have redesigned much of the API so that you can take native Python data structures like Numpy arrays and save them to FITS files without worrying about the details of how FITS is implemented or annoying things like FITS data format codes in many cases.
This work is slightly incomplete, because it seems if you want to define a FITS table from some columnar arrays, you still need to use the Column class and specify a FITS format at a minimum (but you never need to use ColDefs directly):
>>> hdu = fits.BinTableHDU.from_columns([fits.Column(name='target', format='20A', array=a1), fits.Column(name='V_mag', format='E', array=a2)])
>>> hdu.data
FITS_rec([('NGC1001', 11.1), ('NGC1002', 12.3), ('NGC1003', 15.2)],
dtype=(numpy.record, [('target', 'S20'), ('V_mag', '<f4')]))
However, you can also work with Numpy structured arrays directly, and I usually find that simpler personally, as it allows you to ignore most FITS-isms and just focus on your data, for those cases where it's not important to finely tweak the FITS-specific stuff. For example, to define a structured array for your data, there are several ways to go about that, but you might try:
>>> nrows = 3
>>> data = np.empty(nrows, dtype=[('target', 'S20'), ('V_mag', np.float32)])
>>> data['target'] = a1
>>> data['V_mag'] = a2
>>> data
array([('NGC1001', 11.100000381469727), ('NGC1002', 12.300000190734863),
('NGC1003', 15.199999809265137)],
dtype=[('target', 'S20'), ('V_mag', '<f4')])
and then you can instantiate a BinTableHDU directly from this array:
>>> hdu = fits.BinTableHDU(data)
>>> hdu.data
FITS_rec([('NGC1001', 11.1), ('NGC1002', 12.3), ('NGC1003', 15.2)],
dtype=(numpy.record, [('target', 'S20'), ('V_mag', '<f4')]))
>>> hdu.header
XTENSION= 'BINTABLE' / binary table extension
BITPIX = 8 / array data type
NAXIS = 2 / number of array dimensions
NAXIS1 = 24 / length of dimension 1
NAXIS2 = 3 / length of dimension 2
PCOUNT = 0 / number of group parameters
GCOUNT = 1 / number of groups
TFIELDS = 2 / number of table fields
TTYPE1 = 'target '
TFORM1 = '20A '
TTYPE2 = 'V_mag '
TFORM2 = 'E '
Likewise when it comes to things like masking and slicing and adding new columns, working directly with the native Numpy data structures is best.
Or, as suggested in the answers to other question, use the Astropy Table API and don't mess with low-level FITS stuff at all if you can help it. Because as I discussed, it contains several layers of legacy interfaces that make things confusing (and that long term should probably be cleaned up, but it's hard to do because code that uses them in some way or another are pervasive). The Table API was designed from the ground-up to make table manupulations, including things like masking rows and adding columns, relatively easy. Whereas the old PyFITS APIs never quite worked for many simple cases.
I hope this answer was edifying--I know it's maybe a bit long and confusing. If there is anything specific I can clear up let me know.

How to give column names after one hot encoding with sklearn?

Here is my question, I hope someone can help me to figure it out..
To explain, there are more than 10 categorical columns in my data set and each of them has 200-300 categories. I want to convert them into binary values. For that I used first label encoder to convert string categories into numbers. The Label Encoder code and the output is shown below.
After Label Encoder, I used One Hot Encoder From scikit-learn again and it is worked. BUT THE PROBLEM IS, I need column names after one hot encoder. For example, column A with categorical values before encoding.
A = [1,2,3,4,..]
It should be like that after encoding,
A-1, A-2, A-3
Anyone know how to assign column names to (old column names -value name or number) after one hot encoding. Here is my one hot encoding and it's output;
I need columns with name because I trained an ANN, but every time data comes up I cannot convert all past data again and again. So, I want to add just new ones every time. Thank anyway..
As #Vivek Kumar mentioned, you can use the pandas function get_dummies() instead of OneHotEncoder. I wanted to preserve a version of my initial DataFrame so I did the folowing;
import pandas as pd
DataFrame2 = pd.get_dummies(DataFrame)
I used the following code to rename each one-hot encoded columns to "original name_one-hot encoded name". So for your example it would give A_1, A_2, A_3. Feel free to change the "_" below to "-".
#Create list of columns with "object" dtype
cat_cols = [col for col in df_pro.columns if df_pro[col].dtype == np.object]
#Find the array of new columns from one-hot encoding
cat_labels = ohenc.categories_
#Convert array of columns into list
cat_labels = np.concatenate(cat_labels).ravel().tolist()
#Use list comprehension to generate new list with labels needed
cat_labels_new = [(col + "_" + label) for label in cat_labels for col in cat_cols if
label in df_pro[col].values.tolist()]
#Create new DataFrame of transformed columns using new list labels
cat_ohc = pd.DataFrame(cat_arr, columns = cat_labels)
#Concat with original DataFrame and drop original columns (only columns with "object" dtype)

Matlab: get numeric values from table

I import a sheet from Excel into matlab using the command "readtable":
TABLE = readtable(Excel.FN, 'sheet', Excel.Sheet);
The table contains both, numeric values and strings.
If I try to access the numeric values, I can't get them as double.
TABLE{j,i} = '0.00069807'
is still a cell.
cell2num(TABLE{j,i}) = NaN
cell2mat(TABLE{j,i}) = 0.00069807,
but this is a char. So I use
str2num(cell2mat(TABLE{j,i}))
to obtain the numeric value. There must be a simpler way. Could you please tell me the command.
If you don't insist on readtable, the xlsread would be better for you. Loaded data are more "matlab-friendly" with this function.
I am not sure whether there is a simpler solution with readtable. I think that's just the price you need to pay for not working with the "rawer" data such as CSV or simple text files.

Matlab: dynamic name for structure

I want to create a structure with a variable name in a matlab script. The idea is to extract a part of an input string filled by the user and to create a structure with this name. For example:
CompleteCaseName = input('s');
USER WRITES '2013-06-12_test001_blabla';
CompleteCaseName = '2013-06-12_test001_blabla'
casename(12:18) = struct('x','y','z');
In this example, casename(12:18) gives me the result test001.
I would like to do this to allow me to compare easily two cases by importing the results of each case successively. So I could write, for instance :
plot(test001.x,test001.y,test002.x,test002.y);
The problem is that the line casename(12:18) = struct('x','y','z'); is invalid for Matlab because it makes me change a string to a struct. All the examples I find with struct are based on a definition like
S = struct('x','y','z');
And I can't find a way to make a dynamical name for S based on a string.
I hope someone understood what I write :) I checked on the FAQ and with Google but I wasn't able to find the same problem.
Use a structure with a dynamic field name.
For example,
mydata.(casename(12:18)) = struct;
will give you a struct mydata with a field test001.
You can then later add your x, y, z fields to this.
You can use the fields later either by mydata.test001.x, or by mydata.(casename(12:18)).x.
If at all possible, try to stay away from using eval, as another answer suggests. It makes things very difficult to debug, and the example given there, which directly evals user input:
eval('%s = struct(''x'',''y'',''z'');',casename(12:18));
is even a security risk - what happens if the user types in a string where the selected characters are system(''rm -r /''); a? Something bad, that's what.
As I already commented, the best case scenario is when all your x and y vectors have same length. In this case you can store all data from the different files into 2 matrices and call plot(x,y) to plot each column as a series.
Alternatively, you can use a cell array such that:
c = cell(2,nufiles);
for ii = 1:numfiles
c{1,ii} = import x data from file ii
c{2,ii} = import y data from file ii
end
plot(c{:})
A structure, on the other hand
s.('test001').x = ...
s.('test001').y = ...
Use eval:
eval(sprintf('%s = struct(''x'',''y'',''z'');',casename(12:18)));
Edit: apologies, forgot the sprintf.