Encoding String to numbers so as to use it in scikit-learn - encoding

My data consists of 50 columns and most of them are strings. I have a single multi-class variable which I have to predict. I tried using LabelEncoder in scikit-learn to convert the features (not classes) into whole numbers and feed them as input to the RandomForest model I am using. I am using RandomForest for classification.
Now, when new test data comes (stream of new data), for each column, how will I know what the label for each string will be since using LabelEncoder now will give me a new label independent of the labels I generated before. Am, I doing this wrong? Is there anything else I should use for consistent encoding?

The LabelEncoder class has two methods that handle this distinction: fit and transform. Typically you call fit first to map some data to a set of integers:
>>> le = LabelEncoder()
>>> le.fit(['a', 'e', 'b', 'z'])
>>> le.classes_
array(['a', 'b', 'e', 'z'], dtype='U1')
Once you've fit your encoder, you can transform any data to the label space, without changing the existing mapping:
>>> le.transform(['a', 'e', 'a', 'z', 'a', 'b'])
[0, 2, 0, 3, 0, 1]
>>> le.transform(['e', 'e', 'e'])
[2, 2, 2]
The use of this encoder basically assumes that you know beforehand what all the labels are in all of your data. If you have labels that might show up later (e.g., in an online learning scenario), you'll need to decide how to handle those outside the encoder.

You could save the mapping: string -> label in training data with each column.
>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> col_1 = ["paris", "paris", "tokyo", "amsterdam"]
>>> set_col_1 = list(set(col_1))
>>> le.fit(col_1)
>>> dict(zip(set_col_1, le.transform(set_col_1)))
{'amsterdam': 0, 'paris': 1, 'tokyo': 2}
When the testing data come, you could use those mapping to encode corresponding columns in testing data. You do not have to use encoder again in testing data.

Related

Polars searchsorted with a Series

searchsorted is an incredibly useful utility in numpy and pandas for performing a binary search on every element in a list, especially for time-series data.
import numpy as np
np.searchsorted(['a', 'a', 'b', 'c'], ['a', 'b', 'c']) # Returns [0, 2, 3]
np.searchsorted(['a', 'a', 'b', 'c'], ['a', 'b', 'c'], side='right') # Returns [2, 3, 4]
I have a few questions about Polars
Is there any way to apply search_sorted on a list in polars in a vectorized manner?
Is there any way to specify side=right for search_sorted?
Can we use non-numeric data in search_sorted?
If answer is no to the questions, what would be the recommended approach / workaround to achieve the functionalities?
(The ideal approach is if search_sorted can be used as part of an expression, e.g. pl.col('A').search_sorted(pl.col('B)))
Here's what I have tried:
import polars as pl
pl.Series(['a', 'a', 'b', 'c']).search_sorted(['a', 'b', 'c']) # PanicException: not implemented for Utf8
pl.Series([0, 0, 1, 2]).search_sorted([0, 1, 2]) # PanicException: dtype List not implemented
list(map(pl.Series([0, 0, 1, 2]).search_sorted, [0, 1, 2])) # Returns [1, 2, 3], different from numpy results
pl.DataFrame({
'a': [0, 0, 1, 2],
'b': [0, 1, 2, 3],
}).with_columns([
pl.col('a').search_sorted(pl.col('b')).alias('c')
]) # Column C is [1, 1, 1, 1], which is incorrect
I understand Polars is still a work in progress and some functionalities are missing, so any help is greatly appreciated!
To extend on #ritchie46's answer, you need a rolling join so that missing values can be joined to their near neighbor. Unfortunately rolling joins don't work on letters, or more accurately Utf8 dtypes so for your example you have to do an extra step.
Starting from:
df1 = (pl.Series("a", ["a", "a", "b", "c"])
.set_sorted()
.to_frame()
.with_row_count("idx"))
df2 = pl.Series("a", ["a", "b", "c"]).set_sorted().to_frame()
then we make a df to house all the possible values of a and map them to a numeric.
dfindx=(pl.DataFrame(pl.concat([df1.get_column('a'),df2.get_column('a')]).unique())
.sort('a').with_row_count('valindx'))
now we add that valindx to each of df1 and df2
df1=df1.join(dfindx, on='a')
df2=df2.join(dfindx, on='a')
To get almost to the finish line you'd do:
df2.join_asof(df1, on='valindx', strategy='forward')
this will leave missing the last value, the 4 from the numpy case because essentially what's happening is that the first value 'a' doesn't find a match but its nearest forward neighbor is a 'b' so it takes that value and so on but when it gets to 'e' there is nothing in df1 forward of that so we need to do a minor hack of just filling in that null with the max idx+1.
(df2.
join_asof(df1, on='valindx', strategy='forward')
.with_column(pl.col('idx').fill_null(df1.select(pl.col('idx').max()+1)[0,0]))
.get_column('idx'))
Of course, if you're using time or numerics then you can skip the first step. Additionally, I suspect that fetching this index value is an intermediate step and that overall process would be done more efficiently without extracting the index values at all but that would be through a join_asof.
If you change the strategy of join_asof then that should be largely the same as switching the side but you'd have to change the hack bit at the end too.
EDIT: I added the requested functionality and it will be available in next release: https://github.com/pola-rs/polars/pull/6083
Old answer (wrong)
For a "normal" search sorted we can use a join.
# convert to DataFrame
# provide polars with the information the data is sorted (this speeds up many algorithms)
# set a row count
df1 = (pl.Series("a", ["a", "a", "b", "c"])
.set_sorted()
.to_frame()
.with_row_count("idx"))
df2 = pl.Series("a", ["a", "b", "c"]).set_sorted().to_frame()
# join
# drop duplicates
# and only show the indices that were joined
df1.join(df2, on="a", how="semi").unique(subset=["a"])["idx"]
Series: 'idx' [u32]
[
0
2
3
]

Inverse of pyspark.sql.functions greatest

Is there an inverse to the function greatest?
Something to get the min of multiple columns?
If there is not, do you know any other way to find it than using udf functions?
Thank you!
The inverse is:
pyspark.sql.functions.least(*cols)
Returns the least value of the list of column names, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null.
>>> df = spark.createDataFrame([(1, 4, 3)], ['a', 'b', 'c'])
>>> df.select(least(df.a, df.b, df.c).alias("least")).collect()
[Row(least=1)]

How to set proper arguments to build keras Convolution2D NN model [Text Classification]?

I am trying to use 2D CNN to do text classification on Chinese Article and have trouble on setting arguments of keras Convolution2D. I know the basic flow of Convolution2D to cope with image, but stuck by using my dataset with keras.
Input data
My data is 9800 Chinese Article, max sentence length is 6810,with 200 word2vec size.
So the input shape is `(9800, 1, 6810, 200)`
Code for building model
MAX_FEATURES = 6810
# I just randomly pick one filter, seems this is the problem?
nb_filter = 128
input_shape = (1, 6810, 200)
# each word is 200 (word2vec size)
embedding_size = 200
# 3 word length
n_gram = 3
# so stride here is embedding_size*n_gram
model = Sequential()
model.add(Convolution2D(nb_filter, n_gram, embedding_size, border_mode='valid', input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(100, 1), border_mode='valid'))
model.add(Dropout(0.5))
model.add(Activation('relu'))
model.add(Flatten())
model.add(Dense(hidden_dims))
model.add(Dropout(0.5))
model.add(Activation('relu'))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
# X is (9800, 1, 6810, 200)
model.fit(X, y, batch_size=32,
nb_epoch=5,
validation_split=0.1)
Question 1. I have problem to set Convolution2D arguments. My reseach is below,
The official docs do not contain an exmaple for 2D CNN text classifacation(though has 1D CNN).
Convolution2D defination is here https://keras.io/layers/convolutional/:
keras.layers.convolutional.Convolution2D(nb_filter, nb_row, nb_col, init='glorot_uniform', activation=None, weights=None, border_mode='valid', subsample=(1, 1), dim_ordering='default', W_regularizer=None, b_regularizer=None, activity_regularizer=None, W_constraint=None, b_constraint=None, bias=True)
nb_filter: Number of convolution filters to use.
nb_row: Number of rows in the convolution kernel.
nb_col: Number of columns in the convolution kernel.
border_mode: 'valid', 'same' or 'full'. ('full' requires the Theano backend.)
Some research about the arguments:
This issue https://github.com/fchollet/keras/issues/233 is about 2D CNN for text classification, I read all comments and pick:
(1) https://github.com/fchollet/keras/issues/233#issuecomment-117427013
model.add(Convolution2D(nb_filter=N_FILTERS, stack_size=1, nb_row=FIELD_SIZE,
nb_col=1, subsample=(STRIDE, 1)))
(2) https://github.com/fchollet/keras/issues/233#issuecomment-117700913
sequential.add(Convolution2D(nb_feature_maps, 1, n_gram, embedding_size))
But it seems has some diference to current keras version, also the arguments naming by different people are in a mess (I hope keras has an easy understandable argument expanation).
Another comment I see about current api:
https://github.com/fchollet/keras/issues/1665#issuecomment-181181000
The current API is as below:
keras.layers.convolutional.Convolution2D(nb_filter, nb_row, nb_col, init='glorot_uniform', activation='linear', weights=None, border_mode='valid', subsample=(1, 1), dim_ordering='th', W_regularizer=None, b_regularizer=None, activity_regularizer=None, W_constraint=None, b_constraint=None)
So (36,1,7,7) seems the reason, the correct arguments would be (36,7,7,...).
By above research, on my understanding of convolution, Convolution2D create a (nb_filter, nb_row, nb_col) filter , by sliding a stride to get one filter result, recurse sliding, finally combine the result into array with shape (1, one_sample_article_length[6810] / nb_filter), and go to the next layer, is that right? Is my code below set nb_row and nb_col correct ?
Question 2. What is the proper MaxPooling2D arguments? (for my dateset or for commonm, either is OK)
I refer this issue https://github.com/fchollet/keras/issues/233#issuecomment-117427013 to set the argument, there are two kinds:
MaxPooling2D(poolsize=(((nb_features - FIELD_SIZE) / STRIDE) + 1, 1))
MaxPooling2D(poolsize=(maxlen - n_gram + 1, 1))
I have no idea why they calculate MaxPooling2D argument like that.
Question 3. Any recommendation for batch_size and nb_epoch to do such text classification? I have no idea at all.

Pyspark: Passing full dictionary to each task

PySpark: I want to pass my custom dictionary which contains the distances of several locations to each task in Pyspark as for each row in my rdd, I need to calculate the distances from each location and every location in dictionary and take the minimum distance. broadcast didnt solve my problem.
Example:
dict = {(a,3),(b,6),(c,2)}
RDD :
(location1, 5)
(location2, 9)
(location3, 8)
Output: (location1,1)
(location2,3)
(location3,2)
Please help and thanks
A broadcast variable will definitely solve your problem in this case, though you could also just pass the dictionary (or list--see below) in your map function. Whether using a broadcast variable is worthwhile depends on the size of the object.
First of all, since all you want is the minimum distance, it looks like you don't care about the keys of the dictionary, just the values. If that list is sorted, it will make it possible to find the minimum distance efficiently.
>>> d = {'a': 3, 'b': 6, 'c': 2}
>>> locations = sorted(d.itervalues())
>>> rdd = sc.parallelize([('location1', 5), ('location2', 9), ('location3', 8)])
Now define a function to find the minimum distance using bisect.bisect. We make a function of a single element from the general function using functools.partial to fix the second argument.
>>> from functools import partial
>>> from bisect import bisect
>>> def find_min_distance(loc, locations):
... ind = bisect(locations, loc)
... if ind == len(locations):
... return loc - locations[-1]
... elif ind == 0:
... return locations[0] - loc
... else:
... left_dist = loc - locations[ind - 1]
... right_dist = locations[ind] - loc
... return min(left_dist, right_dist)
>>> mapper = partial(find_min_distance, locations=locations)
>>> rdd.mapValues(mapper).collect()
[('location1', 1), ('location2', 3), ('location3', 2)]
To do this instead with a broadcast variable:
>>> locations_bv = sc.broadcast(locations)
>>> def mapper(loc):
... return find_min_distance(loc, locations_bv.value)
...
>>> rdd.mapValues(mapper).collect()
[('location1', 1), ('location2', 3), ('location3', 2)]

Is there a issue with using NetworkX from multiple processes on different graphs?

If each process creates its own nx.Graph() and adds/removes nodes/edges to it, is there any reason for them to collide? I am noticing some weird phenomenoms and trying to debug them.
The general problem is that I am dumping a single graph as an edge list, and recreating it from a subset in each process into a new graph. for some reason those new graphs are missing edges.
EDIT:
I think I found the part of code which causes the problems for me, the question is whether the following is the intended behaviour of NetworkX or not:
>>> import networkx as nx
>>> g = nx.Graph()
>>> g.add_path([0,1,2,3])
>>> g.nodes()
[0, 1, 2, 3]
>>> g.edges()
[(0, 1), (1, 2), (2, 3)]
>>> g[1][0]
{}
>>> g[0][1] = {"test":1}
>>> g.edges(data=True)
[(0, 1, {'test': 1}), (1, 2, {}), (2, 3, {})]
>>> g[1][0]
{}
>>> g[0][1]
{'test': 1}
>>>
Since the graph is a undirectional one I would expect the edge data to appear both regardless to the nodes id in the request, is that an incorrect assumption?
In general there should be no issue with multiple processes as long as they each have their own Graph() object.
EDIT:
In your case you are explicitly assigning data to the internal NetworkX Graph structure with the line
>>> g[0][1] = {"test":1}
While that is technically allowed it breaks the API and data structure. You should instead use
>>> g.add_edge(0,1,test=1)
which won't add a new edge here, only a new attribute. Doing it that way assigns the data to g[0][1] and g[1][0] correctly.