How to remove 'u'(unicode) from all the values in a column in a DataFrame? - unicode

How does one remove 'u'(unicode) from all the values in a column in a DataFrame?
table.place.unique()
array([u'Newyork', u'Chicago', u'San Francisco'], dtype=object)

>>> df = pd.DataFrame([u'c%s'%i for i in range(11,21)], columns=["c"])
>>> df
c
0 c11
1 c12
2 c13
3 c14
4 c15
5 c16
6 c17
7 c18
8 c19
9 c20
>>> df['c'].values
array([u'c11', u'c12', u'c13', u'c14', u'c15', u'c16', u'c17', u'c18',
u'c19', u'c20'], dtype=object)
>>> df['c'].astype(str).values
array(['c11', 'c12', 'c13', 'c14', 'c15', 'c16', 'c17', 'c18', 'c19', 'c20'], dtype=object)
>>>

Related

kdb union join (with plus join)

I have been stuck on this for a while now, but cannot come up with a solution, any help would be appriciated
I have 2 table like
q)x
a b c d
--------
1 x 10 1
2 y 20 1
3 z 30 1
q)y
a b| c d
---| ----
1 x| 1 10
3 h| 2 20
Would like to sum the common columns and append the new ones. Expected result should be
a b c d
--------
1 x 11 11
2 y 20 1
3 z 30 1
3 h 2 20
pj looks to only update the (1,x) but doesn't insert the new (3,h). I am assuming there has to be a way to do some sort of union+plus join in kdb
You can take advantage of the plus (+) operator here by simply keying x and adding the table y to get the desired table:
q)(2!x)+y
a b| c d
---| -----
1 x| 11 11
2 y| 20 1
3 z| 30 1
3 h| 2 20
The same "plus if there's a matching key, insert if not" behaviour works for dictionaries too:
q)(`a`b!1 2)+`a`c!10 30
a| 11
b| 2
c| 30
got it :)
q) (x pj y), 0!select from y where not ([]a;b) in key 2!x
a b c d
--------
1 x 11 11
2 y 20 1
3 z 30 1
3 h 2 20
Always open for a better implementation :D I am sure there is one.

Pivot table and onehot in pyspark

I have a pyspark data frame which looks like -
id age cost gender
1 38 230 M
2 40 832 M
3 53 987 F
1 38 764 M
4 63 872 F
5 21 763 F
I want my data frame looks like -
id age cost gender M F
1 38 230 M 1 0
2 40 832 M 1 0
3 53 987 F 0 1
1 38 764 M 1 0
4 63 872 F 0 1
5 21 763 F 0 1
4 63 1872 F 0 1
Using python I can manage in following way -
final_df = pd.concat([df.drop(['gender'], axis=1), pd.get_dummies(df['gender'])], axis=1)
How to manage in pyspark?
just need to add 2 columns :
from pyspark.sql import functions as F
final_df = df.select(
"id",
"age",
"cost",
"gender",
F.when(F.col("gender")==F.lit("M"),1).otherwise(0).alias("M"),
F.when(F.col("gender")==F.lit("F"),1).otherwise(0).alias("F"),
)

Pivot table in Pyspark

I am very very new to pyspark. My data frame looks like -
id value subject
1 75 eng
1 80 his
2 83 math
2 73 science
3 88 eng
I want my data frame -
id eng his math science
1 .49 .51 0 0
2 0 0 .53 .47
3 1 0 0 0
That means row-wise summation and then divide with each cell. Want to calculate % of each cell.
I have done the following code but it's not working -
from pyspark.sql import functions as F
from pyspark.sql import Window
df = df.withColumn('rank',F.dense_rank().over(Window.orderBy("id","value","subject")))
df.withColumn('combcol',F.concat(F.lit('col_'),df['rank'])).groupby('id').pivot('combcol').agg(F.first('value')).show()
Check if the following code works for you.
from pyspark.sql import functions as F
df = spark.createDataFrame(
[ (1,75,'eng'), (1,80,'his'), (2,83,'math'), (2,73,'science'), (3,88,'eng') ]
, [ 'id','value','subject' ]
)
# create the pivot table
df1 = df.groupby('id').pivot('subject').agg(F.first('value')).fillna(0)
# column names used to sum up for total
cols = df1.columns[1:]
# calculate the total and then percentage accordingly for each cols
df1.withColumn('total', sum([F.col(c) for c in cols])) \
.select('id', *[ F.format_number(F.col(c)/F.col('total'),2).alias(c) for c in cols] ) \
.show()
#+---+----+----+----+-------+
#| id| eng| his|math|science|
#+---+----+----+----+-------+
#| 1|0.48|0.52|0.00| 0.00|
#| 3|1.00|0.00|0.00| 0.00|
#| 2|0.00|0.00|0.53| 0.47|
#+---+----+----+----+-------+

Merging two data sets based on intervals

I was wondering how I can merge these two data sets by a and b. a column in f data set is the lower bound of the intervals so I need to merge 1.5 from g data set with 1 from f, 4.4 from g with 4 from f, 9.8 from g with 9 from f and etc.
a<-seq(1:10)
b<-c("a","b","a","b","a","a","a","b","b","a")
f<-data.frame(a,b)
a<-c(1.5,1.4,2.3,2.2,4.4,4,5,6.6,9.8,4.1,4.6,5.5)
b<-c("a","b","b","b","a","b","a","b","a","b","a","b")
m<-seq(1:12)
g<-data.frame(a,b,m)
Not sure exactly what you are looking for here, but the floor() function should give you what you need. You might also look into the tidyverse, in general, and dplyr, in particular, for data manipulation.
It is not entirely clear what you expect as output - the b column differs a bit after merging - did you only want the records that match? Remove all.x,all.y parameters if you do not care about unmatched records. I also presume that renaming your columns may be in order:
a <- seq(1:10)
b <- c("a", "b", "a", "b", "a", "a", "a", "b", "b", "a")
f <- data.frame(a, b)
a <- c(1.5, 1.4, 2.3, 2.2, 4.4, 4, 5, 6.6, 9.8, 4.1, 4.6, 5.5)
b <- c("a", "b", "b", "b", "a", "b", "a", "b", "a", "b", "a", "b")
m <- seq(1:12)
g <- data.frame(a, b, m)
## floor function takes care of rounding down
g$c <- floor(g$a)
merge(f, g, by.x = "a", by.y = "c", all.x = TRUE, all.y = TRUE)
#> Warning in merge.data.frame(f, g, by.x = "a", by.y = "c", all.x = TRUE, :
#> column name 'a' is duplicated in the result
#> a b.x a b.y m
#> 1 1 a 1.5 a 1
#> 2 1 a 1.4 b 2
#> 3 2 b 2.3 b 3
#> 4 2 b 2.2 b 4
#> 5 3 a NA <NA> NA
#> 6 4 b 4.4 a 5
#> 7 4 b 4.0 b 6
#> 8 4 b 4.6 a 11
#> 9 4 b 4.1 b 10
#> 10 5 a 5.5 b 12
#> 11 5 a 5.0 a 7
#> 12 6 a 6.6 b 8
#> 13 7 a NA <NA> NA
#> 14 8 b NA <NA> NA
#> 15 9 b 9.8 a 9
#> 16 10 a NA <NA> NA

How can I efficiently pivot a table with replications in MATLAB from tall to wide format?

I have a tall MATLAB table like this:
spam =
Data cat1 cat2 time
__________ ___________ __________ ______
-0.41763 1 0 0
0.11719 1 0 0
... ... ... ...
-0.16546 1 0 1
... ... ... ...
-0.21763 1 0 2
0.31719 2 0 0
... ... ... ...
0.58116 3 1 0
... ... ... ...
Data is of double format, cat1 (8 levels) and cat2 (3 levels) are categorical, and time (3 levels) is ordinal (but could be double). Each time point of each cat1 and cat2 level includes 30 (technical) replicates (denoted by ... above).
I wish to use these data in fitrm, which requires them in the wide format. Hence I need to transform the Data column to three separate variables.
Using unstack I get something like this:
spam = unstack(spam, 'Data', 'time')
Warning: Variable names were modified to make them valid MATLAB identifiers.
spam =
cat1 cat2 x0 x1 x2
______ _________ ___________ _______ ________
1 0 -7.6605e-15 2.3168 0.45234
2 0 6.2172e-15 5.1661 24.89
3 1 8.8818e-16 56.697 40.441
4 1 -7.9936e-15 -22.741 -17.191
5 1 -1.4433e-15 -7.7803 -20.817
6 2 5.5511e-16 7.8535 -0.21172
7 2 5.3291e-15 13.658 5.8402
8 2 2.2204e-15 9.1739 13.814
Obviously this result does not include all the information in the tall table.
Specifically, the replicates have not been carried to the result.
Using accumarray, in a similar fashion to that shown in another stack page could be promising, but in my case it seemed easier to perform the one-time transformation by hand.
Is anyone aware of a more efficient approach?
I realize now that perhaps the easiest way is to just add an extra variable replication in the tall table and then use unstack as above.
Example (with different data, taken from my answer here):
name = ['A' 'A' 'A' 'B' 'B' 'C' 'C' 'C' 'C' 'D' 'D' 'E' 'E' 'E']';
value = randn(14, 1);
rep = [1, 2, 3, 1, 2, 1, 2, 3, 4, 1, 2, 1, 2, 3];
T = table(name, value, rep);
T =
name value rep
____ _________ ___
A 0.53767 1
A 1.8339 2
A -2.2588 3
B 0.86217 1
B 0.31877 2
C -1.3077 1
C -0.43359 2
C 0.34262 3
C 3.5784 4
D 2.7694 1
D -1.3499 2
E 3.0349 1
E 0.7254 2
E -0.063055 3
pivotTable = unstack(T, 'value','name')
pivotTable =
rep A B C D E
___ _______ _______ ________ _______ _________
1 0.53767 0.86217 -1.3077 2.7694 3.0349
2 1.8339 0.31877 -0.43359 -1.3499 0.7254
3 -2.2588 NaN 0.34262 NaN -0.063055
4 NaN NaN 3.5784 NaN NaN