I am replicating a CSV file in R (originally created in excel), that is compiled using various data sources. When i made this CSV in excel i used a vlookup to populate certain columns based on another data source/other spreadsheet.
How can i populate a column in R using something similar to a VLOOKUP? i.e. looking for a variable in an external source and matching it with another column in the df?
For example, the formula in the excel version is =
=VLOOKUP('[Spreadsheet1]Tab1'!A22,'[Spreadsheet2]Tab2'!$A$6:$B$500,2,FALSE)
How can i make this same formula in R code?
dplyr::left_join() should do the trick:
myData <- data.frame(
x = c('a', 'b', 'b', 'c')
)
lookUpData <- data.frame(
key = c('a', 'b', 'c'),
value = c(1, 2, 3)
)
library(dplyr)
myData %>%
left_join(lookUpData, by = c(x = 'key')) %>%
rename(newCol = value)
x newCol
1 a 1
2 b 2
3 b 2
4 c 3
Related
I have a Scala Spark Dataset, ds and two functions isTypeA() and isTypeB(), which take rows in that Dataset and return whether or not that row should be classified as A or B respectively. They can both return true for the same row, in which case I want to classify that row as A. Finally, I want C to be the rows that are neither A or B. I would like to save this as 3 separate Datasets.
I can do this by using filter and calling the functions multiple times
val a = ds.filter(isTypeA(_))
val b = ds.filter(row => !isTypeA(row) && isTypeB(row))
val c = ds.filter(row => !isTypeA(row) && !isTypeB(row))
but is there a more efficient way to do it?
So, what I'm doing below is I drop a column A from a DataFrame because I want to apply a transformation (here I just json.loads a JSON string) and replace the old column with the transformed one. After the transformation I just join the two resulting data frames.
df = df_data.drop('A').join(
df_data[['ID', 'A']].rdd\
.map(lambda x: (x.ID, json.loads(x.A))
if x.A is not None else (x.ID, None))\
.toDF()\
.withColumnRenamed('_1', 'ID')\
.withColumnRenamed('_2', 'A'),
['ID']
)
The thing I dislike about this is of course the overhead I'm faced because I had to do the withColumnRenamed operations.
With pandas All I'd do something like this:
pdf = pd.DataFrame([json.dumps([0]*np.random.randint(5,10)) for i in range(10)], columns=['A'])
pdf.A = pdf.A.map(lambda x: json.loads(x))
pdf
but the following does not work in pyspark:
df.A = df[['A']].rdd.map(lambda x: json.loads(x.A))
So is there an easier way than what I'm doing in my first code snipped?
I do not think you need to drop the column and do the join. The following code should* be equivalent to what you posted:
cols = df_data.columns
df = df_data.rdd\
.map(
lambda row: tuple(
[row[c] if c != 'A' else (json.loads(row[c]) if row[c] is not None else None)
for c in cols]
)
)\
.toDF(cols)
*I haven't actually tested this code, but I think this should work.
But to answer your general question, you can transform a column in-place using withColumn().
df = df_data.withColumn("A", my_transformation_function("A").alias("A"))
Where my_transformation_function() can be a udf or a pyspark sql function.
From what i could understand, is it something like this you are trying to achieve?
import pyspark.sql.functions as F
import json
json_convert = F.udf(lambda x: json.loads(x) if x is not None else None)
cols = df_data.columns
df = df_data.select([json_convert(F.col('A')).alias('A')] + \
[col for col in cols if col != 'A'])
I want to select all factor columns having two levels ("Yes", "No").
I want to use dpylr for this but, could not fix the problem.
AB %>%
select_if(.predicate = function(x) length(levels(x))==2 & unique(x) %in% c("No", "Yes"))
unique(x) %in% c('No','Yes') returns a vector the same length as unique(x), rather than a scalar. I think your better off using setequal(x,c('No','Yes')) as shown below:
library(dplyr)
# generate the dataframe with different factor levels
n<-100
no_yes <- sample(c('No','Yes'), n, replace = T)
no_yes_maybe <- sample(c('No','Yes','Maybe'), n, replace = T)
no <- sample(c('No'), n, replace = T)
no_maybe <- sample(c('No','Maybe'), n, replace = T)
AB<-data.frame(
no_yes, # only this column should get returned
no_yes_maybe,
no,
no_maybe,
stringsAsFactors = T
)%>%as.tbl
# function to return TRUE if column has only No/Yes factors.
desired_levels <- c('No','Yes')
predicate_function <- function(x) setequal(levels(x),desired_levels)
# use dplyr to select columns with desired factor levels
AB%>%select_if(predicate_function)
I have List[N] like below
val check = List ("a","b","c","d")
where N can be any number of elements.
I have a dataframe with only column called "value". Based on the contents of value i need to create N columns with column names as elements in the list and column contents as substring(x,y)
I have tried all possible ways, like withColumn, selectExpr, nothing works.
Please consider substring(X,Y) where X and Y as some numbers based on some metadata
Below are my different codes which I tried, but none worked,
val df = sqlContext.read.text("xxxxx")
val coder: (String => String) = (arg: String) => {
val param = "NULL"
if (arg.length() > Y )
arg.substring(X,Y)
else
val sqlfunc = udf(coder)
val check = List ("a","b","c","d")
for (name <- check){val testDF2 = df.withColumn(name, sqlfunc(df("value")))}
testDF2 has only last column d and other columns such as a,b,c are not added in table
var z:Array[String] = new Array[String](check.size)
var i=0
for ( x <- check ) {
if ( (i+1) == check.size) {
z(i) = s""""substring(a.value,X,Y) as $x""""
i = i+1}
else{
z(i) = s""""substring(a.value,X,Y) as $x","""
i = i+1}}
val zz = z.mkString(" ")
df.alias("a").selectExpr(s"$zz").show()
This throws error
Please help how to add columns in DF dynamically with column names as elements in List
I am expecting an Df like below
-----------------------------
Value| a | b | c | d | .... N
-----------------------------
|xxx|xxx|xxx|xxx|xxx|xxxxxx-
|xxx|xxx|xxx|xxx|xxx|xxxxxx-
|xxx|xxx|xxx|xxx|xxx|xxxxxx-
-----------------------------
you can dynamically add columns from your list using for instance this answer by user6910411 to a similar question (see her/his full answer for more possibilities):
val newDF = check.foldLeft(<yourdf>)((df, name) => df.withColumn(name,<yourUDF>$"value"))
I have a Spark Dataframe with some missing values. I would like to perform a simple imputation by replacing the missing values with the mean for that column. I am very new to Spark, so I have been struggling to implement this logic. This is what I have managed to do so far:
a) To do this for a single column (let's say Col A), this line of code seems to work:
df.withColumn("new_Col", when($"ColA".isNull, df.select(mean("ColA"))
.first()(0).asInstanceOf[Double])
.otherwise($"ColA"))
b) However, I have not been able to figure out, how to do this for all the columns in my dataframe. I was trying out the Map function, but I believe it loops through each row of a dataframe
c) There is a similar question on SO - here. And while I liked the solution (using Aggregated tables and coalesce), I was very keen to know if there is a way to do this by looping through each column (I come from R, so looping through each column using a higher order functional like lapply seems more natural to me).
Thanks!
Spark >= 2.2
You can use org.apache.spark.ml.feature.Imputer (which supports both mean and median strategy).
Scala :
import org.apache.spark.ml.feature.Imputer
val imputer = new Imputer()
.setInputCols(df.columns)
.setOutputCols(df.columns.map(c => s"${c}_imputed"))
.setStrategy("mean")
imputer.fit(df).transform(df)
Python:
from pyspark.ml.feature import Imputer
imputer = Imputer(
inputCols=df.columns,
outputCols=["{}_imputed".format(c) for c in df.columns]
)
imputer.fit(df).transform(df)
Spark < 2.2
Here you are:
import org.apache.spark.sql.functions.mean
df.na.fill(df.columns.zip(
df.select(df.columns.map(mean(_)): _*).first.toSeq
).toMap)
where
df.columns.map(mean(_)): Array[Column]
computes an average for each column,
df.select(_: *).first.toSeq: Seq[Any]
collects aggregated values and converts row to Seq[Any] (I know it is suboptimal but this is the API we have to work with),
df.columns.zip(_).toMap: Map[String,Any]
creates aMap: Map[String, Any] which maps from the column name to its average, and finally:
df.na.fill(_): DataFrame
fills the missing values using:
fill: Map[String, Any] => DataFrame
from DataFrameNaFunctions.
To ingore NaN entries you can replace:
df.select(df.columns.map(mean(_)): _*).first.toSeq
with:
import org.apache.spark.sql.functions.{col, isnan, when}
df.select(df.columns.map(
c => mean(when(!isnan(col(c)), col(c)))
): _*).first.toSeq
For imputing the median (instead of the mean) in PySpark < 2.2
## filter numeric cols
num_cols = [col_type[0] for col_type in filter(lambda dtype: dtype[1] in {"bigint", "double", "int"}, df.dtypes)]
### Compute a dict with <col_name, median_value>
median_dict = dict()
for c in num_cols:
median_dict[c] = df.stat.approxQuantile(c, [0.5], 0.001)[0]
Then, apply na.fill
df_imputed = df.na.fill(median_dict)
For PySpark, this is the code I used:
mean_dict = { col: 'mean' for col in df.columns }
col_avgs = df.agg( mean_dict ).collect()[0].asDict()
col_avgs = { k[4:-1]: v for k,v in col_avgs.iteritems() }
df.fillna( col_avgs ).show()
The four steps are:
Create the dictionary mean_dict mapping column names to the aggregate operation (mean)
Calculate the mean for each column, and save it as the dictionary col_avgs
The column names in col_avgs start with avg( and end with ), e.g. avg(col1). Strip the parentheses out.
Fill the columns of the dataframe with the averages using col_avgs