How to perform a VLOOKUP in R?

How to perform a VLOOKUP in R? - merge

I am replicating a CSV file in R (originally created in excel), that is compiled using various data sources. When i made this CSV in excel i used a vlookup to populate certain columns based on another data source/other spreadsheet.
How can i populate a column in R using something similar to a VLOOKUP? i.e. looking for a variable in an external source and matching it with another column in the df?
For example, the formula in the excel version is =
=VLOOKUP('[Spreadsheet1]Tab1'!A22,'[Spreadsheet2]Tab2'!$A$6:$B$500,2,FALSE)
How can i make this same formula in R code?

dplyr::left_join() should do the trick:
myData <- data.frame(
x = c('a', 'b', 'b', 'c')
)
lookUpData <- data.frame(
key = c('a', 'b', 'c'),
value = c(1, 2, 3)
)
library(dplyr)
myData %>%
left_join(lookUpData, by = c(x = 'key')) %>%
rename(newCol = value)
x newCol
1 a 1
2 b 2
3 b 2
4 c 3

Related

Filter dataframe into 3 buckets

I have a Scala Spark Dataset, ds and two functions isTypeA() and isTypeB(), which take rows in that Dataset and return whether or not that row should be classified as A or B respectively. They can both return true for the same row, in which case I want to classify that row as A. Finally, I want C to be the rows that are neither A or B. I would like to save this as 3 separate Datasets.
I can do this by using filter and calling the functions multiple times
val a = ds.filter(isTypeA(_))
val b = ds.filter(row => !isTypeA(row) && isTypeB(row))
val c = ds.filter(row => !isTypeA(row) && !isTypeB(row))
but is there a more efficient way to do it?

Transforming a column and update the DataFrame

So, what I'm doing below is I drop a column A from a DataFrame because I want to apply a transformation (here I just json.loads a JSON string) and replace the old column with the transformed one. After the transformation I just join the two resulting data frames.
df = df_data.drop('A').join(
df_data[['ID', 'A']].rdd\
.map(lambda x: (x.ID, json.loads(x.A))
if x.A is not None else (x.ID, None))\
.toDF()\
.withColumnRenamed('_1', 'ID')\
.withColumnRenamed('_2', 'A'),
['ID']
)
The thing I dislike about this is of course the overhead I'm faced because I had to do the withColumnRenamed operations.
With pandas All I'd do something like this:
pdf = pd.DataFrame([json.dumps([0]*np.random.randint(5,10)) for i in range(10)], columns=['A'])
pdf.A = pdf.A.map(lambda x: json.loads(x))
pdf
but the following does not work in pyspark:
df.A = df[['A']].rdd.map(lambda x: json.loads(x.A))
So is there an easier way than what I'm doing in my first code snipped?

I do not think you need to drop the column and do the join. The following code should* be equivalent to what you posted:
cols = df_data.columns
df = df_data.rdd\
.map(
lambda row: tuple(
[row[c] if c != 'A' else (json.loads(row[c]) if row[c] is not None else None)
for c in cols]
)
)\
.toDF(cols)
*I haven't actually tested this code, but I think this should work.
But to answer your general question, you can transform a column in-place using withColumn().
df = df_data.withColumn("A", my_transformation_function("A").alias("A"))
Where my_transformation_function() can be a udf or a pyspark sql function.

From what i could understand, is it something like this you are trying to achieve?
import pyspark.sql.functions as F
import json
json_convert = F.udf(lambda x: json.loads(x) if x is not None else None)
cols = df_data.columns
df = df_data.select([json_convert(F.col('A')).alias('A')] + \
[col for col in cols if col != 'A'])

select multiple columns with dplyr having factor "No", "Yes" levels

I want to select all factor columns having two levels ("Yes", "No").
I want to use dpylr for this but, could not fix the problem.
AB %>%
select_if(.predicate = function(x) length(levels(x))==2 & unique(x) %in% c("No", "Yes"))

unique(x) %in% c('No','Yes') returns a vector the same length as unique(x), rather than a scalar. I think your better off using setequal(x,c('No','Yes')) as shown below:
library(dplyr)
# generate the dataframe with different factor levels
n<-100
no_yes <- sample(c('No','Yes'), n, replace = T)
no_yes_maybe <- sample(c('No','Yes','Maybe'), n, replace = T)
no <- sample(c('No'), n, replace = T)
no_maybe <- sample(c('No','Maybe'), n, replace = T)
AB<-data.frame(
no_yes, # only this column should get returned
no_yes_maybe,
no,
no_maybe,
stringsAsFactors = T
)%>%as.tbl
# function to return TRUE if column has only No/Yes factors.
desired_levels <- c('No','Yes')
predicate_function <- function(x) setequal(levels(x),desired_levels)
# use dplyr to select columns with desired factor levels
AB%>%select_if(predicate_function)

add columns in dataframes dynamically with column names as elements in List

I have List[N] like below
val check = List ("a","b","c","d")
where N can be any number of elements.
I have a dataframe with only column called "value". Based on the contents of value i need to create N columns with column names as elements in the list and column contents as substring(x,y)
I have tried all possible ways, like withColumn, selectExpr, nothing works.
Please consider substring(X,Y) where X and Y as some numbers based on some metadata
Below are my different codes which I tried, but none worked,
val df = sqlContext.read.text("xxxxx")
val coder: (String => String) = (arg: String) => {
val param = "NULL"
if (arg.length() > Y )
arg.substring(X,Y)
else
val sqlfunc = udf(coder)
val check = List ("a","b","c","d")
for (name <- check){val testDF2 = df.withColumn(name, sqlfunc(df("value")))}
testDF2 has only last column d and other columns such as a,b,c are not added in table
var z:Array[String] = new Array[String](check.size)
var i=0
for ( x <- check ) {
if ( (i+1) == check.size) {
z(i) = s""""substring(a.value,X,Y) as $x""""
i = i+1}
else{
z(i) = s""""substring(a.value,X,Y) as $x","""
i = i+1}}
val zz = z.mkString(" ")
df.alias("a").selectExpr(s"$zz").show()
This throws error
Please help how to add columns in DF dynamically with column names as elements in List
I am expecting an Df like below
-----------------------------
Value| a | b | c | d | .... N
-----------------------------
|xxx|xxx|xxx|xxx|xxx|xxxxxx-
|xxx|xxx|xxx|xxx|xxx|xxxxxx-
|xxx|xxx|xxx|xxx|xxx|xxxxxx-
-----------------------------

you can dynamically add columns from your list using for instance this answer by user6910411 to a similar question (see her/his full answer for more possibilities):
val newDF = check.foldLeft(<yourdf>)((df, name) => df.withColumn(name,<yourUDF>$"value"))

Replace missing values with mean - Spark Dataframe

I have a Spark Dataframe with some missing values. I would like to perform a simple imputation by replacing the missing values with the mean for that column. I am very new to Spark, so I have been struggling to implement this logic. This is what I have managed to do so far:
a) To do this for a single column (let's say Col A), this line of code seems to work:
df.withColumn("new_Col", when($"ColA".isNull, df.select(mean("ColA"))
.first()(0).asInstanceOf[Double])
.otherwise($"ColA"))
b) However, I have not been able to figure out, how to do this for all the columns in my dataframe. I was trying out the Map function, but I believe it loops through each row of a dataframe
c) There is a similar question on SO - here. And while I liked the solution (using Aggregated tables and coalesce), I was very keen to know if there is a way to do this by looping through each column (I come from R, so looping through each column using a higher order functional like lapply seems more natural to me).
Thanks!

Spark >= 2.2
You can use org.apache.spark.ml.feature.Imputer (which supports both mean and median strategy).
Scala :
import org.apache.spark.ml.feature.Imputer
val imputer = new Imputer()
.setInputCols(df.columns)
.setOutputCols(df.columns.map(c => s"${c}_imputed"))
.setStrategy("mean")
imputer.fit(df).transform(df)
Python:
from pyspark.ml.feature import Imputer
imputer = Imputer(
inputCols=df.columns,
outputCols=["{}_imputed".format(c) for c in df.columns]
)
imputer.fit(df).transform(df)
Spark < 2.2
Here you are:
import org.apache.spark.sql.functions.mean
df.na.fill(df.columns.zip(
df.select(df.columns.map(mean(_)): _*).first.toSeq
).toMap)
where
df.columns.map(mean(_)): Array[Column]
computes an average for each column,
df.select(_: *).first.toSeq: Seq[Any]
collects aggregated values and converts row to Seq[Any] (I know it is suboptimal but this is the API we have to work with),
df.columns.zip(_).toMap: Map[String,Any]
creates aMap: Map[String, Any] which maps from the column name to its average, and finally:
df.na.fill(_): DataFrame
fills the missing values using:
fill: Map[String, Any] => DataFrame
from DataFrameNaFunctions.
To ingore NaN entries you can replace:
df.select(df.columns.map(mean(_)): _*).first.toSeq
with:
import org.apache.spark.sql.functions.{col, isnan, when}
df.select(df.columns.map(
c => mean(when(!isnan(col(c)), col(c)))
): _*).first.toSeq

For imputing the median (instead of the mean) in PySpark < 2.2
## filter numeric cols
num_cols = [col_type[0] for col_type in filter(lambda dtype: dtype[1] in {"bigint", "double", "int"}, df.dtypes)]
### Compute a dict with <col_name, median_value>
median_dict = dict()
for c in num_cols:
median_dict[c] = df.stat.approxQuantile(c, [0.5], 0.001)[0]
Then, apply na.fill
df_imputed = df.na.fill(median_dict)

For PySpark, this is the code I used:
mean_dict = { col: 'mean' for col in df.columns }
col_avgs = df.agg( mean_dict ).collect()[0].asDict()
col_avgs = { k[4:-1]: v for k,v in col_avgs.iteritems() }
df.fillna( col_avgs ).show()
The four steps are:
Create the dictionary mean_dict mapping column names to the aggregate operation (mean)
Calculate the mean for each column, and save it as the dictionary col_avgs
The column names in col_avgs start with avg( and end with ), e.g. avg(col1). Strip the parentheses out.
Fill the columns of the dataframe with the averages using col_avgs

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to perform a VLOOKUP in R? - merge

dplyr::left_join() should do the trick: myData <- data.frame( x = c('a', 'b', 'b', 'c') ) lookUpData <- data.frame( key = c('a', 'b', 'c'), value = c(1, 2, 3) ) library(dplyr) myData %>% left_join(lookUpData, by = c(x = 'key')) %>% rename(newCol = value) x newCol 1 a 1 2 b 2 3 b 2 4 c 3

Related

Filter dataframe into 3 buckets

Transforming a column and update the DataFrame

select multiple columns with dplyr having factor "No", "Yes" levels

add columns in dataframes dynamically with column names as elements in List

Replace missing values with mean - Spark Dataframe

Categories

Resources