I have a code to analyze the log file using map transformation. Then the RDD got converted to DF.
val logData = sc.textFile("hdfs://quickstart.cloudera:8020/user/cloudera/syslog.txt")
val logDataDF = logData.map(rec => (rec.split(" ")(0), rec.split(" ")(2), rec.split(" ")(5))).toDF("month", "date", "process")
I would like to know whether I can use mapPartitions in this case instead of map.
I don't know what is your use case but you can definitely use mapPartition instead of map. Below code will return the same logDataDF.
val logDataDF = logData.mapPartitions(x => {
val lst = scala.collection.mutable.ListBuffer[(String, String, String)]()
while (x.hasNext) {
val rec = x.next().split(" ")
lst += ((rec(0), rec(2), rec(5)))
}).toDF("month", "date", "process")
I'm basically trying to do something like this but spark doesn’t recognizes it.
val colsToLower: Array[String] = Array("col0", "col1", "col2")
val selectQry: String = colsToLower.map((x: String) => s"""lower(col(\"${x}\")).as(\"${x}\"), """).mkString.dropRight(2)
Is there a way to do something like this in spark/scala?
If you need to lowercase the name of your columns there is a simple way of doing it. Here is one example:
df.columns.foreach(c => {
val newColumnName = c.toLowerCase
df = df.withColumnRenamed(c, newColumnName)
This will allow you to lowercase the column names, and update it in the spark dataframe.
I believe I found a way to build it:
def lowerTextColumns(cols: Array[String])(df: DataFrame): DataFrame = {
val remainingCols: String = (df.columns diff cols).mkString(", ")
val lowerCols: String = cols.map((x: String) => s"""lower(${x}) as ${x}, """).mkString.dropRight(2)
val selectQry: String =
if (colsToSelect.nonEmpty) lowerCols + ", " + remainingCols
else lowerCols
I have a list of HBase row keys in form or Array[Row] and want to create a Spark DataFrame out of the rows that are fetched from HBase using these RowKeys.
Am thinking of something like:
def getDataFrameFromList(spark: SparkSession, rList : Array[Row]): DataFrame = {
val conf = HBaseConfiguration.create()
val mlRows : List[RDD[String]] = new ArrayList[RDD[String]]
conf.set("hbase.zookeeper.quorum", "dev.server")
conf.set("hbase.zookeeper.property.clientPort", "2181")
conf.set(TableInputFormat.INPUT_TABLE, "hbase_tbl1")
rList.foreach( r => {
var rStr = r.toString()
conf.set(TableInputFormat.SCAN_ROW_START, rStr)
conf.set(TableInputFormat.SCAN_ROW_STOP, rStr + "_")
// read one row
val recsRdd = readHBaseRdd(spark, conf)
// This works, but it is only one row
//val resourcesDf = spark.read.json(recsRdd)
var resourcesDf = <Code here to convert List[RDD[String]] to DataFrame>
I can do recsRdd.collect() in the for loop and convert it to string and append that json to an ArrayList[String but am not sure if its efficient, to call collect() in a for loop like this.
readHBaseRdd is using newAPIHadoopRDD to get data from HBase
def readHBaseRdd(spark: SparkSession, conf: Configuration) = {
val hBaseRDD = spark.sparkContext.newAPIHadoopRDD(conf, classOf[TableInputFormat],
hBaseRDD.map {
case (_: ImmutableBytesWritable, value: Result) =>
Use spark.union([mainRdd, recsRdd]) instead of a list or RDDs (mlRows)
And why read only one row from HBase? Try to have the largest interval as possible.
Always avoid calling collect(), do it only for debug/tests.
I want to split the following RDD into a single RDD(id,(all name same type)).
>val test = rddByKey.map{case(k,v)=> (k,v.collect())}
test: Array[(String, Array[String])] =
(45000,Array(Amit, Pavan, Ratan)),
(10000,Array(Kumar, Venkat, Sheela)),
(50000,Array(Tejas, Dinesh, Lokesh, Bhupesh))
I want to print it like this:
(45000,(Amit, Pavan, Ratan))
(10000,(Kumar, Venkat, Sheela))
This is what I have tried
val data = sc.textFile("/user/cloudera/data.csv")
val rdd = data.map(r=>(r.split(",")(0),r.split(",")(1)))
val groupByKey = rdd.groupByKey().collect()
val rddByKey = groupByKey.map{case(k,v) => k->sc.makeRDD(v.toSeq)}
val test = rddByKey.map{case(k,v)=> (k,v.collect())}
You don't have to go through such complexity of using collect. you can simply do
val data = sc.textFile("/user/cloudera/data.csv")
val rdd = data.map(r=>{
val x = r.split(",")
val groupByKey = rdd.groupByKey().map{case (x, y) => (x :: y.toList)}
groupByKey is
List(45000, Amit, Pavan, Ratan)
List(10000, Kumar, Venkat, Sheela)
List(50000, Tejas, Dinesh, Lokesh, Bhupesh)
I hope the answer is helpful
I have two RDD's of the form:
data_wo_header: RDD[String], data_test_wo_header: RDD[String]
scala> data_wo_header.first
res2: String = 1,2,3.5,1112486027
scala> data_test_wo_header.first
res2: String = 1,2
RDD2 is smaller than RDD 1. I am trying to filter RDD1 by removing the elements whose regEx matches with RDD2.
The 1,2 in the above example represent UserID,MovID. Since it's present in the test I want the new RDD such that it's removed from RDD1.
I have asked a similar ques but it is requiring to do unnecessary split of RDD.
I am trying to do something of this sort but it's not working:
def create_training(data_wo_header: RDD[String], data_test_wo_header: RDD[String]): List[String] = {
var ratings_train = new ListBuffer[String]()
data_wo_header.foreach(x => {
data_test_wo_header.foreach(y => {
if (x.indexOf(y) == 0) {
ratings_train += x
val ratings_train_list = ratings_train.toList
return ratings_train_list
How should I do a regex match and filter based on it.
You can use broadcast variable to share state of rdd2 and then filter rdd1 based on broadcasted variable of rdd2. I replicate your code and this works for me
def create_training(data_wo_header: RDD[String], data_test_wo_header: RDD[String]): List[String] = {
val rdd2array = sparkSession.sparkContext.broadcast(data_test_wo_header.collect())
val training_set = data_wo_header.filter{
case(x) => rdd2array.value.filter(y => x.matches(y)).length == 0
Also with scala and spark I recommend you if it is possible to avoid foreach and use more functional paradigm with map,flatMap and filter functions
I'm trying to transform a dataframe via a function that takes an array as a parameter. My code looks something like this:
def getCategory(categories:Array[String], input:String): String = {
val myArray = Array("a", "b", "c")
val myCategories =udf(getCategory _ )
val df = sqlContext.parquetFile("myfile.parquet)
val df1 = df.withColumn("newCategory", myCategories(lit(myArray), col("myInput"))
However, lit doesn't like arrays and this script errors. I tried definining a new partially applied function and then the udf after that :
val newFunc = getCategory(myArray, _:String)
val myCategories = udf(newFunc)
val df1 = df.withColumn("newCategory", myCategories(col("myInput")))
This doesn't work either as I get a nullPointer exception and it appears myArray is not being recognized. Any ideas on how I pass an array as a parameter to a function with a dataframe?
On a separate note, any explanation as to why doing something simple like using a function on a dataframe is so complicated (define function, redefine it as UDF, etc, etc)?
Most likely not the prettiest solution but you can try something like this:
def getCategory(categories: Array[String]) = {
udf((input:String) => categories(input.toInt))
df.withColumn("newCategory", getCategory(myArray)(col("myInput")))
You could also try an array of literals:
val getCategory = udf(
(input:String, categories: Array[String]) => categories(input.toInt))
"newCategory", getCategory($"myInput", array(myArray.map(lit(_)): _*)))
On a side note using Map instead of Array is probably a better idea:
def mapCategory(categories: Map[String, String], default: String) = {
udf((input:String) => categories.getOrElse(input, default))
val myMap = Map[String, String]("1" -> "a", "2" -> "b", "3" -> "c")
df.withColumn("newCategory", mapCategory(myMap, "foo")(col("myInput")))
Since Spark 1.5.0 you can also use an array function:
import org.apache.spark.sql.functions.array
val colArray = array(myArray map(lit _): _*)
myCategories(lit(colArray), col("myInput"))
See also Spark UDF with varargs