Splitting a string with values to desired dataframe in scala and spark - scala

I have a string with values say
"column1=1;column2=2;column3=3;column4=4" like that for say 10 columns
I want to find for example the value corresponding to say column2,column4,column6
How do I do that using Scala and Spark?
What I did till now is:
var df = Seq(("column1=a1;column2=a2;column3=a3;column4=a4;column5=a5;column6=a6;column7=a7;column8=a8;column9=a9;column10=a10")).toDF("custom")
val df2 = df.withColumn("temp_new", split(col("custom"), "\\;")).select(
(0 until 10).map(i => col("temp_new").getItem(i).as(s"col$i")): _*
)
df2.show()
and I get an output as:
But what I really want is :
How do I do this? Thanks!

Just split it once again?
.map(i => split(col("temp_new").getItem(i), "=").getItem(1).as(s"col$i"))

Related

How can I delete columns of RDD using an if statement in spark (using scala)

Suppose I have a text file where each entry has a couple hundred data points. I want to get rid of any column which has a question mark- using the drop function and picking it out seems tedious; is there a faster way?
Something like dataframe.map( x => ifcontainsquestionmarkdropcolumn(x)) ?
Number of question marks in each column can be calculated with "sum" function, and columns where number is non-zero can be dropped:
val df = Seq(("nomark", "question?mark"))
.toDF("expected", "dropped")
val questionCountColumns = df
.columns
.map(c => sum(when(col(c).contains("?"), 1).otherwise(0)).alias(c))
val questionCountRow = df.select(questionCountColumns: _*).first()
val columnsToDrop = df
.columns
.filter(c => questionCountRow.getAs[Long](c) > 0)
val result = df.drop(columnsToDrop: _*)
Result is:
+--------+
|expected|
+--------+
|nomark |
+--------+

spark scala reducekey dataframe operation

I'm trying to do a count in scala with dataframe. My data has 3 columns and I've already loaded the data and split by tab. So I want to do something like this:
val file = file.map(line=>line.split("\t"))
val x = file1.map(line=>(line(0), line(2).toInt)).reduceByKey(_+_,1)
I want to put the data in dataframe, and having some trouble on the syntax
val file = file.map(line=>line.split("\t")).toDF
val file.groupby(line(0))
.count()
Can someone help check if this is correct?
spark needs to know the schema of the df
there are many ways to specify the schema, here is one option:
val df = file
.map(line=>line.split("\t"))
.map(l => (l(0), l(1).toInt)) //at this point spark knows the number of columns and their types
.toDF("a", "b") //give the columns names for ease of use
df
.groupby('a)
.count()

Spark Dataframe select based on column index

How do I select all the columns of a dataframe that has certain indexes in Scala?
For example if a dataframe has 100 columns and i want to extract only columns (10,12,13,14,15), how to do the same?
Below selects all columns from dataframe df which has the column name mentioned in the Array colNames:
df = df.select(colNames.head,colNames.tail: _*)
If there is similar, colNos array which has
colNos = Array(10,20,25,45)
How do I transform the above df.select to fetch only those columns at the specific indexes.
You can map over columns:
import org.apache.spark.sql.functions.col
df.select(colNos map df.columns map col: _*)
or:
df.select(colNos map (df.columns andThen col): _*)
or:
df.select(colNos map (col _ compose df.columns): _*)
All the methods shown above are equivalent and don't impose performance penalty. Following mapping:
colNos map df.columns
is just a local Array access (constant time access for each index) and choosing between String or Column based variant of select doesn't affect the execution plan:
val df = Seq((1, 2, 3 ,4, 5, 6)).toDF
val colNos = Seq(0, 3, 5)
df.select(colNos map df.columns map col: _*).explain
== Physical Plan ==
LocalTableScan [_1#46, _4#49, _6#51]
df.select("_1", "_4", "_6").explain
== Physical Plan ==
LocalTableScan [_1#46, _4#49, _6#51]
#user6910411's answer above works like a charm and the number of tasks/logical plan is similar to my approach below. BUT my approach is a bit faster.
So,
I would suggest you to go with the column names rather than column numbers. Column names are much safer and much ligher than using numbers. You can use the following solution :
val colNames = Seq("col1", "col2" ...... "col99", "col100")
val selectColNames = Seq("col1", "col3", .... selected column names ... )
val selectCols = selectColNames.map(name => df.col(name))
df = df.select(selectCols:_*)
If you are hesitant to write all the 100 column names then there is a shortcut method too
val colNames = df.schema.fieldNames
Example: Grab first 14 columns of Spark Dataframe by Index using Scala.
import org.apache.spark.sql.functions.col
// Gives array of names by index (first 14 cols for example)
val sliceCols = df.columns.slice(0, 14)
// Maps names & selects columns in dataframe
val subset_df = df.select(sliceCols.map(name=>col(name)):_*)
You cannot simply do this (as I tried and failed):
// Gives array of names by index (first 14 cols for example)
val sliceCols = df.columns.slice(0, 14)
// Maps names & selects columns in dataframe
val subset_df = df.select(sliceCols)
The reason is that you have to convert your datatype of Array[String] to Array[org.apache.spark.sql.Column] in order for the slicing to work.
OR Wrap it in a function using Currying (high five to my colleague for this):
// Subsets Dataframe to using beg_val & end_val index.
def subset_frame(beg_val:Int=0, end_val:Int)(df: DataFrame): DataFrame = {
val sliceCols = df.columns.slice(beg_val, end_val)
return df.select(sliceCols.map(name => col(name)):_*)
}
// Get first 25 columns as subsetted dataframe
val subset_df:DataFrame = df_.transform(subset_frame(0, 25))

Scala Spark DataFrame : dataFrame.select multiple columns given a Sequence of column names

val columnName=Seq("col1","col2",....."coln");
Is there a way to do dataframe.select operation to get dataframe containing only the column names specified .
I know I can do dataframe.select("col1","col2"...)
but the columnNameis generated at runtime.
I could do dataframe.select() repeatedly for each column name in a loop.Will it have any performance overheads?. Is there any other simpler way to accomplish this?
val columnNames = Seq("col1","col2",....."coln")
// using the string column names:
val result = dataframe.select(columnNames.head, columnNames.tail: _*)
// or, equivalently, using Column objects:
val result = dataframe.select(columnNames.map(c => col(c)): _*)
Since dataFrame.select() expects a sequence of columns and we have a sequence of strings, we need to convert our sequence to a List of cols and convert that list to the sequence. columnName.map(name => col(name)): _* gives a sequence of columns from a sequence of strings, and this can be passed as a parameter to select():
val columnName = Seq("col1", "col2")
val DFFiltered = DF.select(columnName.map(name => col(name)): _*)
Alternatively, you can also write like this
val columnName = Seq("col1", "col2")
val DFFiltered = DF.select(columnName.map(DF(_): _*)

Get a range of columns of Spark RDD

Now I have 300+ columns in my RDD, but I found there is a need to dynamically select a range of columns and put them into LabledPoints data type. As a newbie to Spark, I am wondering if there is any index way to select a range of columns in RDD. Something like temp_data = data[, 101:211] in R. Is there something like val temp_data = data.filter(_.column_index in range(101:211)...?
Any thought is welcomed and appreciated.
If it is a DataFrame, then something like this should work:
val df = rdd.toDF
df.select(df.columns.slice(101,211) : _*)
Assuming you have an RDD of Array or any other scala collection (e.g., List). You can do something like this:
val data: RDD[Array[Int]] = sc.parallelize(Array(Array(1,2,3), Array(4,5,6)))
val sliced: RDD[Array[Int]] = data.map(_.slice(0,2))
sliced.collect()
> Array[Array[Int]] = Array(Array(1, 2), Array(4, 5))
Kind of old thread, but I recently had to do something similar and search around. I needed to select all but the last column where I had 200+ columns.
Spark 1.4.1
Scala 2.10.4
val df = hiveContext.sql("SELECT * FROM foobar")
val cols = df.columns.slice(0, df.columns.length - 1)
val new_df = df.select(cols.head, cols.tail:_*)