How to drop multiple columns in spark with scala with some regex? - scala

Say I have some column names which ends with '_undo', Now I need to remove this columns. Now instead of dropping one by one or storing them in a list before dropping, can i drop at one go?
df.drop(//drop at one go for those columns ending with _undo)

val customDF = df.select(df.columns.filter(colName => checkname(colName)) .map(colName => new Column(colName)): _*)
where checkname is simple user function which returns colname if matches and null if not. Use your pattern to check inside that function.
def checkname () : [return type] = {
return [expr]
}

What about this?
df.drop(df.columns.filter(c => "regexp".r.pattern.matcher(c).matches): _*)
or
df.drop(df.columns.filter(_.endsWith("_undo")): _*)

Related

How to change many fields at the same time in Slick update query?

Lets say I have a table for movies which has many fields such as id, the name of the movie,year it was created etc. I know that the following works:
val updateMovieNameQuery = SlickTables.movieTable.filter(_.id === movieId).map(_.name).update("newName")
Is there any way how to update two or more fields in this way ? I tried the following but this doesnt work.
val updateMovieNameQuery = SlickTables.movieTable.filter(_.id === movieId).map(_.name,_.year).update("newName",1997)
I
Your answer is pretty close, but you need to extract the fields as a tuple and then pass a new tuple to update:
val updateMovieNameQuery = SlickTables
.movieTable
.filter(_.id === movieId)
.map(m => (m.name, m.year)) // Create tuple of field values
.update(("newName",1997)) // Pass new tuple of values

Matching Column name from Csv file in spark scala

I want to take headers (column name) from my csv file and the want to match with it my existing header.
I am using below code:
val cc = sparksession.read.csv(filepath).take(1)
Its giving me value like:
Array([id,name,salary])
and I have created one more static schema, which is giving me value like this:
val ss=Array("id","name","salary")
and then I'm trying to compare column name using if condition:
if(cc==ss){
println("matched")
} else{
println("not matched")
}
I guess due to [] and () mismatch its always going to else part is there any other way to compare these value without considering [] and ()?
First, for convenience, set the header option to true when reading the file:
val df = sparksession.read.option("header", true).csv(filepath)
Get the column names and define the expected column names:
val cc = df.columns
val ss = Array("id", "name", "salary")
To check if the two match (not considering the ordering):
if (cc.toSet == ss.toSet) {
println("matched")
} else {
println("not matched")
}
If the order is relevant, then the condition can be done as follows (you can't use Array here but Seq works):
cc.toSeq == ss.toSeq
or you a deep array comparison:
cc.deep == d.deep
First of all, I think you are trying to compare a Array[org.apache.spark.sql.Row] with an Array[String]. I believe you should change how you load the headers to something like: val cc = spark.read.format("csv").option("header", "true").load(fileName).columns.toArray.
Then you could compare using cc.deep == ss.deep.
Below code worked for me.
val cc= spark.read.csv("filepath").take(1)(0).toString
The above code gave output as String:[id,name,salary].
created one one stating schema as
val ss="[id,name,salary]"
then wrote the if else Conditions.

Scala - how to filter a StructType with a list of StructField names?

I'm writing a method to parse schema and want to filter the resulting StructType with a list of column names. Which is a subset of StructField names of the original schema.
As a result, if a flag isFilteringReq = true, I want to return a StructType containing only StructFields with the names from the specialColumnNames, in the same order. If the flag is false, then return an original StructType.
val specialColumnNames = Seq("metric_1", "metric_2", "metric_3")
First I'm getting an original schema with pattern-matching.
val customSchema: StructType = schemaType match {
case "type_1" => getType1chema()
case "type_2" => getType2chema()
}
There are two problems:
1 - I wasn't able to apply .filter() directly to the customSchema right after the curly brace. And geting a Cannot resolve symbol filter. So I wrote a separate method makeCustomSchema. But I don't need a separate object. Is there a more elegant way to apply filtering in this case?
2 - I could filter the originalStruct but only with a single hardcoded column name. How should I pass the specialColumnNames to contains()?
def makeCustomSchema(originalStruct: StructType, isFilteringReq: Boolean, updColumns: Seq[String]) = if (isFilteringReq) {
originalStruct.filter(s => s.name.contains("metric_1"))
} else {
originalStruct
}
val newSchema = makeCustomSchema(customSchema, isFilteringReq, specialColumnNames)
Instead of passing a Seq, pass a Set and you can filter if the field is in the set or not.
Also, I wouldn't use a flag, instead, you could pass an empty Set when there's no filtering, or use Option[Set[String]].
Anyway, you could also use the copy method that comes for free with case classes.
Something like this should work.
def makeCustomSchema(originalStruct: StructType, updColumns:Set[String]): StructType = {
updColumns match {
case s if s.isEmpty => originalStruct
case _ => originalStruct.copy(
fields = originalStruct.fields.filter(
f => updColumns.contains(f.name)))
}
}
Usually you don't need to build structs like this, have you tried using the drop() method in DataFrame/DataSet ?

Pass columnNames dynamically to cassandraTable().select()

I'm reading query off of a file at run-time and executing it on the SPark+Cassandra environment.
I'm executing :
sparkContext.cassandraTable.("keyspaceName", "colFamilyName").select("col1", "col2", "col3").where("some condition = true")
Query in FIle :
select col1, col2, col3
from keyspaceName.colFamilyName
where somecondition = true
Here Col1,col2,col3 can vary depending on the query parsed from the file.
Question :
How do I pick columnName from query and pass them to select() and runtime.
I have tried many ways to do it :
1. dumbest thing done (which obviously threw an error) -
var str = "col1,col2,col3"
var selectStmt = str.split("\\,").map { x => "\"" + x.trim() + "\"" }.mkString(",")
var queryRDD = sc.cassandraTable().select(selectStmt)
Any ideas are welcome.
Side Notes :
1. I do not want to use cassandraCntext becasue it will be depricated/ removed in next realase (https://docs.datastax.com/en/datastax_enterprise/4.5/datastax_enterprise/spark/sparkCCcontext.html)
2. I'm on
- a. Scala 2.11
- b. spark-cassandra-connector_2.11:1.6.0-M1
- c. Spark 1.6
Use Cassandra Connector
Your use case sounds like you actually want to use CassandraConnector Objects. These give you a direct access to a per ExecutorJVM session pool and are ideal for just executing random queries. This will end up being much more efficient than creating an RDD for each query.
This would look something like
rddOfStatements.mapPartitions( it =>
CassandraConnector.withSessionDo { session =>
it.map(statement =>
session.execute(statement))})
But you most likely would want to use executeAsync and handle the futures separately for better performance.
Programatically specifying columns in cassandraTable
The select method takes ColumnRef* which means you need to pass in some number of ColumnRefs. Normally there is an implicit conversion from String --> ColumnRef which is why you can pass in just a var-args of strings.
Here it's a little more complicated because we want to pass var args of another type so we end up with double implicits and Scala doesn't like that.
So instead we pass in ColumnName objects as varargs (:_*)
========================================
Keyspace: test
========================================
Table: dummy
----------------------------------------
- id : java.util.UUID (partition key column)
- txt : String
val columns = Seq("id", "txt")
columns: Seq[String] = List(id, txt)
//Convert the strings to ColumnNames (a subclass of ColumnRef) and treat as var args
sc.cassandraTable("test","dummy")
.select(columns.map(ColumnName(_)):_*)
.collect
Array(CassandraRow{id: 74f25101-75a0-48cd-87d6-64cb381c8693, txt: hello world})
//Only use the first column
sc.cassandraTable("test","dummy")
.select(columns.map(ColumnName(_)).take(1):_*)
.collect
Array(CassandraRow{id: 74f25101-75a0-48cd-87d6-64cb381c8693})
//Only use the last column
sc.cassandraTable("test","dummy")
.select(columns.map(ColumnName(_)).takeRight(1):_*)
.collect
Array(CassandraRow{txt: hello world})

Scala - Expanding an argument list in a pattern matching expression

I'm very new to Scala and trying to use it as an interface to Spark. I'm running into a problem making a generic CSV to DataFrame function. For example, I've got a CSV with about 50 fields, the first of which are task, name, and id. I can get the following to work:
val reader = new CSVReader(new StringReader(txt))
reader.readAll().map(_ match {
case Array(task, name, id, _*) => Row(task, name, id)
case unexpectedArrayForm =>
throw new RuntimeException("Record did not have correct number of fields: "+ unexpectedArrayForm.mkString(","))
})
However, I'd rather not have to hard code the number of fields needed to create a spark Row. I tried this:
val reader = new CSVReader(new StringReader(txt))
reader.readAll().map(_ match {
case Array(args # _*) => Row(args)
case unexpectedArrayForm =>
throw new RuntimeException("Record did not have correct number of fields: "+ unexpectedArrayForm.mkString(","))
})
But it just creates a Row object with a single element. How can I make it expand the args in Row(args) so that if I have an array of N elements I'll get a Row with N elements?
Change your input to be variable length by adding _*:
Row(args:_*)
This is what Row accepts per its apply signature.
In fact, you don't even need to do anything other than pass this in to the Row as it is already of the right sequence type.
reader.readAll().map(Row(_:_*))
This should do the trick:
val reader = new CSVReader(new StringReader(txt))
reader.readAll().map(_ match {
case a: Array[String] => Row(a:_*)
case unexpectedArrayForm =>
throw new RuntimeException("Record did not have correct number of fields: "+ unexpectedArrayForm.mkString(","))
})
Edited to correct omission of Array type