Set all values to None for multiple columns - scala

I'm setting up a Spark batch that aims to filter out some fields that need cleaning up. How do I set the columns in question's values to None for all rows ? (I already have a dataframe containing only rows I want to alter)
I am far from an expert in Spark, and I searched around a lot before asking here, but I am still at loss for a simple enough answer.
There are around 50 columns, and I cannot hard-code the column index to access it, as it may change in future batches.
Example Input dataframe (target columns contain data):
id TARGET 1 TARGET 2 TARGET 3 Col6 ...
someid1 Some(String) Some(String) Some(String) val1 ...
someid2 Some(String) Some(String) None val4 ...
someid5 Some(String) Some(String) Some(String) val3 ...
someid6 Some(String) Some(String) Some(String) val7 ...
Expected Output dataframe (all target columns set to None):
id TARGET 1 TARGET 2 TARGET 3 Col6 ...
someid1 None None None val1 ...
someid2 None None None val4 ...
someid5 None None None val3 ...
someid6 None None None val7 ...

AFAIK, Spark doesn't accept None values. A possible solution would be to replace them with null values casted to String :
ds.
.withColumn("target1", lit(null).cast(StringType))
.withColumn("target2", lit(null).cast(StringType))
It produces the following output :
+--------------------+-------+-------+-----+
| id|target1|target2| col6|
+--------------------+-------+-------+-----+
| 4201735573065099460| null| null|疦薠紀趣餅|
|-6432819446886055080| null| null|┵િ塇駢뱪|
|-7700306868339925800| null| null|鵎썢鳝踽嬌|
|-4913818084582557950| null| null|ꢵ痩찢쑖|
| 6731176796531697018| null| null|少⽬ᩢゖ謹|
+--------------------+-------+-------+-----+
only showing top 5 rows
root
|-- id: long (nullable = false)
|-- target1: string (nullable = true)
|-- target2: string (nullable = true)
|-- col6: string (nullable = true)
That's also what you get when you set a value to None in a Dataset.
case class TestData(id: Long, target1: Option[String], target2: Option[String], col6: String)
val res = Seq(
TestData(1, Some("a"), Some("b"), "c"),
TestData(2, Some("a"), Some("b"), "c"),
TestData(3, Some("a"), Some("b"), "c"),
TestData(4, Some("a"), Some("b"), "c")
).toDS()
res.show(5)
res.map(_.copy(target1 = None, target2 = None)).show(5)
res.printSchema()
Will returns :
+---+-------+-------+----+
| id|target1|target2|col6|
+---+-------+-------+----+
| 1| a| b| c|
| 2| a| b| c|
| 3| a| b| c|
| 4| a| b| c|
+---+-------+-------+----+
+---+-------+-------+----+
| id|target1|target2|col6|
+---+-------+-------+----+
| 1| null| null| c|
| 2| null| null| c|
| 3| null| null| c|
| 4| null| null| c|
+---+-------+-------+----+
root
|-- id: long (nullable = false)
|-- target1: string (nullable = true)
|-- target2: string (nullable = true)
|-- col6: string (nullable = true)

Related

Spark Scala replace Dataframe blank records to "0"

I need to replace my Dataframe field's blank records to "0"
Here is my code -->
import sqlContext.implicits._
case class CInspections (business_id:Int, score:String, date:String, type1:String)
val baseDir = "/FileStore/tables/484qrxx21488929011080/"
val raw_inspections = sc.textFile (s"$baseDir/inspections_plus.txt")
val raw_inspectionsmap = raw_inspections.map ( line => line.split ("\t"))
val raw_inspectionsRDD = raw_inspectionsmap.map ( raw_inspections => CInspections (raw_inspections(0).toInt,raw_inspections(1), raw_inspections(2),raw_inspections(3)))
val raw_inspectionsDF = raw_inspectionsRDD.toDF
raw_inspectionsDF.createOrReplaceTempView ("Inspections")
raw_inspectionsDF.printSchema
raw_inspectionsDF.show()
I am using case class and then converting to Dataframe. But I need "score" as Int as I have to perform some operations and sort it.
But if I declare it as score:Int then I am getting error for blank values.
java.lang.NumberFormatException: For input string: "" 
+-----------+-----+--------+--------------------+
|business_id|score| date| type1|
+-----------+-----+--------+--------------------+
| 10| |20140807|Reinspection/Foll...|
| 10| 94|20140729|Routine - Unsched...|
| 10| |20140124|Reinspection/Foll...|
| 10| 92|20140114|Routine - Unsched...|
| 10| 98|20121114|Routine - Unsched...|
| 10| |20120920|Reinspection/Foll...|
| 17| |20140425|Reinspection/Foll...|
+-----------+-----+--------+--------------------+
I need score field as Int because for the below query, it sort as String not Int and giving wrong result
sqlContext.sql("""select raw_inspectionsDF.score from raw_inspectionsDF where score <>"" order by score""").show()
+-----+
|score|
+-----+
| 100|
| 100|
| 100|
+-----+
Empty string can't be converted to Integer, you need to make the Score nullable so that if the field is missing, it is represented as null, you can try the following:
import scala.util.{Try, Success, Failure}
1) Define a customized parse function which returns None, if the string can't be converted to an Int, in your case empty string;
def parseScore(s: String): Option[Int] = {
Try(s.toInt) match {
case Success(x) => Some(x)
case Failure(x) => None
}
}
2) Define the score field in your case class to be an Option[Int] type;
case class CInspections (business_id:Int, score: Option[Int], date:String, type1:String)
val raw_inspections = sc.textFile("test.csv")
val raw_inspectionsmap = raw_inspections.map(line => line.split("\t"))
3) Use the customized parseScore function to parse the score field;
val raw_inspectionsRDD = raw_inspectionsmap.map(raw_inspections =>
CInspections(raw_inspections(0).toInt, parseScore(raw_inspections(1)),
raw_inspections(2),raw_inspections(3)))
val raw_inspectionsDF = raw_inspectionsRDD.toDF
raw_inspectionsDF.createOrReplaceTempView ("Inspections")
raw_inspectionsDF.printSchema
//root
// |-- business_id: integer (nullable = false)
// |-- score: integer (nullable = true)
// |-- date: string (nullable = true)
// |-- type1: string (nullable = true)
raw_inspectionsDF.show()
+-----------+-----+----+-----+
|business_id|score|date|type1|
+-----------+-----+----+-----+
| 1| null| a| b|
| 2| 3| s| k|
+-----------+-----+----+-----+
4) After parsing the file correctly, you can easily replace null value with 0 using na functions fill:
raw_inspectionsDF.na.fill(0).show
+-----------+-----+----+-----+
|business_id|score|date|type1|
+-----------+-----+----+-----+
| 1| 0| a| b|
| 2| 3| s| k|
+-----------+-----+----+-----+

Spark Graphx: Loading a graph from adjacency matrix

I have been experimenting with the Graphx APIs of Spark, primarily to learn and have a feel of how to use them. In the process, I have to load an adjacency matrix into a graph. The matrix dataset is here.
From the site, the matrix is described as
A number of employees in a factory was interviewed on a question: “Do you like to work with your co-worker?”. Possible answers are 1 for yes and 0 for no. Each employee gave an answer for each other employee thus creating an adjecancy matrix.
So, I have decided to name the employees as English alphabets ("A" onwards). Employees form the nodes of the graph, and their preferences for their co-workers form the edges. I haven't found any straightforward way in Spark to achieve this; my R-programmer friends tell me that it is quite easy to do so, in their world. So, I set upon writing a naive implementation to do so. Here's the code
val conf = new SparkConf().setMaster("local[*]").setAppName("GraphExploration App")
val spark = SparkSession
.builder()
.appName("Spark SQL: beginners exercise")
.getOrCreate()
val sc = SparkContext.getOrCreate(conf)
val df = spark.read.csv("./BlogInputs/sociogram-employees-un.csv").cache
val allRows = df.toLocalIterator.toIndexedSeq
type EmployeeVertex = (Long,String)
val employeesWithNames = (0 until allRows.length).map(i => (i.toLong,((i + 'A').toChar.toString())))
val columnNames = (0 until allRows.length).map(i => ("_c" + i)).toIndexedSeq // It is a square matrix; rows == columns
val edgesAsCollected = (for {
rowIndex <- 0 until df.count.toInt
colIndex <- 0 until df.count.toInt
if (rowIndex != colIndex)
} yield {
if (allRows(rowIndex).fieldIndex(columnNames(colIndex)) == 1)
Some(Edge(employeesWithNames(rowIndex)._1,employeesWithNames(colIndex)._1,"Likes"))
else
None
}).flatten
val employeeNodes = sc.parallelize(employeesWithNames)
val edges = sc.parallelize(edgesAsCollected)
val employeeGraph = Graph(sc.parallelize(employeesWithNames),edges,"Nobody")
Here is the schema:
scala>df.printSchema
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)
|-- _c4: string (nullable = true)
|-- _c5: string (nullable = true)
|-- _c6: string (nullable = true)
|-- _c7: string (nullable = true)
|-- _c8: string (nullable = true)
|-- _c9: string (nullable = true)
|-- _c10: string (nullable = true)
|-- _c11: string (nullable = true)
|-- _c12: string (nullable = true)
|-- _c13: string (nullable = true)
|-- _c14: string (nullable = true)
|-- _c15: string (nullable = true)
|-- _c16: string (nullable = true)
|-- _c17: string (nullable = true)
|-- _c18: string (nullable = true)
|-- _c19: string (nullable = true)
|-- _c20: string (nullable = true)
|-- _c21: string (nullable = true)
|-- _c22: string (nullable = true)
|-- _c23: string (nullable = true)
|-- _c24: string (nullable = true)
.. and first few rows here
scala> df.show
16/12/21 07:12:00 WARN Executor: 1 block locks were not released by TID = 1:
[rdd_8_0]
+---+---+---+---+---+---+---+---+---+---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
|_c0|_c1|_c2|_c3|_c4|_c5|_c6|_c7|_c8|_c9|_c10|_c11|_c12|_c13|_c14|_c15|_c16|_c17|_c18|_c19|_c20|_c21|_c22|_c23|_c24|
+---+---+---+---+---+---+---+---+---+---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
| 0| 1| 0| 1| 1| 0| 1| 1| 1| 0| 0| 1| 0| 1| 1| 0| 1| 1| 0| 1| 0| 1| 0| 1| 1|
| 1| 0| 0| 1| 0| 0| 1| 0| 1| 0| 0| 1| 0| 0| 1| 0| 1| 0| 1| 0| 0| 1| 0| 1| 0|
| 0| 1| 0| 1| 1| 0| 0| 0| 1| 0| 0| 0| 0| 1| 1| 0| 0| 1| 0| 0| 0| 1| 1| 0| 1|
| 0| 1| 1| 0| 0| 0| 1| 0| 0| 0| 1| 1| 0| 1| 0| 0| 1| 1| 0| 0| 1| 0| 1| 1| 0|
This serves my purpose, but I feel there may be a different way. My very little knowledge of Spark's MLLib APIs is perhaps a barrier. Could someone please comment on this? Better even, could someone show me a better yet simple way (by editing my code, if necessary)?
I find #DanieldePaula's suggestion acceptable as an answer, for the case at hand:
As the matrix is square, a very large number of rows would imply a very large number of columns, in which case using SparkSQL wouldn't seem optimal in my opinion. I think you can use Spark for this problem if the matrix is converted into a Sparse format, e.g. RDD[(row, col, value)], then it would be very easy to create your vertices and edges.
Thanks, Daniel!

Convert string to timestamp for Spark using Scala

I have a dataframe called train, he has the following schema :
root
|-- date_time: string (nullable = true)
|-- site_name: integer (nullable = true)
|-- posa_continent: integer (nullable = true)
I want to cast the date_timecolumn to timestampand create a new column with the year value extracted from the date_timecolumn.
To be clear, I have the following dataframe :
+-------------------+---------+--------------+
| date_time|site_name|posa_continent|
+-------------------+---------+--------------+
|2014-08-11 07:46:59| 2| 3|
|2014-08-11 08:22:12| 2| 3|
|2015-08-11 08:24:33| 2| 3|
|2016-08-09 18:05:16| 2| 3|
|2011-08-09 18:08:18| 2| 3|
|2009-08-09 18:13:12| 2| 3|
|2014-07-16 09:42:23| 2| 3|
+-------------------+---------+--------------+
I want to get the following dataframe :
+-------------------+---------+--------------+--------+
| date_time|site_name|posa_continent|year |
+-------------------+---------+--------------+--------+
|2014-08-11 07:46:59| 2| 3|2014 |
|2014-08-11 08:22:12| 2| 3|2014 |
|2015-08-11 08:24:33| 2| 3|2015 |
|2016-08-09 18:05:16| 2| 3|2016 |
|2011-08-09 18:08:18| 2| 3|2011 |
|2009-08-09 18:13:12| 2| 3|2009 |
|2014-07-16 09:42:23| 2| 3|2014 |
+-------------------+---------+--------------+--------+
Well, if you want to cast the date_timecolumn to timestampand create a new column with the year value then do exactly that:
import org.apache.spark.sql.functions.year
df
.withColumn("date_time", $"date_time".cast("timestamp")) // cast to timestamp
.withColumn("year", year($"date_time")) // add year column
You could map the dataframe to add the year at the end of each row:
df.map {
case Row(col1: String, col2: Int, col3: Int) => (col1, col2, col3, DateTime.parse(col1, DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss")).getYear)
}.toDF("date_time", "site_name", "posa_continent", "year").show()

Spark SQL: Select with arithmetic on column values and type casting?

I'm using Spark SQL with DataFrames. Is there a way to do a select statement with some arithmetic, just as you can in SQL?
For example, I have the following table:
var data = Array((1, "foo", 30, 5), (2, "bar", 35, 3), (3, "foo", 25, 4))
var dataDf = sc.parallelize(data).toDF("id", "name", "value", "years")
dataDf.printSchema
// root
// |-- id: integer (nullable = false)
// |-- name: string (nullable = true)
// |-- value: integer (nullable = false)
// |-- years: integer (nullable = false)
dataDf.show()
// +---+----+-----+-----+
// | id|name|value|years|
// +---+----+-----+-----+
// | 1| foo| 30| 5|
// | 2| bar| 35| 3|
// | 3| foo| 25| 4|
//+---+----+-----+-----+
Now, I would like to do a SELECT statement that creates a new column with some arithmetic performed on the existing columns. For example, I would like to compute the ratio value/years. I need to convert value (or years) to a double first. I tried this statement, but it wouldn't parse:
dataDf.
select(dataDf("name"), (dataDf("value").toDouble/dataDf("years")).as("ratio")).
show()
<console>:35: error: value toDouble is not a member of org.apache.spark.sql.Column
select(dataDf("name"), (dataDf("value").toDouble/dataDf("years")).as("ratio")).
I saw a similar question in "How to change column types in Spark SQL's DataFrame?", but that's not quite what I want.
A proper way to change type of a Column is to use cast method. It can either take a description string:
dataDf("value").cast("double") / dataDf("years")
or a DataType:
import org.apache.spark.sql.types.DoubleType
dataDf("value").cast(DoubleType) / dataDf("years")
Well if it's not a requirement to use a select method, you can just use withColumn.
val resDF = dataDf.withColumn("result", col("value").cast("double") / col("years"))
resDF.show
//+---+----+-----+-----+------------------+
//| id|name|value|years| result|
//+---+----+-----+-----+------------------+
//| 1| foo| 30| 5| 6.0|
//| 2| bar| 35| 3|11.666666666666666|
//| 3| foo| 25| 4| 6.25|
//+---+----+-----+-----+------------------+
If it's a requirement to use a select, one option could be:
val exprs = dataDf.columns.map(col(_)) ++ List((col("value").cast("double") / col("years")).as("result"))
dataDf.select(exprs: _*).show
//+---+----+-----+-----+------------------+
//| id|name|value|years| result|
//+---+----+-----+-----+------------------+
//| 1| foo| 30| 5| 6.0|
//| 2| bar| 35| 3|11.666666666666666|
//| 3| foo| 25| 4| 6.25|
//+---+----+-----+-----+------------------+

Aggregations in JDBCRDD or RDD

I'm brand new in Sacla and Spark, and I'm trying to create a SQL query over SqlServer with Spark using jdbcRDD, and do some transformations on it with mappings and aggregations.
This is what I have, a Table with n String columns and m Number columns.
like
"A", "A1",1,2
"A", "A1",4,3
"A", "A2",3,4
"B", "B1",6,7
...
...
what i'm looking for is create a hierarchival structure grouping the strings and aggregating the numeric columns like
A
|->A1
|->(5,5)
|->A2
|->(3,4)
B
|->B1
|->(6,7)
I was able to create the hierarchie but I'm not able to perform the agregation on the list of numeric values.
If you're loading your data over JDBC I would simply use DataFrames:
import sqlContext.implicits._
import org.apache.spark.sql.functions.sum
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.DataFrame
val options: Map[(String, String)] = ???
val df: DataFrame = sqlContext.read
.format("jdbc")
.options(options)
.load()
.toDF("k1", "k2", "v1", "v2")
df.printSchema
// root
// |-- k1: string (nullable = true)
// |-- k2: string (nullable = true)
// |-- v1: integer (nullable = true)
// |-- v2: integer (nullable = true)
df.show
// +---+---+---+---+
// | k1| k2| v1| v2|
// +---+---+---+---+
// | A| A1| 1| 2|
// | A| A1| 4| 3|
// | A| A2| 3| 4|
// | B| B1| 6| 7|
// +---+---+---+---+
With input like above all you need is a basic aggregation
df
.groupBy($"k1", $"k2")
.agg(sum($"v1").alias("v1"), sum($"v2").alias("v2")).show
// +---+---+---+---+
// | k1| k2| v1| v2|
// +---+---+---+---+
// | A| A1| 5| 5|
// | A| A2| 3| 4|
// | B| B1| 6| 7|
// +---+---+---+---+
If you have RDD like this:
val rdd RDD[(String, String, Int, Int)] = ???
rdd.first
// (String, String, Int, Int) = (A,A1,1,2)
There is no reason to built complex hierarchy. Simple PairRDD should be enough:
val aggregated: RDD[((String, String), breeze.linalg.Vector[Int])] = rdd
.map{case (k1, k2, v1, v2) => ((k1, k2), breeze.linalg.Vector(v1, v2))}
.reduceByKey(_ + _)
aggregated.first
// ((String, String), breeze.linalg.Vector[Int]) = ((A,A2),DenseVector(3, 4))
Keeping hierarchical structure is ineffective but you can group above RDD like this:
aggregated.map{case ((k1, k2), v) => (k1, (k2, v))}.groupByKey