Spark Graphx: Loading a graph from adjacency matrix - scala

I have been experimenting with the Graphx APIs of Spark, primarily to learn and have a feel of how to use them. In the process, I have to load an adjacency matrix into a graph. The matrix dataset is here.
From the site, the matrix is described as
A number of employees in a factory was interviewed on a question: “Do you like to work with your co-worker?”. Possible answers are 1 for yes and 0 for no. Each employee gave an answer for each other employee thus creating an adjecancy matrix.
So, I have decided to name the employees as English alphabets ("A" onwards). Employees form the nodes of the graph, and their preferences for their co-workers form the edges. I haven't found any straightforward way in Spark to achieve this; my R-programmer friends tell me that it is quite easy to do so, in their world. So, I set upon writing a naive implementation to do so. Here's the code
val conf = new SparkConf().setMaster("local[*]").setAppName("GraphExploration App")
val spark = SparkSession
.builder()
.appName("Spark SQL: beginners exercise")
.getOrCreate()
val sc = SparkContext.getOrCreate(conf)
val df = spark.read.csv("./BlogInputs/sociogram-employees-un.csv").cache
val allRows = df.toLocalIterator.toIndexedSeq
type EmployeeVertex = (Long,String)
val employeesWithNames = (0 until allRows.length).map(i => (i.toLong,((i + 'A').toChar.toString())))
val columnNames = (0 until allRows.length).map(i => ("_c" + i)).toIndexedSeq // It is a square matrix; rows == columns
val edgesAsCollected = (for {
rowIndex <- 0 until df.count.toInt
colIndex <- 0 until df.count.toInt
if (rowIndex != colIndex)
} yield {
if (allRows(rowIndex).fieldIndex(columnNames(colIndex)) == 1)
Some(Edge(employeesWithNames(rowIndex)._1,employeesWithNames(colIndex)._1,"Likes"))
else
None
}).flatten
val employeeNodes = sc.parallelize(employeesWithNames)
val edges = sc.parallelize(edgesAsCollected)
val employeeGraph = Graph(sc.parallelize(employeesWithNames),edges,"Nobody")
Here is the schema:
scala>df.printSchema
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)
|-- _c4: string (nullable = true)
|-- _c5: string (nullable = true)
|-- _c6: string (nullable = true)
|-- _c7: string (nullable = true)
|-- _c8: string (nullable = true)
|-- _c9: string (nullable = true)
|-- _c10: string (nullable = true)
|-- _c11: string (nullable = true)
|-- _c12: string (nullable = true)
|-- _c13: string (nullable = true)
|-- _c14: string (nullable = true)
|-- _c15: string (nullable = true)
|-- _c16: string (nullable = true)
|-- _c17: string (nullable = true)
|-- _c18: string (nullable = true)
|-- _c19: string (nullable = true)
|-- _c20: string (nullable = true)
|-- _c21: string (nullable = true)
|-- _c22: string (nullable = true)
|-- _c23: string (nullable = true)
|-- _c24: string (nullable = true)
.. and first few rows here
scala> df.show
16/12/21 07:12:00 WARN Executor: 1 block locks were not released by TID = 1:
[rdd_8_0]
+---+---+---+---+---+---+---+---+---+---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
|_c0|_c1|_c2|_c3|_c4|_c5|_c6|_c7|_c8|_c9|_c10|_c11|_c12|_c13|_c14|_c15|_c16|_c17|_c18|_c19|_c20|_c21|_c22|_c23|_c24|
+---+---+---+---+---+---+---+---+---+---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
| 0| 1| 0| 1| 1| 0| 1| 1| 1| 0| 0| 1| 0| 1| 1| 0| 1| 1| 0| 1| 0| 1| 0| 1| 1|
| 1| 0| 0| 1| 0| 0| 1| 0| 1| 0| 0| 1| 0| 0| 1| 0| 1| 0| 1| 0| 0| 1| 0| 1| 0|
| 0| 1| 0| 1| 1| 0| 0| 0| 1| 0| 0| 0| 0| 1| 1| 0| 0| 1| 0| 0| 0| 1| 1| 0| 1|
| 0| 1| 1| 0| 0| 0| 1| 0| 0| 0| 1| 1| 0| 1| 0| 0| 1| 1| 0| 0| 1| 0| 1| 1| 0|
This serves my purpose, but I feel there may be a different way. My very little knowledge of Spark's MLLib APIs is perhaps a barrier. Could someone please comment on this? Better even, could someone show me a better yet simple way (by editing my code, if necessary)?

I find #DanieldePaula's suggestion acceptable as an answer, for the case at hand:
As the matrix is square, a very large number of rows would imply a very large number of columns, in which case using SparkSQL wouldn't seem optimal in my opinion. I think you can use Spark for this problem if the matrix is converted into a Sparse format, e.g. RDD[(row, col, value)], then it would be very easy to create your vertices and edges.
Thanks, Daniel!

Related

Set all values to None for multiple columns

I'm setting up a Spark batch that aims to filter out some fields that need cleaning up. How do I set the columns in question's values to None for all rows ? (I already have a dataframe containing only rows I want to alter)
I am far from an expert in Spark, and I searched around a lot before asking here, but I am still at loss for a simple enough answer.
There are around 50 columns, and I cannot hard-code the column index to access it, as it may change in future batches.
Example Input dataframe (target columns contain data):
id TARGET 1 TARGET 2 TARGET 3 Col6 ...
someid1 Some(String) Some(String) Some(String) val1 ...
someid2 Some(String) Some(String) None val4 ...
someid5 Some(String) Some(String) Some(String) val3 ...
someid6 Some(String) Some(String) Some(String) val7 ...
Expected Output dataframe (all target columns set to None):
id TARGET 1 TARGET 2 TARGET 3 Col6 ...
someid1 None None None val1 ...
someid2 None None None val4 ...
someid5 None None None val3 ...
someid6 None None None val7 ...
AFAIK, Spark doesn't accept None values. A possible solution would be to replace them with null values casted to String :
ds.
.withColumn("target1", lit(null).cast(StringType))
.withColumn("target2", lit(null).cast(StringType))
It produces the following output :
+--------------------+-------+-------+-----+
| id|target1|target2| col6|
+--------------------+-------+-------+-----+
| 4201735573065099460| null| null|疦薠紀趣餅|
|-6432819446886055080| null| null|┵િ塇駢뱪|
|-7700306868339925800| null| null|鵎썢鳝踽嬌|
|-4913818084582557950| null| null|ꢵ痩찢쑖|
| 6731176796531697018| null| null|少⽬ᩢゖ謹|
+--------------------+-------+-------+-----+
only showing top 5 rows
root
|-- id: long (nullable = false)
|-- target1: string (nullable = true)
|-- target2: string (nullable = true)
|-- col6: string (nullable = true)
That's also what you get when you set a value to None in a Dataset.
case class TestData(id: Long, target1: Option[String], target2: Option[String], col6: String)
val res = Seq(
TestData(1, Some("a"), Some("b"), "c"),
TestData(2, Some("a"), Some("b"), "c"),
TestData(3, Some("a"), Some("b"), "c"),
TestData(4, Some("a"), Some("b"), "c")
).toDS()
res.show(5)
res.map(_.copy(target1 = None, target2 = None)).show(5)
res.printSchema()
Will returns :
+---+-------+-------+----+
| id|target1|target2|col6|
+---+-------+-------+----+
| 1| a| b| c|
| 2| a| b| c|
| 3| a| b| c|
| 4| a| b| c|
+---+-------+-------+----+
+---+-------+-------+----+
| id|target1|target2|col6|
+---+-------+-------+----+
| 1| null| null| c|
| 2| null| null| c|
| 3| null| null| c|
| 4| null| null| c|
+---+-------+-------+----+
root
|-- id: long (nullable = false)
|-- target1: string (nullable = true)
|-- target2: string (nullable = true)
|-- col6: string (nullable = true)

Spark scala Casting Unix time to timestamp fails

I am having problem on converting unix time to timestamp.
I have a Dataframe, one column is PosTime. I would like to convert it to Timestamp, but it is half working. can you help?
scala> adsb.printSchema()
root
|-- Icao: string (nullable = true)
|-- Alt: long (nullable = true)
|-- Lat: double (nullable = true)
|-- Long: double (nullable = true)
|-- PosTime: long (nullable = true)
|-- Spd: double (nullable = true)
|-- Trak: double (nullable = true)
|-- Type: string (nullable = true)
|-- Op: string (nullable = true)
|-- Cou: string (nullable = true)
scala> adsb.show(50)
+------+------+---------+----------+-------------+-----+-----+----+--------------------+--------------------+
| Icao| Alt| Lat| Long| PosTime| Spd| Trak|Type| Op| Cou|
+------+------+---------+----------+-------------+-----+-----+----+--------------------+--------------------+
|ABECE7| 4825|40.814442| -111.9776|1506875131778|197.0|356.0|B739| Delta Air Lines| United States|
|4787B0| 38000| null| null| null| null| null|B738| Norwegian| Norway|
|D3B18A| 4222| null| null| null| null| null|null| null|Unknown or unassi...|
|3C3F78|118400| null| null| null| null| null|null| null| Germany|
|AA1C45| -75|40.695969|-74.166321|1506875131747|157.4| 25.6|null| null| United States|
scala> val adsb1 = adsb.withColumn("PosTime", $"PosTime".cast(TimestampType))
scala> adsb_sort.show(100)
+------+-------+---------+---------+--------------------+-------+-------+----+----+--------------------+
| Icao| Alt| Lat| Long| PosTime| Spd| Trak|Type| Op| Cou|
+------+-------+---------+---------+--------------------+-------+-------+----+----+--------------------+
|FFFFFF| null| null| null| null| null| null|null|null|Unknown or unassi...|
|FFFFFF|1049093| 0.0| 0.0|49800-05-04 14:39...|28672.0| 1768.7|null|null|Unknown or unassi...|
|FFFFFF| 12458| 0.0| 0.0|49800-12-11 06:39...| 0.0| 2334.4|null|null|Unknown or unassi...|
Spark interprets Long as timestamp in seconds but it looks like data is in milliseconds:
scala> spark.sql("SELECT CAST(1506875131778 / 1000 AS timestamp)").show
+-------------------------------------------------------------------------+
|CAST((CAST(1506875131778 AS DOUBLE) / CAST(1000 AS DOUBLE)) AS TIMESTAMP)|
+-------------------------------------------------------------------------+
| 2017-10-01 18:25:...|
+-------------------------------------------------------------------------+
If I am right, just divide by 1000:
adsb.withColumn("PosTime", ($"PosTime" / 1000).cast(TimestampType))

How to combine 2 different dataframes together?

I have 2 DataFrames:
Users (~29.000.000 entries)
|-- userId: string (nullable = true)
Impressions (~1000 entries)
|-- modules: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- content: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- id: string (nullable = true)
I want to walk through all the Users and attach to each User 1 Impression from these ~1000 entries. So actually at each ~1000th User the Impression would be the same, then the loop on the Impressions would start from the beginning and assign the same ~1000 impressions for the next ~1000 users.
At the end I want to have a DataFrame with the combined data. Also the Users dataframe could be reused by adding the columns of the Impressions or a newly created one would work also as a result.
You have any ideas, which would be a good solution here?
What I would do is use the old trick of adding a monotically increasing ID to both dataframes, then create a new column on your LARGER dataframe (Users) which contains the modulo of each row's ID and the size of smaller dataframe.
This new column then provides a rolling matching key against the items in the Impressions dataframe.
This is a minimal example (tested) to give you the idea. Obviously this will work if you have 1000 impressions to join against:
var users = Seq("user1", "user2", "user3", "user4", "user5", "user6", "user7", "user8", "user9").toDF("users")
var impressions = Seq("a", "b", "c").toDF("impressions").withColumn("id", monotonically_increasing_id())
var cnt = impressions.count
users=users.withColumn("id", monotonically_increasing_id())
.withColumn("mod", $"id" mod cnt)
.join(impressions, $"mod"===impressions("id"))
.drop("mod")
users.show
+-----+---+-----------+---+
|users| id|impressions| id|
+-----+---+-----------+---+
|user1| 0| a| 0|
|user2| 1| b| 1|
|user3| 2| c| 2|
|user4| 3| a| 0|
|user5| 4| b| 1|
|user6| 5| c| 2|
|user7| 6| a| 0|
|user8| 7| b| 1|
|user9| 8| c| 2|
+-----+---+-----------+---+
Sketch of idea:
Add monotonically increasing id to both dataframes Users and Impressions via
val indexedUsersDF = usersDf.withColumn("index", monotonicallyIncreasingId)
val indexedImpressionsDF = impressionsDf.withColumn("index", monotonicallyIncreasingId)
(see spark dataframe :how to add a index Column )
Determine number of rows in Impressions via count and store as int, e.g.
val numberOfImpressions = ...
Apply UDF to index-column in indexedUsersDF that computes the modulo in a seperate column (e.g. moduloIndex)
val moduloIndexedUsersDF = indexedUsersDF.select(...)
Join moduloIndexedUsersDF and indexedImperessionsDF on
moduloIndexedUsersDF("moduloIndex")===indexedImpressions("index")

Convert string to timestamp for Spark using Scala

I have a dataframe called train, he has the following schema :
root
|-- date_time: string (nullable = true)
|-- site_name: integer (nullable = true)
|-- posa_continent: integer (nullable = true)
I want to cast the date_timecolumn to timestampand create a new column with the year value extracted from the date_timecolumn.
To be clear, I have the following dataframe :
+-------------------+---------+--------------+
| date_time|site_name|posa_continent|
+-------------------+---------+--------------+
|2014-08-11 07:46:59| 2| 3|
|2014-08-11 08:22:12| 2| 3|
|2015-08-11 08:24:33| 2| 3|
|2016-08-09 18:05:16| 2| 3|
|2011-08-09 18:08:18| 2| 3|
|2009-08-09 18:13:12| 2| 3|
|2014-07-16 09:42:23| 2| 3|
+-------------------+---------+--------------+
I want to get the following dataframe :
+-------------------+---------+--------------+--------+
| date_time|site_name|posa_continent|year |
+-------------------+---------+--------------+--------+
|2014-08-11 07:46:59| 2| 3|2014 |
|2014-08-11 08:22:12| 2| 3|2014 |
|2015-08-11 08:24:33| 2| 3|2015 |
|2016-08-09 18:05:16| 2| 3|2016 |
|2011-08-09 18:08:18| 2| 3|2011 |
|2009-08-09 18:13:12| 2| 3|2009 |
|2014-07-16 09:42:23| 2| 3|2014 |
+-------------------+---------+--------------+--------+
Well, if you want to cast the date_timecolumn to timestampand create a new column with the year value then do exactly that:
import org.apache.spark.sql.functions.year
df
.withColumn("date_time", $"date_time".cast("timestamp")) // cast to timestamp
.withColumn("year", year($"date_time")) // add year column
You could map the dataframe to add the year at the end of each row:
df.map {
case Row(col1: String, col2: Int, col3: Int) => (col1, col2, col3, DateTime.parse(col1, DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss")).getYear)
}.toDF("date_time", "site_name", "posa_continent", "year").show()

Spark SQL: Select with arithmetic on column values and type casting?

I'm using Spark SQL with DataFrames. Is there a way to do a select statement with some arithmetic, just as you can in SQL?
For example, I have the following table:
var data = Array((1, "foo", 30, 5), (2, "bar", 35, 3), (3, "foo", 25, 4))
var dataDf = sc.parallelize(data).toDF("id", "name", "value", "years")
dataDf.printSchema
// root
// |-- id: integer (nullable = false)
// |-- name: string (nullable = true)
// |-- value: integer (nullable = false)
// |-- years: integer (nullable = false)
dataDf.show()
// +---+----+-----+-----+
// | id|name|value|years|
// +---+----+-----+-----+
// | 1| foo| 30| 5|
// | 2| bar| 35| 3|
// | 3| foo| 25| 4|
//+---+----+-----+-----+
Now, I would like to do a SELECT statement that creates a new column with some arithmetic performed on the existing columns. For example, I would like to compute the ratio value/years. I need to convert value (or years) to a double first. I tried this statement, but it wouldn't parse:
dataDf.
select(dataDf("name"), (dataDf("value").toDouble/dataDf("years")).as("ratio")).
show()
<console>:35: error: value toDouble is not a member of org.apache.spark.sql.Column
select(dataDf("name"), (dataDf("value").toDouble/dataDf("years")).as("ratio")).
I saw a similar question in "How to change column types in Spark SQL's DataFrame?", but that's not quite what I want.
A proper way to change type of a Column is to use cast method. It can either take a description string:
dataDf("value").cast("double") / dataDf("years")
or a DataType:
import org.apache.spark.sql.types.DoubleType
dataDf("value").cast(DoubleType) / dataDf("years")
Well if it's not a requirement to use a select method, you can just use withColumn.
val resDF = dataDf.withColumn("result", col("value").cast("double") / col("years"))
resDF.show
//+---+----+-----+-----+------------------+
//| id|name|value|years| result|
//+---+----+-----+-----+------------------+
//| 1| foo| 30| 5| 6.0|
//| 2| bar| 35| 3|11.666666666666666|
//| 3| foo| 25| 4| 6.25|
//+---+----+-----+-----+------------------+
If it's a requirement to use a select, one option could be:
val exprs = dataDf.columns.map(col(_)) ++ List((col("value").cast("double") / col("years")).as("result"))
dataDf.select(exprs: _*).show
//+---+----+-----+-----+------------------+
//| id|name|value|years| result|
//+---+----+-----+-----+------------------+
//| 1| foo| 30| 5| 6.0|
//| 2| bar| 35| 3|11.666666666666666|
//| 3| foo| 25| 4| 6.25|
//+---+----+-----+-----+------------------+