How to use double pipe as delimiter in CSV? - scala

Spark 1.5 and Scala 2.10.6
I have a data file that is using "¦¦" as the delimiter. I am having a hard time parsing through this to create a data frame. Can multiple delimiters be used to create a data frame? The code works with a single broken pipe but not with multiple delimiters.
My Code:
val customSchema_1 = StructType(Array(
StructField("ID", StringType, true),
StructField("FILLER", StringType, true),
StructField("CODE", StringType, true)));
val df_1 = sqlContext.read
.format("com.databricks.spark.csv")
.schema(customSchema_1)
.option("delimiter", "¦¦")
.load("example.txt")
Sample file:
12345¦¦ ¦¦10

I ran into this and found a good solution, I am using spark 2.3, I have a feeling it should work all of spark 2.2+ but have not tested it. The way it works is I replace the || with a tab and then the built in csv can take a Dataset[String] . I used tab because I have commas in my data.
var df = spark.sqlContext.read
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter", "\t")
.csv(spark.sqlContext.read.textFile("filename")
.map(line => line.split("\\|\\|").mkString("\t")))
Hope this helps some else.
EDIT:
As of spark 3.0.1 this works out of the box.
example:
val ds = List("name||id", "foo||12", "brian||34", """"cray||name"||123""", "cray||name||123").toDS
ds: org.apache.spark.sql.Dataset[String] = [value: string]
val csv = spark.read.option("header", "true").option("inferSchema", "true").option("delimiter", "||").csv(ds)
csv: org.apache.spark.sql.DataFrame = [name: string, id: string]
csv.show
+----------+----+
| name| id|
+----------+----+
| foo| 12|
| brian| 34|
|cray||name| 123|
| cray|name|
+----------+----+

So the actual error being emitted here is:
java.lang.IllegalArgumentException: Delimiter cannot be more than one character: ¦¦
The docs corroborate this limitation and I checked the Spark 2.0 csv reader and it has the same requirement.
Given all of this, if your data is simple enough where you won't have entries containing ¦¦, I would load your data like so:
scala> :pa
// Entering paste mode (ctrl-D to finish)
val customSchema_1 = StructType(Array(
StructField("ID", StringType, true),
StructField("FILLER", StringType, true),
StructField("CODE", StringType, true)));
// Exiting paste mode, now interpreting.
customSchema_1: org.apache.spark.sql.types.StructType = StructType(StructField(ID,StringType,true), StructField(FILLER,StringType,true), StructField(CODE,StringType,true))
scala> val rawData = sc.textFile("example.txt")
rawData: org.apache.spark.rdd.RDD[String] = example.txt MapPartitionsRDD[1] at textFile at <console>:31
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
scala> val rowRDD = rawData.map(line => Row.fromSeq(line.split("¦¦")))
rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[3] at map at <console>:34
scala> val df = sqlContext.createDataFrame(rowRDD, customSchema_1)
df: org.apache.spark.sql.DataFrame = [ID: string, FILLER: string, CODE: string]
scala> df.show
+-----+------+----+
| ID|FILLER|CODE|
+-----+------+----+
|12345| | 10|
+-----+------+----+

We tried to read data having custom delimiters and customizing column names for data frame in following way,
# Hold new column names saparately
headers ="JC_^!~_*>Year_^!~_*>Date_^!~_*>Service_Type^!~_*>KMs_Run^!~_*>
# '^!~_*>' This is field delimiter, so split string
head = headers.split("^!~_*>")
## Below command splits the S3 file with custom delimiter and converts into Dataframe
df = sc.textFile("s3://S3_Path/sample.txt").map(lambda x: x.split("^!~_*>")).toDF(head)
Passing head as parameter in toDF() assign new column names to dataframe created from text file having custom delimiters.
Hope this helps.

Starting from Spark2.8 and above support of multiple character delimiter has been added.
https://issues.apache.org/jira/browse/SPARK-24540
The above solution proposed by #lockwobr works in scala. Whoever working below Spark 2.8 and looking out for solution in PySpark you can refer to the below
ratings_schema = StructType([
StructField("user_id", StringType(), False)
, StructField("movie_id", StringType(), False)
, StructField("rating", StringType(), False)
, StructField("rating_timestamp", StringType(), True)
])
#movies_df = spark.read.csv("ratings.dat", header=False, sep="::", schema=ratings_schema)
movies_df = spark.createDataFrame(
spark.read.text("ratings.dat").rdd.map(lambda line: line[0].split("::")),
ratings_schema)
i have provided an example but you can modify it for your logic.

Related

what is the order guarantee when joining two columns of a spark dataframe which are processed separately?

I have dataframe with 3 columns
date
jsonString1
jsonString2
I want to expand attributes inside json into columns. so i did something like this.
val json1 = spark.read.json(dataframe.select(col("jsonString1")).rdd.map(_.getString(0)))
val json2 = spark.read.json(dataframe.select(col("jsonString2")).rdd.map(_.getString(0)))
val json1Table = json1.selectExpr("id", "status")
val json2Table = json2.selectExpr("name", "address")
now i want to put these table together. so i did the following
val json1TableWithIndex = addColumnIndex(json1Table)
val json2TableWithIndex = addColumnIndex(json2Table)
var finalResult = json1Table
.join(json2Table, Seq("columnindex"))
.drop("columnindex")
def addColumnIndex(df: DataFrame) = spark.createDataFrame(
df.rdd.zipWithIndex.map { case (row, columnindex) => Row.fromSeq(row.toSeq :+ columnindex) },
StructType(df.schema.fields :+ StructField("columnindex", LongType, false))
)
After sampling few rows I observe that rows match exactly as in the source dataframe
I did not find any information on the order guarantee when joining two columns of a dataframe which are processed separately. Is this the right way to solve my problem. Any help is appreciated.
It is always risky to rely on undocumented behaviours, as your code might not work as you intended, because you only have a partial understanding of it.
You can do the same thing in a much more efficient way without using any split and join approach. Use a from_json function to create two nested columns and then flatten out the nested columns and finally drop out the intermediate JSON string columns and nested columns.
Here is an example fo the whole process.
import org.apache.spark.sql.types.{StringType, StructType, StructField}
val df = (Seq(
("09-02-2020","{\"id\":\"01\", \"status\":\"Active\"}","{\"name\":\"Abdullah\", \"address\":\"Jumeirah\"}"),
("10-02-2020","{\"id\":\"02\", \"status\":\"Dormant\"}","{\"name\":\"Ali\", \"address\":\"Jebel Ali\"}")
).toDF("date","jsonString1","jsonString2"))
scala> df.show()
+----------+--------------------+--------------------+
| date| jsonString1| jsonString2|
+----------+--------------------+--------------------+
|09-02-2020|{"id":"01", "stat...|{"name":"Abdullah...|
|10-02-2020|{"id":"02", "stat...|{"name":"Ali", "a...|
+----------+--------------------+--------------------+
val schema1 = (StructType(Seq(
StructField("id", StringType, true),
StructField("status", StringType, true)
)))
val schema2 = (StructType(Seq(
StructField("name", StringType, true),
StructField("address", StringType, true)
)))
val dfFlattened = (df.withColumn("jsonData1", from_json(col("jsonString1"), schema1))
.withColumn("jsonData2", from_json(col("jsonString2"), schema2))
.withColumn("id", col("jsonData1.id"))
.withColumn("status", col("jsonData1.status"))
.withColumn("name", col("jsonData2.name"))
.withColumn("address", col("jsonData2.address"))
.drop("jsonString1")
.drop("jsonString2")
.drop("jsonData1")
.drop("jsonData2"))
scala> dfFlattened.show()
+----------+---+-------+--------+---------+
| date| id| status| name| address|
+----------+---+-------+--------+---------+
|09-02-2020| 01| Active|Abdullah| Jumeirah|
|10-02-2020| 02|Dormant| Ali|Jebel Ali|
+----------+---+-------+--------+---------+

Transform structured data to JSON format using Spark Scala

I've my "Structured data" as shown below, I need to transform it to the below shown "Expected results" type. My "Output schema" is shown as well. Appreciate if you can provide some help on how I can achieve this using Spark Scala code.
Note: Grouping on the structured data to be done the columns SN and VIN.
There should be one row for the same SN and VIN, if either SN or VIN changes, then data to be present in the next row.
Structured data:
+-----------------+-------------+--------------------+---+
|VIN |ST |SV |SN |
|FU74HZ501740XXXXX|1566799999225|44.0 |APP|
|FU74HZ501740XXXXX|1566800002758|61.0 |APP|
|FU74HZ501740XXXXX|1566800009446|23.39 |ASP|
Expected results:
Output schema:
val outputSchema = StructType(
List(
StructField("VIN", StringType, true),
StructField("EVENTS", ArrayType(
StructType(Array(
StructField("SN", StringType, true),
StructField("ST", IntegerType, true),
StructField("SV", DoubleType, true)
))))
)
)
From Spark 2.1 you can achieve this using struct and collect_list.
val df_2 = Seq(
("FU74HZ501740XXXX",1566799999225.0,44.0,"APP"),
("FU74HZ501740XXXX",1566800002758.0,61.0,"APP"),
("FU74HZ501740XXXX",1566800009446.0,23.39,"ASP")
).toDF("vin","st","sv","sn")
df_2.show(false)
+----------------+-----------------+-----+---+
|vin |st |sv |sn |
+----------------+-----------------+-----+---+
|FU74HZ501740XXXX|1.566799999225E12|44.0 |APP|
|FU74HZ501740XXXX|1.566800002758E12|61.0 |APP|
|FU74HZ501740XXXX|1.566800009446E12|23.39|ASP|
+----------------+-----------------+-----+---+
Use collect_list with struct:
df_2.groupBy("vin","sn")
.agg(collect_list(struct($"st", $"sv",$"sn")).as("events"))
.withColumn("events",to_json($"events"))
.drop(col("sn"))
This will give the wwanted output:
+----------------+---------------------------------------------------------------------------------------------+
|vin |events |
+----------------+---------------------------------------------------------------------------------------------+
|FU74HZ501740XXXX|[{"st":1.566800009446E12,"sv":23.39,"sn":"ASP"}] |
|FU74HZ501740XXXX|[{"st":1.566799999225E12,"sv":44.0,"sn":"APP"},{"st":1.566800002758E12,"sv":61.0,"sn":"APP"}]|
+----------------+---------------------------------------------------------------------------------------------+
You can get it via SparkSession.
val df = spark.read.json("/path/to/json/file/test.json")
here spark is the SparkSession object

Reading comma separated text file in spark 1.6

I have a text file which is similar to below
20190920
123456789,6325,NN5555,123,4635,890,C,9
985632465,6467,KK6666,654,9780,636,B,8
258063464,6754,MM777,789,9461,895,N,5
And I am using spark 1.6 with scala to read this text file
val df = sqlcontext.read.option("com.databricks.spark.csv")
.option("header","false").option("inferSchema","false").load(path)
df.show()
When I used above command to read it is reading only first column. Is there anything to add to read that file with all column values.
Output I got :
20190920
123456789
985632465
258063464
3
In this case you should provide schema,So your code will look like this
val mySchema = StructType(
List(
StructField("col1", StringType, true),
StructField("col2", StringType, true),
// and other columns ...
)
)
val df = sqlcontext.read
.schema(mySchema)
.option("com.databricks.spark.csv")
.option("header","false")
.option("inferSchema","false")
.load(path)

columnSimilarities() back to Spark Data Frame

I have a second question around CosineSimilarity / ColumnSimilarities in Spark 2.1. I'm kinda new to scala and all the Spark environment and this is not really clear to me:
How can I get back the ColumnSimilarities for each combination of columns from the rowMatrix in spark. Here is what I tried:
Data:
import org.apache.spark.sql.{SQLContext, Row, DataFrame}
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, DoubleType}
import org.apache.spark.sql.functions._
// rdd
val rowsRdd: RDD[Row] = sc.parallelize(
Seq(
Row(2.0, 7.0, 1.0),
Row(3.5, 2.5, 0.0),
Row(7.0, 5.9, 0.0)
)
)
// Schema
val schema = new StructType()
.add(StructField("item_1", DoubleType, true))
.add(StructField("item_2", DoubleType, true))
.add(StructField("item_3", DoubleType, true))
// Data frame
val df = spark.createDataFrame(rowsRdd, schema)
Code:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, CoordinateMatrix, RowMatrix}
val rows = new VectorAssembler().setInputCols(df.columns).setOutputCol("vs")
.transform(df)
.select("vs")
.rdd
val items_mllib_vector = rows.map(_.getAs[org.apache.spark.ml.linalg.Vector](0))
.map(org.apache.spark.mllib.linalg.Vectors.fromML)
val mat = new RowMatrix(items_mllib_vector)
val simsPerfect = mat.columnSimilarities()
println("Pairwise similarities are: " + simsPerfect.entries.collect.mkString(", "))
Output:
Pairwise similarities are: MatrixEntry(0,2,0.24759378423606918), MatrixEntry(1,2,0.7376189553526812), MatrixEntry(0,1,0.8355316482961213)
So What I get is simsPerfect org.apache.spark.mllib.linalg.distributed.CoordinateMatrix of my Columns and similarities. How would I transform this back to a dataframe and get the right columns names with it?
My preferred output:
item_from | item_to | similarity
1 | 2 | 0.83 |
1 | 3 | 0.24 |
2 | 3 | 0.73 |
Thanks in advance
This approach also works without converting the row to String:
val transformedRDD = simsPerfect.entries.map{case MatrixEntry(row: Long, col:Long, sim:Double) => (row,col,sim)}
val dff = sqlContext.createDataFrame(transformedRDD).toDF("item_from", "item_to", "sim")
where, I assume val sqlContext = new org.apache.spark.sql.SQLContext(sc) is defined already and sc is the SparkContext.
I found a solution for my problem:
//Transform result to rdd
val transformedRDD = simsPerfect.entries.map{case MatrixEntry(row: Long, col:Long, sim:Double) => Array(row,col,sim).mkString(",")}
//Transform rdd[String] to rdd[Row]
val rdd2 = transformedRDD.map(a => Row(a))
// to DF
val dfschema = StructType(Array(StructField("value",StringType)))
val rddToDF = spark.createDataFrame(rdd2,dfschema)
//create new DF with schema
val newdf = rddToDF.select(expr("(split(value, ','))[0]").cast("string").as("item_from")
,expr("(split(value, ','))[1]").cast("string").as("item_to")
,expr("(split(value, ','))[2]").cast("string").as("sim"))
I'm sure there is another easier way to do this, but I'm happy that it works.

Provide schema while reading csv file as a dataframe in Scala Spark

I am trying to read a csv file into a dataframe. I know what the schema of my dataframe should be since I know my csv file. Also I am using spark csv package to read the file. I trying to specify the schema like below.
val pagecount = sqlContext.read.format("csv")
.option("delimiter"," ").option("quote","")
.option("schema","project: string ,article: string ,requests: integer ,bytes_served: long")
.load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")
But when I check the schema of the data frame I created, it seems to have taken its own schema. Am I doing anything wrong ? how to make spark to pick up the schema I mentioned ?
> pagecount.printSchema
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)
Try the below code, you need not specify the schema. When you give inferSchema as true it should take it from your csv file.
val pagecount = sqlContext.read.format("csv")
.option("delimiter"," ").option("quote","")
.option("header", "true")
.option("inferSchema", "true")
.load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")
If you want to manually specify the schema, you can do it as below:
import org.apache.spark.sql.types._
val customSchema = StructType(Array(
StructField("project", StringType, true),
StructField("article", StringType, true),
StructField("requests", IntegerType, true),
StructField("bytes_served", DoubleType, true))
)
val pagecount = sqlContext.read.format("csv")
.option("delimiter"," ").option("quote","")
.option("header", "true")
.schema(customSchema)
.load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")
For those interested in doing this in Python here is a working version.
customSchema = StructType([
StructField("IDGC", StringType(), True),
StructField("SEARCHNAME", StringType(), True),
StructField("PRICE", DoubleType(), True)
])
productDF = spark.read.load('/home/ForTesting/testProduct.csv', format="csv", header="true", sep='|', schema=customSchema)
testProduct.csv
ID|SEARCHNAME|PRICE
6607|EFKTON75LIN|890.88
6612|EFKTON100HEN|55.66
Hope this helps.
I'm using the solution provided by Arunakiran Nulu in my analysis (see the code). Despite it is able to assign the correct types to the columns, all the values returned are null. Previously, I've tried to the option .option("inferSchema", "true") and it returns the correct values in the dataframe (although different type).
val customSchema = StructType(Array(
StructField("numicu", StringType, true),
StructField("fecha_solicitud", TimestampType, true),
StructField("codtecnica", StringType, true),
StructField("tecnica", StringType, true),
StructField("finexploracion", TimestampType, true),
StructField("ultimavalidacioninforme", TimestampType, true),
StructField("validador", StringType, true)))
val df_explo = spark.read
.format("csv")
.option("header", "true")
.option("delimiter", "\t")
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss")
.schema(customSchema)
.load(filename)
Result
root
|-- numicu: string (nullable = true)
|-- fecha_solicitud: timestamp (nullable = true)
|-- codtecnica: string (nullable = true)
|-- tecnica: string (nullable = true)
|-- finexploracion: timestamp (nullable = true)
|-- ultimavalidacioninforme: timestamp (nullable = true)
|-- validador: string (nullable = true)
and the table is:
|numicu|fecha_solicitud|codtecnica|tecnica|finexploracion|ultimavalidacioninforme|validador|
+------+---------------+----------+-------+--------------+-----------------------+---------+
| null| null| null| null| null| null| null|
| null| null| null| null| null| null| null|
| null| null| null| null| null| null| null|
| null| null| null| null| null| null| null|
The previous solutions have used the custom StructType.
With spark-sql 2.4.5 (scala version 2.12.10) it is now possible to specify the schema as a string using the schema function
import org.apache.spark.sql.SparkSession;
val sparkSession = SparkSession.builder()
.appName("sample-app")
.master("local[2]")
.getOrCreate();
val pageCount = sparkSession.read
.format("csv")
.option("delimiter","|")
.option("quote","")
.schema("project string ,article string ,requests integer ,bytes_served long")
.load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")
Thanks to the answer by #Nulu, it works for pyspark with minimal tweaking
from pyspark.sql.types import LongType, StringType, StructField, StructType, BooleanType, ArrayType, IntegerType
customSchema = StructType(Array(
StructField("project", StringType, true),
StructField("article", StringType, true),
StructField("requests", IntegerType, true),
StructField("bytes_served", DoubleType, true)))
pagecount = sc.read.format("com.databricks.spark.csv")
.option("delimiter"," ")
.option("quote","")
.option("header", "false")
.schema(customSchema)
.load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")
schema definition as simple string
Just in case if some one is interested in schema definition as simple string with date and time stamp
data file creation from Terminal or shell
echo "
2019-07-02 22:11:11.000999, 01/01/2019, Suresh, abc
2019-01-02 22:11:11.000001, 01/01/2020, Aadi, xyz
" > data.csv
Defining the schema as String
user_schema = 'timesta TIMESTAMP,date DATE,first_name STRING , last_name STRING'
reading the data
df = spark.read.csv(path='data.csv', schema = user_schema, sep=',', dateFormat='MM/dd/yyyy',timestampFormat='yyyy-MM-dd HH:mm:ss.SSSSSS')
df.show(10, False)
+-----------------------+----------+----------+---------+
|timesta |date |first_name|last_name|
+-----------------------+----------+----------+---------+
|2019-07-02 22:11:11.999|2019-01-01| Suresh | abc |
|2019-01-02 22:11:11.001|2020-01-01| Aadi | xyz |
+-----------------------+----------+----------+---------+
Please note defining the schema explicitly instead of letting spark infer the schema also improves the spark read performance.
Here's how you can work with a custom schema, a complete demo:
$> shell code,
echo "
Slingo, iOS
Slingo, Android
" > game.csv
Scala code:
import org.apache.spark.sql.types._
val customSchema = StructType(Array(
StructField("game_id", StringType, true),
StructField("os_id", StringType, true)
))
val csv_df = spark.read.format("csv").schema(customSchema).load("game.csv")
csv_df.show
csv_df.orderBy(asc("game_id"), desc("os_id")).show
csv_df.createOrReplaceTempView("game_view")
val sort_df = sql("select * from game_view order by game_id, os_id desc")
sort_df.show
if your spark version is 3.0.1, you can use following Scala scripts:
val df = spark.read.format("csv").option("delimiter",",").option("header",true).load("file:///LOCAL_CSV_FILE_PATH")
but in this way, all datatypes will be set as String.
// import Library
import java.io.StringReader ;
import au.com.bytecode.opencsv.CSVReader
//filename
var train_csv = "/Path/train.csv";
//read as text file
val train_rdd = sc.textFile(train_csv)
//use string reader to convert in proper format
var full_train_data = train_rdd.map{line => var csvReader = new CSVReader(new StringReader(line)) ; csvReader.readNext(); }
//declares types
type s = String
// declare case class for schema
case class trainSchema (Loan_ID :s ,Gender :s, Married :s, Dependents :s,Education :s,Self_Employed :s,ApplicantIncome :s,CoapplicantIncome :s,
LoanAmount :s,Loan_Amount_Term :s, Credit_History :s, Property_Area :s,Loan_Status :s)
//create DF RDD with custom schema
var full_train_data_with_schema = full_train_data.mapPartitionsWithIndex{(idx,itr)=> if (idx==0) itr.drop(1);
itr.toList.map(x=> trainSchema(x(0),x(1),x(2),x(3),x(4),x(5),x(6),x(7),x(8),x(9),x(10),x(11),x(12))).iterator }.toDF
In pyspark 2.4 onwards, you can simply use header parameter to set the correct header:
data = spark.read.csv('data.csv', header=True)
Similarly, if using scala you can use header parameter as well.
You can also do like this by using sparkSession and implicit
import sparkSession.implicits._
val pagecount:DataFrame = sparkSession.read
.option("delimiter"," ")
.option("quote","")
.option("inferSchema","true")
.csv("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")
.toDF("project","article","requests","bytes_served")
This is one of option where we can pass the column names to the dataframe while loading CSV.
import pandas
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pandas.read_csv("C:/Users/NS00606317/Downloads/Iris.csv", names=names, header=0)
print(dataset.head(10))
Output
sepal-length sepal-width petal-length petal-width class
1 5.1 3.5 1.4 0.2 Iris-setosa
2 4.9 3.0 1.4 0.2 Iris-setosa
3 4.7 3.2 1.3 0.2 Iris-setosa
4 4.6 3.1 1.5 0.2 Iris-setosa
5 5.0 3.6 1.4 0.2 Iris-setosa
6 5.4 3.9 1.7 0.4 Iris-setosa
7 4.6 3.4 1.4 0.3 Iris-setosa
8 5.0 3.4 1.5 0.2 Iris-setosa
9 4.4 2.9 1.4 0.2 Iris-setosa
10 4.9 3.1 1.5 0.1 Iris-setosa
here my solution is:
import org.apache.spark.sql.types._
val spark = org.apache.spark.sql.SparkSession.builder.
master("local[*]").
appName("Spark CSV Reader").
getOrCreate()
val movie_rating_schema = StructType(Array(
StructField("UserID", IntegerType, true),
StructField("MovieID", IntegerType, true),
StructField("Rating", DoubleType, true),
StructField("Timestamp", TimestampType, true)))
val df_ratings: DataFrame = spark.read.format("csv").
option("header", "true").
option("mode", "DROPMALFORMED").
option("delimiter", ",").
//option("inferSchema", "true").
option("nullValue", "null").
schema(movie_rating_schema).
load(args(0)) //"file:///home/hadoop/spark-workspace/data/ml-20m/ratings.csv"
val movie_avg_scores = df_ratings.rdd.map(_.toString()).
map(line => {
// drop "[", "]" and then split the str
val fileds = line.substring(1, line.length() - 1).split(",")
//extract (movie id, average rating)
(fileds(1).toInt, fileds(2).toDouble)
}).
groupByKey().
map(data => {
val avg: Double = data._2.sum / data._2.size
(data._1, avg)
})