flatten a spark data frame's column values and put it into a variable - scala

Spark version 1.60, Scala version 2.10.5.
I have a spark-sql dataframe df like this,
+-------------------------------------------------+
|addess | attributes |
+-------------------------------------------------+
|1314 44 Avenue | Tours, Mechanics, Shopping |
|115 25th Ave | Restaurant, Mechanics, Brewery|
+-------------------------------------------------+
From this dataframe, I would like values as below,
Tours, Mechanics, Shopping, Brewery
If I do this,
df.select(df("attributes")).collect().foreach(println)
I get,
[Tours, Mechanics, Shopping]
[Restaurant, Mechanics, Brewery]
I thought I could use flatMapinstead found this, so, tried to put this into a variable using,
val allValues = df.withColumn(df("attributes"), explode("attributes"))
but I am getting an error:
error: type mismatch;
found:org.apache.spark.sql.column
required:string
I was thinking if I can get an output using explode I can use distinct to get the unique values after flattening them.
How can I get the desired output?

I strongly recommend you to use spark 2.x version. In Cloudera, when you issue "spark-shell", it launches 1.6.x version.. however, if you issue "spark2-shell", you get the 2.x shell. Check with your admin
But if you need with Spark 1.6 and rdd solution, try this.
import spark.implicits._
import scala.collection.mutable._
val df = Seq(("1314 44 Avenue",Array("Tours", "Mechanics", "Shopping")),
("115 25th Ave",Array("Restaurant", "Mechanics", "Brewery"))).toDF("address","attributes")
df.rdd.flatMap( x => x.getAs[mutable.WrappedArray[String]]("attributes") ).distinct().collect.foreach(println)
Results:
Brewery
Shopping
Mechanics
Restaurant
Tours
If the "attribute" column is not an array, but comma separated string, then use the below one which gives you same results
val df = Seq(("1314 44 Avenue","Tours,Mechanics,Shopping"),
("115 25th Ave","Restaurant,Mechanics,Brewery")).toDF("address","attributes")
df.rdd.flatMap( x => x.getAs[String]("attributes").split(",") ).distinct().collect.foreach(println)

The problem is that withColumn expects a String in its first argument (which is the name of the added column), but you're passing it a Column here df.withColumn(df("attributes").
You only need to pass "attributes" as a String.
Additionally, you need to pass a Column to the explode function, but you're passing a String - to make it a column you can use df("columName") or the Scala shorthand $ syntax, $"columnName".
Hope this example can help you.
import org.apache.spark.sql.functions._
val allValues = df.select(explode($"attributes").as("attributes")).distinct
Note that this will only preserve the attributes Column, since you want the distinct elements on that one.

Related

Adding Column In sparkdataframe

Hi I am trying to add one column in my spark dataframe and calculating value based on existing dataframe column. I am writing below code.
val df1=spark.sql("select id,dt1,salary frm dbdt1.tabledt1")
val df2=df1.withColumn("new_date",WHEN (month(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM- yyyy')))
IN (01,02,03)) THEN
CONCAT(CONCAT(year(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM- yyyy')))-1,'-'),
substr(year(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM-yyyy'))),3,4))
.otherwise(CONCAT(CONCAT(year(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM- yyyy'))),'-')
,SUBSTR(year(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM-yyyy')))+1,3,4))))
But it always showing issue error: unclosed character literal. Can someone plase guide me how should i add this new column or modify the existing code.
Incorrect syntax in many places. First I suggest you look at a few spark sql examples online and also the org.apache.spark.sql.functions API documentation because your use of WHEN, CONCAT, IN are all incorrect.
Scala strings are enclosed by double quotes, you appear to be using SQL string syntax.
'dd-MM-yyyy' should be "dd-MM-yyyy"
To reference a column dt1 on DataFrame df1 you can use one of the following:
df1("dt1")
col("dt1") // if you import org.apache.spark.sql.functions.col
$"dt1" // if you import spark.implicits._ locally
For example:
from_unixtime(unix_timestamp(col("dt1")), 'dd-MM- yyyy')

How can I Count the Number of Rows of the JSON File?

My JSON file below contains six rows:
[
{"events":[[{"v":"INPUT","n":"type"},{"v":"2016-08-24 14:23:12 EST","n":"est"}]],
"apps":[],
"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},
"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.1.18","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"12","n":"cpu"},{"v":"154665","n":"seq"},{"v":"2016-08-24 14:23:17 EST","n":"est"}]
},
{"events":[[{"v":"INPUT","n":"type"},{"v":"2016-08-24 14:23:14 EST","n":"est"}]],"apps":[],"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.1.18","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"5","n":"cpu"},{"v":"154666","n":"seq"},{"v":"2016-08-24 14:23:23 EST","n":"est"}]},
{"events":[[{"v":"LOGOFF","n":"type"},{"v":"2016-08-24 14:24:04 EST","n":"est"}]],"apps":[],"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.1.18","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"0","n":"cpu"},{"v":"154667","n":"seq"},{"v":"2016-08-24 14:24:05 EST","n":"est"}]},
{"events":[],"apps":[[{"v":"ccSvcHst","n":"pname"},{"v":"7704","n":"pid"},{"v":"Old Virus Definition File","n":"title"},{"v":"O","n":"state"},{"v":"5376","n":"mem"},{"v":"0","n":"cpu"}]],"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.0.5","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"29","n":"cpu"},{"v":"154668","n":"seq"},{"v":"2016-09-25 16:57:24 EST","n":"est"}]},
{"events":[],"apps":[[{"v":"ccSvcHst","n":"pname"},{"v":"7704","n":"pid"},{"v":"Old Virus Definition File","n":"title"},{"v":"F","n":"state"},{"v":"5588","n":"mem"},{"v":"0","n":"cpu"}]],"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.0.5","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"16","n":"cpu"},{"v":"154669","n":"seq"},{"v":"2016-09-25 16:57:30 EST","n":"est"}]},
{"events":[],"apps":[[{"v":"ccSvcHst","n":"pname"},{"v":"7704","n":"pid"},{"v":"Old Virus Definition File","n":"title"},{"v":"F","n":"state"},{"v":"5588","n":"mem"},{"v":"0","n":"cpu"}]],"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.0.5","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"17","n":"cpu"},{"v":"154670","n":"seq"},{"v":"2016-09-25 16:57:36 EST","n":"est"}]}
]
The JSON looks like the below records:
JSON
0
1
2
3
4
5
Required Output:
Count
6
Ok, you are in Spark, and you need to turn your Json into dataset, and use the appropriate operation on it. So here, I wrote the workflow to go from Json to dataset in general and the required steps with examples. I think this way of answering is more beneficial because you can see the steps and then you can decide what to do with the information.
Input Data: You have the Json, that is your data you should start working on. Then you need to decide which fields are important. Counting on its own, is the small part of most cases and you don't want to load all the fields which may not be necessary.
Create a Case Class: you can use case classes because then you can serialize your input data. To keep it simple I have a doctor which belongs to a department, and I get the data in Json. I could have the following case classes:
case class Department(name: String, address: String)
case class Doctor(name: String, department: Department)
so as you can see from the above code, I go bottom up to create the data I want to work on. In you Json, there are loads of fields (e.g., v) that I can't understand the meaning behind it. So be careful not to mix them.
Have a dataaset: Ok, the below code serialize a Json to the case class we defined:
spark.read.json("doctorsData.json).as[Doctor]
couple of points. The spark is a spark session, which you need to create. Here its instance is spark it could be anything. You also need to import spark.implicits._.
In Business!: Ok now you are in business, and in the Spark world. It is just the matter of using count() to count your dataset. THe following method shows how to count it:
def recordsCount(myDataset: Dataset[Doctor]): Long = myDataset.count()
A file of three records I have - with correct formatting, Spark 2.x., reading into a dataframe / dataset:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
val df = spark.read
.option("multiLine", true)
.option("mode", "PERMISSIVE")
.option("inferSchema", true)
.json("/FileStore/tables/json_01.txt")
df.select("*").show(false)
df.printSchema()
df.count()
If just a total tally count, then this will suffice, last line.
res15: Long = 3

how to define features column in spark ml

I am trying to run the spark logistic regression function (ml not mllib). I have a dataframe which looks like (just the first row shown)
+-----+--------+
|label|features|
+-----+--------+
| 0.0| [60.0]|
(Right now just trying to keep it simple with only one dimension in the feature, but will expand later on.)
I run the following code - taken from the Spark ML documentation
import org.apache.spark.ml.classification.LogisticRegression
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
val lrModel = lr.fit(df)
This gives me the error -
org.apache.spark.SparkException: Values to assemble cannot be null.
I'm not sure how to fix this error. I looked at sample_libsvm_data.txt which is in the spark github repo and used in some of the examples in the spark ml documentation. That dataframe looks like
+-----+--------------------+
|label| features|
+-----+--------------------+
| 0.0|(692,[127,128,129...|
| 1.0|(692,[158,159,160...|
| 1.0|(692,[124,125,126...|
Based on this example, my data looks like it should be in the right format, with one issue. Is 692 the number of features? Seems rather dumb if so - spark should just be able to look at the length of the feature vector to see how many features there are. If I do need to add the number of features, how would I do that? (Pretty new to Scala/Java)
Cheers
This error is thrown by VectorAssembler when any of the features are null. Please verify that you rows doesn't contain null values. If there are null values you must convert it into a default numeric feature before VectorAssembling.
Regarding format of sample_libsvm_data.txt, Its is in stored in in a sparse array/matrix form. Where data is represented as:
0 128:51 129:159 130:253 (Where 0 is label and the subsequent column contains index:numeric_feature format.
You can form your single feature dataframe in the following way using Vector class as follow (I ran it on 1.6.1 shell):
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.ml.classification.LogisticRegression
val training1 = sqlContext.createDataFrame(Seq(
(1.0, Vectors.dense(3.0)),
(0.0, Vectors.dense(3.0)))
).toDF("label", "features")
val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)
val model1 = lr.fit(training)
For more, you can check examples at: https://spark.apache.org/docs/1.6.1/ml-guide.html#dataframe (Refer to section Code examples)

I need help parsing a file in scala for running a spark job

I'm running a Spark Job in Scala and I'm struck with parsing the input file.
The Input file(TAB separated) is something like,
date=20160701 name=mike age=26
date=20160402 name=john age=33
I want to parse it and extract only values and not the keys, such as,
20160701 mike 26
20160402 john 33
How can this be achieved in SCALA?
I'm using,
SCALA VERSION: 2.11
You can use CSVParser() and you know the location for key, it will be easy and clean
Test data
val data = "date=20160701\tname=mike\tage=26\ndate=20160402 name=john\tage=33\n"
One statement to do what you asked
val rdd = sc.parallelize(data.split('\n'))
.map(_.split('\t') // split into key=value
.map(_.split('=')(1))) // split those at "=" and select only the value
Display what we got
rdd.collect().foreach(r=>println(r.mkString(",")))
// 20160701,mike,26
// 20160402,john,33
But don't do this for real code. It's very fragile in the face of data format errors, etc. Use CSVParser or something instead as Narendra Parmar suggests.
val rdd = sc.textFile()
rdd.map(x => x.split("\t")).map(x => x.split("=")(1)).map(x => x.mkstring("\t")).saveAsTextFile("")

dataframe: how to groupBy/count then filter on count in Scala

Spark 1.4.1
I encounter a situation where grouping by a dataframe, then counting and filtering on the 'count' column raises the exception below
import sqlContext.implicits._
import org.apache.spark.sql._
case class Paf(x:Int)
val myData = Seq(Paf(2), Paf(1), Paf(2))
val df = sc.parallelize(myData, 2).toDF()
Then grouping and filtering:
df.groupBy("x").count()
.filter("count >= 2")
.show()
Throws an exception:
java.lang.RuntimeException: [1.7] failure: ``('' expected but `>=' found count >= 2
Solution:
Renaming the column makes the problem vanish (as I suspect there is no conflict with the interpolated 'count' function'
df.groupBy("x").count()
.withColumnRenamed("count", "n")
.filter("n >= 2")
.show()
So, is that a behavior to expect, a bug or is there a canonical way to go around?
thanks, alex
When you pass a string to the filter function, the string is interpreted as SQL. Count is a SQL keyword and using count as a variable confuses the parser. This is a small bug (you can file a JIRA ticket if you want to).
You can easily avoid this by using a column expression instead of a String:
df.groupBy("x").count()
.filter($"count" >= 2)
.show()
So, is that a behavior to expect, a bug
Truth be told I am not sure. It looks like parser is interpreting count not as a column name but a function and expects following parentheses. Looks like a bug or at least a serious limitation of the parser.
is there a canonical way to go around?
Some options have been already mentioned by Herman and mattinbits so here more SQLish approach from me:
import org.apache.spark.sql.functions.count
df.groupBy("x").agg(count("*").alias("cnt")).where($"cnt" > 2)
I think a solution is to put count in back ticks
.filter("`count` >= 2")
http://mail-archives.us.apache.org/mod_mbox/spark-user/201507.mbox/%3C8E43A71610EAA94A9171F8AFCC44E351B48EDF#fmsmsx124.amr.corp.intel.com%3E