How can I Count the Number of Rows of the JSON File? - scala

My JSON file below contains six rows:
[
{"events":[[{"v":"INPUT","n":"type"},{"v":"2016-08-24 14:23:12 EST","n":"est"}]],
"apps":[],
"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},
"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.1.18","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"12","n":"cpu"},{"v":"154665","n":"seq"},{"v":"2016-08-24 14:23:17 EST","n":"est"}]
},
{"events":[[{"v":"INPUT","n":"type"},{"v":"2016-08-24 14:23:14 EST","n":"est"}]],"apps":[],"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.1.18","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"5","n":"cpu"},{"v":"154666","n":"seq"},{"v":"2016-08-24 14:23:23 EST","n":"est"}]},
{"events":[[{"v":"LOGOFF","n":"type"},{"v":"2016-08-24 14:24:04 EST","n":"est"}]],"apps":[],"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.1.18","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"0","n":"cpu"},{"v":"154667","n":"seq"},{"v":"2016-08-24 14:24:05 EST","n":"est"}]},
{"events":[],"apps":[[{"v":"ccSvcHst","n":"pname"},{"v":"7704","n":"pid"},{"v":"Old Virus Definition File","n":"title"},{"v":"O","n":"state"},{"v":"5376","n":"mem"},{"v":"0","n":"cpu"}]],"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.0.5","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"29","n":"cpu"},{"v":"154668","n":"seq"},{"v":"2016-09-25 16:57:24 EST","n":"est"}]},
{"events":[],"apps":[[{"v":"ccSvcHst","n":"pname"},{"v":"7704","n":"pid"},{"v":"Old Virus Definition File","n":"title"},{"v":"F","n":"state"},{"v":"5588","n":"mem"},{"v":"0","n":"cpu"}]],"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.0.5","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"16","n":"cpu"},{"v":"154669","n":"seq"},{"v":"2016-09-25 16:57:30 EST","n":"est"}]},
{"events":[],"apps":[[{"v":"ccSvcHst","n":"pname"},{"v":"7704","n":"pid"},{"v":"Old Virus Definition File","n":"title"},{"v":"F","n":"state"},{"v":"5588","n":"mem"},{"v":"0","n":"cpu"}]],"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.0.5","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"17","n":"cpu"},{"v":"154670","n":"seq"},{"v":"2016-09-25 16:57:36 EST","n":"est"}]}
]
The JSON looks like the below records:
JSON
0
1
2
3
4
5
Required Output:
Count
6

Ok, you are in Spark, and you need to turn your Json into dataset, and use the appropriate operation on it. So here, I wrote the workflow to go from Json to dataset in general and the required steps with examples. I think this way of answering is more beneficial because you can see the steps and then you can decide what to do with the information.
Input Data: You have the Json, that is your data you should start working on. Then you need to decide which fields are important. Counting on its own, is the small part of most cases and you don't want to load all the fields which may not be necessary.
Create a Case Class: you can use case classes because then you can serialize your input data. To keep it simple I have a doctor which belongs to a department, and I get the data in Json. I could have the following case classes:
case class Department(name: String, address: String)
case class Doctor(name: String, department: Department)
so as you can see from the above code, I go bottom up to create the data I want to work on. In you Json, there are loads of fields (e.g., v) that I can't understand the meaning behind it. So be careful not to mix them.
Have a dataaset: Ok, the below code serialize a Json to the case class we defined:
spark.read.json("doctorsData.json).as[Doctor]
couple of points. The spark is a spark session, which you need to create. Here its instance is spark it could be anything. You also need to import spark.implicits._.
In Business!: Ok now you are in business, and in the Spark world. It is just the matter of using count() to count your dataset. THe following method shows how to count it:
def recordsCount(myDataset: Dataset[Doctor]): Long = myDataset.count()

A file of three records I have - with correct formatting, Spark 2.x., reading into a dataframe / dataset:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
val df = spark.read
.option("multiLine", true)
.option("mode", "PERMISSIVE")
.option("inferSchema", true)
.json("/FileStore/tables/json_01.txt")
df.select("*").show(false)
df.printSchema()
df.count()
If just a total tally count, then this will suffice, last line.
res15: Long = 3

Related

Saving pyspark dataframe with complicated schema in plain text for testing

How do I make clean test data for pyspark? I have figured something out that seems pretty good, but parts seem a little awkward, so I'm posting.
Let's say I have a dataframe df with a complicated schema and a small number of rows. I want test data checked into my repo. I don't want a binary file. At this point, I'm not sure the best way to proceed -but I'm thinking i have a file like
test_fn.py
and it has this in it
schema_str='struct<eventTimestamp:timestamp,list_data:array<struct<valueA:string,valueB:string,valueC:boolean>>>'
to get the schema in txt format, using the df.schema.simpleString() function. Then to get the rows - I do
lns = [row.json_txt for row in df.select((F.to_json(F.struct('*'))).alias('json_txt')).collect()]
now I put those lines in my test_fn.py file, or I could have a .json file in the repo.
Now to run the test, I have to make a dataframe with the correct schema and data from this text. It seems the only way spark will parse the simple string is if I create a dataframe with it, that is I can't pass that simple string to the from_json function? So this is a little awkward which is why I thought I'd post -
schema2 = spark.createDataFrame(data=[], schema=schema_str).schema
lns = # say I read the lns back from above
df_txt = spark.createDataFrame(data=lns, schema=T.StringType())
I see df_txt just has one column called 'value'
df_json = df_txt.select(F.from_json('value', schema=schema2).alias('xx'))
sel = ['xx.%s' % nm for nm in df_json.select('xx').schema.fields[0].dataType.fieldNames()]
df2 = df_json.select(*sel)
Now df2 should be the same as df1 - which I see is the case from the deepdiff module.

flatten a spark data frame's column values and put it into a variable

Spark version 1.60, Scala version 2.10.5.
I have a spark-sql dataframe df like this,
+-------------------------------------------------+
|addess | attributes |
+-------------------------------------------------+
|1314 44 Avenue | Tours, Mechanics, Shopping |
|115 25th Ave | Restaurant, Mechanics, Brewery|
+-------------------------------------------------+
From this dataframe, I would like values as below,
Tours, Mechanics, Shopping, Brewery
If I do this,
df.select(df("attributes")).collect().foreach(println)
I get,
[Tours, Mechanics, Shopping]
[Restaurant, Mechanics, Brewery]
I thought I could use flatMapinstead found this, so, tried to put this into a variable using,
val allValues = df.withColumn(df("attributes"), explode("attributes"))
but I am getting an error:
error: type mismatch;
found:org.apache.spark.sql.column
required:string
I was thinking if I can get an output using explode I can use distinct to get the unique values after flattening them.
How can I get the desired output?
I strongly recommend you to use spark 2.x version. In Cloudera, when you issue "spark-shell", it launches 1.6.x version.. however, if you issue "spark2-shell", you get the 2.x shell. Check with your admin
But if you need with Spark 1.6 and rdd solution, try this.
import spark.implicits._
import scala.collection.mutable._
val df = Seq(("1314 44 Avenue",Array("Tours", "Mechanics", "Shopping")),
("115 25th Ave",Array("Restaurant", "Mechanics", "Brewery"))).toDF("address","attributes")
df.rdd.flatMap( x => x.getAs[mutable.WrappedArray[String]]("attributes") ).distinct().collect.foreach(println)
Results:
Brewery
Shopping
Mechanics
Restaurant
Tours
If the "attribute" column is not an array, but comma separated string, then use the below one which gives you same results
val df = Seq(("1314 44 Avenue","Tours,Mechanics,Shopping"),
("115 25th Ave","Restaurant,Mechanics,Brewery")).toDF("address","attributes")
df.rdd.flatMap( x => x.getAs[String]("attributes").split(",") ).distinct().collect.foreach(println)
The problem is that withColumn expects a String in its first argument (which is the name of the added column), but you're passing it a Column here df.withColumn(df("attributes").
You only need to pass "attributes" as a String.
Additionally, you need to pass a Column to the explode function, but you're passing a String - to make it a column you can use df("columName") or the Scala shorthand $ syntax, $"columnName".
Hope this example can help you.
import org.apache.spark.sql.functions._
val allValues = df.select(explode($"attributes").as("attributes")).distinct
Note that this will only preserve the attributes Column, since you want the distinct elements on that one.

What is the fastest way to transform a very large JSON file with Spark?

I am having a rather large JSON file (Amazon product data) with a lot of single JSON objects. Those JSON objects contain text that I want to preprocess for a specific training task but it is the preprocessing that I need to speed up here. One JSON object looks like this:
{
"reviewerID": "A2SUAM1J3GNN3B",
"asin": "0000013714",
"reviewerName": "J. McDonald",
"helpful": [2, 3],
"reviewText": "I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is at times hard to read because we think the book was published for singing from more than playing from. Great purchase though!",
"overall": 5.0,
"summary": "Heavenly Highway Hymns",
"unixReviewTime": 1252800000,
"reviewTime": "09 13, 2009"
}
The task would be to extract reviewText from each JSON object and perform some preprocessing like lemmatizing etc.
My problem is that I don't know how I could use Spark in order to speed this task up on a cluster.. I am actually not even sure if I can read that JSON file as a stream object-by-object and parallelize the main task.
What would be the best way to get started with this?
As you have single JSON object per line, you can use RDD's textFile to get RDD[String] of lines. Then use map to parse JSON objects using something like json4s and extract necessary field.
You whole code will looks as simple as this (assuming you have SparkContext as sc):
import org.json4s._
import org.json4s.jackson.JsonMethods._
implicit def formats = DefaultFormats
val r = sc.textFile("input_path").map(l => (parse(l) \ "reviewText").extract[String])
You can use a JSON dataset and then execute a simple sql query to retrieve the reviewText column's value:
// A JSON dataset is pointed to by path.
// The path can be either a single text file or a directory storing text files.
val path = "path/reviews.json"
val people = sqlContext.read.json(path)
// Register this DataFrame as a table.
people.registerTempTable("reviews")
val reviewTexts = sqlContext.sql("SELECT reviewText FROM reviews")
Built from examples at the SparkSQL docs (http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets)
I would load JSON data into Dataframe and then select field that i need, also you can use map to apply preprocessing like lemmatising.

Is it possible manipulating timestamp/date in SparkSQL(1.3.0) ?

Since I'm new to Spark (1.3.0), I'm trying to figure out what is possible to do with it, especially Spark SQL.
I'm stuck with timestamp/date formats and I can't pass this obstacle when it comes to operate with these datatypes.
Are there any available operations for these datatypes?
All I can do at the moment is a simple cast from string to timestamp:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class Log(visitor: String, visit_date: String, page: String)
val log = (triple.map(p => Log(p._1,p._2,p._3))).toDF()
log.registerTempTable("logs")
val logSessions= sqlContext.sql("SELECT visitor" +
" ,cast(visit_date as timestamp)" +
" ,page" +
" FROM logs"
)
logSessions.foreach(println)
I'm trying to use different "custom SQL" operations on this timestamp (casted from string) but I can't obtain anything but errors.
For example: can I add 30 minutes to my timestamps? How?
Maybe I'm missing something but I can't find any documentation on this topic.
Thanks in advance!
FF
I am looking for the same thing. Spark 1.5 has added some built in functions:
https://databricks.com/blog/2015/09/16/spark-1-5-dataframe-api-highlights-datetimestring-handling-time-intervals-and-udafs.html
But for previous versions and something specific looke like need to implement an UDF.

Spark: RDD.saveAsTextFile when using a pair of (K,Collection[V])

I have a dataset of employees and their leave-records. Every record (of type EmployeeRecord) contains EmpID (of type String) and other fields. I read the records from a file and then transform into PairRDDFunctions:
val empRecords = sc.textFile(args(0))
....
val empsGroupedByEmpID = this.groupRecordsByEmpID(empRecords)
At this point, 'empsGroupedByEmpID' is of type RDD[String,Iterable[EmployeeRecord]]. I transform this into PairRDDFunctions:
val empsAsPairRDD = new PairRDDFunctions[String,Iterable[EmployeeRecord]](empsGroupedByEmpID)
Then, I go for processing the records as per the logic of the application. Finally, I get an RDD of type [Iterable[EmployeeRecord]]
val finalRecords: RDD[Iterable[EmployeeRecord]] = <result of a few computations and transformation>
When I try to write the contents of this RDD to a text file using the available API thus:
finalRecords.saveAsTextFile("./path/to/save")
the I find that in the file every record begins with an ArrayBuffer(...). What I need is a file with one EmployeeRecord in each line. Is that not possible? Am I missing something?
I have spotted the missing API. It is well...flatMap! :-)
By using flatMap with identity, I can get rid of the Iterator and 'unpack' the contents, like so:
finalRecords.flatMap(identity).saveAsTextFile("./path/to/file")
That solves the problem I have been having.
I also have found this post suggesting the same thing. I wish I saw it a bit earlier.