read from XML source with custom schema errors - pyspark

Seems like a very simple issue, but very annoying..
I have an XML file with the following structure:
<A attr1="Str1" attr2="Long1">
<B attr3="Str1" attr4="Str2" attr5="Long1"/>
<B attr3="Str1" attr4="Str2" attr5="Long1"/>
....
<B attr3="Str1" attr4="Str1" attr5="Integer1"/>
My goal is to read it into a Spark (Pyspark) DataFrame to process it later.
I am using the Databricks package. When I run the following code:
df = sqlContext.read.format('com.databricks.spark.xml') \
.option('rowTag','A') \
.option('attributePrefix','att_') \
.load('s3a://path.to.my.xml')
The resulting df's schema (auto infered) is the following:
root
|-- A: array (nullable=true)
|-- element: struct (containsNull=true)
|-- _VALUE: string(nullable=true)
|-- att_attr3: string(nullable=true)
|-- att_attr4: long(nullable=true)
|-- att_attr5: long(nullable=true)
|-- att_attr1: string(nullable=true)
|-- att_attr2: long(nullable=true)
The problem is the attr4 in this case, which I expect to be of the type string, but is treated as long.
Every custom schema which I tried to set, was resulting in some internal error, or 0 records in the dataframe.
Please help :)
(Spark v. 2.0.0)

OK... found the appropriate way to set the schema so the XML will be parsed proparely. Some minor syntax issues. If you are interested or have similar problem, please comment and I will wirte it here.

Related

Scala/Spark - Find total number of value in row based on a key

I have a large text file which contains the page views of some Wikimedia projects. (You can find it here if you're really interested) Each line, delimited by a space, contains the statistics for one Wikimedia page. The schema looks as follows:
<project code> <page title> <num hits> <page size>
In Scala, using Spark RDDs or Dataframes, I wish to compute the total number of hits for each project, based on the project code.
So for example for projects with the code "zw", I would like to find all the rows that begin with project code "zw", and add up their hits. Obviously this should be done for all project codes simultaneously.
I have looked at functions like aggregateByKey etc, but the examples I found don't go into enough detail, especially for a file with 4 fields. I imagine it's some kind of MapReduce job, but how exactly to implement it is beyond me.
Any help would be greatly appreciated.
First, you have to read the file in as a Dataset[String]. Then, parse each string into a tuple, so that it can be easily converted to a Dataframe. Once you have a Dataframe, a simple .GroupBy().agg() is enough to finish the computation.
import org.apache.spark.sql.functions.sum
val df = spark.read.textFile("/tmp/pagecounts.gz").map(l => {
val a = l.split(" ")
(a(0), a(2).toLong)
}).toDF("project_code", "num_hits")
val agg_df = df.groupBy("project_code")
.agg(sum("num_hits").as("total_hits"))
.orderBy($"total_hits".desc)
agg_df.show(10)
The above snippet shows the top 10 project codes by total hits.
+------------+----------+
|project_code|total_hits|
+------------+----------+
| en.mw| 5466346|
| en| 5310694|
| es.mw| 695531|
| ja.mw| 611443|
| de.mw| 572119|
| fr.mw| 536978|
| ru.mw| 466742|
| ru| 463437|
| es| 400632|
| it.mw| 400297|
+------------+----------+
It is certainly also possible to do this with the older API as an RDD map/reduce, but you lose many of the optimizations that Dataset/Dataframe api brings.

Saving pyspark dataframe with complicated schema in plain text for testing

How do I make clean test data for pyspark? I have figured something out that seems pretty good, but parts seem a little awkward, so I'm posting.
Let's say I have a dataframe df with a complicated schema and a small number of rows. I want test data checked into my repo. I don't want a binary file. At this point, I'm not sure the best way to proceed -but I'm thinking i have a file like
test_fn.py
and it has this in it
schema_str='struct<eventTimestamp:timestamp,list_data:array<struct<valueA:string,valueB:string,valueC:boolean>>>'
to get the schema in txt format, using the df.schema.simpleString() function. Then to get the rows - I do
lns = [row.json_txt for row in df.select((F.to_json(F.struct('*'))).alias('json_txt')).collect()]
now I put those lines in my test_fn.py file, or I could have a .json file in the repo.
Now to run the test, I have to make a dataframe with the correct schema and data from this text. It seems the only way spark will parse the simple string is if I create a dataframe with it, that is I can't pass that simple string to the from_json function? So this is a little awkward which is why I thought I'd post -
schema2 = spark.createDataFrame(data=[], schema=schema_str).schema
lns = # say I read the lns back from above
df_txt = spark.createDataFrame(data=lns, schema=T.StringType())
I see df_txt just has one column called 'value'
df_json = df_txt.select(F.from_json('value', schema=schema2).alias('xx'))
sel = ['xx.%s' % nm for nm in df_json.select('xx').schema.fields[0].dataType.fieldNames()]
df2 = df_json.select(*sel)
Now df2 should be the same as df1 - which I see is the case from the deepdiff module.

How can I Count the Number of Rows of the JSON File?

My JSON file below contains six rows:
[
{"events":[[{"v":"INPUT","n":"type"},{"v":"2016-08-24 14:23:12 EST","n":"est"}]],
"apps":[],
"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},
"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.1.18","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"12","n":"cpu"},{"v":"154665","n":"seq"},{"v":"2016-08-24 14:23:17 EST","n":"est"}]
},
{"events":[[{"v":"INPUT","n":"type"},{"v":"2016-08-24 14:23:14 EST","n":"est"}]],"apps":[],"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.1.18","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"5","n":"cpu"},{"v":"154666","n":"seq"},{"v":"2016-08-24 14:23:23 EST","n":"est"}]},
{"events":[[{"v":"LOGOFF","n":"type"},{"v":"2016-08-24 14:24:04 EST","n":"est"}]],"apps":[],"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.1.18","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"0","n":"cpu"},{"v":"154667","n":"seq"},{"v":"2016-08-24 14:24:05 EST","n":"est"}]},
{"events":[],"apps":[[{"v":"ccSvcHst","n":"pname"},{"v":"7704","n":"pid"},{"v":"Old Virus Definition File","n":"title"},{"v":"O","n":"state"},{"v":"5376","n":"mem"},{"v":"0","n":"cpu"}]],"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.0.5","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"29","n":"cpu"},{"v":"154668","n":"seq"},{"v":"2016-09-25 16:57:24 EST","n":"est"}]},
{"events":[],"apps":[[{"v":"ccSvcHst","n":"pname"},{"v":"7704","n":"pid"},{"v":"Old Virus Definition File","n":"title"},{"v":"F","n":"state"},{"v":"5588","n":"mem"},{"v":"0","n":"cpu"}]],"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.0.5","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"16","n":"cpu"},{"v":"154669","n":"seq"},{"v":"2016-09-25 16:57:30 EST","n":"est"}]},
{"events":[],"apps":[[{"v":"ccSvcHst","n":"pname"},{"v":"7704","n":"pid"},{"v":"Old Virus Definition File","n":"title"},{"v":"F","n":"state"},{"v":"5588","n":"mem"},{"v":"0","n":"cpu"}]],"agent":{"calls":[],"info":[{"v":"7990994","n":"agentid"},{"v":"7999994","n":"stationid"}]},"header":[{"v":"TUSTX002LKVT1JN","n":"host"},{"v":"192.168.0.5","n":"ip"},{"v":"V740723","n":"vzid"},{"v":"16.3.16.0","n":"version"},{"v":"17","n":"cpu"},{"v":"154670","n":"seq"},{"v":"2016-09-25 16:57:36 EST","n":"est"}]}
]
The JSON looks like the below records:
JSON
0
1
2
3
4
5
Required Output:
Count
6
Ok, you are in Spark, and you need to turn your Json into dataset, and use the appropriate operation on it. So here, I wrote the workflow to go from Json to dataset in general and the required steps with examples. I think this way of answering is more beneficial because you can see the steps and then you can decide what to do with the information.
Input Data: You have the Json, that is your data you should start working on. Then you need to decide which fields are important. Counting on its own, is the small part of most cases and you don't want to load all the fields which may not be necessary.
Create a Case Class: you can use case classes because then you can serialize your input data. To keep it simple I have a doctor which belongs to a department, and I get the data in Json. I could have the following case classes:
case class Department(name: String, address: String)
case class Doctor(name: String, department: Department)
so as you can see from the above code, I go bottom up to create the data I want to work on. In you Json, there are loads of fields (e.g., v) that I can't understand the meaning behind it. So be careful not to mix them.
Have a dataaset: Ok, the below code serialize a Json to the case class we defined:
spark.read.json("doctorsData.json).as[Doctor]
couple of points. The spark is a spark session, which you need to create. Here its instance is spark it could be anything. You also need to import spark.implicits._.
In Business!: Ok now you are in business, and in the Spark world. It is just the matter of using count() to count your dataset. THe following method shows how to count it:
def recordsCount(myDataset: Dataset[Doctor]): Long = myDataset.count()
A file of three records I have - with correct formatting, Spark 2.x., reading into a dataframe / dataset:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
val df = spark.read
.option("multiLine", true)
.option("mode", "PERMISSIVE")
.option("inferSchema", true)
.json("/FileStore/tables/json_01.txt")
df.select("*").show(false)
df.printSchema()
df.count()
If just a total tally count, then this will suffice, last line.
res15: Long = 3

Spark Spark Empty Json Files reading from Directory

I'm reading from a path say /json//myfiles_.json
I'm then flattening the json using explode. This causes an error since I have some empty files. How do I tell it to ignore empty files of somehow filter them out?
I can detect individual files checking if the head is empty but I need to do this on the collection of files iterated in the dataframe with the use of the wildcard path.
So the answer seems to be that I need to provide a schema explicitly because it can't infer one from empty file - as you would expect!
e.g.
val schemadf = sqlContext.read.json(schemapath) //infer schema from file with data or do manually
val schema = schemadf.schema
val raw = sqlContext.read.schema(schema).json(monthfile)
val prep = raw.withColumn("MyArray", explode($"MyArray"))
.select($"ID", $"name", $"CreatedAt")
display(prep)

Spark/Scala read hadoop file

In a pig script I saved a table using PigStorage('|').
I have in the corresponding hadoop folder files like
part-r-00000
etc.
What is the best way to load it in Spark/Scala ? In this table I have 3 fields: Int, String, Float
I tried:
text = sc.hadoopFile("file", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], sc.defaultMinPartitions)
But then I would need somehow to split each line. Is there a better way to do it?
If I were coding in python I would create a Dataframe indexed by the first field and whose columns are the values found in the string field and coefficients the float values. But I need to use scala to use the pca module. And the dataframes don't seem that close to python's ones
Thanks for the insight
PigStorage creates a text file without schema information so you need to do that work yourself something like
sc.textFile("file") // or directory where the part files are
val data = csv.map(line => {
vals=line.split("|")
(vals(0).toInt,vals(1),vals(2).toDouble)}
)