This question already has answers here:
Xml processing in Spark
(4 answers)
Closed 3 years ago.
I want to play around with the 1987 Reuters dataset using Scala and possibly Spark. I can see that the files I've downloaded are in the .sgm format. I've never seen this before but performing a more:
$ more reut2-003.sgm
<!DOCTYPE lewis SYSTEM "lewis.dtd">
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="19419" NEWID="3001">
<DATE> 9-MAR-1987 04:58:41.12</DATE>
<TOPICS><D>money-fx</D></TOPICS>
<PLACES><D>uk</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN>
RM
f0416reute
b f BC-U.K.-MONEY-MARKET-SHO 03-09 0095</UNKNOWN>
<TEXT>
<TITLE>U.K. MONEY MARKET SHORTAGE FORECAST AT 250 MLN STG</TITLE>
<DATELINE> LONDON, March 9 - </DATELINE><BODY>The Bank of England said it forecast a
shortage of around 250 mln stg in the money market today.
Among the factors affecting liquidity, it said bills
maturing in official hands and the treasury bill take-up would
drain around 1.02 billion stg while below target bankers'
balances would take out a further 140 mln.
Against this, a fall in the note circulation would add 345
mln stg and the net effect of exchequer transactions would be
an inflow of some 545 mln stg, the Bank added.
REUTER
</BODY></TEXT>
</REUTERS>
we can see that it looks like pretty simple markup.
Since I don't want to write my own parser, my question is, is there some simple way of parsing this in Scala/Spark using some library?
Q: Since I don't want to write my own parser, my question is, is there
some simple way of parsing this in Scala/Spark using some library?
AFAIK there is no such api. you have to map and parse (clean special characters in it) it. transform in to multiple columns.
I tried in the below way... but your xml showing as corrupt record from dataframe.
Further pointer :https://github.com/databricks/spark-xml
import java.io.File
import org.apache.commons.io.FileUtils
import org.apache.spark.sql.{SQLContext, SparkSession}
/**
* Created by Ram Ghadiyaram
*/
object SparkXmlWithDtd {
def main(args: Array[String]) {
val spark = SparkSession.builder.
master("local")
.appName(this.getClass.getName)
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
val sc = spark.sparkContext
val sqlContext = new SQLContext(sc)
val str =
"""
|<!DOCTYPE lewis SYSTEM "lewis.dtd">
|
|<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="19419" NEWID="3001">
|<DATE> 9-MAR-1987 04:58:41.12</DATE>
|<TOPICS><D>money-fx</D></TOPICS>
|<PLACES><D>uk</D></PLACES>
|<PEOPLE></PEOPLE>
|<ORGS></ORGS>
|<EXCHANGES></EXCHANGES>
|<COMPANIES></COMPANIES>
|<UNKNOWN>
|RM
|f0416reute
|b f BC-U.K.-MONEY-MARKET-SHO 03-09 0095</UNKNOWN>
|<TEXT>
|<TITLE>U.K. MONEY MARKET SHORTAGE FORECAST AT 250 MLN STG</TITLE>
|<DATELINE> LONDON, March 9 - </DATELINE><BODY>The Bank of England said it forecast a
|shortage of around 250 mln stg in the money market today.
| Among the factors affecting liquidity, it said bills
|maturing in official hands and the treasury bill take-up would
|drain around 1.02 billion stg while below target bankers'
|balances would take out a further 140 mln.
| Against this, a fall in the note circulation would add 345
|mln stg and the net effect of exchequer transactions would be
|an inflow of some 545 mln stg, the Bank added.
| REUTER
|</BODY></TEXT>
|</REUTERS>
""".stripMargin
val f = new File("sgmtest.sgm")
FileUtils.writeStringToFile(f, str)
val xml_df = spark.read.
format("com.databricks.spark.xml")
.option("rowTag", "REUTERS")
.load(f.getAbsolutePath)
xml_df.printSchema()
xml_df.createOrReplaceTempView("XML_DATA")
spark.sql("SELECT * FROM XML_DATA").show(false)
xml_df.show(false)
}
}
Result :
root
|-- _corrupt_record: string (nullable = true)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|_corrupt_record |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
9-MAR-1987 04:58:41.12
money-fx
uk
RM
f0416reute
b f BC-U.K.-MONEY-MARKET-SHO 03-09 0095
U.K. MONEY MARKET SHORTAGE FORECAST AT 250 MLN STG
LONDON, March 9 - The Bank of England said it forecast a
shortage of around 250 mln stg in the money market today.
Among the factors affecting liquidity, it said bills
maturing in official hands and the treasury bill take-up would
drain around 1.02 billion stg while below target bankers'
balances would take out a further 140 mln.
Against this, a fall in the note circulation would add 345
mln stg and the net effect of exchequer transactions would be
an inflow of some 545 mln stg, the Bank added.
REUTER
|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|_corrupt_record |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
9-MAR-1987 04:58:41.12
money-fx
uk
RM
f0416reute
b f BC-U.K.-MONEY-MARKET-SHO 03-09 0095
U.K. MONEY MARKET SHORTAGE FORECAST AT 250 MLN STG
LONDON, March 9 - The Bank of England said it forecast a
shortage of around 250 mln stg in the money market today.
Among the factors affecting liquidity, it said bills
maturing in official hands and the treasury bill take-up would
drain around 1.02 billion stg while below target bankers'
balances would take out a further 140 mln.
Against this, a fall in the note circulation would add 345
mln stg and the net effect of exchequer transactions would be
an inflow of some 545 mln stg, the Bank added.
REUTER
|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Related
I have weather data coming from various sources (api, iot measurements), which are in different granularities (daily, hourly, minutes).
You can imagine my data to look like:
2022-01-01 | source:api | data..
2022-01-01 01:00 | source:api | data..
2022-01-01 02:00 | source:api | data..
2022-01-01 00:30 | source:iot | data..
2022-01-01 01:00 | source:iot | data..
2022-01-01 01:30 | source:iot | data..
2022-01-02 | source:api | data..
Depending on the service, I sometimes need my data in a daily resolution, sometimes hourly.
My initial ideas were to store them in either:
time buckets grouped by day, e.g.:
2022-01-01
[d]
[h1, h2,..]
[m1, m2, m3 ...]
2022-01-02
[d]
[h1, h2,..]
[m1, m2, m3 ...]
Save a resolution (daily, hourly, minute data) variable for every document.
I wonder what the best data design strategy would be that would also work long term.
Some additional things to consider:
The data is also used by user facing services (e.g. api) and requests can be many a day. However, these requests are targeted at specific resolution/sources. Calls where we combination of data will be used a few a day.
Sometimes there is a precedence of which source we will choose based on presence. E.g. use minute iot data, otherwise daily api data.
It is important to let the data in the database "explain itself" clearly without special hidden knowledge. I suggest you keep it clear and simple by storing each incoming item with a true datetime type (not an ISO-8601 string) and a "resolution enum" to indicate what the timestamp actually means. 2020-06-22T00:00:00Z without the enum field is ambiguous.
Heres a sample DF:
Date Party name Symbol Buy/Sell indicator # of shares trade price
2011-01-03 American Funds EuPc;A AAPL BUY 2400 332.87
2011-02-14 American Funds CWGI;A SLB BUY 6700 94.08
2011-01-06 Tudor Investment Corp ALL BUY 11800 31.92
2011-01-20 American Funds Inc;A AMZN SELL 3600 180.14
And here is what I wish to achieve:
Date Party name Symbol Buy/Sell # of shares trade price trading volume
2011-04-21 Federated Prime Obl;Inst MMM BUY 2600 96.17 250042
2011-01-05 Fortress Investment Group CMCSA SELL 29700 21.96 644193
2011-02-28 Dodge & Cox Intl Stock DELL SELL 57400 15.67 899458
2011-05-02 American Funds Inc;A S BUY 137300 5.19 712587
The new trading volume column is the # of shares column * trade price column. Anyone know how to achieve this automatically since there are a lot more lines? What I would like to do after is take the trading volume values and show them as an output in descending order. The exact instruction is
The biggest dollar trading volume counter parties, top twenty list.
I have this so far:
val dataframe = spark.read.cvs("c:\data")
val newdf = dataframe.select("# of shares","trade price")
Any help would be much appreciated. Thank you.
Here you go:
import org.apache.spark.sql.functions._
val newdf = dataframe.withColumn("trading volume",col("# of shares")*col("trade price"))
.select("# of shares","trade price","trading volume")
I have a semi-structured text file which I want to convert it to a Data Frame in Spark. I do have a schema on my mind which is shown below. However, I am finding it challenging to parse my text file and assign the schema.
Following is my sample text file:
"good service"
Tom Martin (USA) 17th October 2015
4
Long review..
Type Of Traveller Couple Leisure
Cabin Flown Economy
Route Miami to Chicago
Date Flown September 2015
Seat Comfort 12345
Cabin Staff Service 12345
Ground Service 12345
Value For Money 12345
Recommended no
"not bad"
M Muller (Canada) 22nd September 2015
6
Yet another long review..
Aircraft TXT-101
Type Of Customer Couple Leisure
Cabin Flown FirstClass
Route IND to CHI
Date Flown September 2015
Seat Comfort 12345
Cabin Staff Service 12345
Food & Beverages 12345
Inflight Entertainment 12345
Ground Service 12345
Value For Money 12345
Recommended yes
.
.
The resulting schema with result that I expect to have as follows:
+----------------+------------+--------------+---------------------+---------------+---------------------------+----------+------------------+-------------+--------------+-------------------+----------------+--------------+---------------------+-----------------+------------------------+----------------+---------------------+-----------------+
| Review_Header | User_Name | User_Country | User_Review_Date | Overall Score | Review | Aircraft | Type of Traveler | Cabin Flown | Route_Source | Route_Destination | Date Flown | Seat Comfort | Cabin Staff Service | Food & Beverage | Inflight Entertainment | Ground Service | Wifi & Connectivity | Value for Money |
+----------------+------------+--------------+---------------------+---------------+---------------------------+----------+------------------+-------------+--------------+-------------------+----------------+--------------+---------------------+-----------------+------------------------+----------------+---------------------+-----------------+
| "good service" | Tom Martin | USA | 17th October 2015 | 4 | Long review.. | | Couple Leisure | Economy | Miami | Chicago | September 2015 | 12345 | 12345 | | | 12345 | | 12345 |
| "not bad" | M Muller | Canada | 22nd September 2015 | 6 | Yet another long review.. | TXT-101 | Couple Leisure | FirstClass | IND | CHI | September 2015 | 12345 | 12345 | 12345 | 12345 | 12345 | | 12345 |
+----------------+------------+--------------+---------------------+---------------+---------------------------+----------+------------------+-------------+--------------+-------------------+----------------+--------------+---------------------+-----------------+------------------------+----------------+---------------------+-----------------+
As you may notice, for each block of data in text file, the first four lines are mapped to user defined columns such as Review_Header, User_Name, User_Country, User_Review_Date, whereas rest other individual lines have defined columns.
What could be the best possible way to use schema inference technique in such scenario rather than writing verbose code?
UPDATE: I would like to make this problem a little more tricky. What if the "Long review.." and "Yet another long review" could itself span over multiple newlines. How may I parse the review over multiple line for each block?
If you guarantee that the semi-structured text file has records separated by two newlines, and that those two newlines will never appear in the "Long review..." section, you may be able to use textFiles with a modified delimiter ("\n\n") and then process the lines without writing a custom file format.
sc.hadoopConfiguration.set("textinputformat.record.delimiter", "\n\n")
df = sc.textFile("sample-file.txt")
Then you can do further splitting on "\n" and "\t" to create your fields and columns.
Seeing your update, it's kind of a difficult problem. You have to ask yourself what identifying info is in the attributes that's not in the review. Or what is guaranteed to be in a specific format. E.g.
Can you guarantee there's not two newlines in the long review? This is important if we're splitting on "\n\n" to generate the blocks.
Can you guarantee there's no tabs in the long review?
Is Aircraft, Cabin Flown, Cabin Staff Service, Date Flown, Food & Beverages, Ground Service, ... the full list of attributes? Do you have a full list of possible attributes?
As well as some meta questions:
Where is this data coming from?
Can we request it in a better format?
Can we find this data, or the aspects we're looking for from a better source?
With those known, you'll have a better idea on how to proceed. E.g. if there are no tabs in the review text, (or they're escaped as "\t" or something):
Extract lines[0] - first line "good service"
Extract lines[1] - split to user name, country, review date
Filter lines[2:] containing tabs, get lowest index i - split into attributes
Join lines[2:i] with "\n" - this is the review
What could be the best possible way to use schema inference technique in such scenario rather than writing verbose code?
You don't have much choice and you have to write a verbose code or a custom FileFormat (that would hide the complexity of loading such files to a DataFrame).
Use DataFrameReader.textFile to load the file and transform it accordingly.
textFile(path: String): Dataset[String] Loads text files and returns a Dataset of String. See the documentation on the other overloaded textFile() method for more details.
I tried to perform two requests to Reporting API:
startDate=2016-01-01, endDate=2016-08-26, ga:users, ga:yearMonth
startDate=2016-01-01, endDate=2016-08-26, ga:users, ga:yearMonth, ga:year
The metric results do not match. Why?
Example on https://ga-dev-tools.appspot.com/query-explorer/
Result for request one:
ga:yearMonth ga:users
201601 1372
201602 1701
201603 1980
201604 1779
201605 1465
201606 1336
201607 1402
201608 1595
Result for request two:
ga:year ga:yearMonth ga:users
2016 201601 1372
2016 201602 1525
2016 201603 1761
2016 201604 1531
2016 201605 1239
2016 201606 1084
2016 201607 1157
2016 201608 1365
This answer maybe useful to someone having the same problem.Whenever there is mismatch in data between api and data on dashboard do following things.
Make sure you are using right parameters for both of them (similar metrics and dimensions).
If after step one still there is mismatch then its probably because sampling has been kicked in internally by the google , this is because even the smallest query requires heavy computation. To make sure sampling is being done there will be a field samplingSpaceSizes in the response .
To avoid sampling make sure you loop over dates and query for each day independently.
in your case its most probably the issue of sampling because of huge date range (and this is because you GA account has lots of data), so intead of querying for a bigger range , loop over date range.
Also remember it may tae upto 48 hours for the fresh data to be processed.To make sure if your data is processed one look for a field isDataGolden in response if its present that data is processed and so results will match. if that param is absent it means some of your data has not been processed yet.
Have you checked for sampling? The date range you're working with is on the large side, so you might consider testing with a smaller range to see if the totals become more consistent.
Another thing to consider is that Users metric can be pre-calculated or calculated on the fly. More information on the Users metric here
What is the best practice for joining 'shift' data and other time series data in Tableau? I am working with multiple geo data (from LA to India, UK, NY, Malaysia, Australia, China etc), and a lot of employees work past midnight.
For example, an employee has shift at 9 PM to 6 AM on 2016-07-31. The 'report date' is 2016-07-31 but no time zone information is provided.
This employee does work and there are events (time stamps in UTC) between 2016-07-31 21:00 to 2016-08-01 06:00. When I look at the events though, 7/31 will only have the events between 21:00 and 23:59. If I filter for just July, my calculations will be skewed (the event data will be cut off at midnight even though the shift extended to 6 AM).
I need to make calculations based upon the total time an employee was actually engaged with work (productive) and the total time they were paid. The request is for this to be daily/weekly/monthly.
If anyone can help me out here or give me some talking points to explain this to my superiors, it would be appreciated. This seems like it must be a common scenario. Do I need to request for a new raw data format or is there something I can do on my end?
the shift data only looks like this:
id date regular_hours overtime_hours total_hours
abc 2016-06-17 8 0.52 8.52
abc 2016-06-18 7.64 0.83 8.47
abc 2016-06-19 7.87 0.23 8.1
the event data is more detailed (30 minute interval data on events handled and the time it took to complete those events in seconds):
id date interval events event_duration
abc 2016-06-17 01:30:00 4 688
abc 2016-06-17 02:00:00 6 924
abc 2016-06-17 02:30:00 10 1320
So, you sum up the event_duration for an entire day and you get a number of seconds which was actually spent doing work. You can then compare this to amount of time that the employee was paid to see how efficient the staffing is.
My concern is that the event data has the date and the time (UTC). The payroll data only has a date without any time zone information. This causes inaccuracies when blending data in Tableau because some shifts cross midnight. Is there a way around this or do I need to propose new data requirements?
(FYI - people have been calculating it just based on the date for years most likely without considering time zones before. My assumption is that they just did not realize that this could cause inaccurate results)