How to load DataFrame from semi-structured textfile? - scala

I have a semi-structured text file which I want to convert it to a Data Frame in Spark. I do have a schema on my mind which is shown below. However, I am finding it challenging to parse my text file and assign the schema.
Following is my sample text file:
"good service"
Tom Martin (USA) 17th October 2015
4
Long review..
Type Of Traveller Couple Leisure
Cabin Flown Economy
Route Miami to Chicago
Date Flown September 2015
Seat Comfort 12345
Cabin Staff Service 12345
Ground Service 12345
Value For Money 12345
Recommended no
"not bad"
M Muller (Canada) 22nd September 2015
6
Yet another long review..
Aircraft TXT-101
Type Of Customer Couple Leisure
Cabin Flown FirstClass
Route IND to CHI
Date Flown September 2015
Seat Comfort 12345
Cabin Staff Service 12345
Food & Beverages 12345
Inflight Entertainment 12345
Ground Service 12345
Value For Money 12345
Recommended yes
.
.
The resulting schema with result that I expect to have as follows:
+----------------+------------+--------------+---------------------+---------------+---------------------------+----------+------------------+-------------+--------------+-------------------+----------------+--------------+---------------------+-----------------+------------------------+----------------+---------------------+-----------------+
| Review_Header | User_Name | User_Country | User_Review_Date | Overall Score | Review | Aircraft | Type of Traveler | Cabin Flown | Route_Source | Route_Destination | Date Flown | Seat Comfort | Cabin Staff Service | Food & Beverage | Inflight Entertainment | Ground Service | Wifi & Connectivity | Value for Money |
+----------------+------------+--------------+---------------------+---------------+---------------------------+----------+------------------+-------------+--------------+-------------------+----------------+--------------+---------------------+-----------------+------------------------+----------------+---------------------+-----------------+
| "good service" | Tom Martin | USA | 17th October 2015 | 4 | Long review.. | | Couple Leisure | Economy | Miami | Chicago | September 2015 | 12345 | 12345 | | | 12345 | | 12345 |
| "not bad" | M Muller | Canada | 22nd September 2015 | 6 | Yet another long review.. | TXT-101 | Couple Leisure | FirstClass | IND | CHI | September 2015 | 12345 | 12345 | 12345 | 12345 | 12345 | | 12345 |
+----------------+------------+--------------+---------------------+---------------+---------------------------+----------+------------------+-------------+--------------+-------------------+----------------+--------------+---------------------+-----------------+------------------------+----------------+---------------------+-----------------+
As you may notice, for each block of data in text file, the first four lines are mapped to user defined columns such as Review_Header, User_Name, User_Country, User_Review_Date, whereas rest other individual lines have defined columns.
What could be the best possible way to use schema inference technique in such scenario rather than writing verbose code?
UPDATE: I would like to make this problem a little more tricky. What if the "Long review.." and "Yet another long review" could itself span over multiple newlines. How may I parse the review over multiple line for each block?

If you guarantee that the semi-structured text file has records separated by two newlines, and that those two newlines will never appear in the "Long review..." section, you may be able to use textFiles with a modified delimiter ("\n\n") and then process the lines without writing a custom file format.
sc.hadoopConfiguration.set("textinputformat.record.delimiter", "\n\n")
df = sc.textFile("sample-file.txt")
Then you can do further splitting on "\n" and "\t" to create your fields and columns.
Seeing your update, it's kind of a difficult problem. You have to ask yourself what identifying info is in the attributes that's not in the review. Or what is guaranteed to be in a specific format. E.g.
Can you guarantee there's not two newlines in the long review? This is important if we're splitting on "\n\n" to generate the blocks.
Can you guarantee there's no tabs in the long review?
Is Aircraft, Cabin Flown, Cabin Staff Service, Date Flown, Food & Beverages, Ground Service, ... the full list of attributes? Do you have a full list of possible attributes?
As well as some meta questions:
Where is this data coming from?
Can we request it in a better format?
Can we find this data, or the aspects we're looking for from a better source?
With those known, you'll have a better idea on how to proceed. E.g. if there are no tabs in the review text, (or they're escaped as "\t" or something):
Extract lines[0] - first line "good service"
Extract lines[1] - split to user name, country, review date
Filter lines[2:] containing tabs, get lowest index i - split into attributes
Join lines[2:i] with "\n" - this is the review

What could be the best possible way to use schema inference technique in such scenario rather than writing verbose code?
You don't have much choice and you have to write a verbose code or a custom FileFormat (that would hide the complexity of loading such files to a DataFrame).
Use DataFrameReader.textFile to load the file and transform it accordingly.
textFile(path: String): Dataset[String] Loads text files and returns a Dataset of String. See the documentation on the other overloaded textFile() method for more details.

Related

MongoDB Storing varying interval timeseries data in one collection

I have weather data coming from various sources (api, iot measurements), which are in different granularities (daily, hourly, minutes).
You can imagine my data to look like:
2022-01-01 | source:api | data..
2022-01-01 01:00 | source:api | data..
2022-01-01 02:00 | source:api | data..
2022-01-01 00:30 | source:iot | data..
2022-01-01 01:00 | source:iot | data..
2022-01-01 01:30 | source:iot | data..
2022-01-02 | source:api | data..
Depending on the service, I sometimes need my data in a daily resolution, sometimes hourly.
My initial ideas were to store them in either:
time buckets grouped by day, e.g.:
2022-01-01
[d]
[h1, h2,..]
[m1, m2, m3 ...]
2022-01-02
[d]
[h1, h2,..]
[m1, m2, m3 ...]
Save a resolution (daily, hourly, minute data) variable for every document.
I wonder what the best data design strategy would be that would also work long term.
Some additional things to consider:
The data is also used by user facing services (e.g. api) and requests can be many a day. However, these requests are targeted at specific resolution/sources. Calls where we combination of data will be used a few a day.
Sometimes there is a precedence of which source we will choose based on presence. E.g. use minute iot data, otherwise daily api data.
It is important to let the data in the database "explain itself" clearly without special hidden knowledge. I suggest you keep it clear and simple by storing each incoming item with a true datetime type (not an ISO-8601 string) and a "resolution enum" to indicate what the timestamp actually means. 2020-06-22T00:00:00Z without the enum field is ambiguous.

How to Display Latest Weeks Volume by Default in Qlik Sense

I am looking for some help in displaying a set of numbers on my dashboard but I need to display the latest week whenever the dashboard is open but also allow the user to change the week that they are looking at through the filters.
My data is the following:
latest_week_rank | week_date | completed_orders
1 | 31/01/2020 | 3500
2 | 24/01/2020 | 6450
3 | 17/01/2020 | 6050
4 | 10/01/2020 | 6110
5 | 03/01/2020 | 4000
6 | 27/12/2019 | 3500
7 | 20/12/2019 | 7500
8 | 13/12/2019 | 7450
9 | 06/12/2019 | 7540
10 | 29/11/2019 | 6900
11 | 22/11/2019 | 7100
12 | 15/11/2019 | 7400
13 | 08/11/2019 | 7550
I am going to be using a Multi KPI Extension where I will display the volume of 3500 for the latest weeks volume in my data and then have a second measure to then display a % value to show if the volume is higher then previous week or lower.
so a formula: (3500 / 6450) giving me a % of 45.74% down
The tricky bit is how to do the expression/variable to show the default of the latest week but also having the ability to filter and pick another week which would then change the previous week if the selection of the week_date is changed.
I would really appreciate it if somebody could advise on how I could tackle this issue to display my data on my dashboard as I am fairly new to Qlik so just trying to get my head around how everything works.
I have managed to write expression which gives me the latest weeks volume and also allows me to filter and view previous weeks data.
Sum({<week_date={">=$(=Weekstart(max(week_date)))<=$(=Weekend(max(week_date)))"}>}completed_orders)
In regards to the percentage I have used the same code and then taken the latest weeks and divided the previous weeks . To get the previous week all I did was add a -1 to look at the previous week and then changed the option to show it as a %.
Code in the Data Tab:
set vvWeekOrders = Sum({<week_date={">=$(=Weekstart(max(week_date)))<=$(=Weekend(max(week_date)))"}>}completed_orders);
but this changes my values to 0, do i need to change the code if I am using set?

Tableau - Filter on maximum date within dimension

I have data that are structured as so:
| Node | Update Datetime | Measure Values |
|------|-----------------|----------------|
| A | 2018-01-01 | 1 |
| A | 2018-01-05 | 3 |
| A | 2018-01-06 | 4 |
| B | 2018-01-02 | 2 |
| B | 2018-01-03 | 4 |
The nodesare updated over time with the measure values showing the node's value at the time of data entry, meaning just filtering on a date range will overstate the node's value. The report needs to be responsive or else I would just do this in sql, but as it stands I need to be able to keep only the rows that contain the maximum datetime value within each node, after the entire dataset is filtered on a general date window.
I think creating a level-of-detail (LOD) calculated field and using it as a filter is the cleanest and quickest way to achieve the desired filter.
See Option 2 here -- http://kb.tableau.com/articles/howto/setting-default-date-to-most-recent-date-on-a-quick-filter?lang=en
Let me know if that does not work.
(FYI Tableau Community Forums is another resource for Tableau questions. I use both sites)

How can I aggregate metrics per day in a Grafana table?

I am charting data with a Grafana table, and I want to aggregate all data points from a single day into one row in the table. As you can see below my current setup is displaying the values on a per / minute basis.
Question: How can I make a Grafana table that displays values aggregated by day?
| Day | ReCaptcha | T & C |
|-------------------|------------|-------|
| February 21, 2017 | 9,001 | 8,999 |
| February 20, 2017 | 42 | 17 |
| February 19, 2017 | ... | ... |
You can use the summarize function on the metrics panel.
Change the query by pressing the + then selecting transform
summarize(24h, sum, false) this will aggregate the past 24hr data points into a single point by summing them.
http://graphite.readthedocs.io/en/latest/functions.html#graphite.render.functions.summarize
results

Merge periodes in java

I am working with Spring batch. the batch will read several records each time from the database which look like this
personId |fromDate| toDate | someCode
*100 | 05-05-2011 | 31-12-2011 | A
*100 | 01-01-2012 | 31-12-2012 | A
100 | 01-01-2013 | 03-03-2013 | B
101 | 05-05-2011 | 31-12-2011 | A
*periodes to be merged.
What i want to do is to merge the periodes which has the same code and same personId, but not diffrent code or personId.
The first question is can i chunk this step? the problem is that commite intervals are static and i might not get all the priodes for a person in one chunk. is it possible to have dynamic chunks based on how many records for a person are on the table?
the next question is what is the best way to merge the periods? periods should be merged if the toDate is 31-12 and the next period starts from 01-01 of next year.
I solv the problem with using 2 pointer in each object. one which point to the previous period and one to the next period.
For chunking i needed to read all rows with same person id and aggregate them.