Merge periodes in java - spring-batch

I am working with Spring batch. the batch will read several records each time from the database which look like this
personId |fromDate| toDate | someCode
*100 | 05-05-2011 | 31-12-2011 | A
*100 | 01-01-2012 | 31-12-2012 | A
100 | 01-01-2013 | 03-03-2013 | B
101 | 05-05-2011 | 31-12-2011 | A
*periodes to be merged.
What i want to do is to merge the periodes which has the same code and same personId, but not diffrent code or personId.
The first question is can i chunk this step? the problem is that commite intervals are static and i might not get all the priodes for a person in one chunk. is it possible to have dynamic chunks based on how many records for a person are on the table?
the next question is what is the best way to merge the periods? periods should be merged if the toDate is 31-12 and the next period starts from 01-01 of next year.

I solv the problem with using 2 pointer in each object. one which point to the previous period and one to the next period.
For chunking i needed to read all rows with same person id and aggregate them.

Related

How do you query between date ranges to look for valid DATERANGE?

Say my table is this
validity table
id | validity_dates
-------------+-------------------------
1 | [2018-01-01,2019-01-01)
2 | [2017-05-01,2017-06-01)
3 | [2016-05-01,2016-07-01)
4 | [2022-01-01,2025-01-01)
5 | [2022-01-01,2025-12-10)
How do I query to get the id based on whether the date range between validity_dates still exists within today and the future.
For example I want to get id's that are still valid from today and future. I run a query and get a return on id 4 and 5 since the validity dates ends in 2025.
My current implementation doesn't work in the sense it doesn't provide the right output.
SELECT id
FROM validity
WHERE validity.validity_dates = DATERANGE('2023-01-09','2025-01-09', '[]')
How best do I approach this?

SSRS line chart count when date not null breaks only if there are future dates

I have a line chart created that displays the counts of rowIDs grouped by the month of an associated date. However, not every row has the date. If there is no date, I don't want it counted at all. My current semi-fix for this is adding a filter in the category group properties that excluded dates earlier than 1/1/1960. I tried "IsNothing(Date) <> true" type filters but they didn't work. I am guessing in the background the NULL dates are getting translated to 1/1/1900 or something.
With the Date > 1960 filter it works as I want, doesn't count any rows w/ NULL dates, almost. If any of the dates are in the future, specifically the next month, then all the NULL dates get counted in that month. If the date is more than 1 month in the future then no NULLs are counted, although the line skips months with no dates. I tried forcing it to 0 for those months but it doesn't work. I would think that's because of the way I'm grouping on the associated date. Such a grouping results in
| Month | Count |
|-------|-------|
| 10/20 | 2 |
| 1/21 | 4 |
when what it needs is something like
| Month | Count |
|-------|-------|
| 10/20 | 2 |
| 11/20 | |
| 12/20 | |
| 1/21 | 4 |
Does anyone have any idea what's causing the NULLs to get counted in the next month? Is there a better way to do what I'm trying to do so that the 0 months will show up?

How to load DataFrame from semi-structured textfile?

I have a semi-structured text file which I want to convert it to a Data Frame in Spark. I do have a schema on my mind which is shown below. However, I am finding it challenging to parse my text file and assign the schema.
Following is my sample text file:
"good service"
Tom Martin (USA) 17th October 2015
4
Long review..
Type Of Traveller Couple Leisure
Cabin Flown Economy
Route Miami to Chicago
Date Flown September 2015
Seat Comfort 12345
Cabin Staff Service 12345
Ground Service 12345
Value For Money 12345
Recommended no
"not bad"
M Muller (Canada) 22nd September 2015
6
Yet another long review..
Aircraft TXT-101
Type Of Customer Couple Leisure
Cabin Flown FirstClass
Route IND to CHI
Date Flown September 2015
Seat Comfort 12345
Cabin Staff Service 12345
Food & Beverages 12345
Inflight Entertainment 12345
Ground Service 12345
Value For Money 12345
Recommended yes
.
.
The resulting schema with result that I expect to have as follows:
+----------------+------------+--------------+---------------------+---------------+---------------------------+----------+------------------+-------------+--------------+-------------------+----------------+--------------+---------------------+-----------------+------------------------+----------------+---------------------+-----------------+
| Review_Header | User_Name | User_Country | User_Review_Date | Overall Score | Review | Aircraft | Type of Traveler | Cabin Flown | Route_Source | Route_Destination | Date Flown | Seat Comfort | Cabin Staff Service | Food & Beverage | Inflight Entertainment | Ground Service | Wifi & Connectivity | Value for Money |
+----------------+------------+--------------+---------------------+---------------+---------------------------+----------+------------------+-------------+--------------+-------------------+----------------+--------------+---------------------+-----------------+------------------------+----------------+---------------------+-----------------+
| "good service" | Tom Martin | USA | 17th October 2015 | 4 | Long review.. | | Couple Leisure | Economy | Miami | Chicago | September 2015 | 12345 | 12345 | | | 12345 | | 12345 |
| "not bad" | M Muller | Canada | 22nd September 2015 | 6 | Yet another long review.. | TXT-101 | Couple Leisure | FirstClass | IND | CHI | September 2015 | 12345 | 12345 | 12345 | 12345 | 12345 | | 12345 |
+----------------+------------+--------------+---------------------+---------------+---------------------------+----------+------------------+-------------+--------------+-------------------+----------------+--------------+---------------------+-----------------+------------------------+----------------+---------------------+-----------------+
As you may notice, for each block of data in text file, the first four lines are mapped to user defined columns such as Review_Header, User_Name, User_Country, User_Review_Date, whereas rest other individual lines have defined columns.
What could be the best possible way to use schema inference technique in such scenario rather than writing verbose code?
UPDATE: I would like to make this problem a little more tricky. What if the "Long review.." and "Yet another long review" could itself span over multiple newlines. How may I parse the review over multiple line for each block?
If you guarantee that the semi-structured text file has records separated by two newlines, and that those two newlines will never appear in the "Long review..." section, you may be able to use textFiles with a modified delimiter ("\n\n") and then process the lines without writing a custom file format.
sc.hadoopConfiguration.set("textinputformat.record.delimiter", "\n\n")
df = sc.textFile("sample-file.txt")
Then you can do further splitting on "\n" and "\t" to create your fields and columns.
Seeing your update, it's kind of a difficult problem. You have to ask yourself what identifying info is in the attributes that's not in the review. Or what is guaranteed to be in a specific format. E.g.
Can you guarantee there's not two newlines in the long review? This is important if we're splitting on "\n\n" to generate the blocks.
Can you guarantee there's no tabs in the long review?
Is Aircraft, Cabin Flown, Cabin Staff Service, Date Flown, Food & Beverages, Ground Service, ... the full list of attributes? Do you have a full list of possible attributes?
As well as some meta questions:
Where is this data coming from?
Can we request it in a better format?
Can we find this data, or the aspects we're looking for from a better source?
With those known, you'll have a better idea on how to proceed. E.g. if there are no tabs in the review text, (or they're escaped as "\t" or something):
Extract lines[0] - first line "good service"
Extract lines[1] - split to user name, country, review date
Filter lines[2:] containing tabs, get lowest index i - split into attributes
Join lines[2:i] with "\n" - this is the review
What could be the best possible way to use schema inference technique in such scenario rather than writing verbose code?
You don't have much choice and you have to write a verbose code or a custom FileFormat (that would hide the complexity of loading such files to a DataFrame).
Use DataFrameReader.textFile to load the file and transform it accordingly.
textFile(path: String): Dataset[String] Loads text files and returns a Dataset of String. See the documentation on the other overloaded textFile() method for more details.

Tableau - Filter on maximum date within dimension

I have data that are structured as so:
| Node | Update Datetime | Measure Values |
|------|-----------------|----------------|
| A | 2018-01-01 | 1 |
| A | 2018-01-05 | 3 |
| A | 2018-01-06 | 4 |
| B | 2018-01-02 | 2 |
| B | 2018-01-03 | 4 |
The nodesare updated over time with the measure values showing the node's value at the time of data entry, meaning just filtering on a date range will overstate the node's value. The report needs to be responsive or else I would just do this in sql, but as it stands I need to be able to keep only the rows that contain the maximum datetime value within each node, after the entire dataset is filtered on a general date window.
I think creating a level-of-detail (LOD) calculated field and using it as a filter is the cleanest and quickest way to achieve the desired filter.
See Option 2 here -- http://kb.tableau.com/articles/howto/setting-default-date-to-most-recent-date-on-a-quick-filter?lang=en
Let me know if that does not work.
(FYI Tableau Community Forums is another resource for Tableau questions. I use both sites)

Generating Running Sum of Ratings in SQL

I have a rating table. It boils down to:
rating_value created
+2 april 3rd
-5 april 20th
So, every time someone gets rated, I track that rating event in the database.
I want to generate a rating history/time graph where the rating is the sum of all ratings up to that point in time on a graph.
I.E. A person's rating on April 5th might be select sum(rating_value) from ratings where created <= april 5th
The only problem with this approach is I have to run this day by day across the interval I'm interested in. Is there some trick to generating a running total using this sort of data?
Otherwise, I'm thinking the best approach is to create a denormalized "rating history" table alongside the individual ratings.
If you have postgresql 8.4, you can use a window-aggregate function to calculate a running sum:
steve#steve#[local] =# select rating_value, created,
sum(rating_value) over(order by created)
from rating;
rating_value | created | sum
--------------+------------+-----
2 | 2010-04-03 | 2
-5 | 2010-04-20 | -3
(2 rows)
See http://www.postgresql.org/docs/current/static/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS
try to add a group by statement. that gives you the rating value for each day (in e.g. an array). as you output the rating value over time, you can just add the previous array elements together.