I'm using Pig to read a huge CSV file (+29000 lines) that looks like this
What I'm interested in is begin and end, which are dates
I'm trying to find items that were active in 1930. So first I loaded the file using this statement :
stations = LOAD '/mytp/isd-history.csv'
USING PigStorage(',')
AS
(
id:int,
wban:long,
name:chararray,
country:chararray,
state:chararray,
icao:chararray,
lat:double,
lon:double,
ele:double,
begin:chararray,
end:chararray
);
Then I used this query to FILTER by date
items_active_1930 = FILTER stations
BY ToDate(begin,'yyyy-MM-dd') >= ToDate('1930-01-01')
AND ToDate(end,'yyyy-MM-dd') <= ToDate('1930-12-31');
When I try to dump, the job fails with the following result :
Unable to open iterator for alias items_active_1930. Backend error : Exception while executing [POUserFunc (Name: POUserFunc(org.apache.pig.builtin.ToDate2ARGS)[datetime] - scope-172 Operator Key: scope-172) children: null at []]: java.lang.IllegalArgumentException: Invalid format: "begin"
I would like to know if it's possible in FILTER, to first check if both begin and date are valid dates that match the specified date format, so that no errors occur in ToDate()
Specify the format for 1930-01-01 and 1930-12-31
items_active_1930 = FILTER stations
BY (datetime)ToDate(begin,'yyyy-MM-dd') >= (datetime)ToDate('1930-01-01','yyyy-MM-dd')
AND (datetime)ToDate(end,'yyyy-MM-dd') <= (datetime)ToDate('1930-12-31','yyyy-MM-dd');
Related
I am trying to update a JSONB field in one table with data from another table. For example,
update ms
set data = data || '{"COMMERCIAL": 3.4, "PCT" : medi_percent}'
from mix
where mix.id = mss.data_id
and data_id = 6000
and set_id = 20
This is giving me the following error -
Invalid input syntax for type json
DETAIL: Token "medi_percent" is invalid.
When I change medi_percent to a number, I don't get this error.
{"COMMERCIAL": 3.4, "PCT" : medi_percent} is not a valid JSON text. Notice there is no string interpolation happening here. You might be looking for
json_build_object('COMMERCIAL', 3.4, 'PCT', medi_percent)
instead where medi_percent is now an expression (that will presumably refer to your mix column).
I tried the following code to subset my data so that it only gives me a date range from 6/1 to yesterday:
day_1 = '2018-06-01'
df = df.where((F.col('report_date') >= day_1) & (F.col('report_date') < F.current_date()))
I get the following error: AnalysisException: u"cannot resolve '2018-06-01' given input columns
You can use lit method from sql functions to create artificial column.
df = df.where((F.col('report_date') >= F.lit(day_1)) & (F.col('report_date') < F.current_date()))
I have a day column and a month column and would like to concatenate the year to it and store it in CHARARRAY format with the hyphens.
so I have: month:CHARARRAY, day:CHARARRAY
Meaning, for example, if the day column contains '03' and the month column contains '04', I would like to create a date column that contains: '2014-04-03'
This is my code:
CONCAT('2014-',month,'-',day) as date;
It doesn't work and I'm not quite sure how to concatenate additional text onto the column.
I would like to note that I'm not sure converting to date format is an option for me. I would prefer to keep it in CHARARRAY format since I would like to join with another file that has date stored in CHARARRAY format.
Assuming this is the data file called dateExample.csv:
Surender,02,03,1988
Raja,12,09,1998
Raj,05,10,1986
This is the script for pig:
A = LOAD 'dateExample.csv' USING PigStorage(',') AS(name:chararray,day:chararray,month:long,year:chararray);
X = FOREACH A GENERATE CONCAT((chararray)day,CONCAT('-',CONCAT((chararray)month,CONCAT('-',(chararray)year))));
dump X;
You will get the desired output:
(02-3-1988)
(12-9-1998)
(05-10-1986)
Explanation:
When we try to concat like this:
X = FOREACH A GENERATE CONCAT(day,CONCAT('-',CONCAT(month,CONCAT('-',year))));
We get following exception :
ERROR 1045:
<line 2, column 45> Could not infer the matching function for org.apache.pig.builtin.CONCAT as multiple or none of them fit. Please use an explicit cast.
So we need to explicitly cast the day,month and year values to chararray and it works!!
Using XPath, how do I figure out if a date or datetime field is null or blank?
I am using the concat method as a stand-in for the XPath if statement
concat(
substring(../preceding-sibling::my:PerDiem[1]/my:perDiemEnd, 1, ../preceding-sibling::my:PerDiem[1]/my:perDiemEnd = "" * string-length(../preceding-sibling::my:PerDiem[1]/my:perDiemEnd)),
substring(/my:ExpenseForm/my:ExpenseHeader/my:departureDateTime, 1, not(../preceding-sibling::my:PerDiem[1]/my:perDiemEnd = "") * string-length(/my:ExpenseForm/my:ExpenseHeader/my:departureDateTime))
)
More info:
In Infopath 2010, a repeating table has two date/time fields called perDiemStart and perDiemEnd. In the repeating table, the next perDiemStart is the previous perDiemEnd. This is easily done if the default value of perDiemStart is ../preceding-sibling::my:PerDiem[1]/my:perDiemEnd
But for the first perDiemStart (since a previous perDiemEnd does not exist, I suppose it would be null/blank). I want that first (blank) value to be a different: value of departureDateTime node
Node locations:
/my:ExpenseForm/my:ExpenseHeader/my:departureDateTime
/my:ExpenseForm/my:PerDiemDetails/my:PerDiems/my:PerDiem/my:perDiemStart
/my:ExpenseForm/my:PerDiemDetails/my:PerDiems/my:PerDiem/my:perDiemEnd
To check if it is filled:
perDiemStart[text()]
To check if it is empty/null:
perDiemStart[not(text())]
Does this help? http://blogs.msdn.com/b/syamp/archive/2011/03/13/fim-2010-xpath-how-to-check-null-value-on-a-datetime-attribute.aspx
Basically they detect null dates by getting the set of dates after an old date (e.g. 1900-01-01) and then using 'not' to see which nodes would be excluded.
I have a query with the below WHERE clauses
WHERE
I.new_outstandingamount = 70
AND ISNUMERIC(SUBSTRING(RA.new_stampernumber,7, 4)) = 1
AND (DATEDIFF(M,T.new_commencementdate, SUBSTRING(RA.new_stampernumber,7, 10)) >= 1)
AND RA.new_applicationstatusname = 'Complete'
AND I.new_feereceived > 0
AND RA.new_stampernumber IS NOT NULL
AND T.new_commencementdate IS NOT NULL
RA.new_stampernumber is a string value which contains three concatenated pieces of information of uniform length. The middle piece of info in this string is a date in the format yyyy-MM-dd.
In order to filter out any rows where the date in this string in not formatted as expected I do a check to see if the first 4 characters are numeric using the ISNUMERIC function.
When I run the query I get an error message saying
The conversion of a nvarchar data type to a datetime data type resulted in an out-of-range value.
The line that is causing this error to occur is
AND (DATEDIFF(M,T.new_commencementdate, SUBSTRING(RA.new_stampernumber,7, 10)) >= 1)
When I comment out this line I don't get an error.
What is strange is that if I replace
AND ISNUMERIC(SUBSTRING(RA.new_stampernumber,7, 4)) = 1
with
AND SUBSTRING(RA.new_stampernumber,7, 4) IN ('2003','2004','2005','2006','2007','2008','2009','2010', '2011', '2012','2013','2014','2015'))
the query runs successfully.
Whats even more strange is that if I replace the above working line with this
AND SUBSTRING(RA.new_stampernumber,11, 1) = '-'
I get the error message again. But if I replace the equals sign with a LIKE comparison it works:
AND SUBSTRING(RA.new_stampernumber,11, 1) LIKE '-'
When I remove the DATEDIFF function and compare the results of each of these queries they all return the same resultset so it is not being caused by different data being returned by the different clauses.
Can anyone explain to me what could be causing the out-of-range error to be thrown for some clauses and not for others if the data being returned is in fact the same for each clause?
Thanks,
Neil
Different execution plans.
There is no guarantee that the WHERE clauses are processed in particular order. Presumably when it works it happens to filter out erroring rows before attempting the cast to date.
Also ISNUMERIC itself isn't very reliable for what you want. I'd change the DATEDIFF expression to something like the below
DATEDIFF(M, T.new_commencementdate,
CASE
WHEN RA.new_stampernumber LIKE
'______[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]%'
THEN SUBSTRING(RA.new_stampernumber, 7, 10)
END) >= 1 )