concatenating text to a column in pig

concatenating text to a column in pig - date

I have a day column and a month column and would like to concatenate the year to it and store it in CHARARRAY format with the hyphens.
so I have: month:CHARARRAY, day:CHARARRAY
Meaning, for example, if the day column contains '03' and the month column contains '04', I would like to create a date column that contains: '2014-04-03'
This is my code:
CONCAT('2014-',month,'-',day) as date;
It doesn't work and I'm not quite sure how to concatenate additional text onto the column.
I would like to note that I'm not sure converting to date format is an option for me. I would prefer to keep it in CHARARRAY format since I would like to join with another file that has date stored in CHARARRAY format.

Assuming this is the data file called dateExample.csv:
Surender,02,03,1988
Raja,12,09,1998
Raj,05,10,1986
This is the script for pig:
A = LOAD 'dateExample.csv' USING PigStorage(',') AS(name:chararray,day:chararray,month:long,year:chararray);
X = FOREACH A GENERATE CONCAT((chararray)day,CONCAT('-',CONCAT((chararray)month,CONCAT('-',(chararray)year))));
dump X;
You will get the desired output:
(02-3-1988)
(12-9-1998)
(05-10-1986)
Explanation:
When we try to concat like this:
X = FOREACH A GENERATE CONCAT(day,CONCAT('-',CONCAT(month,CONCAT('-',year))));
We get following exception :
ERROR 1045:
<line 2, column 45> Could not infer the matching function for org.apache.pig.builtin.CONCAT as multiple or none of them fit. Please use an explicit cast.
So we need to explicitly cast the day,month and year values to chararray and it works!!

Related

Create rows from part of column names

Source data
I am working on an ELT project to load data from CSV files into PostgreSQL where I will transform it. The CSV files have many columns that are consistent across files, but also contain activity columns that are inconsistent with names like Date (05/19/2020), Type (05/19/2020), etc.
In the loading script I am merging all of the columns with dates in the column name into one jsonb column so I don't have to constantly add new columns to the raw data table.
The resulting jsonb column in the raw data table looks like this:
id
activity
12345678
{"Date (05/19/2020)": null, "Type (05/19/2020)": null, "Date (06/03/2020)": "06/01/2020", "Type (06/03/2020)": "E"}
98765432
{"Date (05/19/2020)": "05/18/2020", "Type (05/19/2020)": "B", "Date (10/23/2020)": "10/26/2020", "Type (10/23/2020)": "T"}
JSON to columns
Using the amazing create_jsonb_flat_view function from this post I can convert the jsonb to columns like this:
id
Date (05/19/2020)
Type (05/19/2020)
Date (06/03/2020)
Type (06/03/2020)
Type (10/23/2020
Date (10/23/2020)
Type (10/23/2020)
10629465
null
null
06/01/2020
E
98765432
05/18/2020
B
10/26/2020
T
Need to move part of column name to row
Now, this is where I'm stuck. I need to remove the portion of the column name that is the Activity Date (e.g. (05/19/2020)) and create a row for each id and ActivityDate with additional columns for Date and Type like this:
id
ActivityDate
Date
Type
12345678
05/19/2020
null
null
12345678
06/03/2020
06/01/2020
E
98765432
05/19/2020
05/18/2020
B
98765432
10/23/2020
10/26/2020
T

I followed your link to the create_jsonb_flat_view article yesterday and then forgot this question. While I thank you for pointing me there, I think that mentioning it worked against you.
A more conventional approach using regexp_replace() works here. I left the date values as strings, but you can convert them with to_date() if needed:
with parse as (
select id, e.k, e.v,
regexp_replace(e.k, '\s+\([0-9/]{10}\)', '') as k_no_date,
regexp_replace(e.k, '^.+([0-9/]{10}).+', '\1') as k_date_only
from rawinput
cross join lateral jsonb_each_text(activity) as e(k, v)
)
select id,
k_date_only as activity_date,
min(v) filter (where k_no_date = 'Date') as date,
min(v) filter (where k_no_date = 'Type') as type
from parse
group by id, k_date_only;
db<>fiddle here

#Mike-Organek's Answer works beautifully!
However, I was curious if the regexp_replace() calls might be slowing the query down a bit and it seemed I could get the same results using a simpler function.
Since Mike gave me a great example to start with I modified it to split on the space between Date and (05/19/2020).
For 20,000 rows, it went from taking an avg of 7 sec on my local machine to an avg of .9 sec.
Here is the resulting query:
with parse as (
select id, e.k, e.v,
split_part(e.k, ' ', 1) as k_no_date,
trim(split_part(e.k, ' ', 2),'()') as k_date_only
from rawinput
cross join lateral jsonb_each_text(activity) as e(k, v)
)
select id,
k_date_only as activity_date,
min(v) filter (where k_no_date = 'Date') as date,
min(v) filter (where k_no_date = 'Type') as type
from parse
group by id, k_date_only;

PostgreSQL: Is it possible to sort the part of the string has datetime

I have the name of the column that contains the data sequence and datetime type as shown below:
DATA01_0003_20210126135705.zip
DATA01_0002_20210127135030.zip
DATA01_0004_20210126142913.zip
I want ORDER BY according to datetime, then the sequence string, is it possible?
DATA01_0002_20210127135030.zip
DATA01_0004_20210126142913.zip
DATA01_0003_20210126135705.zip
I tried the statement like below but it sort in sequence before datetime:
SELECT filename FROM tblData ORDER BY filename
DATA01_0002_20210127135030.zip
DATA01_0003_20210126135705.zip
DATA01_0004_20210126142913.zip
SELECT filename FROM tblData ORDER BY filename DESC
DATA01_0004_20210126142913.zip
DATA01_0003_20210126135705.zip
DATA01_0002_20210127135030.zip

in postgresql you can sort by expression.
to sort by the file's timestamp, it is necessary to extract that from the string. based on your examples, i am going to guess that the text after the last _ will have the date. if that assumption holds, then the following will sort by date
SELECT filename from tbldata
order by reverse(split_part(reverse(filename), '_', 1))
if the number of _ is fixed, then the trick with two reverses is unnecessary. you can instead order by split_part(filename, '_', 3)

Converting a table that has all variables of the class char to string

I have a table t that contains a column year. The following command returns
class t.Year
ans =
char
This is not just for the column year but for all columns in the table.
I need to convert the table to a cellstr so that I can do str2num function on it. I am unable to convert all the columns and rows to string type. I also need to remove '' from the column names when I do table2cell. After table2cell I need to convert to cellstr and I am unable to do so since all the values in the table(columns) are char.

kdb+: Save table into a csv file

I have the below table "dates" , it has a sym column with symbols and a d column with list of strings and would like to save it into a regular CSV file. Couldn't find a good way to do it. Any suggestions?
q)dates
sym d
----------------------------------------------------------------------------
6AH0 "1970.03.16" "1980.03.17" "1990.03.19" "2010.03.15"
6AH6 "1976.03.15" "1986.03.17" "1996.03.18" "2016.03.14"
6AH7 "1977.03.14" "1987.03.16" "1997.03.17" "2017.03.13"
6AH8 "1978.03.13" "1988.03.14" "1998.03.16" "2018.03.19"
6AH9 "1979.03.19" "1989.03.13" "1999.03.15" "2019.03.18"
When I try to do the regular save the below error happens:
q)save `:dates.csv
k){$[t&77>t:#y;$y;x;-14!'y;y]}
'type
q))

The internal table->csv conversion function within Kdb+ is not able to handle nested lists in columns. The d column in your table is a list of list of chars. However, the conversion function is able to handle a simply nested column (depth of 1).
Therefore, you can convert the d column to a list of chars and then save to CSV using the internal function:
/ generate a table of dummy data
q)show dates:flip `sym`d!(`6AH0`6AH6`6AH7;string (3;0N)#12?.z.d)
sym d
--------------------------------------------------------
6AH0 "2008.02.04" "2015.01.02" "2003.07.05" "2005.02.25"
6AH6 "2012.10.25" "2008.08.28" "2017.01.25" "2007.12.27"
6AH7 "2004.02.01" "2005.06.06" "2013.02.11" "2010.12.20"
/ convert 'd' column to simple list - the (" " sv') is the conversion func here
q)#[`dates;`d;" " sv']
`dates
/ review what was done
q)show dates
sym d
--------------------------------------------------
6AH0 "2008.02.04 2015.01.02 2003.07.05 2005.02.25"
6AH6 "2012.10.25 2008.08.28 2017.01.25 2007.12.27"
6AH7 "2004.02.01 2005.06.06 2013.02.11 2010.12.20"
/ save to csv
q)save `:dates.csv
`:dates.csv
/ review saved csv
q)\cat dates.csv
"sym,d"
"6AH0,2008.02.04 2015.01.02 2003.07.05 2005.02.25"
"6AH6,2012.10.25 2008.08.28 2017.01.25 2007.12.27"
"6AH7,2004.02.01 2005.06.06 2013.02.11 2010.12.20"

As per the csv specification, you'll want to flatten the list out and separate each with a comma and double quote the list.
'save' is limited in that the file must be named the same as the global variable you are saving.
If I was tasked with your question I'd do it like so;
`:myFileNamedWhatever.csv 0: csv 0: select sym,csv sv'd from dates
Explanation;
csv 0: table /csv is a variable, literally defined as "," - its good for readability. csv 0: table converts the table to a comma separated list of strings
`:file 0: listOfStrings /this takes a LIST of strings and pushes them to the file handle. Each element of the list is a new line in the file
I'd prefer this approach as it is general and allows the saving of various types. You can use it within a function etc..
At a later date I decided that I wanted it saved as a pipe (or anything) separated file;
`:myNewFile.psv 0: "|" 0: select sym,"|"sv'd from table

select first letter of different columns in oracle

I want a query which will return a combination of characters and number
Example:
Table name - emp
Columns required - fname,lname,code
If fname=abc and lname=pqr and the row is very first of the table then result should be code = ap001.
For next row it should be like this:
Fname = efg, lname = rst
Code = er002 and likewise.
I know that we can use substr to retrieve first letter of a colume but I don't know how to use it to do with two columns and how to concatenate.

OK. You know you can use substr function. Now, to concatenate you will need a concatenation operator ||. To get the number of row retrieved by your query, you need the rownum pseudocolumn. Perhaps you will also need to use to_char function to format the number. About all those functions and operators you can read in SQL reference. Anyway I think you need something like this (I didn't check it):
select substr(fname, 1, 1) || substr(lname, 1, 1) || to_char(rownum, 'fm009') code
from emp

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

concatenating text to a column in pig - date

Related

Create rows from part of column names

PostgreSQL: Is it possible to sort the part of the string has datetime

Converting a table that has all variables of the class char to string

kdb+: Save table into a csv file

select first letter of different columns in oracle

Categories

Resources