HiveContext.sql("insert into") - scala

I'm trying to insert data with HiveContext like this:
/* table filedata
CREATE TABLE `filedata`(
`host_id` string,
`reportbatch` string,
`url` string,
`datatype` string,
`data` string,
`created_at` string,
`if_del` boolean)
*/
hiveContext.sql("insert into filedata (host_id, data) values (\"a1e1\", \"welcome\")")
Error and try to use "select":
hiveContext.sql("select \"a1e1\" as host_id, \"welcome\"as data").write.mode("append").saveAsTable("filedata")
/*
stack trace
java.lang.ArrayIndexOutOfBoundsException: 2
*/
It needs to all columns like this:
hc.sql("select \"a1e1\" as host_id,
\"xx\" as reportbatch,
\"xx\" as url,
\"xx\" as datatype,
\"welcome\" as data,
\"2017\" as created_at,
1 as if_del").write.mode("append").saveAsTable("filedata")
Is there a way to insert specified columns? For example, only insert columns "host_id" and "data".

As far as i Know , Hive does not support the insertion of values into only some columns
From the documentation
Each row listed in the VALUES clause is inserted into table tablename.
Values must be provided for every column in the table. The standard
SQL syntax that allows the user to insert values into only some
columns is not yet supported. To mimic the standard SQL, nulls can be
provided for columns the user does not wish to assign a value to.
So you should try this:
val data = sqlc.sql("select 'a1e1', null, null, null, 'welcome', null, null, null")
data.write.mode("append").insertInto("filedata")
Reference here

You can do it if you are using row columnar file format such as ORC. Please see the working example below. This example is in Hive but will work very well with HiveContext.
hive> use default;
OK
Time taken: 1.735 seconds
hive> create table test_insert (a string, b string, c string, d int) stored as orc;
OK
Time taken: 0.132 seconds
hive> insert into test_insert (a,c) values('x','y');
Query ID = user_20171219190337_b293c372-5225-4084-94a1-dec1df9e930d
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1507021764560_1375895)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 4.06 s
--------------------------------------------------------------------------------
Loading data to table default.test_insert
Table default.test_insert stats: [numFiles=1, numRows=1, totalSize=417, rawDataSize=254]
OK
Time taken: 6.828 seconds
hive> select * from test_insert;
OK
x NULL y NULL
Time taken: 0.142 seconds, Fetched: 1 row(s)
hive>

Related

Divide a value in JSON using postgreSQL

Im relatively new and would like to redenominate some values in my current database. This means going into my jasonb column in my database, selecting a key value and dividing it by a 1000. I know how to select values but update after I have performed a calculation has failed me. My table name is property_calculation and has two columns as follows: * dynamic_fields is my jasonb column
ID
dynamic_fields
1
{"totalBaseValue": 4198571.230720645844841865113039602874778211116790771484375,"surfaceAreaValue": 18.108285497586717127660449477843940258026123046875,"assessedAnnualValue": 1801819.534798908603936834409607936611000776616631213755681528709828853607177734375}
2
{"totalBaseValue": 7406547.28939837918763178237213651300407946109771728515625,"surfaceAreaValue": 31.94416993248973568597648409195244312286376953125,"assessedAnnualValue": 9121964.022681592442116216621222042691512210677018401838722638785839080810546875}
I would like to update the dynamic_fields.totalBaseValue by dividing it by 1000 and committing it back as the new value. I have tried the following with no success:
update property_calculation
set dynamic_fields = (
select jsonb_agg(case
when jsonb_typeof(elem -> 'totalBaseValue') = 'number'
then jsonb_set(elem, array['totalBaseValue'], to_jsonb((elem ->> 'totalBaseValue')::numeric / 1000))
else elem
end)
from jsonb_array_elements(dynamic_fields::jsonb) elem)::json;
I get the following error:
ERROR: cannot extract elements from an object
SQL state: 22023
My json column has no zero string or null values.
Move the jsonb_typeof() check into the where clause:
update property_calculation
set dynamic_fields =
jsonb_set(
dynamic_fields,
'{totalBaseValue}',
to_jsonb((dynamic_fields->>'totalBaseValue')::numeric / 1000)
)
where jsonb_typeof(dynamic_fields->'totalBaseValue') = 'number';
db<>fiddle here

Convert rows to columns in SQL Server table

My table has the below sample data:
DECLARE #FHTable table (PK_ID int,FK_ID int,P_ID int, T_ID int, A_day int,A_hour TIME,D_day int,D_hour time)
INSERT INTO #FHtable VALUES (129,194,252,1005322,NULL,NULL,1,'02:30:00.0000000')
INSERT INTO #FHtable VALUES (130,194,311,1000891,3,'04:30:00.0000000',null,null)
INSERT INTO #FHtable VALUES (131,194,311,1000129,NULL,NULL,4,'03:30:00.0000000')
INSERT INTO #FHtable VALUES (132,194,252,1000025,6,'03:00:00.0000000',null,null)
SELECT * FROM #FHtable
My final result Should be of the below table:
DECLARE #FinalResultTable TABLE (FK_ID int,P_IDFrom int,P_IDTO INT, T_IDFrom int,T_IDTo INT, A_day int,A_hour TIME,D_day int,D_hour time)
INSERT INTO #FinalResultTable VALUES (194,252,311,1005322,1000891,3,'04:30:00.0000000',1,'02:30:00.0000000')
INSERT INTO #FinalResultTable VALUES (194,311,252,1000129,1000025,6,'03:00:00.0000000',4,'03:30:00.0000000')
select * from #FinalResultTable
The logic is there will be 4 rows for each FK_ID. The first and the second row is a source to destination and the 3rd and 4th row is again a source to destination.
Can you please help
As I pointed out in the comments, it's not very clear what is the logic that joins each arrival with the corresponding departure.
What it's clear to me it's that your title is wrong: you are not talking about converting rows to columns. Instead, you are just grouping 2 by 2 the rows of the table. So, you have 2 approaches, using a GROUP BY or just using a JOIN, it depends on what is exactly your logic.
Here is an example:
select A.FK_ID,
A.PK_ID as P_IDFrom, B.PK_ID as P_IDTO,
A.T_ID as T_IDFrom int, B.T_ID as T_IDTo,
B.A_day, B.A_hour,
A.D_day, A.D_hour
from #FHTable A
join #FHTable B on B.PK_ID+1=A.PK_ID
where A.A_day is null
(this assumes that an arrivals's PK is always +1 w.r.t. its departure).

Specified partition columns do not match the partition columns of the table.Please use () as the partition columns

here i'm trying to persist the data frame in to a partitioned hive table and getting this silly exception. I have looked in to it many times but not able to find the fault.
org.apache.spark.sql.AnalysisException: Specified partition columns
(timestamp value) do not match the partition columns of the table.
Please use () as the partition columns.;
Here is the script by which the external table is created with,
CREATE EXTERNAL TABLEIF NOT EXISTS events2 (
action string
,device_os_ver string
,device_type string
,event_name string
,item_name string
,lat DOUBLE
,lon DOUBLE
,memberid BIGINT
,productupccd BIGINT
,tenantid BIGINT
) partitioned BY (timestamp_val DATE)
row format serde 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
stored AS inputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
location 'maprfs:///location/of/events2'
tblproperties ('serialization.null.format' = '');
Here is the result of describe formatted of table "events2"
hive> describe formatted events2;
OK
# col_name data_type comment
action string
device_os_ver string
device_type string
event_name string
item_name string
lat double
lon double
memberid bigint
productupccd bigint
tenantid bigint
# Partition Information
# col_name data_type comment
timestamp_val date
# Detailed Table Information
Database: default
CreateTime: Wed Jan 11 16:58:55 IST 2017
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: maprfs:/location/of/events2
Table Type: EXTERNAL_TABLE
Table Parameters:
EXTERNAL TRUE
serialization.null.format
transient_lastDdlTime 1484134135
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.078 seconds, Fetched: 42 row(s)
Here is the line of code where the data is partitioned and stored in to a table,
val tablepath = Map("path" -> "maprfs:///location/of/events2")
AppendDF.write.format("parquet").partitionBy("Timestamp_val").options(tablepath).mode(org.apache.spark.sql.SaveMode.Append).saveAsTable("events2")
While running the application, i'm getting the below
Specified partition columns (timestamp_val) do not match the partition
columns of the table.Please use () as the partition columns.
I might be committing an obvious error, any help is much appreciated with an upvote :)
Please print schema of df:
AppendDF.printSchema()
Make sure it is not type mismatch??

Distributed processing of a PostgreSQL table

I've got a PostgreSQL table with several millions of rows that need to be processed with the same algorithm.
I am using Python and SQLAlchemy.Core for this task.
This algorithm accepts one or several rows as input and returns the same amount of rows with some updated values.
id1, id2, NULL, NULL, NULL -> id1, id2, value1, value2, value3
id1, id3, NULL, NULL, NULL -> id1, id3, value4, value5, value6
id2, id3, NULL, NULL, NULL -> id2, id3, value7, value8, value9
...
id_n, id_m, NULL, NULL, NULL -> id_n, id_m, value_xxx, value_yyy, value_zzz
I am using a PC cluster to perform this task. This cluster runs dask.distributed scheduler and workers.
I think, that this task can be effectively implemented with the map function. My idea is that each worker queries data base, selects for processing some rows with NULL values, then updates them with results.
My question is: how to write the SQL query, that would allow to distribute pieces of the table among workers?
I've tried to define subsets of row for each worker with offset and limit in SQL queries, that each worker emits:
SQL:
select * from table where value1 is NULL offset N limit 100;
...
update table where id1 = ... and id2 = ...
set value1 = value...;
Python:
from sqlalchemy import create_engine, bindparam, select, func
from distributed import Executor, progress
def process(offset, limit):
engine = create_engine(...)
# get next piece of work
query = select(...).where(...).limit(limit).offset(offset)
rows = engine.execute([select]).fetchall()
# process rows
# submit values to table
update_stmt = table.update().where(...).where(...).values(...)
up_values = ...
engine.execute(update_stmt, up_values)
if __name__ == '__main__':
e = Executor('{address}:{port}'.format(address=config('SERVER_ADDR'),
port=config('SERVER_PORT')))
n_rows = count_rows_to_process()
chunk_size = 100
progress(e.map(process, range(0, n_rows, chunk_size)))
However, this didn't work.
The range function has returned list of offsets before calculations have started, and the map function has distributed them among workers before starting process function.
Then some workers have successfully finished processing their chunks of work, submitted their results to the table, and updated values.
Then new iteration begins, new SELECT ...WHERE value1 is NULL LIMIT 100 OFFSET ... query is sent to the data base, but the offset is now invalid, because it was calculated before the previous workers have updated the table. Amount of NULL values is now reduced, and a worker can receive empty set from the database.
I cannot use one SELECT query before starting calculations, because it will return huge table that doesn't fit in RAM.
SQLAlchemy manual also says that for distributed processing the engine instance should be created locally for each python process. Therefore, I cannot query the database once and send returned cursor to the process function.
Therefore, the solution is correct construction of SQL queries.
One option to consider is randomization:
SELECT *
FROM table
WHERE value1 IS NULL
ORDER BY random()
LIMIT 100;
In worst case scenario you will have several workers calculating the same thing in parallel. If it does not bother you this is one of the most simple ways.
The other option is dedicating individual rows to the particular worker:
UPDATE table
SET value1 = -9999
WHERE id IN (
SELECT id
FROM table
WHERE value1 IS NULL
ORDER BY random()
LIMIT 100
) RETURNING * ;
This way you "mark" the rows your particular worker has "taken" with -9999. All other workers will skip these rows as value1 IS NOT NULL any more. The risk here is that if the worker fails you will not have a simple way to get back to these rows - you would have to manually update them back to NULL.

Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask

I have a local file movies.dat formatted as movie_id:movie_title:genre. For example:
1:movie1:Comedy
2:movie2:Drama
3:movie3:Horror
...
I create an external table using the following command.
CREATE EXTERNAL TABLE movies(movie_id INT, movie_title String, genre String)
ROW FORMAT
DELIMITED FIELDS TERMINATED BY '\:' -- need backslash!!
LOCATION '/exc103320/movies_copy'; -- name of the directory to copy the original file
Then, I load the data to the table by
LOAD DATA LOCAL INPATH 'movies.dat' OVERWRITE INTO TABLE movies;
When I run SELECT * FROM movies LIMIT 3;
I see the first 3 rows.
When I run SELECT movie_id FROM movies LIMIT 3; I get the following error
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1420729875693_6595, Tracking URL = http://cshadoop1.utdallas.edu:8088/proxy/application_1420729875693_6595/
Kill Command = /usr/local/hadoop-2.4.1/bin/hadoop job -kill job_1420729875693_6595
Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0
2015-03-29 17:14:54,820 Stage-1 map = 0%, reduce = 0%
Ended Job = job_1420729875693_6595 with errors
Error during job, obtaining debugging information...
Job Tracking URL: http://cshadoop1.utdallas.edu:8088/cluster/app/application_1420729875693_6595
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Job 0: HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
Any idea why this happens?
I believe you dont need the backlash in the "ROW FORMAT
DELIMITED FIELDS TERMINATED BY" statement.
Try the DDL statement like this and see if it works.
CREATE EXTERNAL TABLE movies(movie_id INT, movie_title String, genre String)
ROW FORMAT
DELIMITED FIELDS TERMINATED BY ':'
LOCATION '/exc103320/movies_copy';