Export CSV file to Hive - mongodb

I have a CSV file which I downloaded from mongo db and would like to export it to hive so that I can query it and analyze it. However, i suppose I need to first export it to HDFS. I have Hive installed on my system. I used the following command:
CREATE EXTERNAL TABLE reg_log (path STRING, ip STRING)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> LOCATION '/home/nazneen/Desktop/mongodb-linux/bin/reqlog_new_mod.csv'
> STORED AS CSVFILE;
This is throwing error. Any pointers would be appreciated.

I believe the 'STORED AS' clause doesn't support keyword 'CSVFILE', see here. In your case 'STORED AS TEXTFILE' should be fine.

Related

Synapse Spark exception handling - Can't write to log file

I have written PySpark code to hit a REST API and extract the contents in an XML format and later wrote to Parquet in a data lake container.
I am trying to add logging functionality where I not only write out errors but updates of actions/process we execute.
I am comparatively new to Spark I have been relying on online articles and samples. All explain the error handling and logging through "1/0" examples and saving logs in the default folder structure (not in ADLS account/container/folder) which do not help at all. Most of the code written in Pure Python doesn't run as-is.
Could I get some assistance with setting up the following:
Push errors to a log file under a designated folder sitting under a data lake storage account/container/folder hierarchy".
Catching REST specific exceptions.
This is a sample of what I have written:
''''
LogFilepath = "abfss://raw#.dfs.core.windows.net/Data/logging/data.log"
#LogFilepath2 = "adl://.azuredatalakestore.net/raw/Data/logging/data.log"
print(LogFilepath)
try:
1/0
except Exception as e:
print('My Error...' + str(e))
with open(LogFilepath, "a") as f:
f.write("An error occured: {}\n".format(e))
''''
I have tried it both ABFSS and ADL file paths with no luck. The log file is already available in the storage account/container/folder.
I have reproduced the above using abfss path in with open() function but it gave me the below error.
FileNotFoundError: [Errno 2] No such file or directory: 'abfss://synapsedata#rakeshgen2.dfs.core.windows.net/datalogs.logs'
As per this Documentation
we can use open() on ADLS file with a path like /synfs/{jobId}/mountpoint/{filename}.
For that, first we need to mount the ADLS.
Here I have mounted it using ADLS linked service. you can mount either by Storage account access key or SAS as per your requirement.
mssparkutils.fs.mount(
"abfss://<container_name>#<storage_account_name>.dfs.core.windows.net",
"/mountpoint",
{"linkedService":"<ADLS linked service name>"}
)
Now use the below code to achieve your requirement.
from datetime import datetime
currentDateAndTime = datetime.now()
jobid=mssparkutils.env.getJobId()
LogFilepath='/synfs/'+jobid+'/synapsedata/datalogs.log'
print(LogFilepath)
try:
1/0
except Exception as e:
print('My Error...' + str(e))
with open(LogFilepath, "a") as f:
f.write("Time : {}- Error : {}\n".format(currentDateAndTime,e))
Here I am writing date time along with the error and there is no need to create the log file first. The above code will create and append the error.
If you want to generate the logs daily, you can generate date file names log files as per your requirement.
My Execution:
Here I have executed 2 times.

db2 how to configure external tables using extbl_location, extbl_strict_io

db2 how to configure external tables using extbl_location, extbl_strict_io. Could you please give insert example for system table how to set up this parameters. I need to create external table and upload data to external table.
I need to know how to configure parameters extbl_location, extbl_strict_io.
I created table like this.
CREATE EXTERNAL TABLE textteacher(ID int, Name char(50), email varchar(255)) USING ( DATAOBJECT 'teacher.csv' FORMAT TEXT CCSID 1208 DELIMITER '|' REMOTESOURCE 'LOCAL' SOCKETBUFSIZE 30000 LOGDIR '/tmp/logs' );
and tried to upload data to it.
insert into textteacher (ID,Name,email) select id,name,email from teacher;
and get exception [428IB][-20569] The external table operation failed due to a problem with the corresponding data file or diagnostic files. File name: "teacher.csv". Reason code: "1".. SQLCODE=-20569, SQLSTATE=428IB, DRIVER=4.26.14
If I correct understand documentation parameter extbl_location should pointed directory where data will save. I suppose full directory will showed like
$extbl_location+'/'+teacher.csv
I found some documentation about error
https://www.ibm.com/support/pages/how-resolve-sql20569n-error-external-table-operation
I tried to run command in docker command line.
/opt/ibm/db2/V11.5/bin/db2 get db cfg | grep -i external
but does not information about external any tables.
CREATE EXTERNAL TABLE statement:
file-name
...
When both the REMOTESOURCE option is set to LOCAL (this is its default value) and the extbl_strict_io configuration parameter is set
to NO, the path to the external table file is an absolute path and
must be one of the paths specified by the extbl_location configuration
parameter. Otherwise, the path to the external table file is relative
to the path that is specified by the extbl_location configuration
parameter followed by the authorization ID of the table definer. For
example, if extbl_location is set to /home/xyz and the authorization
ID of the table definer is user1, the path to the external table file
is relative to /home/xyz/user1/.
So, If you use relative path to a file as teacher.csv, you must set extbl_strict_io to YES.
For an unload operation, the following conditions apply:
If the file exists, it is overwritten.
Required permissions:
If the external table is a named external table, the owner must have read and write permission for the directory of this file.
If the external table is transient, the authorization ID of the statement must have read and write permission for the directory of this file.
Moreover you must create a sub-directory equal to your username (in lowercase) which is owner of this table in the directory specified in extbl_location and ensure, that this user (not the instance owner) has rw permission to this sub-directory.
Update:
To setup presuming, that user1 runs this INSERT statement.
sudo mkdir -p /home/xyz/user1
# user1 must have an ability to cd to this directory
sudo chown user1:$(id -gn user1) /home/xyz/user1
db2 connect to mydb
db2 update db cfg using extbl_location /home/xyz extbl_strict_io YES

Try to import a database with sql script file

I try to import the small database from this website:
https://postgrespro.com/docs/postgrespro/12/demodb-bookings-installation
in Postgresql. I tried the restore button and choosed the sql file but it fails with the message:
"pg_restore: Error: Input file is apparently a dump in text format. Please use psql".
Then I tried to run the sql command from the file in the sql query, but it fails with the message:
"invalid byte sequence for encoding "UTF8".
Does anybody know how I can import it? Thx in advance!

Error while loading parquet format file into Amazon Redshift using copy command and manifest file

I'm trying to load parquet file using manifest file and getting below error.
query: 124138ailed due to an internal error. File 'https://s3.amazonaws.com/sbredshift-east/data/000002_0 has an invalid version number: )
Here is my copy command
copy testtable from 's3://sbredshift-east/manifest/supplier.manifest'
IAM_ROLE 'arn:aws:iam::123456789:role/MyRedshiftRole123'
FORMAT AS PARQUET
manifest;
here is my manifest file
**{
"entries":[
{
"url":"s3://sbredshift-east/data/000002_0",
"mandatory":true,
"meta":{
"content_length":1000
}
}
]
}**
I'm able to load the same file using copy command by specifying the file name.
copy testtable from 's3://sbredshift-east/data/000002_0' IAM_ROLE 'arn:aws:iam::123456789:role/MyRedshiftRole123' FORMAT AS PARQUET;
INFO: Load into table 'supplier' completed, 800000 record(s) loaded successfully.
COPY
What could be wrong in my copy statement?
This error happens when the content_length value is wrong. You have to specify the correct content_length. You could check it executing an s3 ls command.
aws s3 ls s3://sbredshift-east/data/
2019-12-27 11:15:19 539 sbredshift-east/data/000002_0
The 539 (file size) should be the same than the content_lenght value in your manifest file.
I don't know why they are using this meta value when you don't need it in the direct copy command.
¯\_(ツ)_/¯
The only way I've gotten parquet copy to work with manifest file is to add the meta key with the content_length.
From what I can gather in my error logs, the COPY command for parquet (w/ manifest) might first be reading the files using Redshift Spectrum as an external table. If that's the case, this hidden step does require the content_step which contradicts their initial statement about COPY commands.
https://docs.amazonaws.cn/en_us/redshift/latest/dg/loading-data-files-using-manifest.html

How to access hive data using spark

I have table stored as text file e.g employee in hive and I want to access it using spark.
First i have set sql context object using
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
Then i have created table
scala>sqlContext.sql("CREATE TABLE IF NOT EXISTS employee(
id INT, name STRING, age INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY
',' LINES TERMINATED BY '\n'")
Further i was trying to load the contents of text file by using
scala> sqlContext.sql("LOAD DATA LOCAL INPATH 'employee.txt' INTO TABLE employee")
I am getting error as
SET hive.support.sql11.reserved.keywords=false
FAILED: SemanticException Line 1:23 Invalid path ''employee.txt'': No files
matching path file:/home/username/employee.txt
If i have to place the textfile in current directory where the spark-shell is running how to do that ?
Do you run hive on hadoop?
try to use absolute path... if this doesn't work, try to load your file to hdfs and then give absolute path to your file (hdfs location) .
Try doing the below steps
Start spark-shell in local mode a Eg:spark-shell --master local[*]
Give the file full path for loading file
Eg:file:///home/username/employee.txt