How to access hive data using spark - scala

I have table stored as text file e.g employee in hive and I want to access it using spark.
First i have set sql context object using
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
Then i have created table
scala>sqlContext.sql("CREATE TABLE IF NOT EXISTS employee(
id INT, name STRING, age INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY
',' LINES TERMINATED BY '\n'")
Further i was trying to load the contents of text file by using
scala> sqlContext.sql("LOAD DATA LOCAL INPATH 'employee.txt' INTO TABLE employee")
I am getting error as
SET hive.support.sql11.reserved.keywords=false
FAILED: SemanticException Line 1:23 Invalid path ''employee.txt'': No files
matching path file:/home/username/employee.txt
If i have to place the textfile in current directory where the spark-shell is running how to do that ?

Do you run hive on hadoop?
try to use absolute path... if this doesn't work, try to load your file to hdfs and then give absolute path to your file (hdfs location) .

Try doing the below steps
Start spark-shell in local mode a Eg:spark-shell --master local[*]
Give the file full path for loading file
Eg:file:///home/username/employee.txt

Related

Count files in HDFS directory with Scala

In Scala, I am trying to count the files from an Hdfs directory.
I tryed to get a list of the files with val files = fs.listFiles(path, false) and make a count on it or get it's size, but it doesn't work as files type is RemoteIterator[LocatedFileStatus]
Any idea on how I should process ?
Thank's for helping
This has been done before but generally people use the FSImage. (A copy of the name node file.)
They'll then throw that into a hive table and then you can query it for information about your hdfs file system.
Here's a really good tutorial that explains how to export the fsimage and throw it into a hive table.
Here's another that I think I prefer:
Fetch and copy the fsimage file into HDFS
#connect to any hadoop cluster node as hdfs user
#downloads the fsimage file from namenode
hdfs dfsadmin -fetchImage /tmp
#converts the fsimage file into tab delimited file
hdfs oiv -i /tmp/fsimage_0000000000450297390 -o /tmp/fsimage.csv -p Delimited
#remove the header and copy to HDFS
sed -i -e "1d" fsimage.csv
hdfs dfs -mkdir /tmp/fsimage
hdfs dfs -copyFromLocal /tmp/fsimage.csv /tmp/fsimage
#create the intermediate external table in impala
CREATE EXTERNAL TABLE HDFS_META_D (
PATH STRING ,
REPL INT ,
MODIFICATION_TIME STRING ,
ACCESSTIME STRING ,
PREFERREDBLOCKSIZE INT ,
BLOCKCOUNT DOUBLE,
FILESIZE DOUBLE ,
NSQUOTA INT ,
DSQUOTA INT ,
PERMISSION STRING ,
USERNAME STRING ,
GROUPNAME STRING)
row format delimited
fields terminated by '\t'
LOCATION '/tmp/fsimage';
Once it's in a table you really can do the rest in scala/spark.
I end up using:
var count: Int = 0
while (files.hasNext) {
files.next
count += 1
}
As a Scala begginer, I didn't know how to make a count++ (the answear is count += 1). This actually works quiet well

My Spark job is deleting the target folder inside hdfs

I have a script that writes Hive table content into a CSV file in HDFS.
The target folder name is given in a JSON paramater file. When I launch the script I notice that the folder that I already created is deleted automatically and then an error is thrown saying that the target file does not exist. This my script:
sigma.cache // sigma is the df that contains the hive table. Tested OK
sigma.repartition(1).write.mode(SaveMode.Overwrite).format("csv").option("header", true).option("delimiter", "|").save(Parametre_vigiliste.cible)
val conf = new Configuration()
val fs = FileSystem.get(conf)
//Parametre_vigiliste.cible is the variable inide the JSON file that contains the target folder name
val file = fs.globStatus(new Path(Parametre_vigiliste.cible + "/part*"))(0).getPath().getName();
fs.rename(new Path(Parametre_vigiliste.cible + "/" + file), new Path(Parametre_vigiliste.cible + "/" + "FIC_PER_DATALAKE_.txt"));
sigma.unpersist()
ERROR THROWN:
exception caught: java.lang.UnsupportedOperationException: CSV data
source does not support null data type.
Can this code delete a folder for a certain reason? Thank you.
So as Prateek suggested I tried sigma.printSchema and I discovered some null columns. I rectified that and it worked.

Error while loading parquet format file into Amazon Redshift using copy command and manifest file

I'm trying to load parquet file using manifest file and getting below error.
query: 124138ailed due to an internal error. File 'https://s3.amazonaws.com/sbredshift-east/data/000002_0 has an invalid version number: )
Here is my copy command
copy testtable from 's3://sbredshift-east/manifest/supplier.manifest'
IAM_ROLE 'arn:aws:iam::123456789:role/MyRedshiftRole123'
FORMAT AS PARQUET
manifest;
here is my manifest file
**{
"entries":[
{
"url":"s3://sbredshift-east/data/000002_0",
"mandatory":true,
"meta":{
"content_length":1000
}
}
]
}**
I'm able to load the same file using copy command by specifying the file name.
copy testtable from 's3://sbredshift-east/data/000002_0' IAM_ROLE 'arn:aws:iam::123456789:role/MyRedshiftRole123' FORMAT AS PARQUET;
INFO: Load into table 'supplier' completed, 800000 record(s) loaded successfully.
COPY
What could be wrong in my copy statement?
This error happens when the content_length value is wrong. You have to specify the correct content_length. You could check it executing an s3 ls command.
aws s3 ls s3://sbredshift-east/data/
2019-12-27 11:15:19 539 sbredshift-east/data/000002_0
The 539 (file size) should be the same than the content_lenght value in your manifest file.
I don't know why they are using this meta value when you don't need it in the direct copy command.
¯\_(ツ)_/¯
The only way I've gotten parquet copy to work with manifest file is to add the meta key with the content_length.
From what I can gather in my error logs, the COPY command for parquet (w/ manifest) might first be reading the files using Redshift Spectrum as an external table. If that's the case, this hidden step does require the content_step which contradicts their initial statement about COPY commands.
https://docs.amazonaws.cn/en_us/redshift/latest/dg/loading-data-files-using-manifest.html

Spark-SQL: access file in current worker node directory

I need to read a file using spark-sql, and the file is in the current directory.
I use this command to decompress a list of files I have stored on HDFS.
val decompressCommand = Seq(laszippath, "-i", inputFileName , "-o", "out.las").!!
The file is outputted in the current worker node directory, and I know this because executing "ls -a"!! through scala I can see that the file is there. I then try to access it with the following command:
val dataFrame = sqlContext.read.las("out.las")
I assumed that the sql context would try to find the file in the current directory, but it doesn't. Also, it doesn't throw an error but a warning stating that the file could not be found (so spark continues to run).
I attempted to add the file using: sparkContext.addFile("out.las") and then access the location using: val location = SparkFiles.get("out.las") but this didn't work either.
I even ran the command val locationPt = "pwd"!! and then did val fullLocation = locationPt + "/out.las" and attempted to use that value but it didn't work either.
The actual exception that gets thrown is the following:
User class threw exception: org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: [];
org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: []
And this happens when I try to access column "x" from a dataframe. I know that column 'X' exists because I've downloaded some of the files from HDFS, decompressed them locally and ran some tests.
I need to decompress files one by one because I have 1.6TB of data and so I cannot decompress it at one go and access them later.
Can anyone tell me what I can do to access files which are being outputted to the worker node directory? Or maybe should I be doing it some other way?
So I managed to do it now. What I'm doing is I'm saving the file to HDFS, and then retrieving the file using the sql context through hdfs. I overwrite "out.las" each time in HDFS so that I don't have take too much space.
I have used the hadoop API before to get to files, I dunno if it will help you here.
val filePath = "/user/me/dataForHDFS/"
val fs:FileSystem = FileSystem.get(new java.net.URI(filePath + "out.las"), sc.hadoopConfiguration)
And I've not tested the below, but I'm pretty sure I'm passing the java array to scala illegally. But just giving an idea of what to do afterward.
var readIn: Array[Byte] = Array.empty[Byte]
val fileIn: FSDataInputStream = fs.open(file)
val fileIn.readFully(0, readIn)

Export CSV file to Hive

I have a CSV file which I downloaded from mongo db and would like to export it to hive so that I can query it and analyze it. However, i suppose I need to first export it to HDFS. I have Hive installed on my system. I used the following command:
CREATE EXTERNAL TABLE reg_log (path STRING, ip STRING)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> LOCATION '/home/nazneen/Desktop/mongodb-linux/bin/reqlog_new_mod.csv'
> STORED AS CSVFILE;
This is throwing error. Any pointers would be appreciated.
I believe the 'STORED AS' clause doesn't support keyword 'CSVFILE', see here. In your case 'STORED AS TEXTFILE' should be fine.