How can we parse the kind of logs below by using Scala?
I want to read this kind of data and put that into a Hive table.
log timestamp=“2018-04-06T22:43:19.565Z” eventCategory=“Application” eventType=“Error”
log contents are actually in HTML tag of < />
Why can't you just load the data logs in Hive as-is, though? Use a RegexSerde in Hive
Make a directory
hdfs dfs -mkdir -p /some/hdfs/path
Make a table
DROP TABLE IF EXISTS logdata;
CREATE EXTERNAL TABLE logdata (
timestamp STRING,
eventCategory STRING,
eventType STRING,
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "log timestamp=\“([^ ]*)\” eventCategory=\“([^ ]*)\” eventType=\“([^ ]*)\”",
"output.format.string" = "%1$s %2$s %3$s"
)
STORED AS TEXTFILE
LOCATION '/some/hdfs/path/';
Upload your logs
hdfs dfs -copyFromLocal data.log /some/hdfs/path/
Query the table
SELECT * FROM logdata;
Related
I have an etl process that is using an athena source. I cannot figure out how to create a data frame if there is no data yet in the source. I was using the GlueContext:
trans_ddf = glueContext.create_dynamic_frame.from_catalog(
database=my_db, table_name=my_table, transformation_ctx="trans_ddf")
This fails if there is no data in the source db, because it can't infer the schema.
I also tried using the sql function on the spark session:
has_rows_df = spark.sql("select cast(count(*) as boolean) as hasRows from my_table limit 1")
has_rows = has_rows_df.collect()[0].hasRows
This also fails because it can't infer the schema.
How can I create a data frame so I can determine if the source has any data?
has_rows_df.head(1).isEmpty
should do the job,robustly.
See How to check if spark dataframe is empty?
We are writing files from spark and read from Athena/Hive.
We had an issue with timestamp when using hive.
scala> val someDF = Seq((8, "2018-06-06 11:42:43")).toDF("number", "word")
someDF: org.apache.spark.sql.DataFrame = [number: int, word: string]
scala> someDF.coalesce(1).write.mode("overwrite").option("delimiter", "\u0001").save("s3://test/")
This creates a parquet file and I created a table
CREATE EXTERNAL TABLE `test5`(
`number` int,
`word` timestamp)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0001'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://test/'
Select query Failed with issue:
HIVE_BAD_DATA: Field word's type BINARY in parquet is incompatible with type timestamp defined in table schema
Same thing is working when testing with plain csv file.
scala>someDF.coalesce(1).write.format("com.databricks.spark.csv").mode("overwrite").option("delimiter", "\u0001").save("s3://test")
Table:
CREATE EXTERNAL TABLE `test7`(
`number` int,
`word` timestamp)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0001'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://test/'
Can you please help with what is going wrong when we write it as parquet file.
I think this is a well known bug with Hive storing parquet timestamps in a way that is incompatible with other tools. I have faced a similar problem while using Impala to retrieve Hive data that I have written with Spark. I believe this was resolved in Spark 2.3.
I'm very new to hadoop and hive.
I'm trying to load data into a hive table and I'm experiencing the error below.
On the other hand I tried to insert the record into hive table using the statement stmt.execute("INSERT INTO employee VALUES(1201,'Gopal',45000,'Technical manager')")
it is inserting the record successfully, but while loading bulk of data it fails.
val filePath=C:\\AllProjects\\xxxxxxx\\src\\main\\resources\\input\\sample.txt
val con =
DriverManager.getConnection("jdbc:hive2://xxxxxhive.xxxx.com:10000/dehl_dop;principal=hive/xxxxxhive.com.com#internal.xxxxx.com;" +
"mapred.job.queue.name=usa;AuthMech=3;SSL=1;user=zzzz;password=vvvv;" +
"SSLTrustStore=C:\\Program Files\\Java\\jre1.8.0_144\\lib\\security\\hjsecacerts;UseNativeQuery=0")
val stmt = con.createStatement()
print("\n" + "executing the query" +"\n")
stmt.execute(s"load data inpath $filePath into table Employee")
Error
errorMessage:Error while compiling statement: FAILED: ParseException line 1:17 mismatched input 'C' expecting StringLiteral near 'inpath' in load statement), Query: load data inpath C:\xxxxx\xxxxx\xxxxx\xxxxx\xxxxx\xxxxx\sample.txt into table Employee.
Any help will be appreciated
LOAD DATA INPATH takes a string literal.
$filePath needs single quotes around it
stmt.execute(s"load data inpath '$filePath' into table Employee")
However, that command requires a file be located on HDFS. And you're reading from your C drive
LOAD DATA LOCAL INPATH will read the local filesystem, but I'm not sure how that works over JDBC because it depends on where the query is actually executed (your local machine, or the HiveServer)
I suggest you create an external Hive table at a specific HDFS location, with the necessary schema, then simply copy the text file directly to HDFS.
Programmatically copying the file to HDFS is an option, but hadoop fs -put would be simpler.
If all you want to do is load a local file to HDFS/Hive, Spark would make more sense than JDBC
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder()
.appName("Sample App").enableHiveSupport().getOrCreate()
val df = spark.read.option("header", "false").csv(filePath)
df.createOrReplaceTempView("emp")
spark.sql("INSERT INTO dehl_dop.Employee SELECT * from emp")
When I run the following:
val df1 = sqlContext.read.format("orc").load(myPath)
df1.columns.map(m => println(m))
The columns are printed as '_col0', '_col1', '_col2' etc. As opposed to their real names such as 'empno', 'name', 'deptno'.
When I 'describe mytable' in Hive it prints the column name correctly, but when I run 'orcfiledump' it shows _col0, _col1, _col2 as well. Do I have to specify 'schema on read' or something? If yes, how do I do that in Spark/Scala?
hive --orcfiledump /apps/hive/warehouse/mydb.db/mytable1
.....
fieldNames: "_col0"
fieldNames: "_col1"
fieldNames: "_col2"
Note: I created the table as follows:
create table mydb.mytable1 (empno int, name VARCHAR(20), deptno int) stored as orc;
Note: This is not a duplicate of this issue (Hadoop ORC file - How it works - How to fetch metadata) because the answer tells me to use 'Hive' & I am already using HiveContext as follows:
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
By the way, I am using my own hive-site.xml, which contains following:
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://sandbox.hortonworks.com:9083</value>
</property>
</configuration>
I figured out what the problem was. It was the way I was creating the test data. I was under the impression that if I run the following commands:
create table mydb.mytable1 (empno int, name VARCHAR(20), deptno int) stored as orc;
INSERT INTO mydb.mytable1(empno, name, deptno) VALUES (1, 'EMP1', 100);
INSERT INTO mydb.mytable1(empno, name, deptno) VALUES (2, 'EMP2', 50);
INSERT INTO mydb.mytable1(empno, name, deptno) VALUES (3, 'EMP3', 200);
Data would be created in the ORC format at: /apps/hive/warehouse/mydb.db/mytable1
Turns out that's not the case. Even though I indicated 'stored as orc' the INSERT statements didn't save the column information. Not sure if that's expected behavior. In any case, it all works now. Apologies for the confusion but hopefully this will help someone in future -:)
#DilTeam
This is the problem, when you are writing data with Hive(version 1.x), it does not store columns' metadata for orc formatted files (it's not same for parquet etc.) , this issue is fixed in new Hive(2.x) to store column info in metadata that allow spark to read metadata from file itself.
Here is another option to load tables written with Hive1 in spark:
val table = spark.table(<db.tablename>)
Here spark is default sparkSession which fetches table's information from hive metastore.
One more option comes with more codeblock and perquisite information:
Create dataframe with defined schema over fetched RDD, this will give u flexibility to change data types,you can read in this link
https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#programmatically-specifying-the-schema
I hope this will help
I am reading a Hive table using Spark SQL and assigning it to a scala val
val x = sqlContext.sql("select * from some_table")
Then I am doing some processing with the dataframe x and finally coming up with a dataframe y , which has the exact schema as the table some_table.
Finally I am trying to insert overwrite the y dataframe to the same hive table some_table
y.write.mode(SaveMode.Overwrite).saveAsTable().insertInto("some_table")
Then I am getting the error
org.apache.spark.sql.AnalysisException: Cannot insert overwrite into table that is also being read from
I tried creating an insert sql statement and firing it using sqlContext.sql() but it too gave me the same error.
Is there any way I can bypass this error? I need to insert the records back to the same table.
Hi I tried doing as suggested , but still getting the same error .
val x = sqlContext.sql("select * from incremental.test2")
val y = x.limit(5)
y.registerTempTable("temp_table")
val dy = sqlContext.table("temp_table")
dy.write.mode("overwrite").insertInto("incremental.test2")
scala> dy.write.mode("overwrite").insertInto("incremental.test2")
org.apache.spark.sql.AnalysisException: Cannot insert overwrite into table that is also being read from.;
Actually you can also use checkpointing to achieve this. Since it breaks data lineage, Spark is not able to detect that you are reading and overwriting in the same table:
sqlContext.sparkContext.setCheckpointDir(checkpointDir)
val ds = sqlContext.sql("select * from some_table").checkpoint()
ds.write.mode("overwrite").saveAsTable("some_table")
You should first save your DataFrame y in a temporary table
y.write.mode("overwrite").saveAsTable("temp_table")
Then you can overwrite rows in your target table
val dy = sqlContext.table("temp_table")
dy.write.mode("overwrite").insertInto("some_table")
You should first save your DataFrame y like a parquet file:
y.write.parquet("temp_table")
After you load this like:
val parquetFile = sqlContext.read.parquet("temp_table")
And finish you insert your data in your table
parquetFile.write.insertInto("some_table")
In context to Spark 2.2
This error means that our process is reading from same table and writing to same table.
Normally, this should work as process writes to directory .hiveStaging...
This error occurs in case of saveAsTable method, as it overwrites entire table instead of individual partitions.
This error should not occur with insertInto method, as it overwrites partitions not the table.
A reason why this happening is because Hive table has following Spark TBLProperties in its definition. This problem will solve for insertInto method if you remove following Spark TBLProperties -
'spark.sql.partitionProvider' 'spark.sql.sources.provider'
'spark.sql.sources.schema.numPartCols
'spark.sql.sources.schema.numParts' 'spark.sql.sources.schema.part.0'
'spark.sql.sources.schema.part.1' 'spark.sql.sources.schema.part.2'
'spark.sql.sources.schema.partCol.0'
'spark.sql.sources.schema.partCol.1'
https://querydb.blogspot.com/2019/07/read-from-hive-table-and-write-back-to.html
when we upgraded our HDP to 2.6.3 The Spark was updated from 2.2 to 2.3 which resulted in below error -
Caused by: org.apache.spark.sql.AnalysisException: Cannot overwrite a path that is also being read from.;
at org.apache.spark.sql.execution.command.DDLUtils$.verifyNotReadPath(ddl.scala:906)
This error occurs for job where-in we are reading and writing to same path. Like Jobs with SCD Logic
Solution -
Set --conf "spark.sql.hive.convertMetastoreOrc=false"
or, update the job such that it writes data to a temporary table. Then reads from temporary table and insert it into final table.
https://querydb.blogspot.com/2020/09/orgapachesparksqlanalysisexception.html
Read the data from hive table in spark:
val hconfig = new org.apache.hadoop.conf.Configuration()
org.apache.hive.hcatalog.mapreduce.HCatInputFormat.setInput(hconfig , "dbname", "tablename")
val inputFormat = (new HCatInputFormat).asInstanceOf[InputFormat[WritableComparable[_],HCatRecord]].getClass
val data = sc.newAPIHadoopRDD(hconfig,inputFormat,classOf[WritableComparable[_]],classOf[HCatRecord])
You'll also get the Error: "Cannot overwrite a path that is also being read from" in a case where your are doing this:
You are "insert overwrite" to a hive TABLE "A" from a VIEW "V" (that executes your logic)
And that VIEW also references the same TABLE "A". I found this the hard way as the VIEW is deeply nested code that was querying "A" as well. Bummer.
It is like cutting the very branch on which you are sitting :-(
What you need to keep in mind before doing below is that the hive table in which you are overwriting should be have been created by hive DDL not by
spark(df.write.saveAsTable("<table_name>"))
if the above is not true this wont work.
I tested this in spark 2.3.0
val tableReadDf=spark.sql("select * from <dbName>.<tableName>")
val updatedDf=tableReadDf.<transformation> //any update/delete/addition
updatedDf.createOrReplaceTempView("myUpdatedTable")
spark.sql("""with tempView as(select * from myUpdatedTable) insert overwrite table
<dbName>.<tableName> <partition><partition_columns> select * from tempView""")
This is good solution for me:
Extract RDD and schema from DataFrame.
Create new clone DataFame.
Overwrite table.
private def overWrite(df: DataFrame): Unit = {
val schema = df.schema
val rdd = df.rdd
val dfForSave = spark.createDataFrame(rdd, schema)
dfForSave.write
.mode(SaveMode.Overwrite)
.insertInto(s"${tableSource.schema}.${tableSource.table}")}