I am trying to create table (CTAS) from hive and want to write the file in BSON format, inorder to import it into MongoDb.
Here is my query:
create table if not exists rank_locn
ROW FORMAT SERDE "com.mongodb.hadoop.hive.BSONSerde"
STORED AS INPUTFORMAT "com.mongodb.hadoop.BSONFileInputFormat"
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
as
select RGN_OVRHD_NBR,DM_OVRHD_NBR,LOCN_NBR,Derived,
rank() OVER (ORDER BY DERIVED DESC) as NationalRnk,
rank() OVER (PARTITION BY RGN_OVRHD_NBR ORDER BY DERIVED DESC) as RegionRnk,
rank() OVER (PARTITION BY DM_OVRHD_NBR ORDER BY DERIVED DESC) as DistrictRnk
from Locn_Dim_Values
where Derived between -999999 and 999999;
Three jobs are launched. The last reduce job is failing. Error log is as follows:
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":78133,"reducesinkkey1":143.82632293080053},"value":{"_col0":1,"_col1":12,"_col2":79233,"_col3":78133,"_col4":1634,"_col5":143.82632293080053},"alias":0}
at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:274)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:522)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":78133,"reducesinkkey1":143.82632293080053},"value":{"_col0":1,"_col1":12,"_col2":79233,"_col3":78133,"_col4":1634,"_col5":143.82632293080053},"alias":0}
at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:262)
... 7 more
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat$1.write(HiveIgnoreKeyTextOutputFormat.java:91)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:637)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:502)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:832)
at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:502)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:832)
at org.apache.hadoop.hive.ql.exec.PTFOperator.executeWindowExprs(PTFOperator.java:341)
at org.apache.hadoop.hive.ql.exec.PTFOperator.processInputPartition(PTFOperator.java:198)
at org.apache.hadoop.hive.ql.exec.PTFOperator.processOp(PTFOperator.java:130)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:502)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:832)
at org.apache.hadoop.hive.ql.exec.ExtractOperator.processOp(ExtractOperator.java:45)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:502)
at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:253)
... 7 more
Please help me resolve the issue.
Related
I am trying to read parquet files via the Flink table, and it throws the error when I select one of the timestamps.
My parquet table is something like this.
I create a table with this SQL :
CREATE TABLE MyDummyTable (
`id` INT,
ts BIGINT,
ts_ltz AS TO_TIMESTAMP_LTZ(ts, 3),
ts2 TIMESTAMP,
ts3 TIMESTAMP,
ts4 TIMESTAMP,
ts5 TIMESTAMP
)
It throws an error when I select one of the ts2, ts3, ts4, ts5
The error stack is this.
Caused by: java.lang.IllegalArgumentException: Unexpected type: BINARY
at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:77)
at org.apache.flink.formats.parquet.vector.ParquetSplitReaderUtil.createWritableColumnVector(ParquetSplitReaderUtil.java:369)
at org.apache.flink.formats.parquet.ParquetVectorizedInputFormat.createWritableVectors(ParquetVectorizedInputFormat.java:264)
at org.apache.flink.formats.parquet.ParquetVectorizedInputFormat.createReaderBatch(ParquetVectorizedInputFormat.java:254)
at org.apache.flink.formats.parquet.ParquetVectorizedInputFormat.createPoolOfBatches(ParquetVectorizedInputFormat.java:244)
at org.apache.flink.formats.parquet.ParquetVectorizedInputFormat.createReader(ParquetVectorizedInputFormat.java:137)
at org.apache.flink.formats.parquet.ParquetVectorizedInputFormat.createReader(ParquetVectorizedInputFormat.java:73)
at org.apache.flink.connector.file.src.impl.FileSourceSplitReader.checkSplitOrStartNext(FileSourceSplitReader.java:112)
at org.apache.flink.connector.file.src.impl.FileSourceSplitReader.fetch(FileSourceSplitReader.java:65)
at org.apache.flink.connector.base.source.reader.fetcher.FetchTask.run(FetchTask.java:56)
at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:138)
... 7 more
My Current appproach
I am currently using this approach to jump over the problem but it does not seem a legit solution since I create two columns in the table.
CREATE TABLE MyDummyTable (
`id` INT,
ts2 STRING,
ts2_ts AS TO_TIMESTAMP(ts2)
)
Flink Version: 1.13.2
Scala Version: 2.11.12
I am writing a dataframe to s3 as shown below. Target location: s3://test/folder
val targetDf = spark.read.schema(schema).parquet(targetLocation)
val df1=spark.sql("select * from sourceDf")
val df2=spark.sql(select * from targetDf)
/*
for loop over a date range to dedup and write the data to s3
union dfs and run a dedup logic, have omitted dedup code and for loop
*/
val df3=spark.sql("select * from df1 union all select * from df2")
df3.write.partitionBy(data_id, schedule_dt).parquet("targetLocation")
Spark is creating extra partition column on write like shown below:
Exception in thread "main" java.lang.AssertionError: assertion failed: Conflicting partition column names detected:
Partition column name list #0: data_id, schedule_dt
Partition column name list #1: data_id, schedule_dt, schedule_dt
EMR optimizer class is enabled while writing, I am using spark 2.4.3
Please let me know what could be causing this error.
Thanks
Abhineet
You should give 1 extra column apart from partitioned columns. Can you please try
val df3=df1.union(df2)
instead of
val df3=spark.sql("select data_id,schedule_dt from df1 union all select data_id,schedule_dt from df2")
While writing data into hive partitioned table, I am getting below error.
org.apache.spark.SparkException: Requested partitioning does not match the tablename table:
I have converted my RDD to a DF using case class and then I am trying to write the data into the existing hive partitioned table. But I am getting his error and as per the printed logs "Requested partitions:" is coming as blank. Partition columns are coming as expected in the hive table.
spark-shell error :-
scala> data1.write.format("hive").partitionBy("category", "state").mode("append").saveAsTable("sampleb.sparkhive6")
org.apache.spark.SparkException: Requested partitioning does not match the sparkhive6 table:
Requested partitions:
Table partitions: category,state
Hive table format :-
hive> describe formatted sparkhive6;
OK
col_name data_type comment
txnno int
txndate string
custno int
amount double
product string
city string
spendby string
Partition Information
col_name data_type comment
category string
state string
Try with insertInto() function instead of saveAsTable().
scala> data1.write.format("hive")
.partitionBy("category", "state")
.mode("append")
.insertInto("sampleb.sparkhive6")
(or)
Register a temp view on top of the dataframe then write with sql statement to insert data into hive table.
scala> data1.createOrReplaceTempView("temp_vw")
scala> spark.sql("insert into sampleb.sparkhive6 partition(category,state) select txnno,txndate,custno,amount,product,city,spendby,category,state from temp_vw")
Below is my code for loading csv data into dataframe and applying the difference on two columns and appending to a new one using withColumn.The two columns I am trying to find the difference is of kind Double. Please help me in figuring out the following exception:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
/**
* Created by Guest1 on 5/10/2017.
*/
object arith extends App {
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("akka").setLevel(Level.ERROR)
val spark = SparkSession.builder().appName("Arithmetics").
config("spark.master", "local").getOrCreate()
val df =spark.read.option("header","true")
.option("inferSchema",true")
.csv("./Input/Arith.csv").persist()
// df.printSchema()
val sim =df("Average Total Payments") -df("Average Medicare Payments").show(5)
}
I am getting the following exception:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot resolve column name "Average Total Payments" among (DRG Definition, Provider Id, Provider Name, Provider Street Address, Provider City, Provider State, Provider Zip Code, Hospital Referral Region Description, Total Discharges , Average Covered Charges , Average Total Payments , Average Medicare Payments);
at org.apache.spark.sql.Dataset$$anonfun$resolve$1.apply(Dataset.scala:219)
at org.apache.spark.sql.Dataset$$anonfun$resolve$1.apply(Dataset.scala:219)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.Dataset.resolve(Dataset.scala:218)
at org.apache.spark.sql.Dataset.col(Dataset.scala:1073)
at org.apache.spark.sql.Dataset.apply(Dataset.scala:1059)
at arith$.delayedEndpoint$arith$1(arith.scala:19)
at arith$delayedInit$body.apply(arith.scala:7)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at arith$.main(arith.scala:7)
at arith.main(arith.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
There are multiple issues here.
First if you look at the exception, it basically tells you that there is no "Average Total Payments" column in the dataframe (it also helpfully gives you the columns it sees). It seems the column name read from the csv has an extra space at the end.
Second df("Average Total Payments") and df("Average Medicare Payments") are columns.
You are trying to call show on df("Average medicate payments"). Show is not a member of column (and on dataframe it returns unit so you couldn't do df("Average Total Payments") -df("Average Medicare Payments").show(5) anyway because that would be Column - Unit).
What you want to do is define a new column which is the difference between the two and add it to the dataframe as a new column. Then you want to select just that column and show it. For example:
val sim = df.withColumn("diff",df("Average Total Payments") -df("Average Medicare Payments"))
sim.select("diff").show(5)
Here is my Table:
CREATE TABLE mytable
(
id uuid,
day text,
mytime timestamp,
value text,
status int,
PRIMARY KEY ((id, day), mytime )
)
WITH CLUSTERING ORDER BY (mytime desc)
;
Here is the Index:
CREATE INDEX IF NOT EXISTS idx_status ON mytable (status);
When I run this select statement, I get the expected results:
select * from mytable
where id = 38403e1e-44b0-11e4-bd3d-005056a93afd
AND day = '2014-10-29'
;
62 rows are returned as a result of this query.
If I add to this query to include the index column:
select * from mytable
where id = 38403e1e-44b0-11e4-bd3d-005056a93afd
AND day = '2014-10-29'
AND status = 5
;
zero rows are returned. (there are several records with status = 5)
If I query the table...looking ONLY for a specific index value:
select * from mytable
where status = 5
;
zero rows are also returned.
I'm at a loss. I don't understand what exactly is taking place.
I am on a 3 node cluster, replication level 3. Cassandra 2.1.3
Could this be a configuration issue at all..in cassandra.yaml ?
Or...is there an issue with my select statement?
Appreciate the assistance, thanks.
UPDATE:
I am seeing this in the system.log file, ideas? ...
ERROR [CompactionExecutor:1266] 2015-03-24 15:20:26,596 CassandraDaemon.java:167 - Exception in thread Thread[CompactionExecutor:1266,1,main]
java.lang.AssertionError: /cdata/cassandra/data/my_table-c5f756b5318532afb494483fa1828675/my_table.idx_status-ka-32-Data.db
at org.apache.cassandra.io.sstable.SSTableReader.getApproximateKeyCount(SSTableReader.java:235) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:153) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:76) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:240) ~[apache-cassandra-2.1.3.jar:2.1.3]
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) ~[na:1.7.0_51]
at java.util.concurrent.FutureTask.run(Unknown Source) ~[na:1.7.0_51]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [na:1.7.0_51]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [na:1.7.0_51]
at java.lang.Thread.run(Unknown Source) [na:1.7.0_51]
I ran your steps above and was able to query rows by status=5 just fine. One thing I can suggest, is to try rebuilding your index. Try this from a command prompt:
nodetool rebuild_index mykeyspace mytable idx_status
Otherwise, IMO the best way to solve this, is not with a secondary index. If you know that you're going to have to support a query (especially with a large dataset) by status, then I would seriously consider building a specific, additional "query table" for it.
CREATE TABLE mytablebystatus (id uuid, day text, mytime timestamp, value text, status int,
PRIMARY KEY ((status),day,mytime,id));
This would support queries only by status, or status and day sorted by mytime. In summary, I would experiment with a few different PRIMARY KEY definitions, and see which better-suits your query patterns. That way, you can avoid having to use ill-performing secondary indexes all together.