Error while creating external Hive table in IBM Analytics Engine - ibm-cloud

I am creating an external hive table from a csv file located on IBM Cloud Object Storage. I am using beeline client while ssh'd into the cluster with the clsadmin user. I was able to make jdbc connection. Getting the below error while creating the table.
The csv file is located in the bucket - bucket-name-masked and I have named the fs.cos parameter set as 'hivetest'
0: jdbc:hive2://***hostname-masked***> CREATE EXTERNAL TABLE NYC311Complaints (UniqueKey string, CreatedDate string, ClosedDate string, Agency string, AgencyName string, ComplaintType string, Descriptor string, LocationType string, IncidentZip string, IncidentAddress string, StreetName string, CrossStreet1 string, CrossStreet2 string, IntersectionStreet1 string, IntersectionStreet2 string, AddressType string, City string, Landmark string, FacilityType string, Status string, DueDate string, ResolutionDescription string, ResolutionActionUpdatedDate string, CommunityBoard string, Borough string, XCoordinateStatePlane string, YCoordinateStatePlane string, ParkFacilityName string, ParkBorough string, SchoolName string, SchoolNumber string, SchoolRegion string, SchoolCode string, SchoolPhoneNumber string, SchoolAddress string, SchoolCity string, SchoolState string, SchoolZip string, SchoolNotFound string, SchoolorCitywideComplaint string, VehicleType string, TaxiCompanyBorough string, TaxiPickUpLocation string, BridgeHighwayName string, BridgeHighwayDirection string, RoadRamp string, BridgeHighwaySegment string, GarageLotName string, FerryDirection string, FerryTerminalName string, Latitude string, Longitude string, Location string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'cos://*bucket-name-masked*.hivetest/IAE_examples_data_311NYC.csv';
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:cos://bucket-name-masked.hivetest/IAE_examples_data_311NYC.csv is not a directory or unable to create one) (state=08S01,code=1)
0: jdbc:hive2://hostname-masked>
This looks like a permission issue but I have provided all credentials for the relevant user ids in hdfs as well as cos

Issue was with the cos URL. Filename is not to be provided. Only the bucket is to be named and objects in it would be read. With filename, the whole path gets read as bucket name and looks for obj in there.

Related

Row Filter for Table is invalid pyspark

I have a dataframe in pyspark coming from a View in Bigquery that i import after configuring spark session:
config = pyspark.SparkConf().setAll([('spark.executor.memory', '10g'),('spark.driver.memory', '30G'),\
('spark.jars.packages', 'com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.18.0')])
sc = pyspark.SparkContext(conf=config)
spark = SparkSession.builder.master('yarn').appName('base_analitica_entidades').config(conf = conf).getOrCreate()
I read this dataset through:
recomendaveis = spark.read.format("bigquery").option("viewsEnabled", "true").load("resource_group:some_group.someView")
Then I filter a specific column with IsNotNull:
recomendaveis_mid = recomendaveis.filter(recomendaveis["entities_mid"].isNotNull())
This recomendaveis_mid dataset is:
DataFrame[uid: string, revision: bigint, title: string, subtitle: string, access: string, branded_content: boolean, image: string, published_in: date, changed_in: date, entities_extracted_in: string, translation_extracted_in: string, categories_extracted_in: string, bigquery_inserted_in: string, public_url: string, private_url: string, text: string, translation_en: string, authors_name: string, categories_name: string, categories_confidence: double, entities_name: string, entities_type: string, entities_salience: double, entities_mid: string, entities_wikipedia_url: string, named_entities: string, publications: string, body: string, Editoria: string, idmateria: string]
When I try to get minimum date of column published_in with:
recomendaveis_mid.select(F.min("published_in")).collect()
It throws this error:
Caused by: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: INVALID_ARGUMENT: request failed: Row filter for table resource_group:some_group.table is invalid. Filter is '(`entities_mid` IS NOT NULL)'at com.google.cloud.spark.bigquery.repackaged.io.grpc.Status.asRuntimeException(Status.java:533)
... 14 more
The field published_in has nothing to do with my filter in entities_mid and when i try to run the date filter without running the entities_mid isNotNull my code works fine. Any suggestions? In time:
There is a similar error here but I couldnĀ“t get any other ideas. Thanks in advance
We faced similar issue in scala spark while reading from view.
Upon Analysis, we observed that when we do
df.printSchema()
df.show(1,false)
it prints all fields even before join operation takes place. But during loading/writing data frame to external storage/table it throws error :
INVALID_ARGUMENT: request failed: Row filter for table
After some experiment we observed that if we persist dataframe
df.persist()
it worked fine.
It looks like after joining we also need to have the column used to filter in select, since we don't want that column in our final dataframe. we persisted it in cluster.
Either you can unpersist
df.unpersist()
once data operation completes OR leave it AS IS if you are using ephemeral cluster as it will be deleted after deletion of cluster.

value split is not a member of (String, String, String, String, String)

Given is data by joining two tables.
joinDataRdd.take(5).foreach(println)
(41234,((102921,249,2,109.94,54.97),(2014-04-04 00:00:00.0,3182,PENDING_PAYMENT)))
(65722,((164249,365,2,119.98,59.99),(2014-05-23 00:00:00.0,4077,COMPLETE)))
(65722,((164250,730,5,400.0,80.0),(2014-05-23 00:00:00.0,4077,COMPLETE)))
(65722,((164251,1004,1,399.98,399.98),(2014-05-23 00:00:00.0,4077,COMPLETE)))
(65722,((164252,627,5,199.95,39.99),(2014-05-23 00:00:00.0,4077,COMPLETE)))
When I am trying to get following
val data = joinDataRdd.map(x=>(x._1,x._2._1.split(",")(3)))
It's is throwing an error :
value split is not a member of (String, String, String, String, String)
val data = joinDataRdd.map(x=>(x._1,x._2._1._1.split(",")(3)))
You are trying to split the tuple so that is why the error message. At the given position x._2._1 ,
(41234,((102921,249,2,109.94,54.97),(2014-04-04 00:00:00.0,3182,PENDING_PAYMENT))), the highlighted data is the result. So if you are looking to dig inside the tuple, then you need to advance one position.
It looks like the values are already in a tuple, so you don't need to split the string. Is
val data = joinDataRdd.map(x=>(x._1,x._2._1._4))
what you are looking for?

no viable alternative at input : Siddhi Query

I am trying to write a simple siddhi query by simply importing a custom mapped stream. But as soon as I import stream and validate query, it gives error.
My complete query however is
#Import('bro.in.ssh.log:1.0.0')
define stream inStream (ts string, uid string, id.orig_h string, id.orig_p int, id.resp_h string, id.resp_p int, version int, client string, server string, cipher_alg string, mac_alg string, compression_alg string, kex_alg string, host_key_alg string, host_key string);
#Export('bro.out.ssh.log:1.0.0')
define stream outStream (ts string, ssh_logins int);
from inStream
select dateFormat (ts,'yyyy-MM-dd HH:mm') as formatedTs, count
group by formatedTs
insert into outStream;
All I want is to count number of records in a log for a single minute and export time and count to an output Stream. But I am getting errors even at the very first line.
My input is a log file of bro ids, ssh.log. Its sample record would be something like:
{"ts":"2016-05-08T08:59:47.363764Z","uid":"CLuCgz3HHzG7LpLwH9","id.orig_h":"172.30.26.119","id.orig_p":51976,"id.resp_h":"172.30.26.160","id.resp_p":22,"version":2,"client":"SSH-2.0-OpenSSH_5.0","server":"SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6","cipher_alg":"arcfour256","mac_alg":"hmac-md5","compression_alg":"none","kex_alg":"diffie-hellman-group-exchange-sha1","host_key_alg":"ssh rsa","host_key":"8d:df:71:ac:29:1f:67:6f:f3:dd:c3:e5:2e:5f:3e:b4"}
Siddhi does not allow an Attribute name to have dot ('.') character. So please edit the Event Stream such that the Attribute names (such as id.orig_h) will not have the Dot character.

Assertion on retrieving data from cassandra

i defined a class to map rows of a cassandra table:
case class Log(
val time: Long,
val date: String,
val appId: String,
val instanceId: String,
val appName: String,
val channel: String,
val originCode: String,
val message: String) {
}
i created an RDD to save all my tuples
val logEntries = sc.cassandraTable[Log]("keyspace", "log")
to see if all works i printed this:
println(logEntries.counts()) -> works, print the numbers of tuples retrieved.
println(logEntries.first()) -> exception on this line
java.lang.AssertionError: assertion failed: Missing columns needed by
com.model.Log: app_name, app_id, origin_code, instance_id
my columns of table log on cassandra are:
time bigint, date text, appid text, instanceid text, appname text, channel text, origincode text, message text
what's wrong?
As mentioned in cassandra-spark-connector docs, column name mapper has it's own logic for converting case class parameters to column names:
For multi-word column identifiers, separate each word by an underscore in Cassandra, and use the camel case convention on the Scala side.
So if you use case class Log(appId:String, instanceId:String) with camel-cased parameters, it will be automatically mapped to a underscore-separated notation: app_id text, instance_id text. It cannot be automatically mapped to appid text, instanceid text: you've missed an underscore.

Scala Play passing variable to view not working

This code works fine:
In the controller:
Ok(views.html.payment(message,test,x_card_num,x_exp_date,exp_year,exp_month,x_card_code,x_first_name,x_last_name,x_address,x_city,x_state,x_zip,save_account,product_array,x_amount,products_json,auth_net_customer_profile_id,auth_net_payment_profile_id,customer_id))
In the view:
#(message: String, test: String, x_card_num: String, x_exp_date: String,exp_year: String, exp_month: String, x_card_code: String, x_first_name: String, x_last_name: String, x_address: String, x_city: String, x_state: String, x_zip: String, save_account: String, product_array: Map[String,Map[String,Any]], x_amount: String, products_json: String, auth_net_customer_profile_id: String,auth_net_payment_profile_id: String,customer_id: String)
But when I try to add one more variable to the controller and view like this:
Ok(views.html.payment(message,test,x_card_num,x_exp_date,exp_year,exp_month,x_card_code,x_first_name,x_last_name,x_address,x_city,x_state,x_zip,save_account,product_array,x_amount,products_json,auth_net_customer_profile_id,auth_net_payment_profile_id,customer_id,saved_payments_xml))
#(message: String, test: String, x_card_num: String, x_exp_date: String,exp_year: String, exp_month: String, x_card_code: String, x_first_name: String, x_last_name: String, x_address: String, x_city: String, x_state: String, x_zip: String, save_account: String, product_array: Map[String,Map[String,Any]], x_amount: String, products_json: String, auth_net_customer_profile_id: String,auth_net_payment_profile_id: String,customer_id: String, saved_payments_xml: String)
It gives me this error:
missing parameter type
What am I doing wrong?
There's a limit to the number of parameters you can pass to a template. You've exceeded it when you add another parameter.
It's an undocumented and fairly arbitrary limit which is the result of how the code generation from a template works. It is arguably a bug, but not one that I would fix since nobody needs this many parameters, and having this many makes code much less readable.
Your best resolution here is to refactor, for example by creating some case classes to represent Card and Address in your model, and pass those in instead.