create table in phoenix from spark - scala

Hi I need to create a table in Phoenix from a spark job . I have tried 2 ways below but none of them work, seems this is still not supported.
1) Dataframe.write still requires that the tables exists previously
df.write.format("org.apache.phoenix.spark").mode("overwrite").option("table", schemaName.toUpperCase + "." + tableName.toUpperCase ).option("zkUrl", hbaseQuorum).save()
2) if we connect to phoenix thru JDBC, and try to execute the CREATE statemnt, then we get a parsing error (same create works in phoenix)
var ddlCode="create table test (mykey integer not null primary key, mycolumn varchar) "
val driver = "org.apache.phoenix.jdbc.PhoenixDriver"
val jdbcConnProps = new Properties()
jdbcConnProps.setProperty("driver", driver);
val jdbcConnString = "jdbc:phoenix:hostname:2181/hbase-unsecure"
sqlContext.read.jdbc(jdbcConnString, ddlCode, jdbcConnProps)
error:
org.apache.phoenix.exception.PhoenixParserException: ERROR 601 (42P00): Syntax error. Encountered "create" at line 1, column 15.
Anyone with similar challenges that managed to do it differently?

i have finally worked in a solution for this. Basically i think was wrong by trying to use SQLContext read method for this. I think this method is designed just to "read" data sources. The way to workaournd it has been basically to open a standard JDBC connection against Phoenix:
var ddlCode="create table test (mykey integer not null primary key, mycolumn varchar) "
val jdbcConnString = "jdbc:hostname:2181/hbase-unsecure"
val user="USER"
val pass="PASS"
var connection:Connection = null
Class.forName(driver)
connection = DriverManager.getConnection(jdbcConnString, user, pass)
val statement = connection.createStatement()
statement.executeUpdate(ddlCode)

Related

unable to loop hive sql queries in spark.sql

I am facing challenges with below code to loop hive sql queries in spark.sql.
def missing_pks(query: String) = {
//println(f"spark.sql( $query )")
spark.sql(query)
}
var hql_query_list_df=spark.sql("select distinct hql_qry from table where msr_nm='orders' and rgn_src='europe'")
var hql= hql_query_list_df.select('hql_qry).as[String].collect()
var hql_f=hql_query_list_df.map( "\"" + _ + "\"" )
hql_f.foreach(missing_pks)
here I am calling hive sql statements from table and load them as list then try to execute, unfortunately its not working. not sure what missing in my code. Interesting part is if the list was created manually with in spark shell code is working perfectly. It would be great if someone help me here.

H2 database content is not persisting on insert and update

I am using h2 database to test my postgres slick functionality.
I created a below h2DbComponent:
trait H2DBComponent extends DbComponent {
val driver = slick.jdbc.H2Profile
import driver.api._
val h2Url = "jdbc:h2:mem:test;MODE=PostgreSQL;DB_CLOSE_DELAY=-1;DATABASE_TO_UPPER=false;INIT=runscript from './test/resources/schema.sql'\\;runscript from './test/resources/schemadata.sql'"
val logger = LoggerFactory.getLogger(this.getClass)
val db: Database = {
logger.info("Creating test connection ..................................")
Database.forURL(url = h2Url, driver = "org.h2.Driver")
}
}
In the above snippet i am creating my tables using schema.sql and inserting a single row(record) with schemadata.sql.
Then i am trying to insert a record into the table as below using my test case:
class RequestRepoTest extends FunSuite with RequestRepo with H2DBComponent {
test("Add new Request") {
val response = insertRequest(Request("XYZ","tk", "DM", "RUNNING", "0.1", "l1", "file1",
Timestamp.valueOf("2016-06-22 19:10:25"), Some(Timestamp.valueOf("2016-06-22 19:10:25")), Some("scienceType")))
val actualResult=Await.result(response,10 seconds)
assert(actualResult===1)
val response2 = getAllRequest()
assert(Await.result(response2, 5 seconds).size === 2)
}
}
The above assert of insert works fine stating that the record is inserted. But the getAllRequest() assert fails as the output still contains the single row(as inserted by schemadata.sql) => which means the insertRequest change is not persisted. However the below statements states that the record is inserted as the insert returned 1 stating one record inserted.
val response = insertRequest(Request("CMP_XYZ","tesco_uk", "DM", "RUNNING", "0.1", "l1", "file1",
Timestamp.valueOf("2016-06-22 19:10:25"), Some(Timestamp.valueOf("2016-06-22 19:10:25")),
Some("scienceType")))
val actualResult=Await.result(response,10 seconds)
Below is my definition of insertRequest:
def insertRequest(request: Request):Future[Int]= {
db.run { requestTableQuery += request }
}
I am unable to figure out how can i see the inserted record. Is there any property/config which i need to add?
But the getAllRequest() assert fails as the output still contains the single row(as inserted by schemadata.sql) => which means the insertRequest change is not persisted
I would double-check that the assert(Await.result(response2, 5 seconds).size === 2) line is failing because of a size difference. Could it be failing for some other general failure?
For example, as INIT is run on each connection it could be that you are re-creating the database for each connection. Unless you're careful with the SQL, that could produce an error such as "table already exists". Adding TRACE_LEVEL_SYSTEM_OUT=2; to your H2 URL can be helpful in tracking what H2 is doing.
A couple of suggestions.
First, you could ensure your SQL only runs as needed. For example, your schema.sql could add checks to avoid trying to create the table twice:
CREATE TABLE IF NOT EXISTS my_table( my_column VARCHAR NULL );
And likewise for your schemadata.sql:
MERGE INTO my_table KEY(my_column) VALUES ('a') ;
Alternatively, you could establish schema and test data around your tests (e.g., possibly in Scala code, using Slick). Your test framework probably has a way to ensure something is run before and after a test or test suit.

AWS Glue PySpark replace NULLs

I am running an AWS Glue job to load a pipe delimited file on S3 into an RDS Postgres instance, using the auto-generated PySpark script from Glue.
Initially, it complained about NULL values in some columns:
pyspark.sql.utils.IllegalArgumentException: u"Can't get JDBC type for null"
After some googling and reading on SO, I tried to replace the NULLs in my file by converting my AWS Glue Dynamic Dataframe to a Spark Dataframe, executing the function fillna() and reconverting back to a Dynamic Dataframe.
datasource0 = glueContext.create_dynamic_frame.from_catalog(database =
"xyz_catalog", table_name = "xyz_staging_files", transformation_ctx =
"datasource0")
custom_df = datasource0.toDF()
custom_df2 = custom_df.fillna(-1)
custom_df3 = custom_df2.fromDF()
applymapping1 = ApplyMapping.apply(frame = custom_df3, mappings = [("id",
"string", "id", "int"),........more code
References:
https://github.com/awslabs/aws-glue-samples/blob/master/FAQ_and_How_to.md#3-there-are-some-transforms-that-i-cannot-figure-out
How to replace all Null values of a dataframe in Pyspark
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.fillna
Now, when I run my job, it throws the following error:
Log Contents:
Traceback (most recent call last):
File "script_2017-12-20-22-02-13.py", line 23, in <module>
custom_df3 = custom_df2.fromDF()
AttributeError: 'DataFrame' object has no attribute 'fromDF'
End of LogType:stdout
I am new to Python and Spark and have tried a lot, but can't make sense of this. Appreciate some expert help on this.
I tried changing my reconvert command to this:
custom_df3 = glueContext.create_dynamic_frame.fromDF(frame = custom_df2)
But still got the error:
AttributeError: 'DynamicFrameReader' object has no attribute 'fromDF'
UPDATE:
I suspect this is not about NULL values. The message "Can't get JDBC type for null" seems not to refer to a NULL value, but some data/type that JDBC is unable to decipher.
I created a file with only 1 record, no NULL values, changed all Boolean types to INT (and replaced values with 0 and 1), but still get the same error:
pyspark.sql.utils.IllegalArgumentException: u"Can't get JDBC type for null"
UPDATE:
Make sure DynamicFrame is imported (from awsglue.context import DynamicFrame), since fromDF / toDF are part of DynamicFrame.
Refer to https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html
You are calling .fromDF on the wrong class. It should look like this:
from awsglue.dynamicframe import DynamicFrame
DyamicFrame.fromDF(custom_df2, glueContext, 'label')
For this error, pyspark.sql.utils.IllegalArgumentException: u"Can't get JDBC type for null"
you should use the drop Null columns.
I was getting similar errors while loading to Redshift DB Tables. After using the below command, the issue got resolved
Loading= DropNullFields.apply(frame = resolvechoice3, transformation_ctx = "Loading")
In Pandas, and for Pandas DataFrame, pd.fillna() is used to fill null values with other specified values. However, DropNullFields drops all null fields in a DynamicFrame whose type is NullType. These are fields with missing or null values in every record in the DynamicFrame data set.
In your specific situation, you need to make sure you are using the write class for the appropriate dataset.
Here is the edited version of your code:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database =
"xyz_catalog", table_name = "xyz_staging_files", transformation_ctx =
"datasource0")
custom_df = datasource0.toDF()
custom_df2 = custom_df.fillna(-1)
custom_df3 = DyamicFrame.fromDF(custom_df2, glueContext, 'your_label')
applymapping1 = ApplyMapping.apply(frame = custom_df3, mappings = [("id",
"string", "id", "int"),........more code
This is what you are doing: 1. Read the file in DynamicFrame, 2. Convert it to DataFrame, 3. Drop null values, 4. Convert back to DynamicFrame, and 5. ApplyMapping. You were getting the following error because your step 4 was wrong and you were were feeding a DataFrame to ApplyMapping which does not work. ApplyMapping is designed for DynamicFrames.
I would suggest read your data in DynamicFrame and stick to the same data type. It would look like this (one way to do it):
from awsglue.dynamicframe import DynamicFrame
datasource0 = glueContext.create_dynamic_frame.from_catalog(database =
"xyz_catalog", table_name = "xyz_staging_files", transformation_ctx =
"datasource0")
custom_df = DropNullFields.apply(frame=datasource0)
applymapping1 = ApplyMapping.apply(frame = custom_df, mappings = [("id",
"string", "id", "int"),........more code

How do I create a Spark RDD from Accumulo 1.6 in spark-notebook?

I have a Vagrant image with Spark Notebook, Spark, Accumulo 1.6, and Hadoop all running. From notebook, I can manually create a Scanner and pull test data from a table I created using one of the Accumulo examples:
val instanceNameS = "accumulo"
val zooServersS = "localhost:2181"
val instance: Instance = new ZooKeeperInstance(instanceNameS, zooServersS)
val connector: Connector = instance.getConnector( "root", new PasswordToken("password"))
val auths = new Authorizations("exampleVis")
val scanner = connector.createScanner("batchtest1", auths)
scanner.setRange(new Range("row_0000000000", "row_0000000010"))
for(entry: Entry[Key, Value] <- scanner) {
println(entry.getKey + " is " + entry.getValue)
}
will give the first ten rows of table data.
When I try to create the RDD thusly:
val rdd2 =
sparkContext.newAPIHadoopRDD (
new Configuration(),
classOf[org.apache.accumulo.core.client.mapreduce.AccumuloInputFormat],
classOf[org.apache.accumulo.core.data.Key],
classOf[org.apache.accumulo.core.data.Value]
)
I get an RDD returned to me that I can't do much with due to the following error:
java.io.IOException: Input info has not been set. at
org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:630)
at
org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:343)
at
org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:538)
at
org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:98)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
at scala.Option.getOrElse(Option.scala:120) at
org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:1367) at
org.apache.spark.rdd.RDD.count(RDD.scala:927)
This totally makes sense in light of the fact that I haven't specified any parameters as to which table to connect with, what the auths are, etc.
So my question is: What do I need to do from here to get those first ten rows of table data into my RDD?
update one
Still doesn't work, but I did discover a few things. Turns out there are two nearly identical packages,
org.apache.accumulo.core.client.mapreduce
&
org.apache.accumulo.core.client.mapred
both have nearly identical members, except for the fact that some of the method signatures are different. not sure why both exist as there's no deprecation notice that I could see. I attempted to implement Sietse's answer with no joy. Below is what I did, and the responses:
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.conf.Configuration
val jobConf = new JobConf(new Configuration)
import org.apache.hadoop.mapred.JobConf import
org.apache.hadoop.conf.Configuration jobConf:
org.apache.hadoop.mapred.JobConf = Configuration: core-default.xml,
core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml,
yarn-site.xml
Configuration: core-default.xml, core-site.xml, mapred-default.xml,
mapred-site.xml, yarn-default.xml, yarn-site.xml
AbstractInputFormat.setConnectorInfo(jobConf,
"root",
new PasswordToken("password")
AbstractInputFormat.setScanAuthorizations(jobConf, auths)
AbstractInputFormat.setZooKeeperInstance(jobConf, new ClientConfiguration)
val rdd2 =
sparkContext.hadoopRDD (
jobConf,
classOf[org.apache.accumulo.core.client.mapred.AccumuloInputFormat],
classOf[org.apache.accumulo.core.data.Key],
classOf[org.apache.accumulo.core.data.Value],
1
)
rdd2: org.apache.spark.rdd.RDD[(org.apache.accumulo.core.data.Key,
org.apache.accumulo.core.data.Value)] = HadoopRDD[1] at hadoopRDD at
:62
rdd2.first
java.io.IOException: Input info has not been set. at
org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:630)
at
org.apache.accumulo.core.client.mapred.AbstractInputFormat.validateOptions(AbstractInputFormat.java:308)
at
org.apache.accumulo.core.client.mapred.AbstractInputFormat.getSplits(AbstractInputFormat.java:505)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
at scala.Option.getOrElse(Option.scala:120) at
org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at
org.apache.spark.rdd.RDD.take(RDD.scala:1077) at
org.apache.spark.rdd.RDD.first(RDD.scala:1110) at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:64)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:69)
at...
* edit 2 *
re: Holden's answer - still no joy:
AbstractInputFormat.setConnectorInfo(jobConf,
"root",
new PasswordToken("password")
AbstractInputFormat.setScanAuthorizations(jobConf, auths)
AbstractInputFormat.setZooKeeperInstance(jobConf, new ClientConfiguration)
InputFormatBase.setInputTableName(jobConf, "batchtest1")
val rddX = sparkContext.newAPIHadoopRDD(
jobConf,
classOf[org.apache.accumulo.core.client.mapreduce.AccumuloInputFormat],
classOf[org.apache.accumulo.core.data.Key],
classOf[org.apache.accumulo.core.data.Value]
)
rddX: org.apache.spark.rdd.RDD[(org.apache.accumulo.core.data.Key,
org.apache.accumulo.core.data.Value)] = NewHadoopRDD[0] at
newAPIHadoopRDD at :58
Out[15]: NewHadoopRDD[0] at newAPIHadoopRDD at :58
rddX.first
java.io.IOException: Input info has not been set. at
org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:630)
at
org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:343)
at
org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:538)
at
org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:98)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
at scala.Option.getOrElse(Option.scala:120) at
org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at
org.apache.spark.rdd.RDD.take(RDD.scala:1077) at
org.apache.spark.rdd.RDD.first(RDD.scala:1110) at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:61)
at
edit 3 -- progress!
i was able to figure out why the 'input INFO not set' error was occurring. the eagle-eyed among you will no doubt see the following code is missing a closing '('
AbstractInputFormat.setConnectorInfo(jobConf, "root", new PasswordToken("password")
as I'm doing this in spark-notebook, I'd been clicking the execute button and moving on because I wasn't seeing an error. what I forgot was that notebook is going to do what spark-shell will do when you leave off a closing ')' -- it will wait forever for you to add it. so the error was the result of the 'setConnectorInfo' method never getting executed.
unfortunately, I'm still unable to shove the accumulo table data into an RDD that's useable to me. when I execute
rddX.count
I get back
res15: Long = 10000
which is the correct response - there are 10,000 rows of data in the table I pointed to. however, when I try to grab the first element of data thusly:
rddX.first
I get the following error:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0.0 in stage 0.0 (TID 0) had a not serializable result:
org.apache.accumulo.core.data.Key
any thoughts on where to go from here?
edit 4 -- success!
the accepted answer + comments are 90% of the way there - except for the fact that the accumulo key/value need to be cast into something serializable. i got this working by invoking the .toString() method on both. i'll try to post something soon that's complete working code incase anyone else runs into the same issue.
Generally with custom Hadoop InputFormats, the information is specified using a JobConf. As #Sietse pointed out there are some static methods on the AccumuloInputFormat that you can use to configure the JobConf. In this case I think what you would want to do is:
val jobConf = new JobConf() // Create a job conf
// Configure the job conf with our accumulo properties
AccumuloInputFormat.setConnectorInfo(jobConf, principal, token)
AccumuloInputFormat.setScanAuthorizations(jobConf, authorizations)
val clientConfig = new ClientConfiguration().withInstance(instanceName).withZkHosts(zooKeepers)
AccumuloInputFormat.setZooKeeperInstance(jobConf, clientConfig)
AccumuloInputFormat.setInputTableName(jobConf, tableName)
// Create an RDD using the jobConf
val rdd2 = sc.newAPIHadoopRDD(jobConf,
classOf[org.apache.accumulo.core.client.mapreduce.AccumuloInputFormat],
classOf[org.apache.accumulo.core.data.Key],
classOf[org.apache.accumulo.core.data.Value]
)
Note: After digging into the code, it seems the the is configured property is set based in part on the class which is called (makes sense to avoid conflicts with other packages potentially), so when we go and get it back in the concrete class later it fails to find the is configured flag. The solution to this is to not use the Abstract classes. see https://github.com/apache/accumulo/blob/bf102d0711103e903afa0589500f5796ad51c366/core/src/main/java/org/apache/accumulo/core/client/mapreduce/lib/impl/ConfiguratorBase.java#L127 for the implementation details). If you can't call this method on the concrete implementation with spark-notebook probably using spark-shell or a regularly built application is the easiest solution.
It looks like those parameters have to be set through static methods : http://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/client/mapred/AccumuloInputFormat.html. So try setting the non-optional parameters and run again. It should work.

Cassandra-Hector-Scala:How can I get all row Composite key in column family?

My data storage format is:
Family name :Test
Rowkey: comkey1:comkey2
=>(name=name,value='xyz',timestamp=1554515485)
-------------------------------------------------------
Rowkey: comkey1:comkey3
=>(name=name,value='abc',timestamp=1554515485)
-------------------------------------------------------
Rowkey: comkey1:comkey4
=>(name=name,value='pqr',timestamp=1554515485)
-------------------------------------------------------
now i want to fetch all composite key from "test" family
and i am trying
def test=Action{
val cluster = HFactory.getOrCreateCluster("Test Cluster", "127.0.0.1:9160");
val keyspace = HFactory.createKeyspace("winoriatest", cluster)
var startKey = new Composite();
var endKey= new Composite();
startKey.addComponent("comkey1", StringSerializer.get());
startKey.addComponent("comkey2", StringSerializer.get());
endKey.addComponent("comkey1", StringSerializer.get());
endKey.addComponent("comkey4", StringSerializer.get());
val rangeSlicesQuery = HFactory.createRangeSlicesQuery(keyspace, CompositeSerializer.get(), StringSerializer.get(),StringSerializer.get())
rangeSlicesQuery.setColumnFamily("test");
// CompositeSerializer.get() is not working.
rangeSlicesQuery.setKeys(startKey,endKey)
rangeSlicesQuery.setRange(null,null,false,Integer.MAX_VALUE);
rangeSlicesQuery.setReturnKeysOnly()
val result = rangeSlicesQuery.execute()
val orderedRows = result.get();
import scala.collection.JavaConversions._
for (sc <- orderedRows) {
println(sc.getKey())
}
Ok(views.html.index("Your new application is ready."))
}
Error :[NullPointerException: null] on line
val result = rangeSlicesQuery.execute()
Cassandra 2.0 scala 2.10.2
Thank you for your help in resolving this, in advance.
it giving me null pointer exception, and the same code is working with java
and my java code is
Cluster cluster = HFactory.getOrCreateCluster("Test Cluster","127.0.0.1:9160");
Keyspace keyspace = HFactory.createKeyspace("winoriatest", cluster);
Serializer<String> se= StringSerializer.get() ;
Serializer<Long> le= LongSerializer.get() ;
Serializer<Integer> ie= IntegerSerializer.get() ;
CompositeSerializer ce = new CompositeSerializer();
RangeSlicesQuery<Composite,String,byte[]> rangeSliceQuery=HFactory.createRangeSlicesQuery(keyspace,ce,se, BytesArraySerializer.get());
rangeSliceQuery.setColumnFamily("test");
rangeSliceQuery.setRange(null,null, false, Integer.MAX_VALUE);
QueryResult<OrderedRows<Composite,String,byte[]>>result=rangeSliceQuery.execute();
OrderedRows<Composite,String,byte[]> orderedRows=result.get();
for (Row<Composite,String,byte[]> r:orderedRows)
{
System.out.println("Compositekey="+r.getKey().get(0,se)+":"+r.getKey().get(1, se));
}
I'm not quite sure what "i want to fetch all composite key in test family" means. If you mean, you want to get just the partition [row] key components, then you can do this in CQL as simply as:
SELECT DISTINCT a, b FROM test
(Assigning a and b to be the column names.)
This is a good example of how much simpler CQL makes Cassandra development, which is why we're pushing people to use the native CQL driver over legacy clients like Hector.
For more on how CQL makes sense of a Thrift data model like this, see http://www.datastax.com/dev/blog/cql3-for-cassandra-experts.