I am currently working with AWS and PySpark. My tables are stored in S3 and queryable from Athena.
In my Glue jobs, I'm used to load my tables as:
my_table_df = sparkSession.table("myTable")
However, this time, I want to access a table from another database, in the same data source (AwsDataCatalog). So I do something that works well:
my_other_table_df = sparkSession.sql("SELECT * FROM anotherDatabase.myOtherTable")
I am just looking for a better way to write the same thing, without using a SQL query, in one line, just by specifying the database for this operation. Something that should looks like
sparkSession.database("anotherDatabase").table("myOtherTable")
Any suggestion would be welcome
You can use the DynamicFrameReader for that. This will return you a DynamicFrame. You can just call .toDF() on that DynamicFrame to transform it into a native Spark DataFrame though.
sc = SparkContext()
glue_context = GlueContext(sc)
spark = glue_context.spark_session
job = Job(glue_context)
data_source = glue_context.create_dynamic_frame.from_catalog(
database="database",
table_name="table_name"
).toDF()
Related
In spark-shell, how do I load an existing Hive table, but only one of its partitions?
val df = spark.read.format("orc").load("mytable")
I was looking for a way so it only loads one particular partition of this table.
Thanks!
There is no direct way in spark.read.format but you can use where condition
val df = spark.read.format("orc").load("mytable").where(yourparitioncolumn)
unless until you perform an action nothing is loaded, since load (pointing to your orc file location ) is just a func in DataFrameReader like below it doesnt load until actioned.
see here DataFrameReader
def load(paths: String*): DataFrame = {
...
}
In above code i.e. spark.read.... where is just where condition when you specify this, again data wont be loaded immediately :-)
when you say df.count then your parition column will be appled on data path of orc.
There is no function available in Spark API to load only partition directory, but other way around this is partiton directory is nothing but column in where clause, here you can right simple sql query with partition column in where clause which will read data only from partition directoty. See if that will works for you.
val df = spark.sql("SELECT * FROM mytable WHERE <partition_col_name> = <expected_value>")
Hi I have 90 GB data In CSV file I'm loading this data into one temp table and then from temp table to orc table using select insert command but for converting and loading data into orc format its taking 4 hrs in spark sql.Is there any kind of optimization technique which i can use to reduce this time.As of now I'm not using any kind of optimization technique I'm just using spark sql and loading data from csv file to table(textformat) and then from this temp table to orc table(using select insert)
using spark submit as:
spark-submit \
--class class-name\
--jar file
or can I add any extra Parameter in spark submit for improving the optimization.
scala code(sample):
All Imports
object sample_1 {
def main(args: Array[String]) {
//sparksession with enabled hivesuppport
var a1=sparksession.sql("load data inpath 'filepath' overwrite into table table_name")
var b1=sparksession.sql("insert into tablename (all_column) select 'ALL_COLUMNS' from source_table")
}
}
First of all, you don't need to store the data in the temp table to write into hive table later. You can straightaway read the file and write the output using the DataFrameWriter API. This will reduce one step from your code.
You can write as follows:
val spark = SparkSession.builder.enableHiveSupport().getOrCreate()
val df = spark.read.csv(filePath) //Add header or delimiter options if needed
inputDF.write.mode("append").format(outputFormat).saveAsTable(outputDB + "." + outputTableName)
Here, the outputFormat will be orc, the outputDB will be your hive database and outputTableName will be your Hive table name.
I think using the above technique, your write time will reduce significantly. Also, please mention the resources your job is using and I may be able to optimize it further.
Another optimization you can use is to partition your dataframe while writing. This will make the write operation faster. However, you need to decide the columns on which to partition carefully so that you don't end up creating a lot of partitions.
I have a BQ table and it's partitioned by the default _PARTITIONTIME. I want to read one of its partitions to Spark dataframe (PySpark). However, the spark.read API doesn't seem to recognize the partition column. Below is the code (which doesn't work):
table = 'myProject.myDataset.table'
df = spark.read.format('bigquery').option('table', table).load()
df_pt = df.filter("_PARTITIONTIME = TIMESTAMP('2019-01-30')")
The partition is quite large so I'm not able to read as a pandas dataframe.
Thank you very much.
Good question
I filed https://github.com/GoogleCloudPlatform/spark-bigquery-connector/issues/50 to track this.
A work around today is the filter parameter to read
df = spark.read.format('bigquery').option('table', table) \
.option('filter', "_PARTITIONTIME = '2019-01-30'")).load()
should work today.
Try using the "$" operator: https://cloud.google.com/bigquery/docs/creating-partitioned-tables
So, the table you'd be pulling from is "myProject.myDataset.table$20190130"
table = 'myProject.myDataset.table'
partition = '20190130'
df = spark.read.format('bigquery').option('table', f'{table}${partition}').load()
I am quite new to PySpark. Therefore this question may appear as quite elementary to others.
I am trying to export a data frame created via createOrReplaceTempView() to Hive. The steps are as follows
sqlcntx = SQLContext(sc)
df = sqlcntx.read.format("jdbc").options(url="sqlserver://.....details of MS Sql server",dbtable = "table_name").load()
df_cv_temp = df.createOrReplaceTempView("df")
When I use df_cv_temp.show(5) it is giving an error as follows
NoneType Object has no attribute 'show'
Interestingly when I try to see df.show(5) I am getting proper output.
Naturally when I see the above error I am not able to proceed further.
Now I have two questions.
How to fix the above issue?
Assuming the 1st issue is taken care of, what is the best way to export df_cv_temp to HIVE tables?
P.S. I am using PySaprk 2.0
Update: Incorporating Jim's Answer
Post answer received from Jim, I have updated the code. Please see below the revised code.
from pyspark.sql import HiveContext,SQLContext
sql_cntx = SQLContext(sc)
df = sqlcntx.read.format("jdbc").options(url="sqlserver://.....details of MS Sql server",dbtable = "table_name").load()
df_curr_volt.createOrReplaceTempView("df_cv_temp")
df_cv_filt = sql_cntx.sql("select * from df_cv_temp where DeviceTimeStamp between date_add(current_date(),-1) and current_date()") # Retrieving just a day's record
hc = HiveContext(sc)
Now the problem begins. Please refer to my question 2.
df_cv_tbl = hc.sql("create table if not exits df_cv_raw as select * from df_cv_filt")
df_cv_tbl.write.format("orc").saveAsTable("df_cv_raw")
The above two lines is producing the error as shown below.
pyspark.sql.utils.AnalysisException: u'Table or view not found: df_cv_filt; line 1 pos 14'
So what is the right way of approaching this?
Instead of
df_cv_temp = df.createOrReplaceTempView("df")
you have to use,
df.createOrReplaceTempView("table1")
This is because, df.createOrReplaceTempView(<name_of_the_view>) creates (or replaces if that view name already exists) a lazily evaluated "view" that you can then use like a hive table in Spark SQL. The expression does not produce any output as such, hence it is a NoneType object.
Further, the temp view can be queried as below:
spark.sql("SELECT field1 AS f1, field2 as f2 from table1").show()
Incase, you are sure to have memory space, then you can persist it to be a hive table directly like below. This will create a managed Hive table physically; upon which you can query it even in your Hive CLI.
df.write.saveAsTable("table1")
I need to use Athena in spark but spark uses preparedStatement when using JDBC drivers and it gives me an exception
"com.amazonaws.athena.jdbc.NotImplementedException: Method Connection.prepareStatement is not yet implemented"
Can you please let me know how can I connect Athena in spark
I don't know how you'd connect to Athena from Spark, but you don't need to - you can very easily query the data that Athena contains (or, more correctly, "registers") from Spark.
There are two parts to Athena
Hive Metastore (now called the Glue Data Catalog) which contains mappings between database and table names and all underlying files
Presto query engine which translates your SQL into data operations against those files
When you start an EMR cluster (v5.8.0 and later) you can instruct it to connect to your Glue Data Catalog. This is a checkbox in the 'create cluster' dialog. When you check this option your Spark SqlContext will connect to the Glue Data Catalog, and you'll be able to see the tables in Athena.
You can then query these tables as normal.
See https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html for more.
You can use this JDBC driver: SimbaAthenaJDBC
<dependency>
<groupId>com.syncron.amazonaws</groupId>
<artifactId>simba-athena-jdbc-driver</artifactId>
<version>2.0.2</version>
</dependency>
to use:
SparkSession spark = SparkSession
.builder()
.appName("My Spark Example")
.getOrCreate();
Class.forName("com.simba.athena.jdbc.Driver");
Properties connectionProperties = new Properties();
connectionProperties.put("User", "AWSAccessKey");
connectionProperties.put("Password", "AWSSecretAccessKey");
connectionProperties.put("S3OutputLocation", "s3://my-bucket/tmp/");
connectionProperties.put("AwsCredentialsProviderClass",
"com.simba.athena.amazonaws.auth.PropertiesFileCredentialsProvider");
connectionProperties.put("AwsCredentialsProviderArguments", "/my-folder/.athenaCredentials");
connectionProperties.put("driver", "com.simba.athena.jdbc.Driver");
List<String> predicateList =
Stream
.of("id = 'foo' and date >= DATE'2018-01-01' and date < DATE'2019-01-01'")
.collect(Collectors.toList());
String[] predicates = new String[predicateList.size()];
predicates = predicateList.toArray(predicates);
Dataset<Row> data =
spark.read()
.jdbc("jdbc:awsathena://AwsRegion=us-east-1;",
"my_env.my_table", predicates, connectionProperties);
You can also use this driver in a Flink application:
TypeInformation[] fieldTypes = new TypeInformation[] {
BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO
};
RowTypeInfo rowTypeInfo = new RowTypeInfo(fieldTypes);
JDBCInputFormat jdbcInputFormat = JDBCInputFormat.buildJDBCInputFormat()
.setDrivername("com.simba.athena.jdbc.Driver")
.setDBUrl("jdbc:awsathena://AwsRegion=us-east-1;UID=my_access_key;PWD=my_secret_key;S3OutputLocation=s3://my-bucket/tmp/;")
.setQuery("select id, val_col from my_env.my_table WHERE id = 'foo' and date >= DATE'2018-01-01' and date < DATE'2019-01-01'")
.setRowTypeInfo(rowTypeInfo)
.finish();
DataSet<Row> dbData = env.createInput(jdbcInputFormat, rowTypeInfo);
You can't directly connect Spark to Athena. Athena is simply an implementation of Prestodb targeting s3. Unlike Presto, Athena cannot target data on HDFS.
However, if you want to use Spark to query data in s3, then you are in luck with HUE, which will let you query data in s3 from Spark on Elastic Map Reduce (EMR).
See Also:
Developer Guide for Hadoop User Experience (HUE) on EMR.
The response of #Kirk Broadhurst is correct if you want to use the data of Athena.
If you want to use the Athena engine, then, there is a lib on github that overcomes the preparedStatement problem.
Note that I didn't succeed to use the lib, due to my lack of experience with Maven etc
Actually you can use B2W's Spark Athena Driver.
https://github.com/B2W-BIT/athena-spark-driver