How to read partitioned table from BigQuery to Spark dataframe (in PySpark) - pyspark

I have a BQ table and it's partitioned by the default _PARTITIONTIME. I want to read one of its partitions to Spark dataframe (PySpark). However, the spark.read API doesn't seem to recognize the partition column. Below is the code (which doesn't work):
table = 'myProject.myDataset.table'
df = spark.read.format('bigquery').option('table', table).load()
df_pt = df.filter("_PARTITIONTIME = TIMESTAMP('2019-01-30')")
The partition is quite large so I'm not able to read as a pandas dataframe.
Thank you very much.

Good question
I filed https://github.com/GoogleCloudPlatform/spark-bigquery-connector/issues/50 to track this.
A work around today is the filter parameter to read
df = spark.read.format('bigquery').option('table', table) \
.option('filter', "_PARTITIONTIME = '2019-01-30'")).load()
should work today.

Try using the "$" operator: https://cloud.google.com/bigquery/docs/creating-partitioned-tables
So, the table you'd be pulling from is "myProject.myDataset.table$20190130"
table = 'myProject.myDataset.table'
partition = '20190130'
df = spark.read.format('bigquery').option('table', f'{table}${partition}').load()

Related

Load a table from another database in pyspark

I am currently working with AWS and PySpark. My tables are stored in S3 and queryable from Athena.
In my Glue jobs, I'm used to load my tables as:
my_table_df = sparkSession.table("myTable")
However, this time, I want to access a table from another database, in the same data source (AwsDataCatalog). So I do something that works well:
my_other_table_df = sparkSession.sql("SELECT * FROM anotherDatabase.myOtherTable")
I am just looking for a better way to write the same thing, without using a SQL query, in one line, just by specifying the database for this operation. Something that should looks like
sparkSession.database("anotherDatabase").table("myOtherTable")
Any suggestion would be welcome
You can use the DynamicFrameReader for that. This will return you a DynamicFrame. You can just call .toDF() on that DynamicFrame to transform it into a native Spark DataFrame though.
sc = SparkContext()
glue_context = GlueContext(sc)
spark = glue_context.spark_session
job = Job(glue_context)
data_source = glue_context.create_dynamic_frame.from_catalog(
database="database",
table_name="table_name"
).toDF()

spark-shell load existing hive table by partition?

In spark-shell, how do I load an existing Hive table, but only one of its partitions?
val df = spark.read.format("orc").load("mytable")
I was looking for a way so it only loads one particular partition of this table.
Thanks!
There is no direct way in spark.read.format but you can use where condition
val df = spark.read.format("orc").load("mytable").where(yourparitioncolumn)
unless until you perform an action nothing is loaded, since load (pointing to your orc file location ) is just a func in DataFrameReader like below it doesnt load until actioned.
see here DataFrameReader
def load(paths: String*): DataFrame = {
...
}
In above code i.e. spark.read.... where is just where condition when you specify this, again data wont be loaded immediately :-)
when you say df.count then your parition column will be appled on data path of orc.
There is no function available in Spark API to load only partition directory, but other way around this is partiton directory is nothing but column in where clause, here you can right simple sql query with partition column in where clause which will read data only from partition directoty. See if that will works for you.
val df = spark.sql("SELECT * FROM mytable WHERE <partition_col_name> = <expected_value>")

Applying transformations with filter or map which one is faster Scala spark

Iam trying to do some transformations on the dataset with spark using scala currently using spark sql but want to shift the code to native scala code. i want to know whether to use filter or map, doing some operations like matching the values in column and get a single column after the transformation into a different dataset.
SELECT * FROM TABLE WHERE COLUMN = ''
Used to write something like this earlier in spark sql can someone tell me an alternative way to write the same using map or filter on the dataset, and even which one is much faster when compared.
You can read documentation from Apache Spark website. This is the link to API documentation at https://spark.apache.org/docs/2.3.1/api/scala/index.html#package.
Here is a little example -
val df = sc.parallelize(Seq((1,"ABC"), (2,"DEF"), (3,"GHI"))).toDF("col1","col2")
val df1 = df.filter("col1 > 1")
df1.show()
val df2 = df1.map(x => x.getInt(0) + 3)
df2.show()
If I understand you question correctly, you need to rewrite your SQL query to DataFrame API. Your query reads all columns from table TABLE and filter rows where COLUMN is empty. You can do this with DF in the following way:
spark.read.table("TABLE")
.where($"COLUMN".eqNullSafe(""))
.show(10)
Performance will be the same as in your SQL. Use dataFrame.explain(true) method to understand what Spark will do.

Scripts for generating csv files for spark Canssandra data

I want to generate the 'csv' files as per below logic for the table in cassandra.
val df = sc.parallelize(Seq(("a",1,"abc#gmail.com"), ("b",2,"def#gmail.com"),("a",1,"xyz#gmail.com"),("a",2,"abc#gmail.com"))).toDF("col1","col2","emailId")
I want to generate the 'csv' files as per below logic.
Since there are 3 distinct 'emailid's' I need to generate 3 distinct 'csv' files.
Three csv files for below 3 different queries.
select * from table where emailId='abc#gmail.com'
select * from table where emailId='def#gmail.com'
select * from table where emailId='xyz#gmail.com'
How can I do this. Can anyone please help me on this.
Version:
Spark 1.6.2
Scala 2.10
Create a distinct list of the emails then iterate over them. When iterating, filter for only the emails that match and save the dataframe to Cassandra.
import sql.implicits._
val emailData = sc.parallelize(Seq(("a",1,"abc#gmail.com"), ("b",2,"def#gmail.com"),("a",1,"xyz#gmail.com"),("a",2,"abc#gmail.com"))).toDF("col1","col2","emailId")
val distinctEmails = emailData.select("emailId").distinct().as[String].collect
for (email <- distinctEmails){
val subsetEmailsDF = emailData.filter($"emailId" === email).coalesce(1)
//... Save the subset dataframe to cassandra
}
Note: coalesce(1) sends all the data to one node. This can create memory issues if the dataframe is too large.

merge multiple small files in to few larger files in Spark

I using hive through Spark. I have a Insert into partitioned table query in my spark code. The input data is in 200+gb. When Spark is writing to a partitioned table, it is spitting very small files(files in kb's). so now the output partitioned table folder have 5000+ small kb files. I want to merge these in to few large MB files, may be about few 200mb files. I tired using hive merge settings, but they don't seem to work.
'val result7A = hiveContext.sql("set hive.exec.dynamic.partition=true")
val result7B = hiveContext.sql("set hive.exec.dynamic.partition.mode=nonstrict")
val result7C = hiveContext.sql("SET hive.merge.size.per.task=256000000")
val result7D = hiveContext.sql("SET hive.merge.mapfiles=true")
val result7E = hiveContext.sql("SET hive.merge.mapredfiles=true")
val result7F = hiveContext.sql("SET hive.merge.sparkfiles = true")
val result7G = hiveContext.sql("set hive.aux.jars.path=c:\\Applications\\json-serde-1.1.9.3-SNAPSHOT-jar-with-dependencies.jar")
val result8 = hiveContext.sql("INSERT INTO TABLE partition_table PARTITION (date) select a,b,c from partition_json_table")'
The above hive settings work in a mapreduce hive execution and spits out files of specified size. Is there any option to do this Spark or Scala?
I had the same issue. Solution was to add DISTRIBUTE BY clause with the partition columns. This ensures that data for one partition goes to single reducer. Example in your case:
INSERT INTO TABLE partition_table PARTITION (date) select a,b,c from partition_json_table DISTRIBUTE BY date
You may want to try using the DataFrame.coalesce method; it returns a DataFrame with the specified number of partitions (each of which becomes a file on insertion). So using the number of records you are inserting and the typical size of each record, you can estimate how many partitions to coalesce to if you want files of ~200MB.
The dataframe repartition(1) method works in this case.