I'm trying to insert Spark DataFrame to Teradata table using spark sql jdbc connection.
Code:
properties = {
"TMODE","TERA",
"TYPE","FASTLOAD"
}
jdbcUrl = "jdbc:teradata://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}?user=${jdbcUsername}&password=${jdbcPassword}"
df.write.jdbc(url=jdbcUrl, table="someTable", mode='append', properties=properties)
But when i execute the job, its running only one executor and one connection to teradata.
How to make parallel connection to Teradata, what is the property needs to be included to make multiple parallel connections to Teradata?
update:
I was going through this databricks blog, it says, based on the number of partitions in DataFrame, it will create multiple connections.
https://docs.databricks.com/spark/latest/data-sources/sql-databases.html
You can achieve this like this, basically it creates number of parallel connections based on numPartitions and then you can define lowerBound and upperBound value for each partition ( how many records it has to read )
val df = (spark.read.jdbc(url=jdbcUrl,
table="employees",
columnName="emp_no",
lowerBound=1L,
upperBound=100000L,
numPartitions=100,
connectionProperties=connectionProperties))
display(df)
Related
I'm trying to check the size of the different tables we're generating in our data warehouse, so we can have an automatic way to calculate partition size in next runs.
In order to get the table size I'm getting the stats from dataframes in the following way:
val db = "database"
val table_name = "table_name"
val table_size_bytes = spark.read.table(s"$db.$table_name").queryExecution.analyzed.stats.sizeInBytes
This was working fine until I started running the same code on partitioned tables. Each time I ran it on a partitioned table I got the same value for sizeInBytes, which is the max allowed value for BigInt: 9223372036854775807.
Is this a bug in Spark or should I be running this in a different way for partitioned tables?
I'm having some concerns regarding the behaviour of dataframes after writing them to Hive tables.
Context:
I run a Spark Scala (version 2.2.0.2.6.4.105-1) job through spark-submit in my production environment, which has Hadoop 2.
I do multiple computations and store some intermediate data to Hive ORC tables; after storing a table, I need to re-use the dataframe to compute a new dataframe to be stored in another Hive ORC table.
E.g.:
// dataframe with ~10 million record
val df = prev_df.filter(some_filters)
val df_temp_table_name = "temp_table"
val df_table_name = "table"
sql("SET hive.exec.dynamic.partition = true")
sql("SET hive.exec.dynamic.partition.mode = nonstrict")
df.createOrReplaceTempView(df_temp_table_name)
sql(s"""INSERT OVERWRITE TABLE $df_table_name PARTITION(partition_timestamp)
SELECT * FROM $df_temp_table_name """)
These steps always work and the table is properly populated with the correct data and partitions.
After this, I need to use the just computed dataframe (df) to update another table. So I query the table to be updated into dataframe df2, then I join df with df2, and the result of the join needs to overwrite the table of df2 (a plain, non-partitioned table).
val table_name_to_be_updated = "table2"
// Query the table to be updated
val df2 = sql(table_name_to_be_updated)
val df3 = df.join(df2).filter(some_filters).withColumn(something)
val temp = "temp_table2"
df3.createOrReplaceTempView(temp)
sql(s"""INSERT OVERWRITE TABLE $table_name_to_be_updated
SELECT * FROM $temp """)
At this point, df3 is always found empty, so the resulting Hive table is always empty as well. This happens also when I .persist() it to keep it in memory.
When testing with spark-shell, I have never encountered the issue. This happens only when the flow is scheduled in cluster-mode under Oozie.
What do you think might be the issue? Do you have any advice on approaching a problem like this with efficient memory usage?
I don't understand if it's the first df that turns empty after writing to a table, or if the issue is because I first query and then try to overwrite the same table.
Thank you very much in advance and have a great day!
Edit:
Previously, df was computed in an individual script and then inserted into its respective table. On a second script, that table was queried into a new variable df; then the table_to_be_updated was also queried and stored into a variable old_df2 let's say. The two were then joined and computed upon in a new variable df3, that was then inserted with overwrite into the table_to_be_updated.
How can PySpark remove rows in PostgreSQL by executing a query such as DELETE FROM my_table WHERE day = 3 ?
SparkSQL provides API only for inserting/overriding records. So using a library like psycopg2 could do the job, but it needs to be explicitly compiled on the remote machine, that is not doable for me. Any other suggestions?
Dataframes in Apache Spark are immutable. You can filter out the rows you don't want.
See the documentation.
A simple example could be:
df = spark.jdbc("conn-url", "mytable")
df.createOrReplaceTempView("mytable")
df2 = spark.sql("SELECT * FROM mytable WHERE day != 3")
df2.collect()
The only solution that works so far is to install psycopg2 to spark master node and call queries like a regular python would do. Adding that library as py-files didn't work out for me
I'm using spark with scala to read a specific Hive partition. The partition is year, month, day, a and b
scala> spark.sql("select * from db.table where year=2019 and month=2 and day=28 and a='y' and b='z'").show
But I get this error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 236 in stage 0.0 failed 4 times, most recent failure: Lost task 236.3 in stage 0.0 (TID 287, server, executor 17): org.apache.hadoop.security.AccessControlException: Permission denied: user=user, access=READ, inode="/path-to-table/table/year=2019/month=2/day=27/a=w/b=x/part-00002":user:group:-rw-rw----
As you can see, spark is trying to read a different partition and I don't have permisions there.
It shouldn't be, because I created a filter and this filter is my partition.
I tried the same query with Hive and it's works perfectly (No access problems)
Hive> select * from db.table where year=2019 and month=2 and day=28 and a='y' and b='z';
Why is spark trying to read this partition and Hive doesn't?
There is a Spark configuration that am I missing?
Edit: More information
Some files were created with Hive, others were copied from one server and pasted to our server with different permissions (we can not change the permissions), then they should have refreshed the data.
We are using:
cloudera 5.13.2.1
hive 1.1.0
spark 2.3.0
hadoop 2.6.0
scala 2.11.8
java 1.8.0_144
Show create table
|CREATE EXTERNAL TABLE Columns and type
PARTITIONED BY (`year` int COMMENT '*', `month` int COMMENT '*', `day` int COMMENT '*', `a` string COMMENT '*', `b` string COMMENT '*')
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
)
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'hdfs://path'
TBLPROPERTIES (
'transient_lastDdlTime' = '1559029332'
)
|
A parquet hive table in Spark can use following 2 read flows -
Hive flow -
This will be used when spark.sql.hive.convertMetastoreParquet is set to false. For partitioning pruining to work in this case, you have to set spark.sql.hive.metastorePartitionPruning=true.
spark.sql.hive.metastorePartitionPruning: When true, some predicates
will be pushed down into the Hive metastore so that unmatching
partitions can be eliminated earlier. This only affects Hive tables
not converted to filesource relations (see
HiveUtils.CONVERT_METASTORE_PARQUET and
HiveUtils.CONVERT_METASTORE_ORC for more information
Datasource flow - This flow by default has partition pruning turned on.
This can happen when metastore does not have the partition values for the partition column.
Can we run from Spark
ALTER TABLE db.table RECOVER PARTITIONS
And then rerun the same query.
You will not be able to read special partition in a table you don't have access to all its partition using Spark-Hive API. Spark is using a Hive table access permission and in Hive you need to take full access to the table.
The reason you can't treat spark-hive as unix access. If you need to do it use spark.csv (or whatever format). Then read the data as file based.
You can simply use spark.csv.read("/path-to-table/table/year=2019/month=2/day=27/a=w/b=x/part-")
If you need to verify my answer, Ignore spark and try to run the same query in Hive shell it will not work as part of hive configurations.
When we use spark to read data from csv for DB as follow, it will automatically split the data to multiple partitions and sent to executors
spark
.read
.option("delimiter", ",")
.option("header", "true")
.option("mergeSchema", "true")
.option("codec", properties.getProperty("sparkCodeC"))
.format(properties.getProperty("fileFormat"))
.load(inputFile)
Currently, I have a id list as :
[1,2,3,4,5,6,7,8,9,...1000]
What I want to do is split this list to multiple partitions and sent to executors, in each executor, run the sql as
ids.foreach(id => {
select * from table where id = id
})
When we load data from cassandra, the connector will generate the query sql as:
select columns from table where Token(k) >= ? and Token(k) <= ?
it means, the connector will scan the whole database, virtually, I needn't to scan the whole table, I just what to get all the data from the table where the k(partition key) in the id list.
the table schema as:
CREATE TABLE IF NOT EXISTS tab.events (
k int,
o text,
event text
PRIMARY KEY (k,o)
);
or how can i use spark to load data from cassandra using pre defined sql statement without scan the whole table?
You simply need to use joinWithCassandra function to perform selection only of the data is required for your operation. But be aware that this function is only available via RDD API.
Something like this:
val joinWithRDD = your_df.rdd.joinWithCassandraTable("tab","events")
You need to make sure that column name in your DataFrame matched the partition key name in Cassandra - see documentation for more information.
The DataFrame implementation is only available in the DSE version of Spark Cassandra Connector as described in following blog post.
Update in September 2020th: support for join with Cassandra was added in the Spark Cassandra Connector 2.5.0