Sparksql using scala - scala

val scc = spark.read.jdbc(url,table,properties)
val d = scc.createOrReplaceTempView(“k”)
spark.sql(“select * from k”).show()
if you observe here #1 we are reading complete table and then #3 we are fetching the results based on desired query. Here reading complete table and then querying takes alot of time. Can’t we execute our query while establishing connection ? please do help me if you have any prior knowledge about this .

Check this out.
var dbTable =
"(select emp_no, concat_ws(' ', first_name, last_name) as full_name from employees) as employees_name";
Dataset<Row> jdbcDF =
sparkSession.read().jdbc(CONNECTION_URL, dbTable,connectionProperties);

Related

Update PgSQL Self JOIN With Custom Values

I'm trying to use UPDATE SELF JOIN and could not seem to get the correct SQL query.
Before the query, I execute this SQL query to get the values:
SELECT DISTINCT ON (purpose) purpose FROM user_assigned_customer
sales_manager
main_contact
representative
administrator
By the time I run this query, it overwrites all the purpose columns:
UPDATE user_assigned_customer SET purpose = (
SELECT 'main_supervisor' AS purpose FROM user_assigned_customer AS assigned_user
LEFT JOIN app_user ON app_user.id = assigned_user.app_user_id
WHERE app_user.role = 'supervisor'
AND user_assigned_customer.purpose IS NULL
AND assigned_user.id = user_assigned_customer.id
)
The purpose column is now only showing when running the first query:
main_supervisor
Wondering if there is a way to query to update SQL Self JOIN with a custom value.
I think I got it with a help of a friend.
UPDATE user_assigned_customer SET purpose = 'main_supervisor'
FROM user_assigned_customer AS assigned_user
LEFT JOIN app_user ON app_user.id = assigned_user.app_user_id
WHERE app_user.role = 'supervisor'
AND user_assigned_customer.purpose IS NULL
AND assigned_user.id = user_assigned_customer.id

AnalysisException: cannot resolve given input columns:

I am running in to this error when I am trying to select a couple of columns from the temporary table.
pd_df = pd.read_sql('select * from abc.cars limit 10', conn)
df = spark.createDataFrame(pd_df)
df.createOrReplaceTempView("cars_tmp")
df.show()
print('***************')
print("Reading from tmp table")
data = spark.sql('select location_id from cars_tmp')
data.show()
AnalysisException: cannot resolve '`location_id`' given input columns: [cars_tmp.abc.product_id, cars_tmp.abc.location_id ...]
When I select all the columns I get the results. So this is successful:
data = spark.sql('select * from cars_tmp')
data.show()
I tried below queries but they fail as well with the same error:
data = spark.sql('select cars_tmp.abc.location_id from cars_tmp')
data.show()
data = spark.sql('select cars_tmp.location_id from cars_tmp')
data.show()
data = spark.sql('select abc.location_id from cars_tmp')
data.show()
I am running these in datbricks.
Databricks runtime version: 7.0
Apache Spark version: 3.0
scala: 2.12
or "spark_version": "7.0.x-scala2.12",
Any help will be highly appreciated.
Thanks
The column name does not exist in the table. select * from cars_tmp works because you do not specify the column name.
Please see this answer https://stackoverflow.com/a/64042756/8913402 with the same error handling.
I resolved the issue by add each column in the panda select query. So something like this:
pd_df = pd.read_sql('select id, location_id, product_id from abc.cars limit 10', conn)

Spark: strange nullPointerException when extracting data from PostgreSQL

I'm working with PostgreSQL 9.6 and Spark 2.0.0
I want to create a DataFrame form a postgreSQL table, as following:
val query =
"""(
SELECT events.event_facebook_id,
places.placeid, places.likes as placelikes,
artists.facebookId, artists.likes as artistlikes
FROM events
LEFT JOIN eventsplaces on eventsplaces.event_id = events.event_facebook_id
LEFT JOIN places on eventsplaces.event_id = places.facebookid
LEFT JOIN eventsartists on eventsartists.event_id = events.event_facebook_id
LEFT JOIN artists on eventsartists.artistid = artists.facebookid) df"""
The request is valid (if I run it on psql, I don't get any error) but with Spark,
if I execute the following code, I get a NullPointerException:
sqlContext
.read
.format("jdbc")
.options(
Map(
"url" -> claudeDatabaseUrl,
"dbtable" -> query))
.load()
.show()
If I change, in the query artists.facebookId by an other column as artists.description (which can be null contrary to facebookId), the exception disappears.
I find this very very strange, any idea?
You have different facebookId's in your query: artists.facebook[I]d and artists.facebook[i]d.
Please, try to use the correct one.

Column name cannot be resolved in SparkSQL join

I'm not sure why this is happening. In PySpark, I read in two dataframes and print out their column names and they are as expected, but then when do a SQL join I get an error that cannot resolve column name given the inputs. I have simplified the merge just to get it to work, but I will need to add in more join conditions which is why I'm using SQL (will be adding in: "and b.mnvr_bgn < a.idx_trip_id and b.mnvr_end > a.idx_trip_data"). It appears that the column 'device_id' is being renamed to '_col7' in the df mnvr_temp_idx_prev_temp
mnvr_temp_idx_prev = mnvr_3.select('device_id', 'mnvr_bgn', 'mnvr_end')
print mnvr_temp_idx_prev.columns
['device_id', 'mnvr_bgn', 'mnvr_end']
raw_data_filtered = raw_data.select('device_id', 'trip_id', 'idx').groupby('device_id', 'trip_id').agg(F.max('idx').alias('idx_trip_end'))
print raw_data_filtered.columns
['device_id', 'trip_id', 'idx_trip_end']
raw_data_filtered.registerTempTable('raw_data_filtered_temp')
mnvr_temp_idx_prev.registerTempTable('mnvr_temp_idx_prev_temp')
test = sqlContext.sql('SELECT a.device_id, a.idx_trip_end, b.mnvr_bgn, b.mnvr_end \
FROM raw_data_filtered_temp as a \
INNER JOIN mnvr_temp_idx_prev_temp as b \
ON a.device_id = b.device_id')
Traceback (most recent call last): AnalysisException: u"cannot resolve 'b.device_id' given input columns: [_col7, trip_id, device_id, mnvr_end, mnvr_bgn, idx_trip_end]; line 1 pos 237"
Any help is appreciated!
I would recommend renaming the name of the field 'device_id' in at least one of the data frame. I modified your query just a bit and tested it(in scala). Below query works
test = sqlContext.sql("select * FROM raw_data_filtered_temp a INNER JOIN mnvr_temp_idx_prev_temp b ON a.device_id = b.device_id")
[device_id: string, mnvr_bgn: string, mnvr_end: string, device_id: string, trip_id: string, idx_trip_end: string]
Now if you are doing a 'select * ' in above statement, it will work. But if you try to select 'device_id', you will get an error "Reference 'device_id' is ambiguous" . As you can see in the above 'test' data frame definition, it has two fields with the same name(device_id). So to avoid this, I recommend changing field name in one of the dataframes.
mnvr_temp_idx_prev = mnvr_3.select('device_id', 'mnvr_bgn', 'mnvr_end')
.withColumnRenamned("device_id","device")
raw_data_filtered = raw_data.select('device_id', 'trip_id', 'idx').groupby('device_id', 'trip_id').agg(F.max('idx').alias('idx_trip_end'))
Now use dataframes or sqlContext
//using dataframes with multiple conditions
val test = mnvr_temp_idx_prev.join(raw_data_filtered,$"device" === $"device_id"
&& $"mnvr_bgn" < $"idx_trip_id","inner")
//in SQL Context
test = sqlContext.sql("select * FROM raw_data_filtered_temp a INNER JOIN mnvr_temp_idx_prev_temp b ON a.device_id = b.device and a. idx_trip_id < b.mnvr_bgn")
Above queries will work for your problem. And if your data set is too large, I would recommend to not use '>' or '<' operators in Join condition as it causes cross join which is a costly operation if data set is large. Instead use them in WHERE condition.

PGSQL Error Code 42703 column does not exist

I have a database in postgreSQL. I want to read some data from there, but I get an error (column anganridref does not exist) when I execute my command.
Here is my NpgsqlCommand:
cmd.CommandText = "select * from angebot,angebotstatus,anrede where anrid=anganridref and anstaid=anganstaidref";
and my 3 tables
the names of my columns are rights. So I don't understand why that error comes. Someone can explain me why it does crash? Its not the problem of large and lowercase.
You are not prefixing your column names in the where clause:
select *
from angebot,
angebotstatus,
anrede
where anrid = anganridref <-- missing tablenames for the columns
and anstaid = anganstaidre
It's also recommended to use an explicit JOIN instead of the old SQL 89 implicit join syntax:
select *
from angebot
join angebotstatus on angebot.aaaa = angebotstatus.bbbb
join anrede on angebot.aaaa = anrede.bbbb