How to get strings separated by commas from a list to a query in PySpark? - pyspark

I want to generate a query by using a list in PySpark
list = ["hi#gmail.com", "goodbye#gmail.com"]
query = "SELECT * FROM table WHERE email IN (" + list + ")"
This is my desired output:
query
SELECT * FROM table WHERE email IN ("hi#gmail.com", "goodbye#gmail.com")
Instead I'm getting: TypeError: cannot concatenate 'str' and 'list' objects
Can anyone help me achieve this? Thanks

If someone's having the same issue, I found that you can use the following code:
"'"+"','".join(map(str, emails))+"'"
and you will have the following output:
SELECT * FROM table WHERE email IN ('hi#gmail.com', 'goodbye#gmail.com')

Try this:
Dataframe based approach -
df = spark.createDataFrame([(1,"hi#gmail.com") ,(2,"goodbye#gmail.com",),(3,"abc#gmail.com",),(4,"xyz#gmail.com")], ['id','email_id'])
email_filter_list = ["hi#gmail.com", "goodbye#gmail.com"]
df.where(col('email_id').isin(email_filter_list)).show()
Spark SQL based approach -
df = spark.createDataFrame([(1,"hi#gmail.com") ,(2,"goodbye#gmail.com",),(3,"abc#gmail.com",),(4,"xyz#gmail.com")], ['id','email_id'])
df.createOrReplaceTempView('t1')
sql_filter = ','.join(["'" +i + "'" for i in email_filter_list])
spark.sql("SELECT * FROM t1 WHERE email_id IN ({})".format(sql_filter)).show()

Related

Pyspark join on multiple aliased table columns

Python doesn't like the ampersand below.
I get the error: & is not a supported operation for types str and str. Please review your code.
Any idea how to get this right? I've never tried to join more than 1 column for aliased tables. Thx!!
df_initial_sample = df_crm.alias('crm').join(df_cngpt.alias('cng'), on= (("crm.id=cng.id") & ("crm.cpid = cng.cpid")), how = "inner")
Try using as below -
df_initial_sample = df_crm.alias('crm').join(df_cngpt.alias('cng'), on= (["id"] and ["cpid"]), how = "inner")
Your join condition is overcomplicated. It can be as simple as this
df_initial_sample = df_crm.join(df_cngpt, on=['id', 'cpid'], how = 'inner')

Springboot JPA Cast numeric substring to string

I have a JPA/Springboot application backed by a Postgres database. I need to get a records that is equal to a substring passed back to the server.
For example:
Select * from dp1_attachments where TRIM(RIGHT(dp1_submit_date_dp1_number::text, 5)) ='00007'
This query works in PgAdmin, but not in the JPA #Query statement.
#Query("SELECT a.attachmentsFolder as attachmentsFolder, a.attachmentNumber as attachmentNumber, a.attachmentName as attachmentName, a.dp1SubmitDateDp1Number as dp1SubmitDateDp1Number,a.attachmentType as attachmentType, a.attachmentDate as attachmentDate, a.attachmentBy as attachmentBy "
+ "FROM DP1Attachments a WHERE TRIM(SUBSTRING(a.dp1SubmitDateDp1Number::text, 5 )) = :dp1Number")
I've also tried CASTing the parameter like this:
#Query("SELECT a.attachmentsFolder as attachmentsFolder, a.attachmentNumber as attachmentNumber, a.attachmentName as attachmentName, a.dp1SubmitDateDp1Number as dp1SubmitDateDp1Number,a.attachmentType as attachmentType, a.attachmentDate as attachmentDate, a.attachmentBy as attachmentBy "
+ "FROM DP1Attachments a WHERE TRIM(SUBSTRING(CAST(a.dp1SubmitDateDp1Number as string, 5 ))) = :dp1Number")
but the application won't even run, and returns an error that the query isn't valid.
If I make no attempt to cast it, I get an error that function pg_catalog.substring(numeric, integer) does not exist
UPDATE
I've also tried creating a native query instead but that also doesn't seem to work.
List<DP1AttachmentsProjection> results = em.createNativeQuery("Select * FROM dp1_attachments WHERE TRIM(RIGHT(CAST(dp1_submit_date_dp1_number as varchar),5)) =" + dp1Number).getResultList();
In place of varchar I have also tried string and text.
Errors come back similar to ERROR: operator does not exist: text = integer. Its like the CAST is being ignored and I'm not sure why.
I also tried the following as a native query:
em.createNativeQuery("Select * FROM dp1_attachments WHERE TRIM(RIGHT(dp1_submit_date_dp1_number::varchar),5)) =" + dp1Number).getResultList();
and get ERROR: syntax error at or near ":"
FINAL SOLUTION
Thanks to #Nenad J I altered the query to get the final working solution:
#Query(value = "SELECT a.attachments_Folder as attachmentsFolder, a.attachment_Number as attachmentNumber, a.attachment_Name as attachmentName, a.dp1_Submit_Date_Dp1_Number as dp1SubmitDateDp1Number,a.attachment_Type as attachmentType, a.attachment_Date as attachmentDate, a.attachment_By as attachmentBy FROM DP1_Attachments a WHERE TRIM(RIGHT(CAST(a.dp1_Submit_Date_Dp1_Number as varchar ), 5 )) = :dp1Number", nativeQuery = true)"
Default substring returns a string, so substring(integer data,5) returns a string. Thus no need for the cast.
#Query("SELECT * FROM DP1Attachments a WHERE TRIM(SUBSTRING(a.dp1SubmitDateDp1Number, 5)) = :dp1Number")
But I recommend in this case use native query like this:
Put this code in your attachment repository.
#Query(value="SELECT * FROM DP1Attachments AS a WHERE TRIM(SUBSTRING(a.dp1SubmitDateDp1Number, 5 )) = :dp1Number", nativeQuery=true)
Be careful with the column's name.

Question mark in laravel toSQL()

I get this output when trying to use toSQL() to debug my queries.
Laravel code:
$services = Service::latest()->where('status', '=', '0');
Output SQL:
"select * from `services` where `status` = ? order by `created_at` desc"
How can I get a proper query without ? mark? Thanks!
To view the data that will be substituted into the query string you can call the getBindings() function on the query like
below.
$query = User::first()->jobs();
dd($query->toSql(), $query->getBindings());
The array of bindings get substituted in the same order the ? appear in the SQL statement.
check this link

Column name cannot be resolved in SparkSQL join

I'm not sure why this is happening. In PySpark, I read in two dataframes and print out their column names and they are as expected, but then when do a SQL join I get an error that cannot resolve column name given the inputs. I have simplified the merge just to get it to work, but I will need to add in more join conditions which is why I'm using SQL (will be adding in: "and b.mnvr_bgn < a.idx_trip_id and b.mnvr_end > a.idx_trip_data"). It appears that the column 'device_id' is being renamed to '_col7' in the df mnvr_temp_idx_prev_temp
mnvr_temp_idx_prev = mnvr_3.select('device_id', 'mnvr_bgn', 'mnvr_end')
print mnvr_temp_idx_prev.columns
['device_id', 'mnvr_bgn', 'mnvr_end']
raw_data_filtered = raw_data.select('device_id', 'trip_id', 'idx').groupby('device_id', 'trip_id').agg(F.max('idx').alias('idx_trip_end'))
print raw_data_filtered.columns
['device_id', 'trip_id', 'idx_trip_end']
raw_data_filtered.registerTempTable('raw_data_filtered_temp')
mnvr_temp_idx_prev.registerTempTable('mnvr_temp_idx_prev_temp')
test = sqlContext.sql('SELECT a.device_id, a.idx_trip_end, b.mnvr_bgn, b.mnvr_end \
FROM raw_data_filtered_temp as a \
INNER JOIN mnvr_temp_idx_prev_temp as b \
ON a.device_id = b.device_id')
Traceback (most recent call last): AnalysisException: u"cannot resolve 'b.device_id' given input columns: [_col7, trip_id, device_id, mnvr_end, mnvr_bgn, idx_trip_end]; line 1 pos 237"
Any help is appreciated!
I would recommend renaming the name of the field 'device_id' in at least one of the data frame. I modified your query just a bit and tested it(in scala). Below query works
test = sqlContext.sql("select * FROM raw_data_filtered_temp a INNER JOIN mnvr_temp_idx_prev_temp b ON a.device_id = b.device_id")
[device_id: string, mnvr_bgn: string, mnvr_end: string, device_id: string, trip_id: string, idx_trip_end: string]
Now if you are doing a 'select * ' in above statement, it will work. But if you try to select 'device_id', you will get an error "Reference 'device_id' is ambiguous" . As you can see in the above 'test' data frame definition, it has two fields with the same name(device_id). So to avoid this, I recommend changing field name in one of the dataframes.
mnvr_temp_idx_prev = mnvr_3.select('device_id', 'mnvr_bgn', 'mnvr_end')
.withColumnRenamned("device_id","device")
raw_data_filtered = raw_data.select('device_id', 'trip_id', 'idx').groupby('device_id', 'trip_id').agg(F.max('idx').alias('idx_trip_end'))
Now use dataframes or sqlContext
//using dataframes with multiple conditions
val test = mnvr_temp_idx_prev.join(raw_data_filtered,$"device" === $"device_id"
&& $"mnvr_bgn" < $"idx_trip_id","inner")
//in SQL Context
test = sqlContext.sql("select * FROM raw_data_filtered_temp a INNER JOIN mnvr_temp_idx_prev_temp b ON a.device_id = b.device and a. idx_trip_id < b.mnvr_bgn")
Above queries will work for your problem. And if your data set is too large, I would recommend to not use '>' or '<' operators in Join condition as it causes cross join which is a costly operation if data set is large. Instead use them in WHERE condition.

Birt Report Multiple Input Parameter

My problem are same with this question
here
I tried the solution at that question but it only work if all the parameter has value, but when there are no value the Birt Report output this error
The following items have errors:
Table (id = 4):
+ Can not load the report query: 4. Errors occurred when generating the report document for the report element with ID 4. (Element ID:4)
Can you guys help me?
Thanks
In that example when the parameter has no value the query is not modified from what you put in the query text box. You could also do something like:
1- put a query in like select * from mytable
2 - Then put a beforeOpen script like:
if( params["myparameterval"] ){
this.queryText = this.queryText + " where col1 = " + params["myparameterval"].value;
}else{
this.queryText = this.queryText + " where col1 = hardcodedvalue"
}