PySpark how to get the partition name on query results? - pyspark

I would like to retrieve the partition name on query results.
So if I have a partition like:
dataset/foo/
├─ key=value1
├─ key=value2
└─ key=value3
I can do this query
results = session.read.parquet('dataset/foo/key=value[12]') \
.select(['BAR']) \
.where('BAZ < 10')
Once I do this how to know the partition origin for each results ?
Indeed I can get only values form the BAR column.
Thanks for your help

Include key column in your select statement!
#read foo directory as it is partiitoned so we can filter on the key
results = session.read.parquet('foo/') \
.select(['BAR','key']) \
.filter((col("key") == "value1") & (col("BAZ") < '10'))
In case if you want to add origin filename to all records then use input_file_name()
from pyspark.sql.functions import *
results = session.read.parquet('foo/') \
.select(['BAR','key'])\
.withColumn("input_file", input_file_name()) \
.filter((col("key") == "value1") & (col("BAZ") < '10'))

Related

Incorrect value from count in posrgres on enum rows

I have the following query written in sqlalchemy
_stats = db.session.query(
func.count(a.status==s.STATUS_IN_PROGRESS).label("in_progress_"),
func.count(a.status==s.STATUS_COMPLETED).label("completed_"),
) \
.filter(a.uuid==some_uuid) \
.first()
this returns (7, 7) which is incorrect it should return (7,0) i.e. in_progress_ = 7, completed_ = 0
When I do this in two queries I get the correct values
_stats_in_progress = db.session.query(
func.count(a.status==s.STATUS_IN_PROGRESS).label("in_progress_"),
) \
.filter(a.uuid==some_uuid) \
.first()
_stats_in_complete = db.session.query(
func.count(a.status==s.STATUS_COMPLETED).label("completed_"),
) \
.filter(a.uuid==some_uuid) \
.first()
The corresponding SQL also does not work when using the two counts
SELECT count(a.status = 'IN_PROGRESS') AS in_progress_,
count(a.status = 'STATUS_COMPLETED') AS completed_
FROM a
WHERE a.uuid = '9a353554a6874ebcbf0fe88eb8223d33'
this returns 7,7 too, while if I do the query with just one count I get the correct values.
Does anyone know what I'm doing wrong?

Delete in Apache Hudi - Glue Job

I have to build a Glue Job for updating and deleting old rows in Athena table.
When I run my job for deleting it returns an error:
AnalysisException: 'Unable to infer schema for Parquet. It must be specified manually.;'
My Glue Job:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "test-database", table_name = "test_table", transformation_ctx = "datasource0")
datasource1 = glueContext.create_dynamic_frame.from_catalog(database = "test-database", table_name = "test_table_output", transformation_ctx = "datasource1")
datasource0.toDF().createOrReplaceTempView("view_dyf")
datasource1.toDF().createOrReplaceTempView("view_dyf_output")
ds = spark.sql("SELECT * FROM view_dyf_output where id in (select id from view_dyf where op like 'D')")
hudi_delete_options = {
'hoodie.table.name': 'test_table_output',
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.datasource.write.table.name': 'test_table_output',
'hoodie.datasource.write.operation': 'delete',
'hoodie.datasource.write.precombine.field': 'name',
'hoodie.upsert.shuffle.parallelism': 1,
'hoodie.insert.shuffle.parallelism': 1
}
from pyspark.sql.functions import lit
deletes = list(map(lambda row: (row[0], row[1]), ds.collect()))
df = spark.sparkContext.parallelize(deletes).toDF(['id']).withColumn('name', lit(0.0))
df.write.format("hudi"). \
options(**hudi_delete_options). \
mode("append"). \
save('s3://data/test-output/')
roAfterDeleteViewDF = spark. \
read. \
format("hudi"). \
load("s3://data/test-output/")
roAfterDeleteViewDF.registerTempTable("test_table_output")
spark.sql("SELECT * FROM view_dyf_output where id in (select distinct id from view_dyf where op like 'D')").count()
I have 2 data sources; first old Athena table where data has to updated or deleted, and the second table in which are coming new updated or deleted data.
In ds I have selected all rows that have to be deleted in old table.
op is for operation; 'D' for delete, 'U' for update.
Does anyone know what am I missing here?
The value for hoodie.datasource.write.operation is invalid in your code, the supported write operations are: UPSERT/Insert/Bulk_insert. check Hudi Doc.
Also what is your intention for deleting records: hard delete or soft ?
For Hard delete, you have to provide
{'hoodie.datasource.write.payload.class': 'org.apache.hudi.common.model.EmptyHoodieRecordPayload}

Rebuild sphinx index fail

We have 4 sphinx indexes built using data from one table. All indexes have the same source settings except that they take different documents. We have checks like this mod(id, 4) = <index number> to distribute documents and document attributes between indexes.
Question: One of the four indexes (the same one) fails to rebuild almost every time we rebuild the indexes. Other indexes never have this issue and are rebuild correctly.
We have partitioned the documents and attribute tables. For example this is how documents table is partitioned:
PARTITION BY HASH(mod(id, 4))(
PARTITION `p0` COMMENT '',
PARTITION `p1` COMMENT '',
PARTITION `p2` COMMENT '',
PARTITION `p3` COMMENT ''
);
We think that indexer hangs after it has received all documents but before it starts receiving attributes. We can see this when we check sessions on MySQL server.
The index which fails to rebuild is using mod(id, 4) = 0 condition.
We use Sphinx 2.0.4-release on Ubuntu 64bit 12.04.02 LTS.
Data source config
source ble_job_2 : ble_job
{
sql_query = select job_notice.id as id, \
body, title, source, company, \
UNIX_TIMESTAMP(insertDate) as date, \
substring(company, 1, 1) as companyletter, \
job_notice.locationCountry as country, \
location_us_state.stateName as state, \
0 as expired, \
clusterId, \
groupCity, \
groupCityAttr, \
job_notice.cityLat as citylat, \
job_notice.cityLng as citylng, \
job_notice.zipLat as ziplat, \
job_notice.zipLng as ziplng, \
feedId, job_notice.rating as rating, \
job_notice.cityId as cityid \
from job_notice \
left join location_us_state on job_notice.locationState = location_us_state.stateCode \
where job_notice.status != 'expired' \
and mod(job_notice.id, 4) = 1
sql_attr_multi = uint attr from query; \
select noticeId, attributeId as attr from job_notice_attribute where mod(noticeId, 4) = 1
} # source ble_job_2
Index config
index ble_job_2
{
type = plain
source = ble_job_2
path = /var/lib/sphinxsearch/data/ble_job_2
docinfo = extern
mlock = 0
morphology = none
stopwords = /etc/sphinxsearch/stopwords/blockwords.txt
min_word_len = 1
charset_type = utf-8
enable_star = 0
html_strip = 0
} # index_ble_job_2
Any help would be greatly appreciated.
Warm regards.
Luckily we have fixed the issue.
We have applied the range query setup and this helped us to get index rebuild stable. I think this is because Sphinx runs several queries and each returns limited relatively small set of results. This allows MySQL to complete the query normally and sent all the results back to Sphinx.
The same issue is described on Sphinx forum Indexer Hangs & MySQL Query Sleeps.
The changes in the config for data source are
sql_query_range = SELECT MIN(id),MAX(id) FROM job_notice where mod(job_notice.id, 4) = 1
sql_range_step = 200000
sql_query = select job_notice.id as id, \
...
and mod(job_notice.id, 4) = 1 and job_notice.id >= $start AND job_notice.id <= $end
Please note that no ranges should be applied to sql_attr_multi query - Bad query in Sphinx MVA

Sphinx weird behavior

I have weird trouble creating index on sphinx 2.0.5-id64-release (r3308)
/etc/sphinx/sphinx.conf
source keywords
{
// ..
sql_query = \
SELECT keywords.lid, keywords.keyword FROM keywords_sites \
LEFT JOIN keywords ON keywords_sites.kid = keywords.kid \
GROUP BY keywords_sites.kid \
sql_attr_uint = lid
sql_field_string = keyword
// ...
}
I get warning
WARNING: attribute 'lid' not found - IGNORING
But when i change query to:
sql_query = \
SELECT 1, keywords.lid, keywords.keyword FROM keywords_sites \
LEFT JOIN keywords ON keywords_sites.kid = keywords.kid \
GROUP BY keywords_sites.kid \
I don't get any warnings. Why is this happen?
The first column from the sql_query is ALWAYS used as the document_id.
The document_id can not be defined as an attibute.
If you want to store the primary key in an attribute as well, then you need to include it twice in the query.

Sphinx + Postgres + uuid issues

I have a sql_query for a source defined like so:
sql_query = SELECT \
criteria.item_uuid, \
criteria.user_id, \
criteria.color, \
criteria.selection, \
criteria.item_id, \
home.state, \
item.* \
FROM criteria \
INNER JOIN item USING (item_uuid) \
INNER JOIN user_info home USING (user_id) \
WHERE criteria.item_uuid IS NOT NULL
And then an index:
index csearch {
source = csearch
path = /usr/local/sphinx/var/data/csearch
docinfo = extern
enable_star = 1
min_prefix_len = 0
min_infix_len = 0
morphology = stem_en
}
But when I run indexer --rotate csearch I get:
indexing index 'csearch'...
WARNING: zero/NULL document_id, skipping
The idea is that the item_uuid column is the identifier I want, based on some combination of the other columns. The item_uuid column is a uuid type in postgres: perhaps sphinx does not support this? Anyway, any ideas here would be greatly appreciated.
Read the docs, the document_id must be unique unsigned non-zero integers.
http://www.sphx.org/docs/manual-1.10.html#data-restrictions
You could try using SELECT row_number(), uuid, etc...