Exclude files based on name when calling from_catalog - pyspark

I am reading data via
glueContext.create_data_frame.from_catalog(database = "db", table_name = "ta")
from parquet files on a s3 bucket.
Unfortunately, it seems the bucket contains a non-parquet file (last_ingest_partition) which causes the following error:
An error occurred while calling o92.getDataFrame. s3://cdh/measurements/ta/last_ingest_partition is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [49, 45, 49, 50]
Is there a possibility to exclude this file from being read?
I have tried somethig like
glueContext.create_data_frame.from_catalog(database = "db", table_name = "ta", additional_options={"exclusions" : "[\"**last_ingest_partition\""})
but this does not work for me.

Here is what I have found out and what solved my problem:
When I switch my code to create_dynamic_frame.from_catalog instead of create_data_frame.from_catalog and added a .toDF() afterwards, everything worked fine for me.
For create_dynamic_frame I could also use exlusions as additional options: .create_dynamic_frame.from_catalog(database = "testdb1", table_name = "cxexclude",additional_options={"exclusions": "[\"**{json,parquet}**\"]"})
For create_data_frame class, there are limitations: Spark DataFrame partition filtering doesn't work.

Related

Take. a particular string from a TEXT in postgres

I have one question:
Below is an error message which is stored in a postgres table column, From this string i would like to extract only a part of the string, Is that possible to do in Postgres?
I would like to see odoo.exceptions.ValidationError: ('No MRP template was found for MO/10881!', None)' only this part.
In general all text starting with odoo.exceptions.ValidationError: until the end
How can i do it ? Any idea or suggestions?
'Traceback (most recent call last):
File "/opt/src/addons_OCA/queue/queue_job/controllers/main.py", line 101, in runjob
self._try_perform_job(env, job)
File "/opt/src/addons_OCA/queue/queue_job/controllers/main.py", line 61, in _try_perform_job
job.perform()
File "/opt/src/addons_OCA/queue/queue_job/job.py", line 466, in perform
self.result = self.func(*tuple(self.args), **self.kwargs)
File "/opt/src/addons/costs/models/mrp_production.py", line 163, in trigger_calculate_all_costs
self.calculate_all_costs()
File "/opt/src/addons/sucosts/models/costline_mixin.py", line 284, in calculate_all_costs
rec.generate_service_products()
File "/opt/src/addons/mrp_product_templates/models/mrp_production.py", line 660, in generate_service_products
MO=self.name)))
odoo.exceptions.ValidationError: ('No MRP template was found for MO/10881!', None)'
You can use regexp_replace function to search for particular text then select the text following.
The regexp_replace function provides substitution of new text for
substrings that match POSIX regular expression patterns. It has the
syntax regexp_replace(source, pattern, replacement [, flags ]). The
source string is returned unchanged if there is no match to the
pattern. ...
In this case it seems you want the text after ValidationError:. Something like: (see demo)
select regexp_replace (message, '.*ValidationError:(.*)','\1')
from test;

AWS Glue 3: NameError: name 'date_trunc' is not defined

I built a job in AWS Glue Studio setting version to Glue 3, that is spark 3 is supported.
Goal is to truncate the date in column "date" to the minute, that is all seconds set to 00.
I found function date_trunc to be used for that but I get error "NameError: name 'date_trunc' is not defined"
The code runs in a custom transform and looks as follows:
def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
df = dfc.select(list(dfc.keys())[0]).toDF()
df_rounded = df.withColumn("date_truncated", date_trunc("minute", col("date")))
dyf_rounded = DynamicFrame.fromDF(df_rounded, glueContext, "rounded")
return(DynamicFrameCollection({"CustomTransform0": dyf_rounded}, glueContext))
how can i make that function work? I assume I have to import that function but I dont see a way to do that in the Studio designer

Adding default to binary type in Ecto for Postgres [Elixir]

I'm having a frustrating issue trying to set a default during an ecto migration
In the migration the code looks like the following:
def encode(binary) do
"\\x" <> Base.encode16(binary, case: :lower)
end
Logger.debug("admin.id = #{inspect admin.id}")
Logger.debug("admin.id = #{inspect UUID.string_to_binary!(admin.id)}")
Logger.debug("admin.id = #{inspect encode(admin.id)}")
alter table(#questions) do
add :owner_id, references(:users, on_delete: :nothing, type: :binary_id), null: false, default: admin.id
end
You can see the attempts I tried above in the logger
I get the error
default values are interpolated as UTF-8 strings and cannot contain null bytes. `<<209, 241,
149, 133, 44, 81, 70, 164, 181, 120, 214, 0, 253, 191, 198, 214>>` is invalid. If you want
to write it as a binary, use "\xd1f195852c5146a4b578d600fdbfc6d6", otherwise refer to
PostgreSQL documentation for instructions on how to escape this SQL type
Any help would be great thanks
When using :binary_id with Postgres, Ecto expects you to pass UUIDs as strings. Your error message implies you tried to pass it as a binary, so you should first convert it to a string:
add :owner_id, references(:users, on_delete: :nothing, type: :binary_id), null: false, default: UUID.binary_to_string!(admin.id)

how to link python pandas dataframe to mysqlconnector '%s' value

I am trying to pipe a webscraped pandas dataframe into a MySql table with mysql.connector but I can't seem to link df values to the %s variable. The connection is good (I can add individual rows) but it just returns errors when I replace the value witht he %s.
cnx = mysql.connector.connect(host = 'ip', user = 'user', passwd = 'pass', database = 'db')
cursor = cnx.cursor()
insert_df = ("""INSERT INTO table"
"(page_1, date_1, record_1, task_1)"
"VALUES ('%s','%s','%s','%s')""")
cursor.executemany(insert_df, df)
cnx.commit()
cnx.close()
This returns "ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."
If I add any additional oiperations it returns "ProgrammingError: Parameters for query must be an Iterable."
I am very new to this so any help is appreciated
Work around for me was to redo my whole process. I ran sqlalchemy, all the documentation makes this very easy. message if you want the code I used.

SQLAlchemy: Problems Migrating to PostgreSQL from SQLite (e.g. sqlalchemy.exc.ProgrammingError:)

I am having difficulties migrating a working a working script to PGSQL from SQLite. I am using SQLalchemy. When I run the script, it raises the following errors:
raise exc.DBAPIError.instance(statement, parameters, e, connection_invalidated=is_disconnect)
sqlalchemy.exc.ProgrammingError: (ProgrammingError) can't adapt 'INSERT INTO cnn_hot_stocks (datetime, list, ticker, price, change, "pctChange") VALUES (%(datetime)s, %(list)s, %(ticker)s, %(price)s, %(change)s, %(pctChange)s)' {'price': Decimal('7.94'), 'list': 'active', 'datetime': datetime.datetime(2012, 6, 23, 11, 45, 1, 544361), 'pctChange': u'+1.53%', 'ticker': u'BAC', 'change': Decimal('0.12')}
The insert call works well when using sqlite engine, but I want to use pgsql to utilize the native Decimal type for keeping financial data correct. I copied the script and just changed the db engine to my postgresql server. Any advice on how to troubleshoot this error would be greatly appreciated for this SQLalchemy newbie... I think I am up a creek on this one! Thanks in advance!
Here are my relevant code segments and table descriptions:
dbstring = "postgresql://postgres:postgres#localhost:5432/algo"
db = create_engine(dbstring)
db.echo = True # Try changing this to True and see what happens
metadata = MetaData(db)
cnn_hot_stocks = Table('cnn_hot_stocks', metadata, autoload=True)
i = cnn_hot_stocks.insert() # running log from cnn hot stocks web-site
def scrape_data():
try:
html = urllib2.urlopen('http://money.cnn.com/data/hotstocks/').read()
markup, errors = tidy_document(html)
soup = BeautifulSoup(markup,)
except Exception as e:
pass
list_map = { 2 : 'active',
3 : 'gainer',
4 : 'loser'
}
# Iterate over 3 tables on CNN hot stock web-site
for x in range(2, 5):
table = soup('table')[x]
for row in table.findAll('tr')[1:]:
timestamp = datetime.now()
col = row.findAll('td')
ticker = col[0].a.string
price = Decimal(col[1].span.string)
change = Decimal(col[2].span.span.string)
pctChange = col[3].span.span.string
log_data = {'datetime' : timestamp,
'list' : list_map[x],
'ticker' : ticker,
'price' : price,
'change' : change,
'pctChange' : pctChange
}
print log_data
# Commit to DB
i.execute(log_data)
TABLE:
cnn_hot_stocks = Table('cnn_hot_stocks', metadata, # log of stocks data on cnn hot stocks lists
Column('datetime', DateTime, primary_key=True),
Column('list', String), # loser/gainer/active
Column('ticker', String),
Column('price', Numeric),
Column('change', Numeric),
Column('pctChange', String),
)
My reading of the documentation is that you have to use numeric instead of decimal.
PostgreSQL has no type named decimal (it's an alias for numeric but not a very full-featured one), and SQL Alchemy seems to expect numeric as the type it can use for abstraction purposes.