Exclude files based on name when calling from_catalog

Exclude files based on name when calling from_catalog - pyspark

I am reading data via
glueContext.create_data_frame.from_catalog(database = "db", table_name = "ta")
from parquet files on a s3 bucket.
Unfortunately, it seems the bucket contains a non-parquet file (last_ingest_partition) which causes the following error:
An error occurred while calling o92.getDataFrame. s3://cdh/measurements/ta/last_ingest_partition is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [49, 45, 49, 50]
Is there a possibility to exclude this file from being read?
I have tried somethig like
glueContext.create_data_frame.from_catalog(database = "db", table_name = "ta", additional_options={"exclusions" : "[\"**last_ingest_partition\""})
but this does not work for me.

Here is what I have found out and what solved my problem:
When I switch my code to create_dynamic_frame.from_catalog instead of create_data_frame.from_catalog and added a .toDF() afterwards, everything worked fine for me.
For create_dynamic_frame I could also use exlusions as additional options: .create_dynamic_frame.from_catalog(database = "testdb1", table_name = "cxexclude",additional_options={"exclusions": "[\"**{json,parquet}**\"]"})
For create_data_frame class, there are limitations: Spark DataFrame partition filtering doesn't work.

Related

Take. a particular string from a TEXT in postgres

I have one question:
Below is an error message which is stored in a postgres table column, From this string i would like to extract only a part of the string, Is that possible to do in Postgres?
I would like to see odoo.exceptions.ValidationError: ('No MRP template was found for MO/10881!', None)' only this part.
In general all text starting with odoo.exceptions.ValidationError: until the end
How can i do it ? Any idea or suggestions?
'Traceback (most recent call last):
File "/opt/src/addons_OCA/queue/queue_job/controllers/main.py", line 101, in runjob
self._try_perform_job(env, job)
File "/opt/src/addons_OCA/queue/queue_job/controllers/main.py", line 61, in _try_perform_job
job.perform()
File "/opt/src/addons_OCA/queue/queue_job/job.py", line 466, in perform
self.result = self.func(*tuple(self.args), **self.kwargs)
File "/opt/src/addons/costs/models/mrp_production.py", line 163, in trigger_calculate_all_costs
self.calculate_all_costs()
File "/opt/src/addons/sucosts/models/costline_mixin.py", line 284, in calculate_all_costs
rec.generate_service_products()
File "/opt/src/addons/mrp_product_templates/models/mrp_production.py", line 660, in generate_service_products
MO=self.name)))
odoo.exceptions.ValidationError: ('No MRP template was found for MO/10881!', None)'

You can use regexp_replace function to search for particular text then select the text following.
The regexp_replace function provides substitution of new text for
substrings that match POSIX regular expression patterns. It has the
syntax regexp_replace(source, pattern, replacement [, flags ]). The
source string is returned unchanged if there is no match to the
pattern. ...
In this case it seems you want the text after ValidationError:. Something like: (see demo)
select regexp_replace (message, '.*ValidationError:(.*)','\1')
from test;

AWS Glue 3: NameError: name 'date_trunc' is not defined

I built a job in AWS Glue Studio setting version to Glue 3, that is spark 3 is supported.
Goal is to truncate the date in column "date" to the minute, that is all seconds set to 00.
I found function date_trunc to be used for that but I get error "NameError: name 'date_trunc' is not defined"
The code runs in a custom transform and looks as follows:
def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
df = dfc.select(list(dfc.keys())[0]).toDF()
df_rounded = df.withColumn("date_truncated", date_trunc("minute", col("date")))
dyf_rounded = DynamicFrame.fromDF(df_rounded, glueContext, "rounded")
return(DynamicFrameCollection({"CustomTransform0": dyf_rounded}, glueContext))
how can i make that function work? I assume I have to import that function but I dont see a way to do that in the Studio designer

Adding default to binary type in Ecto for Postgres [Elixir]

I'm having a frustrating issue trying to set a default during an ecto migration
In the migration the code looks like the following:
def encode(binary) do
"\\x" <> Base.encode16(binary, case: :lower)
end
Logger.debug("admin.id = #{inspect admin.id}")
Logger.debug("admin.id = #{inspect UUID.string_to_binary!(admin.id)}")
Logger.debug("admin.id = #{inspect encode(admin.id)}")
alter table(#questions) do
add :owner_id, references(:users, on_delete: :nothing, type: :binary_id), null: false, default: admin.id
end
You can see the attempts I tried above in the logger
I get the error
default values are interpolated as UTF-8 strings and cannot contain null bytes. `<<209, 241,
149, 133, 44, 81, 70, 164, 181, 120, 214, 0, 253, 191, 198, 214>>` is invalid. If you want
to write it as a binary, use "\xd1f195852c5146a4b578d600fdbfc6d6", otherwise refer to
PostgreSQL documentation for instructions on how to escape this SQL type
Any help would be great thanks

When using :binary_id with Postgres, Ecto expects you to pass UUIDs as strings. Your error message implies you tried to pass it as a binary, so you should first convert it to a string:
add :owner_id, references(:users, on_delete: :nothing, type: :binary_id), null: false, default: UUID.binary_to_string!(admin.id)

how to link python pandas dataframe to mysqlconnector '%s' value

I am trying to pipe a webscraped pandas dataframe into a MySql table with mysql.connector but I can't seem to link df values to the %s variable. The connection is good (I can add individual rows) but it just returns errors when I replace the value witht he %s.
cnx = mysql.connector.connect(host = 'ip', user = 'user', passwd = 'pass', database = 'db')
cursor = cnx.cursor()
insert_df = ("""INSERT INTO table"
"(page_1, date_1, record_1, task_1)"
"VALUES ('%s','%s','%s','%s')""")
cursor.executemany(insert_df, df)
cnx.commit()
cnx.close()
This returns "ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."
If I add any additional oiperations it returns "ProgrammingError: Parameters for query must be an Iterable."
I am very new to this so any help is appreciated

Work around for me was to redo my whole process. I ran sqlalchemy, all the documentation makes this very easy. message if you want the code I used.

SQLAlchemy: Problems Migrating to PostgreSQL from SQLite (e.g. sqlalchemy.exc.ProgrammingError:)

I am having difficulties migrating a working a working script to PGSQL from SQLite. I am using SQLalchemy. When I run the script, it raises the following errors:
raise exc.DBAPIError.instance(statement, parameters, e, connection_invalidated=is_disconnect)
sqlalchemy.exc.ProgrammingError: (ProgrammingError) can't adapt 'INSERT INTO cnn_hot_stocks (datetime, list, ticker, price, change, "pctChange") VALUES (%(datetime)s, %(list)s, %(ticker)s, %(price)s, %(change)s, %(pctChange)s)' {'price': Decimal('7.94'), 'list': 'active', 'datetime': datetime.datetime(2012, 6, 23, 11, 45, 1, 544361), 'pctChange': u'+1.53%', 'ticker': u'BAC', 'change': Decimal('0.12')}
The insert call works well when using sqlite engine, but I want to use pgsql to utilize the native Decimal type for keeping financial data correct. I copied the script and just changed the db engine to my postgresql server. Any advice on how to troubleshoot this error would be greatly appreciated for this SQLalchemy newbie... I think I am up a creek on this one! Thanks in advance!
Here are my relevant code segments and table descriptions:
dbstring = "postgresql://postgres:postgres#localhost:5432/algo"
db = create_engine(dbstring)
db.echo = True # Try changing this to True and see what happens
metadata = MetaData(db)
cnn_hot_stocks = Table('cnn_hot_stocks', metadata, autoload=True)
i = cnn_hot_stocks.insert() # running log from cnn hot stocks web-site
def scrape_data():
try:
html = urllib2.urlopen('http://money.cnn.com/data/hotstocks/').read()
markup, errors = tidy_document(html)
soup = BeautifulSoup(markup,)
except Exception as e:
pass
list_map = { 2 : 'active',
3 : 'gainer',
4 : 'loser'
}
# Iterate over 3 tables on CNN hot stock web-site
for x in range(2, 5):
table = soup('table')[x]
for row in table.findAll('tr')[1:]:
timestamp = datetime.now()
col = row.findAll('td')
ticker = col[0].a.string
price = Decimal(col[1].span.string)
change = Decimal(col[2].span.span.string)
pctChange = col[3].span.span.string
log_data = {'datetime' : timestamp,
'list' : list_map[x],
'ticker' : ticker,
'price' : price,
'change' : change,
'pctChange' : pctChange
}
print log_data
# Commit to DB
i.execute(log_data)
TABLE:
cnn_hot_stocks = Table('cnn_hot_stocks', metadata, # log of stocks data on cnn hot stocks lists
Column('datetime', DateTime, primary_key=True),
Column('list', String), # loser/gainer/active
Column('ticker', String),
Column('price', Numeric),
Column('change', Numeric),
Column('pctChange', String),
)

My reading of the documentation is that you have to use numeric instead of decimal.
PostgreSQL has no type named decimal (it's an alias for numeric but not a very full-featured one), and SQL Alchemy seems to expect numeric as the type it can use for abstraction purposes.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Exclude files based on name when calling from_catalog - pyspark

Related

Take. a particular string from a TEXT in postgres

AWS Glue 3: NameError: name 'date_trunc' is not defined

Adding default to binary type in Ecto for Postgres [Elixir]

how to link python pandas dataframe to mysqlconnector '%s' value

SQLAlchemy: Problems Migrating to PostgreSQL from SQLite (e.g. sqlalchemy.exc.ProgrammingError:)

Categories

Resources