I am new with PySpark, ask advice. I' am using https://pypi.org/project/user-agents/ for PySpark dataframe
I would like to apply user_agents library for PySpark dataframe column (user_agent).
user_agent record: user_agent='Mozilla/5.0 (Linux; Android 8.1.0; SM-T580) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.93 Safari/537.36
My code:
from user_agents import parse
df = df.withColumn("browser", parse((df.user_agent.cast("string"))))
The error:
TypeError: unhashable type: 'Column'
I did it in this way, but it takes too much time ( how can I optimize it):
from user_agents import parse
vals_init = [("", "", "","","")]
columns = ['browser', 'os', 'device-brand','device-family',
'device-model' ]
df1 = spark.createDataFrame(vals_init, columns)
for row in df.rdd.collect():
# str(row.user_agent)
# user_agent = parse(str(row.user_agent))
print(user_agent.browser[0]) #browser
print(user_agent.os[0]) #OS
print(user_agent.device.family) # returns 'iPhone'
print(user_agent.device.brand) # returns 'Apple'
print(user_agent.device.model) # returns 'iPhone'
vals = [(user_agent.browser[0],
user_agent.os[0],user_agent.device.family,
user_agent.device.family,user_agent.device.brand,
user_agent.device.model)]
newRow =
spark.createDataFrame([(user_agent.browser[0],user_agent.os[0],
user_agent.device.family,user_agent.device.brand,
user_agent.device.model)], columns)
df1 = df1.union(newRow)
While not an Open Source tool, when it comes to analyzing User-Agent (or more generally HTTP requests), WURFL is the de-facto standard tool used by the big guys (Akamai, AWS, Google,...) who requires more properties and fine-grained analysis of their HTTP traffic.
Here is a blog post that explains how WURFL and PySpark can be integrated. It would be awesome if the author could reference this in the article.
Disclosure: I'm the WURFL inventor
Related
I built a job in AWS Glue Studio setting version to Glue 3, that is spark 3 is supported.
Goal is to truncate the date in column "date" to the minute, that is all seconds set to 00.
I found function date_trunc to be used for that but I get error "NameError: name 'date_trunc' is not defined"
The code runs in a custom transform and looks as follows:
def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
df = dfc.select(list(dfc.keys())[0]).toDF()
df_rounded = df.withColumn("date_truncated", date_trunc("minute", col("date")))
dyf_rounded = DynamicFrame.fromDF(df_rounded, glueContext, "rounded")
return(DynamicFrameCollection({"CustomTransform0": dyf_rounded}, glueContext))
how can i make that function work? I assume I have to import that function but I dont see a way to do that in the Studio designer
I'm trying to update table in postgresql database passing dynamic value using doobie functional JDBC while executing sql statement getting below error.Any help will be appreciable.
Code
Working code
sql"""UPDATE layout_lll
|SET runtime_params = 'testing string'
|WHERE run_id = '123-ksdjf-oreiwlds-9dadssls-kolb'
|""".stripMargin.update.quick.unsafeRunSync
Not working code
val abcRunTimeParams="testing string"
val runID="123-ksdjf-oreiwlds-9dadssls-kolb"
sql"""UPDATE layout_lll
|SET runtime_params = '${abcRunTimeParams}'
|WHERE run_id = '$runID'
|""".stripMargin.update.quick.unsafeRunSync
Error
Exception in thread "main" org.postgresql.util.PSQLException: The column index is out of range: 3, number of columns: 2.
Remove the ' quotes - Doobie make sure they aren't needed. Doobie (and virtually any other DB library) uses parametrized queries, like:
UPDATE layout_lll
SET runtime_params = ?
WHERE run_id = ?
where ? will be replaced by parameters passes later on. This:
makes SQL injection impossible
helps spotting errors in SQL syntax
When you want to pass parameter, the ' is part of the value passed, not part of the parametrized query. And Doobie (or JDBC driver) will "add" it for you. The variables you pass there are processed by Doobie, they aren't just pasted there like in normal string interpolation.
TL;DR Try running
val abcRunTimeParams="testing string"
val runID="123-ksdjf-oreiwlds-9dadssls-kolb"
sql"""UPDATE layout_lll
|SET runtime_params = ${abcRunTimeParams}
|WHERE run_id = $runID
|""".stripMargin.update.quick.unsafeRunSync
I am trying to pipe a webscraped pandas dataframe into a MySql table with mysql.connector but I can't seem to link df values to the %s variable. The connection is good (I can add individual rows) but it just returns errors when I replace the value witht he %s.
cnx = mysql.connector.connect(host = 'ip', user = 'user', passwd = 'pass', database = 'db')
cursor = cnx.cursor()
insert_df = ("""INSERT INTO table"
"(page_1, date_1, record_1, task_1)"
"VALUES ('%s','%s','%s','%s')""")
cursor.executemany(insert_df, df)
cnx.commit()
cnx.close()
This returns "ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."
If I add any additional oiperations it returns "ProgrammingError: Parameters for query must be an Iterable."
I am very new to this so any help is appreciated
Work around for me was to redo my whole process. I ran sqlalchemy, all the documentation makes this very easy. message if you want the code I used.
I am new to PySpark and have purchased a book to enhance my PySpark skills. I am stuck while using a function.
Function
def filterDuplicates ( ( userID, ratings ) ):
(movie1, rating1) = ratings[0]
(movie2, rating2) = ratings[1]
return movie1 < movie2
I am getting error due to two continuous parenthesis. Step basically gets an RDD which is basically a list of touple as show below:
[(196, ((242, 3.0), (242, 3.0))), (196, ((242, 3.0), (393, 4.0)))]
The final result should be only distinct movie ID, rating BY each viewer.
So in the above-given example, 196 is viewer ID, 242 is movie ID and 3.0 is rating given by viewer.
Kindly advise if I need to download a different version of python to use double parenthesis. Presently I have Python 3.7 installed on my machine.
Thanks,
AJ
The variable names inside a tuple is of no use. If you really want the tuple to be parameter of the function, name the whole tuple like
def filterDuplicates ( userData ):
userId = userData[0]
ratings = userData[1]
movie1 = ratings[0][0]
rating1 = ratings[0][1]
movie2 = ratings[1][0]
rating2 = ratings[1][1]
return movie1 < movie2
I am having difficulties migrating a working a working script to PGSQL from SQLite. I am using SQLalchemy. When I run the script, it raises the following errors:
raise exc.DBAPIError.instance(statement, parameters, e, connection_invalidated=is_disconnect)
sqlalchemy.exc.ProgrammingError: (ProgrammingError) can't adapt 'INSERT INTO cnn_hot_stocks (datetime, list, ticker, price, change, "pctChange") VALUES (%(datetime)s, %(list)s, %(ticker)s, %(price)s, %(change)s, %(pctChange)s)' {'price': Decimal('7.94'), 'list': 'active', 'datetime': datetime.datetime(2012, 6, 23, 11, 45, 1, 544361), 'pctChange': u'+1.53%', 'ticker': u'BAC', 'change': Decimal('0.12')}
The insert call works well when using sqlite engine, but I want to use pgsql to utilize the native Decimal type for keeping financial data correct. I copied the script and just changed the db engine to my postgresql server. Any advice on how to troubleshoot this error would be greatly appreciated for this SQLalchemy newbie... I think I am up a creek on this one! Thanks in advance!
Here are my relevant code segments and table descriptions:
dbstring = "postgresql://postgres:postgres#localhost:5432/algo"
db = create_engine(dbstring)
db.echo = True # Try changing this to True and see what happens
metadata = MetaData(db)
cnn_hot_stocks = Table('cnn_hot_stocks', metadata, autoload=True)
i = cnn_hot_stocks.insert() # running log from cnn hot stocks web-site
def scrape_data():
try:
html = urllib2.urlopen('http://money.cnn.com/data/hotstocks/').read()
markup, errors = tidy_document(html)
soup = BeautifulSoup(markup,)
except Exception as e:
pass
list_map = { 2 : 'active',
3 : 'gainer',
4 : 'loser'
}
# Iterate over 3 tables on CNN hot stock web-site
for x in range(2, 5):
table = soup('table')[x]
for row in table.findAll('tr')[1:]:
timestamp = datetime.now()
col = row.findAll('td')
ticker = col[0].a.string
price = Decimal(col[1].span.string)
change = Decimal(col[2].span.span.string)
pctChange = col[3].span.span.string
log_data = {'datetime' : timestamp,
'list' : list_map[x],
'ticker' : ticker,
'price' : price,
'change' : change,
'pctChange' : pctChange
}
print log_data
# Commit to DB
i.execute(log_data)
TABLE:
cnn_hot_stocks = Table('cnn_hot_stocks', metadata, # log of stocks data on cnn hot stocks lists
Column('datetime', DateTime, primary_key=True),
Column('list', String), # loser/gainer/active
Column('ticker', String),
Column('price', Numeric),
Column('change', Numeric),
Column('pctChange', String),
)
My reading of the documentation is that you have to use numeric instead of decimal.
PostgreSQL has no type named decimal (it's an alias for numeric but not a very full-featured one), and SQL Alchemy seems to expect numeric as the type it can use for abstraction purposes.