Pyspark - How to get capitalized names? [closed] - pyspark

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed last year.
Improve this question
How to get capitalized names?
from pyspark.sql import types as T
import pyspark.sql.functions as F
from datetime import datetime
from pyspark.sql.functions import to_timestamp
test = spark.createDataFrame(
[
(1,'2021-10-04 09:05:14', "For the 2nd copy of the ticket, access the link: wa.me/11223332211 (Whats) use ID and our number(1112222333344455). Duvidas, www.abtech.com . AB Tech"),
(2,'2021-10-04 09:10:05', ". MARCIOG, let's try again? Get in touch to rectify your situation. For WhatsApp Link: ab-ab.ab.com/n/12345467. AB Tech"),
(3,'2021-10-04 09:27:27', ", we do not identify the payment of the installment of your agreement, if paid disregard. You doubt, link: wa.me/99998-88822 (Whats) ou 0800-999-9999. AB Tech"),
(4,'2021-10-04 14:55:26', "Mr, SUELI. enjoy the holiday with money in your account. AB has great conditions for you. Call now and hire 0800899-9999 (Mon to Fri from 12pm to 6pm)"),
(5,'2021-10-06 09:15:11', ". DEPREZC, let's try again? Get in touch to rectify your situation. For whatsapp Link: csi-csi.abtech.com/n/12345467. AB Tech"),
(6,'2022-02-03 08:00:12', "Mr. SARA. We have great discount options. Regularize your situation with AB! Link: wa.me/25544-8855 (Whats) ou 0800-999-9999. AB."),
(7,'2021-10-04 09:26:00', ", we do not identify the payment of the installment of your agreement, if paid disregard. You doubt, link: wa.me/999999999 (Whats) or 0800-999-9999. AB Tech"),
(8,'2018-10-09 12:31:33', "Mr.(a) ANTONI, regularize your situation with the Ammmm Bhhhh. Ligue 0800-729-2406 or access the CHAT www.abtech.com. AB Tech."),
(9,'2018-10-09 15:14:51', "Follow code of bars of your updated deal for today (11111.111111 1111.11111 11111.111111 1 11111111111). Doubts call 0800-999-9999. AB Tech.")
],
T.StructType(
[
T.StructField("id_mt", T.StringType(), True),
T.StructField("date_send", T.StringType(), True),
T.StructField("message", T.StringType(), True),
]
),
)
Could you tell me what is the logic to check the uppercase names?
So, there is a column name 'names' which is answer:
enter image description here

We made the Fugue project to port native Python or Pandas code to Spark or Dask. This lets you can keep the logic very readable by expressing it in native Python. Fugue can then port it to Spark for you with one function call.
I think this specific case is hard but Spark but easy in Python. I'll walk through the solution.
First we make a Pandas DataFrame for quick testing:
import pandas as pd
df = pd.DataFrame([(1,'2021-10-04 09:05:14', "For the 2nd copy of the ticket, access the link: wa.me/11223332211 (Whats) use ID and our number(1112222333344455). Duvidas, www.abtech.com . AB Tech"),
(2,'2021-10-04 09:10:05', ". MARCIOG, let's try again? Get in touch to rectify your situation. For WhatsApp Link: ab-ab.ab.com/n/12345467. AB Tech"),
(3,'2021-10-04 09:27:27', ", we do not identify the payment of the installment of your agreement, if paid disregard. You doubt, link: wa.me/99998-88822 (Whats) ou 0800-999-9999. AB Tech"),
(4,'2021-10-04 14:55:26', "Mr, SUELI. enjoy the holiday with money in your account. AB has great conditions for you. Call now and hire 0800899-9999 (Mon to Fri from 12pm to 6pm)"),
(5,'2021-10-06 09:15:11', ". DEPREZC, let's try again? Get in touch to rectify your situation. For whatsapp Link: csi-csi.abtech.com/n/12345467. AB Tech"),
(6,'2022-02-03 08:00:12', "Mr. SARA. We have great discount options. Regularize your situation with AB! Link: wa.me/25544-8855 (Whats) ou 0800-999-9999. AB."),
(7,'2021-10-04 09:26:00', ", we do not identify the payment of the installment of your agreement, if paid disregard. You doubt, link: wa.me/999999999 (Whats) or 0800-999-9999. AB Tech"),
(8,'2018-10-09 12:31:33', "Mr.(a) ANTONI, regularize your situation with the Ammmm Bhhhh. Ligue 0800-729-2406 or access the CHAT www.abtech.com. AB Tech."),
(9,'2018-10-09 15:14:51', "Follow code of bars of your updated deal for today (11111.111111 1111.11111 11111.111111 1 11111111111). Doubts call 0800-999-9999. AB Tech.")], columns=["id_mt", "date_send","message"])
Now we create a native Python function to extract the string. The get_name_for_one_string operates on one string and the get_names will take in the whole DataFrame.
from typing import List, Dict, Any
import re
def get_name_for_one_string(message: str) -> str:
# drop non-alphanumeric
message = re.sub(r"\s*[^A-Za-z]+\s*", " ", message)
# string split
items = message.split(" ")
# keep all caps and len > 2
item = [x for x in items if (x.upper() == x and len(x) > 2)]
if len(item) > 0:
return item[0]
else:
return None
def get_names(df: List[Dict[str,Any]]) -> List[Dict[str,Any]]:
for row in df:
row["names"] = get_name_for_one_string(row["message"])
return df
Now we can use this on a Pandas DataFrame using the Fugue transform function and Fugue will handle the conversions
from fugue import transform
transform(df, get_names, schema="*,names:str")
This works so now we can bring it to Spark just by specifying the engine.
import fugue_spark
transform(df, get_names, schema="*,names:str", engine="spark").show()
+-----+-------------------+--------------------+-------+
|id_mt| date_send| message| names|
+-----+-------------------+--------------------+-------+
| 1|2021-10-04 09:05:14|For the 2nd copy ...| null|
| 2|2021-10-04 09:10:05|. MARCIOG, let's ...|MARCIOG|
| 3|2021-10-04 09:27:27|, we do not ident...| null|
| 4|2021-10-04 14:55:26|Mr, SUELI. enjoy ...| SUELI|
| 5|2021-10-06 09:15:11|. DEPREZC, let's ...|DEPREZC|
| 6|2022-02-03 08:00:12|Mr. SARA. We have...| SARA|
| 7|2021-10-04 09:26:00|, we do not ident...| null|
| 8|2018-10-09 12:31:33|Mr.(a) ANTONI, re...| ANTONI|
| 9|2018-10-09 15:14:51|Follow code of ba...| null|
+-----+-------------------+--------------------+-------+
Note you need .show() because Spark evaluates lazily. The transform function can take in both Pandas and Spark DataFrames. If you use the Spark engine, the output will be a Spark DataFrame also.

Related

MYSQL WORKBENCH: Separating last 2 characters from a column

I am trying to remove the last two characters from a column. The current column that I am targeting has already been created by separating a string, but as you'll see below, it wasn't successful for the 'City' column.
This is how the original looks:
enter image description here
This is what I've been able to output from my code:
| StreetNumber | Street | **City** | State |
|----------------------------------------------------------------------|
| 1808 | FOX CHASE DR | **GOODLETTSVILLE TN** | TN |
| 1832 | FOX CHASE DR | **GOODLETTSVILLE TN** | TN |
| 2005 | SADIE LN | **GOODLETTSVILLE TN** | TN |
actual pic:enter image description here
This is my code:
select substring_index(substring_index(OwnerAddress, ' ', 1), ' ', -1) as StreetNumber,
substring(OwnerAddress, locate(' ', OwnerAddress),
(length(OwnerAddress) - (length(substring_index(OwnerAddress, ' ', 1))
+ length(substring_index(OwnerAddress, ' ', -2))))) as Street,
substring(substring_index(OwnerAddress, ' ', -2) from 1 for length(OwnerAddress)-2) as City,
substring_index(OwnerAddress, ' ', -1) as State
from nashhousing;
The goal is to remove the state abbreviations from the 'City' column because there's a state column already. I thought I could simply -2 for the last two characters but obviously, that didn't work. I hope I've explained my situation clearly, but if not, please let me know. I don't want to give up on this situation but I've been on it for 5 hours already and can't source a solution. Please help and thank you in advance!
To directly answer your question, you are using the wrong field and value for the length portion of the SUBSTRING on the city field.
This should correct your issue:
substring(substring_index(OwnerAddress, ' ', -2), 1, substring_index(OwnerAddress, ' ', -2) - 2) as City,
Or even
substring_index(substring_index(OwnerAddress, ' ', -2), ' ', 1) as City
Please Note:
The major issue with doing things this way is you are assuming every entry is formatted the same, and you're not taking into account cities with two names. ie New York City, Los Angeles, San Francisco, etc.
This is something that you likely need to parse outside of MySQL. Since you only need it to be parsed, you could likely write a decent enough RegEx to handle the majority of the cases. However, if accuracy is your top priority, I would recommend geocoding the data.

Fetching data from a List

I am new to PySpark and have purchased a book to enhance my PySpark skills. I am stuck while using a function.
Function
def filterDuplicates ( ( userID, ratings ) ):
(movie1, rating1) = ratings[0]
(movie2, rating2) = ratings[1]
return movie1 < movie2
I am getting error due to two continuous parenthesis. Step basically gets an RDD which is basically a list of touple as show below:
[(196, ((242, 3.0), (242, 3.0))), (196, ((242, 3.0), (393, 4.0)))]
The final result should be only distinct movie ID, rating BY each viewer.
So in the above-given example, 196 is viewer ID, 242 is movie ID and 3.0 is rating given by viewer.
Kindly advise if I need to download a different version of python to use double parenthesis. Presently I have Python 3.7 installed on my machine.
Thanks,
AJ
The variable names inside a tuple is of no use. If you really want the tuple to be parameter of the function, name the whole tuple like
def filterDuplicates ( userData ):
userId = userData[0]
ratings = userData[1]
movie1 = ratings[0][0]
rating1 = ratings[0][1]
movie2 = ratings[1][0]
rating2 = ratings[1][1]
return movie1 < movie2

Retrieve maximum date from sql using SparkSQL to export in JSON [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 4 years ago.
I have a table in remote SQL database.
CUSTOMERID ACCOUNTNO VEHICLENUMBER TAGSTARTEFFDATE
20000000 10003014 MH43AJ411 2013-06-07 13:07:13.210
20000001 10003014 MH43AJ411 2014-08-08.19:10:11.519
20029961 10003019 GJ15CD7387 2016-07-28 19:21:54.173
20009020 10003019 GJ15CF7747 2016-05-25 18:46:55.947
20001866 10003019 GJ15CD7657 2015-07-11 15:17:14.503
20001557 10003019 GJ15CB9601 2016-05-05 16:45:58.247
20001223 10003019 GJ15CA7837 2014-06-06 14:57:42.583
20000933 10003019 MH02DG7774 2014-02-12 13:49:31.427
20001690 10003019 GJ15CD7477 2015-01-03 16:12:59.000
20000008 10003019 GJ15CB727 2013-06-17 12:36:01.190
20001865 10003019 GJ15CA7387 2015-06-24 15:01:14.000
20000005 10003019 MH02BY7774 2013-06-15 12:29:10.000
I want to export as JSON and this is the code snippet.
val jdbcSqlConnStr = s"jdbc:sqlserver://192.168.70.15;databaseName=$db;user=bhaskar;password=welcome123;"
val jdbcDbTable = table1
val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" -> jdbcSqlConnStr,"dbtable" -> jdbcDbTable)).load()
//jdbcDF.show(10)
//jdbcDF.printSchema
val query = "SELECT ACCOUNTNO, collect_set(struct(`VEHICLENUMBER`, `CUSTOMERID`, `TAGSTARTEFFDATE`)) as VEHICLE FROM tp_customer_account GROUP BY ACCOUNTNO ORDER BY ACCOUNTNO"
jdbcDF.registerTempTable("tp_customer_account")
val res00 = sqlContext.sql(query.toString)
// res00.show(10)
res00.coalesce(1).write.json("D:/res15")
Issue:
But here the problem is that I am getting multiple VEHICLENUMBER because more than one TAGSTARTEFFDATE along with the same VEHICLENUMBER is present in the table.
Want to achieve:
So I want to retrieve the TAGSTARTEFFDATE which is maximum date for the same VEHICLENUMBER. I want to use SparkSQL query using SQLContext as I have given in the code snippet.
You can use Window functions with dense_rank() that goes something like this
val windowSpec = Window.partitionBy(col("VEHICLENUMBER")).orderBy(col("TAGSTARTEFFDATE").desc)
jdbcDF.withColumn("rank", dense_rank().over(windowSpec)).filter(col("rank") === 1).drop(col("rank"))
At the moment I'm not really sure how to express this logic with pure SQL syntax but if you are not restricted to using just SQL you can utilize this snippet.
Edit
Took help of a friend to get a SQL equivalent of above. Try if it works.
SELECT * FROM tp_customer_account WHERE dense_rank() over(partition by VEHICLENUMBER order by TAGSTARTEFFDATE) = 1

Add record to table, and edit records in 2 separate tables

Hi there I'm making an access database, and I can't figure out how to do one particular thing.
I've got a form with two text boxes: MovieID and CustomerID. I also have three separate tables: MovieList, CustomerInfo and HireHistory. What I need is so that when I enter a MovieID and CustomerID into the given boxes then press my button HireButton, it edits that specific MovieID's LastHireDate to Today(), edits that specific CustomerID's LastHireDate to Today(), and then in my HireForm (which has the CustomerID's in the first row) it adds a new record below the CustomerID in the form of: MovieID " on " Today()
Also, I need to make it so that it checks that MovieID's genre and if it's R16 or R18, then it checks whether the customer is older than 16 or 18 today, and if not then it comes up with an error box. I know how to do the checking whether they are older than 16 or 18, but not the error box.
I know that's a lot of text, so I'll just write what's in my brain (how I see the code should be) so it will be easier to see what I want to do.
IF MovieID.Rating = 'R16' OR 'R18'
THEN IF CustomerID.[Date(Year(DOB)+16,Month(DOB),(Day(DOB))] > Today()
THEN DISPLAY Msgbox = "Sorry, too young"
ELSE SET CustomerID.LastHireDate = Today()
SET MovieID.LastHireDate = Today()
ADDRECORD in HireHistory for that CustomerID to (MovieID & " on " & Today())
ELSE SET CustomerID.LastHireDate = Today()
SET MovieID.LastHireDate = Today()
ADDRECORD in HireHistory for that CustomerID to (MovieID & " on " & Today())
Does that explain it a bit better? Thanks in advance for your help! :)
so here How I would do this. You first have to create a recordset for each of those table.
For the age I would use this function. : http://www.fmsinc.com/MicrosoftAccess/modules/examples/AgeCalculation.asp
customerBirth = yourCode to get the date
If MovieID.Rating = 'R16' OR 'R18' then
If AgeYears(customerBirth) < 16 then
msgbox("Sorry, too young")
else
MyCustomerRecordSet("LastHireDate") = now
MyMovieRecordSet("LastHireDate") = now
MyHireRecorset.AddNew
MyHireRecorset("I don't know what your trying to do here")
MyHireRecorset.Update
end if
Else
MyCustomerRecordSet("LastHireDate") = now
MyMovieRecordSet("LastHireDate") = now
MyHireRecorset.AddNew
MyHireRecorset("I don't know what your trying to do here")
End if
If you have any question just ask.

lex|flex rule action ignored

All,
I have a specified type of pattern in my lex file:
"#"[ \\t]*"ifdef".* { action_ifdef_manager(yytext);}
if a text like this encoutred #ifdef GLOBALVAR the action action_ifdef_manager is not called
Thanks for any Help.
The only problem is ambiguity of patterns. You should have a similar pattern like '#ifdef'
a |
ab |
abc |
abcd ECHO; REJECT;
The lexer here returns in "abcd" stream all four validated patterns a , ab , abc, abcd
Take a look at the Flex manual