How to delete all values in a column with lowercase - pyspark

I have a dataset I am working on and it has a description column where items for sale are represented in uppercase and expenses are represented in lowercase. I am trying to drop all rows with lower case.
Sample data :
invoiceid description(string datatype)
100 WHITE METAL LANTERN
200 post expenses
300 BLACK WIRE
what I want to achieve
invoiceid description
100 WHITE METAL LANTERN
300 BLACK WIRE
I tried using the following codes but I kept getting errors
from pyspark.sql.functions import upper, lower col
df4 = df3.where(col('Description').islower())

One way; extract the caps into a column. The lower case will return blank. Filter out the blanks and drop the extract column to clean df
df.withColumn('filt', regexp_extract('description(string datatype)','[A-Z]+', 0)).filter("filt != ''").drop('filt').show()
+---------+----------------------------+
|invoiceid|description(string datatype)|
+---------+----------------------------+
| 100| WHITE METAL LANTERN|
| 300| BLACK WIRE|
+---------+----------------------------+

Related

Count the number of records that match exploded column in pyspark.sql?

I have an assignment that uses Spark 2.4 and part of the Yelp dataset. The part of the schema which we are to use from the business data is below and used in the same DataFrame:
"business_id": string
"categories": comma delimited list of strings
"stars": double
We are supposed to create a new DataFrame which groups the businesses by category, with the following columns:
"category": string exploded from "categories"
"businessCount": integer; number of businesses in that category
"averageStarRating": double; average rating of businesses in the category
"minStarRating": double; lowest rating of any restaurant in that category
"maxStarRating": double; highest rating of any restaurant in that category
So far I have been able to figure out how to use the explode command to break up the "categories" column into individual records and show the "business_id", "category", and "stars":
import from pyspark.sql functions as F
businessdf.select("business_id", F.explode(F.split("categories", ",")).alias("category"), "stars").show(5)
The above command gives me this as a result:
+--------------------+--------------+-----+
| business_id| category|stars|
+--------------------+--------------+-----+
|1SWheh84yJXfytovI...| Golf| 3.0|
|1SWheh84yJXfytovI...| Active Life| 3.0|
|QXAEGFB4oINsVuTFx...|Specialty Food| 2.5|
|QXAEGFB4oINsVuTFx...| Restaurants| 2.5|
|QXAEGFB4oINsVuTFx...| Dim Sum| 2.5|
+--------------------+--------------+-----+
only showing top 5 rows
What I can't figure out how to do is use aggregate functions to create the other columns. My professor says it all must be done in one statement. All of my attempts so far have led to errors.
My assignment says I will also need to remove any leading/trailing spaces on the newly created "category" column before doing any aggregations, but my attempts have all led to errors.
I feel this is the closest I've come but don't have any idea what to try next:
businessdf.select(F.explode(F.split("categories", ",")).alias("category")).groupBy("category").agg(F.count("category").alias("businessCount"), F.avg("stars").alias("averageStarRating"), F.min("stars").alias("minStarRating"), F.max("stars").alias("maxStarRating"))
Here is the error that comes along with that command:
`pyspark.sql.utils.AnalysisException: "cannot resolve '`stars`' given input columns: [category];;\n'Aggregate [category#337], [category#337, count(category#337) AS businessCount#342L, avg('stars) AS averageStarRating#344, min('stars) AS minStarRating#346, max('stars) AS maxStarRating#348]\n+- Project [category#337]\n +- Generate explode(split(categories#33, ,)), false, [category#337]\n +- Relation[address#30,attributes#31,business_id#32,categories#33,city#34,hours#35,is_open#36L,latitude#37,lo`ngitude#38,name#39,postal_code#40,review_count#41L,stars#42,state#43] json\n"
Nevermind, posting must have helped me work through it on my own. The command I posted above is very close, but I forgot to add the "stars" column to the select statment. The correct command is here:
businessdf.select(F.explode(F.split("categories", ",")).alias("category"), "stars").groupBy("category").agg(F.count("category").alias("businessCount"), F.avg("stars").alias("averageStarRating"), F.min("stars").alias("minStarRating"), F.max("stars").alias("maxStarRating")).show()

Calculated Time weigthed return in Power BI

Im trying to calculate the Time Weigthed Return for a portfolio of stocks. The formula is:
I have the following data:
Im calculate the TWR (time weigthed return) in Power Bi as:
TWR = productx(tabel1;TWR denom/yield+1)
The grey and blue marked/selected fields are individual single stock. Here you see the TWR for the grey stock is = 0,030561631 and for the blue TWR = 0,012208719 which is correct for the period from 09.03.19 to 13.03.19.
My problem is, when im trying to calculate the TWR for a portfolio of the two stocks, it takes the product og every row. In the orange field I have calculated the correct result in excel. But in Power BI it takes the product of the grey and blue stocks TWR: (0,0305661631 * 0,012208719) = 0,03143468 which is incorrect.
I want to sum(yield for both stocks)/sum(TWRDenominator for both stocks) for both stocks every single date, such that I not end up with two rows (one for each stock) but instead a common number every date for the portfolio.
I have calculated the column TWR denom/yield -1 in a measure like this:
twr denom/yield-1 = CALCULATE(1+sumx(tabel1;tabel1(yield)/sumx(tabel1;tabel1[TwrDenominator])))
How can I solved this problem?
Thank you in advance!
This is one solution to your question but it assumes the data is in the following format:
[Date] | [Stock] | [TWR] | [Yield]
-----------------------------------
[d1] | X | 12355 | 236
[d1] | y | 23541 | 36
[d2] ... etc.
I.e. date is not a unique value in the table, though date-stock name will be.
Then you can create a new calculated table using the following code:
Portfolio_101 =
CalculateTable(
Summarize(
DataTable;
DataTable[Date];
"Yield_over_TWR"; Sum(DataTable[Yield])/Sum(DataTable[TWR_den])+1
);
Datatable[Stock] in {"Stock_Name_1"; "Stock_Name_2"}
)
Then in the new Portfolio_101 create a measure:
Return_101 =
Productx(
Portfolio_101;
Portfolio_101[Yield_over_TWR]
)-1
If using your data I en up with the following table, I have created three calculated tables, one fore each stock and a third (Portfolio_103) with the two combined. In addition I have a calendar table which has a 1:1 relationship between all Portfolio tables.
Hope this helps, otherwise let me know where I've misunderstood you.
Cheers,
Oscar

How to conditionally remove the first two characters from a column

I have the below data of some phone records, and I want to remove the first two values from each record as they are a country code. What is the way by which I can do this using Scala, Spark, or Hive?
phone
|917799423934|
|019331224595|
| 8981251522|
|917271767899|
I'd like the result to be:
phone
|7799423934|
|9331224595|
|8981251522|
|7271767899|
How can we remove the prefix 91,01 from each record or each row of this column?
Phone size can be different, such construction can be used (Scala):
df.withColumn("phone", expr("substring(phone,3,length(phone)-2)"))
Using regular expressions
Use regexp_replace (add more extension codes if necessary):
select regexp_replace(trim(phone),'^(91|01)','') as phone --removes leading 91, 01 and all leading and trailing spaces
from table;
The same using regexp_extract:
select regexp_extract(trim(phone),'^(91|01)?(\\d+)',2) as phone --removes leading and trailing spaces, extract numbers except first (91 or 01)
from table;
An improvement I believe, would prefer a list with contains or the equivalent of, but here goes:
import org.apache.spark.sql.functions._
case class Tel(telnum: String)
val ds = Seq(
Tel("917799423934"),
Tel("019331224595"),
Tel("8981251522"),
Tel("+4553")).toDS()
val ds2 = ds.withColumn("new_telnum", when(expr("substring(telnum,1,2)") === "91" || expr("substring(telnum,1,2)") === "01", expr("substring(telnum,3,length(telnum)-2)")).otherwise(col("telnum")))
ds2.show
returns:
+------------+----------+
| telnum|new_telnum|
+------------+----------+
|917799423934|7799423934|
|019331224595|9331224595|
| 8981251522|8981251522|
| +4553| +4553|
+------------+----------+
We may need to think of the +, but nothing was stated.
If they are strings then for a Hive query:
sql("select substring(phone,3) from table").show

SSRS 2012 Mean & Median

I have a report that i use Mean and Median measures that were calculated in SSAS 2012 tabular, the reports as follows:
when i use the Mean and Median directly, the values in the green box are correct and the column and row totals are incorrect.
after using aggregate instead of sum the following happens:
1- the blank row and blank column are now gone along with their values.
2- Then total of the Mean is correct in every cell except the grand total(2nd cell from right bottom corner), it appears to take the value of the cell given in Mean for the cross between blank and blank in previous pic.
3- The totals for the median are now either blank or 0 except for the grand total (right bottom corner cell), it seems to have the value for the Median from the cross of blank and blank in previous pic.
I am stuck here i dont know what to do and i want to avoid using another dataset if possible. HELP!!
A screen shot of my Query designer:
After including all empty spaces :
(=Aggregate is used on every cell),
please note after taking away extra dimension columns, the Total row in the bottom changed , the correct Mean totals from pre-pic are now gone.
The MDX for the DataSet :
SELECT { [Measures].[INCOME AVERAGE], [Measures].[INCOME MEDIAN] } ON COLUMNS, {[DIM_Type of Education for Household نوع التعليم لرب الأسرة].[EDU_TYPE_ENAME].[EDU_TYPE_ENAME].ALLMEMBERS * [DIM_Nationality of household الجنسية لرئيس الأسرة].[NATIONALITY_L1_ENAME].[NATIONALITY_L1_ENAME].ALLMEMBERS, [DIM_Type of Education for Household نوع التعليم لرب الأسرة].[EDU_TYPE_ENAME].[EDU_TYPE_ENAME].ALLMEMBERS * {[DIM_Nationality of household الجنسية لرئيس الأسرة].[NATIONALITY_L1_ENAME].[All]}, {[DIM_Type of Education for Household نوع التعليم لرب الأسرة].[EDU_TYPE_ENAME].[All]} * [DIM_Nationality of household الجنسية لرئيس الأسرة].[NATIONALITY_L1_ENAME].[NATIONALITY_L1_ENAME].ALLMEMBERS, {[DIM_Type of Education for Household نوع التعليم لرب الأسرة].[EDU_TYPE_ENAME].[All]} * {[DIM_Nationality of household الجنسية لرئيس الأسرة].[NATIONALITY_L1_ENAME].[All]}, ([DIM_Type of Education for Household نوع التعليم لرب الأسرة].[EDU_TYPE_ENAME].[EDU_TYPE_ENAME].ALLMEMBERS * [DIM_Nationality of household الجنسية لرئيس الأسرة].[NATIONALITY_L1_ENAME].[NATIONALITY_L1_ENAME].ALLMEMBERS ) } DIMENSION PROPERTIES MEMBER_CAPTION, MEMBER_UNIQUE_NAME ON ROWS FROM [Model] CELL PROPERTIES VALUE, BACK_COLOR, FORE_COLOR, FORMATTED_VALUE, FORMAT_STRING, FONT_NAME, FONT_SIZE, FONT_FLAGS
Thanks in advance.
Your MDX query includes a ton of fields. Can you simplify it to the following to see if it solves the problem:
+-----------+-------------+--------+---------+
| Education | Nationality | Median | Average |
+-----------+-------------+--------+---------+
| null | null | 100 | 110 |
| PhD | null | 100 | 110 |
| null | Saudi | 100 | 110 |
| PhD | Saudi | 100 | 110 |
+-----------+-------------+--------+---------+
Don't return any extra dimension columns.
I am using null to represent the All member which is now the MDX query designer does it. Hopefully just dragging and dropping Education and Nationality onto the grid in the graphical MDX query designer will get you what you want. There is a button that turns on showing the All member if I recall.
Row 1 is for the grand total in the very bottom right.
Row 2 is for the subtotal in the far right column.
Row 3 is for the subtotal on the bottom row.
Row 4 fills up the body of the Report.
Then change the layout of the Tablix so that the row group is on the Education column and nothing else. Change the column group to be on Nationality and nothing else. Make sure all the textboxes in your Tablix have =Aggregate. I am hopeful it should work now.

Search inside full search column using certain letters

I want to search inside a full search column using certain letters, I mean:
select "Name","Country","_score" from datatable where match("Country", 'China');
Returns many rows and is ok. My question is, how can I search for example:
select "Name","Country","_score" from datatable where match("Country", 'Ch');
I want to see, China, Chile, etc.
I think that match_type phrase_prefix can be the answer, but I don't know how I can use (correct syntax).
The match predicate supports different types by use of using match_type [with (match_parameter = [value])].
So in your example using the phrase_prefix match type:
select "Name","Country","_score" from datatable where match("Country", 'Ch') using phrase_prefix;
gives you your desired results.
See the match predicate documentation: https://crate.io/docs/en/latest/sql/fulltext.html?#match-predicate
If you just need to match the beginning of a string column, you don't need a fulltext analyzed column. You can use the LIKE operator instead, e.g.:
cr> create table names_table (name string, country string);
CREATE OK (0.840 sec)
cr> insert into names_table (name, country) values ('foo', 'China'), ('bar','Chile'), ('foobar', 'Austria');
INSERT OK, 3 rows affected (0.049 sec)
cr> select * from names_table where country like 'Ch%';
+---------+------+
| country | name |
+---------+------+
| Chile | bar |
| China | foo |
+---------+------+
SELECT 2 rows in set (0.037 sec)