Select from panadas dataframe using multiple columns

Select from panadas dataframe using multiple columns - select

I have the following dataframe:
Date Adj Close
Ticker
ZTS 2014-12-22 43.41
ZTS 2014-12-19 43.51
ZTS 2014-12-18 43.15
ZTS 2014-12-17 41.13
There are many more tickers than just ZTS, it continues on for many more rows.
I would like to select using both the Ticker and the Date but I cannot figure out how. I would like to select as if I were saying in SQL:
Select 'Adj Close' from prices where Ticker = 'ZTS' and 'Date' = '2014-12-22'
Thanks!

the following should work:
df[(df['Date'] == '2014-12-22') & (df.index == 'ZTS')]['Adj Close']
Here we have to use the array & operators rather than and also you must use parentheses due to operator precedence

>>> import pandas
>>> from pandas import *
>>> L = [['2014-12-22',43.41],['2014-12-19',43.51],['2014-12-18',43.15], ['2014-12-17',41.13]]
>>> C = ['ZTS', 'ZTS','ZTS','ZTS']
>>> df = DataFrame(L, columns=['Date','Adj Close'], index=[C])
>>> df
Date Adj Close
ZTS 2014-12-22 43.41
ZTS 2014-12-19 43.51
ZTS 2014-12-18 43.15
ZTS 2014-12-17 41.13
>>> D1 = df.ix['ZTS'][df['Date']=='2014-12-22']['Adj Close']
>>> D1
ZTS 43.41

I've figured out to seperate out the Ticker into a subset dataframe, and then index by Date, and then select by date. But I'm still wondering if there's a more efficient way.
cur_df = df.ix['A']
cur_df = cur_df.set_index(['Date'])
print cur_df['Adj Close']['2014-11-20']

Related

PySpark best way to filter df based on columns from different df's

I have a DF A_DF which has among others two columns say COND_B and COND_C. Then I have 2 different df's B_DF with COND_B column and C_DF with COND_C column.
Now I would like to filter A_DF where the value match in one OR the other. Something like:
df = A_DF.filter((A_DF.COND_B == B_DF.COND_B) | (A_DF.COND_C == C_DF.COND_C))
But I found out it is not possible like this.
EDIT
error: Attribute CON_B#264,COND_C#6 is missing from the schema: [... COND_B#532, COND_C#541 ]. Attribute(s) with the same name appear in the operation: COND_B,COND_C. Please check if the right attribute(s) are used.; looks like I can filter only on same DF because of the #number added on the fly..
So I first tried to do a list from B_DF and C_DF and use filter based on that but it was too expensive to use collect() on 100m of records.
So I tried:
AB_DF = A_DF.join(B_DF, 'COND_B', 'left_semi')
AC_DF = A_DF.join(C_DF, 'COND_C', 'left_semi')
df = AB_DF.unionAll(AC_DF).dropDuplicates()
dropDuplicates() I used to removed duplicate records where both conditions where true. But even with that I got some unexpected results.
Is there some other - smoother solution to do it simply? Something like an EXISTS statement in SQL?
EDIT2
I tried SQL based on #mck response:
e.createOrReplaceTempView('E')
b.createOrReplaceTempView('B')
p.createOrReplaceTempView('P')
df = spark.sql("""select * from E where exists (select 1 from B where E.BUSIPKEY = B.BUSIPKEY) or exists (select 1 from P where E.PCKEY = P.PCKEY)""")
my_output.write_dataframe(df)
with error:
Traceback (most recent call last):
File "/myproject/abc.py", line 45, in my_compute_function
df = spark.sql("""select * from E where exists (select 1 from B where E.BUSIPKEY = B.BUSIPKEY) or exists (select 1 from P where E.PCKEY = P.PCKEY)""")
TypeError: sql() missing 1 required positional argument: 'sqlQuery'
Thanks a lot!

Your idea of using exists should work. You can do:
A_DF.createOrReplaceTempView('A')
B_DF.createOrReplaceTempView('B')
C_DF.createOrReplaceTempView('C')
df = spark.sql("""
select * from A
where exists (select 1 from B where A.COND_B = B.COND_B)
or exists (select 1 from C where A.COND_C = C.COND_C)
""")

Pyspark error while running sql subquery "AnalysisException: u"Correlated column is not allowed in a non-equality predicate:\nAggregate"

I had written a SQL query which is has a subquery in it. It is a correct mySQL query but it does not get implemented on Pyspark
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import HiveContext
from pyspark.sql.types import *
from pyspark.sql.window import Window
from pyspark.sql.functions import *
sc = spark.sparkContext
sqlcontext = HiveContext(sc)
select location, postal, max(spend), max(revenue)
from (select a.*,
(select sum(r.revenue)
from revenue r
where r.user = a.user and
r.dte >= a.dt - interval 10 minute and
r.dte <= a.dte + interval 10 minute
) as revenue
from auction a
where a.event in ('Mid', 'End', 'Show') and
a.cat_id in (3) and
a.cat = 'B'
) a
group by location, postal;
The error eveytime I am getting is
AnalysisException: u"Correlated column is not allowed in a non-equality predicate:\nAggregate [sum(cast(revenue#17 as double)) AS sum(CAST(revenue AS DOUBLE))#498]\n+- Filter (((user#2 = outer(user#85)) && (dt#0 >= cast(cast(outer(dt#67) - interval 10 minutes as timestamp) as string))) && ((dt#0 <= cast(cast(outer(dt#67) + interval 10 minutes as timestamp) as string))
Any insights on this will be helpful.

Correlated subquery using sql syntax in PySpark is not an option, so in this case I ran the queries seperately with some twigs in sql query and left joined it using df.join to get the desired output through PySpark, this is how this issue was resolved

Pyspark regex to data frame

I have a code similar to this:
from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType
def regex_filter(x):
regexs = ['.*123.*']
if x and x.strip():
for r in regexs:
if re.match(r, x, re.IGNORECASE):
return True
return False
filter_udf = udf(regex_filter, BooleanType())
df_filtered = df.filter(filter_udf(df.fieldXX))
I want to use "regexs" var to verify if any digit "123" is in "fieldXX"
i don't know what i did wrong!
Could anyone help me with this?

Regexp is incorrect.
I think it should be something like:
regexs = '.*[123].*'

You can use SQL function to attain this
df.createOrReplaceTempView("df_temp")
df_1 = spark.sql("select *, case when col1 like '%123%' then 'TRUE' else 'FALSE' end col2 from df_temp")
Disadvantage in using UDF is you cannot save the data frame back or do any manipulations in that data frame further.

literal newline/carriage found in data exported with pandas to csv

I have the following text file that has CRLF at the end of each line and that has a relatively small number of bad rows (b'Skipping line 55000: expected 14 fields, saw 15\n').
0.0;0.7;John;0.29
1.0;0.23;Mike;0.55
0.0;0.72;;Jane;;-3.4
0.0;0.98;Gil;0.68
0.0;0.48;;;0
1.0;0.34;Karl;0.73
0.0;0.44;James;0.06
1.0;0.4;Kiki;0.74
0.0;0.18;Albert;0.18
1.0;0.53;Mark;0.53
I import the file with pandas, python 3.5.2 for windows 10 as follows:
with open('E:\DATA\my_file.txt','rb') as f:
df = pd.read_csv(f, sep=';', encoding='CP1252', error_bad_lines=False) // skipping bad rows
df looks like this: // the bad rows seem to be empty now
0.0;0.7;John;0.29
1.0;0.23;Mike;0.55
0.0;0.98;Gil;0.68
1.0;0.34;Karl;0.73
0.0;0.44;James;0.06
1.0;0.4;Kiki;0.74
0.0;0.18;Albert;0.18
1.0;0.53;Mark;0.53
Then I export the table into csv as follows:
with open('E:\DATA\csv_file.csv','w',newline='\n') as outfile:
df.to_csv(outfile, sep=';',index = False, line_terminator = '\r')
csv_file.csv looks like this: // the empty rows seem to be removed
0.0;0.7;John;0.29
1.0;0.23;Mike;0.55
0.0;0.98;Gil;0.68
1.0;0.34;Karl;0.73
0.0;0.44;James;0.06
1.0;0.4;Kiki;0.74
0.0;0.18;Albert;0.18
1.0;0.53;Mark;0.53
Unfortunately when I import the file into postgres with the following code:
set client_encoding to 'WIN1252';
COPY my_table FROM 'E:\DATA\csv_file.csv' (DELIMITER(';'));
I get the following error:
ERROR: literal newline found in data
HINT: Use "\n" to represent newline.
CONTEXT: COPY my_table, line 25408: ""
When I open the csv_file in nopetad++, I see that it has "CR" at the end of each row up to line 25407, line 25408 and a few others have "CRLF" at the end of the line.
I tried a few things I read on this site like opnening the file in binary mode, but nothing helped.
Can anyone explain to me what is going on here and how I can solve this? Thanks

UPDATE2: it just works fine:
In [194]: pd.read_csv(r'D:\download\onetest.txt', sep=';')
Out[194]:
COL1 COL2 COL2.1 COL3 COL4 COL5 COL6 COL7 COL8 COL9 COL10
0 23 21.0 UP 15/08/1986 BOBO NaN 1071001 268-Z DON 1620 NaN
1 012R 65.0 UP 15/10/1986 ESTO NaN 15065108 066-B DON 8415 NaN
2 234 8.0 EIJFTERF 17/12/1989 KING NaN 15571508 0776-V UP 6329 NaN
UPDATE: if your file(s) are small enough to fit into memory you can try this:
import io
data = []
with open(r'E:\DATA\my_file.txt') as f:
for line in f:
data.append(line.rstrip())
df = pd.read_csv(io.StringIO(data), sep=';', encoding='CP1252', error_bad_lines=False)
df.to_csv(r'E:\DATA\csv_file.csv', sep=';', index = False)
OLD answer:
You are using '\r' as a line-break and PostgreSQL's COPY command expects '\n', so try the following:
df = pd.read_csv(r'E:\DATA\my_file.txt', sep=';', encoding='CP1252', error_bad_lines=False)
df.to_csv(r'E:\DATA\csv_file.csv', sep=';', index = False)
in PostgreSQL:
set client_encoding to 'WIN1252';
COPY my_table FROM 'E:\DATA\csv_file.csv' (DELIMITER(';'));

SPARK SQL: Implement AND condition inside a CASE statement

I am aware of how to implement a simple CASE-WHEN-THEN clause in SPARK SQL using Scala. I am using Version 1.6.2. But, I need to specify AND condition on multiple columns inside the CASE-WHEN clause. How to achieve this in SPARK using Scala ?
Thanks in advance for your time and help!
Here's the SQL query that I have:
select sd.standardizationId,
case when sd.numberOfShares = 0 and
isnull(sd.derivatives,0) = 0 and
sd.holdingTypeId not in (3,10)
then
8
else
holdingTypeId
end
as holdingTypeId
from sd;

First read table as dataframe
val table = sqlContext.table("sd")
Then select with expression. There align syntaxt according to your database.
val result = table.selectExpr("standardizationId","case when numberOfShares = 0 and isnull(derivatives,0) = 0 and holdingTypeId not in (3,10) then 8 else holdingTypeId end as holdingTypeId")
And show result
result.show

An alternative option, if it's wanted to avoid using the full string expression, is the following:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
val sd = sqlContext.table("sd")
val conditionedColumn: Column = when(
(sd("numberOfShares") === 0) and
(coalesce(sd("derivatives"), lit(0)) === 0) and
(!sd("holdingTypeId").isin(Seq(3,10): _*)), 8
).otherwise(sd("holdingTypeId")).as("holdingTypeId")
val result = sd.select(sd("standardizationId"), conditionedColumn)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Select from panadas dataframe using multiple columns - select

the following should work: df[(df['Date'] == '2014-12-22') & (df.index == 'ZTS')]['Adj Close'] Here we have to use the array & operators rather than and also you must use parentheses due to operator precedence

I've figured out to seperate out the Ticker into a subset dataframe, and then index by Date, and then select by date. But I'm still wondering if there's a more efficient way. cur_df = df.ix['A'] cur_df = cur_df.set_index(['Date']) print cur_df['Adj Close']['2014-11-20']

Related

PySpark best way to filter df based on columns from different df's

Pyspark error while running sql subquery "AnalysisException: u"Correlated column is not allowed in a non-equality predicate:\nAggregate"

Pyspark regex to data frame

literal newline/carriage found in data exported with pandas to csv

SPARK SQL: Implement AND condition inside a CASE statement

Categories

Resources