I have the following dataframe:
Date Adj Close
Ticker
ZTS 2014-12-22 43.41
ZTS 2014-12-19 43.51
ZTS 2014-12-18 43.15
ZTS 2014-12-17 41.13
There are many more tickers than just ZTS, it continues on for many more rows.
I would like to select using both the Ticker and the Date but I cannot figure out how. I would like to select as if I were saying in SQL:
Select 'Adj Close' from prices where Ticker = 'ZTS' and 'Date' = '2014-12-22'
Thanks!
the following should work:
df[(df['Date'] == '2014-12-22') & (df.index == 'ZTS')]['Adj Close']
Here we have to use the array & operators rather than and also you must use parentheses due to operator precedence
>>> import pandas
>>> from pandas import *
>>> L = [['2014-12-22',43.41],['2014-12-19',43.51],['2014-12-18',43.15], ['2014-12-17',41.13]]
>>> C = ['ZTS', 'ZTS','ZTS','ZTS']
>>> df = DataFrame(L, columns=['Date','Adj Close'], index=[C])
>>> df
Date Adj Close
ZTS 2014-12-22 43.41
ZTS 2014-12-19 43.51
ZTS 2014-12-18 43.15
ZTS 2014-12-17 41.13
>>> D1 = df.ix['ZTS'][df['Date']=='2014-12-22']['Adj Close']
>>> D1
ZTS 43.41
I've figured out to seperate out the Ticker into a subset dataframe, and then index by Date, and then select by date. But I'm still wondering if there's a more efficient way.
cur_df = df.ix['A']
cur_df = cur_df.set_index(['Date'])
print cur_df['Adj Close']['2014-11-20']
Related
I have a DF A_DF which has among others two columns say COND_B and COND_C. Then I have 2 different df's B_DF with COND_B column and C_DF with COND_C column.
Now I would like to filter A_DF where the value match in one OR the other. Something like:
df = A_DF.filter((A_DF.COND_B == B_DF.COND_B) | (A_DF.COND_C == C_DF.COND_C))
But I found out it is not possible like this.
EDIT
error: Attribute CON_B#264,COND_C#6 is missing from the schema: [... COND_B#532, COND_C#541 ]. Attribute(s) with the same name appear in the operation: COND_B,COND_C. Please check if the right attribute(s) are used.; looks like I can filter only on same DF because of the #number added on the fly..
So I first tried to do a list from B_DF and C_DF and use filter based on that but it was too expensive to use collect() on 100m of records.
So I tried:
AB_DF = A_DF.join(B_DF, 'COND_B', 'left_semi')
AC_DF = A_DF.join(C_DF, 'COND_C', 'left_semi')
df = AB_DF.unionAll(AC_DF).dropDuplicates()
dropDuplicates() I used to removed duplicate records where both conditions where true. But even with that I got some unexpected results.
Is there some other - smoother solution to do it simply? Something like an EXISTS statement in SQL?
EDIT2
I tried SQL based on #mck response:
e.createOrReplaceTempView('E')
b.createOrReplaceTempView('B')
p.createOrReplaceTempView('P')
df = spark.sql("""select * from E where exists (select 1 from B where E.BUSIPKEY = B.BUSIPKEY) or exists (select 1 from P where E.PCKEY = P.PCKEY)""")
my_output.write_dataframe(df)
with error:
Traceback (most recent call last):
File "/myproject/abc.py", line 45, in my_compute_function
df = spark.sql("""select * from E where exists (select 1 from B where E.BUSIPKEY = B.BUSIPKEY) or exists (select 1 from P where E.PCKEY = P.PCKEY)""")
TypeError: sql() missing 1 required positional argument: 'sqlQuery'
Thanks a lot!
Your idea of using exists should work. You can do:
A_DF.createOrReplaceTempView('A')
B_DF.createOrReplaceTempView('B')
C_DF.createOrReplaceTempView('C')
df = spark.sql("""
select * from A
where exists (select 1 from B where A.COND_B = B.COND_B)
or exists (select 1 from C where A.COND_C = C.COND_C)
""")
I had written a SQL query which is has a subquery in it. It is a correct mySQL query but it does not get implemented on Pyspark
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import HiveContext
from pyspark.sql.types import *
from pyspark.sql.window import Window
from pyspark.sql.functions import *
sc = spark.sparkContext
sqlcontext = HiveContext(sc)
select location, postal, max(spend), max(revenue)
from (select a.*,
(select sum(r.revenue)
from revenue r
where r.user = a.user and
r.dte >= a.dt - interval 10 minute and
r.dte <= a.dte + interval 10 minute
) as revenue
from auction a
where a.event in ('Mid', 'End', 'Show') and
a.cat_id in (3) and
a.cat = 'B'
) a
group by location, postal;
The error eveytime I am getting is
AnalysisException: u"Correlated column is not allowed in a non-equality predicate:\nAggregate [sum(cast(revenue#17 as double)) AS sum(CAST(revenue AS DOUBLE))#498]\n+- Filter (((user#2 = outer(user#85)) && (dt#0 >= cast(cast(outer(dt#67) - interval 10 minutes as timestamp) as string))) && ((dt#0 <= cast(cast(outer(dt#67) + interval 10 minutes as timestamp) as string))
Any insights on this will be helpful.
Correlated subquery using sql syntax in PySpark is not an option, so in this case I ran the queries seperately with some twigs in sql query and left joined it using df.join to get the desired output through PySpark, this is how this issue was resolved
I have a code similar to this:
from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType
def regex_filter(x):
regexs = ['.*123.*']
if x and x.strip():
for r in regexs:
if re.match(r, x, re.IGNORECASE):
return True
return False
filter_udf = udf(regex_filter, BooleanType())
df_filtered = df.filter(filter_udf(df.fieldXX))
I want to use "regexs" var to verify if any digit "123" is in "fieldXX"
i don't know what i did wrong!
Could anyone help me with this?
Regexp is incorrect.
I think it should be something like:
regexs = '.*[123].*'
You can use SQL function to attain this
df.createOrReplaceTempView("df_temp")
df_1 = spark.sql("select *, case when col1 like '%123%' then 'TRUE' else 'FALSE' end col2 from df_temp")
Disadvantage in using UDF is you cannot save the data frame back or do any manipulations in that data frame further.
I have the following text file that has CRLF at the end of each line and that has a relatively small number of bad rows (b'Skipping line 55000: expected 14 fields, saw 15\n').
0.0;0.7;John;0.29
1.0;0.23;Mike;0.55
0.0;0.72;;Jane;;-3.4
0.0;0.98;Gil;0.68
0.0;0.48;;;0
1.0;0.34;Karl;0.73
0.0;0.44;James;0.06
1.0;0.4;Kiki;0.74
0.0;0.18;Albert;0.18
1.0;0.53;Mark;0.53
I import the file with pandas, python 3.5.2 for windows 10 as follows:
with open('E:\DATA\my_file.txt','rb') as f:
df = pd.read_csv(f, sep=';', encoding='CP1252', error_bad_lines=False) // skipping bad rows
df looks like this: // the bad rows seem to be empty now
0.0;0.7;John;0.29
1.0;0.23;Mike;0.55
0.0;0.98;Gil;0.68
1.0;0.34;Karl;0.73
0.0;0.44;James;0.06
1.0;0.4;Kiki;0.74
0.0;0.18;Albert;0.18
1.0;0.53;Mark;0.53
Then I export the table into csv as follows:
with open('E:\DATA\csv_file.csv','w',newline='\n') as outfile:
df.to_csv(outfile, sep=';',index = False, line_terminator = '\r')
csv_file.csv looks like this: // the empty rows seem to be removed
0.0;0.7;John;0.29
1.0;0.23;Mike;0.55
0.0;0.98;Gil;0.68
1.0;0.34;Karl;0.73
0.0;0.44;James;0.06
1.0;0.4;Kiki;0.74
0.0;0.18;Albert;0.18
1.0;0.53;Mark;0.53
Unfortunately when I import the file into postgres with the following code:
set client_encoding to 'WIN1252';
COPY my_table FROM 'E:\DATA\csv_file.csv' (DELIMITER(';'));
I get the following error:
ERROR: literal newline found in data
HINT: Use "\n" to represent newline.
CONTEXT: COPY my_table, line 25408: ""
When I open the csv_file in nopetad++, I see that it has "CR" at the end of each row up to line 25407, line 25408 and a few others have "CRLF" at the end of the line.
I tried a few things I read on this site like opnening the file in binary mode, but nothing helped.
Can anyone explain to me what is going on here and how I can solve this? Thanks
UPDATE2: it just works fine:
In [194]: pd.read_csv(r'D:\download\onetest.txt', sep=';')
Out[194]:
COL1 COL2 COL2.1 COL3 COL4 COL5 COL6 COL7 COL8 COL9 COL10
0 23 21.0 UP 15/08/1986 BOBO NaN 1071001 268-Z DON 1620 NaN
1 012R 65.0 UP 15/10/1986 ESTO NaN 15065108 066-B DON 8415 NaN
2 234 8.0 EIJFTERF 17/12/1989 KING NaN 15571508 0776-V UP 6329 NaN
UPDATE: if your file(s) are small enough to fit into memory you can try this:
import io
data = []
with open(r'E:\DATA\my_file.txt') as f:
for line in f:
data.append(line.rstrip())
df = pd.read_csv(io.StringIO(data), sep=';', encoding='CP1252', error_bad_lines=False)
df.to_csv(r'E:\DATA\csv_file.csv', sep=';', index = False)
OLD answer:
You are using '\r' as a line-break and PostgreSQL's COPY command expects '\n', so try the following:
df = pd.read_csv(r'E:\DATA\my_file.txt', sep=';', encoding='CP1252', error_bad_lines=False)
df.to_csv(r'E:\DATA\csv_file.csv', sep=';', index = False)
in PostgreSQL:
set client_encoding to 'WIN1252';
COPY my_table FROM 'E:\DATA\csv_file.csv' (DELIMITER(';'));
I am aware of how to implement a simple CASE-WHEN-THEN clause in SPARK SQL using Scala. I am using Version 1.6.2. But, I need to specify AND condition on multiple columns inside the CASE-WHEN clause. How to achieve this in SPARK using Scala ?
Thanks in advance for your time and help!
Here's the SQL query that I have:
select sd.standardizationId,
case when sd.numberOfShares = 0 and
isnull(sd.derivatives,0) = 0 and
sd.holdingTypeId not in (3,10)
then
8
else
holdingTypeId
end
as holdingTypeId
from sd;
First read table as dataframe
val table = sqlContext.table("sd")
Then select with expression. There align syntaxt according to your database.
val result = table.selectExpr("standardizationId","case when numberOfShares = 0 and isnull(derivatives,0) = 0 and holdingTypeId not in (3,10) then 8 else holdingTypeId end as holdingTypeId")
And show result
result.show
An alternative option, if it's wanted to avoid using the full string expression, is the following:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
val sd = sqlContext.table("sd")
val conditionedColumn: Column = when(
(sd("numberOfShares") === 0) and
(coalesce(sd("derivatives"), lit(0)) === 0) and
(!sd("holdingTypeId").isin(Seq(3,10): _*)), 8
).otherwise(sd("holdingTypeId")).as("holdingTypeId")
val result = sd.select(sd("standardizationId"), conditionedColumn)