I'm using Talend to issue API call to Pardot and retrieve records from Prospect table.
This gives me just 200 records.
Can anyone suggest a way to retrieve all the records available in this table.
Or how to loop over and retrieve records in chunks of 200 and terminate when the records retrieved is zero.
You can only retrieve 200 records at a time. If you want to retrieve all the records, you have to loop it using the offset parameter. Here the offset increases by 200 each time i.e. when you use offset=0 it will retrieve first 200 records and then increase the offset by 200 (offset=200) to retrieve the next 200 records. Here is how I retrieved all the records into a csv file in Python.
i=0
final_data = pd.DataFrame() #Initialize an empty dataframe
url = "https://pi.pardot.com/api/Prospect/version/4/do/query? user_key=&api_key=&output=bulk&format=json&sort_by=&src=&offset="+str(i)
while requests.get(url).json()['result'] is not None:
url = "https://pi.pardot.com/api/Prospect/version/4/do/query? user_key=&api_key=&output=bulk&format=json&sort_by=&src=&offset="+str(i)
data = pd.DataFrame.from_dict(requests.get(url).json()['result']['prospect'])
final_data=data.append(final_data)
i=i+200
final_data.to_csv('complete_data.csv',index=False)
I used the condition requests.get(url).json()['result'] is not None because I do not have any idea regarding the number of offsets present. So I am checking every time if records are present in that offset. This might take too long in the case you have several thousands of offsets. Hope this helps.
Providing a modified and working solution
I have avoided using "Offset" parameter as it is suggested not to use it for bulk data pull.
"""defining a function for getting an api key using credentials for Pardot user"""
def api_key_gen():
import requests
import json
url = "https://pi.pardot.com/api/login/version/3%20HTTP/1.1"
querystring = {"email":"","password":"","user_key":"","format":"json"}
headers = {
'Connection': "keep-alive",
'cache-control': "no-cache"
}
response = requests.request("POST", url, headers=headers, params=querystring)
# print(json.loads(response.text)['api_key'])
return (json.loads(response.text)['api_key'])
Using these two functions to fetch data.
First function fetches data using between two dates.
Second function is used if there are large number of records stored in a single second.
def fetchFromDate(api_key, max_date, target_date):
url = "https://pi.pardot.com/api/prospect/version/3/do/query? user_key=&api_key="+str(api_key)+"&output=bulk&created_after="+str(max_date)+"&created_before="+str(target_date)+"&format=json"
result = json.loads((requests.request("GET", url)).text)['result']['prospect']
data = pd.DataFrame(result)
return data
def fetchFromId(api_key, max_id):
url = "https://pi.pardot.com/api/prospect/version/3/do/query? user_key=&api_key="+str(api_key)+"&output=bulk&id_greater_than="+str(max_id)+"&format=json"
result = json.loads((requests.request("GET", url)).text)['result']['prospect']
data = pd.DataFrame(result)
return data
Using below code to fetch data from pardot api for one month to keep size of data small.Also whenever api key gets expired,a new key api key is fetched and used in the URL.Dates are used to compare with each other in order to fetch data for desired period only.I have tried to keep the whole process Dynamic except for the dates parameters.
import pandas as pd
import requests
import json
from datetime import datetime,timedelta
"""using a start date and target date to fetch data for a particular time span"""
max_date = '2014-02-03 08:02:57'
target_date = datetime.strptime('2014-06-30 23:59:59','%Y-%m-%d %H:%M:%S')
final_data = pd.DataFrame() #Initialize an empty dataframe
api_key = api_key_gen()
last_maxDate = max_date
last_maxId = '' #get the id of first record for desired year and fill here
url = "https://pi.pardot.com/api/prospect/version/3/do/query? user_key=&api_key="+str(api_key)+"&output=bulk&created_after="+str(max_date)+"&created_before="+str(target_date)+"&format=json"
print("Start Time : ",datetime.now())
i =1
while json.loads((requests.request("GET", url)).text)['result'] is not None:
# max_date = datetime.strptime(str(max_date),'%Y-%m-%d %H:%M:%S')-timedelta(seconds=1)
last_maxDate = datetime.strptime(str(last_maxDate),'%Y-%m-%d %H:%M:%S')
api_key = api_key_gen()
data = fetchFromDate(api_key, max_date, target_date)
if len(data) < 200:
final_data=data.append(final_data,ignore_index=True)
break
else:
max_id = max(data['id'])
max_date = max(data['created_at'])
max_date = datetime.strptime(str(max_date),'%Y-%m-%d %H:%M:%S')-timedelta(seconds=1)
# print(type(max_date),type(last_maxDate))
if bool(max_date == last_maxDate) & bool(int(max_id) == int(last_maxId)):
print("Running through Id's")
api_key = api_key_gen()
data = fetchFromId(api_key, max_id)
# final_data=data.append(final_data,ignore_index=True)
max_id = max(data['id'])
max_date = max(data['created_at'])
final_data=data.append(final_data,ignore_index=True)
last_maxDate = max_date
last_maxId = max_id
print("Running Loop :",i,max_date,max_id)
i += 1
print(max(data['created_at']))
print(max(data['id']))
final_data.to_csv('file.csv',index=False)
print("End Time : ",datetime.now())
Also the Pardot API key expires after every 60 minutes. So it is better to use PyPardot4 in python which can use a new API key whenever the current key expires.
You can use the following code.
from pypardot.client import PardotAPI
import requests
import pandas as pd
p = PardotAPI(
email='',
password='',
user_key='')
p.authenticate()
i=0
final_data = pd.DataFrame()
while i <=p.prospects.query()['total_results'] -1:
print(i)
data=pd.DataFrame.from_dict(p.prospects.query(format='json',sort_by='id',offset=i)['prospect'])
final_data=data.append(final_data,sort=True)
i=i+200
final_data.to_csv('complete_data.csv',index=False)
The above answers are good for looping. If you only need a limited amount of fields, look into the mobile response format, it doesn't have the 200 record limit. It only supports a couple of predefined fields however.
You can use Export API that works for the Prospect table. This can give a year's data. So logically create each year query.
Related
I want to translate the below postgres query into Sqlalchemy asyncio format, but so far, I could only retrieve the first column only, or the whole row at once, while I need only to retrieve only two columns per record:
SELECT
table.xml_uri,
max(table.created_at) AS max_1
FROM
table
GROUP BY
table.xml_uri
ORDER BY
max_1 DESC;
I reach out to the below translation, but this only returns the first column xml_uri, while I need both columns. I left the order_by clause commented out for now as it generates also the below error when commented in:
Sqlalchemy query:
from sqlalchemy.ext.asyncio import AsyncSession
query = "%{}%".format(query)
records = await session.execute(
select(BaseModel.xml_uri, func.max(BaseModel.created_at))
.order_by(BaseModel.created_at.desc())
.group_by(BaseModel.xml_uri)
.filter(BaseModel.xml_uri.like(query))
)
# Get all the records
result = records.scalars().all()
Error generated when commenting in order_by clause:
column "table.created_at" must appear in the GROUP BY clause or be used in an aggregate function
The query is returning a resultset consisting of two-element tuples. session.scalars() is taking the first element of each tuple. Using session.execute instead will provide the desired behaviour.
It's not permissable to order by the date field directly as it isn't part of the projection, but you can give the max column a label and use that to order.
Here's an example script:
import sqlalchemy as sa
from sqlalchemy import orm
Base = orm.declarative_base()
class MyModel(Base):
__tablename__ = 't73018397'
id = sa.Column(sa.Integer, primary_key=True)
code = sa.Column(sa.String)
value = sa.Column(sa.Integer)
engine = sa.create_engine('postgresql:///test', echo=True, future=True)
Base.metadata.drop_all(engine)
Base.metadata.create_all(engine)
Session = orm.sessionmaker(engine, future=True)
with Session.begin() as s:
for i in range(10):
# Split values based on odd or even
code = 'AB'[i % 2 == 0]
s.add(MyModel(code=code, value=i))
with Session() as s:
q = (
sa.select(MyModel.code, sa.func.max(MyModel.value).label('mv'))
.group_by(MyModel.code)
.order_by(sa.text('mv desc'))
)
res = s.execute(q)
for row in res:
print(row)
which generates this query:
SELECT
t73018397.code,
max(t73018397.value) AS mv
FROM t73018397
GROUP BY t73018397.code
ORDER BY mv desc
I have used Copy Command as below:
copy into test2 from #%test2 file_format = (format_name = 'CSV') on_error = 'CONTINUE';
My file contains some Character data in Number field ( for all records) so Copy result is LOAD_FAILED and I can get fail records using below query ( in this case all records are failed records):
select * from table(validate("TEST2", job_id=>'Corresponding JOB ID'));
I also tried giving invalid dates and still got all bad records from above query.
Now I tried Copy command as below:
copy into test2(test1,test2) from (select $1,to_date($2,'YYYYDDD') from #%test2) file_format = (format_name = 'CSV') on_error = 'CONTINUE';
Copy result was again LOAD_FAILED but I do not get any fail record from below query now:
select * from table(validate("TEST2", job_id=>'Corresponding JOB ID'));
does this work only for regular copy without any conversion function in Copy or there is any other reason?
Adding One more example after seeing response from Mike below:
File data:
1,2018-1-34
2,2/3/2016
3,2020124
table-> create table test2(test1 number,test2 date)
copy into test2(test1,test2) from (select $1,to_date($2,'YYYYDD') from #%test2) file_format = (format_name = 'CSV') on_error = 'CONTINUE';
first and third record are available in Validate query. Only the second record in not present in this case. its weird. ( all three records failed in copy) .
As Mike said below in comments Validate does not work with transform data in copy but why does it provide two records in that case. it should wither not provide anything at all or all of them?
Per the documentation, since you are transforming the data during the COPY INTO, the VALIDATE function will no longer work:
This function does not support COPY INTO statements that
transform data during a load.
https://docs.snowflake.com/en/sql-reference/functions/validate.html#usage-notes
When querying the repository using a criteria, it returns an object with multiple result sets. Each result set is an object mapped to the model. So one can get the preferred result set using ->offsetGet(). How can I get the preferred result set(s) using a parameter value instead?
Example:
A table has three fields uid, guide_option and fact. Table contains multiple records and is mapped to its model. Query fetches data using field guide_option and returns several rows:
$query = $this->createQuery();
$constraints = [$query->equals('guide_option', $guideOption)];
$query->matching($query->logicalAnd($constraints));
$resultsets = $query->execute();
#filter by offset
$offset = $resultsets->offsetGet(0);
#filter by fact
$set = ???
How does one filter by field fact?
I am new to ATG, and I have this question. How can I write my RQLQuery that provide me data, such as this SQL query?
select avg(rating) from rating WHERE album_id = ?;
I'm trying this way:
RqlStatement statement;
Object rqlparam[] = new Object[1];
rqlparam[0] = album_Id;
statement= RqlStatement.parseRqlStatement("album_id= ? 0");
MutableRepository repository = (MutableRepository) getrMember();
RepositoryView albumView = repository.getView(ALBUM);
This query returns me an item for a specific album_id, how can I improve my RQL query so that it returns to me the average field value, as SQL query above.
There is no RQL syntax that will allow for the calculation of an average value for items in the query. As such you have two options. You can either execute your current statement:
album_id= ? 0
And then loop through the resulting RepositoryItem[] and calculate the average yourself (this could be time consuming on large datasets and means you'll have to load all the results into memory, so perhaps not the best solution) or you can implement a SqlPassthroughQuery that you execute.
Object params[] = new Object[1];
params[0] = albumId;
Builder builder = (Builder)view.getQueryBuilder();
String str = "select avg(rating) from rating WHERE album_id = 1 group by album_id";
RepositoryItem[] items =
view.executeQuery (builder.createSqlPassthroughQuery(str, params));
This will execute the average calculation on the database (something it is quite good at doing) and save you CPU cycles and memory in the application.
That said, don't make a habit of using SqlPassthroughQuery as means you don't get to use the repository cache as much, which could be detrimental to your application.
I have a question on how I can extract data from Moodle based on a parameter thats "greater than" or "less than" a given value.
For instance, I'd like to do something like:
**$record = $DB->get_record_sql('SELECT * FROM {question_attempts} WHERE questionid > ?', array(1));**
How can I achieve this, cause each time that I try this, I get a single record in return, instead of all the rows that meet this certain criteria.
Also, how can I get a query like this to work perfectly?
**$sql = ('SELECT * FROM {question_attempts} qa join {question_attempt_steps} qas on qas.questionattemptid = qa.id');**
In the end, I want to get all the quiz question marks for each user on the system, in each quiz.
Use $DB->get_records_sql() instead of $DB->get_record_sql, if you want more than one record to be returned.
Thanks Davo for the response back then (2016, wow!). I did manage to learn this over time.
Well, here is an example of a proper query for getting results from Moodle DB, using the > or < operators:
$quizid = 100; // just an example param here
$cutoffmark = 40 // anyone above 40% gets a Moodle badge!!
$sql = "SELECT q.name, qg.userid, qg.grade FROM {quiz} q JOIN {quiz_grades} qg ON qg.quiz = q.id WHERE q.id = ? AND qg.grade > ?";
$records = $DB->get_records_sql($sql, [$quizid, $cutoffmark]);
The query will return a record of quiz results with all student IDs and grades, who have a grade of over 40.