how do i track downloads for files in my google cloud storage - google-cloud-storage

I need a way of tracking downloads by users of my site, for mp3 files in my cloud storage? Aside from storage logs, are there any other solutions.

There's a storage logging feature: https://cloud.google.com/storage/docs/access-logs

This question is pretty old, but I thought I would share my solution in case anyone is looking for a solution still. You need to enable access logs and then write some scriptage to download and parse the access logs.
I used Ruby and here's the meat of my script:
#!/usr/bin/env ruby
require 'fileutils'
temp_dir = "/tmp/access-logs"
output_file = "/tmp/download-count.csv"
# Clean up the existing access logs
FileUtils.rm_rf(temp_dir)
FileUtils.mkdir(temp_dir)
`/usr/bin/gsutil -m cp "gs://my_access_logs/FusionAuthAccesssLog_usage_*" #{temp_dir} > /dev/null 2>&1`
# Collect the counts
counts = Hash.new(0)
Dir.foreach(temp_dir) do |file|
if File.file?("#{temp_dir}/#{file}")
date = file.gsub(/MyAccesssLog_usage_([0-9]{4})_([0-9]{2})_([0-9]{2}).*/, '\1\2\3')
File.readlines("#{temp_dir}/#{file}").each do |l|
if l =~ /my-file.zip"/
counts[date] = counts[date] + 1
end
end
end
end
File.open(output_file, "w", :encoding => "UTF-8") do |f|
# Write the header
f.puts("Date,Download count\n")
# Write the counts
counts.sort.each do |date,count|
f.puts("#{date},#{count}\n")
end
end
I wrote a blog post on this that goes over the script in detail as well. Here's the link: https://fusionauth.io/blog/2018/09/20/download-counts-from-google-cloud-storage

Related

gsutil multiprocessing and multithreaded does not sustain cpu usage & copy rate on GCP instance

I am running a script to copy millions (2.4 million to be exact) images from several gcs buckets into one central bucket, with all buckets in the same region. I was originally working from one csv file but broke it into 64 smaller ones so each process can iterate through its own file as to not wait for the others. When the script launches on a 64 vCPU, 240 GB memory instance on GCP it runs fine for about an hour and a half. In 75 minutes 155 thousand files copied over. The CPU usage was registering a sustained 99%. After this, the CPU usage drastically declines to 2% and the transfer rate falls significantly. I am really unsure why this. I am keeping track of files that fail by creating blank files in an errors directory. This way there is no write lock when writing to a central error file. Code is below. It is not a spacing or syntax error, some spacing got messed up when I copied into the post. Any help is greatly appreciated.
Thanks,
Zach
import os
import subprocess
import csv
from multiprocessing.dummy import Pool as ThreadPool
from multiprocessing import Pool as ProcessPool
import multiprocessing
gcs_destination = 'gs://dest-bucket/'
source_1 = 'gs://source-1/'
source_2 = 'gs://source-2/'
source_3 = 'gs://source-3/'
source_4 = 'gs://source-4/'
def copy(img):
try:
imgID = img[0] # extract name
imgLocation = pano[9] # extract its location on gcs
print pano[0] + " " + panoLocation
source = ""
if imgLocation == '1':
source = source_1
elif imgLocation == '2':
source = source-2
elif imgLocation == '3':
source = source_3
elif imgLocation == '4':
source = source_4
print str(os.getpid())
command = "gsutil -o GSUtil:state_dir=.{} cp {}{}.tar.gz {}".format(os.getpid(), source, imgID , g
prog = subprocess.call(command, shell="True")
if prog != 0:
command = "touch errors/{}_{}".format(imgID, imgLocation)
os.system(command)
except:
print "Doing nothing with the error"
def split_into_threads(csv_file):
with open(csv_file) as f:
csv_f = csv.reader(f)
pool = ThreadPool(15)
pool.map(copy, csv_f)
if __name__ == "__main__":
file_names = [None] * 64
# Read in CSV file of all records
for i in range(0,64):
file_names[i] = 'split_origin/origin_{}.csv'.format(i)
process_pool = ProcessPool(multiprocessing.cpu_count())
process_pool.map(split_into_threads, file_names)
For gsutil, I agree strongly with the multithreading suggestion by adding -m. Further, composite uploads, -o, may be unnecessary and undesirable as the images are not GB each in size and need not be split into shards. They're likely in the X-XXMB range.
Within your python function, you are calling gsutil commands, which are in turn calling further python functions. It should be cleaner and more performant to leverage the google-made client library for python, available [below]. Gsutil is built for interactive CLI use rather than for calling programatically.
https://cloud.google.com/storage/docs/reference/libraries#client-libraries-install-python
Also, for gsutil, see your ~/.boto file and look at the multi-processing and multi-threading values. Beefier machines can handle greater thread and process. For reference, I work from my Macbook Pro w/ 1 process and 24 threads. I use an ethernet adapter and hardwire into my office connection and get incredible performance off internal SSD (>450 Mbps). That's Megabits, not bytes. The transfer rates are impressive, nonetheless
I strongly recommend you to use the "-m" flag on gsutil to enable multi thread copy.
Also as an alternative you can use the Storage Transfer Service [1] to move data between buckets.
[1] https://cloud.google.com/storage/transfer/

How to extract the list of all repositories in Stash or Bitbucket?

I need to extract the list of all repos under all projects in Bitbucket. Is there a REST API for the same? I couldn't find one.
I have both on-premise and cloud Bitbucket.
Clone ALL Projects & Repositories for a given stash url
#!/usr/bin/python
#
# #author Jason LeMonier
#
# Clone ALL Projects & Repositories for a given stash url
#
# Loop through all projects: [P1, P2, ...]
# P1 > for each project make a directory with the key "P1"
# Then clone every repository inside of directory P1
# Backup a directory, create P2, ...
#
# Added ACTION_FLAG bit so the same logic can run fetch --all on every repository and/or clone.
import sys
import os
import stashy
ACTION_FLAG = 1 # Bit: +1=Clone, +2=fetch --all
url = os.environ["STASH_URL"] # "https://mystash.com/stash"
user = os.environ["STASH_USER"] # joedoe"
pwd = os.environ["STASH_PWD"] # Yay123
stash = stashy.connect(url, user, pwd)
def mkdir(xdir):
if not os.path.exists(xdir):
os.makedirs(xdir)
def run_cmd(cmd):
print ("Directory cwd: %s "%(os.getcwd() ))
print ("Running Command: \n %s " %(cmd))
os.system(cmd)
start_dir = os.getcwd()
for project in stash.projects:
pk = project_key = project["key"]
mkdir(pk)
os.chdir(pk)
for repo in stash.projects[project_key].repos.list():
for url in repo["links"]["clone"]:
href = url["href"]
repo_dir = href.split("/")[-1].split(".")[0]
if (url["name"] == "http"):
print (" url.href: %s"% href) # https://joedoe#mystash.com/stash/scm/app/ae.git
print ("Directory cwd: %s Project: %s"%(os.getcwd(), pk))
if ACTION_FLAG & 1 > 0:
if not os.path.exists(repo_dir):
run_cmd("git clone %s" % url["href"])
else:
print ("Directory: %s/%s exists already. Skipping clone. "%(os.getcwd(), repo_dir))
if ACTION_FLAG & 2 > 0:
# chdir into directory "ae" based on url of this repo, fetch, chdir back
cur_dir = os.getcwd()
os.chdir(repo_dir)
run_cmd("git fetch --all ")
os.chdir(cur_dir)
break
os.chdir(start_dir) # avoiding ".." in case of incorrect git directories
Once logged in: on the top right, click on your profile pic and then 'View profile'
Take note of your user (in the example below 'YourEmail#domain.com', but keep in mind it's case sensitive)
Click on profile pic > Manage account > Personal access token > Create a token (choosing 'Read' access type is enough for this functionality)
For all repos in all projects:
Open a CLI and use the command below (remember to fill in your server domain!):
curl -u "YourEmail#domain.com" -X GET https://<my_server_domain>/rest/api/1.0/projects/?limit=1000
It will ask you for your personal access token, you comply and you get a JSON file with all repos requested
For all repos in a given project:
Pick the project you want to get repos from. In my case, the project URL is: <your_server_domain>/projects/TECH/ and therefore my {projectKey} is 'TECH', which you'll need for the command below.
Open a CLI and use this command (remember to fill in your server domain and projectKey!):
curl -u "YourEmail#domain.com" -X GET https://<my_server_domain>/rest/api/1.0/projects/{projectKey}/repos?limit=50
Final touches
(optional) If you want just the titles of the repos requested and you have jq installed (for Windows, downloading the exe and adding it to PATH should be enough, but you need to restart your CLI for that new addition to be detected), you can use the command below:
curl -u $BBUSER -X GET <my_server_domain>/rest/api/1.0/projects/TECH/repos?limit=50 | jq '.values|.[]|.name'
(tested with Data Center/Atlassian Bitbucket v7.9.0 and powershell CLI)
For Bitbucket Cloud
You can use their REST API to access and perform queries on your server.
Specifically, you can use this documentation page, provided by Atlassian, to learn how to list you're repositories.
For Bitbucket Server
Edit: As of receiving this tweet from Dan Bennett, I've learnt there is an API/plugin system for Bitbucket Server that could possibly cater for your needs. For docs: See here.
Edit2: Found this reference to listing personal repositories that may serve as a solution.
AFAIK there isn't a solution for you unless you built a little API for yourself that interacted with your Bitbucket Server instance.
Atlassian Documentation does indicate that to list all currently configured repositories you can do git remote -v. However I'm dubious of this as this isn't normally how git remote -v is used; I think it's more likely that Atlassian's documentation is being unclear rather than Atlassian building in this functionality to Bitbucket Server.
I ended up having to do this myself with an on-prem install of Bitbucket which didn't seem to have the REST APIs discussed above accessible, so I came up with a short script to scrape it out of the web page. This workaround has the advantage that there's nothing you need to install, and you don't need to worry about dependencies, certs or logins other than just logging into your Bitbucket server. You can also set this up as a bookmark if you urlencode the script and prefix it with javascript:.
To use this:
Open your bitbucket server project page, where you should see a list of repos.
Open your browser's devtools console. This is usually F12 or ctrl-shift-i.
Paste the following into the command prompt there.
JSON.stringify(Array.from(document.querySelectorAll('[data-repository-id]')).map(aTag => {
const href = aTag.getAttribute('href');
let projName = href.match(/\/projects\/(.+)\/repos/)[1].toLowerCase();
let repoName = href.match(/\/repos\/(.+)\/browse/)[1];
repoName = repoName.replace(' ', '-');
const templ = `https://${location.host}/scm/${projName}/${repoName}.git`;
return {
href,
name: aTag.innerText,
clone: templ
}
}));
The result is a JSON string containing an array with the repo's URL, name, and clone URL.
[{
"href": "/projects/FOO/repos/some-repo-here/browse",
"name": "some-repo-here",
"clone": "https://mybitbucket.company.com/scm/foo/some-repo-here.git"
}]
This ruby script isn't the greatest code, which makes sense, because I'm not the greatest coder. But it is clear, tested, and it works.
The script filters the output of a Bitbucket API call to create a complete report of all repos on a Bitbucket server. Report is arranged by project, and includes totals and subtotals, a link to each repo, and whether the repos are public or personal. I could have simplified it for general use, but it's pretty useful as it is.
There are no command line arguments. Just run it.
#!/usr/bin/ruby
#
# #author Bill Cernansky
#
# List and count all repos on a Bitbucket server, arranged by project, to STDOUT.
#
require 'json'
bbserver = 'http(s)://server.domain.com'
bbuser = 'username'
bbpassword = 'password'
bbmaxrepos = 2000 # Increase if you have more than 2000 repos
reposRaw = JSON.parse(`curl -s -u '#{bbuser}':'#{bbpassword}' -X GET #{bbserver}/rest/api/1.0/repos?limit=#{bbmaxrepos}`)
projects = {}
repoCount = reposRaw['values'].count
reposRaw['values'].each do |r|
projID = r['project']['key']
if projects[projID].nil?
projects[projID] = {}
projects[projID]['name'] = r['project']['name']
projects[projID]['repos'] = {}
end
repoName = r['name']
projects[projID]['repos'][repoName] = r['links']['clone'][0]['href']
end
privateProjCount = projects.keys.grep(/^\~/).count
publicProjCount = projects.keys.count - privateProjCount
reportText = ''
privateRepoCount = 0
projects.keys.sort.each do |p|
# Personal project slugs always start with tilde
isPrivate = p[0] == '~'
projRepoCount = projects[p]['repos'].keys.count
privateRepoCount += projRepoCount if isPrivate
reportText += "\nProject: #{p} : #{projects[p]['name']}\n #{projRepoCount} #{isPrivate ? 'PERSONAL' : 'Public'} repositories\n"
projects[p]['repos'].keys.each do |r|
reportText += sprintf(" %-30s : %s\n", r, projects[p]['repos'][r])
end
end
puts "BITBUCKET REPO REPORT\n\n"
puts sprintf(" Total Projects: %5d Public: %5d Personal: %5d", projects.keys.count, publicProjCount, privateProjCount)
puts sprintf(" Total Repos: %5d Public: %5d Personal: %5d", repoCount, repoCount - privateRepoCount, privateRepoCount)
puts reportText
The way I solved this issue, was get the html page and give it a ridiculous limit like this. thats in python :
cmd = "curl -s -k --user " + username + " https://URL/projects/<KEY_PROJECT_NAME>/?limit\=10000"
then I parsed it with BeautifulSoup
make_list = str((subprocess.check_output(cmd, shell=True)).rstrip().decode("utf-8"))
html = make_list
parsed_html = BeautifulSoup(html,'html.parser')
list1 = []
for a in parsed_html.find_all("a", href=re.compile("/<projects>/<KEY_PROJECT_NAME>/repos/")):
list1.append(a.string)
print(list1)
to use this make sure you change and , this should be the bitbucket project you are targeting. All , I am doing is parsing an html file.
Here's how I pulled the list of repos from Bitbucket Cloud.
Setup OAauth Consumer
Go to your workspace settings and setup an OAuth consumer, you should be able to go here directly using this link: https://bitbucket.org/{your_workspace}/workspace/settings/api
The only setting that matters is the callback URL which can be anything but I chose http://localhost
Once setup, this will display a key and secret pair for your OAuth consumer, I will refer to these as {oauth_key} and {oauth_secret} below
Authenticate with the API
Go to https://bitbucket.org/site/oauth2/authorize?client_id={oauth_key}&response_type=code ensuring you replace {oauth_key}
This will redirect you to something like http://localhost/?code=xxxxxxxxxxxxxxxxxx, make a note of that code, I'll refer to that as {oauth_code} below
In your terminal go to curl -X POST -u "{oauth_key}:{oauth_secret}" https://bitbucket.org/site/oauth2/access_token -d grant_type=authorization_code -d code={oauth_code} replacing the placeholders.
This should return json including the access_token, I’ll refer to that access token as {oauth_token}
Get the list of repos
You can now run the following to get the list of repos. Bear in mind that your {oauth_token} lasts 2hrs by default.
curl --request GET \
--url 'https://api.bitbucket.org/2.0/repositories/pageant?page=1' \
--header 'Authorization: Bearer {oauth_token}' \
--header 'Accept: application/json'
This response is paginated so you'll need to page through the responses, 10 repositories at a time.

Using __END__ and DATA in Chef recipes (to run legacy shell scripts)

I'm migrating some shell scripts to Chef recipes. Some of these scripts are fairly involved, so just to make life easier in the short term and to avoid introducing bugs in rewriting everything in Chef/Ruby, I'd like to just run some of them as-is. They're all well-written and idempotent, so honestly there's no rush, but of course, the eventual goal is to rewrite them.
One cool feature of Ruby is its __END__ keyword/method: Lines below __END__ will not be executed. Those lines will be available via the special filehandle DATA.
It would be cool to ship the shell scripts as-is inside the the recipe after __END__, maybe something like the following, which I placed in chef-repo/cookbooks/ruby-data-test/recipes/default.rb:
file = Tempfile.new(File.basename(__FILE__))
file << DATA.read
bash file.path
file.unlink
__END__
echo "Hello, world"
However when I run this (with chef-solo -c solo.rb --override-runlist 'recipe[ruby-data-test]'), I get the following error:
[2014-10-03T17:14:56+00:00] ERROR: uninitialized constant Chef::Recipe::DATA
I'm pretty new to Chef, but I'm guessing the above is something about Chef wrapping my recipe in a class, and there's something simple preventing me from accessing DATA. Since it's "global" (?) I tried putting a dollar sign ($DATA) in front of it but that failed with:
NoMethodError
-------------
undefined method `read' for nil:NilClass
So the question is: How do I access DATA in my Chef recipe? Thanks!
It appears you don't have access to DATA, but you can fake it by reading in the current file yourself and splitting on __END__, like Sinatra does.
I ended up making a Chef LWRP for reuse. I don't know if I'll actually end up using this, but I wanted to figure it out. Like I said, I'm a Chef/Ruby noob, so any better ideas or suggestions welcome!
ruby_data_test/recipes/default.rb:
ruby_data_test_execute_ruby_data __FILE__
__END__
#!/bin/bash
set -o errexit
date
echo "Hello, world"
ruby_data_test/resources/execute_ruby_data.rb:
actions :execute_ruby_data
default_action :execute_ruby_data
attribute :source, :name_attribute => true, :required => true
attribute :args, :kind_of => Array
attribute :ignore_errors, :kind_of => [TrueClass, FalseClass], :default => false
ruby_data_test/providers/execute_ruby_data.rb:
def whyrun_supported?
true
end
use_inline_resources
action :execute_ruby_data do
converge_by("Executing #{#new_resource}") do
Chef::Log.info("Executing #{#new_resource}")
file_who_called_me = #new_resource.source
io = ::IO.respond_to?(:binread) ? ::IO.binread(file_who_called_me) : ::IO.read(file_who_called_me)
app, data = io.gsub("\r\n", "\n").split(/^__END__$/, 2)
data.lstrip!
file = Tempfile.new('execute_ruby_data')
file << data
file.chmod(0755)
file.close
exit_status = ::Open3.popen2e(file.path, *#new_resource.args) do |stdin, stdout_and_stderr, wait_thr|
stdout_and_stderr.each { |line| puts line }
wait_thr.value # exit status
end
if exit_status != 0 && !#new_resource.ignore_errors
throw RuntimeError
end
end
end
Here's the output:
$ chef-solo -c solo.rb --override-runlist 'recipe[ruby_data_test]'
Starting Chef Client, version 11.12.4
[2014-10-03T21:50:29+00:00] WARN: Run List override has been provided.
[2014-10-03T21:50:29+00:00] WARN: Original Run List: []
[2014-10-03T21:50:29+00:00] WARN: Overridden Run List: [recipe[ruby_data_test]]
Compiling Cookbooks...
Converging 1 resources
Recipe: ruby_data_test::default
* ruby_data_test_execute_ruby_data[/root/chef/chef-repo/cookbooks/ruby_data_test/recipes/default.rb] action execute_ruby_dataFri Oct 3 21:50:29 UTC 2014
Hello, world
- Executing ruby_data_test_execute_ruby_data[/root/chef/chef-repo/cookbooks/ruby_data_test/recipes/default.rb]
Running handlers:
Running handlers complete
Chef Client finished, 1/1 resources updated in 1.387608 seconds

Want to call Progress 4GL 91.D procedure through Ajax call

I want to create web service for my Phonegap Android application which will further call progress 4GL 91.D procedure.
Does any one knowy idea how to create web service for this.
That will be a struggle. You CAN create a server that listens to a socket but you will have to handle everything yourself!
Look at this example.
However, you are likely better off writing the webservice in a language with a better support and then finding another way of getting the data out of the DB. If youre really stuck with a 10+ year old version you really should consider migrating to something else.
You don't have to upgrade everything -- you could just obtain a license for a version 10 client. V10 clients can connect to v9 databases (the rule is that the client can be up to one major release higher) so you could use that to build a SOAP service. Or you could get a v10 "webspeed" license.
Or you could write a simple enough CGI wrapper to some 4GL code if you have those sorts of skills. I occasionally toss together something like this:
#!/bin/bash
#
LOGFILE=/tmp/myservice.log
SVC=sample
# if a FIFO does not exist for the specified service then create it in /tmp
#
# $1 = direction -- in or out
# $2 = unique service name
#
pj_fifo() {
if [ ! -p /tmp/$2.$1 ]
then
echo `date` "Creating FIFO $2.$1" >> ${LOGFILE}
rm -f /tmp/$2.$1 >> ${LOGFILE} &2>&1
/bin/mknod -m 666 /tmp/$2.$1 p >> ${LOGFILE} &2>&1
fi
}
if [ "${REQUEST_METHOD}" = "POST" ]
then
read QUERY_STRING
fi
# header must include a blank line
#
# we're returning XML
#
echo "Content-type: text/xml" # or text/html or text/plain
echo
# debugging echo...
#
# echo $QUERY_STRING
#
# echo "<html><head><title>Sample CGI Interface</title></head><body><pre>QUERY STRING = ${QUERY_STRING}</pre></body></html>"
# ensure that the FIFOs exist
#
pj_fifo in $SVC
pj_fifo out $SVC
# make the request
#
echo "$QUERY_STRING" > /tmp/${SVC}.in
# send the response back to the requestor
#
cat /tmp/${SVC}.out
# all done!
#
echo `date` "complete" >> ${LOGFILE}
Then you just arrange for a background session to be reading /tmp/sample.in:
/* sample.p
*
* mbpro dbname -p sample.p > /tmp/sample.log 2>&1 &
*
*/
define variable request as character no-undo.
define variable result as character no-undo.
input from value( "/tmp/sample.in" ).
output to value( "/tmp/sample.out" ).
do while true:
import unformatted request.
/* parse it and do something with it... */
result = '<?xml version="1.0"?>~n<status>~n'.
result = result + "ok". /* or whatever turns your crank... */
result = result + "</status>~n".
end.
When input arrives parse the line and do whatever. Spit the answer back out to /tmp/sample.out and loop. It's not very fancy but if your needs are modest it is easy to do. If you need more scalability, robustness or security then you might ultimately need something more sophisticated but this will at least let you get started prototyping.

Is there a way to tell django compressor to create source maps

I want to be able to debug minified compressed javascript code on my production site. Our site uses django compressor to create minified and compressed js files. I read recently about chrome being able to use source maps to help debug such javascript. However I don't know how/if possible to tell the django compressor to create source maps when compressing the js files
I don't have a good answer regarding outputting separate source map files, however I was able to get inline working.
Prior to adding source maps my settings.py file used the following precompilers
COMPRESS_PRECOMPILERS = (
('text/coffeescript', 'coffee --compile --stdio'),
('text/less', 'lessc {infile} {outfile}'),
('text/x-sass', 'sass {infile} {outfile}'),
('text/x-scss', 'sass --scss {infile} {outfile}'),
('text/stylus', 'stylus < {infile} > {outfile}'),
)
After a quick
$ lessc --help
You find out you can put the less and map files in to the output css file. So my new text/less precompiler entry looks like
('text/less', 'lessc --source-map-less-inline --source-map-map-inline {infile} {outfile}'),
Hope this helps.
Edit: Forgot to add, lessc >= 1.5.0 required for this, to upgrade use
$ [sudo] npm update -g less
While I couldn't get this to work with django-compressor (though it should be possible, I think I just had issues getting the app set up correctly), I was able to get it working with django-assets.
You'll need to add the appropriate command-line argument to the less filter source code as follows:
diff --git a/src/webassets/filter/less.py b/src/webassets/filter/less.py
index eb40658..a75f191 100644
--- a/src/webassets/filter/less.py
+++ b/src/webassets/filter/less.py
## -80,4 +80,4 ## class Less(ExternalTool):
def input(self, in_, out, source_path, **kw):
# Set working directory to the source file so that includes are found
with working_directory(filename=source_path):
- self.subprocess([self.less or 'lessc', '-'], out, in_)
+ self.subprocess([self.less or 'lessc', '--line-numbers=mediaquery', '-'], out, in_)
Aside from that tiny addition:
make sure you've got the node -- not the ruby gem -- less compiler (>=1.3.2 IIRC) available in your path.
turn on the sass source-maps option buried away in chrome's web inspector config pages. (yes, 'sass' not less: less tweaked their debug-info format to match sass's since since sass had already implemented a chrome-compatible mapping and their formats weren't that different to begin with anyway...)
Not out of the box but you can extend a custom filter:
from compressor.filters import CompilerFilter
class UglifyJSFilter(CompilerFilter):
command = "uglifyjs -c -m " /
"--source-map-root={relroot}/ " /
"--source-map-url={name}.map.js" /
"--source-map={relpath}/{name}.map.js -o {output}"