I'm scraping Amazon using perl, but the images are not being captured, do I have the correct parameters?

I'm scraping Amazon using perl, but the images are not being captured, do I have the correct parameters? - perl

I'm trying to scrape products from amazon (shoes to be precise) such as from here: http://www.amazon.com/DC-Mens-Skate-Black-Plaid/dp/B005BWAQVU/ref=sr_1_1?ie=UTF8&qid=1333376200&sr=8-1
For some reason the images no longer save and download. I am afraid I may have incorrect parameters for the images.
Here is the excerpt from my code in which that part takes place:
sub get_data
{
my($product_content,$gender,$product_category,$prod_tag,$sub_category)=#_;
my($product_name,$product_code,$brand,$product_price,$image_file,$image_name,$prod_size,$size_name,$color_name,$prod_color);
if($product_content=~m/<div\s*id\=\"atfResults\"[^>]*>([\w\W]+?)<div\s*id\=\"centerBelowStatic\">/is)
{
my $block=$1;
while($block=~m/<div\s*class\=\"image\">\s*<a[^>]*?href\=\"([^>]+?)\"[^>]*>\s*<img[^>]*>/igs)
{
my $source_url=$1;
$source_url=URI::URL->new_abs($source_url,$home_url);
my $final_content=&get_cont($source_url,$home_url,'GET');
if($final_content=~m/<h1[^>]*>\s*<[^>]*>\s*([^>]+?)\s*</is)
{
$product_name=decode_entities($1);
print "\n\n$count :: Product Name :: $product_name\n";
$product_name=~s/\'/\'\'/igs;
}
if($source_url=~m/\/dp\/([^>]+?)\//is)
{
$product_code=$1;
$product_code=~s/\'/\'\'/igs;
}
if($final_content=~m/<span\s*class\=\"brandLink\">\s*<[^>]*>\s*([^>]+?)\s*</is)
{
$brand=decode_entities($1);
print "Product Brand :: $brand\n";
$brand=~s/\'/\'\'/igs;
}
if($final_content=~m/<td\s*class\=\"priceBlockLabelPrice\">\s*Price\s*\:\s*<[^>]*>\s*<[^>]*>\s*<[^>]*>\s*([^>]+?)\s*</is)
{
$product_price=$1;
$product_price=~s/\'/\'\'/igs;
}
if($final_content=~m/<script[^>]*>\s*var\s*colorImages\s*\=\s*\{([\w\W]+?)\]\};/is)
{
my $color_block=$1;
my $col=1;
$image_file="";
$image_name="";
while($color_block=~m/\"large\"\:\[\"([^>]+?)\"/igs)
{
my $img_src=$1;
if($img_src=~m/(?:.+\/)([^>]*?\.[a-z]+)/is)
{
my $img_fname=$1;
getstore($img_src,"Images/$img_fname");
$img_fname=$dir."/Images/$img_fname";
$image_name=$image_name."Product_Image_filename_".$col.",";
$img_fname=~s/\'/\'\'/igs;
$image_file=$image_file."\'$img_fname\',";
$col++;
}
undef($img_src);
last if($col>10);
}
undef($color_block);
}
Everything else seems to save fine, but the images, nada. I'm not really a perl expert either so if it's something obvious, forgive me.

Why would you scrape their site when Amazon provides an API for getting hold of their product details?

You should use WWW::Scripter module for that. Today morning a new version of this module has been released and one of very new features of this new version is image fetching. The module will fetch images with proper referer and cookies (if apply), so you should have no problem to capture images...

Use Firefox and install an add on "Disable HTTP referer at Startup". Then restart firefox and try again. You will get images.

Related

Sentry: All console logs show breadcrumb rather than file in browser

I have an Ionic app and after adding sentry-cordova, I noticed that my console logs (in browser) now show the following:
Previously, it would name the file and line number rather than "breadcrumbs" and I have no idea how to change this behavior.
It's worth noting that when I hover over breadcrumbs.js in the logs it references: #sentry/browser

This is not so much a solution as a potential workaround but it did the trick for me and should work for anyone using environment variables.
Sentry.init({
dsn: "___DSN___",
integrations: function(integrations) {
return integrations.filter(function(integration) {
if (!environment.production) {
// Disables breadcrumbs unless in production mode
return integration.name !== "Breadcrumbs"
}
})
}
});

How to customize addContentItemDialog to restrict files over 10mb upload in IBM Content Navigator

I am customizing ICN (IBM Content Navigator) 2.0.3 and my requirement is to restrict user to upload files over 10mb and only allowed files are .pdf or .docx.
I know I have to extend / customize the AddContentItemDialog but there is very less detail on exactly how to do it, or any video on it. I'd appreciate if someone could guide.
Thanks
I installed the development environment but I am not sure how to extend the AddContentItemDialog.
public void applicationInit(HttpServletRequest request,
PluginServiceCallbacks callbacks) throws Exception {
}
I want to also know how to roll out the changes to ICN.

This can be easily extended. I would suggest to read the ICN red book for the details on how to do it. But it is pretty standard code.
Regarding rollout the code to ICN, there are two ways:
- If you are using plugin: just replace the Jar file on the server location and restart WAS.
- If you are using EDS: you need to redeploy the web service and restart WAS.
Hope this helps.
thanks

Although there are many ways to do this, one way indeed is tot extend, or augment the AddContentItemDialog as you qouted. After looking at the (rather poor IBM documentation) i figured you could probably use the onAdd event/method
Dojo/Aspect#around allows you to do exactly that, example:
require(["dojo/aspect", "ecm/widget/dialog/AddContentItemDialog"], function(aspect, AddContentItemDialog) {
aspect.around(AddContentItemDialog.prototype, "onAdd", function advisor(original) {
return function around() {
var files = this.addContentItemGeneralPane.getFileInputFiles();
var containsInvalidFiles = dojo.some(files, function isInvalid(file) {
var fileName = file.name.toLowerCase();
var extensionOK = fileName.endsWith(".pdf") || fileName.endsWith(".docx");
var fileSizeOK = file.size <= 10 * 1024 * 1024;
return !(extensionOK && fileSizeOK);
});
if (containsInvalidFiles) {
alert("You can't add that :)");
}else{
original.apply(this, arguments);
}
}
});
});
Just make sure this code gets executed before the actual dialog is opened. The best way to achieve this, is by wrapping this code in a new plugin.
Now on creating/deploying plugins -> The easiest way is this wizard for Eclipse (see also a repackaged version for newer eclipse versions). Just create a new arbitrary plugin, and paste this javascript code in the generated .js file.
Additionally it might be good to note that you're only limiting "this specific dialog" to upload specific files. It would probably be a good idea to also create a requestFilter to limit all possible uses of the addContent api...

How can I use Chrome App fileSystem API

I want to use chrome.fileSystem API to create a new file in C:/ or anywhere, but I cannot figure out how to use this API.
I cannot find any argument for file path, only thing is fileEntry, but how can I generate fileEntry with something like C://a/b/c?

Chrome apps have limitations - for security reasons - on what files can be accessed. Basically, the user needs to approve access from your app to the files and directories that are accessed.
The only way to get access to files outside of the app's sandbox is through a user gesture - that is, you need to ask the user for a file. You do this with chrome.fileSystem.chooseEntry.
If this isn't clear of obvious, maybe you could explain more about what you are trying to do with the files and we can give advice on the best way to do this. Usually chrome.fileSystem is not the best choice for data storage - there are other more convenient and sandboxed alterntives like chrome.storage.

It's a bit tricky to work with. But it follows the same model other languages use, except it's even tricker because of all the callbacks. This is a function I wrote to get a nested file entry, creating directories as it goes along. Maybe this can help you get started.
For this function, youd pass in a FileSystem that you'd get from something like chrome.fileSystem.chooseEntry with a directory type option, and path would be in your example ['a','b','c']
function recursiveGetEntry(filesystem, path, callback) {
function recurse(e) {
if (path.length == 0) {
if (e.isFile) {
callback(e)
} else {
callback({error:'file exists'})
}
} else if (e.isDirectory) {
if (path.length > 1) {
e.getDirectory(path.shift(), {create:true}, recurse, recurse)
} else {
e.getFile(path.shift(), {create:true}, recurse, recurse)
}
} else {
callback({error:'file exists'})
}
}
recurse(filesystem)
}

Adding defer attribute in .js files included in moodle

I am optimizing the performance of my Moodle site, it is showing the high loading time on loading the .js files of the page. That is why I want to include the defer='defer' attribute in the page which is calling or including the javascript files as follow.
if (!empty($CFG->cachejs) and !empty($CFG->jsrev) and $CFG->jsrev > 0 and substr($url, -3) === '.js') {
if (empty($CFG->slasharguments)) {
return new moodle_url($CFG->httpswwwroot.'/lib/javascript.php', array('rev'=>$CFG->jsrev, 'jsfile'=>$url));
} else {
$returnurl = new moodle_url($CFG->httpswwwroot.'/lib/javascript.php');
$returnurl->set_slashargument('/'.$CFG->jsrev.$url);
return $returnurl;
}
} else {
return new moodle_url($CFG->httpswwwroot.$url);
}
So how can I add the defer='defer' attribute on this? Please suggest me.

there are multiple locations where javascript is embedded in the Moodle page, the biggest one is for the YUI3 library.
To add the defer tag, look in file /lib/outputrequirementslib.php. The exact line number depends on your Moodle version. The trickiest one is to add it to the static.js as this is handled in the html_writer class.
Please note that the tag should be "defer" and not "defer='defer'" as Moodle uses the HTML5 doctype.
Also the loading order for Moodle is important due to the way they have build it. Adding the defer attribute will probably break your Moodle.

Joomla: Is there a module render plugin event?

Due to some caching issues, I need to explicitly bypass the cache, for a specific module, if certain URL parameters are present. The workaround I've arrived at is to hack the render() function in libraries/joomla/document/html/renderer/module.php, along the lines of:
function render( $module, $params = array(), $content = null )
{
// Existing code:
$mod_params = new JParameter( $module->params );
// My hack:
if ($module->module == 'mod_foo')
{
if (certain URL parameters are present)
{
$mod_params->set('cache', 0);
}
}
...
}
Of course, hacking the core joomla code is a terrible idea, one which I'd like to avoid if at all possible. So, is there an appropriate hook I can plugin to in order to achieve the same? I don't think I can do anything at the module level, since it won't even be inspected if the renderer has already decided to fetch it from cache.

To answer the first question no there isn't a module render event, here's the plugin doc's and the list of events in Joomla!
Turn off caching for your module.
See this article on The Art Of Joomla, additional articles you could look at:
Using Cache to Speed Up your code
JCache API

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

I'm scraping Amazon using perl, but the images are not being captured, do I have the correct parameters? - perl

Why would you scrape their site when Amazon provides an API for getting hold of their product details?

You should use WWW::Scripter module for that. Today morning a new version of this module has been released and one of very new features of this new version is image fetching. The module will fetch images with proper referer and cookies (if apply), so you should have no problem to capture images...

Use Firefox and install an add on "Disable HTTP referer at Startup". Then restart firefox and try again. You will get images.

Related

Sentry: All console logs show breadcrumb rather than file in browser

How to customize addContentItemDialog to restrict files over 10mb upload in IBM Content Navigator

How can I use Chrome App fileSystem API

Adding defer attribute in .js files included in moodle

Joomla: Is there a module render plugin event?

Categories

Resources