Jsoup post, selecting different options - scala

http://www.myprotein.com/sports-nutrition/impact-whey-protein/10530943.html
I would like retrieve the prices for the different amounts, i.e. 1kg and 2.5kg
The first option is quite straightforward.
val page = Jsoup.connect("http://www.myprotein.com/sports-nutrition/impact-whey-protein/10530943.html").get()
println(page.select("div.media").select("h2.price").text.replaceAll("[a-zA-z\\s]", ""))
Getting the different options:
for (value <- page.getElementById("opts-7").getElementsByAttribute("value").asScala ) {
println(value.attr("value"))
}
However I have no idea about how to go on with the post, setting the cookies, headers etc. I prefer the speed and parsing abilities from Jsoup, however I am also considering taking the easy way out and using selenium to select the other options and parsing the source with jsoup.
So I decided to change to Selenium. It's selecting the right option with an emulated click however the price isn't updating. If I avoid closing the WebDriver and select the 2.5 kg manually it updates price accordingly
var driver: WebDriver = null
def main(args: Array[String]) {
setupWebDriver()
driver.navigate().to("http://www.myprotein.com/sports-nutrition/impact-whey-protein/10530943.html")
driver.findElement(By.xpath("//*[#id=\"opts-7\"]/option[2]")).click()
val temp = Jsoup.parse(driver.getPageSource)
println(temp.select("div.product-price").text())}
There's the main method and the WebDriver setup
def setupWebDriver() {
val binaryExe = "\\phantomjs.exe"
val BinaryChrome = "\\chromedriver.exe"
val caps: DesiredCapabilities = new DesiredCapabilities()
caps.setCapability(PhantomJSDriverService.PHANTOMJS_EXECUTABLE_PATH_PROPERTY, System.getProperty("user.dir") + binaryExe)
System.setProperty("phantomjs.binary.path", System.getProperty("user.dir") + binaryExe)
System.setProperty("webdriver.chrome.driver", System.getProperty("user.dir") + BinaryChrome)
caps.setJavascriptEnabled(true)
caps.setBrowserName(BrowserType.GOOGLECHROME)
driver = new ChromeDriver(caps)
driver.manage.timeouts.implicitlyWait(10, TimeUnit.SECONDS)}
Can anyone enlighten me on how I would get the price to update this way?

Related

How to process multiple parquet files individually in a for loop?

I have multiple parquet files (around 1000). I need to load each one of them, process it and save the result to a Hive table. I have a for loop but it only seems to work with 2 or 5 files, but not with 1000, as it seems Sparks tries to load them all at the same time, and I need it do it individually in the same Spark session.
I tried using a for loop, then a for each, and I ussed unpersist() but It fails anyway.
val ids = get_files_IDs()
ids.foreach(id => {
println("Starting file " + id)
var df = load_file(id)
var values_df = calculate_values(df)
values_df.write.mode(SaveMode.Overwrite).saveAsTable("table.values_" + id)
df.unpersist()
})
def get_files_IDs(): List[String] = {
var ids = sqlContext.sql("SELECT CAST(id AS varchar(10)) FROM table.ids WHERE id IS NOT NULL")
var ids_list = ids.select("id").map(r => r.getString(0)).collect().toList
return ids_list
}
def calculate_values(df:org.apache.spark.sql.DataFrame): org.apache.spark.sql.DataFrame ={
val values_id = df.groupBy($"id", $"date", $"hr_time").agg(avg($"value_a") as "avg_val_a", avg($"value_b") as "avg_value_b")
return values_id
}
def load_file(id:String): org.apache.spark.sql.DataFrame = {
val df = sqlContext.read.parquet("/user/hive/wh/table.db/parquet/values_for_" + id + ".parquet")
return df
}
What I would expect is for Spark to load file ID 1, process the data, save it to the Hive table and then dismiss that date and cotinue with the second ID and so on until it finishes the 1000 files. Instead of it trying to load everything at the same time.
Any help would be very appreciated! I've been stuck on it for days. I'm using Spark 1.6 with Scala Thank you!!
EDIT: Added the definitions. Hope it can help to get a better view. Thank you!
Ok so after a lot of inspection I realised that the process was working fine. It processed each file individualy and saved the results. The issue was that in some very specific cases the process was taking way way way to long.
So I can tell that with a for loop or for each you can process multiple files and save the results without problem. Unpersisting and clearing cache do helps on performance.

Problems joining 2 kafka streams (using custom timestampextractor)

I'm having problems joining 2 kafka streams extracting the date from the fields of my event. The join is working fine when I do not define a custom TimeStampExtractor but when I do the join does not work anymore. My topology is quite simple:
val builder = new StreamsBuilder()
val couponConsumedWith = Consumed.`with`(Serdes.String(),
getAvroCouponSerde(schemaRegistryHost, schemaRegistryPort))
val couponStream: KStream[String, Coupon] = builder.stream(couponInputTopic, couponConsumedWith)
val purchaseConsumedWith = Consumed.`with`(Serdes.String(),
getAvroPurchaseSerde(schemaRegistryHost, schemaRegistryPort))
val purchaseStream: KStream[String, Purchase] = builder.stream(purchaseInputTopic, purchaseConsumedWith)
val couponStreamKeyedByProductId: KStream[String, Coupon] = couponStream.selectKey(couponProductIdValueMapper)
val purchaseStreamKeyedByProductId: KStream[String, Purchase] = purchaseStream.selectKey(purchaseProductIdValueMapper)
val couponPurchaseValueJoiner = new ValueJoiner[Coupon, Purchase, Purchase]() {
#Override
def apply(coupon: Coupon, purchase: Purchase): Purchase = {
val discount = (purchase.getAmount * coupon.getDiscount) / 100
new Purchase(purchase.getTimestamp, purchase.getProductid, purchase.getProductdescription, purchase.getAmount - discount)
}
}
val fiveMinuteWindow = JoinWindows.of(TimeUnit.MINUTES.toMillis(10))
val outputStream: KStream[String, Purchase] = couponStreamKeyedByProductId.join(purchaseStreamKeyedByProductId,
couponPurchaseValueJoiner,
fiveMinuteWindow
)
outputStream.to(outputTopic)
builder.build()
As I said this code works like a charm when I do not use a custom TimeStampExtractor but when I do by setting the StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG to my custom extractor class (I've double checked that the class is extracting the date properly) the join does not work anymore.
I'm testing the topology by running a unit test and passing the following events to it:
val coupon1 = new Coupon("Dec 05 2018 09:10:00.000 UTC", "1234", 10F)
// Purchase within the five minutes after the coupon - The discount should be applied
val purchase1 = new Purchase("Dec 05 2018 09:12:00.000 UTC", "1234", "Green Glass", 25.00F)
val purchase1WithDiscount = new Purchase("Dec 05 2018 09:12:00.000 UTC", "1234", "Green Glass", 22.50F)
val couponRecordFactory1 = couponRecordFactory.create(couponInputTopic, "c1", coupon1)
val purchaseRecordFactory1 = purchaseRecordFactory.create(purchaseInputTopic, "p1", purchase1)
testDriver.pipeInput(couponRecordFactory1)
testDriver.pipeInput(purchaseRecordFactory1)
val outputRecord1 = testDriver.readOutput(outputTopic,
new StringDeserializer(),
JoinTopologyBuilder.getAvroPurchaseSerde(
schemaRegistryHost,
schemaRegistryPort).deserializer())
OutputVerifier.compareKeyValue(outputRecord1, "1234", purchase1WithDiscount)
Not sure if the step of selecting a new key is getting rid of the proper date. I have tested a lot of combinations with no luck :(
Any help would be really appreciated!
I'm not sure of that because I don't know how much you test your code, but my guess will be that :
1) your code work with the default timestamp extractor because it's using the time when you're sending record into the pipes as timestamps records, so basically it will work because in your test you're sending data one after another without a pause.
2) you are using the TopologyTestDriver to do your tests !
Note that it's very useful for testing your business code and the topology as a unit (what I have as inputs and what is the correct according outputs) but there isn't a Kafka Stream app running in thoses tests.
In your case you can play with the method advanceWallClockTime(long) in the TopologyTestDriver class to simulate the system time walking.
If you want to start the topology you will have to do an integration test with an embedded kafka cluster (there is one on kafka libraries that's working just fine !).
Let me know if that's help :-)
Thank you for replying. I was working on this yesterday and I think I found the problem. As you said I am using the TopologyTestDriver to run my tests and when you initialize the TopologyTestDriver class it uses an initialWallClockTime, if you do not provide a value, the TopologyTestDriver will pick up the currentTimeMillis:
public TopologyTestDriver(Topology topology, Properties config) {
this(topology, config, System.currentTimeMillis());
}
There is another constructor that allows you to pass-in an initialWallClockTime. I've been testing this method but for some reason it does not work for me.
So to sum up my solution has been to create the Purchase and Coupon objects with the current timestamp. I'm still using my custom timestamp extractor but instead of hardcoding a date I am always getting the current timestamp and this way the join works fine.
Not fully happy with my end solution because I don't know why the initialWallClockTime does not work for me, but at least the tests are working fine now.

How do I work with a Scala process interactively?

I'm writing a bot in Scala for a game that uses text input and output. So I want to work with a process interactively - that is, my code receives output from the process, works with it, and only then sends its next input to the process. So I want to give a function access to the inputStreams and the outputStream simultaneously.
This doesn't seem to fit into any of the factories in scala.sys.process.BasicIO or the constructor for scala.sys.process.ProcessIO (three functions, each of which has access to only one stream).
Here's how I'm doing it at the moment.
private var rogue_input: OutputStream = _
private var rogue_output: InputStream = _
private var rogue_error: InputStream = _
Process("python3 /home/robin/IdeaProjects/Rogomatic/python/rogue.py --rogomatic").run(
new ProcessIO(rogue_input = _, rogue_output = _, rogue_error = _)
)
try {
private val rogue_scanner = new Scanner(rogue_output)
private val rogue_writer = new PrintWriter(rogue_input, true)
// Play the game
} finally {
rogue_input.close()
rogue_output.close()
rogue_error.close()
}
This works, but it doesn't feel very Scala-like. Is there a more idiomatic way to do this?
So I want to work with a process interactively - that is, my code receives output from the process, works with it, and only then sends its next input to the process.
In general, this is traditionally solved by expect. There exist libraries and tools inspired by expect for various languages, including for Scala: https://github.com/Lasering/scala-expect.
The README of the project gives various examples. While I don't know exactly what your rouge.py expects in terms of stdin/stdout interactions, here's a quick "hello world" example showing how you could interact with a Python interpreter (using the Ammonite REPL, which has conveniently library importing capabilities):
import $ivy.`work.martins.simon::scala-expect:6.0.0`
import work.martins.simon.expect.core._
import work.martins.simon.expect.core.actions._
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.duration._
val timeout = 5 seconds
val e = new Expect("python3 -i -", defaultValue = "?")(
new ExpectBlock(
new StringWhen(">>> ")(
Sendln("""print("hello, world")""")
)
),
new ExpectBlock(
new RegexWhen("""(.*)\n>>> """.r)(
ReturningWithRegex(_.group(1).toString)
)
)
)
e.run(timeout).onComplete(println)
What the code above does is it "expects" >>> to be sent to stdout, and when it finds that, it will send print("hello, world"), followed by a newline. From then, it reads and returns everything until the next prompt (>>>) using a regex.
Amongst other debug information, the above should result in Success(hello, world) being printed to your console.
The library has various other styles, and there may also exist other similar libraries out there. My main point is that an expect-inspired library is likely what you're looking for.

Passing arguments between Gatling scenarios and simulation

I'm current creating some Gatling simulation to test a REST API. I don't really understand Scala.
I've created a scenario with several exec and pause;
object MyScenario {
val ccData = ssv("cardcode_fr.csv").random
val nameData = ssv("name.csv").random
val mobileData = ssv("mobile.csv").random
val emailData = ssv("email.csv").random
val itemData = ssv("item_fr.csv").random
val scn = scenario("My use case")
.feed(ccData)
.feed(nameData)
.feed(mobileData)
.feed(emailData)
.feed(itemData)
.exec(
http("GetCustomer")
.get("/rest/customers/${CardCode}")
.headers(Headers.headers)
.check(
status.is(200)
)
)
.pause(3, 5)
.exec(
http("GetOffers")
.get("/rest/offers")
.queryParam("customercode", "${CardCode}")
.headers(Headers.headers)
.check(
status.is(200)
)
)
}
And I've a simple Simulation :
class MySimulation extends Simulation {
setUp(MyScenario.scn
.inject(
constantUsersPerSec (1 ) during (1)))
.protocols(EsbHttpProtocol.httpProtocol)
.assertions(
global.successfulRequests.percent.is(100))
}
The application I'm trying to simulate is a multilocation mobile App, so I've prepared a set of samples data for each Locale (US, FR, IT...)
My REST API handles all the locales, therefore I want to make the simulation concurrently execute several instances of MyScenario, each with a different locale sample, to simulate the global load.
Is it possible to execute my simulation without having to create/duplicate the scenario and change the val ccData = ssv("cardcode_fr.csv").random for each one?
Also, each locale has its own load, how can I create a simulation that takes a single scenario and executes it several times concurrently with a different load and feeders?
Thanks in advance.
From what you've said, I think this may be a good approach:
Start by grouping your data in such a way that you can look up each item you want to send based on the current locale. For this, I would recommend using a Map that matches a locale string (such as "FR") to the item that matches that locale for the field you're looking to fill in. Then, at the start of each iteration of the scenario, you just pick which locale you want to use for the current iteration from a list. It would look something like this:
val locales = List("US", "FR", "IT")
val names = Map( "US" -> "John", "FR" -> "Pierre", "IT" -> "Guillame")
object MyScenario {
//These two lines pick a random locale from your list
val random_index = rand.nextInt(locales.length);
val currentLocale = locales(random_index);
//This line gets the name
val name = names(currentLocale)
//Do the rest of your logic here
}
This is a very simplified example - you'll have to figure out how you actually want to retrieve the data from files and put it into a Map structure, as I assume you don't want to hard code every item for every field into your code.

DataGen in soapui or soapui pro?

I want to test Restful web service in SoapUI. For that, I need to read values from Excel and pass that to request.
I searched in net, I found that it is possible through DataGen TestStep. I have SoapUI, but I couldn't find that option.
Can someone please tell if DataGen TestStep is available in SoapUI-4.5.1 or SoapUI Pro.
I am 99% sure that the data sources and such are only in SoapUI pro. You can accomplish the same thing in groovy scripts, though, but you would probably be better off reading from a text file as opposed to a spreadsheet.
The step is available only in Soap UI Pro only(Ready API)
In free version that is Soap UI, you can use POI way of reading an excel file via groovy scripting and pass those values in your input request.
import org.apache.poi.xssf.usermodel.*
import org.apache.poi.ss.usermodel.DataFormatter;
def fs = new FileInputStream("F:\\Gaurav\\soapui\\readFile.xlsx")
def wb = new XSSFWorkbook(fs)
def ws = wb.getSheet("Sheet1")
def r = ws.getPhysicalNumberOfRows()
for(def i =0 ; i < r ; i++)
{
def row = ws.getRow(i);
def c=row.getPhysicalNumberOfCells()
for(def j=0; j <c ; j++)
{
def cell= row.getCell(j)
// to convert everything to a String format
DataFormatter formatter = new DataFormatter()
def cellValue=formatter.formatCellValue(cell)
log.info cellValue
}
}
// Above is the code to read from excel. Once you have read the values, you can // store the values in property
testRunner.testCase.setPropertyValue(tcprop,"cellValue")
then in your request you can expand it like below
${#TestCase#tcprop}
This way you can achieve the same DataGen thing in Soap UI free version 4.5
So, there is an option in SoapUI Setup Script that could be run in advance
You can convert your Excel to csv or text file and handle the date from there.
I've done some testing with REST services, used only reading from the text file feature.
the code like this:
//Load the text file
def inputFile = new File("C://Temp//whatever");
//Create an empty list...
def mega_List = [];
//...and then populate it with the contents
// of the text file.
addSomeThingToList = {mega_List.add(it)};
inputFile.eachLine(addSomeThingToList);
//...and assign its value to the Test Case Property
def tc = testRunner.testCase;
//Randomly pick an item from the list...
def index = context.expand( '${#TestCase#index}' ).toInteger()
if ( index < mega_List.size() ) {
def id = mega_List.get(index);
index++
tc.setPropertyValue("id", id);
tc.setPropertyValue("index", index.toString());
}
else {
tc.setPropertyValue("index", "0");
tc.setPropertyValue("id", "0");
testrunner.cancel( "time to go home" )
}