Select Topic Words from Clusters - cluster-analysis

I am following this solution for clustering: https://towardsdatascience.com/clustering-contextual-embeddings-for-topic-model-1fb15c45b1bd
For step four "Select Topic Words from Clusters", I run the code but get error.
# k * vocab
X_per_cluster = self.vectorizer_model.transform(concatenated_documents)
# D * vocab
X_origin = self.vectorizer_model.transform(origin_documents)
if self.word_select_method == 'tfidf_idfi':
socres = TFIDF_IDFi(X_per_cluster, X_origin, all_documents).socre()
elif self.word_select_method == 'tfidf_tfi':
socres = TFIDF_TFi(X_per_cluster, X_origin, all_documents).socre()
elif self.word_select_method == 'tfi':
socres = TFi(X_per_cluster).socre()
elif self.word_select_method == 'tfidfi':
socres = TFIDFi(X_per_cluster).socre()
How to fix this error?

Related

Is it possible to use an alternative to if/else statements in repl python 3.7?

Is there a better solution in python for making large strings of if/else statements for a random input using import random on version 3.7? I feel like it's a bit counterproductive to write the same if else statements for each input. If not, are there any ways to make my code more efficient?
if G.casefold() == "a":
Boss_Health = Boss_Health - A
print("user dealt", A, "damage with A")
print("Boss health:", Boss_Health)
print("Health:", Health)
print("Bosses turn")
print("")
G = random.randint(1, 20)
if G == 1:
Boss_L = Boss_L + Boss_A
print("Boss_L has been upgraded by", Boss_A, "and now deals",
Boss_L, "damage")
print("Boss Health:", Boss_Health)
print("Health:", Health)
print("users turn")
str(G)
G = input()
if G == 2:
Boss_Health = Boss_Health + Boss_B
print("Boss healed", Boss_B, "hp to Boss")
print("Boss Health:", Boss_Health)
print("Health:", Health)
print("users turn")
str(G)
G = input()
if G == 3:
Boss_M = Boss_M + Boss_B
print("Boss_M has been upgraded by", Boss_B, "and now deals",
Boss_M, "damage")
print("Boss Health:", Boss_Health)
print("Health:", Health)
print("users turn")
str(G)
G = input()
Usually, the first bit to look for is where there is duplication of code. In your program, I saw the following lines entered repeatedly:
print("Boss Health:", Boss_Health)
print("Health:", Health)
print("users turn")
str(G)
G = input()
Those would be a candidate to group into some type of function. Realizing that the value of "G" gets either a manual entry or a random entry, following is a version of your code with repeated code pulled out and placed into a separate function.
import random
def Vitals(b_h, h, rndm = None): # Separate function to handle repeated statements
print("Boss health:", b_h)
print("Health:", h)
print("Bosses turn")
print("")
z = " "
if rndm == "gen":
z = random.randint(1, 20)
else:
z = input("Enter your choice: ")
return z
def game_loop():
A = 1
G = " "
Boss_Health = 10
Boss_A = 2
Boss_B = 2
Boss_L = 1
Health = 20
while True:
if G.casefold() == "a":
Boss_Health = Boss_Health - A
print("user dealt", A, "damage with A")
G = Vitals(Boss_Health, Health, "gen")
if G == 1:
Boss_L = Boss_L + Boss_A
print("Boss_L has been upgraded by", Boss_A, "and now deals", Boss_L, "damage")
G = Vitals(Boss_Health, Health)
if G == 2:
Boss_Health = Boss_Health + Boss_B
print("Boss healed", Boss_B, "hp to Boss")
G = Vitals(Boss_Health, Health)
if G == 3:
Boss_M = Boss_M + Boss_B
print("Boss_M has been upgraded by", Boss_B, "and now deals", Boss_M, "damage")
G = Vitals(Boss_Health, Health)
if G == "q" or G == "Q":
break
G = Vitals(Boss_Health, Health)
return
game_loop()
Not having your full program, I improvised to create a user input loop to test this out. But the net effect is a shorter program that cuts down on repeated code which also reduces the chance of some inconsistencies creeping into the various blocks of code that utilize the function.
Give that a try.

OR-tools VRP solver returning one job per vehicle

I am using Google OR-Tools in python to solve a capacitated VRP with pickup/delivery. In many cases the solver works well and returns reasonable solutions, but we have found that for some data sets the solver will always return one job per truck regardless of the time involved for the route.
I have the model set up as follows:
My initial vehicle count is equal to the number of jobs in the data-set, and we allow OR-Tools to automatically minimize the truck count.
Each job's pickup location has a demand of 1 and each job's dropoff location has a demand of -1, to enforce delivery immediately after pickup.
We set the maximum drive time per vehicle to 8 hours.
Then, each job has an associated quantity attached for pickup, and we separate this job into multiple deliveries based on a truck's capacity. For instance, if a job requires 60 tons delivered, we represent that as three jobs at 20 tons each (the maximum a vehicle is allowed to carry on an interstate in the U.S)
Now, we have a simple data set with a pickup location at: 698 Longtown Rd, Columbia, SC and a dropoff location at: 121 Chappell Creek Rd Hopkins, SC. This is a drive time of 32 minutes, or a total trip time of 64 minutes. This job has an associated quantity of 60 tons, which will require 3 truck loads.
The results we receive from or-tools shows one load per truck, and this result does not change regardless of how long we allow the solver to run. The optimal solution would allow one truck to complete all loads, as this is still drastically under the 8 hour drive time limit.
Here is my code:
import json
import math
import traceback
import urllib
import redis
import requests
import boto3
from signal import signal, SIGINT, SIGTERM
from ortools.constraint_solver import pywrapcp, routing_enums_pb2
url = 'https://test-api.truckit.com/api/2/signin'
api_data = {"password": "", "username": ""}
response = requests.post(url, json=api_data)
api_data = response.json()
def build_auth_header(token):
header = {'Authorization': f'Token {token}'}
return header
class SignalHandler:
def __init__(self):
self.received_signal = False
signal(SIGINT, self._signal_handler)
signal(SIGTERM, self._signal_handler)
def _signal_handler(self, signal, frame):
print(f"handling signal {signal}, exiting gracefully")
self.received_signal = True
sqs = boto3.resource("sqs")
queue = sqs.get_queue_by_name(QueueName="")
redisClient = redis.Redis(host='', port=6379,
password='')
def create_distance_matrix(data):
addresses = data["addresses"]
API_key = data["API_key"]
origin_addresses = []
dest_addresses = addresses
distance_matrix = []
responses = {}
responses['destination_addresses'] = []
responses['origin_addresses'] = []
responses['rows'] = []
# Send q requests, returning max_rows rows per request.
for i in range(0, len(addresses)):
origin_addresses.clear()
origin_addresses.append(addresses[i])
for j in range(0, len(addresses), 25):
dest_addresses_request = addresses[j:j + 25]
response = send_request(origin_addresses, dest_addresses_request, API_key)
responses['origin_addresses'] = response['origin_addresses']
for destination_address in response['destination_addresses']:
responses['destination_addresses'].append(destination_address)
for row in response['rows']:
if len(responses['rows']) == 0:
responses['rows'].append(row)
else:
for element in row['elements']:
responses['rows'][0]['elements'].append(element)
distance_matrix += build_distance_matrix(responses)
responses['origin_addresses'].clear()
responses['destination_addresses'].clear()
responses['rows'].clear()
return distance_matrix
def send_request(origin_addresses, dest_addresses, API_key):
""" Build and send request for the given origin and destination addresses."""
def build_address_str(addresses):
# Build a pipe-separated string of addresses
address_str = ''
for i in range(len(addresses) - 1):
address_str += addresses[i] + '|'
address_str += addresses[-1]
return address_str
request = 'https://maps.googleapis.com/maps/api/distancematrix/json?units=imperial'
origin_address_str = build_address_str(origin_addresses)
dest_address_str = build_address_str(dest_addresses)
request = request + '&origins=' + origin_address_str + '&destinations=' + \
dest_address_str + '&key=' + API_key
jsonResult = urllib.request.urlopen(request).read()
response = json.loads(jsonResult)
return response
def build_distance_matrix(response):
distance_matrix = []
for row in response['rows']:
row_list = [row['elements'][j]['duration']['value'] for j in range(len(row['elements']))]
distance_matrix.append(row_list)
return distance_matrix
def process_message(message_body):
print(f"processing message: {message_body}")
data = json.loads(message_body)
data_matrix = {}
data_matrix['problem_id'] = data['problemId']
data_matrix["addresses"] = []
data_matrix["pickups_deliveries"] = []
data_matrix["demands"] = []
data_matrix["jobOrderIDs"] = []
depot_address = str(data["depot"]["latitude"]) + "," + str(data["depot"]["longitude"])
data_matrix["jobOrderIDs"].append(0)
data_matrix["addresses"].append(depot_address)
hash_key = data["hashKey"]
for location in data["locationList"]:
pick_lat = location["PickupLatitude"]
pick_long = location["PickupLongitude"]
drop_lat = location["DropoffLatitude"]
drop_long = location["DropoffLongitude"]
jobOrderId = location["jobOrderID"]
demand = math.ceil(float(int(location["totalQuantity"]) / 20))
for i in range(0, demand):
data_matrix["addresses"].append(str(pick_lat) + ',' + str(pick_long))
data_matrix["addresses"].append(str(drop_lat) + ',' + str(drop_long))
data_matrix["jobOrderIDs"].append(str(jobOrderId))
data_matrix["jobOrderIDs"].append(str(jobOrderId))
data_matrix["demands"].append(0)
for i in range(1, len(data_matrix["addresses"]) - 1, 2):
data_matrix["pickups_deliveries"].append([i, i + 1])
data_matrix["demands"].append(1)
data_matrix["demands"].append(-1)
data_matrix["num_vehicles"] = int(len(data_matrix["addresses"]) / 2)
data_matrix["vehicle_capacities"] = []
for i in range(0, data_matrix["num_vehicles"]):
data_matrix["vehicle_capacities"].append(1)
data_matrix["depot"] = 0
data_matrix["API_key"] = ''
data_matrix["distance_matrix"] = create_distance_matrix(data_matrix)
# Create the routing index manager.
manager = pywrapcp.RoutingIndexManager(len(data_matrix['distance_matrix']),
data_matrix['num_vehicles'], data_matrix['depot'])
# Create Routing Model.
routing = pywrapcp.RoutingModel(manager)
# Define cost of each arc.
def distance_callback(from_index, to_index):
"""Returns the manhattan distance between the two nodes."""
# Convert from routing variable Index to distance matrix NodeIndex.
from_node = manager.IndexToNode(from_index)
to_node = manager.IndexToNode(to_index)
return data_matrix['distance_matrix'][from_node][to_node]*1000
transit_callback_index = routing.RegisterTransitCallback(distance_callback)
routing.SetArcCostEvaluatorOfAllVehicles(transit_callback_index)
# Add Distance constraint.
dimension_name = 'Duration'
routing.AddDimension(
transit_callback_index,
0, # no slack
28800*1000, # vehicle maximum travel hours
True, # start cumul to zero
dimension_name)
distance_dimension = routing.GetDimensionOrDie(dimension_name)
distance_dimension.SetGlobalSpanCostCoefficient(100)
def demand_callback(from_index):
"""Returns the demand of the node."""
# Convert from routing variable Index to demands NodeIndex.
from_node = manager.IndexToNode(from_index)
return data_matrix['demands'][from_node]
demand_callback_index = routing.RegisterUnaryTransitCallback(
demand_callback)
routing.AddDimensionWithVehicleCapacity(
demand_callback_index,
0, # null capacity slack
data_matrix['vehicle_capacities'], # vehicle maximum capacities
True, # start cumul to zero
'Capacity')
# Define Transportation Requests.
for request in data_matrix['pickups_deliveries']:
pickup_index = manager.NodeToIndex(request[0])
delivery_index = manager.NodeToIndex(request[1])
routing.AddPickupAndDelivery(pickup_index, delivery_index)
routing.solver().Add(
routing.VehicleVar(pickup_index) == routing.VehicleVar(
delivery_index))
routing.solver().Add(
distance_dimension.CumulVar(pickup_index) <=
distance_dimension.CumulVar(delivery_index))
# Setting first solution heuristic.
search_parameters = pywrapcp.DefaultRoutingSearchParameters()
search_parameters.local_search_metaheuristic = (
routing_enums_pb2.LocalSearchMetaheuristic.GUIDED_LOCAL_SEARCH)
search_parameters.time_limit.seconds = 1200
search_parameters.log_search = True
search_parameters.first_solution_strategy = (
routing_enums_pb2.FirstSolutionStrategy.AUTOMATIC)
search_parameters.use_full_propagation = True
# Solve the problem.
solution = routing.SolveWithParameters(search_parameters)
if solution:
solution_dict = {}
for vehicle_id in range(data_matrix['num_vehicles']):
index = routing.Start(vehicle_id)
plan_output = ''
route_distance = 0
route_load = 0
while not routing.IsEnd(index):
node_index = manager.IndexToNode(index)
plan_output += '{0},'.format(data_matrix['jobOrderIDs'][node_index])
previous_index = index
index = solution.Value(routing.NextVar(index))
plan_output += '{0},'.format(data_matrix['jobOrderIDs'][manager.IndexToNode(index)])
plan_output = plan_output[:-1]
plan_words = plan_output.split(",")
plan_output = ''
for i in range(len(plan_words)):
if (i % 2 == 0):
plan_output += plan_words[i] + ","
plan_output = plan_output[:-1]
plan_output += ",0"
if plan_output != 0 and plan_output != str(0) and plan_output != str('0,0'):
print(plan_output)
solution_dict[vehicle_id] = plan_output
# trucks_url = 'https://test-api.truckit.com/api/2/trucks'
trucks_url = 'https://test-api.truckit.com/api/2/job-orders/smart-dispatch/' + str(data_matrix['problem_id'])
head = build_auth_header(api_data["authToken"])
status = {}
ride_list = []
dummy_location_dict = {}
dummy_id_dict = {}
dummy_id_dict["id"] = 0
dummy_id_dict["name"] = ""
dummy_location_dict["location"] = dummy_id_dict
dummy_location_dict["timestamp"] = 0
ride_list.append(dummy_location_dict)
redisClient.hset(hash_key, "solution", json.dumps(solution_dict))
redisClient.hset(hash_key, "ride_list", json.dumps(ride_list))
json_data = {"status": "completed"}
api_response = requests.post(trucks_url, headers=head, json=json_data)
print_solution(data_matrix, manager, routing, solution)
def print_solution(data, manager, routing, solution):
"""Prints solution on console."""
print(f'Objective: {solution.ObjectiveValue()}')
total_distance = 0
total_load = 0
for vehicle_id in range(data['num_vehicles']):
index = routing.Start(vehicle_id)
plan_output = 'Route for vehicle {}:\n'.format(vehicle_id)
route_distance = 0
route_load = 0
while not routing.IsEnd(index):
node_index = manager.IndexToNode(index)
plan_output += ' {0} -> '.format(node_index)
previous_index = index
index = solution.Value(routing.NextVar(index))
try:
distance = data['distance_matrix'][previous_index][index]
route_distance += distance
except:
distance = distance
plan_output += ' {0}\n'.format(manager.IndexToNode(index))
plan_output += 'Time of the route: {} hours\n'.format(str(float(route_distance / (60 * 60))))
print(plan_output)
total_distance += route_distance
print('Total distance of all routes: {}m'.format(total_distance))
if __name__ == "__main__":
signal_handler = SignalHandler()
while not signal_handler.received_signal:
messages = queue.receive_messages(
MaxNumberOfMessages=1,
WaitTimeSeconds=1
)
for message in messages:
try:
process_message(message.body)
message.delete()
except Exception as e:
print(f"exception while processing message: {repr(e)}")
traceback.print_exc()
continue
message.delete()
IF anyone has any suggestions as to what the problem may be, your help is greatly appreciated.

Using zero_grad() after loss.backward(), but still receives RuntimeError: "Trying to backward through the graph a second time..."

Below is my implementation of a2c using PyTorch. Upon learning about backpropagation in PyTorch, I have known to zero_grad() the optimizer after each update iteration. However, there is still a RunTime error on second-time backpropagation.
def torchworker(number, model):
worker_env = gym.make("Taxi-v3").env
max_steps_per_episode = 2000
worker_opt = optim.Adam(lr=5e-4, params=model.parameters())
p_history = []
val_history = []
r_history = []
running_reward = 0
episode_count = 0
under = 0
start = time.time()
for i in range(2):
state = worker_env.reset()
episode_reward = 0
penalties = 0
drop = 0
print("Episode {} begins ({})".format(episode_count, number))
worker_env.render()
criterion = nn.SmoothL1Loss()
time_solve = 0
for _ in range(1, max_steps_per_episode):
#worker_env.render()
state = torch.tensor(state, dtype=torch.long)
action_probs = model.forward(state)[0]
critic_value = model.forward(state)[1]
val_history.append((state, critic_value[0]))
# Choose action
action = np.random.choice(6, p=action_probs.detach().numpy())
p_history.append(torch.log(action_probs[action]))
# Apply chosen action
state, reward, done, _ = worker_env.step(action)
r_history.append(reward)
episode_reward += reward
time_solve += 1
if reward == -10:
penalties += 1
elif reward == 20:
drop += 1
if done:
break
# Update running reward to check condition for solving
running_reward = (running_reward * (episode_count) + episode_reward) / (episode_count + 1)
# Calculate discounted returns
returns = deque(maxlen=3500)
discounted_sum = 0
for r in r_history[::-1]:
discounted_sum = r + gamma * discounted_sum
returns.appendleft(discounted_sum)
# Calculate actor losses and critic losses
loss_actor_value = 0
loss_critic_value = 0
history = zip(p_history, val_history, returns)
for log_prob, value, ret in history:
diff = ret - value[1]
loss_actor_value += -log_prob * diff
ret_tensor = torch.tensor(ret, dtype=torch.float32)
loss_critic_value += criterion(value[1], ret_tensor)
loss = loss_actor_value + 0.1 * loss_critic_value
print(loss)
# Update params
loss.backward()
worker_opt.step()
worker_opt.zero_grad()
# Log details
end = time.time()
episode_count += 1
if episode_count % 1 == 0:
worker_env.render()
if running_reward > -50: # Condition to consider the task solved
under += 1
if under > 5:
print("Solved at episode {} !".format(episode_count))
break
I believe there may be something to do with the architecture of my AC model, so I also include it here for reference.
class ActorCriticNetwork(nn.Module):
def __init__(self, num_inputs, num_hidden, num_actions):
super(ActorCriticNetwork, self).__init__()
self.embed = nn.Embedding(500, 10)
self.fc1 = nn.Linear(10, num_hidden * 2)
self.fc2 = nn.Linear(num_hidden * 2, num_hidden)
self.c = nn.Linear(num_hidden, 1)
self.fc3 = nn.Linear(num_hidden, num_hidden)
self.a = nn.Linear(num_hidden, num_actions)
def forward(self, x):
out = F.relu(self.embed(x))
out = F.relu(self.fc1(out))
out = F.relu(self.fc2(out))
critic = self.c(out)
out = F.relu(self.fc3(out.detach()))
actor = F.softmax(self.a(out), dim=-1)
return actor, critic
Would you please tell me what the mistake here is? Thank you in advance.
SOLVED: I forgot to clear the history of probabilities, action-values and rewards after iterations. It is clear why that would cause the issue, as the older elements would cause propagating through old dcgs.

ValueError: List argument 'values' to 'ConcatV2' Op with length 0 shorter than minimum length 2 3Dball

Executing "3Dball" creates some errors in Unity ml-agent
When I execute PPO.ipynb, there is no error till "Load the environment".
Executing "Train the Agents" there are some errors
ValueError: List argument 'values' to 'ConcatV2' Op with length 0
shorter than minimum length 2.
This is the code I executed
https://github.com/Unity-Technologies/ml-agents/blob/master/python/PPO.ipynb
tf.reset_default_graph()
if curriculum_file == "None":
curriculum_file = None
def get_progress():
if curriculum_file is not None:
if env._curriculum.measure_type == "progress":
return steps / max_steps
elif env._curriculum.measure_type == "reward":
return last_reward
else:
return None
else:
return None
# Create the Tensorflow model graph
ppo_model = create_agent_model(env, lr=learning_rate,
h_size=hidden_units, epsilon=epsilon,
beta=beta, max_step=max_steps,
normalize=normalize, num_layers=num_layers)
is_continuous = (env.brains[brain_name].action_space_type == "continuous")
use_observations = (env.brains[brain_name].number_observations > 0)
use_states = (env.brains[brain_name].state_space_size > 0)
model_path = './models/{}'.format(run_path)
summary_path = './summaries/{}'.format(run_path)
if not os.path.exists(model_path):
os.makedirs(model_path)
if not os.path.exists(summary_path):
os.makedirs(summary_path)
init = tf.global_variables_initializer()
saver = tf.train.Saver()
with tf.Session() as sess:
# Instantiate model parameters
if load_model:
print('Loading Model...')
ckpt = tf.train.get_checkpoint_state(model_path)
saver.restore(sess, ckpt.model_checkpoint_path)
else:
sess.run(init)
steps, last_reward = sess.run([ppo_model.global_step, ppo_model.last_reward])
summary_writer = tf.summary.FileWriter(summary_path)
info = env.reset(train_mode=train_model, progress=get_progress())[brain_name]
trainer = Trainer(ppo_model, sess, info, is_continuous, use_observations, use_states, train_model)
if train_model:
trainer.write_text(summary_writer, 'Hyperparameters', hyperparameter_dict, steps)
while steps <= max_steps:
if env.global_done:
info = env.reset(train_mode=train_model, progress=get_progress())[brain_name]
# Decide and take an action
new_info = trainer.take_action(info, env, brain_name, steps, normalize)
info = new_info
trainer.process_experiences(info, time_horizon, gamma, lambd)
if len(trainer.training_buffer['actions']) > buffer_size and train_model:
# Perform gradient descent with experience buffer
trainer.update_model(batch_size, num_epoch)
if steps % summary_freq == 0 and steps != 0 and train_model:
# Write training statistics to tensorboard.
trainer.write_summary(summary_writer, steps, env._curriculum.lesson_number)
if steps % save_freq == 0 and steps != 0 and train_model:
# Save Tensorflow model
save_model(sess, model_path=model_path, steps=steps, saver=saver)
steps += 1
sess.run(ppo_model.increment_step)
if len(trainer.stats['cumulative_reward']) > 0:
mean_reward = np.mean(trainer.stats['cumulative_reward'])
sess.run(ppo_model.update_reward, feed_dict={ppo_model.new_reward: mean_reward})
last_reward = sess.run(ppo_model.last_reward)
# Final save Tensorflow model
if steps != 0 and train_model:
save_model(sess, model_path=model_path, steps=steps, saver=saver)
env.close()
export_graph(model_path, env_name)
I had the same error, the way I fixed it is by replacing line 222 under the file: "ml-agents/python/ppo/models.py":
REPLACE Line 222:
hidden_visual = tf.concat(encoders, axis=2)
BY:
if encoders:
hidden_visual = tf.concat(encoders, axis=2)
I hope that helped you.

How can I remove <math></math> multiline sections with Perl?

How can I remove multiline sections with Perl?
I have such wiki test code:
{|
|-
| colspan="2"|
: <math>
[\underbrace{\color{Red}4,2}_{4 > 2},5,1,7] \rightarrow
[2,\underbrace{\color{OliveGreen}4,5}_{4 < 5},1,7] \rightarrow
[2,4,\underbrace{\color{Red}5,1}_{5 > 1},7] \rightarrow
[2,4,1,\underbrace{\color{OliveGreen}5,7}_{5 < 7}]
</math>
|-
|
: <math>
[\underbrace{\color{OliveGreen}2,4}_{2 < 4},1,5,{\color{Blue}7}] \rightarrow
[2,\underbrace{\color{Red}4,1}_{4 > 1},5,{\color{Blue}7}] \rightarrow
[2,1,\underbrace{\color{OliveGreen}4,5}_{4 < 5},{\color{Blue}7}]
</math>
: <math>
[\underbrace{\color{Red}2,1}_{2 > 1},4,{\color{Blue}5},{\color{Blue}7}] \rightarrow
[1,\underbrace{\color{OliveGreen}2,4}_{2 < 4},{\color{Blue}5},{\color{Blue}7}]
</math>
: <math>
[\underbrace{\color{OliveGreen}1,2}_{1 < 2},{\color{Blue}4},{\color{Blue}5},{\color{Blue}7}]
</math>
|}
And I want to remove from this code all how to do it? I have done such code:
cat math-text.txt | perl -e 'while(<>) { s/<math>.+?<\/math>//gs; print $_; }'
It is not works but should since documentation explains that . will much new lines. How to do it?
The following is a python script which I use to extract all the mathematical formula from wikipedia dumps. Rather than using a multi-line regexp it scans for occurrences of <math> </math> and uses the position on the line to work out where the actual position on the line is and uses a finite state machine to find the actual equations, basically with two states determined by inEqn. It does a few other things like find the title and name space and attributes in the maths tags.
As dumps are in the order of 100MB using a line by line approach may well end up being more efficient than multi-line regexps.
import sys
import re
titleRE = re.compile('<title>(.*)</title>')
nsRE = re.compile('<ns>(.*)</ns>')
mathRE = re.compile('</?math(.*?)>')
pageEndRE = re.compile('</page>')
title =""
attr = ""
ns = -1
inEqn = 0
for line in sys.stdin:
m = titleRE.search(line)
if m :
title = m.group(1)
expression = ""
inEqn = 0
m = nsRE.search(line)
if m :
ns = m.group(1)
start = 0
pos = 0
m = mathRE.search(line,pos)
while m :
if m.group().startswith('<math'):
attr = m.group(1)
start = m.end()
pos = start
expression = ""
inEqn = 1
if m.group() == '</math>' :
end = m.start()
expression = ' '.join([expression,line[start:end]])
print title,'\t',attr,'\t',expression.lstrip().replace('<','<').replace('>','>').replace('&','&')
pos = m.end()
expression = ""
start = 0
inEqn = 0
m = mathRE.search(line,pos)
if start > 0 :
expression = line[start:].rstrip()
elif inEqn :
expression = ' '.join([expression,line.rstrip()])
Another option might be to consider an xml parser. A SAX or DOM based parser would be able to find the equations. This might be worth considering if you want to do more sophisticated analysis of the wiki-text.