K-means exploration¶
With the 2020 MineRL competition, we introduced vectorized obfuscated environments which abstract non-visual state information as well as the action space of the agent to be continuous vector spaces. See MineRLObtainDiamondVectorObf-v0 for documentation on the evaluation environment for that competition.
To use techniques in the MineRL competition that require discrete actions, we can use k-means to quantize the human demonstrations and give our agent n discrete actions representative of actions taken by humans when solving the environment.
To get started, let’s download the MineRLTreechopVectorObf-v0 environment.
python -m minerl.data.download 'MineRLTreechopVectorObf-v0'
Note
If you are unable to download the data ensure you have the MINERL_DATA_ROOT
env variable
set as demonstrated in data sampling.
Now we load the dataset for MineRLTreechopVectorObf-v0 and find 32 clusters using sklearn learn
from sklearn.cluster import KMeans
dat = minerl.data.make('MineRLTreechopVectorObf-v0')
# Load the dataset storing 1000 batches of actions
act_vectors = []
for _, act, _, _,_ in tqdm.tqdm(dat.batch_iter(16, 32, 2, preload_buffer_size=20)):
act_vectors.append(act['vector'])
if len(act_vectors) > 1000:
break
# Reshape these the action batches
acts = np.concatenate(act_vectors).reshape(-1, 64)
kmeans_acts = acts[:100000]
# Use sklearn to cluster the demonstrated actions
kmeans = KMeans(n_clusters=32, random_state=0).fit(kmeans_acts)
Now we have 32 actions that represent reasonable actions for our agent to take. Let’s take these and improve our random hello world agent from before.
i, net_reward, done, env = 0, 0, False, gym.make('MineRLTreechopVectorObf-v0')
obs = env.reset()
while not done:
# Let's use a frame skip of 4 (could you do better than a hard-coded frame skip?)
if i % 4 == 0:
action = {
'vector': kmeans.cluster_centers_[np.random.choice(NUM_CLUSTERS)]
}
obs, reward, done, info = env.step(action)
env.render()
if reward > 0:
print("+{} reward!".format(reward))
net_reward += reward
i += 1
print("Total reward: ", net_reward)
Putting this all together we get:
Full snippet
import gym
import tqdm
import minerl
import numpy as np
from sklearn.cluster import KMeans
dat = minerl.data.make('MineRLTreechopVectorObf-v0')
act_vectors = []
NUM_CLUSTERS = 30
# Load the dataset storing 1000 batches of actions
for _, act, _, _, _ in tqdm.tqdm(dat.batch_iter(16, 32, 2, preload_buffer_size=20)):
act_vectors.append(act['vector'])
if len(act_vectors) > 1000:
break
# Reshape these the action batches
acts = np.concatenate(act_vectors).reshape(-1, 64)
kmeans_acts = acts[:100000]
# Use sklearn to cluster the demonstrated actions
kmeans = KMeans(n_clusters=NUM_CLUSTERS, random_state=0).fit(kmeans_acts)
i, net_reward, done, env = 0, 0, False, gym.make('MineRLTreechopVectorObf-v0')
obs = env.reset()
while not done:
# Let's use a frame skip of 4 (could you do better than a hard-coded frame skip?)
if i % 4 == 0:
action = {
'vector': kmeans.cluster_centers_[np.random.choice(NUM_CLUSTERS)]
}
obs, reward, done, info = env.step(action)
env.render()
if reward > 0:
print("+{} reward!".format(reward))
net_reward += reward
i += 1
print("Total reward: ", net_reward)
Try comparing this k-means random agent with a random agent using env.action_space.sample()
! You should see the
human actions are a much more reasonable way to explore the environment!