Tutorial: DIAL‑trained Neural Models¶

Comparing DIAL‑trained Neural Models, Majority Baselines and Information-Theoretic Limits¶

This notebook provides a step‑by‑step tutorial for reproducing the experiments that compare:

  1. MajorityPlayers — analytical baseline players with no learning.
  2. NeuralNetPlayers trained with DIAL + DRU — communication‑learning agents following the approach introduced in Foerster et al., 2016.
  3. Information‑theoretic upper limits on achievable performance.

We explain the motivation for each component and annotate each section so the workflow is clear and reproducible.

Summary of Results¶

The table below (generated later in the notebook) reports:

  • The Majority baseline performance.
  • The NeuralNet DIAL+DRU training performance.
  • The information‑theoretic bound, computed using mutual information.

These metrics allow us to evaluate how close learned communication strategies approach the theoretical optimum.

Results for comms_size = 1¶

field_size comms_size maj_mean nn_dial_mean info_limit
4 1 0.5988 0.6037 0.6461
8 1 0.5526 0.5475 0.5735
16 1 0.5270 0.5284 0.5368
32 1 0.5169 0.5093 0.5184

Results for comms_size = 2¶

field_size comms_size maj_mean nn_dial_mean info_limit
4 2 0.6360 0.6122 0.7051
8 2 0.5770 0.5607 0.6037
16 2 0.5354 0.5401 0.5520
32 2 0.5008 0.5157 0.5260

Results for comms_size = 4¶

field_size comms_size maj_mean nn_dial_mean info_limit
4 4 0.6885 0.6222 0.7855
8 4 0.5994 0.5560 0.6461
16 4 0.5574 0.5202 0.5735
32 4 0.5236 0.5147 0.5368

Full table¶

field_size comms_size maj_mean nn_dial_mean info_limit
4 1 0.5988 0.6037 0.6461
4 2 0.6442 0.6231 0.7051
4 4 0.6823 0.6784 0.7855
4 8 0.7422 0.6498 0.8900
8 1 0.5541 0.5540 0.5735
8 2 0.5755 0.5628 0.6037
8 4 0.5955 0.5801 0.6461
8 8 0.6388 0.5580 0.7051
16 1 0.5287 0.5238 0.5368
16 2 0.5311 0.5284 0.5520
16 4 0.5596 0.5337 0.5735
16 8 0.5662 0.5424 0.6037
32 1 0.5129 0.5182 0.5184
32 2 0.5194 0.5131 0.5260
32 4 0.5161 0.5137 0.5368
32 8 0.5301 0.5253 0.5520

Overview of the Experimental Procedure¶

This notebook builds and evaluates a full communication-learning pipeline:

  1. Define a game layout — field size, communication bandwidth, and number of games.
  2. Train NeuralNetPlayers under DIAL + DRU — sending continuous communication during training, with noise added via DRU to encourage discretisation.
  3. Extract and freeze the trained models — enabling deterministic communication at evaluation time.
  4. Run a Tournament — evaluating trained players, majority baselines, and computing theoretical limits.

These components follow the structure proposed in referential-communication literature such as:

  • Foerster et al. Learning to Communicate with Deep Multi‑Agent Reinforcement Learning (2016).
  • Lowe et al. Multi-Agent Actor‑Critic for Mixed Cooperation‑Competition (2017).

The goal is to assess how communication bandwidth and noise influence the emergence of useful signalling.

In [1]:
import os
import sys
from pathlib import Path

def change_to_repo_root(marker: str = "src") -> None:
    """Change CWD to the repository root (parent of `src`)."""
    here = Path.cwd()
    for parent in [here] + list(here.parents):
        if (parent / marker).is_dir():
            os.chdir(parent)
            break

change_to_repo_root()
print("CWD set to:", os.getcwd())

src_path = Path("src").resolve()
if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))
print("sys.path[0]:", sys.path[0])
CWD set to: c:\Users\nly99857\OneDrive - Philips\SW Projects\QSeaBattle
sys.path[0]: C:\Users\nly99857\OneDrive - Philips\SW Projects\QSeaBattle\src
In [2]:
import random
import math
import numpy as np
import pandas as pd
import tensorflow as tf

import Q_Sea_Battle as qsb
from Q_Sea_Battle.dru_utils import dru_train
from Q_Sea_Battle.reference_performance_utilities import limit_from_mutual_information

SEED = 1232
np.random.seed(SEED)
random.seed(SEED * 2)
tf.random.set_seed(SEED * 4)

tf.config.run_functions_eagerly(True)

print("NumPy random seed:", SEED)
print("Python random seed:", SEED * 2)
print("TF random seed:", SEED * 4)
print("TensorFlow version:", tf.__version__)
NumPy random seed: 1232
Python random seed: 2464
TF random seed: 4928
TensorFlow version: 2.20.0
In [3]:
FIELD_SIZES = [4, 8, 16, 32]
COMMS_SIZES = [1,2,4,8]

# RL hyperparameters (can be tuned)
NUM_EPOCHS = 128
BATCHES_PER_EPOCH = 40
BATCH_SIZE = 2048*16
SIGMA_TRAIN = 2.0
CLIP_RANGE = (-10.0, 10.0)
LEARNING_RATE = 1e-3

# Sigma annealing for DRU: high noise -> low noise over epochs
SIGMA_START = 2.0
SIGMA_END = 0.3

N_GAMES_TOURNAMENT = 10000

DIAL + DRU Training Helpers¶

This section defines the helper functions used for training the neural agents via DIAL (Differentiable Inter‑Agent Learning).

DIAL allows gradients to flow through communication channels, enabling agents to learn what messages to send. However, communication must ultimately be discrete during evaluation. To bridge this, we use DRU (Discretise / Regularise Unit):

  • During training, DRU adds continuous noise, encouraging messages to become separable.
  • During evaluation, DRU reduces to a hard threshold, producing discrete bits.

The helper functions here:

  • simulate batches of games,
  • run model A to generate communication logits,
  • pass DRU‑processed messages to model B,
  • compute loss from observed rewards,
  • update the parameters via policy gradients.

This mirrors the training structure of the original DIAL paper but adapted for the QSeaBattle game setup.

In [4]:
from Q_Sea_Battle.tournament import Tournament
from Q_Sea_Battle.majority_players import MajorityPlayers
from Q_Sea_Battle.neural_net_players import NeuralNetPlayers


def sample_batch(layout: qsb.GameLayout, batch_size: int):
    """Sample (fields, guns, cell_values) for the given layout.

    - fields: shape (B, n²), Bernoulli(enemy_probability)
    - guns:   shape (B, n²), one-hot
    - cell_values: shape (B, 1), field value at the gun index
    """
    n2 = layout.field_size ** 2
    p = layout.enemy_probability

    fields = np.random.binomial(1, p, size=(batch_size, n2)).astype("float32")

    guns = np.zeros((batch_size, n2), dtype="float32")
    gun_indices = np.random.randint(0, n2, size=(batch_size,))
    guns[np.arange(batch_size), gun_indices] = 1.0

    cell_values = (fields * guns).sum(axis=1, keepdims=True).astype("float32")
    return fields, guns, cell_values


def evaluate_players_in_tournament(
    layout: qsb.GameLayout,
    players_factory,
    label: str = "",
) -> float:
    """Run a tournament and return mean reward.

    `players_factory` is any Players subclass instance, e.g.
    MajorityPlayers(layout) or NeuralNetPlayers(layout, ...).
    """
    game_env = qsb.GameEnv(game_layout=layout)
    tournament = Tournament(game_env=game_env, players=players_factory, game_layout=layout)
    log = tournament.tournament()
    mean_reward, std_err = log.outcome()

    if label:
        print(
            f"{label}: mean reward = {mean_reward:.4f} ± {std_err:.4f} "
        )
    else:
        print(
            f"mean reward = {mean_reward:.4f} ± {std_err:.4f} "
        )

    return mean_reward
In [5]:
def dial_pg_update(
    model_a: tf.keras.Model,
    model_b: tf.keras.Model,
    optimizer: tf.keras.optimizers.Optimizer,
    layout: qsb.GameLayout,
    batch_size: int = 512,
    sigma: float = 2.0,
    clip_range=(-10.0, 10.0),
    entropy_coeff: float = 0.01,
    normalize_adv: bool = True,
) -> tuple[float, float]:
    """Improved DIAL-style policy-gradient update step with DRU.

    Compared to the earlier version, this adds:
    - optional advantage normalization (variance reduction),
    - an entropy bonus on the shoot policy.
    """
    fields_np, guns_np, cell_values_np = sample_batch(layout, batch_size)

    fields_tf = tf.convert_to_tensor(fields_np, dtype=tf.float32)
    fields_scaled = fields_tf - 0.5  # scale {0,1} -> [-0.5, 0.5] to match NeuralNetPlayerA
    guns_tf = tf.convert_to_tensor(guns_np, dtype=tf.float32)
    cell_values_tf = tf.convert_to_tensor(cell_values_np, dtype=tf.float32)

    n2 = layout.field_size ** 2
    denom = float(max(1, n2 - 1))
    gun_indices = tf.argmax(guns_tf, axis=1, output_type=tf.int32)
    gun_idx_norm = tf.cast(gun_indices, tf.float32) / denom
    gun_idx_norm = tf.reshape(gun_idx_norm, (-1, 1))


    eps = 1e-8

    with tf.GradientTape() as tape:
        # A produces communication logits
        comm_logits = model_a(fields_scaled, training=True)          # (B, m)

        # DRU (train mode): logits + noise -> logistic
        comm_cont = dru_train(comm_logits, sigma=sigma, clip_range=clip_range)
        comm_cont = tf.cast(comm_cont, tf.float32)               # (B, m) in (0,1)

        # B receives gun + continuous comm
        x_b = tf.concat([gun_idx_norm, comm_cont], axis=1)          # (B, 1 + m)
        shoot_logits = model_b(x_b, training=True)               # (B, 1)

        probs = tf.nn.sigmoid(shoot_logits)
        rnd = tf.random.uniform(tf.shape(probs))
        actions = tf.cast(rnd < probs, tf.float32)               # (B, 1) in {0,1}

        # Team reward: 1 if correct guess of the field bit at the gun index
        rewards = tf.cast(tf.equal(actions, cell_values_tf), tf.float32)

        # Baseline and advantage
        baseline = tf.reduce_mean(rewards)
        advantages = rewards - baseline

        if normalize_adv:
            adv_std = tf.math.reduce_std(advantages) + 1e-8
            advantages = advantages / adv_std

        advantages = tf.stop_gradient(advantages)

        # Log-prob of the sampled action under Bernoulli(probs)
        log_probs = (
            actions * tf.math.log(probs + eps)
            + (1.0 - actions) * tf.math.log(1.0 - probs + eps)
        )

        # Policy entropy for Bernoulli(probs)
        entropy = -(
            probs * tf.math.log(probs + eps)
            + (1.0 - probs) * tf.math.log(1.0 - probs + eps)
        )

        # REINFORCE loss with entropy regularization
        loss_pg = -tf.reduce_mean(log_probs * advantages)
        loss_ent = -tf.reduce_mean(entropy)   # negative so adding pushes up entropy
        loss = loss_pg + entropy_coeff * loss_ent

    params = model_a.trainable_variables + model_b.trainable_variables
    grads = tape.gradient(loss, params)
    optimizer.apply_gradients(zip(grads, params))

    mean_reward = float(tf.reduce_mean(rewards).numpy())
    loss_value = float(loss.numpy())
    return mean_reward, loss_value
In [6]:
def sigma_for_epoch(epoch: int, num_epochs: int) -> float:
    """Linearly interpolate sigma from SIGMA_START to SIGMA_END over epochs."""
    t = epoch / max(1, num_epochs)
    return SIGMA_START * (1.0 - t) + SIGMA_END * t


print("\nRunning updated sweep with advantage normalization, entropy bonus, and sigma annealing...")

results_updated = []

for n in FIELD_SIZES:
    for m in COMMS_SIZES:
        print(f"Field_size = {n}, comms_size = {m}")

        layout = qsb.GameLayout(
            field_size=n,
            comms_size=m,
            enemy_probability=0.5,
            channel_noise=0.0,
            number_of_games_in_tournament=N_GAMES_TOURNAMENT,
        )

        # Majority players as before
        majority_players = MajorityPlayers(layout)
        maj_mean = evaluate_players_in_tournament(
            layout, majority_players,
            label="\tMajorityPlayers"
        )

        # Fresh NeuralNetPlayers and models
        nn_players = NeuralNetPlayers(game_layout=layout, explore=True)
        player_a, player_b = nn_players.players()
        model_a = nn_players.model_a
        model_b = nn_players.model_b
        assert model_a is not None and model_b is not None

        optimizer = tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE)

        for epoch in range(1, NUM_EPOCHS + 1):
            sigma_now = sigma_for_epoch(epoch, NUM_EPOCHS)
            epoch_rewards = []
            epoch_losses = []

            for _ in range(BATCHES_PER_EPOCH):
                r, l = dial_pg_update(
                    model_a=model_a,
                    model_b=model_b,
                    optimizer=optimizer,
                    layout=layout,
                    batch_size=BATCH_SIZE,
                    sigma=sigma_now,
                    clip_range=CLIP_RANGE,
                    entropy_coeff=0.01,
                    normalize_adv=True,
                )
                epoch_rewards.append(r)
                epoch_losses.append(l)

        # Evaluate trained neural nets (greedy play)
        nn_players_eval = NeuralNetPlayers(
            game_layout=layout,
            model_a=model_a,
            model_b=model_b,
            explore=False,
        )
        nn_mean = evaluate_players_in_tournament(
            layout, nn_players_eval,
            label="\tNeuralNetPlayers (DIAL+DRU)"
        )

        info_limit = limit_from_mutual_information(
            field_size=n,
            comms_size=m,
            channel_noise=0.0,
            accuracy_in_digits=10,
        )

        print(
            f"\tInfo-theoretic upper bound (noiseless channel): {info_limit:.4f}"
        )

        results_updated.append(
            {
                "field_size": n,
                "comms_size": m,
                "maj_mean": maj_mean,
                "nn_dial_mean": nn_mean,
                "info_limit": info_limit,
            }
        )

        # Store trained models for this (field_size, comms_size) setting
        filename_a = f"notebooks/models/neural_net_model_a_f{n}_c{m}.keras"
        filename_b = f"notebooks/models/neural_net_model_b_f{n}_c{m}.keras"
        nn_players.store_models(filename_a, filename_b)
        print(f"Saved models to {filename_a} and {filename_b}")
Running updated sweep with advantage normalization, entropy bonus, and sigma annealing...
Field_size = 4, comms_size = 1
	MajorityPlayers: mean reward = 0.5988 ± 0.0049 
	NeuralNetPlayers (DIAL+DRU): mean reward = 0.6037 ± 0.0049 
	Info-theoretic upper bound (noiseless channel): 0.6461
Saved models to notebooks/models/neural_net_model_a_f4_c1.keras and notebooks/models/neural_net_model_b_f4_c1.keras
Field_size = 4, comms_size = 2
	MajorityPlayers: mean reward = 0.6442 ± 0.0048 
	NeuralNetPlayers (DIAL+DRU): mean reward = 0.6231 ± 0.0048 
	Info-theoretic upper bound (noiseless channel): 0.7051
Saved models to notebooks/models/neural_net_model_a_f4_c2.keras and notebooks/models/neural_net_model_b_f4_c2.keras
Field_size = 4, comms_size = 4
	MajorityPlayers: mean reward = 0.6823 ± 0.0047 
	NeuralNetPlayers (DIAL+DRU): mean reward = 0.6784 ± 0.0047 
	Info-theoretic upper bound (noiseless channel): 0.7855
Saved models to notebooks/models/neural_net_model_a_f4_c4.keras and notebooks/models/neural_net_model_b_f4_c4.keras
Field_size = 4, comms_size = 8
	MajorityPlayers: mean reward = 0.7422 ± 0.0044 
	NeuralNetPlayers (DIAL+DRU): mean reward = 0.6498 ± 0.0048 
	Info-theoretic upper bound (noiseless channel): 0.8900
Saved models to notebooks/models/neural_net_model_a_f4_c8.keras and notebooks/models/neural_net_model_b_f4_c8.keras
Field_size = 8, comms_size = 1
	MajorityPlayers: mean reward = 0.5541 ± 0.0050 
	NeuralNetPlayers (DIAL+DRU): mean reward = 0.5540 ± 0.0050 
	Info-theoretic upper bound (noiseless channel): 0.5735
Saved models to notebooks/models/neural_net_model_a_f8_c1.keras and notebooks/models/neural_net_model_b_f8_c1.keras
Field_size = 8, comms_size = 2
	MajorityPlayers: mean reward = 0.5755 ± 0.0049 
	NeuralNetPlayers (DIAL+DRU): mean reward = 0.5628 ± 0.0050 
	Info-theoretic upper bound (noiseless channel): 0.6037
Saved models to notebooks/models/neural_net_model_a_f8_c2.keras and notebooks/models/neural_net_model_b_f8_c2.keras
Field_size = 8, comms_size = 4
	MajorityPlayers: mean reward = 0.5955 ± 0.0049 
	NeuralNetPlayers (DIAL+DRU): mean reward = 0.5801 ± 0.0049 
	Info-theoretic upper bound (noiseless channel): 0.6461
Saved models to notebooks/models/neural_net_model_a_f8_c4.keras and notebooks/models/neural_net_model_b_f8_c4.keras
Field_size = 8, comms_size = 8
	MajorityPlayers: mean reward = 0.6388 ± 0.0048 
	NeuralNetPlayers (DIAL+DRU): mean reward = 0.5580 ± 0.0050 
	Info-theoretic upper bound (noiseless channel): 0.7051
Saved models to notebooks/models/neural_net_model_a_f8_c8.keras and notebooks/models/neural_net_model_b_f8_c8.keras
Field_size = 16, comms_size = 1
	MajorityPlayers: mean reward = 0.5287 ± 0.0050 
	NeuralNetPlayers (DIAL+DRU): mean reward = 0.5238 ± 0.0050 
	Info-theoretic upper bound (noiseless channel): 0.5368
Saved models to notebooks/models/neural_net_model_a_f16_c1.keras and notebooks/models/neural_net_model_b_f16_c1.keras
Field_size = 16, comms_size = 2
	MajorityPlayers: mean reward = 0.5311 ± 0.0050 
	NeuralNetPlayers (DIAL+DRU): mean reward = 0.5284 ± 0.0050 
	Info-theoretic upper bound (noiseless channel): 0.5520
Saved models to notebooks/models/neural_net_model_a_f16_c2.keras and notebooks/models/neural_net_model_b_f16_c2.keras
Field_size = 16, comms_size = 4
	MajorityPlayers: mean reward = 0.5596 ± 0.0050 
	NeuralNetPlayers (DIAL+DRU): mean reward = 0.5337 ± 0.0050 
	Info-theoretic upper bound (noiseless channel): 0.5735
Saved models to notebooks/models/neural_net_model_a_f16_c4.keras and notebooks/models/neural_net_model_b_f16_c4.keras
Field_size = 16, comms_size = 8
	MajorityPlayers: mean reward = 0.5662 ± 0.0050 
	NeuralNetPlayers (DIAL+DRU): mean reward = 0.5424 ± 0.0050 
	Info-theoretic upper bound (noiseless channel): 0.6037
Saved models to notebooks/models/neural_net_model_a_f16_c8.keras and notebooks/models/neural_net_model_b_f16_c8.keras
Field_size = 32, comms_size = 1
	MajorityPlayers: mean reward = 0.5129 ± 0.0050 
	NeuralNetPlayers (DIAL+DRU): mean reward = 0.5182 ± 0.0050 
	Info-theoretic upper bound (noiseless channel): 0.5184
Saved models to notebooks/models/neural_net_model_a_f32_c1.keras and notebooks/models/neural_net_model_b_f32_c1.keras
Field_size = 32, comms_size = 2
	MajorityPlayers: mean reward = 0.5194 ± 0.0050 
	NeuralNetPlayers (DIAL+DRU): mean reward = 0.5131 ± 0.0050 
	Info-theoretic upper bound (noiseless channel): 0.5260
Saved models to notebooks/models/neural_net_model_a_f32_c2.keras and notebooks/models/neural_net_model_b_f32_c2.keras
Field_size = 32, comms_size = 4
	MajorityPlayers: mean reward = 0.5161 ± 0.0050 
	NeuralNetPlayers (DIAL+DRU): mean reward = 0.5137 ± 0.0050 
	Info-theoretic upper bound (noiseless channel): 0.5368
Saved models to notebooks/models/neural_net_model_a_f32_c4.keras and notebooks/models/neural_net_model_b_f32_c4.keras
Field_size = 32, comms_size = 8
	MajorityPlayers: mean reward = 0.5301 ± 0.0050 
	NeuralNetPlayers (DIAL+DRU): mean reward = 0.5253 ± 0.0050 
	Info-theoretic upper bound (noiseless channel): 0.5520
Saved models to notebooks/models/neural_net_model_a_f32_c8.keras and notebooks/models/neural_net_model_b_f32_c8.keras
In [7]:
df_results_updated = pd.DataFrame(results_updated)
df_results_updated
Out[7]:
field_size comms_size maj_mean nn_dial_mean info_limit
0 4 1 0.5988 0.6037 0.646103
1 4 2 0.6442 0.6231 0.705074
2 4 4 0.6823 0.6784 0.785498
3 4 8 0.7422 0.6498 0.889972
4 8 1 0.5541 0.5540 0.573455
5 8 2 0.5755 0.5628 0.603692
6 8 4 0.5955 0.5801 0.646103
7 8 8 0.6388 0.5580 0.705074
8 16 1 0.5287 0.5238 0.536777
9 16 2 0.5311 0.5284 0.551988
10 16 4 0.5596 0.5337 0.573455
11 16 8 0.5662 0.5424 0.603692
12 32 1 0.5129 0.5182 0.518395
13 32 2 0.5194 0.5131 0.526011
14 32 4 0.5161 0.5137 0.536777
15 32 8 0.5301 0.5253 0.551988

Conclusion¶

This notebook demonstrates how communication‑learning agents can be trained and compared against both analytical baselines and information‑theoretic limits.

With the updated model interface, the pipeline remains unchanged except for input preprocessing, making the tutorial stable for future experiments.

You can now adjust:

  • communication bandwidth,
  • field size,
  • DRU noise parameters, to explore the emergence of communication under different constraints.

Result from 840 mins run¶

Settings¶

FIELD_SIZES = [4, 8, 16, 32] COMMS_SIZES = [1,2,4,8]

  • RL hyperparameters (can be tuned) NUM_EPOCHS = 128 BATCHES_PER_EPOCH = 40 BATCH_SIZE = 2048*16 SIGMA_TRAIN = 2.0 CLIP_RANGE = (-10.0, 10.0) LEARNING_RATE = 1e-3

  • Sigma annealing for DRU: high noise -> low noise over epochs SIGMA_START = 2.0 SIGMA_END = 0.3

N_GAMES_TOURNAMENT = 10000

¶

(range is st deviation / sqr(number of games)) Running updated sweep with advantage normalization, entropy bonus, and sigma annealing... Field_size = 4, comms_size = 1 MajorityPlayers: mean reward = 0.5988 ± 0.0049 NeuralNetPlayers (DIAL+DRU): mean reward = 0.6037 ± 0.0049 Info-theoretic upper bound (noiseless channel): 0.6461 Saved models to notebooks/models/neural_net_model_a_f4_c1.keras and notebooks/models/neural_net_model_b_f4_c1.keras Field_size = 4, comms_size = 2 MajorityPlayers: mean reward = 0.6442 ± 0.0048 NeuralNetPlayers (DIAL+DRU): mean reward = 0.6231 ± 0.0048 Info-theoretic upper bound (noiseless channel): 0.7051 Saved models to notebooks/models/neural_net_model_a_f4_c2.keras and notebooks/models/neural_net_model_b_f4_c2.keras Field_size = 4, comms_size = 4 MajorityPlayers: mean reward = 0.6823 ± 0.0047 NeuralNetPlayers (DIAL+DRU): mean reward = 0.6784 ± 0.0047 Info-theoretic upper bound (noiseless channel): 0.7855 Saved models to notebooks/models/neural_net_model_a_f4_c4.keras and notebooks/models/neural_net_model_b_f4_c4.keras Field_size = 4, comms_size = 8 MajorityPlayers: mean reward = 0.7422 ± 0.0044 NeuralNetPlayers (DIAL+DRU): mean reward = 0.6498 ± 0.0048 Info-theoretic upper bound (noiseless channel): 0.8900 Saved models to notebooks/models/neural_net_model_a_f4_c8.keras and notebooks/models/neural_net_model_b_f4_c8.keras Field_size = 8, comms_size = 1 MajorityPlayers: mean reward = 0.5541 ± 0.0050 NeuralNetPlayers (DIAL+DRU): mean reward = 0.5540 ± 0.0050 Info-theoretic upper bound (noiseless channel): 0.5735 Saved models to notebooks/models/neural_net_model_a_f8_c1.keras and notebooks/models/neural_net_model_b_f8_c1.keras Field_size = 8, comms_size = 2 MajorityPlayers: mean reward = 0.5755 ± 0.0049 NeuralNetPlayers (DIAL+DRU): mean reward = 0.5628 ± 0.0050 Info-theoretic upper bound (noiseless channel): 0.6037 Saved models to notebooks/models/neural_net_model_a_f8_c2.keras and notebooks/models/neural_net_model_b_f8_c2.keras Field_size = 8, comms_size = 4 MajorityPlayers: mean reward = 0.5955 ± 0.0049 NeuralNetPlayers (DIAL+DRU): mean reward = 0.5801 ± 0.0049 Info-theoretic upper bound (noiseless channel): 0.6461 Saved models to notebooks/models/neural_net_model_a_f8_c4.keras and notebooks/models/neural_net_model_b_f8_c4.keras Field_size = 8, comms_size = 8 MajorityPlayers: mean reward = 0.6388 ± 0.0048 NeuralNetPlayers (DIAL+DRU): mean reward = 0.5580 ± 0.0050 Info-theoretic upper bound (noiseless channel): 0.7051 Saved models to notebooks/models/neural_net_model_a_f8_c8.keras and notebooks/models/neural_net_model_b_f8_c8.keras Field_size = 16, comms_size = 1 MajorityPlayers: mean reward = 0.5287 ± 0.0050 NeuralNetPlayers (DIAL+DRU): mean reward = 0.5238 ± 0.0050 Info-theoretic upper bound (noiseless channel): 0.5368 Saved models to notebooks/models/neural_net_model_a_f16_c1.keras and notebooks/models/neural_net_model_b_f16_c1.keras Field_size = 16, comms_size = 2 MajorityPlayers: mean reward = 0.5311 ± 0.0050 NeuralNetPlayers (DIAL+DRU): mean reward = 0.5284 ± 0.0050 Info-theoretic upper bound (noiseless channel): 0.5520 Saved models to notebooks/models/neural_net_model_a_f16_c2.keras and notebooks/models/neural_net_model_b_f16_c2.keras Field_size = 16, comms_size = 4 MajorityPlayers: mean reward = 0.5596 ± 0.0050 NeuralNetPlayers (DIAL+DRU): mean reward = 0.5337 ± 0.0050 Info-theoretic upper bound (noiseless channel): 0.5735 Saved models to notebooks/models/neural_net_model_a_f16_c4.keras and notebooks/models/neural_net_model_b_f16_c4.keras Field_size = 16, comms_size = 8 MajorityPlayers: mean reward = 0.5662 ± 0.0050 NeuralNetPlayers (DIAL+DRU): mean reward = 0.5424 ± 0.0050 Info-theoretic upper bound (noiseless channel): 0.6037 Saved models to notebooks/models/neural_net_model_a_f16_c8.keras and notebooks/models/neural_net_model_b_f16_c8.keras Field_size = 32, comms_size = 1 MajorityPlayers: mean reward = 0.5129 ± 0.0050 NeuralNetPlayers (DIAL+DRU): mean reward = 0.5182 ± 0.0050 Info-theoretic upper bound (noiseless channel): 0.5184 Saved models to notebooks/models/neural_net_model_a_f32_c1.keras and notebooks/models/neural_net_model_b_f32_c1.keras Field_size = 32, comms_size = 2 MajorityPlayers: mean reward = 0.5194 ± 0.0050 NeuralNetPlayers (DIAL+DRU): mean reward = 0.5131 ± 0.0050 Info-theoretic upper bound (noiseless channel): 0.5260 Saved models to notebooks/models/neural_net_model_a_f32_c2.keras and notebooks/models/neural_net_model_b_f32_c2.keras Field_size = 32, comms_size = 4 MajorityPlayers: mean reward = 0.5161 ± 0.0050 NeuralNetPlayers (DIAL+DRU): mean reward = 0.5137 ± 0.0050 Info-theoretic upper bound (noiseless channel): 0.5368 Saved models to notebooks/models/neural_net_model_a_f32_c4.keras and notebooks/models/neural_net_model_b_f32_c4.keras Field_size = 32, comms_size = 8 MajorityPlayers: mean reward = 0.5301 ± 0.0050 NeuralNetPlayers (DIAL+DRU): mean reward = 0.5253 ± 0.0050 Info-theoretic upper bound (noiseless channel): 0.5520 Saved models to notebooks/models/neural_net_model_a_f32_c8.keras and notebooks/models/neural_net_model_b_f32_c8.keras

Table¶

field_size comms_size maj_mean nn_dial_mean info_limit 0 4 1 0.5988 0.6037 0.646103 1 4 2 0.6442 0.6231 0.705074 2 4 4 0.6823 0.6784 0.785498 3 4 8 0.7422 0.6498 0.889972 4 8 1 0.5541 0.5540 0.573455 5 8 2 0.5755 0.5628 0.603692 6 8 4 0.5955 0.5801 0.646103 7 8 8 0.6388 0.5580 0.705074 8 16 1 0.5287 0.5238 0.536777 9 16 2 0.5311 0.5284 0.551988 10 16 4 0.5596 0.5337 0.573455 11 16 8 0.5662 0.5424 0.603692 12 32 1 0.5129 0.5182 0.518395 13 32 2 0.5194 0.5131 0.526011 14 32 4 0.5161 0.5137 0.536777 15 32 8 0.5301 0.5253 0.551988