π― Overview
This document explains the training data structure for the 2D Point-Robot Navigator project, where a robot learns to navigate using only a 3Γ3 perception window in 10Γ10 environments.
Robot environment
βββ¬ββ¬ββ¬ββ¬ββ¬ββ¬ββ¬ββ¬ββ¬ββ
βRβββ β βββ β β β β β β Random obstacles (β)
βββΌββΌββΌββΌββΌββΌββΌββΌββΌββ€
β β βββ β β βββ β β β
βββΌββΌββΌββΌββΌββΌββΌββΌββΌββ€
β β β β β β β βββ β β
βββΌββΌββΌββΌββΌββΌββΌββΌββΌββ€
β βββ β β β β β β β β
βββΌββΌββΌββΌββΌββΌββΌββΌββΌββ€
β β β βββ β β β βββ β
βββΌββΌββΌββΌββΌββΌββΌββΌββΌββ€
β β β β β βββ β β β β
βββΌββΌββΌββΌββΌββΌββΌββΌββΌββ€
β β βββ β β β β β β β
βββΌββΌββΌββΌββΌββΌββΌββΌββΌββ€
β β β β β β βββ β β β
βββΌββΌββΌββΌββΌββΌββΌββΌββΌββ€
β β β β β β β β β β β
βββ΄ββ΄ββ΄ββ΄ββ΄ββ΄ββ΄ββ΄ββ΄ββ
Robot local view:
βββββ¬ββββ¬ββββ
β β β β β
βββββΌββββΌββββ€
β β β R β β Should it go RIGHT or DOWN? => DOWN
βββββΌββββΌββββ€
β β β β
βββββ΄ββββ΄ββββ
What is the training data exactly? 3x3 percption and action tag β How does it work?
(state) = (local_view, goal_delta)
(label) = expert action (from A*)
- 9 features: Flattened 3Γ3 perception grid + goal_delta:
- ==4 actions: UP(0), DOWN(1), LEFT(2), RIGHT(3)==
Training data input features
Imitation learning
Training an agent to mimic an expertβs behavior instead of discovering it purely by trial and error. You give the agent examples of state β expert action, then it learns to copy those actions based on local view.
A* explicitly uses the goal to guide search (through the heuristic h) and to decide when to stop (when the goal node is reached).
Example with A:*
- Generate a 10Γ10 grid with obstacles.
- Use A* (expert planner) to find the shortest path from start β goal.
- At each step, record:
- State (3Γ3 local patch + relative goal vector)
- Expert action (the next move chosen by A*)
- Train a neural network on these pairs.
Result: the network learns to navigate like A*, but only using its local 3Γ3 view and goal info.
Is final goal or destination,Β considered in A* algorithm to find optimal path
A* explicitly uses the goal to guide search (through the heuristic h) and to decide when to stop (when the goal node is reached). Heuristic function : - To evaluate each candidate node n, A* uses a heuristic that depends on the goal position. - Example (distance in a grid): - Without the goal, you cannot compute h(n).
What are other alternatives to A* algorithm
Aspect | A***** | Imitation Learning (NN) | Reinforcement Learning (RL) |
---|---|---|---|
Needs full map? | β Yes | β No (local perception + goal) | β No (just rewards + perception) |
Optimality | β Guaranteed | β Approximate | β Approximate |
Data need | β None | β Needs expert demos (A*) | β Needs reward signals & lots of episodes |
Computation | β Slow search | β Fast (forward pass) | β Slow to train, fast at test |
Generalization | β Must re-run | β Generalizes across maps | β Can adapt to new maps via training |
Interpretability | β Clear | β Black-box | β Black-box |
Works in partial observability | β No | β Yes | β Yes |
Applications | Routing, games, static maps | Robots, cars, drones | Complex robotics, unknown envs |
β Summary: |
- A* = optimal, but unrealistic in partial observability.
- Imitation learning = fast, practical, learns A* like policy from local views.
- RL = flexible, works without expert, but training is costly.
- In practice, often A (global) + NN (local) is used in real robots.
QA on training data
Is it require to include initial position, current position, and final goal in training data?
- Initial position: β Not required as an explicit feature.
- Because at each training step the agentβs local perception is always centered on itself, so the absolute start doesnβt matter.
How corner position or robot along the wall will look like in 3Γ3 perception?
Robot in theΒ top-left corner[1, 1, 1] [1, 0, 0] [1, 0, 0]
- Corners β two sides of the patch (rows/columns) padded with 1βs.
- Walls β one side padded with 1βs.
- Current position: β Not needed explicitly either.
- The 3Γ3 patch (with agent at the center) already encodes where it is relative to local obstacles.
- Final position (goal coordinates): β
You need to encode the goal information (relative vector dx, dy) at every step.
- Without this, the network wouldnβt know which direction to head.
- goal_delta:
- action: expertβs next move from A*
Is training and validation accuracy measured against final goal or its measured against predicted action? if its relative to predicted action, how to make sure if robot reaches its goal?
Step accuracy vs goal success
In Imitation Learning (supervised)
- Training/validation accuracy is measured against the predicted action vs. expert action (from A*).
- Example: if A* says βmove Eastβ and your network predicts βEast,β thatβs a correct classification.
- So accuracy = % of steps where predicted action = expert action. This is like image classification: input (3Γ3 patch + goal delta) β label (expertβs next move). π Here, you donβt directly measure βdid the robot reach the final goalβ during training. That comes later in evaluation.
To find optimal path, is final goal or destination included in A* algorithm?
Yes β A* explicitly uses the goal to guide search (through the heuristic h) and to decide when to stop (when the goal node is reached).
If A* generated path for NN training include goal information, then do we need to add explicit Β βgoal_deltaβ: [0.2, 0.1]
in the NN training data?
Even though the A* path βknowsβ the goal, the neural network doesnβt see the full map β it only gets:
- Local 3Γ3 perception
- Relative direction to the goal (goal_delta) Without explicitly giving the network the goal information, the same 3Γ3 local view could mean different correct actions depending on which goal youβre heading toward.
(state) = (local_view, goal_delta)
(label) = expert action (from A*)
Will the robot know the final destination during testing?
Yes β you must tell it the final goal at the start of each episode.
- During testing, you donβt reveal the path, but you provide the goal coordinates.
- The robot then computes goal_delta at every step.
- Without this, the robot would only wander locally because it wouldnβt know which direction to go.
Does the robot needΒ wall information in the environment?
Yes β the robot must perceive walls (map boundaries) inside its 3Γ3 patch, otherwise it wonβt know itβs at the edge and might try to walk out of bounds.
- *Include wall information in the 3Γ3 patch, by padding out-of-bounds cells with βwall = 1β in the obstacle channel.
π Data Structure Summary
INPUT (X_train): OUTPUT (y_train):
βββββββββββββββββββββββββββββββββββ ββββββββββββββββ
β Type: float32 (OPTIMIZED!) β β Type: int8 β
β Shape: (841, 9) β β Shape: (841,)β For each sample there is corrsponding one action by Robot.
β Values: 0.0 to 1.0 β β Values: 0-3 β
β Meaning: 3Γ3 obstacle patterns β β Meaning: Actions β
βββββββββββββββββββββββββββββββββββ ββββββββββββββββ
X_train shape: (841, 9)
βββββββββββββββββββββββββββββββββββ
β Sample 0: [0,1,0,0,0,1,0,1,0] β β 9 features per sample 3Γ3 View along A* path
βββββββββββββββββββββββββββββββββββ€
β Sample 1: [0,1,0,0,0,1,0,0,0] β β 9 features per sample 3Γ3 View along A* path
βββββββββββββββββββββββββββββββββββ€
β Sample 2: [1,0,0,1,0,0,0,0,1] β β 9 features per sample 3Γ3 View along A* path
βββββββββββββββββββββββββββββββββββ€
β ... β
βββββββββββββββββββββββββββββββββββ€
β Sample 840: [0,0,1,0,1,0,0,0,0] β β 9 features per sample 3Γ3 View along A* path
βββββββββββββββββββββββββββββββββββ
π How to Visualize (841, 9) Shape:
Think of it as a Table in_N β Input Neuron
Sample | Feature1 (in_N1) | Feature2 (in_N2) | Feature3 (in_N3) | β¦ | Feature9 (in_N9) | Action |
---|---|---|---|---|---|---|
0 | 0.0 | 1.0 | 0.0 | β¦ | 0.0 | 1 |
1 | 0.0 | 1.0 | 0.0 | β¦ | 0.0 | 1 |
2 | 1.0 | 0.0 | 0.0 | β¦ | 1.0 | 0 |
β¦ | β¦ | β¦ | β¦ | β¦ | β¦ | β¦ |
840 | 0.0 | 0.0 | 1.0 | β¦ | 0.0 | 2 |
Key Numbers:
- 841 samples: Total training examples
- 9 features: Flattened 3Γ3 perception grid
- 4 actions: UP(0), DOWN(1), LEFT(2), RIGHT(3)
- 100 environments: Diverse training scenarios
π§ How Training Data is Generated
Input: [0, 1, 0, 1, 1, 0, 0, 1, 0, # 3x3 perception (9 features)
0, 0, 0, 1, # Last action: UP (one-hot)
0, 1, 0, 0, # 2nd last action: DOWN (one-hot)
1, 0, 0, 0] # 3rd last action: LEFT (one-hot)
Output: [2] # Action (LEFT)
βββ¬ββ¬ββ¬ββ¬ββ¬ββ¬ββ¬ββ¬ββ¬ββ
βRβββ β βββ β β β β β β Robot at edge
βββΌββΌββΌββΌββΌββΌββΌββΌββΌββ€
β β βββ β β βββ β β β
βββ΄ββ΄ββ΄ββ΄ββ΄ββ΄ββ΄ββ΄ββ΄ββ
Current 3Γ3 View:
βββββββ¬ββββββ¬ββββββ
β 1.0 β 1.0 β 1.0 β β All treated as obstacles
βββββββΌββββββΌββββββ€
β 0.0 β 0.0 β 1.0 β β Missing wall information
βββββββΌββββββΌββββββ€
β 0.0 β 1.0 β 0.0 β
βββββββ΄ββββββ΄ββββββ
Step 1: Environment Creation
Generate 10Γ10 grid with random obstacles
βββ¬ββ¬ββ¬ββ¬ββ¬ββ¬ββ¬ββ¬ββ¬ββ
βRβββ β βββ β β β β β β Random obstacles (β)
βββΌββΌββΌββΌββΌββΌββΌββΌββΌββ€
β β βββ β β βββ β β β
βββΌββΌββΌββΌββΌββΌββΌββΌββΌββ€
β β β β β β β βββ β β
βββΌββΌββΌββΌββΌββΌββΌββΌββΌββ€
β βββ β β β β β β β β
βββΌββΌββΌββΌββΌββΌββΌββΌββΌββ€
β β β βββ β β β βββ β
βββΌββΌββΌββΌββΌββΌββΌββΌββΌββ€
β β β β β βββ β β β β
βββΌββΌββΌββΌββΌββΌββΌββΌββΌββ€
β β βββ β β β β β β β
βββΌββΌββΌββΌββΌββΌββΌββΌββΌββ€
β β β β β β βββ β β β
βββΌββΌββΌββΌββΌββΌββΌββΌββΌββ€
β β β β β β β β β β β
βββ΄ββ΄ββ΄ββ΄ββ΄ββ΄ββ΄ββ΄ββ΄ββ
Start: (0,0) Goal: (9,9)
Step 2: A* Pathfinding
A* algorithm finds optimal path
βββ¬ββ¬ββ¬ββ¬ββ¬ββ¬ββ¬ββ¬ββ¬ββ
βRβββ β βββ β β β β β β R = Robot start
βββΌββΌββΌββΌββΌββΌββΌββΌββΌββ€
βββ βββ β β βββ β β β β β = Path direction
βββΌββΌββΌββΌββΌββΌββΌββΌββΌββ€
βββ β β β β β βββ β β
βββΌββΌββΌββΌββΌββΌββΌββΌββΌββ€
βββββ β β β β β β β β
βββΌββΌββΌββΌββΌββΌββΌββΌββΌββ€
βββ β βββ β β β βββ β
βββΌββΌββΌββΌββΌββΌββΌββΌββΌββ€
βββ β β β βββ β β β β
βββΌββΌββΌββΌββΌββΌββΌββΌββΌββ€
βββ βββ β β β β β β β
βββΌββΌββΌββΌββΌββΌββΌββΌββΌββ€
βββ β β β β βββ β β β
βββΌββΌββΌββΌββΌββΌββΌββΌββΌββ€
ββββββββββGβ β G = Goal
βββ΄ββ΄ββ΄ββ΄ββ΄ββ΄ββ΄ββ΄ββ΄ββ
A* Algorithm Result:
Complete Path: [(0,0) β (0,1) β (1,1) β (2,1) β (3,1) β (4,1) β (5,1) β (6,1) β (7,1) β (8,1) β (8,2) β (8,3) β (8,4) β (8,5) β (8,6) β (8,7) β (8,8) β (9,8) β (9,9)]
Path Length: 18 steps
Path Validity: β (reaches goal)
Optimality: β (shortest possible path)
### Step 3: Extract 3Γ3 Perceptions
At each step along the A* path:
Position (0,0): Position (1,0):
βββ¬ββ¬ββ¬ββ¬ββ¬ββ¬ββ¬ββ¬ββ¬ββ βββ¬ββ¬ββ¬ββ¬ββ¬ββ¬ββ¬ββ¬ββ¬ββ
βRβββ β βββ β β β β β β βββ β βββ β β β β β
βββΌββΌββΌββΌββΌββΌββΌββΌββΌββ€ βββΌββΌββΌββΌββΌββΌββΌββΌββΌββ€
β β βββ β β βββ β β β βRβ βββ β β βββ β β β
βββΌββΌββΌββΌββΌββΌββΌββΌββΌββ€ βββΌββΌββΌββΌββΌββΌββΌββΌββΌββ€
β βββ β β β β βββ β β β β β β β β β βββ β β
βββ΄ββ΄ββ΄ββ΄ββ΄ββ΄ββ΄ββ΄ββ΄ββ βββ΄ββ΄ββ΄ββ΄ββ΄ββ΄ββ΄ββ΄ββ΄ββ
3Γ3 View: 3Γ3 View:
βββββββ¬ββββββ¬ββββββ βββββββ¬ββββββ¬ββββββ
β 0.0 β 1.0 β 0.0 β β Out of β 0.0 β 1.0 β 0.0 β β Out of
βββββββΌββββββΌββββββ€ bounds βββββββΌββββββΌββββββ€ bounds
β 0.0 β 0.0 β 1.0 β β 0.0 β 0.0 β 1.0 β
βββββββΌββββββΌββββββ€ βββββββΌββββββΌββββββ€
β 0.0 β 1.0 β 0.0 β β 0.0 β 0.0 β 0.0 β
βββββββ΄ββββββ΄ββββββ βββββββ΄ββββββ΄ββββββ
Flattened: [0,1,0,0,0,1,0,1,0]
Flattened: [0,1,0,0,0,1,0,0,0]
Action: 1 (DOWN)
## π Data Structure Details
### Input Data (X_train)
- **Shape**: (841, 9)
- **Type**: float64
- **Values**: 0.0 (empty) to 1.0 (obstacle)
- **Structure**: Each row is a flattened 3Γ3 perception
### Output Data (y_train)
- **Shape**: (841,)
- **Type**: int64
- **Values**: 0-3 (action indices)
- **Mapping**: 0=UP, 1=DOWN, 2=LEFT, 3=RIGHT
### Why These Data Types?
**Input (float32):**
- **Neural network compatibility**: Requires floating-point inputs
- **Obstacle probability**: Can represent partial visibility (0.0-1.0)
- **Standard practice**: Most ML libraries expect float64
**Why float32 anyway?**
- Neural Network Requirement: Most ML libraries expect float64 for inputs
- NumPy Default: np.array() creates float64 by default
- Future Flexibility: Could add partial visibility later (0.5 = partially visible obstacle)
- Library Compatibility: PyTorch/TensorFlow expect float inputs
**Output (int8):**
- **Discrete actions**: Only 4 possible values (0,1,2,3)
- **Memory efficient**: Integers use less memory than floats
- **Direct indexing**: Can be used as array indices
> **π€ Why not int8 for 4 discrete actions?**
>
> You're absolutely right! For only 4 values, we could use `int8`:
> ```python
> # Memory comparison for 841 samples:
> int64: 841 Γ 8 bytes = 6,728 bytes
> int8: 841 Γ 1 byte = 841 bytes # 8x less memory!
> ```
>
> **Why we use int64 anyway:**
> - NumPy default behavior
> - ML library compatibility
> - Negligible impact for 841 samples (6KB vs 1KB)
> - Easy to optimize later: `y_train.astype(np.int8)`
## 𧬠Biological Inspiration
**Local Perception**: Like animals using limited peripheral vision to navigate
**Expert Demonstrations**: A* provides optimal "expert" decisions
**Pattern Learning**: Robot learns obstacle-action relationships through repetition
The robot learns to map 3Γ3 obstacle patterns to navigation actions, mimicking how animals use local sensory input to make movement decisions!
## π§ Neural Network Architecture
Input Layer: 9 neurons (3Γ3 flattened perception)
Hidden Layer 1: 64 neurons + ReLU + Dropout(0.2)
Hidden Layer 2: 32 neurons + ReLU + Dropout(0.2)
Output Layer: 4 neurons + Softmax
**Training Strategy:**
- Split: 80% train, 20% validation
- Batch size: 32-64
- Learning rate: 0.001 with decay
- Epochs: 50-100 with early stopping
## π Key Insights
1. **Data Balance**: Action distribution is well-balanced (imbalance ratio: 1.25)
2. **Perception Complexity**: Average 2.05 obstacles per 3Γ3 view
3. **Environment Diversity**: 100 different environments with varying complexity
4. **Optimal Labels**: A* guarantees shortest path decisions
5. **Biological Connection**: Mimics animal navigation with limited vision
**Bottom Line:** The robot learns to navigate by observing optimal A* decisions at each position, building a comprehensive map of obstacle patterns to actions through supervised learning.