Hey,
thanks for providing a base implementation for the method presented in "Towards Deep Symbolic Reinforcement Learning".
With respect to your implementation, I have the following comments/ questions:
state_builder.py:
(1) line 177: We should check the absolute difference between x.position and entity.position otherwise x might not be in the radius but still part of the list when the difference is negative.
(2) line 143-145: In the deletion of the not-tracked entities. This is only a problem if the object tracking messes up and the number of objects seen changes. Since the entities to be deleted are from the last state their id in the self.tracked_entities list might have changed. Lets say you track 15 entities and in your current do_not_exist you have index 10 which you delete and in the current timestep you add 15 to the newlynonexistent list, and no new entities are added in the next timestep this loop will throw an error.
agent.py
(1) The update function of the Tabular Agent. If I am not completley mistaken, I think there is a mistake in the paper in the update equation of the tabular Q-function. At least I do not see a reason why the action value of (s_t,a_t) should also be discounted. Additionally in your calculation of the current and next step action value you always sum over all the interactions present at the current timestep while only updating the Q-table for the specific type of interaction as described in the paper. Is there a reasoning behind it ?
autoencoder.py
(1) type consistency: In the current get_entities function of the autoencoder the representative activations are always determined newly for a new timestep, which can lead to inconsistencies across timesteps. For example, when there is an entity at the top left corner with a specific type it gets assigned type-0. If now the agent collects this entity after a certain number of timesteps the agent gets assigned type 0 which I think is undesired behaviour because it makes the matching of tacked_entities and new entities harder in the build representation function.
If I change the points mentioned above the implementation runs without problems. I hope these comments can help to fix your implementation, if I misunderstood some part let me know.
Since the paper itself provides very little information on how the method was actually implemented I was wondering whether you already contacted the authors and got some additional information about the implementation that is not part of the paper.
All the best,
N