Hello dear mr. Yuhang Song,
In the paper, it is mentioned that the rewards for action v are given by
And the parameters θv are optimized according to the rule:
In the code,
|
'''get direction reward and ground-truth v from data_base in last state''' |
|
last_prob, distance_per_data = suppor_lib.get_prob( |
|
lon=self.last_lon, |
|
lat=self.last_lat, |
|
theta=action * 45.0, |
|
subjects=self.subjects, |
|
subjects_total=self.subjects_total, |
|
cur_data=self.last_data, |
|
) |
|
'''rescale''' |
|
distance_per_step = distance_per_data * self.data_per_step |
|
'''convert v to degree''' |
|
degree_per_step = distance_per_step / math.pi * 180.0 |
|
'''set v_lable''' |
|
v_lable = degree_per_step |
there seems to be no reward for v calculated, instead
v_lable is estimated as a "weighted" target value (sum of subject_i_v * similarity),
|
distance_per_data = 0.0 |
|
for i in range(subjects_total): |
|
if config.if_normalize_v_lable: |
|
distance_per_data += prob_dic_normalized[i] * subjects[i].data_frame[cur_data].v |
|
else: |
|
distance_per_data += prob_dic[i] * subjects[i].data_frame[cur_data].v |
which then contributes another term
(v-v_lable)^2 in the loss function:
|
# v loss |
|
v_loss = 0.5 * tf.reduce_sum(tf.square(pi.v - self.v_lable)) |
Is there any particular reason why the direct sum of rewards is not calculated, and instead the above approach is considered?