Indexed by:
Abstract:
Wastewater treatment is indispensable to the functioning of urban society, and its optimal control has enormous social benefits. However, precise modelling of the unstable and complex treatment process is challenging yet crucial to the adaptive dynamic programming method. In this article, an adaptive critic algorithm with variational inference is designed to address the optimal control problem of nonlinear discrete-time systems, along with the convergence analysis. Based on the recorded system trajectory, the variational autoencoder is utilized to approximate the behavior policy of the offline dataset without system modelling and online interaction. Through policy iteration learning, the actor-critic structure can amend the policy generated by the variational autoencoder to achieve the optimal control objective. Simulations on a nonlinear system and the wastewater treatment process have verified that the proposed approach outperformed the behavior policy. Driven by the wastewater treatment process data derived from the incremental proportional-integral-derivative controller, the proposed approach can produce an optimal control policy of less tracking error and cost. Note to Practitioners—When dealing with an unknown system with complex dynamics, it is more feasible to improve the acceptable performance of the existing control policy based on the system’s trajectory than to obtain an excelling policy. Motivated by batch reinforcement learning, learning from offline data can avoid the online interaction between the system and the adaptive dynamic programming algorithm, which could lead to exploratory errors during online learning. Specifically, using a model-free adaptive dynamic programming algorithm, the parameters of the controller are instantly updated based on the experience replay buffer sampled from the online trajectory data. However, online exploration determines the update, and there is no guarantee that the system will converge every time. As a specific type of adaptive dynamic programming algorithm, adaptive critic design uses a critic network to approximate the expected future cost and an actor network to generate a control input that minimizes the expected future cost. In this article, using the converged trajectory as the offline dataset, a revised variational autoencoder is used to approximate the behavior policy of the offline dataset. As a generative model, the variational autoencoder considers a random variable that adheres to a prior distribution while producing outputs. Through offline learning, the actor network can amend the approximated policy based on the evaluation from the critic network while being constrained within the limited variation of the generative model. Finally, the objective of the optimal control task can be achieved by following the designated cost design. However, a dataset containing disturbances could impede offline learning, which needs to be addressed. IEEE
Keyword:
Reprint Author's Address:
Email:
Source :
IEEE Transactions on Automation Science and Engineering
ISSN: 1545-5955
Year: 2023
Issue: 4
Volume: 21
Page: 1-12
5 . 6 0 0
JCR@2022
ESI Discipline: ENGINEERING;
ESI HC Threshold:19
Cited Count:
WoS CC Cited Count: 0
SCOPUS Cited Count: 10
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 8
Affiliated Colleges: