Research on Transformer-Based Action Sequence Modeling of Intangible Cultural Heritage Shadow Play Using Attention Mechanisms

Shadow puppet movements are characterized by long-range spatiotemporal dependence, pronounced stylization, and complex control and transmission relationships, these characteristics pose two major challenges to digital modeling: capturing long-range dependencies and preserving artistic style expression. This paper proposes an improved Transformer model incorporating a multi-level attention mechanism for modeling and generating action sequences of intangible cultural heritage shadow play. The model designs three types of collaborative attention modules: spatial attention introduces bone adjacency priors to enhance structural rationality; temporal attention captures cross-frame long-range dependencies; and style-aware attention adjusts local computations via global feature statistics to preserve genre-specific performance styles. Furthermore, an enhanced architecture alternately stacking graph convolution and Transformer is adopted, and sparse and hierarchical modeling strategies are introduced to reduce computational complexity from quadratic to approximately linear in sequence length. The experimental results show that the average joint position error of the proposed method in motion prediction tasks is 31.4, which is 11.8 lower than that of the standard Transformer; Style loss decreased by 24.6%; Under the extreme condition of 50% missing key points, the error ratio is 1.31, which is significantly better than the comparison method. The proposed model provides effective technical support for the digital protection and intelligent inheritance of intangible cultural heritage.

Lin, C., Xia, G., Nickpour, F., & Chen, Y. (2025). Bridging Tradition and Innovation: Using Artistic Genes to Assess Cultural Authenticity in Digital Shadow Play. In International Conference on Human-Computer Interaction (pp. 213-232). Cham: Springer Nature Switzerland. DOI: 10.1007/978-3-032-13164-5_14
Li, T., & Cao, W. (2021). Research on a method of creating digital shadow puppets based on parameterized templates. Multimedia Tools and Applications, 80(13), 20403-20422. DOI: 10.1007/s11042-021-10726-1
Hou, Y., Kenderdine, S., Picca, D., Egloff, M., & Adamou, A. (2022). Digitizing intangible cultural heritage embodied: State of the art. Journal on Computing and Cultural Heritage (JOCCH), 15(3), 1-20. DOI: 10.1145/3494837
Rallis, I., Voulodimos, A., Bakalos, N., Protopapadakis, E., Doulamis, N., & Doulamis, A. (2020). Machine learning for intangible cultural heritage: a review of techniques on dance analysis. In Visual Computing for Cultural Heritage (pp. 103-119). DOI: 10.1007/978-3-030-37191-3_6
Zhou, Y., Wang, R., Li, H., & Kung, S. Y. (2020). Temporal action localization using long short-term dependency. IEEE Transactions on Multimedia, 23, 4363-4375. DOI: 10.1109/TMM.2020.3042077
Jin, Y., Long, Y., Chen, C., Zhao, Z., Dou, Q., & Heng, P. A. (2021). Temporal memory relation network for workflow recognition from surgical video. IEEE Transactions on Medical Imaging, 40(7), 1911-1923. DOI: 10.1109/TMI.2021.3069471
Hermans, C. (2025). Of rhythm and movement: physical play and dance as (participatory) sense-making practices. Research in Dance Education, 26(3), 313-328. DOI: 10.1080/14647893.2023.2211524
Romat, H., Fender, A., Meier, M., & Holz, C. (2021). Flashpen: A high-fidelity and high-precision multi-surface pen for virtual reality. In 2021 IEEE Virtual Reality and 3D User Interfaces (VR) (pp. 306-315). IEEE. DOI: 10.1109/VR50410.2021.00053
Sun, J., Zheng, C., Xie, E., Liu, Z., Chu, R., Qiu, J., ... & Li, Z. (2025). A survey of reasoning with foundation models: Concepts, methodologies, and outlook. ACM Computing Surveys, 57(11), 1-43. DOI: 10.1145/3729218
Luo, Q., Zeng, W., Chen, M., Peng, G., Yuan, X., & Yin, Q. (2023). Self-attention and transformers: Driving the evolution of large language models. In 2023 IEEE 6th International Conference on Electronic Information and Communication Technology (ICEICT) (pp. 401-405). IEEE. DOI: 10.1109/ICEICT57916.2023.10245906
Hassija, V., Palanisamy, B., Chatterjee, A., Mandal, A., Chakraborty, D., Pandey, A., ... & Kumar, D. (2025). Transformers for vision: A survey on innovative methods for computer vision. IEEE Access. DOI: 10.1109/ACCESS.2025.3571735
Mazzia, V., Angarano, S., Salvetti, F., Angelini, F., & Chiaberge, M. (2022). Action transformer: A self-attention model for short-time pose-based human action recognition. Pattern Recognition, 124, 108487. DOI: 10.1016/j.patcog.2021.108487
Zhang, E. Y., Cheok, A. D., Pan, Z., Cai, J., & Yan, Y. (2023). From turing to transformers: A comprehensive review and tutorial on the evolution and applications of generative transformer models. Sci, 5(4), 46. DOI: 10.3390/sci5040046
Moutik, O., Sekkat, H., Tigani, S., Chehri, A., Saadane, R., Tchakoucht, T. A., & Paul, A. (2023). Convolutional neural networks or vision transformers: Who will win the race for action recognitions in visual data?. Sensors, 23(2), 734. DOI: 10.3390/s23020734
Ren, Q., Li, M., Li, H., & Shen, Y. (2021). A novel deep learning prediction model for concrete dam displacements using interpretable mixed attention mechanism. Advanced Engineering Informatics, 50, 101407. DOI: 10.1016/j.aei.2021.101407
Tutek, M., & Šnajder, J. (2022). Toward practical usage of the attention mechanism as a tool for interpretability. IEEE Access, 10, 47011-47030. DOI: 10.1109/ACCESS.2022.3169772
Yang, Z. B., Zhang, J. P., Zhao, Z. B., Zhai, Z., & Chen, X. F. (2020). Interpreting network knowledge with attention mechanism for bearing fault diagnosis. Applied Soft Computing, 97, 106829. DOI: 10.1016/j.asoc.2020.106829
Mienye, I. D., Swart, T. G., & Obaido, G. (2024). Recurrent neural networks: A comprehensive review of architectures, variants, and applications. Information, 15(9), 517. DOI: 10.3390/info15090517
Ahmad, T., Wu, J., Alwageed, H. S., Khan, F., Khan, J., & Lee, Y. (2023). Human activity recognition based on deep-temporal learning using convolution neural networks features and bidirectional gated recurrent unit with features selection. IEEE Access, 11, 33148-33159. DOI: 10.1109/ACCESS.2023.3263155
Zan, T., Jia, X., Guo, X., Wang, M., Gao, X., & Gao, P. (2025). Research on variable-length control chart pattern recognition based on sliding window method and SECNN-BiLSTM. Scientific Reports, 15(1), 5921. DOI: 10.1038/s41598-025-86849-4
Bouktif, S., Fiaz, A., Ouni, A., & Serhani, M. A. (2020). Multi-sequence LSTM-RNN deep learning and metaheuristics for electric load forecasting. Energies, 13(2), 391. DOI: 10.3390/en13020391
Ahmad, T., Jin, L., Zhang, X., Lai, S., Tang, G., & Lin, L. (2021). Graph convolutional neural network for human action recognition: A comprehensive survey. IEEE Transactions on Artificial Intelligence, 2(2), 128-145. DOI: 10.1109/TAI.2021.3076974
Yang, X., Li, S., Niu, S., & Yue, X. (2026). Graph network learning for human skeleton modeling: a survey. Artificial Intelligence Review, 59(1), 31. DOI: 10.1007/s10462-025-11442-0
Feng, L., Zhao, Y., Zhao, W., & Tang, J. (2022). A comparative review of graph convolutional networks for human skeleton-based action recognition. Artificial Intelligence Review, 55(5), 4275-4305. DOI: 10.1007/s10462-021-10107-y
Yu, H., Fan, X., Hou, Y., Pei, W., Ge, H., Yang, X., ... & Zhang, M. (2023). Toward realistic 3d human motion prediction with a spatio-temporal cross-transformer approach. IEEE Transactions on Circuits and Systems for Video Technology, 33(10), 5707-5720. DOI: 10.1109/TCSVT.2023.3255186
Jiao, L., Zhang, X., Liu, X., Liu, F., Yang, S., Ma, W., ... & Zhang, J. (2023). Transformer meets remote sensing video detection and tracking: A comprehensive survey. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 16, 1-45. DOI: 10.1109/JSTARS.2023.3289293
Soydaner, D. (2022). Attention mechanism in neural networks: where it comes and where it goes. Neural Computing and Applications, 34(16), 13371-13385. DOI: 10.1007/s00521-022-07366-3
Yuan, C., Liu, J., Wang, H., & Yang, Q. (2025). Object Detection in Complex Traffic Scenes Based on Environmental Perception Attention and Three-Scale Feature Fusion. Applied Sciences, 15(6), 3163. DOI: 10.3390/app15063163
Hou, Y., Kenderdine, S., Picca, D., Egloff, M., & Adamou, A. (2022). Digitizing intangible cultural heritage embodied: State of the art. Journal on Computing and Cultural Heritage (JOCCH), 15(3), 1-20. DOI: 10.1145/3494837
Kim, M., Hwang, T., & So, J. (2025). Real-Time Live Streaming Framework for Cultural Heritage Using Multi-Camera 3D Motion Capture and Virtual Avatars. Applied Sciences, 15(22), 12208. DOI: 10.3390/app152212208
Shen, J., Chen, L., He, X., Zuo, C., Li, X., & Dong, L. (2025). An Interactive Human-in-the-Loop Framework for Skeleton-Based Posture Recognition in Model Education. Biomimetics, 10(7), 431. DOI: 10.3390/biomimetics10070431
Wen, B. (2025). A multimodal transformer framework with biomechanical constraints for injury prediction and human motion analysis. Journal of Computational Methods in Sciences and Engineering, 14727978251348632. DOI: 10.1177/14727978251348632