Research on Transformer-Based Action Sequence Modeling of Intangible Cultural Heritage Shadow Play Using Attention Mechanisms

Yuxiao Liu; Shuolei Feng; Mengyu Liu

doi:10.71451/ISTAER2615

Authors

Yuxiao Liu Art and Design, Beijing City University, Shunyi District, Beijing, China Author https://orcid.org/0009-0003-7951-314X
Shuolei Feng Department of Information Science, Beijing City University, Shunyi District, Beijing, China Author https://orcid.org/0009-0003-8967-5134
Mengyu Liu Art and Design, Beijing City University, Shunyi District, Beijing, China Author https://orcid.org/0009-0003-0522-0280

DOI:

https://doi.org/10.71451/ISTAER2615

Keywords:

Intangible cultural heritage digitization; Shadow play; Action sequence modeling; Transformer; Multi-level attention mechanism

Abstract

Shadow puppet movements are characterized by long-range spatiotemporal dependence, pronounced stylization, and complex control and transmission relationships, these characteristics pose two major challenges to digital modeling: capturing long-range dependencies and preserving artistic style expression. This paper proposes an improved Transformer model incorporating a multi-level attention mechanism for modeling and generating action sequences of intangible cultural heritage shadow play. The model designs three types of collaborative attention modules: spatial attention introduces bone adjacency priors to enhance structural rationality; temporal attention captures cross-frame long-range dependencies; and style-aware attention adjusts local computations via global feature statistics to preserve genre-specific performance styles. Furthermore, an enhanced architecture alternately stacking graph convolution and Transformer is adopted, and sparse and hierarchical modeling strategies are introduced to reduce computational complexity from quadratic to approximately linear in sequence length. The experimental results show that the average joint position error of the proposed method in motion prediction tasks is 31.4, which is 11.8 lower than that of the standard Transformer; Style loss decreased by 24.6%; Under the extreme condition of 50% missing key points, the error ratio is 1.31, which is significantly better than the comparison method. The proposed model provides effective technical support for the digital protection and intelligent inheritance of intangible cultural heritage.

References

[1] Lin, C., Xia, G., Nickpour, F., & Chen, Y. (2025, June). Bridging Tradition and Innovation: Using Artistic Genes to Assess Cultural Authenticity in Digital Shadow Play. In International Conference on Human-Computer Interaction (pp. 213-232). Cham: Springer Nature Switzerland. DOI: https://doi.org/10.1007/978-3-032-13164-5_14

[2] Li, T., & Cao, W. (2021). Research on a method of creating digital shadow puppets based on parameterized templates. Multimedia Tools and Applications, 80(13), 20403-20422. DOI: https://doi.org/10.1007/s11042-021-10726-1

[3] Hou, Y., Kenderdine, S., Picca, D., Egloff, M., & Adamou, A. (2022). Digitizing intangible cultural heritage embodied: State of the art. Journal on Computing and Cultural Heritage (JOCCH), 15(3), 1-20. DOI: https://doi.org/10.1145/3494837

[4] Rallis, I., Voulodimos, A., Bakalos, N., Protopapadakis, E., Doulamis, N., & Doulamis, A. (2020). Machine learning for intangible cultural heritage: a review of techniques on dance analysis. Visual Computing for Cultural Heritage, 103-119. DOI: https://doi.org/10.1007/978-3-030-37191-3_6

[5] Zhou, Y., Wang, R., Li, H., & Kung, S. Y. (2020). Temporal action localization using long short-term dependency. IEEE Transactions on Multimedia, 23, 4363-4375. DOI: https://doi.org/10.1109/TMM.2020.3042077

[6] Jin, Y., Long, Y., Chen, C., Zhao, Z., Dou, Q., & Heng, P. A. (2021). Temporal memory relation network for workflow recognition from surgical video. IEEE Transactions on Medical Imaging, 40(7), 1911-1923. DOI: https://doi.org/10.1109/TMI.2021.3069471

[7] Hermans, C. (2025). Of rhythm and movement: physical play and dance as (participatory) sense-making practices. Research in Dance Education, 26(3), 313-328. DOI: https://doi.org/10.1080/14647893.2023.2211524

[8] Romat, H., Fender, A., Meier, M., & Holz, C. (2021, March). Flashpen: A high-fidelity and high-precision multi-surface pen for virtual reality. In 2021 IEEE Virtual Reality and 3D User Interfaces (VR) (pp. 306-315). IEEE. DOI: https://doi.org/10.1109/VR50410.2021.00053

[9] Sun, J., Zheng, C., Xie, E., Liu, Z., Chu, R., Qiu, J., ... & Li, Z. (2025). A survey of reasoning with foundation models: Concepts, methodologies, and outlook. ACM Computing Surveys, 57(11), 1-43. DOI: https://doi.org/10.1145/3729218

[10] Luo, Q., Zeng, W., Chen, M., Peng, G., Yuan, X., & Yin, Q. (2023, July). Self-attention and transformers: Driving the evolution of large language models. In 2023 IEEE 6th International conference on electronic information and communication technology (ICEICT) (pp. 401-405). IEEE. DOI: https://doi.org/10.1109/ICEICT57916.2023.10245906

[11] Hassija, V., Palanisamy, B., Chatterjee, A., Mandal, A., Chakraborty, D., Pandey, A., ... & Kumar, D. (2025). Transformers for vision: A survey on innovative methods for computer vision. Ieee Access. DOI: https://doi.org/10.1109/ACCESS.2025.3571735

[12] Mazzia, V., Angarano, S., Salvetti, F., Angelini, F., & Chiaberge, M. (2022). Action transformer: A self-attention model for short-time pose-based human action recognition. Pattern Recognition, 124, 108487. DOI: https://doi.org/10.1016/j.patcog.2021.108487

[13] Zhang, E. Y., Cheok, A. D., Pan, Z., Cai, J., & Yan, Y. (2023). From turing to transformers: A comprehensive review and tutorial on the evolution and applications of generative transformer models. Sci, 5(4), 46. DOI: https://doi.org/10.3390/sci5040046

[14] Moutik, O., Sekkat, H., Tigani, S., Chehri, A., Saadane, R., Tchakoucht, T. A., & Paul, A. (2023). Convolutional neural networks or vision transformers: Who will win the race for action recognitions in visual data?. Sensors, 23(2), 734. DOI: https://doi.org/10.3390/s23020734

[15] Ren, Q., Li, M., Li, H., & Shen, Y. (2021). A novel deep learning prediction model for concrete dam displacements using interpretable mixed attention mechanism. Advanced Engineering Informatics, 50, 101407. DOI: https://doi.org/10.1016/j.aei.2021.101407

Tutek, M., & Šnajder, J. (2022). Toward practical usage of the attention mechanism as a tool for interpretability. IEEE access, 10, 47011-47030. DOI: https://doi.org/10.1109/ACCESS.2022.3169772

[16] Yang, Z. B., Zhang, J. P., Zhao, Z. B., Zhai, Z., & Chen, X. F. (2020). Interpreting network knowledge with attention mechanism for bearing fault diagnosis. Applied Soft Computing, 97, 106829. DOI: https://doi.org/10.1016/j.asoc.2020.106829

[17] Mienye, I. D., Swart, T. G., & Obaido, G. (2024). Recurrent neural networks: A comprehensive review of architectures, variants, and applications. Information, 15(9), 517. DOI: https://doi.org/10.3390/info15090517

[18] Ahmad, T., Wu, J., Alwageed, H. S., Khan, F., Khan, J., & Lee, Y. (2023). Human activity recognition based on deep-temporal learning using convolution neural networks features and bidirectional gated recurrent unit with features selection. IEEE access, 11, 33148-33159. DOI: https://doi.org/10.1109/ACCESS.2023.3263155

[19] Zan, T., Jia, X., Guo, X., Wang, M., Gao, X., & Gao, P. (2025). Research on variable-length control chart pattern recognition based on sliding window method and SECNN-BiLSTM. Scientific Reports, 15(1), 5921. DOI: https://doi.org/10.1038/s41598-025-86849-4

[20] Bouktif, S., Fiaz, A., Ouni, A., & Serhani, M. A. (2020). Multi-sequence LSTM-RNN deep learning and metaheuristics for electric load forecasting. Energies, 13(2), 391. DOI: https://doi.org/10.3390/en13020391

[21] Ahmad, T., Jin, L., Zhang, X., Lai, S., Tang, G., & Lin, L. (2021). Graph convolutional neural network for human action recognition: A comprehensive survey. IEEE Transactions on Artificial Intelligence, 2(2), 128-145. DOI: https://doi.org/10.1109/TAI.2021.3076974

[22] Yang, X., Li, S., Niu, S., & Yue, X. (2026). Graph network learning for human skeleton modeling: a survey. Artificial Intelligence Review, 59(1), 31. DOI: https://doi.org/10.1007/s10462-025-11442-0

[23] Feng, L., Zhao, Y., Zhao, W., & Tang, J. (2022). A comparative review of graph convolutional networks for human skeleton-based action recognition. Artificial Intelligence Review, 55(5), 4275-4305. DOI: https://doi.org/10.1007/s10462-021-10107-y

[24] Yu, H., Fan, X., Hou, Y., Pei, W., Ge, H., Yang, X., ... & Zhang, M. (2023). Toward realistic 3d human motion prediction with a spatio-temporal cross-transformer approach. IEEE Transactions on Circuits and Systems for Video Technology, 33(10), 5707-5720. DOI: https://doi.org/10.1109/TCSVT.2023.3255186

[25] Jiao, L., Zhang, X., Liu, X., Liu, F., Yang, S., Ma, W., ... & Zhang, J. (2023). Transformer meets remote sensing video detection and tracking: A comprehensive survey. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 16, 1-45. DOI: https://doi.org/10.1109/JSTARS.2023.3289293

[26] Soydaner, D. (2022). Attention mechanism in neural networks: where it comes and where it goes. Neural Computing and Applications, 34(16), 13371-13385. DOI: https://doi.org/10.1007/s00521-022-07366-3

[27] Yuan, C., Liu, J., Wang, H., & Yang, Q. (2025). Object Detection in Complex Traffic Scenes Based on Environmental Perception Attention and Three-Scale Feature Fusion. Applied Sciences, 15(6), 3163. DOI: https://doi.org/10.3390/app15063163

[28] Hou, Y., Kenderdine, S., Picca, D., Egloff, M., & Adamou, A. (2022). Digitizing intangible cultural heritage embodied: State of the art. Journal on Computing and Cultural Heritage (JOCCH), 15(3), 1-20. DOI: https://doi.org/10.1145/3494837

[29] Kim, M., Hwang, T., & So, J. (2025). Real-Time Live Streaming Framework for Cultural Heritage Using Multi-Camera 3D Motion Capture and Virtual Avatars. Applied Sciences, 15(22), 12208. DOI: https://doi.org/10.3390/app152212208

[30] Shen, J., Chen, L., He, X., Zuo, C., Li, X., & Dong, L. (2025). An Interactive Human-in-the-Loop Framework for Skeleton-Based Posture Recognition in Model Education. Biomimetics, 10(7), 431. DOI: https://doi.org/10.3390/biomimetics10070431

[31] Wen, B. (2025). A multimodal transformer framework with biomechanical constraints for injury prediction and human motion analysis. Journal of Computational Methods in Sciences and Engineering, 14727978251348632. DOI: https://doi.org/10.1177/14727978251348632