I’m a Ph.D. candidate at Computer Science and Technology at Tsinghua University, advised by Prof. Carlo Cannistraci. I obtained my Master’s degree in Data Science from Tsinghua University and the University of Washington, advised by Prof. Jie Tang, and my B.E. at Computer Science and Technology at Tsinghua University.
My research focuses on efficient AI, natural language processing, and graph learning.
Experience
- Research Intern, Post-training Team, Meta Superintelligence Labs, Menlo Park, California, US, 2026.5-2026.8 (expected)
- Senior Research Engineer, Personalization, Disney+ Hotstar, Beijing, China, 2021.7-2023.7
- Trading Intern, Jane Street, Hong Kong, China and New York, NY, US, 2019.7-2019.9
- Research Intern, Microsoft Research Asia, Beijing, China, 2019.1-2019.4
Email: jialin [dot] zhao97 [at] gmail [dot] com
Vitæ
Full Resume in PDF.
(*: co-first author; ^: corresponding author)
-
Accelerating Attention with Basis Decomposition
Jialin Zhao^
Preprint: Under review, 2025
Attention is a core operation in large language models (LLMs) and vision-language models (VLMs). We present BD Attention (BDA), the first lossless algorithmic reformulation of attention. BDA is enabled by a simple matrix identity from Basis Decomposition (BD), which restructures multi-head projections into a compact form while preserving exact outputs. Unlike I/O-aware system optimizations such as FlashAttention, BDA provides a mathematically guaranteed acceleration that is architecture-agnostic. On DeepSeek-V2-Lite (16B, FP16), BDA requires only 4s of offline preparation with no retraining required and, on modern GPUs, achieves 32% faster key/value projections and 25% smaller weights, while increasing end-to-end perplexity (PPL) by just 0.02% (FP16) or 0.0004% (FP32)—a negligible effect on model performance. These results position BDA as the first theoretically exact method for lossless attention acceleration that is complementary to existing engineering-level optimizations. Our code is available at https://github.com/abcbdf/basis-decomposition-official
-
Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models
Jialin Zhao^, Yingtao Zhang, and Carlo Vittorio Cannistraci^
ICML’25: Forty-second International Conference on Machine Learning , 2025
The rapid growth of Large Language Models has driven demand for effective model compression techniques to reduce memory and computation costs. Low-rank pruning has gained attention for its GPU compatibility across all densities. However, low-rank pruning struggles to match the performance of semi-structured pruning, often doubling perplexity at similar densities. In this paper, we propose Pivoting Factorization (PIFA), a novel lossless meta low-rank representation that unsupervisedly learns a compact form of any low-rank representation, effectively eliminating redundant information. PIFA identifies pivot rows (linearly independent rows) and expresses non-pivot rows as linear combinations, achieving 24.2% additional memory savings and 24.6% faster inference over low-rank layers at rank = 50% of dimension. To mitigate the performance degradation caused by low-rank pruning, we introduce a novel, retraining-free reconstruction method that minimizes error accumulation (M). MPIFA, combining M and PIFA into an end-to-end framework, significantly outperforms existing low-rank pruning methods, and achieves performance comparable to semi-structured pruning, while surpassing it in GPU efficiency and compatibility. Our code is available at https://github.com/biomedical-cybernetics/pivoting-factorization.
-
Sparse Spectral Training and Inference on Euclidean and Hyperbolic Neural Networks
Jialin Zhao^, Yingtao Zhang, Xinghang Li, Huaping Liu, and Carlo Vittorio Cannistraci^
ICML’25: Forty-second International Conference on Machine Learning , 2025
The growing demands on GPU memory posed by the increasing number of neural network parameters call for training approaches that are more memory-efficient. Previous memory reduction training techniques, such as Low-Rank Adaptation (LoRA) and ReLoRA, face challenges, with LoRA being constrained by its low-rank structure, particularly during intensive tasks like pre-training, and ReLoRA suffering from saddle point issues. In this paper, we propose Sparse Spectral Training (SST) to optimize memory usage for pre-training. SST updates all singular values and selectively updates singular vectors through a multinomial sampling method weighted by the magnitude of the singular values. Furthermore, SST employs singular value decomposition to initialize and periodically reinitialize low-rank parameters, reducing distortion relative to full-rank training compared to other low-rank methods. Through comprehensive testing on both Euclidean and hyperbolic neural networks across various tasks, SST demonstrates its ability to outperform existing memory reduction training methods and is comparable to full-rank training in various cases. On LLaMA-1.3B, with only 18.7% of the parameters trainable compared to full-rank training (using a rank equivalent to 6% of the embedding dimension), SST reduces the perplexity gap between other low-rank methods and full-rank training by 97.4%. This result highlights SST as an effective parameter-efficient technique for model pre-training.
-
Adaptive Cannistraci-Hebb Network Automata Modelling of Complex Networks for Path-based Link Prediction
Jialin Zhao, Alessandro Muscoloni, Umberto Michieli, Yingtao Zhang, and Carlo Vittorio Cannistraci^
NeurIPS’25: Advances in neural information processing systems, 2025
-
Adaptive Cannistraci-Hebb Network Automata Modelling of Complex Networks for Path-based Link Prediction
Jialin Zhao, Yuxiao Dong, Ming Ding, Evgeny Kharlamov, and Jie Tang^
NeurIPS’21: Advances in neural information processing systems, 2021
The success of graph neural networks (GNNs) largely relies on the process of aggregating information from neighbors defined by the input graph structures. Notably, message passing based GNNs, e.g., graph convolutional networks, leverage the immediate neighbors of each node during the aggregation process, and recently, graph diffusion convolution (GDC) is proposed to expand the propagation neighborhood by leveraging generalized graph diffusion. However, the neighborhood size in GDC is manually tuned for each graph by conducting grid search over the validation set, making its generalization practically limited. To address this issue, we propose the adaptive diffusion convolution (ADC) strategy to automatically learn the optimal neighborhood size from the data. Furthermore, we break the conventional assumption that all GNN layers and feature channels (dimensions) should use the same neighborhood for propagation. We design strategies to enable ADC to learn a dedicated propagation neighborhood for each GNN layer and each feature channel, making the GNN architecture fully coupled with graph structures—the unique property that differs GNNs from traditional neural networks. By directly plugging ADC into existing GNNs, we observe consistent and significant outperformance over both GDC and their vanilla versions across various datasets, demonstrating the improved model capacity brought by automatically learning unique neighborhood size per layer and per channel in GNNs.
Services
Conference reviewer: ICML (2025), NeurIPS (2025), ICLR (2026)
Journal reviewer: IEEE Transactions on Big Data, Applied Network Science, Scientific Reports
Education
Ph.D. in Computer Science and Technology, Tsinghua University, Beijing, China, 2023-2027 (expected)
Advised by Prof. Carlo Cannistraci.
Master of Data Science, Tsinghua University and University of Washington, Beijing, China and Seattle, WA, US, 2019-2021
Advised by Prof. Jie Tang.
B.E. in Computer Science and Technology, Tsinghua University, Beijing, China, 2015-2019