Efficiency and Safety in Machine Learning

Saturday, Aug. 26, 2023

Session: Efficiency and Safety in Machine Learning

Time: 10:30 a.m. — 12:00 p.m.

Location: 华东师范大学普陀校区文史楼203

Session Chair: Pengkun Yang, Tsinghua University

Customizing Personal Large-Scale Language Model

Qing Ling

Sun Yat-Sen University

Title: Byzantine-Robust Distributed Online Learning: Taming Adversarial Participants in an Adversarial Environment

Abstract: In this talk, we consider distributed online learning under Byzantine attacks. The performance of an online learning algorithm is often characterized by (adversarial) regret, which evaluates the quality of one-step-ahead decision-making when an environment provides adversarial losses, and a sublinear bound is preferred. But we prove that, even with a class of state-of-the-art robust aggregation rules, in an adversarial environment and in the presence of Byzantine participants, distributed online gradient descent can only achieve a linear adversarial regret bound, which is tight. This is the inevitable consequence of Byzantine attacks, even though we can control the constant of the linear adversarial regret to a reasonable level. Interestingly, when the environment is not fully adversarial so that the losses of the honest participants are i.i.d. (independent and identically distributed), we show that sublinear stochastic regret, in contrast to the aforementioned adversarial regret, is possible. We develop a Byzantine-robust distributed online momentum algorithm to attain such a sublinear stochastic regret bound. Extensive numerical experiments corroborate our theoretical analysis.

Lei Wu

Peking University

Title: Theoretical Analysis of Inductive Biases in Deep Convolutional Networks

Abstract: In this talk, we'll discuss the inductive biases of deep convolutional neural networks (CNNs), which are believed to be vital drivers behind CNNs' exceptional performance on vision-like tasks. Specifically, we'll analyze the universality of CNNs and show that achieving it requires only a depth of $O(\log d)$, where $d$ is the input dimension. Additionally, we'll demonstrate that CNNs can efficiently capture long-range sparse correlations with only $O(\log^2d)$ samples. These are achieved through a novel combination of increased network depth and the utilization of multi-channeling and down-sampling. We'll also explore the inductive biases of weight sharing and locality through the lens of symmetry by introducing locally-connected networks (LCNs), which can be viewed as CNNs without weight sharing. We'll compare the performance of CNNs, LCNs, and fully-connected networks (FCNs) on a simple regression task and highlight the cruciality of weight sharing and the importance of locality. Our findings demonstrate that weight sharing and locality break different symmetries in the learning process, leading to provable separations between the two biases.

Zhi-Qin John Xu

Shanghai Jiao Tong University

Title: Dropout Facilitate Condensation in Deep Learning

Abstract: It is important to understand how the popular regularization method dropout helps neural network training find a good generalization solution. In this work, we theoretically derive the implicit regularization and find that, trained with dropout, input weights of hidden neurons would tend to condense on isolated orientations. This work points out the distinct characteristics of dropout compared with stochastic gradient descent.

Jingzhao Zhang

Tsinghua University

Title: Two Phases of Scaling Laws for Nearest Neighbor Classifiers

Abstract: A scaling law refers to the observation that the test performance of a model improves as the number of training data increases. A fast scaling law implies that one can solve machine learning problems by simply boosting the data and the model sizes. Yet, in many cases, the benefit of adding more data can be negligible. In this work, we study the rate of scaling laws of nearest neighbor classifiers. We show that a scaling law can have two phases: in the first phase, the generalization error depends polynomially on the data dimension and decreases fast; whereas in the second phase, the error depends exponentially on the data dimension and decreases slowly. Our analysis highlights the complexity of the data distribution in determining the generalization error. When the data distributes benignly, our result suggests that nearest neighbor classifier can achieve a generalization error that depends polynomially, instead of exponentially, on the data dimension.