Private feature learning for single-index models (part 1)

This is part 1/3 of my M1 internship per requirement from ENS, done at Institute of Science and Technology Austria, under the supervision of Simone Bombari and Prof. Marco Mondelli, to be published.

Abstract

$(\varepsilon, \delta)$-differential privacy has been gold standard in gauging protection of training data. It was commonly believed that there exists some privacy-performance trade-off, due to more parameters to be kept private. But recently, it was shown that one needs not to sacrifice performance for privacy when working with sufficiently over-parameterised linear regression or random feature models. Nevertheless, neither of the two models are capable of feature learning, and it has been open whether one could achieve feature learning whilst guaranteeing privacy.

In this work, we show that with sufficiently wide shallow neural networks trained with quadratic loss, one can learn single-index models, i.e. $\mathcal{R} = o(1)$, whilst maintaining strong privacy, i.e. $\varepsilon, \delta = o(1)$.

Acknowledgement

I thank Simone Bombari and Marco Mondelli for suggesting this problem, countless helpful discussions, warmest hospitality at Institute of Science and Technology Austria, and support throughout this project. I acknowledge financial help from Department of Computer Science, École Normale Supérieure - PSL which have made my trip to Vienna possible.

Last but not least, I have my utmost gratitude to my dear friend Anna-Maria Edlinger for her unwavering support through difficulties during my staying in Vienna, without which I would have not been able to finish this project.

As such, this work is dedicated to her.

1. Introduction.

1.1. Privacy.

As machine learning finds more applications in privacy-sensitive setting, such as in medicine (Shehab et al., 2022), communication (Dada et al., 2019), finance (Hernandez Aros et al., 2024), and personalised services (Weiner et al., 2024), arises the question of protecting data. Indeed, from a publicly available model, one might infer unexpected properties (Ganju et al., 2018), extracting (Carlini et al., 2021) or reconstruct (Fredrikson et al., 2015) training data, or test whether a sample is in the dataset (Shokri et al., 2017).

One possible approach to mitigate these attacks is to ensure $(\varepsilon, \delta)$-differential privacy (Dwork & Roth, 2013), for example by using differentially private variants of stochastic gradient descent (missing reference) and/or incur computational cost (Kurakin et al., 2022). Deferring the precise definition to later sections, $\varepsilon$ and $\delta$ represent the privacy budget: the smaller they are, the stronger the privacy guarantee. As no lunch should come free, one might expect some trade-off between performance and privacy (Chaudhuri & Monteleoni, 2009; Chaudhuri et al., 2011), similar to that between bias and variance in classical theory. The more parameters to train, the harder it is to have both privacy and performance due to the noise level added (Jayaraman & Evans, 2019).

Unexpectedly, it was shown that in overparameterised regime, privacy can be achieved with $\varepsilon, \delta = o(1)$ and the same performance for linear regression (Brown et al., 2024) and random feature models (Bombari & Mondelli, 2025), thus arguing for benefits of overparameterisation.

1.2. Feature learning.

That being said, one cannot expect linear regression to compete with large neural networks used in practice. It serves as a good proof-of-concept to suggest that privacy and performance might be attainable at the same time, but it remains an open question to show this for more complicated model.

Likewise, random feature models have been shown to be equivalent to noisy linear models with Gaussian features and Gaussian noise, known as Gaussian equivalence property (Hu & Lu, 2023; Bosch et al., 2023). This means that a random feature model can learn only components up to some degree $q$ of the true model, where $q$ depends on random feature model’s size. Or, in another word, such a model has its performance comes purely from its size, and it does not learn any features of the data, thus lacks feature learning.

This prompts the question of this work, that is:

Can one simultaneously achieve both feature learning and differential privacy?

1.3. Contribution and outline.

In this work, we consider

a shallow neural network where the number of parameters $p$ potentially significantly larger than the sample size $n$ and the input dimension $d$
with the training process comprising one step of gradient descent on the first layer, then gradient descent on the second layer, with some modification to ensure differential privacy
over data generated by a single-index model.

We show that indeed, as with linear regression and random feature models, under some technical conditions and for all sufficiently over-parameterised models, it is possible to achieve both $(\varepsilon, \delta)$-differential privacy and $o(1)$ generalisation error, as $p$, $n$, and $d$ suitably tend to infinity, thus give an affirmative response to the above question.

Section 2 sets rigorously the problem and the main result, introduces necessary notations and definitions, as well as provides a review of related works, before stating the outline of the proof. Section 3 and Section 4 analyse first and second layer respectively. Section 5 concludes with some final remarks. $\newcommand{\P}{\mathbb{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\E}{\mathbb{E}} \newcommand{\N}{\mathcal{N}} \newcommand{\H}{\mathcal{H}} \newcommand{\L}{\mathcal{L}} \newcommand{\RR}{\text{RR}} \newcommand{\he}{\text{He}} \newcommand{\h}{\text{H}} \newcommand{\sech}{\text{sech}} \newcommand{\csch}{\text{csch}} \newcommand{\id}{\text{Id}}$

2. Preliminaries.

2.1. Differential privacy.

Dwork and Roth (Dwork & Roth, 2013) defined $(\varepsilon, \delta)$-differential privacy (DP) based on the idea that if regardless of the dataset, with high probability changing one sample does not change an algorithm’s output, an attacker has hard time inferring any information about one sample from the final output alone. This is stated as follow.

Definition 1. A randomised algorithm $\mathcal{A}$ is said to be $(\varepsilon, \delta)$-differentially private if for any subset $S$ of output space $\text{Range}(\mathcal{A})$ and any pair of adjacent dataset $D, D' \subseteq \text{Dom}(\mathcal{A})$, we have that

\[\begin{equation} \P[\mathcal{A}(D') \in S] \leq \exp(\varepsilon) \P[\mathcal{A}(D) \in S] + \delta \end{equation}\]

One caveat in machine learning is that the trained parameters are hard to characterise, and in practice, the result of a random process due to the use of stochastic gradient descent. As such, it is a priori non-trivial to enforce $(\varepsilon, \delta)$-DP; for numerical algorithms, a common approach, so-called Gaussian mechanism is to add Gaussian noise to the final output (Dwork & Roth, 2013), thus obscure its relationship with input.

Since one can compose two differentially private algorithms to obtain another (Dwork & Roth, 2013), one can apply this mechanism for every gradient step. Unfortunately the final result will have too much noise to guarantee any performance; fortunately, Abadi et al. (Abadi et al., 2016) showed that the noise level needs not be too large, which allows to recover good performance. Bombari and Mondelli (Bombari & Mondelli, 2025) showed that indeed this is the case for random feature models. It essentially involves training only the second layer whilst leaving the first layer fixed, which is precisely what we will do to train the second layer. Details of training procedure are given in later sections.

2.2. Setting.

In what follows, we consider the input in $\R^d$ and output in $\R$.

Model: as stated above, we consider a fully connected two-layer neural network with $p$ hidden neurons, which takes the form

\[\begin{equation} \label{eq:model} f(x) = a^\intercal \sigma(W^\intercal x + b) \end{equation}\]

where $\sigma = \tanh$ is the non-linear activation function, $W \in \R^{d \times p}$ is coefficients of first layer, $b \in \R^p$ is the bias units, and $a \in \R^p$ is the coefficients of second layer.

Dataset: we have two disjoint sets of data of same size $n$, namely $X^W = (x^W_1, \ldots, x^W_n)$ to train the first layer, and $X^a = (x^a_1, \ldots, x^a_n)$ to train the second layer. For $1 \leq i \leq n$, $x^W_i$ and $x^a_i$ have respective labels $y^W_i$ and $y^a_i$. This is merely to ensure independence of each layer’s training and facilitate analysis later on, as was commonly done (Dandi et al., 2024; Ba et al., 2023; Berthier et al., 2024; Abbe et al., 2023; Damian et al., 2022).
Loss function: we consider the square loss, which takes the form

\[\begin{equation} \ell(x, y, f) = (f(x) - y)^2, \end{equation}\]

and we train with respect to empirical loss and regularisation on the second layer, meaning we minimise

\[\begin{equation} \L(f) = \frac{1}{n} \sum_{i = 1}^n \left[f(x_i) - y_i\right]^2 + \lambda \|a \|^2. \end{equation}\]

Training procedure comprises two stages given below, denoted Algorithm 1. Note that we separate the training of each layer, as was commonly done (Dandi et al., 2024; Ba et al., 2023; Berthier et al., 2024; Abbe et al., 2023; Damian et al., 2022), for the same reason we have two datasets: it is merely to help the analysis later on, and we expect the general picture to hold without resampling, barring worse sample complexity and/or slower rate of convergence for test loss.

Require. Number of iterations $T$, step sizes $\eta_W, \eta_a$, clipping constants $C_a$, noise level $\sigma_w, \sigma_a$.

$\rhd$ Initialisation. $(W_0)_{i, j} \sim \N\left(0, \frac{1}{d}\right), b \sim \N(0, \id), a_0 = \frac{1}{\sqrt{p}}(1, \ldots, 1)^\intercal$.

$\rhd$ First layer training.

for $i = 1, \ldots, n$ do

Compute the gradient $g_W(x^W_i, y^W_i, f) \gets \nabla_W \L = \nabla_W (a_0^\intercal \sigma(W_0^\intercal x^W_i + b) - y^W_i)^2$.

endfor

Update the first layer $W_1 \gets W_0 - \eta_W \sum_{i = 1}^n g^C_W(x^W_i, y^W_i, f)$

for $i = 1, \ldots, p$ do

Normalise the neurons and add noise $w_i^1 \gets \frac{w_i^1}{\|w_i^1\|} + \sigma_W \N(0, \id).$

endfor

$\rhd$ Second layer training.

for $t = 1, \ldots, T$ do

for $i = 1, \ldots, n$ do

Compute the gradient

$g_a(x^a_i, y^a_i, f) \gets \nabla_a \L = \nabla_a \left[(a_{i-1}^\intercal \sigma(W_1^\intercal x^a_i + b) - y^a_i)^2 + \lambda \|a_{i-1}\|^2\right]$.

Clip the gradient

\[g^C_a(x^a_i, y^a_i, f) \gets g_a(x^a_i, y^a_i, f) \cdot \max\left(1, \frac{\left\|g^C_a(x^a_i, y^a_i, f)\right\|}{C_a}\right).\]

endfor

Aggregate the gradient $g_a \gets \frac{1}{n} \sum_{i = 1}^n g^C_a(x^a_i, y^a_i, f)$.

Update the second layer layer $a_t \gets a_{t-1} - \eta_a g_a + \sqrt{\eta_a} \frac{C_a}{n} \sigma_a \N(0, 1)$.

endfor

Return $W_1, a_T$.

Algorithm 1.

Finally, we make some technical assumptions.

Assumption 1.(Data distribution) The inputs $x^W_i$’s and $x^a_i$’s are i.i.d sampled from standard Gaussian distribution $\N(0, 1)$. The labels are generated from a single-index model, meaning for some fixed $\mu \in \mathbb{S}^{d-1}$ and some degree-$q$ polynomial $\sigma_*$ which define $f_*(x) = \sigma_*(\langle x, \mu \rangle)$, we have $y^W_i = f_*(x^W_i)$ and $y^a_i = f_*(x^a_i)$.

Moreover, we assume that $\sigma_*$ has information exponent (Arous et al., 2021) of $1$, meaning that $\alpha^*_0 = 0$ and $\alpha^*_1 \neq 0$, where $\alpha^*_k$ denotes the $k$\textsuperscript{th} Hermite coefficient of $f_*$. This is standard (Ba et al., 2022), and in our regime and without the bias unit, it would have been necessary (Bietti et al., 2022; Dandi et al., 2024).

Assumption 2. (Regime) There exist $\varepsilon_n, \varepsilon_p > 0$ such that $\frac{3}{16}\varepsilon_p < \varepsilon_n$, $p \geq d^{\varepsilon_p}$ and $n \geq d^{1 + 3\varepsilon_n}$.

Assumption 3. (Privacy budget)

\[\begin{equation} \delta \in (0, 1), \quad \varepsilon \in \left(0, 8 \log(1/\delta)\right), \quad \frac{\varepsilon}{\sqrt{\log \frac{1}{\delta}}} = \omega\left(\frac{1}{\log d}\right). \end{equation}\]

Assumption 4. (Hyper-parameter choice)

\[\begin{equation} \begin{gathered} \lambda = \frac{p}{d^{2\varepsilon_n}}, \eta_W = d^{\frac{3\varepsilon_n}{2}}\sqrt{p}, \eta_a = \frac{d^{\varepsilon_n}\log^2 d}{p}, C_a = d^{\varepsilon_n}\sqrt{p} \log^{q+1} d, T = d^{\varepsilon_n} \\ \sigma_W = \sqrt{\frac{d^{1 + \varepsilon_B}}{n}} \log^{q+3} d, \sigma_a = \sqrt{\eta_a T} \frac{\sqrt{8\log(1/\delta)}}{\varepsilon}. \end{gathered} \end{equation}\]

Now we can formally state the main result.

Theorem 1. Consider the two-layer neural network in $\eqref{eq:model}$ with input dimension $d$ and number of features $p$, training over two datasets as described in Algorithm $\eqref{alg:training procedure}$. Let $f_{\text{NN}} = a_T^\intercal \sigma(W_1^\intercal x + b)$ be the resulting neural network after training, then $f_\text{NN}$ is $(\varepsilon, \delta)$-differentially private, and moreover, we have

\[\begin{equation} \begin{gathered} \mathcal{R}(f_{\text{NN}}) = \E_x[(f_{\text{NN}}(x) - f_*(x))^2] \\ = O\left(d^{-\frac{\varepsilon_p}{8q}} + \frac{d^{1 + \varepsilon_n}}{n} \frac{\log(1.25/\delta)}{\varepsilon^2}\right) = o(1). \end{gathered} \end{equation}\]

with probability at least $1 - \exp(-c \log^2 d)$ where $c$ is an absolute constant.

2.3.1. Feature learning.

Once the behaviour of random feature models is characterised (Hu & Lu, 2023; Bosch et al., 2023) and it is clearly not capable of feature learning (Yehudai & Shamir, 2019; Refinetti et al., 2021; Yang et al., 2020), two lines of research emerge. One is performing pertubative corrections to the large-with lazy regime (Hanin & Nica, 2020; Dyer & Gur-Ari, 2020; Naveh & Ringel, 2021; Naveh et al., 2020).

This work belongs to another, where feature learning comes from one non-pertubative step of gradient descent. In particular, Ba et al. (Ba et al., 2022) showed that for a slightly different model, $f(x) = \frac{1}{\sqrt{p}} a^\intercal \sigma(W^\intercal x + b)$, the first gradient step of first layer approximates linear term of target function. Moniri et al. (Moniri et al., 2024) extended this analysis to show that with appropriately chosen step size $\eta_W$, one has $\ell$ in the spectrum of feature matrix corresponding to terms of degree at most $\ell$ of the target. Different from this work, in their setting, training of second layer was done via ridge regression. In another direction, Cui et al. (Cui et al., 2024) characterised shallow neural networks after one step of gradient descent on the first layer, showed that they are equivalent to spiked random feature models, similar to Gaussian equivalence property of random feature models themselves.

2.3.2. Single- and multi-index models.

Due to its analytical tractability, single- and multi-index models have been subject of extensive research, for which Bruan and Hsu provided a survey (Bruna & Hsu, 2025). In particular, Ba et al. (Ba et al., 2023) considered one step of gradient descent for anisotropic single-index models. Based on this method, Oko et al. extended the analysis to sums-of-indices models (Oko et al., 2024), with multiple steps of gradient descent for the first layer. More generally, Damian et al. carried similar analysis for multi-index models (Damian et al., 2022). Dandi et al. (Dandi et al., 2024) extended it further, showed the precise sample complexity, and hinted at a hierarchy of functions demanding increasingly large batch-size to learn.

Beyond one step, similar analysis was carried out with gradient flow (Mousavi-Hosseini et al., 2023; Bietti et al., 2022; Bietti et al., 2023). As a culmination of much prior effort, for shallow neural networks without bias term, Dandi et al. characterised the classes of target learnable within finite-time of gradient-based methods as well as their sample complexity (Dandi et al., 2024).

2.3.3. Learning with differential privacy.

Whilst much has been discovered about feature learning and indexed models, much less is known about learning with differential privacy. For a long time, it was open whether gradient descent could be done differentially private without hurting the performance, until the breakthrough by Abadi et al. (Abadi et al., 2016). For example, some attempts led to too little privacy budget (Shokri & Shmatikov, 2015), proven to be ineffective (Hitaj et al., 2017).

Likewise, for a long time, learning with differential privacy remained open. There were hints of a privacy-performance trade-off (Chaudhuri & Monteleoni, 2009; Chaudhuri et al., 2011), which prompts a line of research focusing on low-dimensional subspaces via ad-hoc techniques (Kairouz et al., 2020; Zhou et al., 2021; Yu et al., 2021). Nevertheless in practice, even with only standard differentially private variants of gradient descents, one still achieves good performance (Mehta et al., 2022; De et al., 2022), and seemingly avoids the trade-off altogether.

Recently and independently, Bombari and Mondelli (Bombari & Mondelli, 2025), and Brown et al. (Brown et al., 2024) showed that it is possible to achieve differential privacy at no loss of performance by using overparameterised models, in random feature models and in linear regression respectively. The methodology is similar: one sets appropriate clipping constants and noise level, then shows with high probability, the clipping does not happen and the noise added is small enough not to affect the performance.

Interestingly, each team used different strategy: Bombari and Mondelli took $\eta_a$ small enough to approximate training trajectory by a differential equation, now stochastic due to Gaussian noise added, then bounded the solution by a leave-one-out argument. On the other hand, Brown et al. assumed that the target is linear, and used these hidden parameters to bound the trajectory directly via an induction argument. We shall borrow from both of these techniques to analyse second layer.

2.4. Proof strategy.

Our analysis of first layer follows closely that of Ba et al. (Ba et al., 2023), modified for isotropic data, comprising three steps.

First we show that the first gradient step admits a simple rank-1 approximation similar to that by Ba et al. (Ba et al., 2023) and Ba et al. (Ba et al., 2022), and further analysed by Sonthalia et al. (Sonthalia et al., 2025).
Then, using an universal approximation theorem similar to that of Damian et al. (Damian et al., 2022; Ba et al., 2023; Oko et al., 2024) but proven for $\tanh$, we construct a “certificate”, that is, a choice of second layer with both small test loss and small magnitude.
Finally, by a simplified argument similar to Ba et al. (Ba et al., 2023), we show that we would also achieve small test loss if we trained the second layer via ridge regression non-privately. The analysis of second layer largely follows induction strategy of Brown et al. (Brown et al., 2024), but since the underlying model is not linear, we will replace part of the argument with a leave-one-out argument similar to that of Bombari and Mondelli (Bombari & Mondelli, 2025).

Section 3 is devoted to the first two steps of first layer’s analysis; Section 4 finishes the rest of the argument.

2.5. Notations.

In what follows, all algorithms are in the natural basis. We denote by $\| \cdot \|$ the $\ell_2$ norm for vector, and the induced operator norm for matrices. For sub-Gaussian random variables, we denote by $\|\cdot\|_{\psi_2}$ the Gaussian norm (Vershynin, 2018).

Given to quantities $a, b$, we denote $a \lesssim b$ or $b \gtrsim a$ to mean there exists a hidden absolute constant $C > 0$ such that $a \leq C b$, and $a \asymp b$ to mean $a \lesssim b$ and $b \lesssim a$. All complexity notations are understood to be taken with the size of dataset $n$, input dimension $d$, and number of hidden neurons $p$ large enough.

The notation $O_\P(\cdot), o_\P(\cdot), \lesssim_\P, \asymp_\P$ are to be understood as $O(\cdot), o(\cdot), \lesssim, \asymp$ with probability at least $1 - \exp(-c\log^2 d)$ as $d$ (and thus $n$ and $p$) tends to infinity.

We write $\he_n$ and $\h_n$ for probabilist’s and physicist’s $n$^th Hermite polynomial, given by

\[\begin{equation} \he_n = (-1)^n e^{\frac{x^2}{2}} \frac{d^n}{dx^n} e^{-\frac{x^2}{2}}, \quad \h_n = (-1)^n e^{x^2} \frac{d^n}{dx^n} e^{-x^2}. \end{equation}\]

Let $\alpha^b_n$ be the $n$^th (probabilist) Hermite coefficient of $\sigma(\cdot + b)$, and $\alpha_n = \alpha^0_n$. Likewise, let $\alpha^*_n$ be the $n$^th (probabilist) Hermite coefficient of $f_*$. Finally, we denote $w_i \in \R^d$ (resp. $w^0_i, w^1_i$) be the $i$^th column of $W$ (resp. $W_0, W_1$).

References

Sonthalia, R., Murray, M., & Montufar, G. (2025). Low Rank Gradients and Where to Find Them. ICML 2025 Workshop HiLD.
Bruna, J., & Hsu, D. (2025). Survey on Algorithms for multi-index models. http://arxiv.org/abs/2504.05426
Bombari, S., & Mondelli, M. (2025). Privacy for free in the overparameterized regime. Proceedings of the National Academy of Sciences of the United States of America, 122(15). https://doi.org/10.1073/pnas.2423072122
Oko, K., Song, Y., Suzuki, T., & Wu, D. (2024). Learning sum of diverse features: computational hardness and efficient gradient-based training for ridge combinations. Proceedings of Machine Learning Research, 247, 4009–4081.
Dandi, Y., Troiani, E., Arnaboldi, L., Pesce, L., Zdeborova, L., & Krzakala, F. (2024). The Benefits of Reusing Batches for Gradient Descent in Two-Layer Networks: Breaking the Curse of Information and Leap Exponents. Proceedings of Machine Learning Research, 235, 9991–10016.
Moniri, B., Lee, D., Hassani, H., & Dobriban, E. (2024). A Theory of Non-Linear Feature Learning with One Gradient Step in Two-Layer Neural Networks. Proceedings of Machine Learning Research, 235, 36106–36159.
Cui, H., Pesce, L., Dandi, Y., Krzakala, F., Lu, Y. M., Zdeborová, L., & Loureiro, B. (2024). Asymptotics of Feature Learning in Two-layer Networks after One Gradient-step. Proceedings of Machine Learning Research, 235, 9662–9695.
Dandi, Y., Krzakala, F., Loureiro, B., Pesce, L., & Stephan, L. (2024). How Two-Layer Neural Networks Learn, One (Giant) Step at a Time. Journal of Machine Learning Research, 25(1), 17099–17163.
Berthier, R., Montanari, A., & Zhou, K. (2024). Learning Time-Scales in Two-Layers Neural Networks. Foundations of Computational Mathematics. https://doi.org/10.1007/s10208-024-09664-9
Weiner, F., Teh, P. L., & Cheng, C. B. (2024). Systematic Review of Machine Learning in Recommendation Systems Over the Last Decade. Lecture Notes in Networks and Systems, 1016 LNNS, 66–75. https://doi.org/10.1007/978-3-031-62281-6_5
Hernandez Aros, L., Bustamante Molano, L. X., Gutierrez-Portela, F., Moreno Hernandez, J. J., & Rodríguez Barrero, M. S. (2024). Financial fraud detection through the application of machine learning techniques: a literature review. Humanities and Social Sciences Communications, 11(1). https://doi.org/10.1057/s41599-024-03606-0
Brown, G., Dvijotham, K., Evans, G., Liu, D., Smith, A., & Thakurta, A. (2024). Private Gradient Descent for Linear Regression: Tighter Error Bounds and Instance-Specific Uncertainty Estimation. Proceedings of Machine Learning Research, 235, 4561–4584.
Bietti, A., Bruna, J., & Pillaud-Vivien, L. (2023). On Learning Gaussian Multi-index Models with Gradient Flow. http://arxiv.org/abs/2310.19793
Mousavi-Hosseini, A., Wu, D., Suzuki, T., & Erdogdu, M. A. (2023). Gradient-Based Feature Learning under Structured Data. Advances in Neural Information Processing Systems, 36.
Ba, J., Erdogdu, M. A., Suzuki, T., Wang, Z., & Wu, D. (2023). Learning in the Presence of Low-dimensional Structure: A Spiked Random Matrix Perspective. Advances in Neural Information Processing Systems, 36.
Abbe, E., Boix-Adserà, E., & Misiakiewicz, T. (2023). SGD learning on neural networks: leap complexity and saddle-to-saddle dynamics. Proceedings of Machine Learning Research, 195, 2552–2623.
Bosch, D., Panahi, A., & Hassibi, B. (2023). Precise Asymptotic Analysis of Deep Random Feature Models. Proceedings of Machine Learning Research, 195, 4132–4179.
Hu, H., & Lu, Y. M. (2023). Universality Laws for High-Dimensional Learning With Random Features. IEEE Transactions on Information Theory, 69(3), 1932–1964. https://doi.org/10.1109/TIT.2022.3217698
Mehta, H., Thakurta, A., Kurakin, A., & Cutkosky, A. (2022). Large Scale Transfer Learning for Differentially Private Image Classification. http://arxiv.org/abs/2205.02973
Ba, J., Erdogdu, M. A., Suzuki, T., Wang, Z., Wu, D., & Yang, G. (2022). High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation. Advances in Neural Information Processing Systems, 35.
Bietti, A., Bruna, J., Sanford, C., & Song, M. J. (2022). Learning Single-Index Models with Shallow Neural Networks. Advances in Neural Information Processing Systems, 35.
Damian, A., Lee, J. D., & Soltanolkotabi, M. (2022). Neural Networks can Learn Representations with Gradient Descent. Proceedings of Machine Learning Research, 178, 5413–5452.
Shehab, M., Abualigah, L., Shambour, Q., Abu-Hashem, M. A., Shambour, M. K. Y., Alsalibi, A. I., & Gandomi, A. H. (2022). Machine learning in medical applications: A review of state-of-the-art methods. Computers in Biology and Medicine, 145. https://doi.org/10.1016/j.compbiomed.2022.105458
Kurakin, A., Song, S., Chien, S., Geambasu, R., Terzis, A., & Thakurta, A. (2022). Toward Training at ImageNet Scale with Differential Privacy. http://arxiv.org/abs/2201.12328
De, S., Berrada, L., Hayes, J., Smith, S. L., & Balle, B. (2022). Unlocking High-Accuracy Differentially Private Image Classification through Scale. http://arxiv.org/abs/2204.13650
Yu, D., Zhang, H., Chen, W., & Liu, T. Y. (2021). Do Not Let Privacy Overbill Utility: Gradient Embedding Perturbation for Private Learning. ICLR 2021 - 9th International Conference on Learning Representations.
Zhou, Y., Wu, Z. S., & Banerjee, A. (2021). Bypassing the Ambient Dimension: Private Sgd With Gradient Subspace Identification. ICLR 2021 - 9th International Conference on Learning Representations.
Naveh, G., & Ringel, Z. (2021). A self consistent theory of Gaussian Processes captures feature learning effects in finite CNNs. http://arxiv.org/abs/2106.04110
Refinetti, M., Goldt, S., Krzakala, F., & Zdeborová, L. (2021). Classifying high-dimensional Gaussian mixtures: Where kernel methods fail and neural networks succeed. Proceedings of Machine Learning Research, 139, 8936–8947.
Arous, G. B., Gheissari, R., & Jagannath, A. (2021). Online stochastic gradient descent on non-convex losses from high-dimensional inference. Journal of Machine Learning Research, 22.
Carlini, N., Tramèr, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, Ú., Oprea, A., & Raffel, C. (2021). Extracting training data from large language models. Proceedings of the 30th USENIX Security Symposium, 2633–2650.
Kairouz, P., Ribero, M., Rush, K., & Thakurta, A. (2020). Fast Dimension Independent Private AdaGrad on Publicly Estimated Subspaces. http://arxiv.org/abs/2008.06570
Dyer, E., & Gur-Ari, G. (2020). Asymptotics of Wide Networks From Feynman Diagrams. 8th International Conference on Learning Representations, ICLR 2020.
Hanin, B., & Nica, M. (2020). Finite Depth and Width Corrections To the Neural Tangent Kernel. 8th International Conference on Learning Representations, ICLR 2020.
Naveh, G., David, O. B., Sompolinsky, H., Ringel, Z., Ben-David, O., Sompolinsky, H., & Ringel, Z. (2020). Predicting the outputs of finite networks trained with noisy gradients. ArXiv, 1–24. http://arxiv.org/abs/2004.01190
Yang, Z., Yu, Y., You, C., Steinhardt, J., & Ma, Y. (2020). Rethinking bias-variance trade-off for generalization of neural networks. 37th International Conference on Machine Learning, ICML 2020, PartF16814, 10698–10708.
Yehudai, G., & Shamir, O. (2019). On the power and limitations of random features for understanding neural networks. Advances in Neural Information Processing Systems, 32.
Dada, E. G., Bassi, J. S., Chiroma, H., Abdulhamid, S. M., Adetunmbi, A. O., & Ajibuwa, O. E. (2019). Machine learning for email spam filtering: review, approaches and open research problems. Heliyon, 5(6). https://doi.org/10.1016/j.heliyon.2019.e01802
Jayaraman, B., & Evans, D. (2019). Evaluating differentially private machine learning in practice. Proceedings of the 28th USENIX Security Symposium, 1895–1912.
Vershynin, R. (2018). High-Dimensional Probability. High-Dimensional Probability. https://doi.org/10.1017/9781108231596
Ganju, K., Wang, Q., Yang, W., Gunter, C. A., & Borisov, N. (2018). Property inference attacks on fully connected neural networks using permutation invariant representations. Proceedings of the ACM Conference on Computer and Communications Security, 619–633. https://doi.org/10.1145/3243734.3243834
Hitaj, B., Ateniese, G., & Perez-Cruz, F. (2017). Deep Models under the GAN: Information leakage from collaborative deep learning. Proceedings of the ACM Conference on Computer and Communications Security, 603–618. https://doi.org/10.1145/3133956.3134012
Shokri, R., Stronati, M., Song, C., & Shmatikov, V. (2017). Membership Inference Attacks Against Machine Learning Models. Proceedings - IEEE Symposium on Security and Privacy, 3–18. https://doi.org/10.1109/SP.2017.41
Abadi, M., McMahan, H. B., Chu, A., Mironov, I., Zhang, L., Goodfellow, I., & Talwar, K. (2016). Deep learning with differential privacy. Proceedings of the ACM Conference on Computer and Communications Security, 24-28-Octo, 308–318. https://doi.org/10.1145/2976749.2978318
Shokri, R., & Shmatikov, V. (2015). Privacy-preserving deep learning. Proceedings of the ACM Conference on Computer and Communications Security, 2015-Octob, 1310–1321. https://doi.org/10.1145/2810103.2813687
Fredrikson, M., Jha, S., & Ristenpart, T. (2015). Model inversion attacks that exploit confidence information and basic countermeasures. Proceedings of the ACM Conference on Computer and Communications Security, 2015-Octob, 1322–1333. https://doi.org/10.1145/2810103.2813677
Dwork, C., & Roth, A. (2013). The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4), 211–487. https://doi.org/10.1561/0400000042
Chaudhuri, K., Monteleoni, C., & Sarwate, A. D. (2011). Differentially private empirical risk minimization. Journal of Machine Learning Research, 12, 1069–1109.
Chaudhuri, K., & Monteleoni, C. (2009). Privacy-preserving logistic regression. Advances in Neural Information Processing Systems 21 - Proceedings of the 2008 Conference, 289–296. https://doi.org/10.12720/jait.6.3.88-95

-- NGUYEN Doan-Dai , on July 24th, 2025

Private feature learning for single-index models (part 1)

Abstract

Acknowledgement

1. Introduction.

1.1. Privacy.

1.2. Feature learning.

1.3. Contribution and outline.

2. Preliminaries.

2.1. Differential privacy.

2.2. Setting.

2.3. Related works.

2.3.1. Feature learning.

2.3.2. Single- and multi-index models.

2.3.3. Learning with differential privacy.

2.4. Proof strategy.

2.5. Notations.

References

Some other posts

Private feature learning for single-index models (part 3)

Private feature learning for single-index models (part 2)