Let (X, Y) be a random couple in S×T with unknown distribution P. Let (X1, Y1), …, (Xn, Yn) be i.i.d. copies of (X, Y), Pn being their empirical distribution. Let h1, …, hN:S↦[−1, 1] be a dictionary consisting of N functions. For λ∈ℝN, denote fλ:=∑j=1Nλjhj. Let ℓ:T×ℝ↦ℝ be a given loss function, which is convex with respect to the second variable. Denote (ℓ•f)(x, y):=ℓ(y; f(x)). We study the following penalized empirical risk minimization problem
¶
\[\hat{\lambda}^{\varepsilon }:=\mathop{\operatorname {argmin}}_{\lambda\in {\mathbb{R}}^{N}}\bigl[P_{n}(\ell\bullet f_{\lambda})+\varepsilon \|\lambda\|_{\ell_{p}}^{p}\bigr],\]
¶
which is an empirical version of the problem
¶
\[\lambda^{\varepsilon }:=\mathop{\operatorname{argmin}}_{\lambda\in {\mathbb{R}}^{N}}\bigl[P(\ell \bullet f_{\lambda})+\varepsilon \|\lambda\|_{\ell_{p}}^{p}\bigr]\]
¶
(here ɛ≥0 is a regularization parameter; λ0 corresponds to ɛ=0). A number of regression and classification problems fit this general framework. We are interested in the case when p≥1, but it is close enough to 1 (so that p−1 is of the order $\frac{1}{\log N}$ , or smaller). We show that the “sparsity” of λɛ implies the “sparsity” of λ̂ɛ and study the impact of “sparsity” on bounding the excess risk P(ℓ•fλ̂ɛ)−P(ℓ•fλ0) of solutions of empirical risk minimization problems.