04 May 2025

sublinear algorithms

Sublinear algorithms

These work on computing the answer with streaming/partial data available at hand. We study both sublinear time algorithms and sublinear space or streaming algorithms. Sublinear time algorithms might be heavily needing randomized algorithms.

An example of Sublinear time algorithm

Given an array \(A[1:n]\) consisting of a permutation of the first \(n\) numbers, output index \(i\) s.t. \(A[i]\) is an even number.

This can be done with a randomized algorithm only. Pick a random number between \(1\) and \(n\) uniformly and check \(A[p]\), if it's even then output else repeat. Since the probability for any element will be \(\frac{1}{2}\), on expectation it's \(\mathcal{O}(1)\).

Sublinear space example

Given an array \(A[1:n - 1]\) consisting of a permutation of \(n-1\) of \(n\) numbers, output number \(i\) s.t. \(i\) is missing.

A very simple deterministic solution to this is to store the sum as the data streams.

any streaming solution will need atleast \(\Omega(\log(n))\) bits

This is fairly straightforward. We need to represent \(n\) memory states and that can only happen if the number of bits is atleast \(\mathcal{O}(\log(n))\).

For the general case of \(k\) missing numbers, it is \(\Theta(k\log(\frac{n}{k}))\)

We have \(^nC_k\) states because \(k\) elements can be missing. We need \(\log(^nC_k)\) bits.

Now, \(^nC_k\) is \(\frac{n}{k}\cdot\frac{n-1}{k-1}\cdots\) which in turn can be shown to be asymptotically \((\frac{n}{k})^k\) . This gives the lower bound on the number of bits.

TODO To show sufficiency

Probabilistic analysis

Sufficiently high probability

We define sufficiently high probablility as a probability greater that \(1 - \frac{1}{poly(n)}\) where that term is an arbitrarily large polynomial in the input size. So practically larger than \(1 - \frac{1}{n}\).

On any probabilistic analysis of an algorithm, we do two things:

bounding the expectation
bounding the concentration/variance

Bounding the expectation

This usually only uses one thing: \(\mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y]\). This follows trivially from the definition of expectation value. The random variables need not be independent. We'll refer to this as the linearity of expectation.

Bounding the deviation - Markov bound

\(\text{Pr}(X \geq t\cdot\mathbb{E}[X]) \leq \frac{1}{t}\)

This follows trivially from this: (let \(\mu = \mathbb{E}[X]\))

\(\mu = \mathbb{E}[X | X \geq t\cdot\mu]\cdot\text{Pr}(X \geq t\cdot\mu) + \mathbb{E}[X | X < t\cdot\mu]\cdot\text{Pr}(X < t\cdot\mu)\)

The first expectation is bounded by \(t\cdot\mu\) and the second one is bounded by \(0\).

Another form of the same equation would be

\(\text{Pr}(X\geq b) \leq \frac{\mathbb{E}[X]}{b}\)

Bounding the deviation - Chebyshev bound

For a random variable \(X\)

\(\text{Pr}(|X - \mathbb{E}[X]| \geq b) \leq \frac{\text{Var}[X]}{b^2}\)

This follows trivially after you apply Markov bound on \((X - \mathbb{E}[X])^2 \geq b^2\)

Bounding the deviation - Chernoff bound

For independent random variables \(X_i, i\in \{1,\dots n\}\) where each takes a value in \([0,1]\), and \(X = \sum_i X_i\), then

\[ \text{Pr}(|X - \mathbb{E}[X]| \geq t\cdot \mathbb{E}[X]) \leq 2\cdot\text{exp}\left(-\frac{t\cdot\mathbb{E}[X]}{3}\right) \]

\[ \text{Pr}(|X - \mathbb{E}[X]| \geq \epsilon\cdot \mathbb{E}[X]) \leq 2\cdot\text{exp}\left(-\frac{\epsilon^2\cdot\mathbb{E}[X]}{3}\right) \]

\(t \geq 1\) and \(\epsilon \leq 1\)

Bounding of probabilities - Union Bound

For two events \(\mathcal{E}_1\) and \(\mathcal{E}_2\), the following inequality holds for obvious reasons

\(\text{Pr}(\mathcal{E}_1 \cup \mathcal{E}_2) \leq \text{Pr}(\mathcal{E}_1) + \text{Pr}(\mathcal{E}_2)\)

Sublinear algorithms for graphs

First of all, you decide the query model for the algorithm. Usually, the model is either adjacency lists, or matrices, or both.

Matrices allow us to query whether an edge exists between two points or not in \(\mathcal{O}(1)\) query time. This is the query model. Similarly, lists allow us to query the number of neighbours and the i'th neighbour of a vertex.

Estimating number of connected components

Here's the problem premise.

\[ \text{Pr}(|\widetilde{C} - C| \leq \epsilon n) \geq 1 - \delta \]

Where \(\epsilon\) is the approximation parameter and \(\delta\) is the confidence interval.

For this, we first prove something else. Let \(s_v\) be the size of the connected component of the component containing \(v\). The following holds (where \(C\) is the number of connected components):

\[ \sum_{v \in V} \frac{1}{s_v} = C \]

The proof is simple. Partition the set of vertices into the set of components and then sum for each component individually. The result will be 1 for each component.

InnocentZero's Treasure Chest