Notes of HyperINR: A Fast and Predictive Hypernetwork for Implicit Neural Rrpresentations via Knowledge Distillation

Mind Map

Structure

Some algorithms noted in paper

Shepard's algorithm

Shepard's interpolation algorithm is a numerical interpolation method proposed by Donald Shepard in 1968. It's based on the concept of inverse distance weighting, that is, for a point to be predicted, the influence of nearby sample points is more significant than that of farther ones. Therefore, during the interpolation operation, the weight of each sample point is determined by its distance to the prediction point.

The general form of Shepard's interpolation algorithm is as follows:

$$ f(x) = \sum_{j=1}^{N} \frac {w_j E_j} {\sum_{j=1}^N w_j} $$

And

$$ w_j = (\frac {d} {|x-x_i|})^p $$

where

$f(x)$ is the function value of the prediction point
$E_j$ is the function value of the j-th sample point
$w_i$ is the weight of the j-th sample point
$x_i$ is the position of the j-th sample point
$p$ is the power of the weight function, usually taken as 2
$d$ is a very small constant, used to prevent the denominator from being 0

fast Poisson Disk sampling algorithm

Bridson's fast Poisson disk sampling algorithm is a commonly used algorithm in computer graphics, designed to generate uniform but irregular point arrays in two or three-dimensional spaces. This algorithm is fast and efficient, and the generated points have the characteristic of "blue noise", that is, the minimum distance between any two points is greater than a given value, which prevents points from clustering. Furthermore, the distribution of points is uniform on any large spatial scale.

The main step of the algorithm are as follows:

1. randomly select an initial point within the view port and add it to to the 'active points list'
1. randomly select an initial point within 'active points list' and generate a certain number of points randomly within the annular area around this point.
1. check these points newly generated points. If their distance to existing points is all greater than a given minimum distance, add them to the active points list, otherwise discard them.
1. if no new points can be generated around the selected active point, remove this point from the active point list.
1. repeat steps two to four until the active points list is empty, then, the algorithm ends.

Poisson disk sampling can be seen in many domains, such as graphic rendering, Geographic Information Systems (GIS), machine learning, etc.

Gaussian kernel sampling

Gaussian Kernel Sampling is a sampling method used in various fields like machine learning, signal processing and statistics. In this method, samples are generated based on the Gaussian distribution (also known as Normal distribution).

The Gaussian Kernel is a function in the form of a bell curve, characterized by its mean (μ) and standard deviation (σ). It has the property that values closer to the mean are sampled more frequently than values further away.

In the context of machine learning, Gaussian Kernel is often used in the Kernel methods (like Support Vector Machines and Gaussian Processes) for transforming linearly inseparable data to a higher dimension where it becomes linearly separable.

In signal processing, Gaussian kernel can be used in Gaussian filters for blurring images or reducing noise.

In general, Gaussian Kernel Sampling is a powerful method for data generation and transformation, given its properties of continuity, differentiability and its characteristic bell shape.

explanation of HyperINR

$$ \Phi : \vec{x} \mapsto \Phi (\vec{x}) = \vec{v}, \vec{x} \in \mathbb{R}^2 or \space \mathbb{R}^3 $$

After decomposion,

$$ \Phi (\vec {x} \mid \theta) = S \circ E(\theta)(\vec{x}) $$

1. $\Phi (\vec {x} \mid \theta) $: this represents an output function based on the input $x$ and parameter $\theta$ . It defines a conditional mapping $\Phi$ with parameter $\theta $ applied to $\vec{x}$
1. $S \circ E(\theta)(\vec{x})$ : Here, $\circ$ denotes function composition. This means that first, the function $E(\theta)$ is applied to $\vec{x}$ and then the result is passed into the function $S$.
  - $E (\theta)$: This is a function parameterized by $\theta$, typically representing an embedding function or feature extractor, applied to the input $\vec{x}$.
  - $S$ : This is another function that processes the output of $E(\theta)(\vec{x})$. It could perform some post-processing or mapping operation.

In simpler terms, the formula indicates that the input $\vec{x}$ is first processed by $E(\theta)$, and the result is then passed to $S$ to produce the final output $\Phi(\vec{x} | \theta)$.

Adam Optimizer

Adam optimizer is a popular gradient-based optimization algorithm for training machine learning models, especially deep neural networks. It was introduced by Diederik P. Kingma and Jimmy Ba in a paper titled "Adam: A Method for Stochastic Optimization" in 2014.

Adam combines ideas from two other widely employed optimization methods: Momentum and RMSprop (Root Mean Square Propagation). The key advantages of the Adam optimizer include:

Adaptive Learning Rates: Adam maintains a per-parameter learning rate that is adjusted based on the average of recent gradients for the parameter. This helps in dealing with sparse gradients or noisy data.
Momentum: It calculates the exponential moving average of gradients and uses this momentum to update the weights, which helps in accelerating optimization in the direction of consistent gradient reduction.
Variance Correction: Along with maintaining gradient momentum, Adam also keeps track of the squared gradients to normalize the parameter updates. This normalization counteracts the problem of varying learning speeds for different parameters.

The typical update rule using Adam optimizer is as follows:

Compute the gradient of the loss function with respect to the parameters (weights).
Update biased first moment estimate with the gradient.
Update biased second raw moment estimate with the square of the gradient.
Compute bias-corrected first moment estimate.
Compute bias-corrected second raw moment estimate.
Update the parameters with the computed moment estimates.

This is represented by the following equations, where $g_t$ is the gradient at time step t, mt is the first moment (mean), $v_t$ is the second raw moment (uncentered variance), $\hat m _t$ and $\hat v _t$ are bias-corrected versions of $m_t$ and $v_t$, α is the step size (learning rate), and $θ_t$ represents the parameters:

$$ m_t = \beta_1 * m_{t-1} + (1-\beta_1)*g_t $$

$$ v_t = \beta_2 * v_{t-1} + (1-\beta_2)*(g_t)^2 $$

$$ \hat m _t = \frac {m_t} {1-(\beta_1)^t} $$

$$ \hat v _t = \frac {v_t} {1-(\beta_2)^t} $$

$$ \theta _{t+1} = \frac {\theta_t - \alpha * \hat m _t } {(\sqrt {\hat v _t } + ε)} $$

In the equations above, β1 and β2 are hyperparameters representing the decay rates for the moment estimates (common defaults are 0.9 and 0.999 respectively), and ε is a small constant (e.g., 1e-8) for numerical stability.

Implementation

SIREN(Sinusoidal Representation Networks)

SIREN is a type of neural network that utilizes periodic (sinusoidal) activation functions. This architecture excels at representing complex signals such as images, audio, and videos, and is particularly useful for applications like solving differential equations (e.g., Poisson, Helmholtz) and implicit neural representations for 3D shapes. SIREN stands out for its ability to model high-frequency details and gradients, which makes it superior to traditional architectures that use activations like ReLU or Tanh

The primary characteristic of Siren is that it can generate smooth and continuous activation functions, which helps to solve challenging problems present in traditional neural networks, such as vanishing and exploding gradients. Siren's activation function is defined as

$$ f(x) = sin(w*x) $$

where w is a learnable parameter. By utilizing the sine function, Siren can achieve periodic variations and is capable of approximating any function.

CoordNet

CoordNet is a coordinate-based framework designed for tasks like visualizing and processing time-varying volumetric data. It simplifies handling such data by transforming tasks into a unified representation using coordinates, making it easier to apply deep learning techniques across diverse tasks without altering the network architecture. CoordNet offers high generalization capability, particularly in scientific visualization and large-scale datasets

NeurComp

NeurComp` is a framework that uses neural networks, particularly coordinate-based multi-layer perceptrons (MLPs), to compress large volumetric scalar field datasets. It leverages architectures like SIREN with sinusoidal activations for compressing time-varying volumetric data efficiently. NeurComp excels at high compression ratios while maintaining data fidelity, and it is used in applications like scientific simulations and data visualization

Ablation Study

An ablation study is an experimental method used in fields such as machine learning, deep learning, and computer vision to understand the contribution of different components within a model to its overall performance. By systematically removing (or "ablating") parts of the model, such as certain features, layers, or modules, and then retraining and evaluating the model, researchers can observe the impact on performance and understand how various parts influence the model as a whole.

This approach helps to:

Identify critical components: Determine which parts are essential for the model's performance.
Simplify the model: Discover which parts are not necessary for performance improvements, potentially leading to model simplification without significant loss of performance.
Improve model design: Understanding which parts of a model play important roles can guide the structural design and improvement of future models.

In summary, an ablation study is a powerful tool for gaining a deeper understanding of the behaviors and properties of complex models, and for providing insights into further optimizations and improvements of models.

Application

Inverse Distance Weighting(IDW)

Inverse Distance Weighting (IDW) is a type of interpolation method used for estimating unknown values in spatial data based on the known values surrounding them. The core idea of IDW is that the influence of a known point decreases as the distance from the unknown point increases. This is achieved by assigning weights to nearby points, with closer points having more influence (larger weight) than those farther away. In mathematical terms, the IDW function can be expressed as:

$$ \hat Z (x) = \frac {\sum_{i=1}^N Z(x_i) \cdot w_i} {\sum_{i=1}^N w_i } $$

Jason's Blog