Neural networks

One lesson I learned during graduate school was the importance of understanding things from first principles. So for learning about neural networks, I really enjoyed reading Grokking Deep Learning, which implements neural networks using only numpy, keeping the reader from being bogged down with implementation details. After that, Michael Nielsen’s Neural Networks and Deep Learning is a good followup. If you prefer watching videos rather than reading, Josh Starmer’s StatQuest Neural Networks playlist is also excellent. Another excellent introduction is this video from Grant Sanderson*, which also goes into mathematical details.

General concepts

Like other machine learning systems, neural networks aim to return a set of outputs given some inputs: classes for a image classification task, phrases in the target language for translation task, etc. What makes neural networks unique is the scale of the inputs and outputs and thus the difficulty of training: modern Large Language Models (LLMs) are trained on billions to hundreds of billions of parameters. As a result, progress in Artificial Intelligence (now almost always synonymous with large neural networks) has often involved making the training process more efficient.

One of the interesting and frustrating problems in modeling complex data is overfitting. The solution often involves a combination of picking the right tools, then knowing how to interpret their output. In a neural network context once you’ve picked the appropriate architecture, the right activation functions, you may need to implement dropout. These, and other network tuning issues are covered in Neural Smithing.

History and development

The idea of artificial neural networks has been around since at least 1940s in the form of Hebbian theory, though modern artificial neural networks bear only vague conceptual resemblance to their biological analogues. Neural networks gained a lot of attention after Geoffrey Hinton’s famous 1986 paper on backpropagation, allowing for quick optimization of gradients, thus enabling multiple parameters to be learned efficiently.

The availability of big datasets and development of GPGPUs accelerated the use of neural networks in many areas, such as CNNs in image processing. Another big advance came in 2017 with the Attention Is All You Need paper which described Transformer models, enabling the development of GPTs which power LLMs.

As of 2024, one major development is agentic AI, smaller programs that retrieve knowledge from LLMs, but can also use external tools. In increasing order of sophistication, the major conceptual types of agents are:

Simple reflex agents: reacts to input data via condition-action rules
Model-based reflex agents: remembers previous states in applying condition-action rules
Goal-based agents: works towards higher-level aims which can involve multiple steps
Utility-based agents: evaluates multiple runs of strategy
Learning agents: combines and optimizes strategies over time

Tool use itself can be implemented in multiple ways: Retrieval-Augmented Generation (RAG) and Cache-Augmented Generation (CAG), optimizing for data freshness or latency, respectively.

Running neural networks

Because they require enormous investments of time, technical expertise and electrical energy to train and run, commercial LLMs are closed-source and operate on a subscription model. Their large size also means these commercial LLMs are cloud-hosted. However, smaller open-source models (still on the order of billions of parameters) can be obtained and run locally for free, using tools such as LM Studio, Ollama or vLLM.

Evaluating neural network performance

As decision systems, neural networks that perform a narrow, well-defined task like text classification can be evaluated using traditional metrics like accuracy, ROC, etc. However, as both neural networks and their tasks become more complex, culminating in general purpose LLMs like ChatGPT, it’s important to note that benchmarking them present several unique issues.

First, because AIs can perform so many tasks, there exists many benchmarks (200 by one count). Choosing one (or more) of these benchmarks to characterize overall “performance” can be tricky.

Second, AIs raise some serious privacy and safety concerns, particularly around fully autonomous agents that motivated separate safety benchmarks, testing the AI’s susceptibility to illegal, harmful or false information.

Third, AIs are created to solve then optimize solutions to the problems we pose to them. As a result, sometimes benchmark performance becomes the problem that AIs solve, analogous to students studying to pass standardized tests rather than actually understanding their coursework. In the worst cases, much like these students, AIs can perform well on tests then fail to replicate good performance when deployed to solve new problems.

*_Vlogging on YouTube as 3Blue1Brown, Sanderson has an excellent series where he explained math concepts visually.

Written on January 20, 2020