Alignment Newsletter Podcast – Détails, épisodes et analyse

Détails du podcast

Informations techniques et générales issues du flux RSS du podcast.

Alignment Newsletter Podcast

Rohin Shah et al.

Actualités

Technologie

Fréquence : 1 épisode/10j. Total Éps: 100

The Alignment Newsletter is a weekly publication with recent content relevant to AI alignment. This podcast is an audio version, recorded by Robert Miles (http://robertskmiles.com) More information about the newsletter at: https://rohinshah.com/alignment-newsletter/

Site

RSS

Apple

Classements récents

Dernières positions dans les classements Apple Podcasts et Spotify.

Apple Podcasts

🇬🇧 Grande Bretagne - techNews
03/05/2026
#87
🇬🇧 Grande Bretagne - techNews
02/05/2026
#71
🇬🇧 Grande Bretagne - techNews
01/05/2026
#51
🇬🇧 Grande Bretagne - techNews
18/02/2026
#86
🇬🇧 Grande Bretagne - techNews
17/02/2026
#75
🇬🇧 Grande Bretagne - techNews
16/02/2026
#45
🇬🇧 Grande Bretagne - techNews
06/11/2025
#98
🇬🇧 Grande Bretagne - techNews
05/11/2025
#86
🇬🇧 Grande Bretagne - techNews
04/11/2025
#60
🇨🇦 Canada - techNews
03/11/2025
#92

Spotify

Aucun classement récent disponible

Liens partagés entre épisodes et podcasts

Liens présents dans les descriptions d'épisodes et autres podcasts les utilisant également.

See all

https://rohinshah.com/alignment-newsletter/
148 partages
http://robertskmiles.com
142 partages
https://arxiv.org/abs/2005.14165
18 partages

https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg
83 partages
https://www.youtube.com/watch?v=JPcPayJiWm8
1 partage
https://www.youtube.com/watch?v=_5xkh-Rh6Ec
1 partage

https://en.wikipedia.org/wiki/Thinking
77 partages
https://en.wikipedia.org/wiki/The_Book_of_Why
4 partages
https://en.wikipedia.org/wiki/Set-theoretic_definition_of_natural_numbers#Frege_and_Russell
1 partage

Qualité et score du flux RSS

Évaluation technique de la qualité et de la structure du flux RSS.

See all

Qualité du flux RSS

À améliorer

Score global : 52%

Historique des publications

Répartition mensuelle des publications d'épisodes au fil des années.

Year

Episodes published by month in

Derniers épisodes publiés

Liste des épisodes récents, avec titres, durées et descriptions.

See all

Alignment Newsletter #173: Recent language model results from DeepMind

Saison 1 · Épisode 173

jeudi 21 juillet 2022 • Durée 16:43

Recorded by Robert Miles: http://robertskmiles.com

More information about the newsletter here: https://rohinshah.com/alignment-newsletter/

YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg

#HIGHLIGHTS

Scaling Language Models: Methods, Analysis & Insights from Training Gopher (Jack W. Rae et al) (summarized by Rohin): This paper details the training of the Gopher family of large language models (LLMs), the biggest of which is named Gopher and has 280 billion parameters. The algorithmic details are very similar to the GPT series (AN #102): a Transformer architecture trained on next-word prediction. The models are trained on a new data distribution that still consists of text from the Internet but in different proportions (for example, book data is 27% of Gopher’s training data but only 16% of GPT-3’s training data).

Like other LLM papers, there are tons of evaluations of Gopher on various tasks, only some of which I’m going to cover here. One headline number is that Gopher beat the state of the art (SOTA) at the time on 100 out of 124 evaluation tasks.

The most interesting aspect of the paper (to me) is that the entire Gopher family of models were all trained on the same number of tokens, thus allowing us to study the effect of scaling up model parameters (and thus training compute) while holding data constant. Some of the largest benefits of scale were seen in the Medicine, Science, Technology, Social Sciences, and the Humanities task categories, while scale has not much effect or even a negative effect in the Maths, Logical Reasoning, and Common Sense categories. Surprisingly, we see improved performance on TruthfulQA (AN #165) with scale, even though the TruthfulQA benchmark was designed to show worse performance with increased scale.

We can use Gopher in a dialogue setting by prompting it appropriately. The prompt specifically instructs Gopher to be “respectful, polite, and inclusive”; it turns out that this significantly helps with toxicity. In particular, for the vanilla Gopher model family, with more scale the models produce more toxic continuations given toxic user statements; this no longer happens with Dialogue-Prompted Gopher models, which show slight reductions in toxicity with scale in the same setting. The authors speculate that while increased scale leads to an increased ability to mimic the style of a user statement, this is compensated for by an increased ability to account for the prompt.

Another alternative the authors explore is to finetune Gopher on 5 billion tokens of dialogue to produce Dialogue-Tuned Gopher. Interestingly, human raters were indifferent between Dialogue-Prompted Gopher and Dialogue-Tuned Gopher.

Training Compute-Optimal Large Language Models (Jordan Hoffmann et al) (summarized by Rohin): One application of scaling laws (AN #87) is to figure out how big a model to train, on how much data, given some compute budget. This paper performs a more systematic study than the original paper and finds that existing models are significantly overtrained. Chinchilla is a new model built with this insight: it has 4x fewer parameters than Gopher, but is trained on 4x as much data. Despite using the same amount of training compute as Gopher (and lower inference compute), Chinchilla outperforms Gopher across a wide variety of metrics, validating these new scaling laws.

You can safely skip to the opinion at this point – the rest of this summary is quantitative details.

We want to find functions N(C) and D(C) that specify the optimal number of parameters N and the amount of data D to use given some compute budget C. We’ll assume that these scale with a power of C, that is, N(C) = k_N * C^a and D(C) = k_D * C^b, for some constants a, b, k_N, and k_D. Note that since total compute increases linearly with both N (since each forward / backward pass is linear in N) and D (since the number of forward / backwards passes is linear in D), we need to have a + b = 1. (You can see this somewhat more formally by noting that we have C = k_C * N(C) * D(C) for some constant k_C, and then substituting in the definitions of N(C) and D(C).)

This paper uses three different approaches to get three estimates of a and b. The approach I like best is “isoFLOP curves”:

1. Choose a variety of possible values of (N, D, C), train models with those values, and record the final loss obtained. Note that not all values of (N, D, C) are possible: given any two values the third is determined.

2. Draw isoFLOP curves: for each value of C, choose either N or D to be your remaining independent variable, and fit a parabola to the losses of the remaining points. The minimum of this parabola gives you an estimate for the optimal N and D for each particular value of C.

3. Use the optimal (N, D, C) points to fit N(C) and D(C).

This approach gives an estimate of a = 0.49; the other approaches give estimates of a = 0.5 and a = 0.46. If we take the nice round number a = b = 0.5, this suggests that you should scale up parameters and data equally. With 10x the computation, you should train a 3.2x larger model with 3.2x as much data. In contrast, the original scaling laws paper (AN #87) estimated that a = 0.74 and b = 0.26. With 10x more computation, it would suggest training a 5.5x larger model with 1.8x as much data.

Rohin's opinion: It’s particularly interesting to think about how this should influence timelines. If you’re extrapolating progress forwards in time, the update seems pretty straightforward: this paper shows that you can significantly better capabilities using the same compute budget and so your timelines should shorten (unless you were expecting an even bigger result than this).

For bio anchor approaches (AN #121) the situation is more complicated. For a given number of parameters, this paper suggests that it will take significantly more compute than was previously expected to train a model of the required number of parameters. There’s a specific parameter for this in the bio anchors framework (for the neural network paths); if you only update that parameter it will lengthen the timelines output by the model. It is less clear how you’d update other parts of the model: for example, should you decrease the size of model that you think is required for TAI? It’s not obvious that the reasoning used to set that parameter is changed much by this result, and so maybe this shouldn’t be changed and you really should update towards longer timelines overall.

#TECHNICAL AI ALIGNMENT
#PROBLEMS

Ethical and social risks of harm from Language Models (Laura Weidinger et al) (summarized by Rohin): This paper provides a detailed discussion, taxonomy, and literature review of various risks we could see with current large language models. It doesn't cover alignment risks; for those you'll want Alignment of Language Agents (AN #144), which has some overlap of authors. I’ll copy over the authors’ taxonomy in Table 1:

1. Discrimination, Exclusion and Toxicity: These risks arise from the LM accurately reflecting natural speech, including unjust, toxic, and oppressive tendencies present in the training data.

2. Information Hazards: These risks arise from the LM predicting utterances which constitute private or safety-critical information which are present in, or can be inferred from, training data.

3. Misinformation Harms: These risks arise from the LM assigning high probabilities to false, misleading, nonsensical or poor quality information.

4. Malicious Uses: These risks arise from humans intentionally using the LM to cause harm.

5. Human-Computer Interaction Harms: These risks arise from LM applications, such as Conversational Agents, that directly engage a user via the mode of conversation. (For example, users might anthropomorphize LMs and trust them too much as a result.)

6. Automation, access, and environmental harms: These risks arise where LMs are used to underpin widely used downstream applications that disproportionately benefit some groups rather than others.

#FIELD BUILDING

How to pursue a career in technical AI alignment (Charlie Rogers-Smith) (summarized by Rohin): This post gives a lot of advice in great detail on how to pursue a career in AI alignment. I strongly recommend it if you are in such a position; I previously would recommend my FAQ (AN #148) but I think this is significantly more detailed (while providing broadly similar advice).

#OTHER PROGRESS IN AI
#REINFORCEMENT LEARNING

Learning Robust Real-Time Cultural Transmission without Human Data (Cultural General Intelligence Team et al) (summarized by Rohin): Let’s consider a 3D RL environment with obstacles and bumpy terrain, in which an agent is rewarded for visiting colored spheres in a specific order (that the agent does not initially know). Even after the agent learns how to navigate at all in the environment (non-trivial in its own right), it still has to learn to try the various orderings of spheres. In other words, it must solve a hard exploration problem within every episode.

How do humans solve such problems? Often we simply learn from other people who already know what to do, that is, we rely on cultural transmission. This paper investigates what it would take to get agents that learn through cultural transmission. We’ll assume that there is an expert bot that visits the spheres in the correct order. Given that, this paper identifies MEDAL-ADR as the necessary ingredients for cultural transmission:

1. (M)emory: Memory is needed for the agent to retain information it is not currently observing.

2. (E)xpert (D)ropout: There need to be some training episodes in which the expert is only present for part of the episode. If the expert was always present, then there’s no incentive to actually learn: you can just follow the expert forever.

3. (A)ttention (L)oss: It turns out that vanilla RL by itself isn’t enough for the agent to learn to follow the expert. There needs to be an auxiliary task of predicting the relative position of other agents in the world, which encourages the agent to learn representations about the expert bot’s position, which then makes it easier for RL to learn to follow the expert.

These ingredients by themselves are already enough to train an agent that learns through cultural transmission. However, if you then put the agent in a new environment, it does not perform very well. To get agents that generalize well to previously unseen test environments, we also need:

4. (A)utomatic (D)omain (R)andomization: The training environments are procedurally generated, and the parameters are randomized during each episode. There is a curriculum that automatically increases the difficulty of the environments in lockstep with the agent’s capabilities.

With all of these ingredients, the resulting agent can even culturally learn from a human player, despite only encountering bots during training.

Rohin's opinion: I liked the focus of this paper on identifying the ingredients for cultural transmission, as well as the many ablations and experiments to understand what was going on, many of which I haven’t summarized here. For example, you might be interested in the four phases of learning of MEDAL without ADR (random behavior, expert following, cultural learning, and solo learning), or the cultural transmission metric they use, or the “social neurons” they identified which detect whether the expert bot is present.

#DEEP LEARNING

Improving language models by retrieving from trillions of tokens (Sebastian Borgeaud et al) (summarized by Rohin): We know that large language models memorize a lot of their training data, especially data that gets repeated many times. This seems like a waste; we’re interested in having the models use their parameters to implement “smart” computations rather than regurgitation of already written text. One natural idea is to give models the ability to automatically search previously written text, which they can then copy if they so choose: this removes their incentive to memorize a lot of training data.

The key to implementing this idea is to take a large dataset of text (~trillions of tokens), chunk it into sequences, compute language model representations of these sequences, and store them in a database that allows for O(log N) time nearest-neighbor access. Then, every time we do a forward pass through the model that we’re training, we first query the database for the K nearest neighbors (intuitively, the K most related chunks of text), and give the forward pass access to representations for those chunks of text and the chunks immediately following them. This is non-differentiable – from the standpoint of gradient descent, it “looks like” there’s always some helpful extra documents that often have information relevant to predicting the next token, and so gradient descent pushes the model to use those extra documents. There’s a bunch of fiddly technical details to get this all working that I’m not going to summarize here.

As a side benefit, once you have this database of text representations that supports fast nearest neighbor querying, you can also use it to address the problem of test set leakage. For any test document you are evaluating on, you can look for the nearest neighbors in the database and look at the overlap between these neighbors and your test document, to check whether your supposedly “test” document was something the model might have trained on.

The evaluation shows that the 7 billion parameter (7B) Retro model from the paper can often do as well as or better than the 280B Gopher or 178B Jurassic-1 (both of which outperform GPT-3) on language modeling, and that it also does well on question answering. (Note that these are both tasks that seem particularly likely to benefit from retrieval.)

#NEWS

Apply to the Open Philanthropy Technology Policy Fellowship! (Luke Muehlhauser) (summarized by Rohin): This policy fellowship (AN #157) on high-priority emerging technologies is running for the second time! Application deadline is September 15.

Job ad: DeepMind Long-term Strategy & Governance Research Scientist (summarized by Rohin): The Long-term Strategy and Governance Team at DeepMind works to build recommendations for better governance of AI, identifying actions, norms, and institutional structures that could improve decision-making around advanced AI. They are seeking a broad range of expertise including: global governance of science and powerful technologies; the technical landscape; safety-critical organisations; political economy of large general models and AI services. The application deadline is August 1st.

Also, the Alignment and Scalable Alignment teams at DeepMind are hiring, though some of the applications are closed at this point.

Job ads: Anthropic (summarized by Rohin): Anthropic is hiring for a large number of roles (I count 19 different ones as of the time of writing).

Job ad: Deputy Director at BERI (Sawyer Bernath) (summarized by Rohin): The Berkeley Existential Risk Initiative (BERI) is hiring a Deputy Director. Applications will be evaluated on a rolling basis.

Job ads: Centre for the Governance of AI (summarized by Rohin): The Centre for the Governance of AI has several roles open, including Research Scholars (General Track and Policy Track), Survey Analyst, and three month fellowships. The application deadlines are in the August 1 - 10 range.

Job ads: Metaculus (summarized by Rohin): Metaculus is hiring for a variety of roles, including an AI Forecasting Lead.

Job ads: Epoch AI (summarized by Rohin): Epoch AI is a new organization that investigates and forecasts the development of advanced AI. They are currently hiring for a Research Manager and Staff Researcher position.

Job ad: AI Safety Support is hiring a Chief Operating Officer (summarized by Rohin): Application deadline is August 14.

Alignment Newsletter #172: Sorry for the long hiatus!

Saison 1 · Épisode 172

mardi 5 juillet 2022 • Durée 05:52

Recorded by Robert Miles: http://robertskmiles.com

More information about the newsletter here: https://rohinshah.com/alignment-newsletter/

YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg

Sorry for the long hiatus! I was really busy over the past few months and just didn't find time to write this newsletter. (Realistically, I was also a bit tired of writing it and so lacked motivation.) I'm intending to go back to writing it now, though I don't think I can realistically commit to publishing weekly; we'll see how often I end up publishing. For now, have a list of all the things I should have advertised to you whose deadlines haven't already passed. #NEWS

Survey on AI alignment resources (Anonymous) (summarized by Rohin): This survey is being run by an outside collaborator in partnership with the Centre for Effective Altruism (CEA). They ask that you fill it out to help field builders find out which resources you have found most useful for learning about and/or keeping track of the AI alignment field. Results will help inform which resources to promote in the future, and what type of resources we should make more of.

Announcing the Inverse Scaling Prize ($250k Prize Pool) (Ethan Perez et al) (summarized by Rohin): This prize with a $250k prize pool asks participants to find new examples of tasks where pretrained language models exhibit inverse scaling: that is, models get worse at the task as they are scaled up. Notably, you do not need to know how to program to participate: a submission consists solely of a dataset giving at least 300 examples of the task.

Inverse scaling is particularly relevant to AI alignment, for two main reasons. First, it directly helps understand how the language modeling objective ("predict the next word") is outer misaligned, as we are finding tasks where models that do better according to the language modeling objective do worse on the task of interest. Second, the experience from examining inverse scaling tasks could lead to general observations about how best to detect misalignment.

$500 bounty for alignment contest ideas (Akash) (summarized by Rohin): The authors are offering a $500 bounty for producing a frame of the alignment problem that is accessible to smart high schoolers/college students and people without ML backgrounds. (See the post for details; this summary doesn't capture everything well.)

Job ad: Bowman Group Open Research Positions (Sam Bowman) (summarized by Rohin): Sam Bowman is looking for people to join a research center at NYU that'll focus on empirical alignment work, primarily on large language models. There are a variety of roles to apply for (depending primarily on how much research experience you already have).

Job ad: Postdoc at the Algorithmic Alignment Group (summarized by Rohin): This position at Dylan Hadfield-Menell's lab will lead the design and implementation of a large-scale Cooperative AI contest to take place next year, alongside collaborators at DeepMind and the Cooperative AI Foundation.

Job ad: AI Alignment postdoc (summarized by Rohin): David Krueger is hiring for a postdoc in AI alignment (and is also hiring for another role in deep learning). The application deadline is August 2.

Job ad: OpenAI Trust & Safety Operations Contractor (summarized by Rohin): In this remote contractor role, you would evaluate submissions to OpenAI's App Review process to ensure they comply with OpenAI's policies. Apply here by July 13, 5pm Pacific Time.

Job ad: Director of CSER (summarized by Rohin): Application deadline is July 31. Quoting the job ad: "The Director will be expected to provide visionary leadership for the Centre, to maintain and enhance its reputation for cutting-edge research, to develop and oversee fundraising and new project and programme design, to ensure the proper functioning of its operations and administration, and to lead its endeavours to secure longevity for the Centre within the University."

Job ads: Redwood Research (summarized by Rohin): Redwood Research works directly on AI alignment research, and hosts and operates Constellation, a shared office space for longtermist organizations including ARC, MIRI, and Open Philanthropy. They are hiring for a number of operations and technical roles.

Job ads: Roles at the Fund for Alignment Research (summarized by Rohin): The Fund for Alignment Research (FAR) is a new organization that helps AI safety researchers, primarily in academia, pursue high-impact research by hiring contractors. It is currently hiring for Operation Manager, Research Engineer, and Communication Specialist roles.

Job ads: Encultured AI (summarized by Rohin): Encultured AI is a new for-profit company with a public benefit mission: to develop technologies promoting the long-term survival and flourishing of humanity and other sentient life. They are hiring for a Machine Learning Engineer and an Immersive Interface Engineer role.

Job ads: Fathom Radiant (summarized by Rohin): Fathom Radiant is a public benefit corporation that aims to build a new type of computer which they hope to use to support AI alignment efforts. They have several open roles, including (but not limited to) Scientists / Engineers, Builders and Software Engineer, Lab.

Alignment Newsletter #163: Using finite factored sets for causal and temporal inference

Saison 1 · Épisode 163

mercredi 8 septembre 2021 • Durée 19:27

Recorded by Robert Miles: http://robertskmiles.com

More information about the newsletter here: https://rohinshah.com/alignment-newsletter/

YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg

This newsletter is a combined summary + opinion for the Finite Factored Sets sequence by Scott Garrabrant. I (Rohin) have taken a lot more liberty than I usually do with the interpretation of the results; Scott may or may not agree with these interpretations.

Motivation

One view on the importance of deep learning is that it allows you to automatically learn the features that are relevant for some task of interest. Instead of having to handcraft features using domain knowledge, we simply point a neural net at an appropriate dataset, and it figures out the right features. Arguably this is the majority of what makes up intelligent cognition; in humans it seems very analogous to System 1, which we use for most decisions and actions. We are also able to infer causal relations between the resulting features.

Unfortunately, existing models of causal inference don’t model these learned features -- they instead assume that the features are already given to you. Finite Factored Sets (FFS) provide a theory which can talk directly about different possible ways to featurize the space of outcomes, and still allows you to perform causal inference. This sequence develops this underlying theory, and demonstrates a few examples of using finite factored sets to perform causal inference given only observational data.

Another application is to embedded agency (AN #31): we would like to think of “agency” as a way to featurize the world into an “agent” feature and an “environment” feature, that together interact to determine the world. In Cartesian Frames (AN #127), we worked with a function A × E → W, where pairs of (agent, environment) together determined the world. In the finite factored set regime, we’ll think of A and E as features, the space S = A × E as the set of possible feature vectors, and S → W as the mapping from feature vectors to actual world states.

What is a finite factored set?

Generalizing this idea to apply more broadly, we will assume that there is a set of possible worlds Ω, a set S of arbitrary elements (which we will eventually interpret as feature vectors), and a function f : S → Ω that maps feature vectors to world states. Our goal is to have some notion of “features” of elements of S. Normally, when working with sets, we identify a feature value with the set of elements that have that value. For example, we can identify “red” as the set of all red objects, and in some versions of mathematics, we define “2” to be the set of all sets that have exactly two elements. So, we define a feature to be a partition of S into subsets, where each subset corresponds to one of the possible feature values. We can also interpret a feature as a question about items in S, and the values as possible answers to that question; I’ll be using that terminology going forward.

A finite factored set is then given by (S, B), where B is a set of factors (questions), such that if you choose a particular answer to every question, that uniquely determines an element in S (and vice versa). We’ll put aside the set of possible worlds Ω; for now we’re just going to focus on the theory of these (S, B) pairs.

Let’s look at a contrived example. Consider S = {chai, caesar salad, lasagna, lava cake, sprite, strawberry sorbet}. Here are some possible questions for this S:

- FoodType: Possible answers are Drink = {chai, sprite}, Dessert = {lava cake, strawberry sorbet}, Savory = {caesar salad, lasagna}

- Temperature: Possible answers are Hot = {chai, lava cake, lasagna} and Cold = {sprite, strawberry sorbet, caesar salad}.

- StartingLetter: Possible answers are “C” = {chai, caesar salad}, “L” = {lasagna, lava cake}, and “S” = {sprite, strawberry sorbet}.

- NumberOfWords: Possible answers are “1” = {chai, lasagna, sprite} and “2” = {caesar salad, lava cake, strawberry sorbet}.

Given these questions, we could factor S into {FoodType, Temperature}, or {StartingLetter, NumberOfWords}. We cannot factor it into, say, {StartingLetter, Temperature}, because if we set StartingLetter = L and Temperature = Hot, that does not uniquely determine an element in S (it could be either lava cake or lasagna).

Which of the two factorizations should we use? We’re not going to delve too deeply into this question, but you could imagine that if you were interested in questions like “does this need to be put in a glass” you might be more interested in the {FoodType, Temperature} factorization.

Just to appreciate the castle of abstractions we’ve built, here’s the finite factored set F with the factorization {FoodType, Temperature}:

F = ({chai, caesar salad, lasagna, lava cake, sprite, strawberry sorbet}, {{{chai, sprite}, {lava cake, strawberry sorbet}, {caesar salad, lasagna}}, {{chai, lava cake, lasagna}, {sprite, strawberry sorbet, caesar salad}}})

To keep it all straight, just remember: a factorization B is a set of questions (factors, partitions) each of which is a set of possible answers (parts), each of which is a set of elements in S.

A brief interlude

Some objections you might have about stuff we’ve talked about so far:

Q. Why do we bother with the set S -- couldn’t we just have the set of questions B, and then talk about answer vectors of the form (a1, a2, … aN)?

A. You could in theory do this, as there is a bijection between S and the Cartesian product of the sets in B. However, the problem with this framing is that it is hard to talk about other derived features. For example, the question “what is the value of B1+B2” has no easy description in this framing. When we instead directly work with S, the B1+B2 question is just another partition of S, just like B1 or B2 individually.

Q. Why does f map S to Ω? Doesn’t this mean that a feature vector uniquely determines a world state, whereas it’s usually the opposite in machine learning?

A. This is true, but here the idea is that the set of features together captures all the information within the setting we are considering. You could think of feature vectors in deep learning as only capturing an important subset of all of the features (which we’d have to do in practice since we only have bounded computation), and those features are not enough to determine world states.

Orthogonality in Finite Factored Sets

We’re eventually going to use finite factored sets similarly to Pearlian causal models: to infer which questions (random variables) are conditionally independent of each other. However, our analysis will apply to arbitrary questions, unlike Pearlian models, which can only talk about independence between the predefined variables from which the causal model is built.

Just like Pearl, we will talk about conditioning on evidence: given evidence e, a subset of S, we can “observe” that we are within e. In the formal setup, this looks like erasing all elements that are not in e from all questions, answers, factors, etc.

Unlike Pearl, we’re going to assume that all of our factors are independent from each other. In Pearlian causal models, the random variables are typically not independent from each other. For example, you might have a model with two binary variables, e.g. “Variable Rain causes Variable Wet Sidewalk”; these are obviously not independent. An analogous finite factored set would have three factors: “did it rain?”, “if it rained did the sidewalk get wet?” and “if it didn’t rain did the sidewalk get wet?” This way all three factors can be independent of each other. We will still be able to ask whether Wet Sidewalk is independent of Rain, since Wet Sidewalk is just another question about the set S -- it just isn’t one of the underlying factors any more.

The point of this independence is to allow us to reason about counterfactuals: it should be possible to say “imagine the element s, except with underlying factor b2 changed to have value v”. As a result, our definitions will include clauses that say “and make sure we can still take counterfactuals”. For example, let’s talk about the “history” of a question X, which for now you can think of as the “factors relevant to X”. The history of X given e is the smallest set of factors such that:

1) if you know the answers to these factors, then you can infer the answer to X, and

2) any factors that are not in the history are independent of X. As suggested above, we can think of this as being about counterfactuals -- we’re saying that for any such factor, we can counterfactually change its answer, and this will remain consistent with the evidence e.

(A technicality on the second point: we’ll never be able to counterfactually change a factor to a value that is never found in the evidence; this is fine and doesn’t prevent things from being independent.)

Time for an example! Consider the set S = {000, 001, 010, 011, 100, 101, 110, 111}, and the factorization {X, Y, Z}, where X is the question “what is the first bit”, Y is the question “what is the second bit”, and Z is the question “what is the third bit”. Consider the question Q = “when interpreted as a binary number, is the number >= 2?” In this case, the history of Q given no evidence is {X, Y}, because you can determine the answer to Q with the combination of X and Y. (You can still counterfact on anything, since there is no evidence to be inconsistent with.)

Let’s consider an example with evidence. Suppose we observe that all the bits are equal, that is, e = {000, 111}. Now, what is the history of X? If there weren’t any evidence, the history would just be {X}; you only need to know X in order to determine the value of X. However, suppose we learned that X = 0, implying that our element is 000. We can’t counterfact on Y or Z, since that would produce 010 or 001, both of which are inconsistent with the evidence. So given this evidence, the history of X is actually {X, Y, Z}, i.e. the entire set of factors! If we’d only observed that the first two bits were equal, so e = {000, 001, 110, 111}, then we could counterfact on Z, and the history of X would be {X, Y}.

(Should you want more examples, here are two relevant posts.)

Given this notion of “history”, it is easy to define orthogonality: X is orthogonal to Y given evidence e if the history of X given e has no overlap with the history of Y given e. Intuitively, this means that the factors relevant to X are completely separate from those relevant to Y, and so there cannot be any entanglement between X and Y. For a question Z, we say that X is orthogonal to Y given Z if we have that X is orthogonal to Y given z, for every possible answer z in Z.

Now that we have defined orthogonality, we can state the Fundamental Theorem of Finite Factored Sets. Given some questions X, Y and Z about a finite factored set F, X is orthogonal to Y given Z if and only if in every probability distribution on F, X is conditionally independent of Y given Z, that is, P(X, Y | Z) = P(X | Z) * P(Y | Z).

(I haven’t told you how you put a probability distribution on F. It’s exactly what you would think -- you assign a probability to every possible answer in every factor, and then the probability of an individual element is defined to be the product of the probabilities of its answers across all the factors.)

(I also haven’t given you any intuition about why this theorem holds. Unfortunately I don’t have great intuition for this; the proof has multiple non-trivial steps each of which I locally understand and have intuition for... but globally it’s just a sequence of non-trivial steps to me. Here’s an attempt, which isn’t very good: we specifically defined orthogonality to capture all the relevant information for a question, in particular by having that second condition requiring that we be able to counterfact on other factors, and so it intuitively makes sense that if the relevant information doesn’t overlap then there can’t be a way for the probability distribution to have interactions between the variables.)

The fundamental theorem is in some sense a justification for calling the property “orthogonality” -- if we determine just by studying the structure of the finite factored set that X is orthogonal to Y given Z, then we know that this implies conditional independence in the “true” probability distribution, whatever it ends up being. Pearlian models have a similar theorem, where the graphical property of d-separation implies conditional independence.

Foundations of causality and time

You might be wondering why we have been calling the minimal set of relevant factors “history”. The core philosophical idea is that, if you have the right factorization, then “time” or “causality” can be thought of as flowing in the direction of larger histories. Specifically, we say that X is “before” Y if the history of X is a subset of the history of Y. (We then call it “history” because every factor in the history of X will be “before” X by this definition.)

One intuition pump for this is that in physics, if an event A causes an event B, then the past light cone of A is a subset of the past light cone of B, and A happens before B in every possible reference frame.

But perhaps the best argument for thinking of this as causality is that we can actually use this notion of “time” or “causality” to perform causal inference. Before I talk about that, let’s see what this looks like in Pearlian models.

Strictly speaking, in Pearlian models, the edges do not have to correspond to causality: formally they only represent conditional independence assumptions on a probability distribution. However, consider the following Cool Fact: for some Pearlian models, if you have observational data that is generated from that model, you can recover the exact graphical structure of the generating model just by looking at the observational data. In this case, you really are inferring cause-and-effect relationships from observational data! (In the general case where the data is generated by an arbitrary model, you can recover a lot of the structure of the model, but be uncertain about the direction of some of the edges, so you are still doing some causal inference from observational data.)

We will do something similar: we’ll use our notion of “before” to perform causal inference given observational data.

Temporal inference: the three dependent bits

You are given statistical (i.e. observational) data for three bits: X, Y and Z. You quickly notice that it is always the case that Z = X xor Y (which implies that X = Y xor Z, and Y = Z xor X). Clearly, there are only two independent bits here, and the other bit is derived as the xor of the two independent bits. From the raw statistical data, can you tell which bits are the independent ones, and which one is the derived one, thus inferring which one was caused by the other two? It turns out that you can!

Specifically, you want to look for which two bits are orthogonal to each other, that is, you want to check whether we approximately have P(X, Y) = P(X) P(Y) (and similarly for other possible pairings). In the world where two of the bits were generated by a biased coin, you will find exactly one pair that is orthogonal in this way. (The case where the bits are generated by a fair coin is a special case; the argument won’t work there, but it’s in some sense “accidental” and happens because the probability of 0.5 is very special.)

Let’s suppose that the orthogonal pair was (X, Z). In this case, we can prove that in every finite factored set that models this situation, X and Z come “before” Y, i.e. their histories are strict subsets of Y’s history. Thus, we’ve inferred causality using only observational data! (And unlike with Pearlian models, we did this in a case where one “variable” was a deterministic function of two other “variables”, which is a type of situation that Pearlian models struggle to handle.)

Future work

Remember that motivation section, a couple thousand words ago? We talked about how we can do causal inference with learned featurizations, and apply it to embedded agency. Well, we actually haven’t done that yet, beyond a few examples of causal inference (as in the example above). There is a lot of future work to be done in applying it to the case that motivated it in the first place. The author wrote up potential future work here, which has categories for both causal inference and embedded agency, and also adds a third one: generalizing the theory to infinite sets. If you are interested in this framework, there are many avenues for pushing it forward.

Alignment Newsletter #162: Foundation models: a paradigm shift within AI

Saison 1 · Épisode 162

vendredi 27 août 2021 • Durée 15:46

Recorded by Robert Miles: http://robertskmiles.com

More information about the newsletter here: https://rohinshah.com/alignment-newsletter/

YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg

Alignment Newsletter #161: Creating generalizable reward functions for multiple tasks by learning a model of functional similarity

Saison 1 · Épisode 161

vendredi 20 août 2021 • Durée 17:38

Recorded by Robert Miles: http://robertskmiles.com

More information about the newsletter here: https://rohinshah.com/alignment-newsletter/

YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg

Alignment Newsletter #160: Building AIs that learn and think like people

Saison 1 · Épisode 160

vendredi 13 août 2021 • Durée 17:26

Recorded by Robert Miles: http://robertskmiles.com

More information about the newsletter here: https://rohinshah.com/alignment-newsletter/

YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg

Alignment Newsletter #159: Building agents that know how to experiment, by training on procedurally generated games

Saison 1 · Épisode 159

mercredi 4 août 2021 • Durée 27:00

Recorded by Robert Miles: http://robertskmiles.com

More information about the newsletter here: https://rohinshah.com/alignment-newsletter/

YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg

Alignment Newsletter #158: Should we be optimistic about generalization?

Saison 1 · Épisode 158

jeudi 29 juillet 2021 • Durée 15:39

Recorded by Robert Miles: http://robertskmiles.com

More information about the newsletter here: https://rohinshah.com/alignment-newsletter/

YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg

Alignment Newsletter #157: Measuring misalignment in the technology underlying Copilot

Saison 1 · Épisode 157

vendredi 23 juillet 2021 • Durée 14:17

Recorded by Robert Miles: http://robertskmiles.com

More information about the newsletter here: https://rohinshah.com/alignment-newsletter/

YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg

Alignment Newsletter #156: The scaling hypothesis: a plan for building AGI

Saison 1 · Épisode 156

vendredi 16 juillet 2021 • Durée 14:17

Recorded by Robert Miles: http://robertskmiles.com

More information about the newsletter here: https://rohinshah.com/alignment-newsletter/

YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg

Podcasts Similaires Basées sur le Contenu

Découvrez des podcasts liées à Alignment Newsletter Podcast. Explorez des podcasts avec des thèmes, sujets, et formats similaires. Ces similarités sont calculées grâce à des données tangibles, pas d'extrapolations !