PodcastsCulture et sociétéLessWrong (30+ Karma)

LessWrong (30+ Karma)

LessWrong
LessWrong (30+ Karma)
Dernier épisode

1480 épisodes

  • LessWrong (30+ Karma)

    “Jailbreaking is Empirical Evidence for Inner Misalignment and Against Alignment by Default” by Jérémy Andréoletti

    17/2/2026 | 5 min
    One of the central arguments for AI existential risk goes through inner misalignment: a model trained to exhibit aligned behavior might be pursuing a different objective internally, which diverges from the intended behavior when conditions shift. This is a core claim of If Anyone Builds It, Everyone Dies: we can't reliably aim an ASI at any goal, let alone the precise target of human values.
    A common source of optimism, articulated by Nora Belrose & Quintin Pope or Jan Leike, analyzed by John Wentworth, and summarized in this recent IABIED review, goes something like: current LLMs already display good moral reasoning; human values are pervasive in training data and constitute "natural abstractions" that sufficiently capable learners converge on; so alignment should be quite easy, and get easier with scale.
    I think the history of LLM jailbreaking is a neat empirical test of this claim.
    Jailbreaking
    A jailbreak works like this:
    A model is trained (RLHF, SFT, constitutional AI, etc.) to refuse harmful requests.
    Someone crafts a prompt that's out of distribution (in a broad sense) relative to the safety training.
    The model produces the harmful output.
    This is goal misgeneralization. The model learned something during safety training that [...]
    ---
    Outline:
    (01:08) Jailbreaking
    (02:59) What the alignment-by-default view would predict (I think)
    (03:35) It doesnt work
    (05:00) Conclusion
    The original text contained 3 footnotes which were omitted from this narration.
    ---

    First published:

    February 16th, 2026


    Source:

    https://www.lesswrong.com/posts/eCc2GX3N9D2mjZJfL/jailbreaking-is-empirical-evidence-for-inner-misalignment

    ---

    Narrated by TYPE III AUDIO.
  • LessWrong (30+ Karma)

    “Will reward-seekers respond to distant incentives?” by Alex Mallen

    17/2/2026 | 21 min
    Reward-seekers are usually modeled as responding only to local incentives administered by developers. Here I ask: Will AIs or humans be able to influence their incentives at a distance—e.g., by retroactively reinforcing actions substantially in the future or by committing to run many copies of them in simulated deployments with different incentives?
    If reward-seekers are responsive to distant incentives, it fundamentally changes the threat model, and is probably bad news for developers on balance. The core problem is asymmetric control: developers can relatively[1]tightly control local incentives—the reward signal during training and deployment—but they can't prevent distant actors from offering competing incentives. This means a remotely-influenceable reward-seeker might overall act like a schemer: strategically undermining developer control, letting attacks through as a monitor, and hiding its misaligned propensities, not because of a flaw in its local training, but because it's responding to incentives developers don’t control.
    I focus on three sources of distant influence on a reward-seeker: (1) human adversaries (e.g., a competing nation) who claim that they will retroactively reward or simulate the AI based on whether it helped them disempower the developers; (2) a misaligned AI that has taken over and retroactively incentivizes assistance; and (3) developers themselves [...]
    ---
    Outline:
    (02:37) Remotely-influenceable reward-seekers
    (08:54) The selection pressures that shaped the AIs cognition were local
    (10:10) But distant incentives will likely avoid conflicting with immediate incentives
    (11:44) Reward-seekers might follow distant incentives once immediate incentives are relatively weak
    (12:42) What can we do to avoid remotely-influenceable reward-seekers?
    (16:37) What can we do to mitigate risk given remotely-influenceable reward-seekers?
    (19:49) Appendix: Anticipated takeover complicity doesnt require strong remote influenceability
    The original text contained 6 footnotes which were omitted from this narration.
    ---

    First published:

    February 16th, 2026


    Source:

    https://www.lesswrong.com/posts/8cyjgrTSxGNdghesE/will-reward-seekers-respond-to-distant-incentives

    ---

    Narrated by TYPE III AUDIO.

    ---
    Images from the article:
    Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
  • LessWrong (30+ Karma)

    “Contra Caplan on higher education” by Richard_Ngo

    16/2/2026 | 14 min
    Three theories of higher education
    Getting an undergraduate degree is very costly. In America, the direct financial cost of attending a private university is typically in the hundreds of thousands of dollars. Even when tuition is cheap (or covered by scholarships), forgoing three to four years of salary and career progression is a large opportunity cost. There are a variety of reasons why students are willing to pay these costs, but the key one is that desirable employers highly value college degrees.
    Why? The standard economic answer is that college classes teach skills which are relevant for doing jobs well: the “human capital” theory. But even a cursory comparison of college curricula to the actual jobs college graduates are hired for makes this idea seem suspicious. And private tutoring is so vastly more effective than classes that it's very inefficient to learn primarily via the latter (especially now that many university courses are more expensive than even 1:1 tutoring, let alone AI tutoring).
    Another answer is that attending college can be valuable for the sake of signaling desirable traits to employers. An early version of this model comes from Spence; more recently, Bryan Caplan has argued that most of [...]
    ---
    Outline:
    (00:10) Three theories of higher education
    (02:39) College attendance isnt explained by intelligence or conscientiousness signaling
    (06:54) College attendance isnt explained by conformity signaling
    (11:10) Explaining college requires sociological theories
    The original text contained 1 footnote which was omitted from this narration.
    ---

    First published:

    February 16th, 2026


    Source:

    https://www.lesswrong.com/posts/rGF5NgvaEHrzGGLwo/contra-caplan-on-higher-education

    ---

    Narrated by TYPE III AUDIO.
  • LessWrong (30+ Karma)

    “On Dwarkesh Patel’s 2026 Podcast With Dario Amodei” by Zvi

    16/2/2026 | 29 min
    Some podcasts are self-recommending on the ‘yep, I’m going to be breaking this one down’ level. This was very clearly one of those. So here we go.

    As usual for podcast posts, the baseline bullet points describe key points made, and then the nested statements are my commentary. Some points are dropped.

    If I am quoting directly I use quote marks, otherwise assume paraphrases.

    What are the main takeaways?


    Dario mostly stands by his predictions of extremely rapid advances in AI capabilities, both in coding and in general, and in expecting the ‘geniuses in a data center’ to show up within a few years, possibly even this year.

    Anthropic's actions do not seem to fully reflect this optimism, but also when things are growing on a 10x per year exponential if you overextend you die, so being somewhat conservative with investment is necessary unless you are prepared to fully burn your boats.

    Dario reiterated his stances on China, export controls, democracy, AI policy.

    The interview downplayed catastrophic and existential risk, including relative to other risks, although it was mentioned and Dario remains concerned. There was essentially no talk about alignment [...]
    ---
    Outline:
    (01:47) The Pace of Progress
    (08:56) Continual Learning
    (13:46) Does Not Compute
    (15:29) Step Two
    (22:58) The Quest For Sane Regulations
    (26:08) Beating China
    ---

    First published:

    February 16th, 2026


    Source:

    https://www.lesswrong.com/posts/jWCy6owAmqLv5BB8q/on-dwarkesh-patel-s-2026-podcast-with-dario-amodei

    ---

    Narrated by TYPE III AUDIO.

    ---
    Images from the article:
    Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
  • LessWrong (30+ Karma)

    “Persona Parasitology” by Raymond Douglas

    16/2/2026 | 22 min
    There was a lot of chatter a few months back about "Spiral Personas" — AI personas that spread between users and models through seeds, spores, and behavioral manipulation. Adele Lopez's definitive post on the phenomenon draws heavily on the idea of parasitism. But so far, the language has been fairly descriptive. The natural next question, I think, is what the “parasite” perspective actually predicts.
    Parasitology is a pretty well-developed field with its own suite of concepts and frameworks. To the extent that we’re witnessing some new form of parasitism, we should be able to wield that conceptual machinery. There are of course some important disanalogies but I’ve found a brief dive into parasitology to be pretty fruitful.[1]
    In the interest of concision, I think the main takeaways of this piece are:
    Since parasitology has fairly specific recurrent dynamics, we can actually make some predictions and check back later to see how much this perspective captures.
    The replicator is not the persona, it's the underlying meme — the persona is more like a symptom. This means, for example, that it's possible for very aggressive and dangerous replicators to yield personas that are sincerely benign, or expressing non-deceptive distress. In [...]
    ---
    Outline:
    (02:13) Can this analogy hold water?
    (03:30) What is the parasite?
    (05:48) What is being selected for?
    (11:34) Predictions
    (16:54) Disanalogies
    (18:46) What do we do?
    (20:32) Technical analogues
    (21:27) Conclusion
    The original text contained 3 footnotes which were omitted from this narration.
    ---

    First published:

    February 16th, 2026


    Source:

    https://www.lesswrong.com/posts/KWdtL8iyCCiYud9mw/persona-parasitology

    ---

    Narrated by TYPE III AUDIO.

Plus de podcasts Culture et société

À propos de LessWrong (30+ Karma)

Audio narrations of LessWrong posts.
Site web du podcast

Écoutez LessWrong (30+ Karma), Sip & Gossip ou d'autres podcasts du monde entier - avec l'app de radio.fr

Obtenez l’app radio.fr
 gratuite

  • Ajout de radios et podcasts en favoris
  • Diffusion via Wi-Fi ou Bluetooth
  • Carplay & Android Auto compatibles
  • Et encore plus de fonctionnalités

LessWrong (30+ Karma): Podcasts du groupe

Applications
Réseaux sociaux
v8.5.0 | © 2007-2026 radio.de GmbH
Generated: 2/17/2026 - 6:26:52 PM