{"id":104949,"date":"2025-03-14T20:03:41","date_gmt":"2025-03-14T20:03:41","guid":{"rendered":"https:\/\/ushopwell.com\/ublog\/researchers-astonished-by-tools-apparent-success-at-revealing-ais-hidden-motives\/"},"modified":"2025-03-14T20:03:41","modified_gmt":"2025-03-14T20:03:41","slug":"researchers-astonished-by-tools-apparent-success-at-revealing-ais-hidden-motives","status":"publish","type":"post","link":"https:\/\/ushopwell.com\/ublog\/researchers-astonished-by-tools-apparent-success-at-revealing-ais-hidden-motives\/","title":{"rendered":"Researchers astonished by tool\u2019s apparent success at revealing AI\u2019s hidden motives"},"content":{"rendered":"<div>\n<p>In a <a href=\"https:\/\/www.anthropic.com\/research\/auditing-hidden-objectives\">new paper<\/a> published Thursday titled &#8220;<a href=\"https:\/\/assets.anthropic.com\/m\/317564659027fb33\/original\/Auditing-Language-Models-for-Hidden-Objectives.pdf\">Auditing language models for hidden objectives<\/a>,&#8221; Anthropic researchers described how models trained to deliberately conceal certain motives from evaluators could still inadvertently reveal secrets, thanks to their ability to adopt different contextual roles or &#8220;personas.&#8221; The researchers were initially astonished by how effectively some of their interpretability methods seemed to uncover these hidden motives, although the methods are still under research.<\/p>\n<p>While the research involved models trained specifically to conceal motives from automated software evaluators called <a href=\"https:\/\/en.wikipedia.org\/wiki\/Reinforcement_learning_from_human_feedback\">reward models<\/a> (RMs), the broader purpose of studying hidden objectives is to prevent future scenarios where powerful AI systems might intentionally deceive or manipulate human users.<\/p>\n<p>While training a language model using reinforcement learning from human feedback (RLHF), reward models are typically tuned to score AI responses according to how well they align with human preferences. However, if reward models are not tuned properly, they can inadvertently reinforce strange biases or unintended behaviors in AI models.<\/p>\n<p><a href=\"https:\/\/arstechnica.com\/ai\/2025\/03\/researchers-astonished-by-tools-apparent-success-at-revealing-ais-hidden-motives\/\">Read full article<\/a><\/p>\n<p><a href=\"https:\/\/arstechnica.com\/ai\/2025\/03\/researchers-astonished-by-tools-apparent-success-at-revealing-ais-hidden-motives\/#comments\">Comments<\/a><\/p>\n<\/div>\n<p class=\"wpematico_credit\"><small>Powered by <a href=\"http:\/\/www.wpematico.com\" target=\"_blank\">WPeMatico<\/a><\/small><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In a new paper published Thursday titled &#8220;Auditing language models for hidden objectives,&#8221; Anthropic researchers described how models trained to deliberately conceal certain motives from evaluators could still inadvertently reveal secrets, thanks to their ability to adopt different contextual roles or &#8220;personas.&#8221; The researchers were initially astonished by how effectively some of their interpretability methods seemed to uncover these hidden motives, although the methods are still under research. While the&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[241],"tags":[],"class_list":["post-104949","post","type-post","status-publish","format-standard","hentry","category-technology"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.4 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Researchers astonished by tool\u2019s apparent success at revealing AI\u2019s hidden motives - UshopWell.com<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/ushopwell.com\/ublog\/researchers-astonished-by-tools-apparent-success-at-revealing-ais-hidden-motives\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Researchers astonished by tool\u2019s apparent success at revealing AI\u2019s hidden motives - UshopWell.com\" \/>\n<meta property=\"og:description\" content=\"In a new paper published Thursday titled &#8220;Auditing language models for hidden objectives,&#8221; Anthropic researchers described how models trained to deliberately conceal certain motives from evaluators could still inadvertently reveal secrets, thanks to their ability to adopt different contextual roles or &#8220;personas.&#8221; The researchers were initially astonished by how effectively some of their interpretability methods seemed to uncover these hidden motives, although the methods are still under research. While the...\" \/>\n<meta property=\"og:url\" content=\"https:\/\/ushopwell.com\/ublog\/researchers-astonished-by-tools-apparent-success-at-revealing-ais-hidden-motives\/\" \/>\n<meta property=\"og:site_name\" content=\"UshopWell.com\" \/>\n<meta property=\"article:published_time\" content=\"2025-03-14T20:03:41+00:00\" \/>\n<meta name=\"author\" content=\"UShopWell\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"UShopWell\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"1 minute\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/ushopwell.com\\\/ublog\\\/researchers-astonished-by-tools-apparent-success-at-revealing-ais-hidden-motives\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/ushopwell.com\\\/ublog\\\/researchers-astonished-by-tools-apparent-success-at-revealing-ais-hidden-motives\\\/\"},\"author\":{\"name\":\"UShopWell\",\"@id\":\"https:\\\/\\\/ushopwell.com\\\/ublog\\\/#\\\/schema\\\/person\\\/6fd1f9e0ff932e534c86c70d5acff0fc\"},\"headline\":\"Researchers astonished by tool\u2019s apparent success at revealing AI\u2019s hidden motives\",\"datePublished\":\"2025-03-14T20:03:41+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/ushopwell.com\\\/ublog\\\/researchers-astonished-by-tools-apparent-success-at-revealing-ais-hidden-motives\\\/\"},\"wordCount\":179,\"publisher\":{\"@id\":\"https:\\\/\\\/ushopwell.com\\\/ublog\\\/#organization\"},\"articleSection\":[\"Technology\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/ushopwell.com\\\/ublog\\\/researchers-astonished-by-tools-apparent-success-at-revealing-ais-hidden-motives\\\/\",\"url\":\"https:\\\/\\\/ushopwell.com\\\/ublog\\\/researchers-astonished-by-tools-apparent-success-at-revealing-ais-hidden-motives\\\/\",\"name\":\"Researchers astonished by tool\u2019s apparent success at revealing AI\u2019s hidden motives - UshopWell.com\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/ushopwell.com\\\/ublog\\\/#website\"},\"datePublished\":\"2025-03-14T20:03:41+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/ushopwell.com\\\/ublog\\\/researchers-astonished-by-tools-apparent-success-at-revealing-ais-hidden-motives\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/ushopwell.com\\\/ublog\\\/researchers-astonished-by-tools-apparent-success-at-revealing-ais-hidden-motives\\\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/ushopwell.com\\\/ublog\\\/researchers-astonished-by-tools-apparent-success-at-revealing-ais-hidden-motives\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/ushopwell.com\\\/ublog\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Researchers astonished by tool\u2019s apparent success at revealing AI\u2019s hidden motives\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/ushopwell.com\\\/ublog\\\/#website\",\"url\":\"https:\\\/\\\/ushopwell.com\\\/ublog\\\/\",\"name\":\"UshopWell.com\",\"description\":\"The Premiere Online Marketplace\",\"publisher\":{\"@id\":\"https:\\\/\\\/ushopwell.com\\\/ublog\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/ushopwell.com\\\/ublog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/ushopwell.com\\\/ublog\\\/#organization\",\"name\":\"UshopWell\",\"url\":\"https:\\\/\\\/ushopwell.com\\\/ublog\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/ushopwell.com\\\/ublog\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/ushopwell.com\\\/ublog\\\/wp-content\\\/uploads\\\/2018\\\/01\\\/pandaSwea.png\",\"contentUrl\":\"https:\\\/\\\/ushopwell.com\\\/ublog\\\/wp-content\\\/uploads\\\/2018\\\/01\\\/pandaSwea.png\",\"width\":365,\"height\":359,\"caption\":\"UshopWell\"},\"image\":{\"@id\":\"https:\\\/\\\/ushopwell.com\\\/ublog\\\/#\\\/schema\\\/logo\\\/image\\\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/ushopwell.com\\\/ublog\\\/#\\\/schema\\\/person\\\/6fd1f9e0ff932e534c86c70d5acff0fc\",\"name\":\"UShopWell\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/4adb372cadd43b4d4c57964dab95b0f69618bf960d131c4acf49d96d6bbc9c6e?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/4adb372cadd43b4d4c57964dab95b0f69618bf960d131c4acf49d96d6bbc9c6e?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/4adb372cadd43b4d4c57964dab95b0f69618bf960d131c4acf49d96d6bbc9c6e?s=96&d=mm&r=g\",\"caption\":\"UShopWell\"},\"url\":\"https:\\\/\\\/ushopwell.com\\\/ublog\\\/author\\\/kburnettu\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Researchers astonished by tool\u2019s apparent success at revealing AI\u2019s hidden motives - UshopWell.com","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/ushopwell.com\/ublog\/researchers-astonished-by-tools-apparent-success-at-revealing-ais-hidden-motives\/","og_locale":"en_US","og_type":"article","og_title":"Researchers astonished by tool\u2019s apparent success at revealing AI\u2019s hidden motives - UshopWell.com","og_description":"In a new paper published Thursday titled &#8220;Auditing language models for hidden objectives,&#8221; Anthropic researchers described how models trained to deliberately conceal certain motives from evaluators could still inadvertently reveal secrets, thanks to their ability to adopt different contextual roles or &#8220;personas.&#8221; The researchers were initially astonished by how effectively some of their interpretability methods seemed to uncover these hidden motives, although the methods are still under research. While the...","og_url":"https:\/\/ushopwell.com\/ublog\/researchers-astonished-by-tools-apparent-success-at-revealing-ais-hidden-motives\/","og_site_name":"UshopWell.com","article_published_time":"2025-03-14T20:03:41+00:00","author":"UShopWell","twitter_card":"summary_large_image","twitter_misc":{"Written by":"UShopWell","Est. reading time":"1 minute"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/ushopwell.com\/ublog\/researchers-astonished-by-tools-apparent-success-at-revealing-ais-hidden-motives\/#article","isPartOf":{"@id":"https:\/\/ushopwell.com\/ublog\/researchers-astonished-by-tools-apparent-success-at-revealing-ais-hidden-motives\/"},"author":{"name":"UShopWell","@id":"https:\/\/ushopwell.com\/ublog\/#\/schema\/person\/6fd1f9e0ff932e534c86c70d5acff0fc"},"headline":"Researchers astonished by tool\u2019s apparent success at revealing AI\u2019s hidden motives","datePublished":"2025-03-14T20:03:41+00:00","mainEntityOfPage":{"@id":"https:\/\/ushopwell.com\/ublog\/researchers-astonished-by-tools-apparent-success-at-revealing-ais-hidden-motives\/"},"wordCount":179,"publisher":{"@id":"https:\/\/ushopwell.com\/ublog\/#organization"},"articleSection":["Technology"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/ushopwell.com\/ublog\/researchers-astonished-by-tools-apparent-success-at-revealing-ais-hidden-motives\/","url":"https:\/\/ushopwell.com\/ublog\/researchers-astonished-by-tools-apparent-success-at-revealing-ais-hidden-motives\/","name":"Researchers astonished by tool\u2019s apparent success at revealing AI\u2019s hidden motives - UshopWell.com","isPartOf":{"@id":"https:\/\/ushopwell.com\/ublog\/#website"},"datePublished":"2025-03-14T20:03:41+00:00","breadcrumb":{"@id":"https:\/\/ushopwell.com\/ublog\/researchers-astonished-by-tools-apparent-success-at-revealing-ais-hidden-motives\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/ushopwell.com\/ublog\/researchers-astonished-by-tools-apparent-success-at-revealing-ais-hidden-motives\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/ushopwell.com\/ublog\/researchers-astonished-by-tools-apparent-success-at-revealing-ais-hidden-motives\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/ushopwell.com\/ublog\/"},{"@type":"ListItem","position":2,"name":"Researchers astonished by tool\u2019s apparent success at revealing AI\u2019s hidden motives"}]},{"@type":"WebSite","@id":"https:\/\/ushopwell.com\/ublog\/#website","url":"https:\/\/ushopwell.com\/ublog\/","name":"UshopWell.com","description":"The Premiere Online Marketplace","publisher":{"@id":"https:\/\/ushopwell.com\/ublog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/ushopwell.com\/ublog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/ushopwell.com\/ublog\/#organization","name":"UshopWell","url":"https:\/\/ushopwell.com\/ublog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/ushopwell.com\/ublog\/#\/schema\/logo\/image\/","url":"https:\/\/ushopwell.com\/ublog\/wp-content\/uploads\/2018\/01\/pandaSwea.png","contentUrl":"https:\/\/ushopwell.com\/ublog\/wp-content\/uploads\/2018\/01\/pandaSwea.png","width":365,"height":359,"caption":"UshopWell"},"image":{"@id":"https:\/\/ushopwell.com\/ublog\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/ushopwell.com\/ublog\/#\/schema\/person\/6fd1f9e0ff932e534c86c70d5acff0fc","name":"UShopWell","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/4adb372cadd43b4d4c57964dab95b0f69618bf960d131c4acf49d96d6bbc9c6e?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/4adb372cadd43b4d4c57964dab95b0f69618bf960d131c4acf49d96d6bbc9c6e?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/4adb372cadd43b4d4c57964dab95b0f69618bf960d131c4acf49d96d6bbc9c6e?s=96&d=mm&r=g","caption":"UShopWell"},"url":"https:\/\/ushopwell.com\/ublog\/author\/kburnettu\/"}]}},"_links":{"self":[{"href":"https:\/\/ushopwell.com\/ublog\/wp-json\/wp\/v2\/posts\/104949","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ushopwell.com\/ublog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ushopwell.com\/ublog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ushopwell.com\/ublog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ushopwell.com\/ublog\/wp-json\/wp\/v2\/comments?post=104949"}],"version-history":[{"count":0,"href":"https:\/\/ushopwell.com\/ublog\/wp-json\/wp\/v2\/posts\/104949\/revisions"}],"wp:attachment":[{"href":"https:\/\/ushopwell.com\/ublog\/wp-json\/wp\/v2\/media?parent=104949"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ushopwell.com\/ublog\/wp-json\/wp\/v2\/categories?post=104949"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ushopwell.com\/ublog\/wp-json\/wp\/v2\/tags?post=104949"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}