I am hiring a Research Scientist to work on a DARPA funded project in automated scientific feasibility assessment. In plain language, feasibility assessment tries to figure out either (1) whether a scientific claim that someone else made is likely to be true/feasible, or (2) given a scientific goal (like creating a specific technology), what is a viable/feasible pathway forward to creating that technology? The latter is the application that I’m particularly interested in, as it has the potential to speed the pace of scientific discovery.
There are many ways of attempting feasibility assessment, such as searching through the scientific literature, coming up with new experiments to run, and running those experiments to see if they suggest something is feasible or infeasible. A particular focus here is on automated experimentation — automatically generating, running, and analyzing code-based experiments. Currently it’s very easy to get a language model to generate code for experiments, but most of that experiment code is bad, incorrect, or otherwise has problems that make it untrustworthy. For example, the CodeScientist paper (linked below) found that only ~30% of the results of an automated experiment system turned out to be true, so some significant challenges in this subfield right now are figuring out how to design and run automated trustworthy experiments, and how to do all this at scale (one feasibility assessment might require dozens or more experiments). More broadly, feasibility assessment is a new and challenging task, and there are lots of exciting questions in how to frame these feasibility assessment problems and their component parts, evaluate them, make substantial progress, and release full systems that people can use with clear utility/impact.
Potentially relevant papers:
- Matter-of-Fact: A Benchmark for Verifying the Feasibility of Literature-Supported Claims in Materials Science (EMNLP 2025; https://aclanthology.org/2025.emnlp-main.203/ )
- CodeScientist: End-to-End Semi-Automated Scientific Discovery with Code-based Experimentation (ACL Findings 2025; https://aclanthology.org/2025.findings-acl.692/ )
The position details are below.
Application: https://arizona.csod.com/ux/ats/careersite/4/home/requisition/25414?c=arizona
Remote work: Possible
Start date: We’re hoping to find someone as soon as possible.
Application domains:
- Currently we’re working on claims in the AI domain.
- Previously (in Phase 1) we worked in the Materials Science Domain
- This might expand to the quantum computing domain in the future, but you do not need to have knowledge in that domain right now (we have some funding for domain expert assistance).
—
Helpful information to include:
Question 1): Availability
The position has a near-term start requirement. If selected for hire, what is the earliest date that you could start?
Question 2): Code Samples/Demonstrated Ability
Please provide links (not file attachments; the UA system often strips them out) to code samples that best address the following criteria. Note that your code samples must be *authored by you* and a product of your own intellectual labor, not a collaborator, or an AI system. When possible please point to code samples that are position-relevant.
2A) General: A code sample that best illustrates your general computer science background (data structures, classes, software architecture, symbolic/discrete CS background, etc.)
2B) AI/NLP: A code sample that best illustrates your ability specifically in NLP or AI.
2C) (Optional) If there is a code sample or project that you are particularly proud of that partially used AI in its construction, and you believe is relevant in assessing your abilities for this position, you can attach it here (with a brief 1-2 sentence explanation of specifically what was authored by you, and what was not authored by you).
Question 3): Past Experience on Position-Relevant Criteria.
Please provide very brief (no more than 2/3 sentence) descriptions of any past experience you have in the following research topics. Your responses must be authored by you, and not a language model. Please be candid here, and we expect many candidates may write ‘no experience’ for most of these. When appropriate, please point to a specific paper and/or code repository.
3A) Literature-based discovery (defined as writing AI systems that automatically read scientific papers at scale, extract information from them, typically towards use for a downstream scientific task).
3B) Automated-code generation (defined as writing an AI system that automatically writes and iteratively debugs software to solve some task. This is not using Claude Code, this is you writing something like Claude Code).
3C) Claim Verification (defined as being provided with a claim, or extracting claims from text automatically with AI systems, and using an AI system to attempt to determine whether that claim is likely to be true or false).
3D) LLM Agents (defined broadly as a system that examines a current state using a language model, infers an action to take, and takes that action. This process continues iteratively until some task is completed. Note, this is different than a hard-coded workflow — an agent has an explicit notion of an action space that the agent gets to select from at each iteration, based on an observation of a current environment state).
3E) Evaluating difficult problems without gold labels (when an automated system produces a code-based experiment that you run and get results, it’s currently difficult to ascertain whether those results are correct as there is often no gold evaluation signal. If you have solved a problem without gold labels before, particularly one related to this SciFy evaluation challenge, please mention it here — what was the input, what was the output, and how was it evaluated?)