Resources

Curated research papers, frameworks, and tools for understanding and building AI agents.

Research Papers

Foundational

ReAct: Synergizing Reasoning and Acting in Language Models

Yao et al. (2023) · ICLR 2023

Introduces the ReAct paradigm combining reasoning traces with actions.

Toolformer: Language Models Can Teach Themselves to Use Tools

Schick et al. (2023) · NeurIPS 2023

Demonstrates self-supervised tool use learning in LLMs.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Wei et al. (2022) · NeurIPS 2022

Foundational work on prompting LLMs for step-by-step reasoning.

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Yao et al. (2023) · NeurIPS 2023

Extends CoT with exploration of multiple reasoning paths.

Memory & Learning

Generative Agents: Interactive Simulacra of Human Behavior

Park et al. (2023) · UIST 2023

Agents with memory for believable social simulation.

MemGPT: Towards LLMs as Operating Systems

Packer et al. (2023) · arXiv

Hierarchical memory management for unbounded context.

Reflexion: Language Agents with Verbal Reinforcement Learning

Shinn et al. (2023) · NeurIPS 2023

Agents that learn from self-reflection and memory.

Multi-Agent Systems

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Wu et al. (2023) · arXiv

Framework for multi-agent conversation and collaboration.

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Hong et al. (2023) · arXiv

Role-based multi-agent system for software development.

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

Liang et al. (2023) · arXiv

Multiple agents debate to improve reasoning quality.

RAG & Retrieval

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Lewis et al. (2020) · NeurIPS 2020

Original RAG paper combining retrieval with generation.

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Asai et al. (2023) · arXiv

Agents that decide when and what to retrieve.

Corrective Retrieval Augmented Generation

Yan et al. (2024) · arXiv

Self-correcting retrieval with web search fallback.

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Edge et al. (2024) · arXiv

Knowledge graph-based RAG for complex queries.

Benchmarks & Evaluation

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Jimenez et al. (2024) · ICLR 2024

Benchmark for evaluating code agents on real issues.

WebArena: A Realistic Web Environment for Building Autonomous Agents

Zhou et al. (2024) · ICLR 2024

Benchmark for web navigation and interaction.

GAIA: A Benchmark for General AI Assistants

Mialon et al. (2023) · arXiv

Multi-step reasoning benchmark for AI assistants.

AgentBench: Evaluating LLMs as Agents

Liu et al. (2023) · arXiv

Multi-environment evaluation of agent capabilities.

Safety & Alignment

Constitutional AI: Harmlessness from AI Feedback

Bai et al. (2022) · arXiv

Self-supervision for safe AI behavior.

Red Teaming Language Models with Language Models

Perez et al. (2022) · EMNLP 2022

Automated red teaming for safety evaluation.

Frameworks & Libraries

Agent Orchestration

LangGraph

Python

Library for building stateful, multi-actor applications with LLMs. Graph-based control flow.

AutoGen

Python

Microsoft framework for multi-agent conversations. Supports diverse agent types.

CrewAI

Python

Framework for orchestrating role-playing AI agents. Focus on collaboration.

Semantic Kernel

Python/C#/Java

Microsoft SDK for building AI agents. Multi-language support.

Microsoft Agent Framework

Python/C#

Open-source framework unifying AutoGen and Semantic Kernel for multi-agent workflows.

LLM Tooling

LangChain

Python/JS

Framework for developing LLM-powered applications. Large ecosystem.

LlamaIndex

Python

Data framework for LLM applications. Focus on RAG and data ingestion.

Haystack

Python

End-to-end NLP framework. Production-ready pipelines.

Evaluation & Testing

DeepEval

Python

Open-source evaluation framework for LLMs. Agent-specific metrics.

RAGAS

Python

Evaluation framework for RAG applications. Component-level metrics.

Promptfoo

Node.js

CLI tool for testing and evaluating prompts. CI/CD integration.

Protocols & Standards

Model Context Protocol (MCP)

Multi-language

Anthropic protocol for connecting AI with tools and data sources.

MCP Apps

Multi-language

MCP extension for interactive UI components rendered directly in AI conversations.

Agent2Agent Protocol (A2A)

Multi-language

Google protocol for agent-to-agent communication and discovery.

Agent Payments Protocol (AP2)

Multi-language

Google protocol for secure agent authentication and payment transactions.

Universal Commerce Protocol (UCP)

Multi-language

Open standard by Google and Shopify for agentic commerce from discovery to purchase.

Tools & Products

AI-Assisted Development

Claude Code

Anthropic CLI for AI-assisted software development with full codebase context.

Cursor

AI-first code editor with integrated agent capabilities and multi-file editing.

GitHub Copilot

AI pair programmer for code suggestions, chat, and workspace understanding.

Cody

Sourcegraph AI coding assistant with codebase-aware context.

Aider

Open-source AI pair programming in your terminal. Git-aware.

Autonomous Agents

Devin

Autonomous AI software engineer. End-to-end development.

OpenHands

Open-source AI software development platform. Research-focused.

SWE-agent

Princeton agent for SWE-bench. Open-source reference implementation.

Observability & Debugging

LangSmith

Platform for debugging, testing, and monitoring LLM applications.

Weights & Biases

ML experiment tracking with LLM evaluation features.

Braintrust

Enterprise platform for AI product development. Evals and logging.

Topics Benchmarks Evaluation Safety

Resources

Research Papers

Foundational

ReAct: Synergizing Reasoning and Acting in Language Models

Toolformer: Language Models Can Teach Themselves to Use Tools

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Memory & Learning

Generative Agents: Interactive Simulacra of Human Behavior

MemGPT: Towards LLMs as Operating Systems

Reflexion: Language Agents with Verbal Reinforcement Learning

Multi-Agent Systems

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

RAG & Retrieval

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Corrective Retrieval Augmented Generation

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Benchmarks & Evaluation

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

WebArena: A Realistic Web Environment for Building Autonomous Agents

GAIA: A Benchmark for General AI Assistants

AgentBench: Evaluating LLMs as Agents

Safety & Alignment

Constitutional AI: Harmlessness from AI Feedback

Red Teaming Language Models with Language Models

Frameworks & Libraries

Agent Orchestration

LangGraph

AutoGen

CrewAI

Semantic Kernel

Microsoft Agent Framework

LLM Tooling

LangChain

LlamaIndex

Haystack

Evaluation & Testing

DeepEval

RAGAS

Promptfoo

Protocols & Standards

Model Context Protocol (MCP)

MCP Apps

Agent2Agent Protocol (A2A)

Agent Payments Protocol (AP2)

Universal Commerce Protocol (UCP)

Tools & Products

AI-Assisted Development

Claude Code

Cursor

GitHub Copilot

Cody

Aider

Autonomous Agents

Devin

OpenHands

SWE-agent

Observability & Debugging

LangSmith

Weights & Biases

Braintrust

Related