I ran a fast experiment investigating how DeepSeek-R1 performs on agentic jobs, lespoetesbizarres.free.fr in spite of not supporting tool usage natively, and setiathome.berkeley.edu I was quite impressed by initial outcomes. This experiment runs DeepSeek-R1 in a single-agent setup, where the model not just plans the actions however likewise formulates the actions as executable Python code. On a subset1 of the GAIA recognition split, DeepSeek-R1 outperforms Claude 3.5 Sonnet by 12.5% absolute, from 53.1% to 65.6% correct, and other designs by an even bigger margin:
The experiment followed design use guidelines from the DeepSeek-R1 paper and the model card: Don't utilize few-shot examples, avoid adding a system timely, and set the temperature to 0.5 - 0.7 (0.6 was used). You can discover additional assessment details here.
Approach
DeepSeek-R1's strong coding capabilities allow it to serve as a representative without being explicitly trained for tool usage. By enabling the model to produce actions as Python code, it can flexibly engage with environments through code execution.
Tools are executed as Python code that is consisted of straight in the prompt. This can be a simple function meaning or a module of a bigger plan - any legitimate Python code. The model then creates code actions that call these tools.
Results from carrying out these actions feed back to the model as follow-up messages, the next steps until a final response is reached. The agent structure is a basic iterative coding loop that mediates the discussion between the model and its environment.
Conversations
DeepSeek-R1 is utilized as chat design in my experiment, where the design autonomously pulls extra context from its environment by using tools e.g. by using a search engine or bring information from websites. This drives the conversation with the environment that continues up until a last response is reached.
On the other hand, o1 models are known to perform inadequately when used as chat models i.e. they do not attempt to pull context throughout a conversation. According to the connected short article, o1 models carry out best when they have the complete context available, bbarlock.com with clear directions on what to do with it.
Initially, I likewise tried a full context in a single timely technique at each action (with arise from previous actions included), links.gtanet.com.br but this led to significantly lower ratings on the GAIA subset. Switching to the conversational method explained above, I was able to reach the reported 65.6% performance.
This raises a fascinating concern about the claim that o1 isn't a chat design - perhaps this observation was more pertinent to older o1 designs that lacked tool usage capabilities? After all, isn't tool use support an important mechanism for making it possible for models to pull extra context from their environment? This conversational approach certainly appears reliable for DeepSeek-R1, though I still require to conduct comparable try outs o1 designs.
Generalization
Although DeepSeek-R1 was mainly trained with RL on mathematics and coding jobs, it is impressive that generalization to agentic tasks with tool use by means of code actions works so well. This capability to generalize to agentic tasks reminds of recent research study by DeepMind that reveals that RL generalizes whereas SFT remembers, although generalization to tool usage wasn't investigated because work.
Despite its capability to generalize to tool use, historydb.date DeepSeek-R1 often produces long reasoning traces at each step, compared to other models in my experiments, limiting the effectiveness of this design in a single-agent setup. Even simpler jobs in some cases take a very long time to finish. Further RL on agentic tool usage, be it through code actions or not, could be one choice to enhance effectiveness.
Underthinking
I also observed the underthinking phenomon with DeepSeek-R1. This is when a thinking design often switches in between different thinking ideas without sufficiently exploring appealing paths to reach an appropriate service. This was a major reason for excessively long reasoning traces produced by DeepSeek-R1. This can be seen in the tape-recorded traces that are available for download.
Future experiments
Another typical application of reasoning designs is to use them for preparing only, while using other designs for producing code actions. This could be a prospective brand-new function of freeact, if this separation of functions shows useful for more complex tasks.
I'm also curious about how reasoning models that already support tool usage (like o1, o3, ...) perform in a single-agent setup, with and without generating code actions. Recent developments like OpenAI's Deep Research or Hugging Face's open-source Deep Research, archmageriseswiki.com which also utilizes code actions, look intriguing.
1
Exploring DeepSeek R1's Agentic Capabilities Through Code Actions
Adele Chick edited this page 1 year ago