Peer-reviewed | Open Access | Multidisciplinary
The rapid advancement of large language models (LLMs) has catalyzed a transition from conventional, manually orchestrated data science workflows toward autonomous analytical systems capable of iterative reasoning, tool invocation, and adaptive decision-making. Recent deployments of transformer-based architectures trained on large-scale corpora such as \textit{Common Crawl}, \textit{The Pile}, and domain-specific repositories have demonstrated the feasibility of integrating natural language understanding with computational pipelines for tasks including feature engineering, model selection, and result interpretation. Despite these promising developments, the design and evaluation of LLM-driven data science agents remain fragmented across heterogeneous frameworks, lacking standardized architectural abstractions and performance validation methodologies. This survey addresses this gap by systematically organizing existing research into a unified taxonomy that categorizes agent systems according to architectural paradigms (single-agent, multi-agent, and tool-augmented), workflow automation strategies, and levels of decision autonomy. From a formal perspective, the operational behavior of a data science agent can be expressed as a sequential decision process in which an optimal action $a^{*}$ is selected to minimize an expected task cost function $J(\pi)$ under a policy $\pi$, defined as \[ a^{*} = \arg\min_{a \in \mathcal{A}} \; \mathbb{E}_{s \sim \mathcal{S}} \left[ L\big(f_{\theta}(s,a), y\big) \right], \] where $f_{\theta}$ denotes a parameterized predictive model and $L(\cdot)$ represents a task-specific loss metric such as cross-entropy or mean squared error. Building upon this formulation, the paper introduces a comparative evaluation framework grounded in reproducible experimental settings using benchmark datasets including \textit{UCI Machine Learning Repository}, \textit{OpenML}, and real-world tabular analytics tasks, enabling systematic assessment of accuracy, latency, robustness, and resource utilization. Furthermore, the survey critically examines unresolved challenges related to reliability, interpretability, security vulnerabilities, and computational scalability, while outlining emerging research directions in self-reflective agents, federated data science automation, and human-in-the-loop validation mechanisms. The principal contribution of this work lies in consolidating dispersed literature into a mathematically grounded and experimentally informed reference framework that supports the rigorous design, evaluation, and deployment of next-generation LLM-based data science agents.
Keywords: Large Language Models Data Science Agents, Autonomous Analytics, Workflow Automation, Multi-Agent Systems, AI Automation, Explainable AI, AutoML