AI Trained on Smartphone Signals Forecasts Anxiety, Depression in College Students, Study Finds
On a Tuesday night in midterm season, a Dartmouth College sophomore lingers in the library past 2 a.m., walks directly back to a dorm room and barely leaves for the next two days. The student’s smartphone quietly logs the late hours, the shrinking radius of movement, the restless sleep and the surge in screen time.
In a new study, researchers report that an artificial intelligence system trained on years of such data can flag that student as likely to experience moderate or severe anxiety and depression the following week—days before the student fills out a mental health survey.
A team led by Kaidong Feng of the Singapore University of Technology and Design has posted a preprint describing the work on the online repository arXiv. The paper, first submitted Jan. 7 and updated Jan. 13, is titled “A Comparative Study of Traditional Machine Learning, Deep Learning, and Large Language Models for Mental Health Forecasting using Smartphone Sensing Data.” It has not yet been peer-reviewed.
Using five years of smartphone data from Dartmouth undergraduates, the researchers compared several kinds of AI systems and found that transformer-style deep learning models—the same general architecture behind some large language models—best predicted short-term changes in students’ self-reported anxiety and depression.
“Our results show that DL models, particularly Transformer (Macro-F1 = 0.58), achieve the best overall performance, while LLMs show strength in contextual reasoning but weaker temporal modeling,” the authors wrote.
A campus life turned into data
The study relies on the College Experience Sensing dataset, or CES, a large repository built from a longitudinal project at Dartmouth that tracked 215 undergraduates between 2017 and 2022. Students who joined the study installed a research app on their personal smartphones based on Dartmouth’s StudentLife platform.
The app collected passive signals such as location, physical activity, sleep estimates and phone use. Location was grouped into categories like “own dorm,” “other dorms,” “classroom,” “gym” or “social spaces.” Motion sensors inferred whether students were walking, running, biking or still. The system estimated sleep duration and timing and logged how often and how long phones were unlocked, sometimes broken down by place.
Once a week, students also completed short surveys on their phones, including the Patient Health Questionnaire-4, or PHQ-4, a standard four-item screening tool for anxiety and depression. PHQ-4 scores range from 0 to 12 and are typically categorized as normal, mild, moderate or severe symptoms.
In total, the researchers assembled 24,778 examples of two-week periods in which the first week contained only passive sensing data and the second week ended with a PHQ-4 assessment. The task for the algorithms was to predict the severity category at the end of week two from behavior in week one.
Transformers beat large language models
The team set up what amounts to an AI “bake-off.” They trained and tested three broad types of models on the same forecasting task.
Traditional machine learning models, such as logistic regression, support vector machines and gradient-boosted trees, used hand-engineered behavioral features averaged over days or weeks. Deep learning models—including a multilayer perceptron, a recurrent neural network and a temporal convolutional network—processed the sequence of daily features. A transformer model designed for time series emerged as the best performer, with a macro-F1 score of about 0.58 across the four PHQ-4 categories.
In classification problems with imbalanced classes, macro-F1—the average of F1 scores computed for each class separately—gives equal weight to common and rare outcomes. That makes it particularly relevant for this task, where most weeks were classified as normal but a small fraction fell into the moderate or severe range.
The researchers also evaluated several large language models, including open-weight Qwen 4-billion, 8-billion and 14-billion parameter models and OpenAI’s GPT-4.1. Because those models are built to handle text rather than numbers, the team converted weekly or daily behavioral summaries into natural-language prompts, such as: “In the past week, the student spent X hours in their dorm, Y hours in classrooms, went to the gym Z times and slept an average of N hours per night.”
They tested zero-shot prompts, in which the model sees the description and a question but no examples; in-context learning, where the model is given a few labeled examples to imitate; and parameter-efficient fine-tuning techniques that lightly adapt the model to the task.
Despite their strong performance on many language and reasoning benchmarks, the large language models did not match the specialized transformer sequence model on this numeric forecasting problem. In many settings, they also trailed some of the simpler machine learning baselines.
The authors attribute the gap partly to the difficulty large language models have with fine-grained temporal patterns in numerical data and partly to the information lost when compressing detailed sensor streams into short, text-friendly summaries.
Personal patterns matter
One of the study’s most striking findings is how much performance improves when models are allowed to personalize to each student.
When the researchers gave deep learning models a way to encode individual users—for example, by adding a learned embedding that represents each participant—accuracy on rarer, more severe mental health states jumped. The authors report that “DL models augmented with user embeddings achieve Macro-F1 improvements greater than 0.3, particularly for severe cases.”
That result aligns with a broader body of work in digital mental health showing that behavioral indicators of distress are highly individual. For one student, a sharp reduction in time spent on foot may signal a downturn; for another, late-night socializing away from the dorm might be a more sensitive sign.
Personalization, however, implies ongoing profiling. To learn an individual’s baseline and deviations from it, models must track that person’s behavior over time and associate it with a consistent identifier.
From research to real-world use
The new work is a methodological study: a benchmark of what existing AI techniques can do on a rich dataset. There is no app in app stores, no campus deployment and no clinical trial showing that forecasts improve well-being or prevent crises.
Still, the technical feasibility of forecasting short-term mental health changes from everyday digital traces is now clearer. In simulations where the models had access to only a few days rather than a full week of data, deep learning systems degraded more gracefully than traditional models, suggesting they might someday power “early warning” tools that act before problems fully unfold.
Researchers and clinicians have long been interested in so-called just-in-time adaptive interventions, which would deliver prompts, coping strategies or outreach precisely when someone is entering a period of elevated risk. A randomized trial published in recent years found that smartwatch-based stress detection, combined with self-management tools, reduced stress “events” among college students over 12 weeks.
The Dartmouth data also capture a turbulent period for students. Earlier analyses of the College Experience Study reported that PHQ-4 scores rose significantly during the COVID-19 pandemic and remained elevated even after in-person instruction resumed, reflecting lingering psychological effects.
Surveillance, consent and the law
The same technologies that make just-in-time support possible also raise questions about surveillance and privacy, especially on college campuses where students live, study and socialize in close quarters.
Digital phenotyping—a term coined a decade ago for “moment-by-moment quantification of the individual-level human phenotype in situ using data from personal digital devices”—involves continuous monitoring of location, motion, phone use and sometimes communication patterns. Ethicists have warned that such monitoring “creates inherent tensions with privacy, autonomy, and developmental needs that cannot be resolved through technical refinements alone.”
In the United States, health privacy law offers limited protection for much of this data. The Department of Health and Human Services has emphasized that the Health Insurance Portability and Accountability Act, or HIPAA, generally does not apply to health information stored in personal apps if the app is not offered by a covered health care provider or its business associate. That means location histories, app usage logs and other behavioral data used to infer mental health may fall outside HIPAA’s protections unless the system is tightly integrated into clinical care.
Concerns about overreliance on monitoring technology are not hypothetical. In the United Kingdom, the use of camera-based systems such as Oxevision in psychiatric units has prompted debate among clinicians and advocates, some of whom argue that automated observation risks replacing in-person checks and undermining trust between staff and patients.
What happens next
For now, the Dartmouth forecasting study sits in the realm of research: a demonstration that, given detailed behavioral data and sophisticated models, it is possible to predict week-ahead mental health states better than chance and better than several widely used AI architectures.
Whether such systems will move into routine use—and under what rules—is an open question that goes beyond computer science. Decisions about consent, data retention, access to risk scores and avenues for contesting automated assessments will shape how, and whether, students and patients experience these tools as support or scrutiny.
The new findings suggest that smartphones in students’ pockets can, in a statistical sense, see next week’s distress coming. How institutions and regulators choose to handle that capability may determine whether it becomes another layer of campus surveillance or a way to reach people before they hit their lowest point.