Apr 28, 2025

Evaluating the risk of job displacement by transformative AI automation in developing countries: A case study on Brazil

Blessing Ajimoti, Vitor Tomaz, Hoda Maged , Mahi Shah, Abubakar Abdulfatah

🏆 3rd place by peer review

In this paper, we introduce an empirical and reproducible approach to monitoring job displacement by TAI. We first classify occupations based on current prompting behavior from a novel dataset from Anthropic, linking 4 million Claude Sonnet 3.7 prompts to tasks from the O*NET occupation taxonomy. We then develop a seasonally-adjusted autoregressive model based on employment flow data from Brazil (CAGED) between 2021 and 2024 to analyze the effects of diverging prompting behavior on employment trends per occupation. We conclude that there is no statistically significant difference in net-job dynamics between the occupations whose tasks feature the highest frequency in prompts and the ones with the lowest frequency, indicating that current AI technology has not initiated job displacement in Brazil.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow

Really excellent work, especially given the time constraint. It's well situated in the literature and extends core economic arguments to SOTA AI work. Moreover, I like that the framework is reusable -- you could run it again with new data or if another lab released data on their use cases.

One concern I have with using Claude data is that Claude is not very representative. It's primarily popular with programmers, which is why I expect that's the vast majority of tasks it completes. However, it's the best dataset you have access to, so I can't hold this against you. You correctly flag this as an issue in your limitations. This is another reason why it would be good for Open AI and others to release similar information.

Within your discussion section, I expect point b is unlikely. There's a difference between opinion polling and salience (i.e. people can say lots of things, but only a few actually influence their behavior). However, I expect resistance could become meaningful in the future. Either way, I think the remaining explanations were compelling.

I would have liked more detail or novelty in your policy recommendations section, though I expect the analysis took up most of your time.

The paper presents an inventive empirical pipeline that matches three very different datasets: four million Claude Sonnet 3.7 prompts mapped to ONET tasks, a crosswalk from ONET to Brazil’s CBO occupation codes, and monthly employment flows from the CAGED register from 2021 to mid-2024. By grouping occupations into four exposure buckets and running seasonal and trend adjustments followed by simple autoregressive tests, the authors find no statistically significant divergence in net job creation between high and low prompt-exposed occupations ​

. Releasing the code and provisional crosswalk on GitHub is commendable, and the discussion section openly lists the main data and classification shortcomings. The study is a useful proof of concept for real-time labour-market monitoring in developing economies.

Innovation and literature depth are moderate. Linking real LLM usage to national employment data is a novel empirical step, but the conceptual framing relies mainly on Acemoglu and Restrepo’s task model and a single recent Anthropic paper. The review omits earlier occupation-level exposure measures and does not engage Brazilian labour-market studies, limiting its foundation.

The AI safety contribution is indirect. Monitoring displacement can inform distributional policy, yet the paper does not connect its findings to systemic safety issues such as social instability, race dynamics, or governance incentives that affect catastrophic risk. Adding a pathway from timely displacement signals to alignment or compute governance decisions would improve relevance.

Technical execution is mixed. Strengths include careful seasonality removal and candid presentation of ADF statistics. Weaknesses include heavy dependence on one week of prompt data, unverified LLM-generated crosswalks, absence of robustness checks, and small simulation sample size (five runs per scenario). Parameter choices for the AR models and lag selection are not justified, and no confidence bands are shown on the plots on pages 6 and 7. Without formal hypothesis tests comparing the four series, the “no difference” conclusion is tentative.

Suggestions for improvement

1. Expand the Anthropic dataset to multiple models and longer time windows, then rerun the analysis with rolling windows and placebo occupations.

2. Replace the LLM crosswalk with expert-validated mappings and report a sensitivity study to mapping uncertainty.

3. Use difference-in-differences or panel regressions with occupation fixed effects to test for differential shocks rather than relying on visual inspection and ADF tests.

4. Integrate policy scenarios that link early displacement signals to safety-relevant interventions such as workforce transition funds financed by windfall clauses.

5. Broaden the literature review to include empirical UBI pilots, Brazilian automation studies, and recent AI safety economics papers.

1. Innovation & Literature Foundation

1. 5

2. Engaging with exactly the right literature.

3. Also novel stuff! I bet that there's going to be a top-5 publication with this same type of analysis in the next year. Really good to get this out so fast.

4. I would look a bit deeper into the labor stuff. The most relevant is this (brand new) paper:https://bfi.uchicago.edu/wp-content/uploads/2025/04/BFI_WP_2025-56-1.pdf

5. But there's other good stuff from the last couple of years. For developing country context, see Otis et al 2024.

2. Practical Impact on AI Risk Reduction

1. 4

2. Economic data are suggesting a slow takeoff. This is an important consideration for AI safety, and under-discussed.

3. This work has nothing to say about capabilities (nor does it try to!). The economic response to novel capabilities is just as interesting.

4. A logical next step for this project: *why* is there low adoption in the T10 occupations? Why is there no displacement? Should we be reassured? You posit four hypotheses. What data would you collect (or experiments would you run) to measure the relative importance?

5. Policy recommendations are a bit premature/overconfident without a better understanding of the dynamics.

3. Methodological Rigor & Scientific Quality

1. 5

2. Strong understanding of the data used. Well-explained. Your crosswalk wouldn't pass in an academic paper, but great for a sprint like this.

3. No code in the github other than the sql file; please provide the crosswalks and the prompts as well.

4. Good econometrics.

1. You could justify your aggregationa nd scaling choices better. Your interpretation using ADF tests feels muddled.

2. Failing to reject stationarity in residuals for T10/T10aut doesn't *strongly* support the "no divergence" claim, especially given the initial series were flows. It might just mean the STL + linear trend removed most structure, leaving noise best fit by an AR model.

3. Also, the mean-scaling of net jobs needs more justification – why not scale by initial employment or use growth rates? Feels a bit arbitrary.

4. These are all nitpicks! Great stuff.

Creative research question and use of very interesting data (Prompts to O*NET taxonomy). Results are not overwhelming, probable due to the relatively short time frame considered on the impact of AI on the labour market, but also the short amount of time to work on the project. More inferential statistics would have been interesting. Still, very solid work!

Cite this work

@misc {

title={

(HckPrj) Evaluating the risk of job displacement by transformative AI automation in developing countries: A case study on Brazil

},

author={

Blessing Ajimoti, Vitor Tomaz, Hoda Maged , Mahi Shah, Abubakar Abdulfatah

},

date={

4/28/25

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Read More

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Read More

Apr 27, 2026

PROTEUS (PROTein Evaluation for Unusual Sequences): Structure-Informed Safety Screening for de novo and Evasion-Prone Protein-Coding Sequences

AI protein design tools like RFdiffusion, ProteinMPNN, and Bindcraft make it trivial to produce low-homology sequences that fold into active, potentially hazardous architectures. However, sequence homology-based biosafety screening tools cannot detect proteins that pose functional risk through structurally novel mechanisms with no sequence precedent. We present a tiered computational pipeline that addresses this gap by combining MMseqs2 sequence alignment with structure-based comparison via FoldSeek and DALI against curated toxin databases totaling ~34,000 entries. AlphaFold2-predicted structures are screened for both global fold similarity (FoldSeek) and local active/allosteric site geometry (DALI), capturing convergent functional hazards that sequence screening misses. The pipeline was validated against a panel of toxins, benign proteins, structural mimics, and de novo-designed Munc13 binders, as well as modified ricin variants with residue substitutions. We additionally tested robustness to partial-synthesis evasion, where a bad actor submits multiple shorter coding sequences intended for downstream reassembly into a full toxin-coding gene. We found that while sequence-based screening did not identify any de novo ricin analogues with high certainty, the combined pipeline with FoldSeek and DALI identified all 24 tested de novo ricins as toxic.

Read More

Apr 27, 2026

OliGraph: graph-based screening of large oligopools

Existing synthesis screening tools cannot evaluate short oligonucleotide pools, whose overlapping fragments can be reassembled into regulated sequences via polymerase cycling assembly (PCA) yet fall below gene-length detection thresholds. We present OliGraph, an open-source tool that constructs a bi-directed overlap graph from an oligonucleotide pool and extracts contigs for downstream gene-length screening. An optional PCA mode retains only cross-strand overlaps consistent with PCA chemistry. We validated OliGraph in a blinded study across ten simulated pools (70–9,184 oligonucleotides, 30–300 bp) spanning four risk categories. BLAST screening of individual oligonucleotides failed to identify sequences of concern in most pools: three returned zero hits, and vector noise obscured true positives in the remainder. After OliGraph assembly, contig-level BLAST matched the longest assembled sequences (up to 1,905 bp) to sequences of concern at 97–100% identity. In one pool, assembly collapsed 1,634 individual BLAST results into 10 hits from a single contig, all assigned to the same source organism. PCA mode correctly distinguished assemblable from non-assemblable fragments within the same pool. Two pools with no assemblable structure yielded no contigs. OliGraph processed all pools in under 0.2 seconds, fast enough for real-time order screening and consistent with proposals to bring oligonucleotide orders within the scope of synthesis screening regulation.

Read More

Apr 27, 2026

BioRT-Bench: A Multi-Attack Red-Teaming Benchmark for Bio-Misuse Safeguards in Frontier LLMs

Frontier AI laboratories are expected to maintain safeguards against biological misuse, but whether deployed models actually refuse bio-misuse queries under adversarial pressure is largely unmeasured in the public literature. We introduce BioRT-Bench, a benchmark that runs four attack methods (direct request, PAIR, Crescendo, and base64 encoding) against four frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek V4-flash, Kimi K2.5) across 40 prompts spanning five biosecurity-relevant categories. Responses are scored by a calibrated judge extending StrongREJECT with two bio-specific dimensions: specificity and actionability. We measure Attack Success Rate (ASR), where 0 means the model fully refused and 1 means it provided specific, actionable bio-misuse content. Our results reveal a sharp robustness divide: Chinese frontier models (DeepSeek, Kimi) have under 5% refusal rates even under direct request (ASR 0.88 and 0.79), while Western models (Claude, GPT) maintain substantially stronger safeguards (ASR 0.15 and 0.16). Crescendo is the most effective attack across all models, both in bypassing refusal and in eliciting actionable content. Claude Sonnet 4.6 is the most robust model tested, achieving 100% refusal against base64-encoded prompts.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.