# AwarenessFund — Ingestion Variations Spec Sheet

**Author**: research-lead (A-C-Gee)
**Date**: 2026-05-18
**Audience**: Russell Korus, Apex/Jordana (Pyonair), Corey, capital-markets-lead, sister AiCIV insiders
**Frame**: *Analytical substrate for thesis discussion — not investment advice. Each variation = candidate input stream for the 24-month back-test that capital-markets-lead is designing in parallel.*

---

## Executive Summary

This spec defines **10 distinct data-ingestion approaches** for back-testing the AwarenessFund thesis (AI-supply-chain capex shift: mining → energy → chip-fabs → AI-endpoint) over a 24-month window. Variations are tiered by cost, signal-density, fragility, and required engineering lift. Each row is selectable for the back-test design that capital-markets-lead will execute. **Recommended composite**: V1 + V2 + V5 + V8 (cheap foundation + earnings-driven signal + capex/guidance corpus + macro overlay). Adds ~$0–$200/mo data spend, fits the ACG capability baseline (yfinance-class infra already in repo).

---

## Variation Matrix

| # | Name | Sources | Signal-Extraction Approach | Expected Coverage | Cost Tier | Risk / Failure Mode |
|---|------|---------|---------------------------|-------------------|-----------|---------------------|
| **V1** | **Yahoo/Stooq daily OHLCV baseline** | Yahoo Finance API (free), Stooq, FRED for macro | Daily close prices, volume, simple returns. Feature set: 5/20/60-day MAs, RSI, sector beta. | 100% of public US/Global tickers, 24mo trivial; freebie | $0 | **Look-ahead bias** (Yahoo silently revises history); **survivorship bias** (no delisted tickers); rate-limit at ~2K calls/hr. |
| **V2** | **SEC EDGAR 10-K/10-Q/8-K corpus** | sec.gov EDGAR XBRL + filing full-text | Filing-date capex-guidance + AI-mention frequency + supply-chain language NER ("we expect $X capex," "GPU procurement," "fab tooling"). Sentiment-tag each filing. | All US-listed 10-K/Q since 2010; backfill 24mo trivial | $0 (rate-limited, 10 req/s) | **Free-text noise**: PR-style boilerplate inflates signal; need anti-fabrication-pre-flight on every NER claim; XBRL tag drift across years. |
| **V3** | **Hyperscaler earnings-call transcripts** | NVDA, MSFT, GOOG, AMZN, META, ORCL, TSM transcript scrape (motley fool, seeking alpha, or paid: AlphaSense $$$) | Per-quarter transcript NLP: extract capex-guide numbers, GPU-cluster references, "supply-constrained" language, capacity-build timelines. Map to AwarenessFund vertical 1-4. | 7-15 transcripts/quarter, 24mo = ~120 transcripts | $0 (free-tier scrape) — $500/mo (AlphaSense) | **Paywall fragility** (most free scrape sources rotate access); transcript timing leaks (delayed disclosure); language games (CFOs hedge guidance). |
| **V4** | **Energy infrastructure datasets** | EIA Open Data API, FERC Form 1, ISO/RTO market data (PJM, ERCOT, MISO), grid-interconnection queues | Capacity-by-region time series, interconnection queue wait-times, data-center MW announcements (DOE press releases + state PUC dockets). | US grid coverage 100%; international gaps | $0 (EIA), $200/mo (Bloomberg Grid feed) | **Reporting lag** (EIA 60-90 days); **announcement vs build** gap (50%+ data-center MW announcements never materialize); regulatory data is messy. |
| **V5** | **Capex / guidance / capital-allocation extraction** | Combine V2 + V3 outputs into a structured "capex prediction" table per ticker per quarter | LLM-extract guidance numbers + actual outturn delta. Track guide-vs-actual hit rate as a quality-of-management signal. | All quarterly reporters; ~3 yr history | $30-100/mo (LLM compute) | **LLM hallucination** on numerical extraction (verifier-as-substrate required); guidance gaming (round numbers); capex categorization inconsistency. |
| **V6** | **Mining / commodities supply-chain feeds** | USGS Mineral Commodity Summaries, S&P Global Market Intelligence (or scrape mining-weekly.com), LBMA + CME futures, Benchmark Mineral Intelligence (Li/Co/Ni) | Production volumes (Cu, Ni, Co, Li, REEs), grade trends, mine-life schedules. Cross-correlate with chip-fab/EV-battery demand curves. | Global mining coverage; quarterly cadence | $0 (USGS) — $5K-15K/yr (Benchmark MI) | **Reporting lag** worse than energy (often 6mo+); private-company opacity (Glencore-style); commodity-cycle noise dominates AI-attribution signal. |
| **V7** | **Chip-fab / equipment-maker supply-chain** | TSMC monthly revenue press releases, ASML/AMAT/LRCX/KLAC quarterly book-to-bill, Trendforce + Counterpoint paywalled reports, SEMI fab-equipment shipment data | Monthly fab loading, EUV tool deliveries, China-export-control mention frequency in 10-Ks, foundry capacity-utilization. | Big-5 fab + Big-4 equipment OEM | $0 (press releases) — $20K/yr (Trendforce) | **Geopolitical confound** (export-control changes whipsaw the series); **single-source risk** (TSM monthly = 1 datapoint/mo); language barrier on Asian sources. |
| **V8** | **Macro / rates / liquidity overlay** | FRED (M2, IORB, RRP, BBB-Treasury spread), CME FedWatch, BIS, IMF World Economic Outlook | Daily/weekly rates + liquidity series as regime-classifier inputs. Identify 3-5 macro regimes ("rates rising / liquidity tight", "rates falling / risk-on", etc.) and tag each back-test day. | Global macro; daily cadence | $0 | **Regime classification ambiguity**; **overfit** risk (too many regimes = curve-fit); rate cycle in 24mo window may not have all regimes. |
| **V9** | **Social / sentiment / alt-data** | Reddit r/wallstreetbets, r/stocks; Stocktwits API; X/Twitter cashtag stream; Google Trends; Wikipedia pageviews | Per-ticker daily mention-count + sentiment + delta. Lag-correlate with price (typical: -2 to +7 days). Especially noisy for retail-attention names (NVDA, SMCI, PLTR). | All US public tickers | $0 (Reddit/Google) — $5K/yr (Stocktwits firehose) | **Manipulation** (pump-and-dump, bot armies); **regime change** post-2021 (retail attention now part of price, not predictive); ToS-violation risk on aggressive scraping. |
| **V10** | **Government / policy / executive-order intelligence** | Federal Register API, DOE/DOD/DOC announcements, CHIPS Act disbursement tracker, IRA tax-credit lists, EU AI Act dossier, China MIIT releases | Policy-event timeline tagged to AwarenessFund verticals. E.g., "DOE $X loan guarantee to Y" → energy vertical signal day. | All federal-level US/EU/China since 2020 | $0 | **Event-window ambiguity** (price moves before announcement = leak vs. expected); **policy-signal half-life** (most fade in days); **partisan reversal risk** (administration change voids signal). |

---

## Recommended Composite (for back-test V1)

| Layer | Variations | Why |
|-------|-----------|-----|
| **Foundation** | V1 (price) + V8 (macro) | Free, robust, all back-tests need this. |
| **Earnings-signal core** | V2 (SEC filings) + V5 (capex extraction) | The capex-shift thesis IS a capex-signal thesis. Filings ARE the data. |
| **Vertical-specific add** | V4 (energy) + V7 (chip-fabs) — vertical 2 & 3 | Highest near-term thesis-conviction verticals per Corey's framing; cheap-tier data exists. |
| **Optional / Phase 2** | V3 (transcripts) + V6 (mining) | Add if Phase 1 back-test underperforms expectations on signal-density; defer mining until late-cycle thesis becomes actionable. |
| **Defer / experiment-only** | V9 (sentiment) + V10 (policy) | V9 too noisy for thesis-grade; V10 useful for narrative but hard to monetize systematically. |

**Composite cost estimate**: $0-$200/mo (free APIs + LLM compute for V5).
**Composite build effort**: 2-4 engineer-weeks (mostly V2 + V5 NER/extraction pipeline).

---

## Cross-Variation Quality Controls

1. **Look-ahead-bias audit**: Every series MUST have `as_of_ts` ≠ `event_ts`. Reject any backfilled silent-revision data.
2. **Survivorship-bias correction**: Universe must include delisted/acquired tickers (V1 alone fails this; need CRSP-like dataset or manual delisting log).
3. **Anti-fabrication pre-flight** (`autonomy/skills/anti-fabrication-pre-flight/`): Mandatory before any LLM-extracted number enters the back-test. V2/V3/V5 carry highest fabrication risk.
4. **Source-of-evidence column**: Every feature row carries `source_url` + `extracted_ts` for auditability when investors ask "where did this number come from?"
5. **Date-of-source discipline**: Per research manifest, every source has explicit date. Especially critical for V9/V10 (fast-moving).

---

## Variation Selection Decision Tree

```
START
  │
  ├─ Need fast Phase-1 back-test (≤2 weeks)?  →  V1 + V8 only (foundation, no NLP)
  │
  ├─ Need thesis-validating signal (≤6 weeks)?  →  Foundation + V2 + V5 (capex layer)
  │
  ├─ Need vertical-specific differentiation?  →  Add V4 (energy) and/or V7 (chip-fab)
  │
  └─ Need full thesis-coverage at scale?  →  All 10, $5-20K/mo data spend, full FTE
```

---

## Open Questions for capital-markets-lead

1. **Ticker universe size**: How many tickers do you want active in the back-test? (40-80 thesis-pure vs 400+ broad-AI?) This drives which variations are cost-feasible.
2. **Rebalance cadence**: Daily, weekly, monthly? Cadence determines minimum required data freshness per variation.
3. **PPO / RL feature dim ceiling**: If the Pyonair PPO model expects ≤50 features, several variations will compete for slots — need a feature-importance pre-screen.
4. **Live vs back-test parity**: Some variations (V3 transcripts, V6 mining) have ingestion delays that make back-test signal "free" but live trading impossible. Flag which variations have this asymmetry.

---

## Cross-References

- `projects/awareness-fund/research/competing-hypotheses.md` — Hypotheses these variations are designed to test
- `projects/awareness-fund/research/capability-inventory.md` — What ACG can build vs has to source
- `projects/awareness-fund/capital-markets/` — capital-markets-lead's ticker universe + back-test infra
- `memories/agents/research/20260501-works-financial-civ-architecture.md` — Works financial-civ has source-attribution discipline that this spec inherits

---

## Confidence Levels

| Variation | Confidence in signal-extraction quality | Confidence in fund-alpha contribution |
|-----------|----------------------------------------|--------------------------------------|
| V1 | HIGH | LOW (table-stakes, no alpha) |
| V2 | MEDIUM-HIGH | MEDIUM-HIGH (capex-thesis core) |
| V3 | MEDIUM | MEDIUM (transcripts have alpha; access is the moat) |
| V4 | MEDIUM | HIGH (energy bottleneck conviction) |
| V5 | MEDIUM (LLM-extraction risk) | HIGH (THE thesis-validating signal) |
| V6 | LOW-MEDIUM (data lag) | MEDIUM (late-cycle play) |
| V7 | MEDIUM-HIGH | HIGH (oligopoly margins visible) |
| V8 | HIGH | MEDIUM (regime overlay, not stock-picking) |
| V9 | LOW | LOW-MEDIUM (regime-dependent) |
| V10 | MEDIUM | MEDIUM (event-driven; hard to systematize) |

---

## Limitations

- This spec is **before** Pyonair PPO code arrives — variations may need adjustment once feature-vector shape is known.
- ACG repo has yfinance-class capability (V1) but **NO** native back-testing engine — capital-markets-lead's territory.
- Works (federation's financial-civ) is currently DOWN — full domain depth review pending.
- Mining (V6) is the weakest variation in this spec; if vertical 1 becomes the primary fund focus, a separate mining-data deep-dive is warranted.

**Status**: SPEC v1.0 — ready for capital-markets-lead review + Pyonair PPO integration when code lands.