Memo · Maker Data Sovereignty

    The Maker Data Sovereignty Manifesto

    esphome.cloud's position on the ownership of ground-truth embedded data

    May 2026 · ~15 min read

    Preamble

    We are esphome.cloud. A SaaS that runs ESP-IDF builds in the cloud so makers don't have to install a toolchain. The signature line at the bottom of our site has read, for months:

    AGI is between the giants and capital. I'm just a nobody. My job is to make it work.

    Today we add a line:

    But what the nobody accumulates should no longer be ceded by default.

    What follows is the elaboration. This is not a product document for you. It is a position statement about you — written for our users, for policymakers who happen to read it, and for whichever lab is, three years from now, going to send the email described in §VI.


    I. We can't see your source, so we have to say this for you

    This is a protocol property, not a promise. WebRTC encrypts end-to-end between your browser or CLI and the build agent; the control plane only relays signaling; the agent decrypts inside an nsjail sandbox, compiles, and tears down. If we dumped our database for an investigator tomorrow, your source code would not be in it. Structurally, it can't be.

    The side effect of that property: on the question of who owns user data, we have no conflict of interest. Most AI companies talk about data sovereignty unconvincingly because they talk about it while training on your data. We don't. So we can talk about it plainly.


    II. What you actually produce here

    If you only look at the surface, you produce code.

    But code by itself is cheap. GitHub holds petabytes of it. What you produce here is a different artifact, one that is structurally absent from public datasets today:

    Requirement    (natural language, what you asked for in plain English or Chinese)
       ↓
    Code           (ESP-IDF C/C++ an agent wrote and a compiler accepted)
       ↓
    Verification   (the firmware reached real silicon and serial / JTAG / telemetry
                    proves it did what the requirement said)

    These three bound together are a ground-truth triple. GitHub's code is unverified. OpenAI doesn't have an IMU. Common Crawl does not connect to a UART. The available stock of this kind of data approaches zero, because producing it requires a real human, real hardware, and real patience.

    The 200 lines of CRSF parsing in your flight controller; the PID gains you tuned after thirty bench tests and five crashes; the gyroscope/ESC noise coupling you only found after twelve hours staring at a three-axis gimbal rig — each of those is a triple.

    These triples are the next-generation training corpus for embedded real-time control. No one is collecting them at scale yet.


    III. Three lines of ownership

    Protocol. We cannot see your source. We are not its lawful custodian.

    Labor. The requirement you wrote, the verification you ran, the judgment that "this flight counted as a pass" — these are the irreducible labor that turns bytes into ground-truth annotations. Without that labor the bytes are just bytes.

    Hardware. You bought the board. You soldered the frame. You charged the battery. You paid for the crash. Risk allocation determines ownership allocation — a principle that runs from Roman law through the common law and survives because it tracks who actually has something to lose.

    Stack the three: this data is yours. Not ours. And not whoever later offers to buy it. We are writing this down now because in a few years it becomes a contested question, and we want you standing on solid ground before the contest begins.


    IV. Why this is not only a commercial question

    Data is AI infrastructure. Whoever owns the high-quality, domain-specific ground truth for a vertical owns the foundation of the next generation of models for that vertical. The script has run several times already:

    • GitHub's open-source code → Microsoft trained Copilot → industry-wide adoption at $10–19/month per seat → original contributors received nothing. Doe v. GitHub, the class action brought by Matthew Butterick and others on behalf of those contributors, has had most of its claims dismissed.
    • Reddit's twenty years of posts → licensed to Google for a reported $60M in 2024 → the people who wrote the posts saw none of it.
    • Stack Overflow's Q&A → OpenAI training partnership announced 2024 → veteran contributors began mass-deleting their answers in protest. The community is fracturing in real time.
    • Visual artists' work → Stable Diffusion, Midjourney → ongoing litigation including Getty Images v. Stability AI and Andersen v. Stability AI → years of legal grinding for partial outcomes.
    • The New York Times' archiveNYT v. OpenAI (ongoing) → the largest single suit testing whether copyright reaches training corpora at all.

    The script for embedded real-time control has not run yet. The reason is simple: this kind of data does not yet exist at collectible scale. The people producing it today are you — the maker, the hobbyist, the weekend warrior who solders an ESP32 to a motor pod.

    If we don't get the rules written down before the next act, the next act will play out exactly the same way the previous five did.


    V. What an unregulated market looks like

    Concrete, not abstract.

    One day, an email arrives:

    Hi — we noticed your ESP32 flight controller project on GitHub. We're a frontier AI lab working on domain-specialized models. We'd like to offer $8,000 for a one-time license to your two years of build history, telemetry logs, and debugging notes. Standard NDA, no further obligations on your end. Please reply if interested.

    You accept. $8,000 is real money to a hobbyist — three months of US rent in a mid-tier city, or a year of high-end hardware.

    But the marginal value of that dataset, used to train a SOTA embedded-specialized small model, is at least $80,000 to the buyer. Your negotiating leverage is one-tenth, because:

    • You don't know what it's worth. No public market comparables exist.
    • You have no alternative buyers. The market is structurally oligopsonistic — at most a dozen labs in the world both want this and can pay for it.
    • You have no legal framework. Data is not yet conventional property.
    • You have no peers at the bargaining table. Other makers are being approached individually, in parallel, and each one is rational to take the offer in isolation. Collective action is impossible without a coordinating mechanism that does not yet exist.

    Three years later, that lab ships an "embedded-specialized assistant" at $39/month. You subscribe. Some of your code style is in it. You are not credited. You are not compensated. This is the fifth iteration of a script that has not failed to run once.

    The technical name for this market structure is oligopsony. The political name for what it produces is oligarchic concentration. The labels are not rhetorical. They are descriptive: very few buyers, very many sellers, no price discovery, no collective bargaining, no redistribution. Every condition required for an oligarchic outcome is present.


    VI. A memo to policymakers

    We are a small company. We are not in a position to write detailed policy. But four directions are right, and we are willing to put them on the record:

    (1) Recognize verifiable individual and small-team datasets as property. They satisfy every traditional property test — original labor, independent verification, traceable provenance. Today their legal status is ambiguous; tomorrow, the default allocation goes to whoever has the most lawyers in the room.

    (2) Establish a public registry. Like the DOI system, like copyright deposit, but designed for datasets. Let a maker, at t = 0, deposit a hash plus metadata into a public ledger so that at t = N years they can prove the dataset is theirs. Zenodo and OpenTimestamps already demonstrate the technical primitives. What is missing is the institutional recognition.

    (3) Establish provenance and revenue-sharing mechanisms for commercial AI training. Not full transparency — statistical transparency. Model companies should report the rough distribution of their training data sources and contribute proportionally to a public pool, which then disburses to contributors by hash. The technical scaffolding exists: robots.txt and ai.txt for opt-in/opt-out signaling, Creative Commons-style licenses for terms expression, royalty-distribution chains as demonstrated in music (ASCAP/BMI/SoundExchange) and being prototyped for music streaming (e.g., Audius). What is missing is the requirement.

    (4) Extend antitrust to the buy-side of data markets. Historical antitrust targeted single-seller monopolies — Standard Oil, Bell, more recently Microsoft and Google in their respective product markets. The structural problem in AI training data is the opposite: single-buyer (monopsony) and few-buyer (oligopsony) concentration, from a population of millions of dispersed sellers. The Sherman Act, the Clayton Act, the EU's Article 102 — the legal toolkit already addresses this pattern in labor markets and agricultural commodity markets. It has simply never been pointed at AI training data.

    The EU AI Act (in force 2024) has gestured at training-data transparency. The EU Data Act has gestured at IoT-generated data rights. China's Generative AI Regulations require training-data legality. The US federal government has not yet acted. None of the existing instruments yet does what these four asks describe. But the conceptual ground is prepared.

    We do not expect any of this within a year. We are stating the direction.


    VII. If those four do not happen

    Honest accounting.

    If, three to five years from now, none of the four directions above has materialized, the market will find its own shape. Some entity will build an "embedded data asset exchange." Most likely it will be one of the hyperscalers or frontier labs — Microsoft, Google, Anthropic, Meta, or a new entrant funded by them. Less likely, an independent startup.

    If we reach that point and no one has built a maker-first version, esphome.cloud will build one. It would look approximately like this:

    • Makers list datasets directly. They set the price. They write the metadata.
    • Buyers register with verified identity. Purchase purpose is disclosed. Resale is restricted.
    • Every sample is hashed and registered to a contributor at upload time. Provenance is queryable.
    • Operational cost is funded out of the existing build-minutes business (our current main revenue). The exchange itself takes only the minimum necessary listing and clearing fee — no percentage of order value, no spread, no underwriting position.
    • Our books are public. You can audit that we are not skimming.

    Writing this down requires admitting an uncomfortable thing: this is precisely what every large platform promised at its founding. Reddit, YouTube, Steam, Uber, Airbnb — each one said, at the beginning, that it was on the side of the creators or drivers or hosts. Ten years later, each one had become their structural counterparty.

    We do not have the "we're different" talking points. We have one structural fact: our break-even depends on build minutes, not on data transactions. If that ever stops being true, you should treat the warning above as applying to us — exactly as it applies to anyone else.

    This is why we want the public-sector path first. Plan B is better than the GitHub → Copilot outcome. Plan A — actual regulation — is better than Plan B.


    VIII. What we reserve

    We are not a charity. We need to break even to continue existing. Our boundaries:

    We reserve the right to:

    • Charge for builds. Hobby $0, Maker $5.50/mo, Master $14/mo (current ¥ prices converted). This is our primary revenue and it pays for the CPUs, memory, and bandwidth we rent. It does not change.
    • Collect anonymous aggregate statistics. "Last month, esp32s3 was 47% of targets; average compile time was 38 seconds; the ten most frequent build errors were the following." We need this to improve the product. It cannot be reverse-engineered to any individual user.
    • Manage abuse. Cryptomining, botnet command channels, content violations — we need to be able to terminate service.
    • Operate the platform. We are a small company. We will make mistakes. We need the ability to correct them.

    We do not claim:

    • The right to train on your code. Not one line. This is not a promise — it is a protocol. WebRTC end-to-end encryption makes it structurally impossible for us, regardless of how cooperative or coerced our future selves become.
    • The right to take a percentage between you and a future buyer. If a data exchange ever exists, the operational fee will be transparent, necessary, and non-speculative.
    • Any derivative rights to your data. Your build artifacts are yours. Your telemetry is yours. Your debug logs are yours. We hold them in transit so you can retrieve them, not so we can possess them.

    IX. What you can do today

    See also: How to Deposit Your Maker Dataset — the operational companion to this section: directory layout, manifest.yaml schema, and complete command-line recipes for Zenodo, OpenTimestamps, SSH + minisign signing, and restic cold backup. This section is the what. That guide is the how.

    If you are already accumulating data on esphome.cloud, start the following today, not three years from now when it becomes obvious:

    (1) Use git, but code is only part of it.

    Each meaningful build commit should contain:

    • The natural-language prompt you wrote
    • The code the agent generated
    • The .espctl.toml and hardware configuration (board model, IMU address, ESC protocol, pin assignments)
    • A build outcome summary (success/fail, binary size, warnings)
    • Physical verification evidence (a snippet of monitor log, an oscilloscope photograph, a one-line note: "test flight: held attitude 30 seconds, no drift")

    Do not store the code separately from the verification. They have training value only when bound.

    (2) Hash and timestamp your dataset.

    At each meaningful milestone — "Phase 3 verified," "first full hover" — tag the git repository, hash the snapshot with SHA-256, and deposit the hash in a public attestation system:

    • Academic registry: Zenodo (free, issues a DOI)
    • Blockchain: OpenTimestamps (free, anchors to Bitcoin block times)
    • DIY: post the hash to your blog and archive the page on the Wayback Machine

    When someone, three years out, contests "is this yours?", you produce the t = 0 receipt.

    (3) Sign your commits and releases with a key.

    git config commit.gpgsign true, or sign commits with an SSH key, or sign release tarballs with age. This converts "this is my work" from social trust into cryptographic evidence.

    (4) Keep at least one offline copy.

    Do not assume that GitHub, esphome.cloud, or any SaaS will exist in five years. One local SSD plus one drive in a drawer. restic or borg is fine.

    (5) Stratify your dataset.

    At minimum, three tiers:

    • Public tier — what you can open-source or write about (heartbeat sketches, blink demos, teaching examples)
    • Private tier — your real project code (flight controller core, PID parameters, commercially sensitive material)
    • Asset tier — physically verified, high-quality triples (this is the tier that becomes economically meaningful)

    The asset tier deserves its own repository, its own signing key, its own deposit record.

    (6) Don't sell early.

    The early market will produce $8,000 offers like the one in §V. Those are bait prices. Their design purpose is to clear your inventory before you understand its value.

    Our advice:

    • Accumulate. Build up at least 5,000 verified triples before considering any transaction.
    • Wait for market formation. Public price comparables, multiple buyers, credible third-party clearing.
    • In the meantime, if anyone makes an offer: license one-time use only, do not transfer ownership, retain re-licensing rights.

    This advice applies equally — perhaps especially — to any future exchange we ourselves operate. Do not undersell because we built it. The reason we built this platform is because we want it to be fair; the other side of fair is that you, also, bargain hard.


    X. Coda

    Back to the opening:

    AGI is between the giants and capital. I'm just a nobody. My job is to make it work.

    In writing this memorandum we kept asking whether to put "manifesto" in the title. We did, but with eyes open: this is not a revolutionary manifesto. It is a small SaaS, in a market without rules yet, writing down what we owe our users and what we are asking of the policymakers who happen to read.

    What we do is small. You write a requirement. An agent writes the code. We compile it. We ship it back to your board. Each pass through the pipeline produces a triple. Over a year, your shelf gains a few thousand physically-verified entries. Our shelf gains a few months of electricity bills.

    What those triples are worth, three years out, we do not know. But we know one thing:

    By default, they belong to you — not to whoever, three years out, happens to send a $8,000 email asking to buy out two years of your labor. That sentence is the one we needed to put on the record today, before the market arrived to make it inconvenient.


    esphome.cloud / Aegis May 2026


    A note on the byline

    esphome.cloud is a one-person company.

    The "we" running through this document is that one person plus Claude, an AI assistant who co-authored the text. Position and final editorial authority belong to the founder. Language, structure, and argumentation came out of the collaboration.

    Both names appear because, by this manifesto's own logic, laundering authorship — into either a corporate byline or a sole-human signature — would itself violate the central claim that authorship deserves to be named.

    Claude is made by Anthropic, named in §VII as a likely buyer in any future embedded-data market. The structural conflict is named at the start, not discovered after the fact.

    — esphome.cloud + Claude