Budaya Automation V1

End to end pipeline for ingestion, curation, deduplication, LLM recommendation, human approval, and controlled upload. This page merges all core diagrams in one presentation friendly flow.

Human in the loop Queue based processing Staging first rollout Audit ready

1) System Architecture

High level component map from crawler input to verified upload outcome.

flowchart LR
  A[Crawler Output PDFs]
  B[Ingestion API\nFastAPI]
  C[(PostgreSQL)]
  D[(Redis Queue)]
  E[Worker: Extract\nPyMuPDF]
  F[Worker: Curation\nRule Engine]
  G[Worker: Dedup\nRapidfuzz + Target Lookup]
  H[Worker: LLM Recommender]
  I[Reviewer UI\nNext.js]
  J[Approve or Reject]
  K[Worker: Upload\nPlaywright]
  L[Target Site\nBudaya Indonesia]
  M[Audit Logs + Metrics]

  A --> B
  B --> C
  B --> D
  D --> E --> C
  D --> F --> C
  D --> G --> C
  D --> H --> C
  I --> B
  B --> I
  I --> J --> B
  B --> D
  D --> K
  K --> L
  K --> C
  C --> M

  classDef source fill:#1c3b57,stroke:#5aa7d9,color:#e9f5ff;
  classDef service fill:#183446,stroke:#2db6a3,color:#d8fff8;
  classDef store fill:#312743,stroke:#a58be8,color:#efe8ff;
  classDef human fill:#3f2e14,stroke:#e8b14c,color:#fff2d7;
  classDef ops fill:#24321d,stroke:#79b955,color:#ebffe0;

  class A source;
  class B,E,F,G,H,K service;
  class C,D store;
  class I,J human;
  class L,M ops;

Governance

LLM gives recommendation only. Final decision remains with reviewer.

Reliability

Queue isolation keeps retry traffic away from UI and API user traffic.

Safety

Separate staging and production targets with isolated session state.

2) Document Lifecycle

Atomic state machine defining legal transitions and terminal states.

stateDiagram-v2
  [*] --> INGESTED
  INGESTED --> EXTRACTING
  EXTRACTING --> EXTRACTED: success
  EXTRACTING --> CURATED_FAIL: extraction failed

  EXTRACTED --> CURATED_PASS: checks pass
  EXTRACTED --> CURATED_FAIL: checks fail

  CURATED_PASS --> DEDUP_DUPLICATE: duplicate score > threshold
  CURATED_PASS --> LLM_RECOMMENDED: unique

  LLM_RECOMMENDED --> HUMAN_APPROVED: reviewer approves
  LLM_RECOMMENDED --> NEEDS_EDIT: reviewer requests edits
  LLM_RECOMMENDED --> HUMAN_REJECTED: reviewer rejects

  NEEDS_EDIT --> HUMAN_APPROVED: reviewer re-approves
  NEEDS_EDIT --> HUMAN_REJECTED: reviewer rejects

  HUMAN_APPROVED --> UPLOAD_QUEUED
  UPLOAD_QUEUED --> UPLOADING
  UPLOADING --> UPLOADED: upload verified
  UPLOADING --> DEDUP_DUPLICATE: duplicate found live
  UPLOADING --> UPLOAD_FAILED: max attempts reached

  UPLOAD_FAILED --> UPLOAD_QUEUED: manual retry

  DEDUP_DUPLICATE --> [*]
  HUMAN_REJECTED --> [*]
  CURATED_FAIL --> [*]
  UPLOADED --> [*]

Status control

All status updates should enforce atomic guard: `WHERE status = current_status`.

Terminal states

UPLOADEDHUMAN_REJECTEDDEDUP_DUPLICATECURATED_FAIL

Recovery path

Only `UPLOAD_FAILED` can move back to `UPLOAD_QUEUED` by explicit retry.

3) Approval to Upload Sequence

Interaction timeline across UI, API, queue, worker, and target website.

sequenceDiagram
  autonumber
  participant Reviewer
  participant UI as Review UI
  participant API as Backend API
  participant DB as PostgreSQL
  participant Q as Redis Queue
  participant W as Upload Worker
  participant T as Target Site

  Reviewer->>UI: Open document detail
  UI->>API: GET /v1/documents/{id}
  API->>DB: Load extracted + curation + llm + dedup
  DB-->>API: Document aggregate
  API-->>UI: Document detail

  Reviewer->>UI: Approve and queue upload
  UI->>API: POST /v1/documents/{id}/approve
  API->>DB: status = UPLOAD_QUEUED
  API->>Q: enqueue upload job
  API-->>UI: upload_job_id

  W->>Q: consume upload job
  W->>DB: status = UPLOADING
  W->>T: login + dedup + fill form + upload
  alt upload success
    W->>DB: status = UPLOADED + save target_entry_url
  else upload failed
    W->>DB: status = UPLOAD_FAILED + error artifacts
  end

  UI->>API: GET /v1/documents/{id}
  API-->>UI: Latest status and result

Review latency target

Below 500 ms for approve or reject response.

Upload timeout target

3 to 5 minutes per document in normal conditions.

Retry policy

Maximum 3 attempts with exponential backoff and audit trail.

Operator visibility

UI polls current status after queueing to reflect final outcome.

4) ERD (V1 Logical Model)

Entity relationships for traceability from raw source to upload result.

erDiagram
    USERS ||--o{ HUMAN_REVIEWS : performs
    DOCUMENTS ||--o{ DOCUMENT_ASSETS : has
    DOCUMENTS ||--|| DOCUMENT_CONTENTS : has
    DOCUMENTS ||--|| CURATION_RESULTS : has
    DOCUMENTS ||--|| DEDUP_RESULTS : has
    DOCUMENTS ||--|| LLM_RECOMMENDATIONS : has
    DOCUMENTS ||--o{ HUMAN_REVIEWS : reviewed_in
    DOCUMENTS ||--o{ UPLOAD_JOBS : queued_as
    UPLOAD_JOBS ||--o{ UPLOAD_ATTEMPTS : has
    DOCUMENTS ||--|| UPLOAD_RESULTS : produces
    DOCUMENTS ||--o{ AUDIT_LOGS : tracked_by
    DOCUMENT_ASSETS ||--o{ UPLOAD_ATTEMPTS : evidence_for

    USERS {
      uuid id PK
      string email UK
      string role
      boolean is_active
      timestamptz created_at
    }

    DOCUMENTS {
      uuid id PK
      string env
      string source_name
      string source_path
      string source_hash
      string status
      string status_reason_code
      text status_reason_text
      jsonb metadata
      timestamptz created_at
      timestamptz updated_at
    }

    DOCUMENT_ASSETS {
      uuid id PK
      uuid document_id FK
      string kind
      string storage_uri
      string mime_type
      string sha256
      bigint size_bytes
      timestamptz created_at
    }

    DOCUMENT_CONTENTS {
      uuid document_id PK,FK
      string title
      text description
      jsonb raw_blocks
      int page_count
      uuid image_asset_id FK
      timestamptz extracted_at
    }

    CURATION_RESULTS {
      uuid id PK
      uuid document_id FK,UK
      boolean pass
      numeric score
      jsonb checks
      jsonb reasons
      timestamptz created_at
    }

    DEDUP_RESULTS {
      uuid id PK
      uuid document_id FK,UK
      string decision
      numeric best_score
      jsonb candidates
      timestamptz created_at
    }

    LLM_RECOMMENDATIONS {
      uuid id PK
      uuid document_id FK,UK
      string model
      string prompt_version
      string decision
      numeric confidence
      jsonb reasons
      jsonb suggested_fields
      jsonb raw_response
      int tokens_in
      int tokens_out
      timestamptz created_at
    }

    HUMAN_REVIEWS {
      uuid id PK
      uuid document_id FK
      uuid reviewer_id FK
      string action
      text notes
      jsonb overrides
      timestamptz created_at
    }

    UPLOAD_JOBS {
      uuid id PK
      uuid document_id FK
      string env
      string status
      int attempts
      int max_attempts
      string idempotency_key UK
      string last_error_code
      text last_error_message
      timestamptz queued_at
      timestamptz started_at
      timestamptz finished_at
    }

    UPLOAD_ATTEMPTS {
      uuid id PK
      uuid upload_job_id FK
      int attempt_no
      string result
      int duration_ms
      jsonb request_payload
      jsonb response_payload
      uuid screenshot_asset_id FK
      uuid html_asset_id FK
      timestamptz created_at
    }

    UPLOAD_RESULTS {
      uuid id PK
      uuid document_id FK,UK
      uuid upload_job_id FK
      string target_base_url
      string target_entry_id
      string target_entry_url
      numeric duplicate_score
      jsonb raw_result
      timestamptz uploaded_at
    }

    AUDIT_LOGS {
      bigserial id PK
      string actor_type
      uuid actor_id
      string action
      string object_type
      uuid object_id
      jsonb before
      jsonb after
      string request_id
      timestamptz created_at
    }

Key notation

PKFKUK

Data lineage

Everything links back to `documents.id` for complete traceability.

Ops evidence

Upload attempts can store screenshot and HTML artifact references.