Budaya Automation V1
End to end pipeline for ingestion, curation, deduplication, LLM recommendation, human approval, and controlled upload. This page merges all core diagrams in one presentation friendly flow.
1) System Architecture
High level component map from crawler input to verified upload outcome.
flowchart LR
A[Crawler Output PDFs]
B[Ingestion API\nFastAPI]
C[(PostgreSQL)]
D[(Redis Queue)]
E[Worker: Extract\nPyMuPDF]
F[Worker: Curation\nRule Engine]
G[Worker: Dedup\nRapidfuzz + Target Lookup]
H[Worker: LLM Recommender]
I[Reviewer UI\nNext.js]
J[Approve or Reject]
K[Worker: Upload\nPlaywright]
L[Target Site\nBudaya Indonesia]
M[Audit Logs + Metrics]
A --> B
B --> C
B --> D
D --> E --> C
D --> F --> C
D --> G --> C
D --> H --> C
I --> B
B --> I
I --> J --> B
B --> D
D --> K
K --> L
K --> C
C --> M
classDef source fill:#1c3b57,stroke:#5aa7d9,color:#e9f5ff;
classDef service fill:#183446,stroke:#2db6a3,color:#d8fff8;
classDef store fill:#312743,stroke:#a58be8,color:#efe8ff;
classDef human fill:#3f2e14,stroke:#e8b14c,color:#fff2d7;
classDef ops fill:#24321d,stroke:#79b955,color:#ebffe0;
class A source;
class B,E,F,G,H,K service;
class C,D store;
class I,J human;
class L,M ops;
Governance
LLM gives recommendation only. Final decision remains with reviewer.
Reliability
Queue isolation keeps retry traffic away from UI and API user traffic.
Safety
Separate staging and production targets with isolated session state.
2) Document Lifecycle
Atomic state machine defining legal transitions and terminal states.
stateDiagram-v2
[*] --> INGESTED
INGESTED --> EXTRACTING
EXTRACTING --> EXTRACTED: success
EXTRACTING --> CURATED_FAIL: extraction failed
EXTRACTED --> CURATED_PASS: checks pass
EXTRACTED --> CURATED_FAIL: checks fail
CURATED_PASS --> DEDUP_DUPLICATE: duplicate score > threshold
CURATED_PASS --> LLM_RECOMMENDED: unique
LLM_RECOMMENDED --> HUMAN_APPROVED: reviewer approves
LLM_RECOMMENDED --> NEEDS_EDIT: reviewer requests edits
LLM_RECOMMENDED --> HUMAN_REJECTED: reviewer rejects
NEEDS_EDIT --> HUMAN_APPROVED: reviewer re-approves
NEEDS_EDIT --> HUMAN_REJECTED: reviewer rejects
HUMAN_APPROVED --> UPLOAD_QUEUED
UPLOAD_QUEUED --> UPLOADING
UPLOADING --> UPLOADED: upload verified
UPLOADING --> DEDUP_DUPLICATE: duplicate found live
UPLOADING --> UPLOAD_FAILED: max attempts reached
UPLOAD_FAILED --> UPLOAD_QUEUED: manual retry
DEDUP_DUPLICATE --> [*]
HUMAN_REJECTED --> [*]
CURATED_FAIL --> [*]
UPLOADED --> [*]
Status control
All status updates should enforce atomic guard: `WHERE status = current_status`.
Terminal states
UPLOADEDHUMAN_REJECTEDDEDUP_DUPLICATECURATED_FAIL
Recovery path
Only `UPLOAD_FAILED` can move back to `UPLOAD_QUEUED` by explicit retry.
3) Approval to Upload Sequence
Interaction timeline across UI, API, queue, worker, and target website.
sequenceDiagram
autonumber
participant Reviewer
participant UI as Review UI
participant API as Backend API
participant DB as PostgreSQL
participant Q as Redis Queue
participant W as Upload Worker
participant T as Target Site
Reviewer->>UI: Open document detail
UI->>API: GET /v1/documents/{id}
API->>DB: Load extracted + curation + llm + dedup
DB-->>API: Document aggregate
API-->>UI: Document detail
Reviewer->>UI: Approve and queue upload
UI->>API: POST /v1/documents/{id}/approve
API->>DB: status = UPLOAD_QUEUED
API->>Q: enqueue upload job
API-->>UI: upload_job_id
W->>Q: consume upload job
W->>DB: status = UPLOADING
W->>T: login + dedup + fill form + upload
alt upload success
W->>DB: status = UPLOADED + save target_entry_url
else upload failed
W->>DB: status = UPLOAD_FAILED + error artifacts
end
UI->>API: GET /v1/documents/{id}
API-->>UI: Latest status and result
Review latency target
Below 500 ms for approve or reject response.
Upload timeout target
3 to 5 minutes per document in normal conditions.
Retry policy
Maximum 3 attempts with exponential backoff and audit trail.
Operator visibility
UI polls current status after queueing to reflect final outcome.
4) ERD (V1 Logical Model)
Entity relationships for traceability from raw source to upload result.
erDiagram
USERS ||--o{ HUMAN_REVIEWS : performs
DOCUMENTS ||--o{ DOCUMENT_ASSETS : has
DOCUMENTS ||--|| DOCUMENT_CONTENTS : has
DOCUMENTS ||--|| CURATION_RESULTS : has
DOCUMENTS ||--|| DEDUP_RESULTS : has
DOCUMENTS ||--|| LLM_RECOMMENDATIONS : has
DOCUMENTS ||--o{ HUMAN_REVIEWS : reviewed_in
DOCUMENTS ||--o{ UPLOAD_JOBS : queued_as
UPLOAD_JOBS ||--o{ UPLOAD_ATTEMPTS : has
DOCUMENTS ||--|| UPLOAD_RESULTS : produces
DOCUMENTS ||--o{ AUDIT_LOGS : tracked_by
DOCUMENT_ASSETS ||--o{ UPLOAD_ATTEMPTS : evidence_for
USERS {
uuid id PK
string email UK
string role
boolean is_active
timestamptz created_at
}
DOCUMENTS {
uuid id PK
string env
string source_name
string source_path
string source_hash
string status
string status_reason_code
text status_reason_text
jsonb metadata
timestamptz created_at
timestamptz updated_at
}
DOCUMENT_ASSETS {
uuid id PK
uuid document_id FK
string kind
string storage_uri
string mime_type
string sha256
bigint size_bytes
timestamptz created_at
}
DOCUMENT_CONTENTS {
uuid document_id PK,FK
string title
text description
jsonb raw_blocks
int page_count
uuid image_asset_id FK
timestamptz extracted_at
}
CURATION_RESULTS {
uuid id PK
uuid document_id FK,UK
boolean pass
numeric score
jsonb checks
jsonb reasons
timestamptz created_at
}
DEDUP_RESULTS {
uuid id PK
uuid document_id FK,UK
string decision
numeric best_score
jsonb candidates
timestamptz created_at
}
LLM_RECOMMENDATIONS {
uuid id PK
uuid document_id FK,UK
string model
string prompt_version
string decision
numeric confidence
jsonb reasons
jsonb suggested_fields
jsonb raw_response
int tokens_in
int tokens_out
timestamptz created_at
}
HUMAN_REVIEWS {
uuid id PK
uuid document_id FK
uuid reviewer_id FK
string action
text notes
jsonb overrides
timestamptz created_at
}
UPLOAD_JOBS {
uuid id PK
uuid document_id FK
string env
string status
int attempts
int max_attempts
string idempotency_key UK
string last_error_code
text last_error_message
timestamptz queued_at
timestamptz started_at
timestamptz finished_at
}
UPLOAD_ATTEMPTS {
uuid id PK
uuid upload_job_id FK
int attempt_no
string result
int duration_ms
jsonb request_payload
jsonb response_payload
uuid screenshot_asset_id FK
uuid html_asset_id FK
timestamptz created_at
}
UPLOAD_RESULTS {
uuid id PK
uuid document_id FK,UK
uuid upload_job_id FK
string target_base_url
string target_entry_id
string target_entry_url
numeric duplicate_score
jsonb raw_result
timestamptz uploaded_at
}
AUDIT_LOGS {
bigserial id PK
string actor_type
uuid actor_id
string action
string object_type
uuid object_id
jsonb before
jsonb after
string request_id
timestamptz created_at
}
Key notation
PKFKUK
Data lineage
Everything links back to `documents.id` for complete traceability.
Ops evidence
Upload attempts can store screenshot and HTML artifact references.