Protocol Upload Feature Architecture Design

Executive Summary

The goal of this project is to streamline the process of uploading and processing client pdf protocols by integrating a protocol upload feature directly into our app, eliminating the need for third-party tools like StorageLink. This feature will allow users to upload protocols, trigger automatic PDF parsing, and synchronize protocols parsing output post-upload. The system will segregate protocols by tenant to ensure data isolation and security.

Motivation and Goals

Key Objectives

Reduce manual work: Automate the upload and processing of protocols, removing the need for manual intervention.
Tenant segregation: Ensure protocols are securely separated by tenant for compliance and security.
Unified experience: Integrate the feature within app.trially.ai, avoiding reliance on external tools.

Benefits

Faster protocol processing times
Improved user experience
Greater control over data flow and processing

Proposal

Implement a new "Protocol Upload" page in the app where users can:

Upload protocol files, which will be processed automatically.
Track the processing status of the uploaded files.
Edit some of the protocol fields (e.g., title, associated sites) once the processing is complete.
Delete protocols if necessary.

This feature will integrate with the existing cloud storage and PDF parsing systems, always ensuring that protocols are segregated by tenant.

Pros and Cons

Pros

Full control over the protocol upload process
Custom UI tailored to our needs
Direct integration with app.trially.ai
Ability to prioritize features, optimizations, security, and bug fixes

Cons

Requires development and ongoing maintenance
Modification to multiple services, like web-recruitment-app frontend and backend, llm-nlp-pipelines script automation, and elt-ingest-pipelines integration to trigger the protocol parsing automatically
Increased complexity of the system

Design and Implementation Details

Frontend (`web-recruitment-app`)

New UI Component: A "Protocol Upload" page where users can upload protocols and track their status.
Status Tracking: The UI will show status updates (uploading, processing, success, error).
Editable Fields: Once processed, fields such as protocol title and associated sites will be editable.
API Interaction: Two possible approaches for file uploads:
Option 1: Upload through the backend API (POST /protocols/upload).
Option 2: Upload directly to tenant-specific cloud buckets with backend API assistance for bucket URLs.
Status Handling: The frontend will display:
Uploading: Client-side status during file upload.
Processing: Backend-driven status while the PDF is being parsed.
Error/Success: Final status once the protocol has been fully processed.

Backend

API (`web-recruitment-app`)

Configure Alembic to support multi-tenant migrations.
Option 1: Handle file uploads directly in the backend, which will then transfer the files to cloud storage and trigger processing.
Option 2: Use a pre-signed bucket URL approach, allowing direct uploads to cloud storage, reducing latency and network costs, but increasing overall complexity.
ProtocolStatus Model:

class ProtocolParsingStatus(BaseModel, table=True):
    __tablename__ = "protocol_parsing_status"
    status: Literal["processing", "error", "success"]
    status_message: Optional[str] # Error message or any additional info
    job_id: str
    file_url: str
    last_updated_at: datetime
    protocol_id: int = Field(foreign_key="protocols.id")
    protocol: Protocol = Relationship(back_populates="parsing_status")

# Update the Protocol model to include the parsing status relationship
class Protocol(ProtocolBase, TimeStampMixin, table=True):
    id: Optional[int] = Field(default=None, primary_key=True)
    parsing_status: ProtocolParsingStatus = Relationship(back_populates="protocol")

# Update protocol base to support recommended title
class ProtocolBase(BaseModel):
    title: str  # Original title from the file name or user defined
    status: Literal["active", "inactive"] = "active"  # Active by default
    is_deleted: bool  # Soft delete flag, could be a in a model mixin

This model will track the protocol's processing status, updated after every major processing step.

Processing System (`llm-agent-service`)

Protocols will be processed using the LLM Agent System.
The backend (web-recruitment-app) will either trigger processing via a Pub/Sub system or an endpoint for task handling to the llm-agent-service service.
The llm-agent-service service subscribes to Pub/Sub messages to report the protocol parsing results back to the main system (web-recruitment-app).
We can either use Pub/Sub or a REST API to communicate between services. (TBD)

Backend API contract (option 1)

POST /protocols/upload

This endpoint will receive the file via multipart/form-data and save the file to the cloud storage. The protocol title will be extracted from the file name and return a 201 created. This endpoint only supports uploading 1 file.

Request

Headers:
    Content-Type: multipart/form-data
Body:
    file: protocol-1234.pdf (binary)

Response Same schema as in GET /protocols/{protocol_id} below, but a single json, not an array

{
  "title": "Protocol_v2.32_22 March 05_2021",
  "externalProtocolId": "",
  "id": 0,
  "parsingStatus": "processing",
  "status": "active",
  "parsingStatusMessage": "string",
  "lastUpdatedAt": "2024-01-01T01:00:00Z",
  "file_url": "https://storage.googleapis.com/<path_to_bucket_folder>/Protocol_v2.32_22%20March%2005_2021.pdf"
}

GET /protocols/

Response
Use existing endpoint to fetch the protocols data, it should return new fields like parsingStatus, parsingStatusMessage, lastUpdatedAt, status, and fileUrl

[
  {
    "title": "string",
    "externalProtocolId": "string",
    "id": 0,
    "parsingStatus": "success",
    "status": "active",
    "parsingStatusMessage": "string",
    "lastUpdatedAt": "2024-01-01T01:00:00Z",
    "fileUrl": "https://storage.googleapis.com/<path_to_bucket_folder>/Protocol_v2.32_22%20March%2005_2021.pdf"
  }
]

parsingStatus: This field will have 3 possible values:
processing The pdf is being processed, not a final state. (Protocol has default/empty values)
success The pdf processing is successful, final state. (No action needed, protocol is already updated)
error The pdf processing failed, final state. (Check logs or retry upload)
status: The protocol status, could be active or inactive.
active The protocol is active and will receive AI updates.
inactive The protocol is inactive and will not receive AI updates.

DELETE /protocols/{protocol_id}

Soft delete the protocol, mark it as deleted in the database.
Change the is_deleted field to True.

POST /protocols/{protocol_id}/sites/{site_id}
Create relationship to enable protocol for this site.

Response Status: 204 No Content

DELETE /protocols/{protocol_id}/sites/{site_id}
Remove relationship to disable protocol for this site.

Response Status: 204 No Content

GET /protocols/{protocol_id}/sites/
Get all sites associated with the protocol.

Response

[
  {
    "id": 0,
    "name": "string"
  }
]

LLM Agent System Related Endpoints
These endpoints are free to use by the frontend, but are mainly used by the llm-agent-service to report the processing status back to the main system.

PATCH /protocols/{protocol_id}
The backend will use this endpoint to update the parsing status. The frontend will use this endpoint to update the protocol title and status.

Example Request 1:
llm-agent-service request to update the parsing status to success and add the external protocol id:

{
  "externalProtocolId": "string",
  "parsingStatus": "success"
}

Example Request 2:
llm-agent-service request to update the parsing status to error and add the error message:

{
  "parsingStatus": "error",
  "parsingStatusMessage": "string"
}

Example Request 3:
web-recruitment-app request to update the protocol title:

{
  "title": "New Protocol Title"
}

Example Request 4:
web-recruitment-app request to update the protocol status:

{
  "status": "inactive"
}

Endpoint without changes that could be useful for LLM Agent System

POST /criteria/
GET /protocol/
GET /protocol/{protocol_id}/criteria/
POST /appointments/bulk/
GET /patients_by_external_id/{external_id}
POST /patients/
POST /criteria_instances/

Overall Diagram

Option 1 (Upload through the backend API) Protocol Upload Option 1

User triggers a POST /protocols/upload request with the protocol file to the backend.
The backend uploads the protocol file to the cloud storage
The backend triggers the processing of the protocol.
The LLM Agent System returns a job_id.
The backend responds to the frontend with the protocol schema similar to that returned by GET /protocols/<protocol_id>.

Option 2 (Upload directly to cloud buckets) Protocol Upload Option 2

The frontend triggers a GET /protocols/bucket_url request when the modal opens.
The backend returns a pre-signed URL to upload the protocol file to the cloud storage.
The frontend uploads the protocol file to the cloud storage.
The frontend triggers a POST /protocols/ request with the protocol file url and original file name.
The backend creates a new protocol and triggers the processing of the protocol.
The LLM Agent System returns a job_id.
The backend responds to the frontend with the protocol schema similar to that returned by GET /protocols/<protocol_id>.

Black Box Details

For more information about the black box details, please refer to the Backend -> Processing System section.

Conclusion

This architecture will bring significant improvements in protocol management efficiency by automating the protocol pdf parsing workflow. By integrating directly into the app and using scalable backend processing, we can deliver results with enhanced user experience and data security.

Endpoints Summary

Frontend

POST /protocols/upload  # Upload a protocol file for processing.
GET /protocols/  # Fetch all protocols with parsing status.
DELETE /protocols/{protocol_id}  # Soft Delete a protocol.
POST /protocols/{protocol_id}/sites/{site_id}  # Associate a protocol with a site.
DELETE /protocols/{protocol_id}/sites/{site_id}  # Disassociate a protocol from a site.
GET /protocols/{protocol_id}/sites/  # Fetch all sites associated with a protocol.
PATCH /protocols/{protocol_id}  # Update the protocol title and status.

LLM Agent System

PATCH /protocols/{protocol_id}  # Update the protocol parsing status and external id.
POST /criteria/
GET /protocol/
GET /protocol/{protocol_id}/criteria/
POST /appointments/bulk/
GET /patients_by_external_id/{external_id}
POST /patients/
POST /criteria_instances/

Recommendation

I recommend using Option 1 if the file size is less than 32 MiB and currently using HTTP/1 server. Otherwise, we should use Option 2 as it will provide a more scalable solution for large file uploads.

Future Improvements/Ideas

Implement a recommendation system for protocol titles based on the content of the PDF.
Easy to implement with current projects like llm-nlp-pipelines.
Share the recommended title in the PATCH /protcol endpoint.
UI/UX improvements to show the recommended title to the user.