Protocol Upload Feature Architecture Design
Executive Summary
The goal of this project is to streamline the process of uploading and processing client pdf protocols by integrating a protocol upload feature directly into our app, eliminating the need for third-party tools like StorageLink. This feature will allow users to upload protocols, trigger automatic PDF parsing, and synchronize protocols parsing output post-upload. The system will segregate protocols by tenant to ensure data isolation and security.
Motivation and Goals
Key Objectives
- Reduce manual work: Automate the upload and processing of protocols, removing the need for manual intervention.
- Tenant segregation: Ensure protocols are securely separated by tenant for compliance and security.
- Unified experience: Integrate the feature within app.trially.ai, avoiding reliance on external tools.
Benefits
- Faster protocol processing times
- Improved user experience
- Greater control over data flow and processing
Proposal
Implement a new "Protocol Upload" page in the app where users can:
- Upload protocol files, which will be processed automatically.
- Track the processing status of the uploaded files.
- Edit some of the protocol fields (e.g., title, associated sites) once the processing is complete.
- Delete protocols if necessary.
This feature will integrate with the existing cloud storage and PDF parsing systems, always ensuring that protocols are segregated by tenant.
Pros and Cons
Pros
- Full control over the protocol upload process
- Custom UI tailored to our needs
- Direct integration with app.trially.ai
- Ability to prioritize features, optimizations, security, and bug fixes
Cons
- Requires development and ongoing maintenance
- Modification to multiple services, like
web-recruitment-appfrontend and backend,llm-nlp-pipelinesscript automation, andelt-ingest-pipelinesintegration to trigger the protocol parsing automatically - Increased complexity of the system
Design and Implementation Details
Frontend (web-recruitment-app)
- New UI Component: A "Protocol Upload" page where users can upload protocols and track their status.
- Status Tracking: The UI will show status updates (uploading, processing, success, error).
- Editable Fields: Once processed, fields such as protocol title and associated sites will be editable.
-
API Interaction: Two possible approaches for file uploads:
-
Option 1: Upload through the backend API (
POST /protocols/upload). -
Option 2: Upload directly to tenant-specific cloud buckets with backend API assistance for bucket URLs.
-
Status Handling: The frontend will display:
- Uploading: Client-side status during file upload.
- Processing: Backend-driven status while the PDF is being parsed.
- Error/Success: Final status once the protocol has been fully processed.
Backend
API (web-recruitment-app)
- Configure Alembic to support multi-tenant migrations.
- Option 1: Handle file uploads directly in the backend, which will then transfer the files to cloud storage and trigger processing.
- Option 2: Use a pre-signed bucket URL approach, allowing direct uploads to cloud storage, reducing latency and network costs, but increasing overall complexity.
- ProtocolStatus Model:
class ProtocolParsingStatus(BaseModel, table=True):
__tablename__ = "protocol_parsing_status"
status: Literal["processing", "error", "success"]
status_message: Optional[str] # Error message or any additional info
job_id: str
file_url: str
last_updated_at: datetime
protocol_id: int = Field(foreign_key="protocols.id")
protocol: Protocol = Relationship(back_populates="parsing_status")
# Update the Protocol model to include the parsing status relationship
class Protocol(ProtocolBase, TimeStampMixin, table=True):
id: Optional[int] = Field(default=None, primary_key=True)
parsing_status: ProtocolParsingStatus = Relationship(back_populates="protocol")
# Update protocol base to support recommended title
class ProtocolBase(BaseModel):
title: str # Original title from the file name or user defined
status: Literal["active", "inactive"] = "active" # Active by default
is_deleted: bool # Soft delete flag, could be a in a model mixin
This model will track the protocol's processing status, updated after every major processing step.
Processing System (llm-agent-service)
- Protocols will be processed using the LLM Agent System.
- The backend (
web-recruitment-app) will either trigger processing via a Pub/Sub system or an endpoint for task handling to thellm-agent-serviceservice. - The
llm-agent-serviceservice subscribes to Pub/Sub messages to report the protocol parsing results back to the main system (web-recruitment-app). - We can either use Pub/Sub or a REST API to communicate between services. (TBD)
Backend API contract (option 1)
POST /protocols/upload
This endpoint will receive the file via multipart/form-data and save the file to the cloud
storage. The protocol title will be extracted from the file name and return a 201 created. This
endpoint only supports uploading 1 file.
Request
Response Same schema as in GET /protocols/{protocol_id} below, but a single json, not an array
{
"title": "Protocol_v2.32_22 March 05_2021",
"externalProtocolId": "",
"id": 0,
"parsingStatus": "processing",
"status": "active",
"parsingStatusMessage": "string",
"lastUpdatedAt": "2024-01-01T01:00:00Z",
"file_url": "https://storage.googleapis.com/<path_to_bucket_folder>/Protocol_v2.32_22%20March%2005_2021.pdf"
}
GET /protocols/
Response
Use existing endpoint to fetch the protocols data, it should return new fields like parsingStatus,
parsingStatusMessage, lastUpdatedAt, status, and fileUrl
[
{
"title": "string",
"externalProtocolId": "string",
"id": 0,
"parsingStatus": "success",
"status": "active",
"parsingStatusMessage": "string",
"lastUpdatedAt": "2024-01-01T01:00:00Z",
"fileUrl": "https://storage.googleapis.com/<path_to_bucket_folder>/Protocol_v2.32_22%20March%2005_2021.pdf"
}
]
parsingStatus: This field will have 3 possible values:processingThe pdf is being processed, not a final state. (Protocol has default/empty values)successThe pdf processing is successful, final state. (No action needed, protocol is already updated)errorThe pdf processing failed, final state. (Check logs or retry upload)status: The protocol status, could beactiveorinactive.activeThe protocol is active and will receive AI updates.inactiveThe protocol is inactive and will not receive AI updates.
DELETE /protocols/{protocol_id}
- Soft delete the protocol, mark it as deleted in the database.
- Change the
is_deletedfield toTrue.
POST /protocols/{protocol_id}/sites/{site_id}
Create relationship to enable protocol for this site.
Response Status: 204 No Content
DELETE /protocols/{protocol_id}/sites/{site_id}
Remove relationship to disable protocol for this site.
Response Status: 204 No Content
GET /protocols/{protocol_id}/sites/
Get all sites associated with the protocol.
Response
LLM Agent System Related Endpoints
These endpoints are free to use by the frontend, but are mainly used by the llm-agent-service to
report the processing status back to the main system.
PATCH /protocols/{protocol_id}
The backend will use this endpoint to update the parsing status. The frontend will use this endpoint
to update the protocol title and status.
Example Request 1:
llm-agent-service request to update the parsing status to success and add the external protocol
id:
Example Request 2:
llm-agent-service request to update the parsing status to error and add the error message:
Example Request 3:
web-recruitment-app request to update the protocol title:
Example Request 4:
web-recruitment-app request to update the protocol status:
Endpoint without changes that could be useful for LLM Agent System
POST /criteria/GET /protocol/GET /protocol/{protocol_id}/criteria/POST /appointments/bulk/GET /patients_by_external_id/{external_id}POST /patients/POST /criteria_instances/
Overall Diagram
Option 1 (Upload through the backend API)
- User triggers a
POST /protocols/uploadrequest with the protocol file to the backend. - The backend uploads the protocol file to the cloud storage
- The backend triggers the processing of the protocol.
- The LLM Agent System returns a job_id.
- The backend responds to the frontend with the protocol schema similar to that returned by
GET /protocols/<protocol_id>.
Option 2 (Upload directly to cloud buckets)
- The frontend triggers a
GET /protocols/bucket_urlrequest when the modal opens. - The backend returns a pre-signed URL to upload the protocol file to the cloud storage.
- The frontend uploads the protocol file to the cloud storage.
- The frontend triggers a
POST /protocols/request with the protocol file url and original file name. - The backend creates a new protocol and triggers the processing of the protocol.
- The LLM Agent System returns a job_id.
- The backend responds to the frontend with the protocol schema similar to that returned by
GET /protocols/<protocol_id>.
Black Box Details
For more information about the black box details, please refer to the Backend -> Processing System section.
Conclusion
This architecture will bring significant improvements in protocol management efficiency by automating the protocol pdf parsing workflow. By integrating directly into the app and using scalable backend processing, we can deliver results with enhanced user experience and data security.
Endpoints Summary
Frontend
POST /protocols/upload # Upload a protocol file for processing.
GET /protocols/ # Fetch all protocols with parsing status.
DELETE /protocols/{protocol_id} # Soft Delete a protocol.
POST /protocols/{protocol_id}/sites/{site_id} # Associate a protocol with a site.
DELETE /protocols/{protocol_id}/sites/{site_id} # Disassociate a protocol from a site.
GET /protocols/{protocol_id}/sites/ # Fetch all sites associated with a protocol.
PATCH /protocols/{protocol_id} # Update the protocol title and status.
LLM Agent System
PATCH /protocols/{protocol_id} # Update the protocol parsing status and external id.
POST /criteria/
GET /protocol/
GET /protocol/{protocol_id}/criteria/
POST /appointments/bulk/
GET /patients_by_external_id/{external_id}
POST /patients/
POST /criteria_instances/
Recommendation
I recommend using Option 1 if the file size is less than 32 MiB and currently using HTTP/1 server. Otherwise, we should use Option 2 as it will provide a more scalable solution for large file uploads.
Future Improvements/Ideas
- Implement a recommendation system for protocol titles based on the content of the PDF.
- Easy to implement with current projects like
llm-nlp-pipelines. - Share the recommended title in the
PATCH /protcolendpoint. - UI/UX improvements to show the recommended title to the user.