# Duplicate Handling

#### Source Codes

`ocbc-ecoffice/enrich/duplicate_handling/duplicate_handling.py`

#### Description

* <b>NOTE:</b> This pipelet is executed twice, PRE-"<i>PDF OCR</i>" and POST-"<i>PDF OCR</i>"
* File is downloaded from the `content_url` property specified in each Squirro Item, `item["files"][0]["content_url"]`
* Compute binary hash for the downloaded binary content.
    * If `skip_binary_hash_compute` key is in `keywords`, computation of binary hash is skipped
    * Handling of duplicates will be based on <b style="color: red;">EXISTING</b> `binary_hash` key in `keywords`
    * Afterwhich, the `skip_binary_hash_compute` key will be removed from `keywords`
* Duplicates are detected by checking if there is existing Squirro Item when querying against the `binary_hash` key in `keywords`
    * Query syntax: `binary_hash:{digest} -is_deleted:true`
* When Duplicates are detected, the Squirro Item `id` is pushed into Redis as a field to the hashes, `duplicated_item_wfi_hash` and `duplicated_item_hash` accordingly, with the following value:
    * <i>"The document you uploaded is a duplicate of {duplicated_id} and has been rejected. This is the link to the original document: ..."</i>
    * The `id` field and its corresponding message is consumed by the [Activity Tracking Scheduled Job](/scheduled-jobs/feature-jobs/activity-tracking.html)
