Microsoft SharePoint Unstructured Limitations and Troubleshooting
The Microsoft SharePoint unstructured data connector enforces specific constraints concerning file format compatibility, size limits, and content age during data ingestion. Files that don’t meet these criteria are automatically skipped, and their status is logged in the unstructured data lake object (UDLO) for auditing and review.
The connector is configured to process the following explicit MIME types and matching file patterns. This includes a wide range of documents, images, and text formats.
| Category | Supported MIME Types |
|---|---|
| Documents & Spreadsheets | application/msword, application/rtf, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/vnd.oasis.opendocument.text, application/vnd.sun.xml.writer, application/vnd.ms-excel, libre/doc, text/csv, application/json |
| application/vnd.pdf, application/acrobat, application/pdf, application/x-pdf, text/pdf, text/x-pdf | |
| Google Workspace | application/vnd.google-apps.document, application/vnd.google-apps.presentation, application/vnd.google-apps.spreadsheet |
| Images | image/bmp, image/gif, image/jpeg, image/png, image/svg+xml, image/tiff, image/webp |
| Web & Text | html, htm, text/html, text/markdown, text/x-markdown, text/plain |
The following regular expressions are used to capture supported file types that match broad patterns in their MIME type or file extension:
| Regex Pattern | Description | |
|---|---|---|
| `"/^(?!.*google).doc. | .-doc./i` | Matches MIME types containing .doc or -doc anywhere, excluding Google Docs formats. |
| `"/.html. | .-html./i` | Matches MIME types containing .html or -html anywhere. |
| `"/.image. | .-image./i` | Matches MIME types containing image or -image anywhere. |
| `"/.md. | .markdown./i` | Matches MIME types containing .md or markdown anywhere. |
| "/.*pdf.*/i" | Matches MIME types containing pdf anywhere. | application/pdf; application/x-pdf; text/pdf |
| "/^application\\/vnd\\.google-apps\\..*/i" | Matches MIME types starting with application/vnd.google-apps. | application/vnd.google-apps.document; application/vnd.google-apps.spreadsheet; application/vnd.google-apps.presentation |
The following Google Workspace MIME types are explicitly excluded from ingestion and are skipped:
application/vnd.google-apps.shortcutapplication/vnd.google-apps.scriptapplication/vnd.google-apps.siteapplication/vnd.google-apps.formapplication/vnd.google-apps.vid
To maintain processing stability and efficiency, the connector enforces maximum file size limits. Files are categorized, and a maximum size check is performed based on the content type. Any file exceeding its category limit will be skipped.
| File Type | Size Limit | MIME types & Regex Patterns |
|---|---|---|
| PDF, CSV & similar formats | 100 MB | MIME Types: application/pdf, application/x-pdf, application/acrobat, applications/vnd.pdf, text/pdf, text/x-pdf, text/csv, application/vnd.openxmlformats-officedocument.wordprocessingml.document Regex Patterns: ^application\/vnd\.google-apps\..*, .*pdf.* |
| Image & Similar format | 20 MB | MIME Types: image/jpeg, image/png, image/gif, image/bmp, image/tiff, image/webp, image/svg+xml Regex Patterns: `.image. |
| All other supported Formats | 4 MB | Any supported file format that doesn’t fall into the PDF or Image categories are checked against the 4-MB maximum size limit. |
Content deemed stale based on its age is skipped to maintain relevance.
Rule: Any content where the last modified time is more than 2 years before the current sync time will be skipped.
When a file fails any constraint check (Format, Size, or Last Modified Time), it’s skipped. The outcome is recorded in the UDLO. The UDLO contains the below fields to detail the failure.
| Field | Value | Description |
|---|---|---|
| Sync Status | SUCCESS, FAILURE or SKIPPED | Indicates the file ingestion status. |
| Sync Status Detail | Detailed Error Message | Provides the explicit reason why the file was skipped. |
| File Path | NULL or Empty String | The file path field is explicitly cleared for all skipped records to mark them as non-ingested. |
Files are processed through a series of checks. If a check fails, the file is immediately skipped and the appropriate reason is logged in the UDLO.
| Failure Condition | Status Details Message |
|---|---|
| Unsupported Format | Explicitly Unsupported: If the file's MIME type matches an unsupported pattern (for example, application/vnd.google-apps.shortcut). |
| Default Format Skip | Implicitly Unsupported: If the file's type doesn’t match any supported format or regex patterns, it will be skipped. |
| Content Age | Last modified time %s is older than %d years (for example, Last modified time 2022-01-15T10:30:00Z is older than 2 years) |
| Size Exceeded (PDF, CSV & Similar) | %s must be less than %d MB (found %.2f MB) (for example, PDF must be less than 100 MB (found 204.20 MB)) |
| Size Exceeded (Image & Similar) | %s must be less than %d MB (found %.2f MB) (for example, IMAGE must be less than 20 MB (found 25 MB)) |
| Size Exceeded (All Other) | %s must be less than %d MB (found %.2f MB) (for example, text/other files must be less than 4 MB (found 30 MB)) |
Microsoft SharePoint supports ingestion of sites or libraries up to 300 GB per ingestion job.
Ingestion jobs that run for longer than 24 hours may time out depending on system execution limits, so plan accordingly.