Microsoft SharePoint Unstructured Limitations and Troubleshooting

The Microsoft SharePoint unstructured data connector enforces specific constraints concerning file format compatibility, size limits, and content age during data ingestion. Files that don’t meet these criteria are automatically skipped, and their status is logged in the unstructured data lake object (UDLO) for auditing and review.

The connector is configured to process the following explicit MIME types and matching file patterns. This includes a wide range of documents, images, and text formats.

CategorySupported MIME Types
Documents & Spreadsheetsapplication/msword, application/rtf, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/vnd.oasis.opendocument.text, application/vnd.sun.xml.writer, application/vnd.ms-excel, libre/doc, text/csv, application/json
PDFapplication/vnd.pdf, application/acrobat, application/pdf, application/x-pdf, text/pdf, text/x-pdf
Google Workspaceapplication/vnd.google-apps.document, application/vnd.google-apps.presentation, application/vnd.google-apps.spreadsheet
Imagesimage/bmp, image/gif, image/jpeg, image/png, image/svg+xml, image/tiff, image/webp
Web & Texthtml, htm, text/html, text/markdown, text/x-markdown, text/plain

The following regular expressions are used to capture supported file types that match broad patterns in their MIME type or file extension:

Regex PatternDescription
`"/^(?!.*google).doc..-doc./i`Matches MIME types containing .doc or -doc anywhere, excluding Google Docs formats.
`"/.html..-html./i`Matches MIME types containing .html or -html anywhere.
`"/.image..-image./i`Matches MIME types containing image or -image anywhere.
`"/.md..markdown./i`Matches MIME types containing .md or markdown anywhere.
"/.*pdf.*/i"Matches MIME types containing pdf anywhere.application/pdf; application/x-pdf; text/pdf
"/^application\\/vnd\\.google-apps\\..*/i"Matches MIME types starting with application/vnd.google-apps.application/vnd.google-apps.document; application/vnd.google-apps.spreadsheet; application/vnd.google-apps.presentation

The following Google Workspace MIME types are explicitly excluded from ingestion and are skipped:

  • application/vnd.google-apps.shortcut
  • application/vnd.google-apps.script
  • application/vnd.google-apps.site
  • application/vnd.google-apps.form
  • application/vnd.google-apps.vid

To maintain processing stability and efficiency, the connector enforces maximum file size limits. Files are categorized, and a maximum size check is performed based on the content type. Any file exceeding its category limit will be skipped.

File TypeSize LimitMIME types & Regex Patterns
PDF, CSV & similar formats100 MBMIME Types: application/pdf, application/x-pdf, application/acrobat, applications/vnd.pdf, text/pdf, text/x-pdf, text/csv, application/vnd.openxmlformats-officedocument.wordprocessingml.document Regex Patterns: ^application\/vnd\.google-apps\..*, .*pdf.*
Image & Similar format20 MBMIME Types: image/jpeg, image/png, image/gif, image/bmp, image/tiff, image/webp, image/svg+xml Regex Patterns: `.image.
All other supported Formats4 MBAny supported file format that doesn’t fall into the PDF or Image categories are checked against the 4-MB maximum size limit.

Content deemed stale based on its age is skipped to maintain relevance.

Rule: Any content where the last modified time is more than 2 years before the current sync time will be skipped.

When a file fails any constraint check (Format, Size, or Last Modified Time), it’s skipped. The outcome is recorded in the UDLO. The UDLO contains the below fields to detail the failure.

FieldValueDescription
Sync StatusSUCCESS, FAILURE or SKIPPEDIndicates the file ingestion status.
Sync Status DetailDetailed Error MessageProvides the explicit reason why the file was skipped.
File PathNULL or Empty StringThe file path field is explicitly cleared for all skipped records to mark them as non-ingested.

Files are processed through a series of checks. If a check fails, the file is immediately skipped and the appropriate reason is logged in the UDLO.

Failure ConditionStatus Details Message
Unsupported FormatExplicitly Unsupported: If the file's MIME type matches an unsupported pattern (for example, application/vnd.google-apps.shortcut).
Default Format SkipImplicitly Unsupported: If the file's type doesn’t match any supported format or regex patterns, it will be skipped.
Content AgeLast modified time %s is older than %d years (for example, Last modified time 2022-01-15T10:30:00Z is older than 2 years)
Size Exceeded (PDF, CSV & Similar)%s must be less than %d MB (found %.2f MB) (for example, PDF must be less than 100 MB (found 204.20 MB))
Size Exceeded (Image & Similar)%s must be less than %d MB (found %.2f MB) (for example, IMAGE must be less than 20 MB (found 25 MB))
Size Exceeded (All Other)%s must be less than %d MB (found %.2f MB) (for example, text/other files must be less than 4 MB (found 30 MB))

Microsoft SharePoint supports ingestion of sites or libraries up to 300 GB per ingestion job.

Ingestion jobs that run for longer than 24 hours may time out depending on system execution limits, so plan accordingly.