Skip to content

Type hints for Azure AI Search indexes βš™οΈ using schema metadata πŸ“‹ from Microsoft Purview (Data Governance)πŸ”

Notifications You must be signed in to change notification settings

kimtth/purview-driven-azure-ai-search-index

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Purview-Driven Azure AI Search Type hint

πŸ§ͺ Proof of Concept (PoC) script to evaluate the feasibility of the Purview API, aiming to generate type hints and connection details for an Azure AI Search index based on schema data retrieved from the Purview Data Catalog.

Environment configuration

🧭 Micrsoft Purview

  1. Create Microsoft Purview account in Azure or via the Purview Portal.

  2. Grant Reader permission to Managed Identity associated with Purview account

  3. Create data sources and sample data

  4. Open Microsoft Purview Governance Portal > Scan data sources in Data Map

  5. Create a Service Principal for API access to Purview.

  6. Use the Microsoft Purview Python SDK

  7. References: Purview

πŸ”Ž Azure AI Search

  1. Create Index by Python SDK

  2. Understand supported data types and how to define an index.

  3. References: Azure AI Search

Azure AI Search Skillset Mapping β€” Python Analogy

  • Skill inputs:
name: expected parameter name (like function arg)
source: value from document (e.g. /document/field)
  • Skill outputs:
name: predefined output label
targetName: key used to store output in document (dict key)
  • Azure AI Search Skillset Json vs Python analogy
"inputs": [
  { "name": "text", "source": "/document/extracted_content" }
],
"outputs": [
  { "name": "textItems", "targetName": "pages" }
]
document = {
    "extracted_content": "Some extracted text."
}

def split_skill(text):
    return {
        "pages": text.split("\n\n")
    }

# Run skill
document.update(split_skill(document["extracted_content"]))
  • Chaining explanation

Each skill’s outputs (targetNames) become keys in the document dict. Subsequent skills use these keys as source inputs, enabling step-by-step data flow:

'''
{
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Util.DocumentExtractionSkill",
      "name": "extraction-skill",
      "context": "/document",
      "inputs": [
        { "name": "file_data", "source": "/document/file_data" }
      ],
      "outputs": [
        { "name": "content", "targetName": "extracted_content" },
        { "name": "normalized_images", "targetName": "normalized_images" }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
      "name": "split-skill",
      "context": "/document",
      "inputs": [
        { "name": "text", "source": "/document/extracted_content" }
      ],
      "outputs": [
        { "name": "textItems", "targetName": "pages" }
      ]
    }
  ]
}
'''

# Initial document with binary file data
document = {
    "file_data": "binary file content"
}

# Extraction Skill
def extract(file_data):
    return {
        "extracted_content": "Text",
        "normalized_images": ["img1", "img2"]
    }

# Split Skill
def split(text):
    return {
        "pages": text.split("\n\n")
    }

# Run skills in sequence, aligned with Azure JSON mappings
document.update(extract(document["file_data"]))
document.update(split(document["extracted_content"]))

# Final document structure
print(document)
  • Summary
Azure Term	Python analogy
source	document["key"]
name	function parameter name
targetName	dict key in the output, source references that key for next skill input

βœ… The names meaning in different operations: skill logic vs. index mapping.

Inside a skill:name = skill-specific internal label (fixed by Microsoft skill design)

In index mapping:name = output field name in your Azure Search index

Type hints sample output

2025-MM-DD 00:01:34.006 | INFO     | __main__:search_data_assets:125 - Search count: **REDACTED**
2025-MM-DD 00:01:34.008 | INFO     | __main__:search_data_assets:130 - Name: Customer, Type: azure_sql_table, GUID: adb80e06-3283-4907-b8f8-7ef6f6f60000, qualifiedName: mssql://**REDACTED**.database.windows.net/**REDACTED**/SalesLT/Customer
2025-MM-DD 00:01:34.009 | INFO     | __main__:search_data_assets:130 - Name: Address, Type: azure_sql_table, GUID: 09d0c95f-4238-44c8-9c7d-5af6f6f60000, qualifiedName: mssql://**REDACTED**.database.windows.net/**REDACTED**/SalesLT/Address
...
2025-MM-DD 00:01:35.393 | INFO     | __main__:main:200 - Table azure_blob_path (**REDACTED**): {}
2025-MM-DD 00:01:35.881 | INFO     | __main__:main:200 - Table azure_sql_table (**REDACTED**): {'ProductCategoryID': <AzureSearchDataType.EDM_INT32: 'Edm.Int32'>, 'Weight': <AzureSearchDataType.EDM_DOUBLE: 'Edm.Double'>, 'ProductID': <AzureSearchDataType.EDM_INT32: 'Edm.Int32'>, 'SellStartDate': <AzureSearchDataType.EDM_STRING: 'Edm.String'>, 'DiscontinuedDate': <AzureSearchDataType.EDM_STRING: 'Edm.String'>, 'ListPrice': <AzureSearchDataType.EDM_DOUBLE: 'Edm.Double'>, 'ThumbNailPhoto': <AzureSearchDataType.EDM_STRING: 'Edm.String'>, 'Name': <AzureSearchDataType.EDM_STRING: 'Edm.String'>, 'SellEndDate': <AzureSearchDataType.EDM_STRING: 'Edm.String'>, 'rowguid': <AzureSearchDataType.EDM_STRING: 'Edm.String'>, 'Size': <AzureSearchDataType.EDM_STRING: 'Edm.String'>, 'Color': <AzureSearchDataType.EDM_STRING: 'Edm.String'>, 'ProductModelID': <AzureSearchDataType.EDM_INT32: 'Edm.Int32'>, 'StandardCost': <AzureSearchDataType.EDM_DOUBLE: 'Edm.Double'>, 'ThumbnailPhotoFileName': <AzureSearchDataType.EDM_STRING: 'Edm.String'>, 'ProductNumber': <AzureSearchDataType.EDM_STRING: 'Edm.String'>, 'ModifiedDate': <AzureSearchDataType.EDM_STRING: 'Edm.String'>}
2025-MM-DD 00:01:36.349 | INFO     | __main__:main:200 - Table azure_sql_view (**REDACTED**): {'Copyright': <AzureSearchDataType.EDM_STRING: 'Edm.String'>, 'Material': <AzureSearchDataType.EDM_STRING: 'Edm.String'>, 'Saddle': <AzureSearchDataType.EDM_STRING: 'Edm.String'>, 'Style': <AzureSearchDataType.EDM_STRING: 'Edm.String'>, 'NoOfYears': <AzureSearchDataType.EDM_STRING: 'Edm.String'>, 'RiderExperience': <AzureSearchDataType.EDM_STRING: 'Edm.String'>, 'MaintenanceDescription': <AzureSearchDataType.EDM_STRING: 'Edm.String'>, 'ProductLine': <AzureSearchDataType.EDM_STRING: 'Edm.String'>, 'ProductURL': <AzureSearchDataType.EDM_STRING: 'Edm.String'>, 'PictureSize': <AzureSearchDataType.EDM_STRING: 'Edm.String'>, 'WarrantyPeriod': <AzureSearchDataType.EDM_STRING: 'Edm.String'>, 'ProductModelID': <AzureSearchDataType.EDM_INT32: 'Edm.Int32'>, 'Pedal': <AzureSearchDataType.EDM_STRING: 'Edm.String'>, 'BikeFrame': <AzureSearchDataType.EDM_STRING: 'Edm.String'>, 'Crankset': <AzureSearchDataType.EDM_STRING: 'Edm.String'>, 'rowguid': <AzureSearchDataType.EDM_STRING: 'Edm.String'>, 'Color': <AzureSearchDataType.EDM_STRING: 'Edm.String'>, 'PictureAngle': <AzureSearchDataType.EDM_STRING: 'Edm.String'>, 'Manufacturer': <AzureSearchDataType.EDM_STRING: 'Edm.String'>, 'Name': <AzureSearchDataType.EDM_STRING: 'Edm.String'>, 'WarrantyDescription': <AzureSearchDataType.EDM_STRING: 'Edm.String'>, 'Wheel': <AzureSearchDataType.EDM_STRING: 'Edm.String'>, 'ModifiedDate': <AzureSearchDataType.EDM_STRING: 'Edm.String'>, 'Summary': <AzureSearchDataType.EDM_STRING: 'Edm.String'>, 'ProductPhotoID': <AzureSearchDataType.EDM_STRING: 'Edm.String'>}

About

Type hints for Azure AI Search indexes βš™οΈ using schema metadata πŸ“‹ from Microsoft Purview (Data Governance)πŸ”

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published