Skip to content

Conversation

@devin-ai-integration
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot commented Nov 4, 2025

Fix dropdown option spacing in markdown output

Summary

This PR fixes a bug where HTML dropdown (<select>/<option>) elements were being converted to markdown with no spacing between options, resulting in concatenated text like "Option1Option2Option3" instead of "Option1, Option2, Option3".

Changes:

  • Added a preprocessSelectElements() function that runs before HTML-to-markdown conversion
  • Select elements are replaced with comma-separated lists of their option values
  • Filters out placeholder options (empty value="" or text starting with "Select")
  • Skips disabled options
  • Normalizes whitespace in option text
  • Added 6 comprehensive unit tests covering various scenarios

The fix works uniformly across both the Go converter and TurndownService fallback by preprocessing the HTML before conversion.

Review & Testing Checklist for Human

Risk Level: Yellow (Medium risk - modifies core parsing logic with limited end-to-end testing)

  • Test with the actual problematic URL - Scrape https://renovahospitals.com/ through the Firecrawl API and verify the specialty dropdown options are properly formatted with commas/spaces
  • Check for performance impact - Test with large HTML documents (especially those with many or large select elements) to ensure the preprocessing doesn't significantly impact scraping speed
  • Verify no regressions - Test scraping various websites that use different HTML structures to ensure the preprocessing doesn't break other conversions or cause unexpected side effects
  • Test edge cases - Try dropdowns with: (1) options containing commas in their text, (2) very long option lists (100+ items), (3) nested or multiple selects on the same page

Test Plan

# Test the problematic URL
curl -X POST https://api.firecrawl.dev/v1/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{"url": "https://renovahospitals.com/", "formats": ["markdown"]}' \
  | jq '.data.markdown' | grep -A 5 "Speciality"

Expected: Should see "Anaesthesiology & Pain Management, Arthroplasty And Sports Medicine, Cardiology, ..." with proper comma-space separation.

Notes

  • All 11 unit tests pass locally (including 6 new tests for select/option handling)
  • Fixed an unrelated test expectation that was using the wrong markdown list format
  • The placeholder detection regex /^select\b/i might need refinement based on real-world usage patterns
  • Performance impact should be minimal since cheerio is already a dependency and the preprocessing is lightweight, but worth monitoring

Session: https://app.devin.ai/sessions/b5d46cd2ec9d4285ad2cf6c9d1bc2027
Requested by: Micah Stairs (@micahstairs)


Summary by cubic

Fixes markdown conversion for HTML dropdowns by preprocessing select/option elements and outputting comma-separated lists. Prevents concatenated text like "Option1Option2Option3" across both the Go converter and Turndown fallback.

  • Bug Fixes
    • Replace select elements with comma-separated option text.
    • Ignore placeholder (empty value or text starting with “Select”) and disabled options; normalize whitespace.
    • Add 6 unit tests for select/option handling; adjust one list test expectation.

Written for commit 6daee32. Summary will update automatically on new commits.

- Add pre-processing step to handle select/option elements before markdown conversion
- Replace select elements with comma-separated lists of options
- Filter out placeholder options (empty value or text starting with 'Select')
- Skip disabled options
- Normalize whitespace in option text
- Add comprehensive unit tests for select/option conversion

Fixes issue where dropdown options were concatenated without spaces
(e.g., 'Option1Option2Option3' instead of 'Option1, Option2, Option3')

Co-Authored-By: Micah Stairs <micah.stairs@gmail.com>
@devin-ai-integration
Copy link
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files

Prompt for AI agents (all 1 issues)

Understand the root cause of the following 1 issues and fix them.


<file name="apps/api/src/lib/html-to-markdown.ts">

<violation number="1" location="apps/api/src/lib/html-to-markdown.ts:33">
This check treats any &lt;option&gt; without an explicit value attribute as a placeholder, so selects that rely on the default text value end up with all of their real options dropped from the markdown output.</violation>
</file>

React with 👍 or 👎 to teach cubic. Mention @cubic-dev-ai to give feedback, ask questions, or re-run the review.

const text = $option.text().trim();

if (
value === "" ||
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check treats any without an explicit value attribute as a placeholder, so selects that rely on the default text value end up with all of their real options dropped from the markdown output.

Prompt for AI agents
Address the following comment on apps/api/src/lib/html-to-markdown.ts at line 33:

<comment>This check treats any &lt;option&gt; without an explicit value attribute as a placeholder, so selects that rely on the default text value end up with all of their real options dropped from the markdown output.</comment>

<file context>
@@ -10,6 +11,49 @@ dotenv.config();
+        const text = $option.text().trim();
+
+        if (
+          value === &quot;&quot; ||
+          /^select\b/i.test(text) ||
+          $option.attr(&quot;disabled&quot;)
</file context>
Fix with Cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant