Skip to content

akash-pandey1/data-scrapper-akash-p1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Scraper

A powerful web scraping application built with Angular and Node.js that allows users to extract structured data from any website.

Features

  • 🎯 Smart Field Selection: Choose from 20 pre-defined fields to scrape
  • 🔍 Auto Field Discovery: Automatically discovers additional fields from the website
  • 🎨 Beautiful UI: Modern, responsive design with Material UI
  • 📊 Real-time Results: View scraped data immediately
  • 💾 CSV Export: Export scraped data to CSV format
  • Fast & Efficient: Built with performance in mind

Pre-configured Fields

The app comes with 20 default fields (10 auto-selected):

  • Address, City, Name, Full Name, Location
  • Description, Note, About, Contact, Email
  • Phone, Title, Company, Website, Street
  • State, Zip Code, Country, Price, Date

Tech Stack

Frontend

  • Angular 18 (SSR enabled)
  • Angular Material
  • TypeScript
  • SCSS

Backend

  • Node.js
  • Express
  • Cheerio (HTML parsing)
  • Axios (HTTP requests)
  • json2csv (CSV generation)

Installation

Quick Setup (All Dependencies)

npm run install:all

Backend Setup

cd backend
npm install
npm run dev

The backend server will start on http://localhost:3000

Frontend Setup

cd frontend
npm install
npm start

The frontend will start on http://localhost:4200

Usage

  1. Enter URL: Paste the website URL you want to scrape
  2. Select Fields: Choose which data fields to extract (10 are pre-selected)
  3. Start Scraping: Click "Start Scraping" button
  4. Review Results: View extracted data and discovered fields
  5. Select Extra Fields: Choose any additional fields found
  6. Export: Download results as CSV file

API Endpoints

POST /api/scrape/url

Scrape a website URL and extract data

Request:

{
  "url": "https://example.com",
  "selectedFields": ["name", "email", "phone"]
}

Response:

{
  "success": true,
  "url": "https://example.com",
  "pageInfo": {
    "title": "Example Page",
    "description": "...",
    "keywords": "..."
  },
  "scrapedData": {
    "name": ["John Doe"],
    "email": ["john@example.com"]
  },
  "extraFields": ["username", "bio"],
  "timestamp": "2025-11-11T..."
}

POST /api/scrape/export

Export scraped data to CSV

Request:

{
  "data": {
    "name": ["John Doe"],
    "email": ["john@example.com"]
  },
  "fields": ["name", "email"]
}

Response: CSV file download

Project Structure

data-scraper/
├── backend/
│   ├── controllers/
│   │   └── scrapeController.js
│   ├── routes/
│   │   └── scrape.js
│   ├── server.js
│   ├── package.json
│   └── README.md
├── frontend/
│   ├── src/
│   │   ├── app/
│   │   │   ├── services/
│   │   │   │   └── scraper.service.ts
│   │   │   ├── app.component.ts
│   │   │   ├── app.component.html
│   │   │   ├── app.component.scss
│   │   │   └── app.config.ts
│   │   ├── styles.scss
│   │   └── index.html
│   └── package.json
└── README.md

Development

Backend Development

cd backend
npm run dev  # Runs with nodemon for auto-restart

Frontend Development

cd frontend
npm start  # Runs Angular dev server

Building for Production

Backend

cd backend
npm start

Frontend

cd frontend
npm run build

The built files will be in frontend/dist/ directory.

Deployment

Hosting Options

This project can be deployed to various platforms:

Backend Deployment

  • Heroku: Deploy the backend/ folder as a Node.js app
  • Railway: Connect your GitHub repo and set the root to backend/
  • Render: Deploy as a Web Service with Node.js
  • Vercel: Use serverless functions or deploy as Node.js app

Frontend Deployment

  • Vercel: Connect the repo and set build command to cd frontend && npm run build
  • Netlify: Set build command to cd frontend && npm run build and publish directory to frontend/dist
  • GitHub Pages: Build and deploy the frontend/dist/ folder
  • Firebase Hosting: Deploy the frontend/dist/ folder

Environment Variables

For production, make sure to set appropriate environment variables:

  • PORT: Backend server port (default: 3000)
  • NODE_ENV: Set to production for production builds

CORS Configuration

If deploying frontend and backend separately, update CORS settings in backend/server.js to allow requests from your frontend domain.

Features in Detail

Smart Field Extraction

The scraper uses multiple strategies to find data:

  • Searches for matching IDs, classes, and names
  • Looks for label-value pairs
  • Extracts meta tag content
  • Identifies semantic patterns

Field Discovery

Automatically discovers potential fields by analyzing:

  • HTML attributes (id, class, name)
  • Meta tags
  • Form labels
  • Table headers

CSV Export

Transforms scraped data into CSV format with:

  • Proper column headers
  • Multiple values per field
  • Clean formatting

Browser Support

  • Chrome (recommended)
  • Firefox
  • Safari
  • Edge

License

ISC

Author

Akash Pandey

📧 Hire Me - akashdeep9226@gmail.com

Contributing

Feel free to submit issues and enhancement requests!

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors