Servo & SQLGitHub
A JOURNEY OF TRANSFORMING FAILURES INTO A GOOD IDEA
Servo
 A new browser engine written from scratch in Rust by Mozilla
 Original Project: Write an Android frontend for servo
 Issues, Issues and issues
 Zero knowledge about mobile development and Rust
 Outdated documentation
 Very poorly supported on Android
 Tried approaching in various ways, but to no avail
 Made build and packaging guide and reported issues
SQLGitHub – Motivation (I)
 Who is stealing all my easy bugs/issues?
 There are very limited tools for managing GitHub organizations
 Open-source organizations are usually understaffed
 The Servo project on GitHub for example,
contains 129 repositories
managed by practically 1 person.
SQLGitHub – Background
 Organization: Organizations are shared accounts where businesses
and open-source projects can collaborate across many projects at
once.
 Repository: A repository contains all of the project files, and stores
each file's revision history.
 Commit: An individual change to a set of files.
 Issue: Suggested improvements, tasks or questions
 Pull Request: Proposed change
to a repository.
Repository
Organization
Repository Repository
• Commits
• Issues
• Pull Requests
• …
• Commits
• Issues
• Pull Requests
• …
• Commits
• Issues
• Pull Requests
• …https://help.github.com/articles/github-glossary/
SQLGitHub – Motivation (II)
 Common questions/problems an organization admin include:
 Obtain certain metrics of the organization in machine-friendly format for
post-processing (eg. KPI report)
 Get the current list of projects hosted on GitHub
 List of the most popular repositories in the organization
 Get the list of issues closed (resolved) for the past 7 days
 What are the critical issues that are still left open?
 Who are the top contributors of the past month?
 … endless possible questions
Abstract
 SQLGitHub features a SQL-like syntax that allows you to:
Query information about an organization as a whole.
 You may also think of it as a better, enhanced frontend layer built
on top of GitHub’s RESTful API
Introduction – Supported Schema
 SELECT
select_expr [, select_expr ...]
FROM {org_name | org_name.{repos | issues | pulls | commits}}
[WHERE where_condition]
[GROUP BY {col_name | expr}
[ASC | DESC], ...]
[HAVING where_condition]
[ORDER BY {col_name | expr}
[ASC | DESC], ...]
[LIMIT row_count]
Introduction – Use Case (I)
 Get name and description from all the repos in apple.
 select name, description from apple.repos
Introduction – Use Case (II)
 Get last-updated time and title of the issues closed in the past week (7
days) in servo listed in descending order of last-updated time.
 select updated_at, title from servo.issues.closed.7 order by updated_at desc
Introduction – Use Case (III)
 Get top 10 most-starred repositories in servo.
 select concat(concat("(", stargazers_count, ") ", name), ": ", description) from
servo.repos order by stargazers_count desc, name limit 10
Introduction – Use Case (IV)
 Get top 10 contributors in servo for the past month (30 days) based
on number of commits.
 select login, count(login) from servo.commits.30 group by login order by
count(login) desc, login limit 10
Introduction – Technology Stack
 Python
 re & regex, regular expression libraries
 PyGithub (patched), an unofficial client library for GitHub API
 prompt_toolkit, a library for building prompts
 pygments, a library for syntax highlighting
Introduction – (Simplified) Flow
Fetch data (from)
Filter by where conditions
Evaluate partial exprs
Group by group exprs
Order by order exprs
Evaluate select exprs
Fetch data with required fields from GitHub API
Evaluate where conditions and filter fetched data
Evaluate group exprs and other “field” exprs
Generate table groups by values of group exprs
Sort within and between tables
Evaluate select exprs
Filter by having conditions Evaluate having conditions and filter tables
Introduction – Architecture
top_level
parser
tokenizersession
grouping ordering
table_fetcher expression
definitionutilitiestable PyGithub
process user input
process external data
fetches external data
base classes, functions
and SQL definitions
Introduction – Challenges (I)
 Algorithm of parsing is almost identical to that of expression
evaluation  waste of time
 Lazy Parsing: Only parse clauses (eg. select, from, where) and comma-
separated fields
 Comma-separated fields, strings and escape characters
Evaluate this: concat("[)"Stars"(: ", stargazers_count)
 concat("[)"Stars"(: ", stargazers_count)
concat("[)"Stars"(: ", stargazers_count)
 concat("[)"Stars"(: ", stargazers_count)
Introduction – Challenges (II)
 Extracting all relevant fields from expressions to fetch at once
 select concat("[)"-> avg(stargazers_count)"(: ", stargazers_count -
avg(stargazers_count), "] ", name) from apple.repos where description
like "%library%" order by id
 Algorithm: for each expression,
 Remove all literal strings. Use r""(?:[^"]|.)*"" to match.
 Find all possible tokens with r"([a-zA-Z_]+)(?:[^(a-zA-Z_]|$)".
 For each token, check if it’s a predefined token (ie. part of SQL).
Introduction – Challenges (III)
 Expression Evaluation is really complicated
 Regular (eg. concat, floor) and Aggregate functions (eg. max, min)
 Have to evaluate an entire table at once
 Nested functions (eg. sum(avg(field_a) + avg(field_b)))
 Use recursive regex patterns to extract tokens – r”((?:(?>[^()]+|(?R))*))”
 Assign special precedence and insert extra logic in place
 Operator Precedence
 Modified 2-stack evaluation approach +
 Finite State Machine + One-token Lookahead
Introduction – Challenges (IV)
 Python’s built-in sort is not customizable:
sorted(iterable, *, key=None, reverse=False)
 order by requires sorting with multiple keys each with potentially
different reverse:
order by field_a desc, field_b asc, field_c, desc
 Wrote custom sort that integrates better with the workflow
Future Directions
 Improve SQL, MySQL compatibility
 Extend to end users not just organizations
 Migrate to the new GraphQL backend (GitHub API v4)
 Integrate SQLGitHub directly on the server end (better efficiency
and perhaps better security!)
Acknowledgements
 We would like to thank:
 Shing Lyu, former software engineer at Mozilla Taiwan for the mentorship
 Irvin Chen, Liaison of MozTW (Mozilla Taiwan Community) for
coordinating the program
 Prof. Cheng-Chung Lin for organizing the program

SQLGitHub - Access GitHub API with SQL-like syntaxes

  • 1.
    Servo & SQLGitHub AJOURNEY OF TRANSFORMING FAILURES INTO A GOOD IDEA
  • 2.
    Servo  A newbrowser engine written from scratch in Rust by Mozilla  Original Project: Write an Android frontend for servo  Issues, Issues and issues  Zero knowledge about mobile development and Rust  Outdated documentation  Very poorly supported on Android  Tried approaching in various ways, but to no avail  Made build and packaging guide and reported issues
  • 3.
    SQLGitHub – Motivation(I)  Who is stealing all my easy bugs/issues?  There are very limited tools for managing GitHub organizations  Open-source organizations are usually understaffed  The Servo project on GitHub for example, contains 129 repositories managed by practically 1 person.
  • 4.
    SQLGitHub – Background Organization: Organizations are shared accounts where businesses and open-source projects can collaborate across many projects at once.  Repository: A repository contains all of the project files, and stores each file's revision history.  Commit: An individual change to a set of files.  Issue: Suggested improvements, tasks or questions  Pull Request: Proposed change to a repository. Repository Organization Repository Repository • Commits • Issues • Pull Requests • … • Commits • Issues • Pull Requests • … • Commits • Issues • Pull Requests • …https://help.github.com/articles/github-glossary/
  • 5.
    SQLGitHub – Motivation(II)  Common questions/problems an organization admin include:  Obtain certain metrics of the organization in machine-friendly format for post-processing (eg. KPI report)  Get the current list of projects hosted on GitHub  List of the most popular repositories in the organization  Get the list of issues closed (resolved) for the past 7 days  What are the critical issues that are still left open?  Who are the top contributors of the past month?  … endless possible questions
  • 6.
    Abstract  SQLGitHub featuresa SQL-like syntax that allows you to: Query information about an organization as a whole.  You may also think of it as a better, enhanced frontend layer built on top of GitHub’s RESTful API
  • 7.
    Introduction – SupportedSchema  SELECT select_expr [, select_expr ...] FROM {org_name | org_name.{repos | issues | pulls | commits}} [WHERE where_condition] [GROUP BY {col_name | expr} [ASC | DESC], ...] [HAVING where_condition] [ORDER BY {col_name | expr} [ASC | DESC], ...] [LIMIT row_count]
  • 8.
    Introduction – UseCase (I)  Get name and description from all the repos in apple.  select name, description from apple.repos
  • 9.
    Introduction – UseCase (II)  Get last-updated time and title of the issues closed in the past week (7 days) in servo listed in descending order of last-updated time.  select updated_at, title from servo.issues.closed.7 order by updated_at desc
  • 10.
    Introduction – UseCase (III)  Get top 10 most-starred repositories in servo.  select concat(concat("(", stargazers_count, ") ", name), ": ", description) from servo.repos order by stargazers_count desc, name limit 10
  • 11.
    Introduction – UseCase (IV)  Get top 10 contributors in servo for the past month (30 days) based on number of commits.  select login, count(login) from servo.commits.30 group by login order by count(login) desc, login limit 10
  • 12.
    Introduction – TechnologyStack  Python  re & regex, regular expression libraries  PyGithub (patched), an unofficial client library for GitHub API  prompt_toolkit, a library for building prompts  pygments, a library for syntax highlighting
  • 13.
    Introduction – (Simplified)Flow Fetch data (from) Filter by where conditions Evaluate partial exprs Group by group exprs Order by order exprs Evaluate select exprs Fetch data with required fields from GitHub API Evaluate where conditions and filter fetched data Evaluate group exprs and other “field” exprs Generate table groups by values of group exprs Sort within and between tables Evaluate select exprs Filter by having conditions Evaluate having conditions and filter tables
  • 14.
    Introduction – Architecture top_level parser tokenizersession groupingordering table_fetcher expression definitionutilitiestable PyGithub process user input process external data fetches external data base classes, functions and SQL definitions
  • 15.
    Introduction – Challenges(I)  Algorithm of parsing is almost identical to that of expression evaluation  waste of time  Lazy Parsing: Only parse clauses (eg. select, from, where) and comma- separated fields  Comma-separated fields, strings and escape characters Evaluate this: concat("[)"Stars"(: ", stargazers_count)  concat("[)"Stars"(: ", stargazers_count) concat("[)"Stars"(: ", stargazers_count)  concat("[)"Stars"(: ", stargazers_count)
  • 16.
    Introduction – Challenges(II)  Extracting all relevant fields from expressions to fetch at once  select concat("[)"-> avg(stargazers_count)"(: ", stargazers_count - avg(stargazers_count), "] ", name) from apple.repos where description like "%library%" order by id  Algorithm: for each expression,  Remove all literal strings. Use r""(?:[^"]|.)*"" to match.  Find all possible tokens with r"([a-zA-Z_]+)(?:[^(a-zA-Z_]|$)".  For each token, check if it’s a predefined token (ie. part of SQL).
  • 17.
    Introduction – Challenges(III)  Expression Evaluation is really complicated  Regular (eg. concat, floor) and Aggregate functions (eg. max, min)  Have to evaluate an entire table at once  Nested functions (eg. sum(avg(field_a) + avg(field_b)))  Use recursive regex patterns to extract tokens – r”((?:(?>[^()]+|(?R))*))”  Assign special precedence and insert extra logic in place  Operator Precedence  Modified 2-stack evaluation approach +  Finite State Machine + One-token Lookahead
  • 18.
    Introduction – Challenges(IV)  Python’s built-in sort is not customizable: sorted(iterable, *, key=None, reverse=False)  order by requires sorting with multiple keys each with potentially different reverse: order by field_a desc, field_b asc, field_c, desc  Wrote custom sort that integrates better with the workflow
  • 19.
    Future Directions  ImproveSQL, MySQL compatibility  Extend to end users not just organizations  Migrate to the new GraphQL backend (GitHub API v4)  Integrate SQLGitHub directly on the server end (better efficiency and perhaps better security!)
  • 20.
    Acknowledgements  We wouldlike to thank:  Shing Lyu, former software engineer at Mozilla Taiwan for the mentorship  Irvin Chen, Liaison of MozTW (Mozilla Taiwan Community) for coordinating the program  Prof. Cheng-Chung Lin for organizing the program