DIALLED crawler

In June 2014, CWRC created a list of 4,789 Canadian libraries and archives, including their homepages.

The CWRC data is available in JSON. We can use this as a starting point for polling Canadian cultural institutions and collect linked open data that they might be expressing on their homepages, such as hours of operation, branch locations, or contact information.

More information is available from https://dialled.ca

Basic usage

./crawl --html-refresh --libraries libraries_test.json

This will:

crawl the libraries listed in the test file based on their URL
download the HTML for their listed URL into the html subdirectory
extract any linked data from the HTML it can find (whether RDFa, microdata, or JSON-LD) into an N3 file in a corresponding ttl subdirectory
serialize all of the data into N3 format in a datadumps subdirectory

Once you have gathered the data, you can run a set of queries against it to determine what kinds of linked open data the homepages of the libraries contain:

./queries

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
collect_data		collect_data
html		html
website		website
.gitignore		.gitignore
LICENSE		LICENSE
README.adoc		README.adoc
batch_redirects		batch_redirects
crawl		crawl
libraries_all.json		libraries_all.json
libraries_test.json		libraries_test.json
queries		queries
talk.adoc		talk.adoc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DIALLED crawler

Basic usage

About

Uh oh!

Releases

Packages

Languages

License

dbs/dialled-ca

Folders and files

Latest commit

History

Repository files navigation

DIALLED crawler

Basic usage

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages