Skip to content

Conversation

@apergos
Copy link

@apergos apergos commented Jan 17, 2017

This makes the existing code work with the not-so-new-now api.

Copy link
Contributor

@d00rman d00rman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments and thoughts in-lined.

bin/dump_wiki Outdated
process.exit(1);
}

if (!argv.dataBase && !argv.saveDir) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I would argue that this actually makes sense because it allows one to go over the whole dataset without having to have a lot of storage taken for a run. For example, we use the htmldumper script in production when we want to force a refresh of part of the content.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Running a dump without storing the results is an important use case that we should continue to support.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say 'refresh part of the content', refresh it where?

Copy link
Member

@gwicke gwicke Jan 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apergos, in RESTBase. The htmldumper script only spiders the contents in that case, and throws away the result. Another use case is load testing in the staging environment.

this.articles = [];
this.waiters = [];
this.done = false;
this.contentHost = 'http://' + options.prefix + '/api/rest_v1'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might make more sense to have contentHost as a command-line argument that replaces both host and prefix. That way one can specify exactly what they need. For example, outside of prod, that would be https://en.wikipedia.org/api/rest_v1/ while if run inside, we can set it to http://restbase.svc.eqiad.wmnet/en.wikipedia.org/v1/

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefix is used by itself though for the Host: header. Or are you suggesting parsing the value of the new argument to get the hostname out, for that purpose?

Maybe there needs to be a host arg and a contentUrl arg?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, we still need to set the Host header. Perhaps we could then change the meaning of prefix to be what is now contentHost and keep host as a command-line arg? So we'd have both prefix and host, but prefix would be the URI prefix to use and host would be used for the header and the domain in the requests. The latter could be optional, so that if it's not specified, we can simply url.parse() it out of prefix.

@monperrus
Copy link

Cross-linking with phabricator: this PR seems to be a blocker for https://phabricator.wikimedia.org/T17017

@apergos
Copy link
Author

apergos commented Apr 7, 2017

I've made the changes I think you want, untested though. Tell me if I've understood you right?

Copy link
Contributor

@d00rman d00rman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR partially tries to address the issues addressed in #9 . I think we should work on that one first, as it allows users to specify the params on the command line, which makes it more versatile (the aim is to be able to use htmldumper from both inside and outside of production)

headers: {
'user-agent': options.userAgent,
host: options.prefix
host: options.host
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the domain. When sending requests to the appservers, the Host header indicates the actual domain to fetch info for, not the originating host.

}
var url = options.host + '/' + options.prefix
+ '/v1/page/html/' + encodeURIComponent(title) + '/' + oldid;
var url = options.prefix + '/page/html/'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This problem is addressed in #9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants