-
Notifications
You must be signed in to change notification settings - Fork 6
convert to use rest_v1 api plus a bit more cli arg checking #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
d00rman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments and thoughts in-lined.
bin/dump_wiki
Outdated
| process.exit(1); | ||
| } | ||
|
|
||
| if (!argv.dataBase && !argv.saveDir) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I would argue that this actually makes sense because it allows one to go over the whole dataset without having to have a lot of storage taken for a run. For example, we use the htmldumper script in production when we want to force a refresh of part of the content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Running a dump without storing the results is an important use case that we should continue to support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you say 'refresh part of the content', refresh it where?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@apergos, in RESTBase. The htmldumper script only spiders the contents in that case, and throws away the result. Another use case is load testing in the staging environment.
| this.articles = []; | ||
| this.waiters = []; | ||
| this.done = false; | ||
| this.contentHost = 'http://' + options.prefix + '/api/rest_v1' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might make more sense to have contentHost as a command-line argument that replaces both host and prefix. That way one can specify exactly what they need. For example, outside of prod, that would be https://en.wikipedia.org/api/rest_v1/ while if run inside, we can set it to http://restbase.svc.eqiad.wmnet/en.wikipedia.org/v1/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefix is used by itself though for the Host: header. Or are you suggesting parsing the value of the new argument to get the hostname out, for that purpose?
Maybe there needs to be a host arg and a contentUrl arg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, we still need to set the Host header. Perhaps we could then change the meaning of prefix to be what is now contentHost and keep host as a command-line arg? So we'd have both prefix and host, but prefix would be the URI prefix to use and host would be used for the header and the domain in the requests. The latter could be optional, so that if it's not specified, we can simply url.parse() it out of prefix.
|
Cross-linking with phabricator: this PR seems to be a blocker for https://phabricator.wikimedia.org/T17017 |
|
I've made the changes I think you want, untested though. Tell me if I've understood you right? |
d00rman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR partially tries to address the issues addressed in #9 . I think we should work on that one first, as it allows users to specify the params on the command line, which makes it more versatile (the aim is to be able to use htmldumper from both inside and outside of production)
| headers: { | ||
| 'user-agent': options.userAgent, | ||
| host: options.prefix | ||
| host: options.host |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be the domain. When sending requests to the appservers, the Host header indicates the actual domain to fetch info for, not the originating host.
| } | ||
| var url = options.host + '/' + options.prefix | ||
| + '/v1/page/html/' + encodeURIComponent(title) + '/' + oldid; | ||
| var url = options.prefix + '/page/html/' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This problem is addressed in #9
This makes the existing code work with the not-so-new-now api.