Skip to content

Conversation

@mdholloway
Copy link
Contributor

No description provided.

@mdholloway

This comment has been minimized.

@mdholloway mdholloway requested a review from linehan December 22, 2020 14:34
@mdholloway mdholloway marked this pull request as draft December 22, 2020 14:35
@tonisevener

This comment has been minimized.

@tonisevener

This comment has been minimized.

@mdholloway

This comment has been minimized.

@tonisevener

This comment has been minimized.

@mdholloway

This comment has been minimized.

@mdholloway
Copy link
Contributor Author

By the way, why is the User-Agent string stored with all events in the old client? I added it here, too, just to match the old client, but can it be removed?

@tonisevener
Copy link
Collaborator

tonisevener commented Dec 22, 2020

By the way, why is the User-Agent string stored with all events in the old client? I added it here, too, just to match the old client, but can it be removed?

Hm good catch - it looks like the old system sets the request header value for "User-Agent". The only reason I can think is we save the user agent as the request is generated and persisted. If they happen to update the app or OS before the event is posted, this way the user agent sent is the old one when the event occured, and not the new one when the event is posted. In httpPost, when you're creating the request from session, you can pass in the header from the record to override the default header:

let request = session.request(with: url, method: .post, bodyData: body, bodyEncoding: .json, headers: ["User-Agent": userAgent])

@mdholloway mdholloway force-pushed the rebasedstorage branch 2 times, most recently from 156ab60 to 14cb920 Compare December 23, 2020 15:42
@mdholloway mdholloway changed the title WIP: EPC: Add StorageManager EPC: Add StorageManager Dec 23, 2020
@mdholloway
Copy link
Contributor Author

By the way, why is the User-Agent string stored with all events in the old client? I added it here, too, just to match the old client, but can it be removed?

Hm good catch - it looks like the old system sets the request header value for "User-Agent". The only reason I can think is we save the user agent as the request is generated and persisted. If they happen to update the app or OS before the event is posted, this way the user agent sent is the old one when the event occured, and not the new one when the event is posted. In httpPost, when you're creating the request from session, you can pass in the header from the record to override the default header:

let request = session.request(with: url, method: .post, bodyData: body, bodyEncoding: .json, headers: ["User-Agent": userAgent])

If I recall correctly, the old EventLogging system supplements the event data that ultimately gets stored in the DB with the producer's user-agent. I don't think that currently happens (yet) for the Event Platform, but I'd have to double-check on that (though maybe @linehan knows off the top of his head). In any event, is this just being used as a proxy for the app version? Can we update the relevant schema(s) to explicitly require an app version string instead, rather than continuing to spoof older user-agent headers?

@tonisevener
Copy link
Collaborator

tonisevener commented Dec 23, 2020

If I recall correctly, the old EventLogging system supplements the event data that ultimately gets stored in the DB with the producer's user-agent. I don't think that currently happens (yet) for the Event Platform, but I'd have to double-check on that.

The let request = session.request(with: url, method: .post, bodyData: body, bodyEncoding: .json) line in the new service populates the request header with the same value that the legacy service uses to populate the userAgent value on it's records. So I think for the most part the functionality is the same, we just aren't persisting it, and there's a chance the value of user agent has changed in the time since the event was first generated and persisted.

@mdholloway
Copy link
Contributor Author

mdholloway commented Dec 23, 2020

If I recall correctly, the old EventLogging system supplements the event data that ultimately gets stored in the DB with the producer's user-agent. I don't think that currently happens (yet) for the Event Platform, but I'd have to double-check on that.

The let request = session.request(with: url, method: .post, bodyData: body, bodyEncoding: .json) line in the new service populates the request header with the same value that the legacy service uses to populate the userAgent value on it's records. So I think for the most part the functionality is the same, we just aren't persisting it, and there's a chance the value of user agent has changed in the time since the event was first generated and persisted.

Ah, sorry, I was talking about the server-side behavior of the legacy eventlogging service upon receiving an event via the beacon endpoint (see https://github.com/wikimedia/eventlogging#eventcapsule-mediawiki-extension), not the EventLoggingService class in the app.

@linehan
Copy link

linehan commented Dec 23, 2020

By the way, why is the User-Agent string stored with all events in the old client? I added it here, too, just to match the old client, but can it be removed?

Hm good catch - it looks like the old system sets the request header value for "User-Agent". The only reason I can think is we save the user agent as the request is generated and persisted. If they happen to update the app or OS before the event is posted, this way the user agent sent is the old one when the event occured, and not the new one when the event is posted. In httpPost, when you're creating the request from session, you can pass in the header from the record to override the default header:

let request = session.request(with: url, method: .post, bodyData: body, bodyEncoding: .json, headers: ["User-Agent": userAgent])

If I recall correctly, the old EventLogging system supplements the event data that ultimately gets stored in the DB with the producer's user-agent. I don't think that currently happens (yet) for the Event Platform, but I'd have to double-check on that

The fields in the http fragment, specifically http.request_headers.user_agent, should be filled out by EventGate based on the HTTP request it receives. So whatever we put in the HTTP header as User-Agent will make it into that field.

In any event, is this just being used as a proxy for the app version? Can we update the relevant schema(s) to explicitly require an app version string instead, rather than continuing to spoof older user-agent headers?

I was thinking along these lines, except I was thinking to let the client set a user_agent string directly rather than needing to fish
it out of the header. This would also simplify our awkward userAgent state we are needing to tag along with the event the whole way, since it would just be part of the event's data to begin with.

I think the intention of a user agent as, roughly, "the software and version of the originator of this HTTP request" makes sense and has a cross-platform interpretation (it's tied to HTTP as a protocol, rather than to any specific notion of how that user agent string is structured), so it's probably not necessary for us to distinguish app version versus browser version versus anything else as long as user agent strings are distinct and canonical, it makes sense to me.

@mdholloway
Copy link
Contributor Author

I was thinking along these lines, except I was thinking to let the client set a user_agent string directly rather than needing to fish it out of the header. This would also simplify our awkward userAgent state we are needing to tag along with the event the whole way, since it would just be part of the event's data to begin with.

I think the intention of a user agent as, roughly, "the software and version of the originator of this HTTP request" makes sense and has a cross-platform interpretation (it's tied to HTTP as a protocol, rather than to any specific notion of how that user agent string is structured), so it's probably not necessary for us to distinguish app version versus browser version versus anything else as long as user agent strings are distinct and canonical, it makes sense to me.

I guess I would want to know how it's being used by PMs and data analysts (or whomever is querying this data). I would think in general that breaking out the specific data of interest would make queries easier to write and more performant to execute than doing something like WHERE user_agent LIKE '%v1.3.37%' or what have you. But I guess I wouldn't object in principle to having a user_agent field in the event data object. I don't mean to be pedantic but it just seems wrong for the app to represent itself to the intake server as something it isn't anymore in order to set this data with the intended value(s).

@linehan
Copy link

linehan commented Dec 23, 2020

I guess I would want to know how it's being used by PMs and data analysts (or whomever is querying this data). I would think in general that breaking out the specific data of interest would make queries easier to write and more performant to execute than doing something like WHERE user_agent LIKE '%v1.3.37%' or what have you.

Aah yeah, that is a good point. I think the practice is for a back-end processor to eat these strings and spit out something categorical like "firefox", "wikipedia-ios", "mobile-safari", etc. This is most often used as a dimension to break down events by browser or (possibly) version.

I don't mean to be pedantic but it just seems wrong for the app to represent itself to the intake server as something it isn't anymore in order to set this data with the intended value(s).

Being pedantic is totally cool. Is user agent a bad place/way for the app to represent itself, though? I always thought about it as the piece of software (including version) which is the user's agent in the HTTP request. So, curl, or quake, could be a user agent. That's how I think about it at least, which I think is aligned with RFC 7231. I don't think it's wrong to separate the user agent from a user agent version, or something, but probably both pieces of data are important for analytics.

IIRC Chrome and some other browser vendors want to move away from having UA strings in their HTTP requests. I don't know if this means that navigator.userAgent would also be deprecated.

@mdholloway
Copy link
Contributor Author

OK, I zapped the User-Agent from the DB but we can revert that if need be, of course.

I also noticed that the earlier Core Data model for EPC storage was still floating around in the repo, so I added a commit deleting that, too.

I'll remove the Draft tag since I think this is beyond that point now.

@mdholloway mdholloway marked this pull request as ready for review December 23, 2020 22:43
Copy link
Collaborator

@tonisevener tonisevener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last code review comment from my end. I did some lightweight testing and it's working well!

Copy link
Collaborator

@tonisevener tonisevener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been testing this morning and everything looks great. I believe I have marked all the issues I brought up as resolved, leaving the others for @linehan to confirm. Thanks @mdholloway!

@linehan
Copy link

linehan commented Jan 5, 2021

Looks good to me, many thanks @mdholloway and @tonisevener, excellent work.

@tonisevener tonisevener removed the WIP label Jan 5, 2021
@tonisevener tonisevener merged commit 1e471e2 into wikimedia:main Jan 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants