#4190 closed Defect (fixed)
usernames allow accented characters but URLs not working
| Reported by: | cash | Owned by: | |
|---|---|---|---|
| Priority: | normal | Milestone: | Elgg 1.8.2 |
| Component: | Core | Version: | 1.8 |
| Severity: | minor | Keywords: | |
| Cc: | brett@… | Difficulty: |
Description (last modified by cash)
Create a user with an umlaut in the username to reproduce this. I'm not sure how we can fix this without sprinkling the code with urlencode()'s. Other option is not allowing these characters in usernames.
Change History (14)
comment:1 Changed 18 months ago by cash
- Description modified (diff)
comment:2 Changed 18 months ago by fgreinus
comment:3 Changed 18 months ago by fgreinus
I commited a Quick&Dirty patch at github, that fixed the problem (momentary) for me:
https://github.com/fgreinus/Elgg/commit/7c7d801393cbe6952032416a668621e58ad5dea9
comment:4 Changed 18 months ago by cash
The core problem is that we're producing invalid URLs because of the non-ascii characters. This causes the URL to fail this: filter_var($url, FILTER_VALIDATE_URL)
To include non-ascii characters in a URL, they have to be encoded. As an example, in views/default/page/layouts/content/filter.php, we would need to change
$username = elgg_get_logged_in_user_entity()->username;
to
$username = urlencode(elgg_get_logged_in_user_entity()->username);
comment:5 Changed 18 months ago by cash
- Milestone changed from Needs Review to Elgg 1.8.2
Long term solution is probably to add a method to ElggUser:
public function getUsername($encode = false) {
if ($encode) {
return $this->username;
} else {
return urlencode($this->username);
}
}
comment:6 Changed 18 months ago by ewinslow
Why not convert the accented characters to non-accented? umlauts/accents in *names* makes sense, but I don't think it's necessary for usernames. Should we be allowing cjk characters in usernames as well?
comment:7 Changed 18 months ago by Purus
The problem is not just with english accented characters. The problem comes when using non english languages too.. For example my Tamil language.
For example try to have the values "தமிழ்" in both username and display name.
comment:8 Changed 18 months ago by cash
Evan, we would need to either require usernames to be only ascii characters or allow all national characters. There's no good middle ground. I'll check through the history of what we have allowed. I don't know if we suddenly started allowing more character types with Elgg 1.8 or something else changed to cause these issues.
comment:9 Changed 17 months ago by ewinslow
tl;dr: I vote [a-z0-9\.]+ usernames only.
I think we should take the conservative approach right now and only allow alphanumeric usernames. Even Facebook doesn't allow "non-romanized" usernames. I'm also thinking of things like federation which could be affected by this.
I don't think this is ideal, but it closes off a lot of potential problems and we can always open things up later when we have done more research and come up with a viable solution. On the other hand, if we allow all national characters outright, we may end up regretting that down the line with no good way to recover.
comment:10 Changed 17 months ago by brettp
What about [_-@]? Those are fairly common and arguably useful for usernames.
comment:11 Changed 17 months ago by cash
Elgg has supported non-ASCII characters in usernames since 1.0 as far as I can tell. What has changed is that we added elgg_normalize_url() which uses PHP's URL validator. That validator works with URLs that conform to RFC 3986 (Uniform Resource Identifier). That allows for alphanumeric, hyphen, period, underscore, and tilde.
Before 1.8, we were producing invalid URLs according to RFC 3986, but they worked on probably every browser/server.
There is also RFC 3987 (Internationalized Resource Identifiers). Here is a good article that describes the relationship between URIs and IRIs: http://www.w3.org/International/articles/idn-and-iri/
Here is a list of what IRIs allow as characters in the path:
iunreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar
ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
/ %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
/ %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
/ %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
/ %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
/ %xD0000-DFFFD / %xE1000-EFFFD
Anyway, it seems reasonable to modify elgg_normalize_url() to work with IRIs since that is what we've been producing since 1.0. If we want to discussing limiting username character sets, that should be a separate ticket.
comment:12 Changed 17 months ago by ewinslow
Ok, good distinction to make.
comment:13 follow-up: ↓ 14 Changed 17 months ago by Cash Costello
- Resolution set to fixed
- Status changed from new to closed
Fixes #4190 accepting full urls with non-ascii characters
Changeset: c529671a522dea0dcfc280815092ee1f5127b92b
comment:14 in reply to: ↑ 13 Changed 17 months ago by fgreinus
Replying to Cash Costello:
Fixes #4190 accepting full urls with non-ascii characters
Changeset: c529671a522dea0dcfc280815092ee1f5127b92b
Runs perfectly for me, big thanks !

I think not allowing those characters in Usernames is the worst option.
There must be a way to fix this.
Where would you suggest to put urlencode? Maybe I could try it live on my site if you suggest some changes ;).