New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve character encoding handling for friendly titles (Trac #2276) #2276
Comments
Milestone changed to |
cash wrote on 40489073-08-08 I am not able to reproduce. Could you post a set of steps to reproduce this? |
trac user bman wrote on 40489097-10-12 I started looking at this and i don't see a specific blog related issue. The issue seems to lie in php's lack of unicode support. I have gotten a few functions working that will help, but the actual issue lies in the url creation and the characters being stripped out in strip tags and friendly_title function. The way I was looking to solve this was to convert utf8 to unicode array, convert unicode array to unicode entities (preserving ascii chars), striptags, then convert the remaining entities back to utf8. I am working on going back to utf8 still but not complete yet. anyway, just letting you all know my progress and that i am actually looking at this issue and my thought process. |
trac user bman wrote on 40489100-03-11 yeah the actual bug should be titled "Add unicode support in friendly_title" |
cash wrote on 40489432-07-28 Okay - I see the issue. I believe this is related to #2027. I started to look into using iconv just before the release of 1.7.1. If that works, it would be a clean solution. Since Wikipedia seems to do a good job with URLs, MediaWiki would be a good place to look for how they handle character encodings and URLs. |
Title changed from |
trac user bman wrote on 40489955-12-12 Very nice, yeah unicode characters are just stripped out totally so posting something like: "é-test" would result in "-test" as the title used in the url |
cash wrote on 40489999-10-20 iconv should fallback to closest character in the ASCII character set if it can figure it out. This may be dependent on the server. In some tests I've seen something like "é-test" output as "e-test". I didn't do enough testing to turn it on in svn. |
trac user bman wrote on 40490008-05-22 Ahhh well that makes it readable, but I was actually trying to preserve unicode seeing as its supported on the web now, and the internalization would probably be a good thing for others. |
cash wrote on 40490078-02-18 Yeah, it was a temporary solution while PHP plays catch-up in the Unicode world. I think 5.3 is suppose to start having better i18n support. Have you checked out the intl extension: http://www.php.net/manual/en/book.intl.php |
trac user bman wrote on 40490114-08-22 Very good, I will read up and look again. |
cash wrote on 40514881-07-29 Pushing this back to 1.8 |
Milestone changed to |
Milestone changed to |
cash wrote on 41880155-04-16 Valid characters: http://www.ietf.org/rfc/rfc3986.txt Germanazo's pull request: #83 I believe Wikipedia encodes the friendly title with url encode. It works for most modern browsers, but causes issues when people copy/paste the url like so: http://ta.wikipedia.org/wiki/%E0%AE%A4%E0%AE%AE%E0%AE%BF%E0%AE%B4%E0%AF%8D |
Milestone changed to |
cash wrote on 41891104-06-12 This may require more time than we have for 1.8.2 so pushing back to 1.8.3. If it is done earlier, that's great. |
Milestone changed to |
cash wrote on 42003719-12-24 Options:
|
Milestone changed to |
brettp wrote on 42039239-12-01 I vote for option 3. |
trac user sembrestels wrote on 42093554-11-22 Drupal uses that module for this: |
trac user sembrestels wrote on 42132645-03-03 Let's use this function: http://www.php.net/manual/en/function.iconv.php Do you agree? |
cash wrote on 42140132-04-19 My concern with iconv is that it may be dependent on the server's configuration. |
brettp wrote on 42154993-03-16 Drupal's Transliteration plugin is basically a collection of maps of character transliterations. I don't want to maintain something like that. Since we're not ready to use iconv, I'm going to push this out to 1.8.5. |
Milestone changed to |
trac user mrclay wrote on 42388964-04-13 This looks, to me anyway, a unique method of transliteration based on decomposing diacritic marks out of the characters then stripping them. I believe each sequences of preg_replace calls could be replaced by str_replace w/ array args for a big speed improvement. Seems to only require vanilla PHP5.3. |
trac user ManUtopiK wrote on 42425749-06-16 I suggest an other way to solve this issue. |
trac user mrclay wrote on 42433022-09-21
I don't think the [http://us.php.net/manual/en/normalizer.normalize.php#92592 Normalizer-based translit] would translate chars like 語 (that may not neatly separate into base latin + diacritics). The given code just strips all non-ASCII afterwords, but we could strike a balance and Will 1.9 require PHP5.3? |
ewinslow wrote on 42433148-08-14 1.9 Will likely not require PHP 5.3. I hope 1.10 does... |
Milestone changed to |
cash wrote on 42483022-06-02 Long term - something like Normalizer is the way to go. Until then, how about pulling in this? We could modify it along the lines of what Steve is suggesting: European languages get transliterated and others get encoded as IRIs (which means those users would have to deal with ugly links if pasted somewhere). |
trac user mrclay wrote on 42483964-04-08 #281 adds an ElggTranslit class based on Doctrine1 Inflector, and more unit tests. Is there a better way to include the Doctrine license in the PHPDoc? |
trac user mrclay wrote on 42484839-01-02 Should we keep a special case for leaving periods in what look like file names? E.g. Someone uploads an image without setting title. The friendly title becomes "photo1jpg". I could easily leave periods that are immediately followed by an alphanumeric. Is there a guide/spec for this kind of function considering UX and SEO? |
trac user mrclay wrote on 42484883-04-21 I know we're trying to take only MIT licensed code, but WordPress's sanitize_title_with_dashes is exactly what we need here. http://core.trac.wordpress.org/browser/tags/3.4/wp-includes/formatting.php#L941 |
cash wrote on 42484970-01-14 For the license, I think we could use license with a link to the license. I don't think it is worth the effort to have special code to check for periods in titles (or if we do, it should probably be in the file plugin). As long as we have a goal of producing a MIT only core, we cannot grab WP code. By the way, most of the plugins will remain GPL only (decision by the primary copyright holder - the old Curverider company). |
trac user mrclay wrote on 42487745-11-04 Squashed this work into 1 commit. The license in the Doctrine file looks MIT-ish, but the link is to GPL, so I suggest we keep the PHPDoc license. Regardless, almost the whole thing has been rewritten. Used this to format the translit array. |
Fixes Elgg#2276: Better friendly titles
Original ticket http://trac.elgg.org/ticket/2276 on 40445254-04-27 by trac user belhkaci, assigned to trac user mrclay.
Elgg version: 1.7
I can not see the page content rss blogs when there is a blog that has a title that contains accented characters such as "é"
The text was updated successfully, but these errors were encountered: