Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve character encoding handling for friendly titles (Trac #2276) #2276

Closed
elgg-gitbot opened this issue Feb 16, 2013 · 42 comments
Closed

Comments

@elgg-gitbot
Copy link

Original ticket http://trac.elgg.org/ticket/2276 on 40445254-04-27 by trac user belhkaci, assigned to trac user mrclay.

Elgg version: 1.7

I can not see the page content rss blogs when there is a blog that has a title that contains accented characters such as "é"

@elgg-gitbot
Copy link
Author

Milestone changed to Elgg 1.7.2 by brettp on 40445929-01-14

@elgg-gitbot
Copy link
Author

cash wrote on 40489073-08-08

I am not able to reproduce. Could you post a set of steps to reproduce this?

@elgg-gitbot
Copy link
Author

trac user bman wrote on 40489097-10-12

I started looking at this and i don't see a specific blog related issue. The issue seems to lie in php's lack of unicode support. I have gotten a few functions working that will help, but the actual issue lies in the url creation and the characters being stripped out in strip tags and friendly_title function. The way I was looking to solve this was to convert utf8 to unicode array, convert unicode array to unicode entities (preserving ascii chars), striptags, then convert the remaining entities back to utf8. I am working on going back to utf8 still but not complete yet. anyway, just letting you all know my progress and that i am actually looking at this issue and my thought process.

@elgg-gitbot
Copy link
Author

trac user bman wrote on 40489100-03-11

yeah the actual bug should be titled "Add unicode support in friendly_title"
sorry cash i should have posted before i started working on this.

@elgg-gitbot
Copy link
Author

cash wrote on 40489432-07-28

Okay - I see the issue. I believe this is related to #2027.

I started to look into using iconv just before the release of 1.7.1. If that works, it would be a clean solution. Since Wikipedia seems to do a good job with URLs, MediaWiki would be a good place to look for how they handle character encodings and URLs.

@elgg-gitbot
Copy link
Author

Title changed from Problem of rss in blogs to improve character encoding handling for friendly titles by cash on 40489432-07-28

@elgg-gitbot
Copy link
Author

trac user bman wrote on 40489955-12-12

Very nice, yeah unicode characters are just stripped out totally so posting something like: "é-test" would result in "-test" as the title used in the url

@elgg-gitbot
Copy link
Author

cash wrote on 40489999-10-20

iconv should fallback to closest character in the ASCII character set if it can figure it out. This may be dependent on the server. In some tests I've seen something like "é-test" output as "e-test". I didn't do enough testing to turn it on in svn.

@elgg-gitbot
Copy link
Author

trac user bman wrote on 40490008-05-22

Ahhh well that makes it readable, but I was actually trying to preserve unicode seeing as its supported on the web now, and the internalization would probably be a good thing for others.

@elgg-gitbot
Copy link
Author

cash wrote on 40490078-02-18

Yeah, it was a temporary solution while PHP plays catch-up in the Unicode world. I think 5.3 is suppose to start having better i18n support. Have you checked out the intl extension: http://www.php.net/manual/en/book.intl.php

@elgg-gitbot
Copy link
Author

trac user bman wrote on 40490114-08-22

Very good, I will read up and look again.

@elgg-gitbot
Copy link
Author

cash wrote on 40496896-05-29

(In [svn:6586]) Refs #2117 #2276 #2027 - added elgg_get_friendly_time and elgg_get_friendly_title so they can be used in non-html views

@elgg-gitbot
Copy link
Author

cash wrote on 40514881-07-29

Pushing this back to 1.8

@elgg-gitbot
Copy link
Author

Milestone changed to Elgg 1.8 by cash on 40514881-07-29

@elgg-gitbot
Copy link
Author

Milestone changed to Elgg 1.8.1 by ewinslow on 41120363-04-12

@elgg-gitbot
Copy link
Author

cash wrote on 41880155-04-16

Valid characters: http://www.ietf.org/rfc/rfc3986.txt

Germanazo's pull request: #83

I believe Wikipedia encodes the friendly title with url encode. It works for most modern browsers, but causes issues when people copy/paste the url like so: http://ta.wikipedia.org/wiki/%E0%AE%A4%E0%AE%AE%E0%AE%BF%E0%AE%B4%E0%AF%8D

@elgg-gitbot
Copy link
Author

Milestone changed to Elgg 1.8.2 by cash on 41880155-04-16

@elgg-gitbot
Copy link
Author

cash wrote on 41891104-06-12

This may require more time than we have for 1.8.2 so pushing back to 1.8.3. If it is done earlier, that's great.

@elgg-gitbot
Copy link
Author

Milestone changed to Elgg 1.8.3 by cash on 41891104-06-12

@elgg-gitbot
Copy link
Author

cash wrote on 42003719-12-24

Options:

  • Use IRIs like we do for profile URLs
  • Encode the text like Wikipedia
  • Improve the convert technique of removing non-URI allowed characters (like Wordpress)

@elgg-gitbot
Copy link
Author

Milestone changed to Elgg 1.8.4 by cash on 42003719-12-24

@elgg-gitbot
Copy link
Author

brettp wrote on 42039239-12-01

I vote for option 3.

@elgg-gitbot
Copy link
Author

trac user sembrestels wrote on 42093554-11-22

Drupal uses that module for this:

http://drupal.org/project/transliteration

@elgg-gitbot
Copy link
Author

trac user sembrestels wrote on 42132645-03-03

Let's use this function:

http://www.php.net/manual/en/function.iconv.php

Do you agree?

@elgg-gitbot
Copy link
Author

cash wrote on 42140132-04-19

My concern with iconv is that it may be dependent on the server's configuration.

@elgg-gitbot
Copy link
Author

brettp wrote on 42154993-03-16

Drupal's Transliteration plugin is basically a collection of maps of character transliterations. I don't want to maintain something like that.

Since we're not ready to use iconv, I'm going to push this out to 1.8.5.

@elgg-gitbot
Copy link
Author

Milestone changed to Elgg 1.8.5 by brettp on 42154993-03-16

@elgg-gitbot
Copy link
Author

trac user mrclay wrote on 42388964-04-13

This looks, to me anyway, a unique method of transliteration based on decomposing diacritic marks out of the characters then stripping them. I believe each sequences of preg_replace calls could be replaced by str_replace w/ array args for a big speed improvement. Seems to only require vanilla PHP5.3.

@elgg-gitbot
Copy link
Author

trac user ManUtopiK wrote on 42425749-06-16

I suggest an other way to solve this issue.
I just do a pull request her on github :
#264

@elgg-gitbot
Copy link
Author

trac user mrclay wrote on 42433022-09-21

iconv looks awfully dependent on server config. It also apparently chokes on invalid UTF-8 data. Since its input would practically always be coming from the DB (already UTF-8), I'm not sure the latter is a worry.

I don't think the [http://us.php.net/manual/en/normalizer.normalize.php#92592 Normalizer-based translit] would translate chars like 語 (that may not neatly separate into base latin + diacritics). The given code just strips all non-ASCII afterwords, but we could strike a balance and urlencode after. I.e. convert Español 日本語 to espanol-%e6%97%a5%e6%9c%ac%e8%aa%9e. Western langs would get ASCII translit, non Westerns get urlencoded.

Will 1.9 require PHP5.3?

@elgg-gitbot
Copy link
Author

ewinslow wrote on 42433148-08-14

1.9 Will likely not require PHP 5.3. I hope 1.10 does...

@elgg-gitbot
Copy link
Author

Milestone changed to Elgg 1.8.7 by cash on 42460625-09-26

@elgg-gitbot
Copy link
Author

cash wrote on 42483022-06-02

Long term - something like Normalizer is the way to go. Until then, how about pulling in this?

We could modify it along the lines of what Steve is suggesting: European languages get transliterated and others get encoded as IRIs (which means those users would have to deal with ugly links if pasted somewhere).

@elgg-gitbot
Copy link
Author

trac user mrclay wrote on 42483964-04-08

#281 adds an ElggTranslit class based on Doctrine1 Inflector, and more unit tests. Is there a better way to include the Doctrine license in the PHPDoc?

@elgg-gitbot
Copy link
Author

trac user mrclay wrote on 42484839-01-02

Should we keep a special case for leaving periods in what look like file names? E.g. Someone uploads an image without setting title. The friendly title becomes "photo1jpg". I could easily leave periods that are immediately followed by an alphanumeric. Is there a guide/spec for this kind of function considering UX and SEO?

@elgg-gitbot
Copy link
Author

trac user mrclay wrote on 42484883-04-21

I know we're trying to take only MIT licensed code, but WordPress's sanitize_title_with_dashes is exactly what we need here. http://core.trac.wordpress.org/browser/tags/3.4/wp-includes/formatting.php#L941

@elgg-gitbot
Copy link
Author

cash wrote on 42484970-01-14

For the license, I think we could use license with a link to the license.

I don't think it is worth the effort to have special code to check for periods in titles (or if we do, it should probably be in the file plugin).

As long as we have a goal of producing a MIT only core, we cannot grab WP code. By the way, most of the plugins will remain GPL only (decision by the primary copyright holder - the old Curverider company).

@elgg-gitbot
Copy link
Author

trac user mrclay wrote on 42487745-11-04

Squashed this work into 1 commit. The license in the Doctrine file looks MIT-ish, but the link is to GPL, so I suggest we keep the PHPDoc license. Regardless, almost the whole thing has been rewritten. Used this to format the translit array.

@elgg-gitbot
Copy link
Author

trac user Steve Clay wrote on 42513475-07-31

Fixes #2276: Better friendly titles, portable ElggTranslit class, better units
Changeset: 4fff20a

@elgg-gitbot
Copy link
Author

trac user Cash Costello wrote on 42513487-09-07

Refs #2276 made the new class private
Changeset: 2755f1c

@elgg-gitbot
Copy link
Author

trac user Steve Clay wrote on 42524624-05-28

Fixes #2276: Better friendly titles, portable ElggTranslit class, better units
Changeset: 4fff20a

@elgg-gitbot
Copy link
Author

trac user Cash Costello wrote on 42524624-09-10

Merge pull request #281 from mrclay/2276-friendly-title

Fixes #2276: Better friendly titles
Changeset: 35bd23e

adayth pushed a commit to adayth/Elgg that referenced this issue Feb 22, 2013
adayth pushed a commit to adayth/Elgg that referenced this issue Feb 22, 2013
adayth pushed a commit to adayth/Elgg that referenced this issue Feb 22, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

1 participant