We have moved to Github. Please open tickets there.

Opened 3 years ago

Closed 11 months ago

Last modified 11 months ago

#2276 closed Defect (fixed)

improve character encoding handling for friendly titles

Reported by: belhkaci Owned by: mrclay
Priority: normal Milestone: Elgg 1.8.7
Component: Core Version: 1.7
Severity: minor Keywords:
Cc: brettp, sembrestels Difficulty:

Description (last modified by ewinslow)

I can not see the page content rss blogs when there is a blog that has a title that contains accented characters such as "é"

Change History (37)

comment:1 Changed 3 years ago by brettp

  • Milestone changed from Elgg 1.8 to Elgg 1.7.2

comment:2 Changed 3 years ago by cash

  • Type changed from confirmed defect to unconfirmed defect

I am not able to reproduce. Could you post a set of steps to reproduce this?

comment:3 Changed 3 years ago by bman

I started looking at this and i don't see a specific blog related issue. The issue seems to lie in php's lack of unicode support. I have gotten a few functions working that will help, but the actual issue lies in the url creation and the characters being stripped out in strip tags and friendly_title function. The way I was looking to solve this was to convert utf8 to unicode array, convert unicode array to unicode entities (preserving ascii chars), striptags, then convert the remaining entities back to utf8. I am working on going back to utf8 still but not complete yet. anyway, just letting you all know my progress and that i am actually looking at this issue and my thought process.

comment:4 Changed 3 years ago by bman

yeah the actual bug should be titled "Add unicode support in friendly_title"
sorry cash i should have posted before i started working on this.

comment:5 Changed 3 years ago by cash

  • Component changed from Blog to Core
  • Summary changed from Problem of rss in blogs to improve character encoding handling for friendly titles
  • Type changed from unconfirmed defect to confirmed defect

Okay - I see the issue. I believe this is related to #2027.

I started to look into using iconv just before the release of 1.7.1. If that works, it would be a clean solution. Since Wikipedia seems to do a good job with URLs, MediaWiki would be a good place to look for how they handle character encodings and URLs.

comment:6 Changed 3 years ago by bman

Very nice, yeah unicode characters are just stripped out totally so posting something like: "é-test" would result in "-test" as the title used in the url

comment:7 Changed 3 years ago by cash

iconv should fallback to closest character in the ASCII character set if it can figure it out. This may be dependent on the server. In some tests I've seen something like "é-test" output as "e-test". I didn't do enough testing to turn it on in svn.

comment:8 Changed 3 years ago by bman

Ahhh well that makes it readable, but I was actually trying to preserve unicode seeing as its supported on the web now, and the internalization would probably be a good thing for others.

comment:9 Changed 3 years ago by cash

Yeah, it was a temporary solution while PHP plays catch-up in the Unicode world. I think 5.3 is suppose to start having better i18n support. Have you checked out the intl extension: http://www.php.net/manual/en/book.intl.php

comment:10 Changed 3 years ago by bman

Very good, I will read up and look again.

comment:11 Changed 3 years ago by cash

(In [svn:6586]) Refs #2117 #2276 #2027 - added elgg_get_friendly_time and elgg_get_friendly_title so they can be used in non-html views

comment:12 Changed 3 years ago by cash

  • Milestone changed from Elgg 1.7.2 to Elgg 1.8

Pushing this back to 1.8

comment:13 Changed 3 years ago by brettp

  • Difficulty set to easy
  • Priority changed from normal to high

comment:14 Changed 2 years ago by ewinslow

  • Milestone changed from Elgg 1.8 to Elgg 1.8.1

comment:15 Changed 18 months ago by cash

  • Difficulty easy deleted
  • Milestone changed from Elgg 1.8.x to Elgg 1.8.2
  • Priority changed from high to normal

Valid characters: http://www.ietf.org/rfc/rfc3986.txt

Germanazo's pull request: https://github.com/Elgg/Elgg/pull/83

I believe Wikipedia encodes the friendly title with url encode. It works for most modern browsers, but causes issues when people copy/paste the url like so: http://ta.wikipedia.org/wiki/%E0%AE%A4%E0%AE%AE%E0%AE%BF%E0%AE%B4%E0%AF%8D

comment:16 Changed 18 months ago by cash

  • Milestone changed from Elgg 1.8.2 to Elgg 1.8.3

This may require more time than we have for 1.8.2 so pushing back to 1.8.3. If it is done earlier, that's great.

comment:17 Changed 17 months ago by cash

  • Milestone changed from Elgg 1.8.3 to Elgg 1.8.4

Options:

  • Use IRIs like we do for profile URLs
  • Encode the text like Wikipedia
  • Improve the convert technique of removing non-URI allowed characters (like Wordpress)

comment:18 Changed 17 months ago by brettp

I vote for option 3.

comment:19 Changed 16 months ago by sembrestels

Drupal uses that module for this:

http://drupal.org/project/transliteration

comment:20 Changed 15 months ago by sembrestels

  • Cc sembrestels added

Let's use this function:

http://www.php.net/manual/en/function.iconv.php

Do you agree?

comment:21 Changed 15 months ago by cash

My concern with iconv is that it may be dependent on the server's configuration.

comment:22 Changed 15 months ago by brettp

  • Milestone changed from Elgg 1.8.4 to Elgg 1.8.5

Drupal's Transliteration plugin is basically a collection of maps of character transliterations. I don't want to maintain something like that.

Since we're not ready to use iconv, I'm going to push this out to 1.8.5.

comment:23 Changed 12 months ago by mrclay

This looks, to me anyway, a unique method of transliteration based on decomposing diacritic marks out of the characters then stripping them. I believe each sequences of preg_replace calls could be replaced by str_replace w/ array args for a big speed improvement. Seems to only require vanilla PHP5.3.

comment:24 Changed 12 months ago by ManUtopiK

I suggest an other way to solve this issue.
I just do a pull request her on github :
https://github.com/Elgg/Elgg/pull/264

comment:25 Changed 12 months ago by mrclay

iconv looks awfully dependent on server config. It also apparently chokes on invalid UTF-8 data. Since its input would practically always be coming from the DB (already UTF-8), I'm not sure the latter is a worry.

I don't think the Normalizer-based translit would translate chars like 語 (that may not neatly separate into base latin + diacritics). The given code just strips all non-ASCII afterwords, but we could strike a balance and urlencode after. I.e. convert Español 日本語 to espanol-%e6%97%a5%e6%9c%ac%e8%aa%9e. Western langs would get ASCII translit, non Westerns get urlencoded.

Will 1.9 require PHP5.3?

comment:26 Changed 12 months ago by ewinslow

  • Description modified (diff)

1.9 Will likely not require PHP 5.3. I hope 1.10 does...

comment:27 Changed 11 months ago by cash

  • Milestone changed from Elgg 1.8.6 to Elgg 1.8.7

comment:28 Changed 11 months ago by cash

Long term - something like Normalizer is the way to go. Until then, how about pulling in this?

We could modify it along the lines of what Steve is suggesting: European languages get transliterated and others get encoded as IRIs (which means those users would have to deal with ugly links if pasted somewhere).

comment:29 Changed 11 months ago by mrclay

  • Owner set to mrclay
  • Status changed from new to assigned

https://github.com/Elgg/Elgg/pull/281 adds an ElggTranslit class based on Doctrine1 Inflector, and more unit tests. Is there a better way to include the Doctrine license in the PHPDoc?

comment:30 Changed 11 months ago by mrclay

Should we keep a special case for leaving periods in what look like file names? E.g. Someone uploads an image without setting title. The friendly title becomes "photo1jpg". I could easily leave periods that are immediately followed by an alphanumeric. Is there a guide/spec for this kind of function considering UX and SEO?

comment:31 Changed 11 months ago by mrclay

I know we're trying to take only MIT licensed code, but WordPress's sanitize_title_with_dashes is exactly what we need here. http://core.trac.wordpress.org/browser/tags/3.4/wp-includes/formatting.php#L941

comment:32 Changed 11 months ago by cash

For the license, I think we could use @license with a link to the license.

I don't think it is worth the effort to have special code to check for periods in titles (or if we do, it should probably be in the file plugin).

As long as we have a goal of producing a MIT only core, we cannot grab WP code. By the way, most of the plugins will remain GPL only (decision by the primary copyright holder - the old Curverider company).

comment:33 Changed 11 months ago by mrclay

Squashed this work into 1 commit. The license in the Doctrine file looks MIT-ish, but the link is to GPL, so I suggest we keep the PHPDoc license. Regardless, almost the whole thing has been rewritten. Used this to format the translit array.

comment:34 Changed 11 months ago by Steve Clay

  • Resolution set to fixed
  • Status changed from assigned to closed

Fixes #2276: Better friendly titles, portable ElggTranslit class, better units

Changeset: 4fff20a33467a7318956412d4dabfcab1ce6daba

comment:35 Changed 11 months ago by Cash Costello

Refs #2276 made the new class private

Changeset: 2755f1cd44de70d32833d95aae9a761c80666687

comment:36 Changed 11 months ago by Steve Clay

Fixes #2276: Better friendly titles, portable ElggTranslit class, better units

Changeset: 4fff20a33467a7318956412d4dabfcab1ce6daba

comment:37 Changed 11 months ago by Cash Costello

Merge pull request #281 from mrclay/2276-friendly-title

Fixes #2276: Better friendly titles

Changeset: 35bd23ec8deb6c1f576780169bd0808caae4bdd1

Note: See TracTickets for help on using tickets.