We have moved to Github. Please open tickets there.

Opened 3 years ago

Closed 3 years ago

Last modified 3 years ago

#1501 closed Defect (fixed)

Need to work around bug in php when dealing with character encodings

Reported by: FxNION Owned by:
Priority: normal Milestone: Elgg 1.7
Component: Core Version: 1.7
Severity: major Keywords: accents, utf8, metadata, tags
Cc: brettp, fx.nion@… Difficulty:

Description

Tested on the profile field "Interest" with the value ùéaèàç.
The metadata isn't saved.

Change History (14)

comment:1 Changed 3 years ago by cash

This works for me on a clean install

comment:2 Changed 3 years ago by FxNION

Very strange. For me it works for 'Text' typed fiels, but not 'Tags' typed fields...
I tested it on a clean install.

comment:3 Changed 3 years ago by FxNION

  • Cc fx.nion@… added
  • Summary changed from Metadata accents skipped to Metadata accents skipped - An explanation
  • Type changed from unconfirmed defect to confirmed defect
  • Version changed from SVN Trunk to 1.7

The problem is on the string_to_tag_array function called from profile/action/edit.php

The function is found in metadata.php and the problem is certainly linked with the call strtolower, which doesn't work well with utf8...

Fix suggested: call an mb_strtolower directly to the string at the beginning of the function.

By the way, why tags are to be set to lower case ?

comment:4 Changed 3 years ago by cash

  • Priority changed from high to normal
  • Severity changed from critical to major
  • Type changed from confirmed defect to unconfirmed defect

FxNION - if your server has the mbstring extension installed, it should use mb_strtolower(). Take a look at /engine/lib/mb_wrapper.php

comment:5 Changed 3 years ago by FxNION

Thanks Cash to pay attention to this ticket :)

I have the mbstring extension installed, and i understood the wrapper strategy.

I tried this naive test and had success in converting tags with accents:

function string_to_tag_array($string) {

if (is_string($string)) {

$stringLower = mb_strtolower($string, 'UTF8');
$ar = explode(",",$stringLower);
$ar = array_map('trim', $ar); trim blank spaces
$ar = array_map('elgg_strtolower', $ar); make lower case : [Marcus Povey 20090605 - Using mb wrapper function using UTF8 safe function where available]
$ar = array_filter($ar, 'is_not_null');
Remove null values
return $ar;

}
return false;

}
I noticed than if id didn't specify the 'UT8' param, it didn't do the job well. When i specify the 'UTF8' param it worked as expected.

May be the call so elgg_fun() have to specify this parameter, but what about the array_map case ?

comment:6 Changed 3 years ago by cash

the encoding is set in engine/mb_wrapper to UTF-8 so we shouldn't need to set the second parameter of each mb_* function. Here are some tests that I'd like you to run:

  1. add a log statement in the string_to_tag_array() function:
    error_log('encoding: ' . mb_internal_encoding());
    

This should output "encoding: UTF-8"

  1. switch the mb_strtolower() call that you added to elgg_strtolower() and see if it still works for you

comment:7 Changed 3 years ago by FxNION

Hi Cash,

Very usefuls tests :
1- In th log i get the unexpected log : encoding: ISO-8859-1
2- a
Naturally, if i use $stringLower = elgg_strtolower($string, 'UTF8'); instead of $stringLower = mb_strtolower($string, 'UTF8'); it works well
2- b
if i use $stringLower = elgg_strtolower($string);, no surprise, it fails.

3- I went further and put
error_log('mb_internal_encoding: ' . mb_internal_encoding());
inside the beginning lines of engine/mb_wrapper, and the log showed the expected result mb_internal_encoding: UTF-8

So, what do you think about this ?

comment:8 Changed 3 years ago by cash

I think this means you are running a third party plugin that is setting the encoding to ISO-8859-1. Do a search through your source code for mb_internal_encoding(). It is getting set correctly when the Elgg engine is loaded but sometime after that it is being set by something else.

comment:9 Changed 3 years ago by FxNION

Hmmm, i have already done this search in the whole elgg, and guess what ? In fact i have just two third party plugin installed beyond the fresh beta, they are mine and don't use mb_internal_encoding. I did the search again, and no more results than Elgg's one. Php bug ?

But if you suppose this kind of possible side effect, wouldn't it be safer to hard code the mb_wrappers and force them to use the UTF8 parameter instead of building dynamically the code. Although, as not every mb functions take the mb_encoding parameter, forcing the use in the dynamic code is not the best idea i had. But i tried :))).

comment:10 Changed 3 years ago by cash

I looked this up and there are bugs in different versions of php 5.2.* with this. Try putting this line in the mb_wrapper and see if the problem goes away:

ini_set("mbstring.internal_encoding", 'UTF-8');

comment:11 Changed 3 years ago by FxNION

Ahhhhhhhhhhhhhhhhhhhhhhh, great. It works like a charm. I came back to the initial code in the function string_to_tag_array, to ensure minimal changes.

I then tested 3 ways which are ok in mb_wrapper:

test 1:

if (is_callable('mb_internal_encoding')) {
	mb_internal_encoding("UTF-8");
	ini_set("mbstring.internal_encoding", 'UTF-8');
}

test 2:

if (is_callable('mb_internal_encoding')) {
        ini_set("mbstring.internal_encoding", 'UTF-8');
	mb_internal_encoding("UTF-8");
	
}

test 3:

ini_set("mbstring.internal_encoding", 'UTF-8');

May be it would help, i use WampServer 2.0 on Windows Xp for development. Php v5.3.0, Apache v2.2.11 and Mysql v5.1.36.

In conjunction with the correction i suggested concerning the search by tags, this is a great improvment.

Thanks Cash.

comment:12 Changed 3 years ago by cash

  • Summary changed from Metadata accents skipped - An explanation to Need to work around bug in php when dealing with character encodings
  • Type changed from unconfirmed defect to confirmed defect

Brett, I recommend we include the ini_set in mb_wrapper to work around this PHP bug. It seems that with some versions, if you call certain non-mb string functions, it resets the encoding to what is in php.ini. We call parse_str() in a few places which is probably the trigger. We should look at those cases to see if they should be replaced by elgg_parse_str().

Here is the code that should be at the top of mb_wrapper.php:

if (is_callable('mb_internal_encoding')) {
	mb_internal_encoding("UTF-8");
	ini_set("mbstring.internal_encoding", 'UTF-8');
}

comment:13 Changed 3 years ago by brettp

  • Resolution set to fixed
  • Status changed from new to closed

(In [svn:3932]) Fixes #1501: Setting ini for mbstring.internal_encoding to utf8 to work around a PHP bug. Replaced calls to parse_str() with elgg_parse_str().

comment:14 Changed 3 years ago by brettp

Thanks to both of you for all the investigating!

Note: See TracTickets for help on using tickets.