#1501 closed Defect (fixed)
Need to work around bug in php when dealing with character encodings
| Reported by: | FxNION | Owned by: | |
|---|---|---|---|
| Priority: | normal | Milestone: | Elgg 1.7 |
| Component: | Core | Version: | 1.7 |
| Severity: | major | Keywords: | accents, utf8, metadata, tags |
| Cc: | brettp, fx.nion@… | Difficulty: |
Description
Tested on the profile field "Interest" with the value ùéaèàç.
The metadata isn't saved.
Change History (14)
comment:1 Changed 3 years ago by cash
comment:2 Changed 3 years ago by FxNION
Very strange. For me it works for 'Text' typed fiels, but not 'Tags' typed fields...
I tested it on a clean install.
comment:3 Changed 3 years ago by FxNION
- Cc fx.nion@… added
- Summary changed from Metadata accents skipped to Metadata accents skipped - An explanation
- Type changed from unconfirmed defect to confirmed defect
- Version changed from SVN Trunk to 1.7
The problem is on the string_to_tag_array function called from profile/action/edit.php
The function is found in metadata.php and the problem is certainly linked with the call strtolower, which doesn't work well with utf8...
Fix suggested: call an mb_strtolower directly to the string at the beginning of the function.
By the way, why tags are to be set to lower case ?
comment:4 Changed 3 years ago by cash
- Priority changed from high to normal
- Severity changed from critical to major
- Type changed from confirmed defect to unconfirmed defect
FxNION - if your server has the mbstring extension installed, it should use mb_strtolower(). Take a look at /engine/lib/mb_wrapper.php
comment:5 Changed 3 years ago by FxNION
Thanks Cash to pay attention to this ticket :)
I have the mbstring extension installed, and i understood the wrapper strategy.
I tried this naive test and had success in converting tags with accents:
function string_to_tag_array($string) {
if (is_string($string)) {
$stringLower = mb_strtolower($string, 'UTF8');
$ar = explode(",",$stringLower);
$ar = array_map('trim', $ar); trim blank spaces
$ar = array_map('elgg_strtolower', $ar); make lower case : [Marcus Povey 20090605 - Using mb wrapper function using UTF8 safe function where available]
$ar = array_filter($ar, 'is_not_null'); Remove null values
return $ar;
}
return false;
}
I noticed than if id didn't specify the 'UT8' param, it didn't do the job well. When i specify the 'UTF8' param it worked as expected.
May be the call so elgg_fun() have to specify this parameter, but what about the array_map case ?
comment:6 Changed 3 years ago by cash
the encoding is set in engine/mb_wrapper to UTF-8 so we shouldn't need to set the second parameter of each mb_* function. Here are some tests that I'd like you to run:
- add a log statement in the string_to_tag_array() function:
error_log('encoding: ' . mb_internal_encoding());
This should output "encoding: UTF-8"
- switch the mb_strtolower() call that you added to elgg_strtolower() and see if it still works for you
comment:7 Changed 3 years ago by FxNION
Hi Cash,
Very usefuls tests :
1- In th log i get the unexpected log : encoding: ISO-8859-1
2- a
Naturally, if i use $stringLower = elgg_strtolower($string, 'UTF8'); instead of $stringLower = mb_strtolower($string, 'UTF8'); it works well
2- b
if i use $stringLower = elgg_strtolower($string);, no surprise, it fails.
3- I went further and put
error_log('mb_internal_encoding: ' . mb_internal_encoding());
inside the beginning lines of engine/mb_wrapper, and the log showed the expected result mb_internal_encoding: UTF-8
So, what do you think about this ?
comment:8 Changed 3 years ago by cash
I think this means you are running a third party plugin that is setting the encoding to ISO-8859-1. Do a search through your source code for mb_internal_encoding(). It is getting set correctly when the Elgg engine is loaded but sometime after that it is being set by something else.
comment:9 Changed 3 years ago by FxNION
Hmmm, i have already done this search in the whole elgg, and guess what ? In fact i have just two third party plugin installed beyond the fresh beta, they are mine and don't use mb_internal_encoding. I did the search again, and no more results than Elgg's one. Php bug ?
But if you suppose this kind of possible side effect, wouldn't it be safer to hard code the mb_wrappers and force them to use the UTF8 parameter instead of building dynamically the code. Although, as not every mb functions take the mb_encoding parameter, forcing the use in the dynamic code is not the best idea i had. But i tried :))).
comment:10 Changed 3 years ago by cash
I looked this up and there are bugs in different versions of php 5.2.* with this. Try putting this line in the mb_wrapper and see if the problem goes away:
ini_set("mbstring.internal_encoding", 'UTF-8');
comment:11 Changed 3 years ago by FxNION
Ahhhhhhhhhhhhhhhhhhhhhhh, great. It works like a charm. I came back to the initial code in the function string_to_tag_array, to ensure minimal changes.
I then tested 3 ways which are ok in mb_wrapper:
test 1:
if (is_callable('mb_internal_encoding')) {
mb_internal_encoding("UTF-8");
ini_set("mbstring.internal_encoding", 'UTF-8');
}
test 2:
if (is_callable('mb_internal_encoding')) {
ini_set("mbstring.internal_encoding", 'UTF-8');
mb_internal_encoding("UTF-8");
}
test 3:
ini_set("mbstring.internal_encoding", 'UTF-8');
May be it would help, i use WampServer 2.0 on Windows Xp for development. Php v5.3.0, Apache v2.2.11 and Mysql v5.1.36.
In conjunction with the correction i suggested concerning the search by tags, this is a great improvment.
Thanks Cash.
comment:12 Changed 3 years ago by cash
- Summary changed from Metadata accents skipped - An explanation to Need to work around bug in php when dealing with character encodings
- Type changed from unconfirmed defect to confirmed defect
Brett, I recommend we include the ini_set in mb_wrapper to work around this PHP bug. It seems that with some versions, if you call certain non-mb string functions, it resets the encoding to what is in php.ini. We call parse_str() in a few places which is probably the trigger. We should look at those cases to see if they should be replaced by elgg_parse_str().
Here is the code that should be at the top of mb_wrapper.php:
if (is_callable('mb_internal_encoding')) {
mb_internal_encoding("UTF-8");
ini_set("mbstring.internal_encoding", 'UTF-8');
}
comment:13 Changed 3 years ago by brettp
- Resolution set to fixed
- Status changed from new to closed
(In [svn:3932]) Fixes #1501: Setting ini for mbstring.internal_encoding to utf8 to work around a PHP bug. Replaced calls to parse_str() with elgg_parse_str().
comment:14 Changed 3 years ago by brettp
Thanks to both of you for all the investigating!

This works for me on a clean install