Export doesn't work in Excel with diacritic chars
PROPOSAL - prepend the BOM for UTF8 "\xEF\xBB\xBF" so csv's exported from excel
Discussion (note this is the same writeup on the PR - creating a gitlab too as I suspect there might be discussion).
There is a long-standing issue whereby files exported from CiviCRM with diacritic characters - eg. ę are mangled when opened in MS Excel.
The underlying issue is that the characters are UTF-8 encoded & MS Excel by default does not assume UTF8.
In order to do so it needs a BOM - Byte Order Marker - which is a few hex characters at the start of
the csv.
The BOM indicator is 'not recommended' .... unless you want your csv to work with
MS Excel. Since MS Excel is a major use case for csvs I think it's pretty clear that all
things being equal we want to support it... I note that there are extensions to export to excel
natively but I don't think that replaces this. The number one reason to want to export
a csv is to work with it in a spreadsheeting programme & unless we can't safely make it
work in core we shouldn't require an extension for that.
There are various recommendations over time but it seems things have improved in the MS
Excel world and what works on Windows now appears to work on Mac too - at least on a recent version.
This link https://donatstudios.com/CSV-An-Encoding-Nightmare is a pretty good discussion.
While the above link and others say that you need a different BOM for MAC than Windows my
testing shows that the one recommended for Windows works fine on Mac Excel (while you would need
to convert to UTF 16 to follow the Mac recommendation. (https://csv.thephpleague.com/9.0/interoperability/encoding/)
Other links
https://stackoverflow.com/questions/35294443/does-excel-for-mac-2016-properly-import-unicode-in-csv-files
https://stackoverflow.com/questions/2223882/whats-the-difference-between-utf-8-and-utf-8-without-bom/2223926#2223926
Notably this summary :
" When should you encode with a BOM?
If you're unable to record the metadata in any other way (through a charset tag or file system meta), and the programs being used like BOMs, you should encode with a BOM. This is especially true on Windows where anything without a BOM is generally assumed to be using a legacy code page. The BOM tells programs like Office that, yes, the text in this file is Unicode; here's the encoding used.
When it comes down to it, the only files I ever really have problems with are CSV. Depending on the program, it either must, or must not have a BOM. For example, if you're using Excel 2007+ on Windows, it must be encoded with a BOM if you want to open it smoothly and not have to resort to importing the data."
So the argument for a BOM is - if you want it to be compatible with MS Excel use a BOM. This seems to apply.
The risk is that perhaps there is some variant of csv viewing programme that can't cope - the risk of this is rather mitigited by
1) the fact that MS Excel adds the BOM to the start of any files it saves as UTF-8 encoded csv so any programme that expects to open MS Excel generated
csvs would need to be able to cope with it. In addition it is a standard, if not required, file indicator so it feels like the programmes
that were legacy in 2016 writeups might be of no concern now.
2) There really is no programatic way to export these csvs so the risk that this is being parsed by code is close to zero
There are 2 other things we could do to mitigate
1) test on more platforms - I've tested with MS Excel for Mac (Office 365 v 16.31) and Libre Office and MS Numbers
& MS word, cot editor & notes
2) We could add a setting. I'm generally a bit loath on settings as they are a bit of a maintenance nightmare.
However, perhaps it would be a setting like 'export csvs in legacy format & there could be a link
to the gitlab to explain your use-case if you feel the need to set it. If no-one does then we could later remove.
Final note - csv tables are created with the CRM_Core_Report_Excel (spot the irony) from 3 places
- the main export, the export for custom fields & a third place which I believe to be unreachable. This is one path
only but I can look at centralising for the custom fields export path.