haskell / text-icu Goto Github PK

This package provides the Haskell Data.Text.ICU library, for performing complex manipulation of Unicode text.

License: BSD 2-Clause "Simplified" License

Haskell 85.97% C 13.91% Shell 0.12%

text-icu's Introduction

Text-ICU: Comprehensive support for string manipulation

This package provides the Data.Text.ICU library, for performing complex manipulation of Unicode text. It provides features such as the following:

Unicode normalization
Conversion to and from many common and obscure encodings
Date and number formatting
Comparison and collation

Prerequisites

This library is implemented as bindings to the well-respected ICU library (which is not bundled, and must be installed separately).

macOS

brew install icu4c
brew link icu4c --force

You might need:

export PKG_CONFIG_PATH="$(brew --prefix)/opt/icu4c/lib/pkgconfig"

Debian/Ubuntu

sudo apt-get update
sudo apt-get install libicu-dev

Fedora/CentOS

sudo dnf install unzip libicu-devel

Nix/NixOS

nix-shell --packages icu

Windows/MSYS2

Under MSYS2, ICU can be installed via pacman.

pacman --noconfirm -S mingw-w64-x86_64-icu

Depending on the age of the MSYS2 installation, the keyring might need to be updated to avoid certification issues, and pkg-config might need to be added. In this case, do this first:

pacman --noconfirm -Sy msys2-keyring
pacman --noconfirm -S mingw-w64-x86_64-pkgconf

Windows/stack

With stack on Windows, which comes with its own bundled MSYS2, the following commands give up-to-date system dependencies for text-icu-0.8.0 (tested 2023-09-30):

stack exec -- pacman --noconfirm -Sy msys2-keyring
stack exec -- pacman --noconfirm -S mingw-w64-x86_64-pkgconf
stack exec -- pacman --noconfirm -S mingw-w64-x86_64-icu

Compatibility

Upstream ICU occasionally introduces backwards-incompatible API breaks. This package tries to stay up to date with upstream, and is currently more or less in sync with ICU 72.

Minimum required version is ICU 62.

Get involved!

Please report bugs via the github issue tracker.

Authors

This library was written by Bryan O'Sullivan.

text-icu's People

Contributors

Stargazers

Watchers

text-icu's Issues

Still Recommended?

I'm pretty new to Haskell and wondering if this package is still recommended. Is it still being actively developed/tested?

v0.8.0: Build failure with GHC < 8.10

The -Wcompat is a problem for GHC 7.x
```
Building library for text-icu-0.8.0..
ghc: unrecognised flag: -Wcompat
```
This option needs to be guarded under if impl (ghc >= 8.0).

The ImportQualifiedPost language extension is only available from GHC 8.10, breaking build with almost all GHC 8 versions.

Data/Text/ICU/Calendar.hsc:1:14: error:
  Unsupported extension: ImportQualifiedPost
  |
1 | {-# LANGUAGE ImportQualifiedPost, RankNTypes, BangPatterns, ForeignFunctionInterface, RecordWildCards #-}
  |              ^^^^^^^^^^^^^^^^^^^

In my role as hackage trustee, I made a revision exposing the need for GHC >= 8.10 in the cabal file. But I think we should restore buildability with GHC 7 (latest major versions) and 8 (all major versions).

Is this still being maintained?

I see open PRs with no comments, the last commit is from 2015, and the version on Hackage isn't even the latest version in this repo.

I think this is a fairly important library, so if @bos doesn't want to maintain it anymore, perhaps we should try to find someone to take over?

Add interface for BiDi levels

We are using text-icu in Balkón, an inline layout engine intended for a web browser.

In order to handle bidirectional text, Balkón needs to be able to run the BiDi algorithm for a given input text and retrieve the calculated levels, so that it can break this text into directional runs and pass each of them to HarfBuzz for shaping. For this, Balkón needs the output provided by the ubidi_getLevels() function in the ICU C API. Because Balkón allows associating formatting options and metadata with portions of the input text, and because the output of HarfBuzz has the form of glyphs, we cannot use the high-level reorderParagraph function, which only works on plain text and reorders it in such a way that preserving metadata would be very difficult. Fortunately, the reordering step is very simple to implement, so Balkón can take responsibility for it. We just need the output of the BiDi algorithm after rule I2, before reordering (https://www.unicode.org/reports/tr9/#Reordering_Resolved_Levels).

It would also be useful if Balkón could supply the initial embedding levels to the algorithm, so that direction changes can be dictated by higher level protocols (typically HTML, since this is intended for a web browser) without having to insert explicit formatting characters into the input string (which would complicate working with text offsets for selections). This is permitted by BiDi rule HL3 (https://www.unicode.org/reports/tr9/#HL3). Initial embedding levels can be passed as the embeddingLevels parameter of the ubidi_setPara() function in the ICU C API, which is currently hardcoded as NULL in the Haskell bindings (https://hackage.haskell.org/package/text-icu-0.8.0.2/src/cbits/text_icu.c).

It should also be possible to control the paraLevel parameter of the ubidi_setPara() function, which would typically reflect the direction set on HTML block elements. This is permitted by BiDi rule HL1 (https://www.unicode.org/reports/tr9/#HL1).

A high-level, pure Haskell function providing the required functionality might look something like this:

textLevels :: Word8 -> ByteString -> Text -> ByteString
textLevels paraLevel inputLevels inputText =
   unsafePerformIO $ do
     bidi <- open
     setParaWithLevels bidi inputText paraLevel inputLevels
     getLevels bidi

where setParaWithLevels is a foreign call to ubidi_setPara() including the embeddingLevels parameter, and getLevels is a foreign call to ubidi_getLevels(). Additional code may be necessary to handle memory allocation and deallocation. The levels may be stored in a data type other than ByteString if necessary.

Problem with pure boundary analysis

The pure boundary analysis API seems to have a problem in that it can produce different results when given identical arguments.

The following example illustrates the problem:

{-# LANGUAGE OverloadedStrings #-}

module Main where

import Control.Monad
import Data.Monoid
import Data.Text.ICU

main = do
  let x  = "foobar"
      x' = foldr (\_ x -> x <> x) x [1..15]
  forM_ [1..10] $ \_ -> print $ length (charBreaks x')

charBreaks = breaks (breakCharacter Current)

When I run this, I get (in one instance):

Any idea what might be going wrong?

`collate` gives different results than applying `compare` on `sortKey`

ghci> import qualified Data.Text.ICU as ICU
ghci> let testCompare c a b = (ICU.collate c a b, compare (ICU.sortKey c a) (ICU.sortKey c b))

according to the docs, testCompare c a b should always return a pair of two equal values (i.e. (EQ, EQ), (LT, LT) or (GT, GT)). But this isn't the case, for example:

ghci> let c = ICU.collator ICU.Root
ghci> testCompare c "" "\EOT"
(EQ,LT)
ghci> testCompare c "" "\ETX"
(EQ,LT)
ghci> testCompare c "" "\NUL"
(EQ,LT)
ghci> testCompare c "" "\2205"
(EQ,LT)
ghci> testCompare c "" "\2250"
(EQ,LT)
ghci> testCompare c "" "\2250\ETX\2205"
(EQ,LT)

As far as I can tell, there are a handful of characters (including all of those above) such that Data.ByteString.unpack $ ICU.sortKey "(char)" gives [1, 1, 0]. And the problem manifests when we compare a string of any number of these characters (such a string also has sort key [1, 1, 0]) to the empty string (sort key []). I haven't seen this in any other situation.

(\2250 is U+08ca "arabic small high farsi yeh" and \2205 is "arabic superscripet alef mokhassas". Found these essentially randomly. A few others in the vicinity have the same property, like \2251 but not \2206. I haven't looked to see if there's any pattern here.)

I tried a few other collators. collatorWith _ [Strength Secondary] makes the sort key of the non-empty strings [1, 0] instead of [1, 1, 0], but testCompare gives the same results. Changing the base to Locale "en" or adding Numeric True doesn't obviously make a difference.

This is with text-icu-0.8.0.2. I can't rule out that this is a bug in icu itself. I'm not familiar enough with C to be able to test that easily, though I expect I could figure it out. I'm using a version provided by nix. Based on the output of lsof, it seems to be version 72.1: my running GHC is has these files open:

/nix/store/x6cq3940a5krcwj0p28y3b6lckxmcfqw-icu4c-72.1/lib/libicudata.so.72.1
/nix/store/x6cq3940a5krcwj0p28y3b6lckxmcfqw-icu4c-72.1/lib/libicui18n.so.72.1
/nix/store/x6cq3940a5krcwj0p28y3b6lckxmcfqw-icu4c-72.1/lib/libicuuc.so.72.1

macOS: ld: warning: directory not found for option '-L/opt/homebrew/opt/icu4c/lib'

The following warning is issued by cabal for any project I (macOS) am building using text-icu-0.8, every time:

ld: warning: directory not found for option '-L/opt/homebrew/opt/icu4c/lib'

It stems from here:

text-icu/text-icu.cabal

Line 120 in c73d7fe

/opt/homebrew/opt/icu4c/lib

This isn't severe, but may irritate some users.

t_blockCode test failure again with ICU 74

This is a reprise of

Happens on macOS with ICU 74:
https://github.com/haskell/text-icu/actions/runs/8943327460/job/24567789387#step:12:636

t_blockCode: [Failed]
*** Failed! (after 25 tests):
Exception:
  toEnum{BlockCode}: tag (328) is outside of enumeration's range (0,327)
  CallStack (from HasCallStack):
    error, called at Data/Text/ICU/Char.hsc:495:17 in text-icu-0.8.0.5-FtrCGOmY2PD7vvbWpyQUDD:Data.Text.ICU.Char
'\191897'

Update: also fails on Windows with ICU 74.2:
https://github.com/haskell/text-icu/actions/runs/8943757763/job/24569171067#step:11:26

*** Failed! (after 13 tests):
Exception:
  toEnum{BlockCode}: tag (328) is outside of enumeration's range (0,327)
  CallStack (from HasCallStack):
    error, called at Data\Text\ICU\Char.hsc:495:17 in text-icu-0.8.0.5-inplace:Data.Text.ICU.Char
'\191528'

Deprecation warning for `memcpy` with `bytestring >= 0.11.5`

Data/Text/ICU/Spoof.hsc:433:7: warning: [-Wdeprecations]
    In the use of ‘memcpy’
    (imported from Data.ByteString.Internal, but defined in bytestring-0.11.5.0:Data.ByteString.Internal.Type):
    Deprecated: "Use Foreign.Marshal.Utils.copyBytes instead"
    |
433 |       memcpy dptr bptr (fromIntegral dlen))
    |       ^^^^^^

Segfault on Windows 64 bits

Even a simple use of text-icu - this code basically just calls collate - causes a segfault on Windows with a 64-bit build of GHC.

I tested the procedure below on Windows 7 64 bits with GHC 7.6.2 32 bits, 7.8.3 32 bits, and 7.8.3 64 bits. It works with both 32 bit versions, and segfaults with the 64 bit version. In all cases I used cabal-install and Cabal compiled from current HEAD on the 1.20 branch, rev. caf257cd96e766b293943bbac07d766ec2f552dd.

Steps to reproduce:

Clone this repo: http://github.com/ygale/test-text-icu

Download ICU4C version 54.1 binary for Windows, 32 bits or 64 bits depending on which GHC version you are using. Extract the zip into a folder near the repo.

Inside the repo:

cabal sandbox delete
cabal sandbox init
cabal install --extra-lib-dirs=C:\absolute\path\to\icu\lib64 --extra-include-dirs=C:\absolute\path\to\icu\include text-icu
cabal install

Now, place the resulting executable .cabal-sandbox\bin\test-text-icu.exe together with the 3 DLLs
icudt54.dll, icuin54.dll, and icuuc54.dll from the bin or bin64 subfolder of icu all together in the same folder. Run test-text-icu.

For 32-bit builds, this runs successfully and prints the text:

Q: Why did the multi-threaded chicken cross the road?
A: get other side. the to To

For 64-bit builds, this prints the first line, then segfaults.

This issue was posted to the ghc-devs list, in a thread about GHC linker changes initiated by @bos and cross-posted to the cabal-dev list.

Can't load updated libicu

Not sure if this is the right place for this but when attempting to build haskell-ide-engine which relies on text-icu as a dependency, it complains about the following:

Library not loaded: /usr/local/opt/icu4c/lib/libicuuc.58.dylib

However when running brew install icu4c, it installs the latest version which is 59.1. Does support for ICU 59.1 need to be added to text-icu or is this issue just unrelated?

More info regarding the error message can be found here haskell/haskell-ide-engine#303.

Getting the size of a grapheme cluster

I'd like to get the size of a grapheme cluster (from a value of type Text). Is there a function in the library that can help me with it? If not, is it in the scope of the library to provide one?

Fell

Installation instructions for text-icu-0.8.0 ?

What is necessary to install v0.8.0?

$ cabal install text-icu-0.8.0
Resolving dependencies...
Error: cabal: Could not resolve dependencies:
[__0] next goal: text-icu (user goal)
[__0] rejecting: text-icu-0.8.0 (conflict: pkg-config package icu-i18n-any,
not found in the pkg-config database)

The change log says:

text-icu/changelog.md

Line 7 in 7830071

* Declare pkg-config dependencies (#43)

But the README does not have updated installation instructions.
The linked PR is mute on the why and how: #43
Same for the original issue: #42

See also the build failure in the Agda CI: https://github.com/agda/agda/runs/5042362762?check_suite_focus=true#step:7:408

Declare pkg-config dependencies

The cabal file should explicitly state the package's pkg-config dependencies using the pkgconfig-depends.

Build breaks with icu 68

Building on my system, errors with the following

text-icu> cbits/text_icu.c: In function '__hs_unorm_quickCheck':
text-icu> 
text-icu> /tmp/stack-56efa0379724b03c/text-icu-0.7.0.1/cbits/text_icu.c:265:5: error:
text-icu>      warning: 'unorm_quickCheck_68' is deprecated [-Wdeprecated-declarations]
text-icu>       265 |     return unorm_quickCheck(source, sourcelength, mode, status);
text-icu>           |     ^~~~~~
text-icu>     |
text-icu> 265 |     return unorm_quickCheck(source, sourcelength, mode, status);
text-icu>     |     ^
text-icu> In file included from /usr/include/unicode/platform.h:25,
text-icu>                  from /usr/include/unicode/ptypes.h:52,
text-icu>                  from /usr/include/unicode/umachine.h:46,
text-icu>                  from /usr/include/unicode/utypes.h:38,
text-icu>                  from include/hs_text_icu.h:5,
text-icu> 
text-icu> /tmp/stack-56efa0379724b03c/text-icu-0.7.0.1/                 from cbits/text_icu.c:1:0: error: 
text-icu> 
text-icu> /tmp/stack-56efa0379724b03c/text-icu-0.7.0.1//usr/include/unicode/unorm.h:244:1: error:
text-icu>      note: declared here
text-icu>       244 | unorm_quickCheck(const UChar *source, int32_t sourcelength,
text-icu>           | ^~~~~~~~~~~~~~~~
text-icu>     |
text-icu> 244 | unorm_quickCheck(const UChar *source, int32_t sourcelength,
text-icu>     | ^
text-icu> cbits/text_icu.c: In function '__hs_unorm_isNormalized':
text-icu> 
text-icu> /tmp/stack-56efa0379724b03c/text-icu-0.7.0.1/cbits/text_icu.c:272:5: error:
text-icu>      warning: 'unorm_isNormalized_68' is deprecated [-Wdeprecated-declarations]
text-icu>       272 |     return unorm_isNormalized(src, srcLength, mode, pErrorCode);
text-icu>           |     ^~~~~~
text-icu>     |
text-icu> 272 |     return unorm_isNormalized(src, srcLength, mode, pErrorCode);
text-icu>     |     ^
text-icu> In file included from /usr/include/unicode/platform.h:25,
text-icu>                  from /usr/include/unicode/ptypes.h:52,
text-icu>                  from /usr/include/unicode/umachine.h:46,
text-icu>                  from /usr/include/unicode/utypes.h:38,
text-icu>                  from include/hs_text_icu.h:5,
text-icu> 
text-icu> /tmp/stack-56efa0379724b03c/text-icu-0.7.0.1/                 from cbits/text_icu.c:1:0: error: 
text-icu> 
text-icu> /tmp/stack-56efa0379724b03c/text-icu-0.7.0.1//usr/include/unicode/unorm.h:291:1: error:
text-icu>      note: declared here
text-icu>       291 | unorm_isNormalized(const UChar *src, int32_t srcLength,
text-icu>           | ^~~~~~~~~~~~~~~~~~
text-icu>     |
text-icu> 291 | unorm_isNormalized(const UChar *src, int32_t srcLength,
text-icu>     | ^
text-icu> cbits/text_icu.c: In function '__hs_unorm_normalize':
text-icu> 
text-icu> /tmp/stack-56efa0379724b03c/text-icu-0.7.0.1/cbits/text_icu.c:280:5: error:
text-icu>      warning: 'unorm_normalize_68' is deprecated [-Wdeprecated-declarations]
text-icu>       280 |     return unorm_normalize(source, sourceLength, mode, options, result,
text-icu>           |     ^~~~~~
text-icu>     |
text-icu> 280 |     return unorm_normalize(source, sourceLength, mode, options, result,
text-icu>     |     ^
text-icu> In file included from /usr/include/unicode/platform.h:25,
text-icu>                  from /usr/include/unicode/ptypes.h:52,
text-icu>                  from /usr/include/unicode/umachine.h:46,
text-icu>                  from /usr/include/unicode/utypes.h:38,
text-icu>                  from include/hs_text_icu.h:5,
text-icu> 
text-icu> /tmp/stack-56efa0379724b03c/text-icu-0.7.0.1/                 from cbits/text_icu.c:1:0: error: 
text-icu> 
text-icu> /tmp/stack-56efa0379724b03c/text-icu-0.7.0.1//usr/include/unicode/unorm.h:218:1: error:
text-icu>      note: declared here
text-icu>       218 | unorm_normalize(const UChar *source, int32_t sourceLength,
text-icu>           | ^~~~~~~~~~~~~~~
text-icu>     |
text-icu> 218 | unorm_normalize(const UChar *source, int32_t sourceLength,
text-icu>     | ^
text-icu> cbits/text_icu.c: In function '__hs_u_strCompareIter':
text-icu> 
text-icu> /tmp/stack-56efa0379724b03c/text-icu-0.7.0.1/cbits/text_icu.c:308:43: error:
text-icu>      error: 'TRUE' undeclared (first use in this function)
text-icu>       308 |     return u_strCompareIter(iter1, iter2, TRUE);
text-icu>           |                                           ^~~~
text-icu>     |
text-icu> 308 |     return u_strCompareIter(iter1, iter2, TRUE);
text-icu>     |                                           ^
text-icu> 
text-icu> /tmp/stack-56efa0379724b03c/text-icu-0.7.0.1/cbits/text_icu.c:308:43: error:
text-icu>      note: each undeclared identifier is reported only once for each function it appears in
text-icu>     |
text-icu> 308 |     return u_strCompareIter(iter1, iter2, TRUE);
text-icu>     |                                           ^
text-icu> `gcc' failed in phase `C Compiler'. (Exit code: 1)

while the first few errors are mere deprecation warnings, the last error is due to icu not declaring TRUE and FALSE unless U_DEFINE_FALSE_AND_TRUE is set. See also the associated icu bug

Unknown block code causes run-time error

The function blockCode crashes on unknown block codes. It would be nice if that function could become total, for instance by returning NoBlock for unknown block codes. I understand that the Enum is derived, so taking care of such a special case might be a bit tricky (or result in a massive hand-written instance).

`pkgconfig-depends` in cabal file lacks lower bounds

The field pkgconfig-depends so far declares the needed libraries, but not their version:

text-icu/text-icu.cabal

Line 61 in 7830071

pkgconfig-depends: icu-uc, icu-i18n

The pkgconfig-depends field allows to give a version range, see https://cabal.readthedocs.io/en/3.6/cabal-package.html#pkg-field-pkgconfig-depends .

I think a version lower bound like >= 69.1 (or whatever is correct, I don't know) should be added.
I know for sure that text-icu-0.8.0 does not build with some old versions of the ICU library!

Changelog for 0.8.0.5: mention new flag `homebrew`

text-icu/changelog.md

Line 3 in be8f0e8

* Make homebrew optional #(99)

Would be good to mention the name of the new flag here and its purpose.

Installation os osx/darwin is a bit tricky. Here's instructions which maybe could be extracted to a wiki.

So, I failed while trying to use text-icu via cabal on osx/darwin.

After some good search, I found the following instructions to be quite helpful. Not sure where to document it as , but I'm just putting it up here so that it may help people who might run into the same issue.

Maybe this can eventually be pulled out to the project wiki once set up.

# Make sure the external lib (icu4c) is installed and available via homebrew
brew install icu4c

# If installing in a cabal sandbox which is mostly what i do
# Just add the package to the <project>.cabal file and then run cabal install with additional parameters
cabal install --only-dependencies  --extra-lib-dirs=/usr/local/opt/icu4c/lib/ --extra-include-dirs=/usr/local/opt/icu4c/include

# If not using sandbox, just cabal install with additional parameters to the homebrew install
cabal install text-icu --extra-lib-dirs=/usr/local/opt/icu4c/lib/ --extra-include-dirs=/usr/local/opt/icu4c/include

Export `split` for regexes

v0.8.0.1: relax bound on `deepseq` to the one shipped with GHC 8.0

The current lower bound deepseq >= 1.4.3 for v0.8.0.1 is needlessly restrictive and excludes building with GHC 8.0 in some cases, e.g.: https://github.com/agda/agda/runs/5495421153?check_suite_focus=true

Agda-2.6.3$ cabal build -w ghc-8.0.2 -f +enable-cluster-counting
Resolving dependencies...
Error: cabal: Could not resolve dependencies:
...
[__1] trying: Agda:+enable-cluster-counting
[__2] trying: text-icu-0.8.0.1 (dependency of Agda +enable-cluster-counting)
[__3] trying: template-haskell-2.11.1.0/installed-2.11.1.0 (dependency of
Agda)
[__4] next goal: pretty (dependency of Agda)
[__4] rejecting: pretty-1.1.3.3/installed-1.1.3.3 (conflict: text-icu =>
deepseq>=1.4.3.0, pretty => deepseq==1.4.2.0/installed-1.4.2.0)
[__4] rejecting: pretty-1.1.3.6, ..., pretty-1.0.0.0 (conflict: template-haskell =>
pretty==1.1.3.3/installed-1.1.3.3)
[__4] fail (backjumping, conflict set: Agda, pretty, template-haskell, text-icu)

text-icu-0.8.0.1 builds fine with deepseq-1.4.2 (shipped with GHC 8.0):

$ cabal install -w ghc-8.0.2 text-icu-0.8.0.1 --constraint='deepseq==1.4.2.*' --allow-older=text-icu:deepseq
...
[38 of 38] Compiling Data.Text.ICU    ( Data/Text/ICU.hs, dist/build/Data/Text/ICU.o )
ld: warning: directory not found for option '-L/opt/homebrew/opt/icu4c/lib'
Installing library in ...

So the lower bound should be relaxed...

Unexpected exception and results with unmatched prefix (or suffix)

The pure versions of regex match extraction functions, Text.ICU.prefix, suffix, and (possibly) group do not correctly handle the case where a group is in a regex but is not used in a match. For example "a(b)?c" against "ac" or "(a)|b" against "b". They assume that start_ and end_ return -1 only when the grouping is out of range, but in fact they can when a grouping does not fire.

> prefix 1 =<< find (regex [] "abc(def)?ghi") "xabcghiy"
*** Exception: Data.Text.Array.new: size overflow
CallStack (from HasCallStack):
  error, called at ./Data/Text/Array.hs:129:20 in text-1.2.2.1-FeA6fTH3E2n883cNXIS2Li:Data.Text.Array
> suffix 1 =<< find (regex [] "abc(def)?ghi") "xabcghiy"
Just "\NULxabcghiy"

An out of bounds range gives the expected results:

> prefix 2 =<< find (regex [] "abc(def)?ghi") "xabcghiy"
Nothing
> suffix 2 =<< find (regex [] "abc(def)?ghi") "xabcghiy"
Nothing

group possibly does right thing, but not for the right reason (it extracts -1 to -1), and perhaps should return Nothing instead:

> group 1 =<< find (regex [] "abc(def)?ghi") "xabcghiy"
Just ""

One solution would be to use the safe underlying start and end functions instead, returning Nothing for any underlying Nothing. Happy to submit a PR for this approach.

Surprising collation sort when using UpperFirst

I would expect that specifying UpperFirst would sort the capital before the lower, regardless of the following character. Do you have any insight?

:set -XOverloadedStrings
import qualified Data.Text.ICU as ICU
import qualified Data.Text.ICU.Collate as I
ICU.collate (ICU.collatorWith (ICU.Locale "de_DE") [I.CaseFirst (Just I.UpperFirst)]) "muller" "Müller"
-- LT
ICU.collate (ICU.collatorWith (ICU.Locale "de_DE") [I.CaseFirst (Just I.UpperFirst)]) "muller" "Muller"
-- GT

Set up CI

@vshabanov : Thanks for your work on this package!
Would be great to see some GitHub Actions here that perform continuous integration.

v0.7.{0.1,1.0} fails to build with GHC < 8.4

v0.7.1.0 fails with ghc 8.0 and 8.2 without further constraints.

$ cabal install -w ghc-8.0.2 text-icu-0.7.1.0
...
[ 3 of 25] Compiling Data.Text.ICU.Internal ( dist/build/Data/Text/ICU/Internal.hs, dist/build/Data/Text/ICU/Internal.o )

Data/Text/ICU/Internal.hsc:58:23: error:
    • Couldn't match type ‘GHC.Word.Word8’ with ‘Word16’
      Expected type: Ptr UChar
        Actual type: Ptr GHC.Word.Word8
    • In the second argument of ‘uiter_setString’, namely ‘p’
      In the first argument of ‘(>>)’, namely
        ‘uiter_setString i p (fromIntegral l)’
      In the expression: uiter_setString i p (fromIntegral l) >> act i
Error: cabal: Failed to build text-icu-0.7.1.0.

Same for v0.7.0.1.

In the wild: https://github.com/agda/agda/runs/5494474576?check_suite_focus=true

Not sure which bound is lacking:

base >= 4.11 would rule out building with GHC 8.0 and 8.2, but it seems unlikely that these versions of text-icu were only intended for GHC >= 8.4. (Note that text-icu-0.7.0.0 is for GHC <= 7.8 only)
maybe bytestring needs to be constrained?

regex' leaks exception on empty regular expression

My understanding is that regex' was supposed to catch errors and wrap them in Left, however for an empty string we see:

Prelude Data.Text.ICU> regex' [] ""
*** Exception: ICUError U_ILLEGAL_ARGUMENT_ERROR

Fell out of stackage

It seems that text-icu fell out of stackage nightly: https://github.com/fpco/stackage/blob/master/build-constraints.yaml It would be nice to have it back in.

macOS 12: Dynamic linking problem: `symbol not found in flat namespace (_u_charDigitValue_72)`

Since I upgraded macOS from 10.14 to 12.6.3 (Monterey), I cannot build anything using text-icu. Linking fails with dlopen(..): symbol not found in flat namespace (...). However, inspecting the dylib with dyld_info -symbolic_fixups produces an entry for the "missing" symbol:

$ cabal test
....
<no location info>: error:
    dlopen($ROOT/dist-newstyle/build/x86_64-osx/ghc-9.6.0.20230210/text-icu-0.8.0.2/build/libHStext-icu-0.8.0.2-inplace-ghc9.6.0.20230210.dylib, 0x0005): 
  symbol not found in flat namespace (_u_charDigitValue_72)

$  dyld_info -symbolic_fixups $ROOT/dist-newstyle/build/x86_64-osx/ghc-9.6.0.20230210/text-icu-0.8.0.2/build/libHStext-icu-0.8.0.2-inplace-ghc9.6.0.20230210.dylib \
  | grep charDigit
           +0xC87A0      bind pointer   flat-namespace/_u_charDigitValue_72

I sometimes see at the end of a cabal build:

ld: warning: -undefined dynamic_lookup may not work with chained fixups

But this seems harmless: https://gitlab.haskell.org/ghc/ghc/-/issues/22429

Make `pkg-config` dependency more robust

Lifted from haskell/cabal#8496 (comment). @gbaz writes:

I note btw that text-icu with its current cabal file will always fail if there is no pkg-config present on the system, or if icu-uc, or icu-i18n are not registered in it.

It would be better to have an auto flag that switches between a build using pkgconfig-depends and a build explicitly enumerating extra-libraries. This is what the pr that started this all was intended to enable :-) haskell/cabal#7621

In such a case if the CI busted on pkgconfig-depends, then it will fallback to the explicit extra-libraries etc.

This also makes the package somewhat more resilient for end-users.

My 2 cents to this: One would have to investigate whether such an auto flag also works with stack which has a different flag semantics.

Problem with static linknig

I'm trying to create a static executable which depends on text-icu.

Here is the minified example https://github.com/4e6/text-icu-static-example

To enable static linking, I build icu with --enable-static flag here:

  # http://userguide.icu-project.org/packaging#TOC-Link-to-ICU-statically
  icu-static = pkgs.icu.overrideAttrs (attrs: {
    dontDisableStatic = true;
    configureFlags = (attrs.configureFlags or "") + " --enable-static";
    outputs = attrs.outputs ++ [ "static" ];
    postInstall = ''
      mkdir -p $static/lib
      mv -v lib/*.a $static/lib
    '' + (attrs.postInstall or "");
  });

And add following ghc options here:

        configureFlags = [
          "--ghc-option=-optl=-static"
          "--ghc-option=-optl=-pthread"
          "--ghc-option=-optl=-L${pkgs.glibc.static}/lib"
          "--ghc-option=-optl=-L${pkgs.gmp6.override { withStatic = true; }}/lib"
          "--ghc-option=-optl=-L${icu-static.static}/lib"
        ];

As a result of nix-build ., I'm getting a lot of errors about undefined references, see nix-build.log:

/nix/store/2h1il2pyfh20kc5rh7vnp5a564alxr21-icu4c-59.1-static/lib/libicui18n.a(regexcmp.ao):(.text+0x7805): more undefined references to `icu_59::UVector64::setElementAt(long, int)' follow
/nix/store/2h1il2pyfh20kc5rh7vnp5a564alxr21-icu4c-59.1-static/lib/libicui18n.a(regexcmp.ao): In function `icu_59::RegexCompile::compile(UText*, UParseError&, UErrorCode&)':
(.text+0x8355): undefined reference to `__cxa_throw_bad_array_new_length'
/nix/store/2h1il2pyfh20kc5rh7vnp5a564alxr21-icu4c-59.1-static/lib/libicui18n.a(regexcmp.ao):(.data.rel.ro._ZTIN6icu_5912RegexCompileE[_ZTIN6icu_5912RegexCompileE]+0x0): undefined reference to `vtable for __cxxabiv1::__si_class_type_info'
collect2: error: ld returned 1 exit status
`cc' failed in phase `Linker'. (Exit code: 1)

This setup works with other libraries but fails with text-icu. Any ideas on what I'm doing wrong?

Pacman workaround for Windows not working anymore

Any package with text-icu as dependency on Windows can build fine with stack build , but with stack ghci will fail with:

ghc.EXE:  | C:\sr\snapshots\7501d5b7\lib\x86_64-windows-ghc-8.2.2\text-icu-0.7.0.1-1gkLXKbDLhmLamnawnVRsL\HStext-icu-0.7.0.1-1gkLXKbDLhmLamnawnVRsL.o: unknown symbol `ucnv_getMaxCharSize_61'                                                  
ghc.EXE: unable to load package `text-icu-0.7.0.1'

Somewhere in the past, this command fixed it:
stack exec -- pacman -Sy mingw64/mingw-w64-x86_64-icu
but this is not sufficient anymore.

Collation customization: How is it exposed in text-icu?

It appears that ICU provides mechanisms for customizing collation (described at http://userguide.icu-project.org/collation/customization).

I’d be grateful for a pointer as to if, and if so, how this is exposed in the haskell text-icu package.

Upload the latest commit to Hackage

Or at least the one solving #20, please.

This is actually very important change to mac users, as we will have encounter this problem whenever a package depends on text-icu.

`text-icu-0.8.0.*` fails to configure with `cabal 3.8.1.0` on `macOS-{11,12}` virtual environments

text-icu-0.8.0.* fails to configure with cabal 3.8.1.0 on macOS-{11,12} virtual environments. The issue is described here:

haskell/actions#119

This may be fixed by:

(This isn't text-icu's fault, blame goes upstream, but I raised to issue here to alert the community of the problem.)

How to statically link on Windows?

Hi,
I'm aware that this is not a problem specific for this library but I think Haskell generally has a problem that this information is hard to find and this is the first library I'm using that's supposed to used satically linked files. How do I include this into my project?

I did stack install text-icu --extra-include-dirs=C:\msys64\icu4c-59_1-Win64-MSVC2015\include --extra-lib-dirs=C:\msys64\icu4c-59_1-Win64-MSVC2015\lib64

and I tried putting the dll into my .stack-work\install\28cbf0ed\bin folder but when I try stack build I get


ghc.EXE: unable to load package `text-icu-0.7.0.1'
ghc.EXE: addDLL: icuuc59.dll (Win32 error 126): The specified module could not be found.
ghc.EXE: Could not load `icuuc59.dll'. Reason: addDLL: could not load DLL

Test suite build failure with GHC 8.2

Building test suite 'tests' for text-icu-0.7.0.1..
[1 of 3] Compiling QuickCheckUtils  ( tests/QuickCheckUtils.hs, dist/build/tests/tests-tmp/QuickCheckUtils.o )

tests/QuickCheckUtils.hs:14:10: error:
    Duplicate instance declarations:
      instance NFData Ordering
        -- Defined at tests/QuickCheckUtils.hs:14:10
      instance [safe] NFData Ordering -- Defined in ‘Control.DeepSeq’
   |
14 | instance NFData Ordering where
   |          ^^^^^^^^^^^^^^^

Release 0.8.0.1

Once the dust settles, please release text-icu again, so that #60 is available.
I suppose 0.8.0.1 would be compatible with the PVP:

if change consist only of corrected documentation, non-visible change to allow different dependency range etc. A.B.C MAY remain the same

Tests failing on darwin

See the build log at http://hydra.nixos.org/build/9699818/log/raw - it would appear these tests have never successfully ran on darwin.

Test suite on Hackage missing Properties module

Preprocessing test suite 'tests' for text-icu-0.6.3.6...

tests/Tests.hs:5:18:
    Could not find module `Properties'
    Use -v to see a list of the files searched for.
Failed to install text-icu-0.6.3.6
cabal: Error: some packages failed to install:
text-icu-0.6.3.6 failed during the building phase. The exception was:
ExitFailure 1

Expose encoding-guessing functions

http://userguide.icu-project.org/conversion/detection seems to say that ICU is able to make some guesses regarding encoding used but text-icu seems to not expose any such functionality which IMHO greatly impairs its usefulness.

hsc2hs #const U_NO_NUMERIC does not work with --cross-compile

This seems to be because the way --cross-compile checks if something is valid is by using it as an array size, but this constant is a double, so that's not allowed.

In general #const says it is for longs, but this constant could be a long, it's just cast to double for some reason.

I'm working around it for now by doing the hsc2hs call without --cross-compile (since the constants are the same on my host system) and building against the result.

Corrupted word breaking with fairly large text

Hi,

first of it all: thank you for the library.

I bumped against a strange problem with word breaking on a large amount text.

With test.txt (just c&p from Wikipedia Haskell) and this snipped:

{-# LANGUAGE OverloadedStrings #-}
import Data.Text.IO as T
import Data.Text.ICU as ICU
fmap ICU.brkBreak . ICU.breaks (ICU.breakWord "en-US") <$> T.readFile "test.txt"

ICU starts somewhere in the middle to break on character border, here is the critical transition:

[...,"properties"," ","of"," ","programs","\n","Cayenne",","," ","with"," ","dependent"," ","types","\n","\937mega",","," ","strict"," ","and"," ","more","\n","Elm",","," ","a"," ","functional"," ","language"," ","to"," ","create"," ","web"," ","front","-","end"," ","apps",","," ","no"," ","s","u","p","p","o","r","t"," ","f","o","r"," ","h","i","gh","e","r","-","k","i","n","d","e","d"," ","t","y","p","e","s","\n","J","V","M","-","b","a","s","e","d",":","\n","\n","F","r","eg","e",","," ","a"," ","H","a","s","k",...]

After this point, nearly every character isolated. But not always, sometimes chars are bundled pairwise.

Note: I experienced this bug first with german text extracted from epub chapters. The behavior seems a bit chaotic: Mainly chars are seperated, but somethimes words or parts of word are surviving.

I'm using icu4c/56.1 on OS X installed via brew install icu4c.

Hackage readme refers to Bitbucket issue queue, which does not exist

On https://hackage.haskell.org/package/text-icu, we see:

Please report bugs via the bitbucket issue tracker.

Following that link, we get:

This repository does not have issue tracking enabled.
Use the links at the top to get back.

I take it the Github issue queue is used nowadays?

Missing Directions.

The following direction codes were added later to ICU.

U_FIRST_STRONG_ISOLATE = 19
U_LEFT_TO_RIGHT_ISOLATE = 20
U_RIGHT_TO_LEFT_ISOLATE = 21
U_POP_DIRECTIONAL_ISOLATE = 22

The Direction enum doesn't include them.

This came up when binding the bidirectional algorithm bits of ICU in a separate package and trying to share types with text-icu.

GHC 9.4 build error: Couldn't match type ‘GHC.Word.Word8’ with ‘Word16

text-icu                     >
text-icu                     > /private/var/folders/0c/zmpjp7l568xcvnt49pkkn5p00000gn/T/stack-7d50abf0944b2f36/text-icu-0.7.1.0/Data/Text/ICU/Internal.hsc:58:23: error:
text-icu                     > [14 of 25] Compiling Data.Text.ICU.Normalize.Internal
text-icu                     >     • Couldn't match type ‘GHC.Word.Word8’ with ‘Word16’
text-icu                     >       Expected: Ptr UChar
text-icu                     >         Actual: Ptr GHC.Word.Word8
text-icu                     >     • In the second argument of ‘uiter_setString’, namely ‘p’
text-icu                     >       In the first argument of ‘(>>)’, namely
text-icu                     >         ‘uiter_setString i p (fromIntegral l)’
text-icu                     >       In the expression: uiter_setString i p (fromIntegral l) >> act i
text-icu                     >    |
text-icu                     > 58 |     uiter_setString i p (fromIntegral l) >> act i
text-icu                     >    |                       ^
text-icu                     > [20 of 25] Compiling Data.Text.ICU.Spoof.Internal

Test failure: missing `BlockCode`s ?

The CI on macOS for GHC 9.6 reports a failed test case (ICU_VER=73.2):
https://github.com/haskell/text-icu/actions/runs/6341389638/job/17224925448#step:11:26

t_blockCode: [Failed]
*** Failed! (after 76 tests):
Exception:
  toEnum{BlockCode}: tag (322) is outside of enumeration's range (0,320)
  CallStack (from HasCallStack):
    error, called at Data/Text/ICU/Char.hsc:485:17 in text-icu-0.8.0.2-inplace:Data.Text.ICU.Char
'\203257'

The Windows CI has a similar test failure (ICU_VER=73.2):
https://github.com/haskell/text-icu/actions/runs/6341389638/job/17224926997#step:11:25

  t_blockCode: [Failed]
*** Failed! (after 47 tests):
Exception:
  toEnum{BlockCode}: tag (322) is outside of enumeration's range (0,320)
  CallStack (from HasCallStack):
    error, called at Data\Text\ICU\Char.hsc:485:17 in text-icu-0.8.0.2-inplace:Data.Text.ICU.Char
'\201586'

These could be symptomatic that text-icu has fallen behind the ICU lib.
ICU 72 has Unicode 15, but currently only ICU 70 is supported.

I suppose this package needs some updates to the state of the art.
ATTN: @vshabanov

Doesn't seem to work with icu-55.1-1?

I'm on ArchLinux and recent upgrade of icu breaks this package. Is there any plan of supporting the new version in the future?

Release a new version to hackage

a4bdf45 is the change I'd like to see released.