Ismael Olea — web personal

Un repaso esquemático en beneficio de Recursos Lingüìsticos Abiertos del Español (RLA-ES) y su corrector ortográfico.

Hace tiempo me interesé por cuál sería realmente el funcionamiento de los sistemas de localización de software, en particular los usados en el mundo Linux y software libre. En particular por su relación con el corrector ortográfico para el español que mantiene el proyecto RLA-Es desde hace años y que es el más usado universalmente en nuestro mundillo en toda clase de sistema operativo. Empecé a recopilar datos pero realmente no fueron más que una madeja inconexa que ahora me propongo organizar.

Espero ser tan sintético como claro.

Cómo se identifican los códigos regionales de lenguas
IETF BCP 47 Language Tags
Language Subtag Registry - IANA
M49 Standard
Unicode Common Locale Data Repository (CLDR)
Complitud del corrector RLA-ES
Siguientes pasos

Cómo se identifican los códigos regionales de lenguas

Si bien han sido de uso normas como ISO 639 e ISO 3166 mi conclusión es que la referencia mundial hoy día es el Unicode Common Locale Data Repository (CLDR), mantenida por el Unicode Consortium y de uso generalizado.

The Unicode CLDR provides key building blocks for software to support the world’s languages, with the largest and most extensive standard repository of locale data available. This data is used by a wide spectrum of companies for their software internationalization and localization, adapting software to the conventions of different languages for such common software tasks.

El CLDR, que usa como herramienta de trabajo un repositorio en Github, publica revisiones del mismo, la última es la CLDR 37 que están disponibles para descarga pública.

datos de publicación de CLDR v7

Gran parte del mérito del impacto del CLDR en el software actual es porque está implementado en las bibliotecas de software ICU, disponiles para C, C++ y Java, publicadas con licencia libre y usadas por todo el mundo y por toda clase de plataformas software. ICU se mantiene desde http://icu-project.org/. Es muy probable que en este mismo momento estés usando aplicaciones que implementan CLDR porque usan las bibliotecas ICU.

Los datos especificados son enormes y exhaustivos para cubrir todas las lenguas registradas. En nuestro caso sólo atenderemos referencias al español y sus variantes.

A su vez el CLDR implementa normas de las que veremos algún breve detalle:

IETF BCP 47 Language Tags

BCP 47 es la especificación de códigos de lenguas usada por CLDR. Está compuesta por dos RFC concatenados:

RFC 5646
y RFC 4647

y aplicando el IANA Language Subtag Registry.

Al parecer se considera más exacta y práctica que las definiciones de ISO 639 e ISO 3166.

Language Subtag Registry - IANA

Es una lista que establece un código (que denominan subtag) por cada lengua registrada.

Puede examinarse el contenido completo del registro de subetiquetas de lenguas: IANA Language Subtag Registry

Seleccionamos las relacionadas con el español:

%%
Type: language
Subtag: es
Description: Spanish
Description: Castilian
Added: 2005-10-16
Suppress-Script: Latn
%%
Type: language
Subtag: osp
Description: Old Spanish
Added: 2009-07-29
%%
%%
Type: language
Subtag: spq
Description: Loreto-Ucayali Spanish
Added: 2009-07-29
%%
Type: variant
Subtag: spanglis
Description: Spanglish
Added: 2017-02-23
Prefix: en
Prefix: es
Comments: A variety of contact dialects of English and Spanish
%%
Type: redundant
Tag: es-419
Description: Latin American Spanish
Added: 2005-07-15

En el corrector ortográfico RLA-ES estamos usando:

%%
Type: language
Subtag: es
Description: Spanish
Description: Castilian
Added: 2005-10-16
Suppress-Script: Latn

Y descubro otra subtag que podría ser de interés:

%%
Type: redundant
Tag: es-419
Description: Latin American Spanish
Added: 2005-07-15

Consideraciones:

Hay más registros relacionados con diferentes lenguas de signos hispanas que están fuera del alcance del corrector.
También hay otros registros de lenguas de países hispánicos diferentes al español que también están fuera del alcance.

M49 Standard

¿Cuál es el origen del código 419? Está definido en el Standard country or area codes for statistical use (M49) y asignado a la región geográfica Latinoamérica. Exactamente:

América Latina y el Caribe

419

No sé cuál es la necesidad práctica de una variante sencillamente porque no la he investigado, pero teniendo en cuenta que está definida como Type: redundant no parece que sea extremadamente prioritaria.

Unicode Common Locale Data Repository (CLDR)

Para nuestro caso examinaremos el contenido del paquete cldr-common-37.0.zip.

Nuestro objetivo es identificar exactamente todos los códigos l10n de lengua española reconocidos para diferentes territorios:

$ ls core/common/main/es*

common/main/es_419.xml	common/main/es_HN.xml
common/main/es_AR.xml	common/main/es_IC.xml
common/main/es_BO.xml	common/main/es_MX.xml
common/main/es_BR.xml	common/main/es_NI.xml
common/main/es_BZ.xml	common/main/es_PA.xml
common/main/es_CL.xml	common/main/es_PE.xml
common/main/es_CO.xml	common/main/es_PH.xml
common/main/es_CR.xml	common/main/es_PR.xml
common/main/es_CU.xml	common/main/es_PY.xml
common/main/es_DO.xml	common/main/es_SV.xml
common/main/es_EA.xml	common/main/es_US.xml
common/main/es_EC.xml	common/main/es_UY.xml
common/main/es_ES.xml	common/main/es_VE.xml
common/main/es_GQ.xml	common/main/es.xml
common/main/es_GT.xml

No vamos a examinar los contenidos, pero cualquiera interesado puede descargarlos y examinarlos.

La lista exacta de códigos se puede obtener fácilmente:

$ ls  common/main/es* | sed "s/.*\///g"|sed "s/.xml//g"|sort

es
es_419
es_AR
es_BO
es_BR
es_BZ
es_CL
es_CO
es_CR
es_CU
es_DO
es_EA
es_EC
es_ES
es_GQ
es_GT
es_HN
es_IC
es_MX
es_NI
es_PA
es_PE
es_PH
es_PR
es_PY
es_SV
es_US
es_UY
es_VE

Para obtener exactamente la lista, en español, de etiquetas de regiones podemos usar:

$ for a in `ls  common/main/es* | sed "s/.*es_*//g"|sed "s/.xml//g"|sort` ; \
     do grep "<territory type=\"$a\">" common/main/es.xml ;done

subetiqueta	región
419	Latinoamérica
AR	Argentina
BO	Bolivia
BR	Brasil
BZ	Belice
CL	Chile
CO	Colombia
CR	Costa Rica
CU	Cuba
DO	República Dominicana
EA	Ceuta y Melilla
EC	Ecuador
ES	España
GQ	Guinea Ecuatorial
GT	Guatemala
HN	Honduras
IC	Canarias
MX	México
NI	Nicaragua
PA	Panamá
PE	Perú
PH	Filipinas
PR	Puerto Rico
PY	Paraguay
SV	El Salvador
US	Estados Unidos
UY	Uruguay
VE	Venezuela

Complitud del corrector RLA-ES

Cuál es, en comparación, el estado de complitud del corrector ortográfico RLA-ES:

subetiqueta	región	hunspell-es
es_419	Latinoamérica	✗
es_AR	Argentina	✔
es_BO	Bolivia	✔
es_BR	Brasil	✗
es_BZ	Belice	✗
es_CL	Chile	✔
es_CO	Colombia	✔
es_CR	Costa Rica	✔
es_CU	Cuba	✔
es_DO	República Dominicana	✔
es_EA	Ceuta y Melilla	✗
es_EC	Ecuador	✔
es_ES	España	✔
es_GQ	Guinea Ecuatorial	✗
es_GT	Guatemala	✔
es_HN	Honduras	✔
es_IC	Canarias	✗
es_MX	México	✔
es_NI	Nicaragua	✔
es_PA	Panamá	✔
es_PE	Perú	✔
es_PH	Filipinas	✔
es_PR	Puerto Rico	✔
es_PY	Paraguay	✔
es_SV	El Salvador	✔
es_US	Estados Unidos	✔
es_UY	Uruguay	✔
es_VE	Venezuela	✔
es	genérico	✔ codificado como es-ANY

Vista la cantidad de regiones cubiertas podemos decir que el estado del proyecto es bastante satisfactorio. Algunos comentarios:

Sobre la cobertura de regiones parte de España no apreciamos ninguna necesidad de preparar diccioanrios para «Ceuta y Melilla» y «Canarias», puesto que en el último caso deberían estar cubiertas por es_ES.
En el caso de «Brasil» y «Belice»… aquí no somos capaces de formular una opinión. El español no es lengua oficial en ambos países y no tenemos ni idea de qué cantidad de población significativa podría necesitar de una versión localizada del corrector.
En cambio en «Guinea Ecuatorial» el español sí es lengua oficial. Sería la única nación-estado con lengua oficial para la que no disponemos de corrector localizado.
El caso de la región «419» podría ser interesante de ser discutida con los miembros del proyecto para saber si sería conveniente algún cambio o acción particular o si los productos actuales son suficientes.

Siguientes pasos

Una vez que están claras cuáles son las etiquetas de lenguas normalizadas parece apropiado verificar la compatibilidad de al menos las aplicaciones más populares. Veremos a ver.

Past days have been very flatpaky to me and productive enough for creating some new material.

A Flatpak getting started video tutorial

Joining my mates at HackLab Almería, who took the initiative of a set of video talks supporting the #YoMeQuedoEnCasa initiative fighting against the boredom COVID-19 confinement in Spain (thanks Víctor), I felt ready enough to give a talk about guerrilla Flatpak packaging using as examples my work on recent bundles:

Talk is in Spanish. If interested you can ask or comment at the HLA forum entry. The recording quality is not good enough: it’s the first time I record a talk with the amazing OBS application. It’s not a great work but it does the job.

JClic published at Flathub

I’m very happy to announce the JClic educative opensource application it’s now published at Flathub! This is relevant because the Clic project has been developed for more than two decades, it’s being used by teachers of all the world and has more than 2500 educative activities published.

desktop screenshot mobile screenshot

Other Flatpak bundles in the works

As I’m getting experience with Flatpak and java apps I’m preparing other ones I think they deserve to be really widespread:

Freeplane, a mindmaping fork of Freemind, waiting to be approved by Flathub;
OmegaT, the best opensource professional CAT tool, waiting some feedback from upstream before nominate it to Flathub;
Archi, the leading opensource Archimate enterprise architecture modeling tool, waiting for upstream feedback too;
and providing some technical support to my friend Jesús Marín who’s packaging TLA+ Toolbox.

Hope all of these will be published at Flathub at some moment.

Enjoy!

OmegaT logo

It took me a while since I announced I retired OmegaT from Fedora and my first try to deploy it with Flatpak and, I hope, Flathub. There were some reasons:

I was completely new to Flatpak,
the big amount of work to write the manifest of a non trivial application compiled completely from scratch,
the complexity at that time of working with Java and Flatpak.

Well, it has been time to resume the task. I’m not really an OmegaT user (I’ve used it just a few times in all these years) but I feel committed helping it to be better known and used in the Linux Desktop. In the past I made a lot of work with technical translations and I fully understand the power of the tool for profesional users and it’s opensource: as far as I know OmegaT is the best CAT (Computer Assisted Translation) opensource tool in the world.

So I’m here again and did some work for a final product. Today the work is a lot of easier because the existence of the OpenJDK Flatpak extension (see this update from Mat Booth). Thanks for this extension! It’s so nice I’m finishing other two java programs to be published at Flathub: Freeplane and JClic. For this release series I’m taking the easy way of packaging the binary portable bundle instead of compiling from sources because… it’s tedious. I am concerned it’s not the best practice. And this time I’ve started from a more recent version beta 5.2.0.

I invite you to give it a try and provide some feedback if any. Some details:

install it:
- $ wget https://dl.bintray.com/olea/org.omegat.App.flatpak (about 170Mb);
- $ flatpak install --user org.omegat.App.flatpak
a merge request at upstream for moving the XDG metadata resources;
it will be using org.freedesktop.Sdk.Extension.openjdk11, the LTS OpenJDK extension at Flathub.

Remember when you install a Flatpak package it is fully self-contained, you don’t need to install anything more (in this case the JRE is included in the bundle) and it’s isolated from the rest of the desktop so you’ll be able to keep an install indefinitely without worrying it will broke on any operating system update.

Remember too you could install this bundle in any modern Linux desktop. The only requirement is to have Flatpak installed, as more recents versions of Linux are.

Next steps:

finishing a final manifest to propose to Flathub in a couple days or so, which will be, as far as I am concerned, the main publishing site of the Flatpak bundle; when in Flathub any user could install it pointing and clicking as in any other app store;
getting accepted at upstream my minor contributions, basicaly metadata;
extending the bundle with some of the most popular OmegaT extensions and dictionaries;
and eventually rewrite the manifest for compiling from sources.

Hope you’ll find it useful.

PD: fixed how to install the package from CLI and changed a couple times the downloading URL :-/

Cuáles son todas las localizaciones (l10n) de la lengua española