MyHTML C/C++ HTML 5 Parser. Using threads.
MyHTML is a fast HTML Parser using Threads implemented as a pure C99 library with no outside dependencies.
This is one of module of the Modest project
Now
The current version is 1.0.4. Last version is 1.0.3
See [Releases](https://github.com/lexborisov/myhtml/releases)
Changes
Please, see CHANGELOG.md file
Features
- Asynchronous Parsing, Build Tree and Indexation
- Fully conformant with the HTML5 specification
- Two API - high and low-level
- Manipulation of elements: add, change, delete and other
- Manipulation of elements attributes: add, change, delete and other
- Support 39 character encoding by specification encoding.spec.whatwg.org
- Support detecting character encodings
- Support Single Mode parsing
- Support Build without POSIX Threads
- Support for fragment parsing
- Support for parsing by chunks
- No outside dependencies
- C99 support
- Passes all tree construction tests from html5lib-tests
- Tested by 1 billion HTML pages (by commoncrawl.org, see test file)
Further developments
- MyCSS โ Fast C/C++ CSS Parser (Cascading Style Sheets Parser)
Support encodings for InputStream
X_USER_DEFINED, UTF_8, UTF_16LE, UTF_16BE, BIG5, EUC_KR, GB18030,
IBM866, ISO_8859_10, ISO_8859_13, ISO_8859_14, ISO_8859_15, ISO_8859_16, ISO_8859_2, ISO_8859_3,
ISO_8859_4, ISO_8859_5, ISO_8859_6, ISO_8859_7, ISO_8859_8, KOI8_R, KOI8_U, MACINTOSH,
WINDOWS_1250, WINDOWS_1251, WINDOWS_1252, WINDOWS_1253, WINDOWS_1254, WINDOWS_1255, WINDOWS_1256,
WINDOWS_1257, WINDOWS_1258, WINDOWS_874, X_MAC_CYRILLIC, ISO_2022_JP, GBK, SHIFT_JIS, EUC_JP, ISO_8859_8_I
Support encodings for output
Program working in UTF-8 and returns all in UTF-8
Detecting character encodings
Now it UTF-8, UTF-16LE, UTF16BE and russian windows-1251, koi8-r, iso-8859-5, x-mac-cyrillic, ibm866
The Story of Creation
Benchmark
Build and Installation
Make
make
If successful copy lib/* and include/* at the right place for you
Flags that can be passed to make:
MyHTML_OPTIMIZATION_LEVEL=-O2
set compiler optimization level. Default: -O2MyHTML_BUILD_WITHOUT_THREADS=YES
build without POSIX Threads. Default: NO
for example
make MyHTML_BUILD_WITHOUT_THREADS=NO
cp lib/* /usr/local/lib
cp -r include/* /usr/local/include
CMake
In myhtml/project directory:
cmake .
make
sudo make install
Flags that can be passed to CMake:
MyHTML_OPTIMIZATION_LEVEL=-O2
set compiler optimization level. Default: -O2CMAKE_INSTALL_LIBDIR=lib
set path to install created library. Default: libMyHTML_BUILD_SHARED=ON
build shared library. Default: ONMyHTML_BUILD_STATIC=ON
build static library. Default: ONMyHTML_INSTALL_HEADER=OFF
install header files. Default OFFMyHTML_BUILD_WITHOUT_THREADS=YES
build without POSIX Threads. Default: NOMyHTML_EXTERN_MALLOC=my_malloc_func
set extern malloc function. Default: UNDEFINEDMyHTML_EXTERN_REALLOC=my_realloc_func
set extern realloc function. Default: UNDEFINEDMyHTML_EXTERN_CALLOC=my_calloc_func
set extern calloc function. Default: UNDEFINEDMyHTML_EXTERN_FREE=my_free_func
set extern free function. Default: UNDEFINED
for example
cmake . -DCMAKE_INSTALL_LIBDIR=lib64 -DMyHTML_INSTALL_HEADER=ON
Build with your program
I advise to build using clang, but decided to show examples of gcc
for example
build with shared library
gcc -Wall -Werror -O2 -lmyhtml your_program.c -o your_program
build with static library
gcc -Wall -Werror -O2 your_program.c /path/to/libmyhtml_static.a -o your_program
Future
Works are in full swing
- make test && make install
- css (MyCSS)
- html render
- js
- ...
Dependencies
None
In other languages, external bindings
Examples
See examples directory
Simple example
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <myhtml/api.h>
int main(int argc, const char * argv[])
{
char html[] = "<div><span>HTML</span></div>";
// basic init
myhtml_t* myhtml = myhtml_create();
myhtml_init(myhtml, MyHTML_OPTIONS_DEFAULT, 1, 0);
// first tree init
myhtml_tree_t* tree = myhtml_tree_create();
myhtml_tree_init(tree, myhtml);
// parse html
myhtml_parse(tree, MyHTML_ENCODING_UTF_8, html, strlen(html));
// release resources
myhtml_tree_destroy(tree);
myhtml_destroy(myhtml);
return 0;
}
AUTHOR
Alexander Borisov [email protected]
COPYRIGHT AND LICENSE
Copyright (C) 2015-2016 Alexander Borisov
This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version.
This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public License along with this library; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
See the LICENSE file.