serengil / chefboost Goto Github PK

A Lightweight Decision Tree Framework supporting regular algorithms: ID3, C4.5, CART, CHAID and Regression Trees; some advanced techniques: Gradient Boosting, Random Forest and Adaboost w/categorical features support for Python

Home Page: https://www.youtube.com/watch?v=Z93qE5eb6eg&list=PLsS_1RYmYQQHp_xZObt76dpacY543GrJD&index=3

License: MIT License

Python 99.66% Makefile 0.10% Shell 0.23%

decision-trees gradient-boosting gradient-boosting-machine random-forest adaboost id3 c45-trees cart regression-tree gbm

chefboost's People

Contributors

Stargazers

Watchers

Forkers

chaowu2009 moeljono ufukhurriyetoglu qyx01020 vipultanwar96 bsharma2 lemonviv zzhsaga providencequeensburry leetschau shubham07shastri70 nurettin bigdatamachine tanveer6329 reyal999 arscode iketutg eddiehole pythonexpert pjago anekhai ponisio7 gowthamgit04 shreyashiagarwal quangvu1798 sumesh1 khawar-islam choupeiyuan ahmadmujiana shiyadbava aayushk26 sarthakdimri platinumyzm thekindbeast sfumecjf sandy4321 leeensub kilolimafoxtrot tlkt0m neha-jaist sophiad23 rodosingh raagajuliet rmallof feifanrensheng p-blank axiks murathansolmaz1 hejiongyang profjefer nomancseku ssgantayat yasinkutuk vivanvatsa aniruddhachoudhury rizwandel adbmd manikant92 jack1981 vishalindev yd124 artaxerces danielguir souvikghosh-git nlibanio11 grace-ai-data ncku-bioinformatic-club xmansimon yunyouhuang jiyuha arteriosclerosis ashishpatel26 manu87ds cosastro kingazaan anapaulamendes ninhthelight matdadi merterhk elsker14 jingdujingdu luciagrodecka caiquanyou prussell94 d-savla chm85 accelerate-ai cleancoindev doaa-mohammed2000 5l1v3r1 amelmhamdi ikramadouli iq-scm jannisbush occoder crypt-asu martowu tri-bao

chefboost's Issues

The CART tree is not binary tree

the CART tree is not binary tree. Why?

Python 3.12 issue (no imp module)

When trying chefboost with python 3.12, it gives the issue of no imp module.


..../lib/python3.12/site-packages/chefboost/Chefboost.py", line 5, in <module>
    import imp
ModuleNotFoundError: No module named 'imp'

i made a c4.5 model with chefboost and another model using xgb how can i ensemble these 2 models

Any Tree Traversal API or Example?

I am interested in plotting chef trees, particularly decision path for a sample.

A generic traversal iterator call would allow users to dump rule in different formats or create various plots with networkx/pygraphviz/matplotlib/dtreeviz/treeinterpreter ex https://stackoverflow.com/questions/20224526/how-to-extract-the-decision-rules-from-scikit-learn-decision-tree/39772170

Is there an example of DFS/BFS generator for traversing the nodes?

ex sklearn DFS via structure & decision path
https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html

ex sklearn BFS generator
https://stackoverflow.com/questions/61203080/traversal-of-sklearn-decision-tree

does chef have anything like decision_path() in scikit?

decision_path()
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.decision_path

I don't mind adding this, looking for a guide to the internals of chefboost - reconstructRules might be the closes to a traversal?

Reference
#20
#2

Need settings for text classification

Can you improve ur library so that you can do text classification?

findDecision incorrect?

I have a CSV with pre-calculated cosine distance between face embeddings of people images in my dataset like this:

       Person1     Person2  Idx1  Idx2  Distance Decision
0   Aaron Paul  Aaron Paul     0     1    0.3245      Yes
1   Aaron Paul  Aaron Paul     0     2    0.2281      Yes
2   Aaron Paul  Aaron Paul     0     3    0.4737      Yes
3   Aaron Paul  Aaron Paul     0     4    0.4103      Yes
4   Aaron Paul  Aaron Paul     0     5    0.3236      Yes
5   Aaron Paul  Aaron Paul     0     6    0.3270      Yes
6   Aaron Paul  Aaron Paul     0     7    0.4873      Yes
7   Aaron Paul  Aaron Paul     0     8    0.3988      Yes
8   Aaron Paul  Aaron Paul     1     2    0.2357      Yes
9   Aaron Paul  Aaron Paul     1     3    0.2613      Yes
10  Aaron Paul  Aaron Paul     1     4    0.3827      Yes
11  Aaron Paul  Aaron Paul     1     5    0.2221      Yes
12  Aaron Paul  Aaron Paul     1     6    0.2183      Yes
13  Aaron Paul  Aaron Paul     1     7    0.4568      Yes
14  Aaron Paul  Aaron Paul     1     8    0.2391      Yes
15  Aaron Paul  Aaron Paul     2     3    0.4439      Yes
16  Aaron Paul  Aaron Paul     2     4    0.4086      Yes
17  Aaron Paul  Aaron Paul     2     5    0.2592      Yes
18  Aaron Paul  Aaron Paul     2     6    0.2863      Yes
19  Aaron Paul  Aaron Paul     2     7    0.4588      Yes

And I use this script to calculate findDecision tree:

import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
tqdm.pandas()

if __name__ == '__main__':
	##############################################################################
	# Leer CSV para determinar el mejor threshold...
	df = pd.read_csv(R"\\10.15.20.109\e$\MODELS\ProtecFR\Model\faces2.csv", encoding='UTF8')
	print(df.head(20))

	df1 = df[df['Decision'] == "Yes"]['Distance'].copy()
	df2 = df[df['Decision'] == "No"]['Distance'].copy()
	print(f"Count Yes: {df1.count()}")
	print(f"Average Yes: {round(df1.mean(), 4)}")
	print(f"Std. deviation Yes: {round(df1.std(), 4)}")
	print(f"Min Yes: {round(df1.min(), 4)}")
	print(f"Max Yes: {round(df1.max(), 4)}")
	print(f"Mode Yes: {round(df1.mode()[0], 4)}")

	print(f"Count No: {df2.count()}")
	print(f"Average No: {round(df2.mean(), 4)}")
	print(f"Std. deviation No: {round(df2.std(), 4)}")
	print(f"Min No: {round(df2.min(), 4)}")
	print(f"Max No: {round(df2.max(), 4)}")
	print(f"Mode No: {round(df2.mode()[0], 4)}")

	df1.plot.kde()
	df2.plot.kde()
	plt.legend(["Yes", "No"])
	plt.grid()
	plt.axhline(0,color='red')
	plt.axvline(0,color='red')
	plt.show()

	from chefboost import Chefboost as chef
	config = {'algorithm': 'C4.5'}

	tmp_df = df[['Distance', 'Decision']].copy()
	model = chef.fit(tmp_df, config)
	print (model)

The results I get are:

Count Yes: 108285
Average Yes: 0.4496
Std. deviation Yes: 0.1557
Min Yes: 0.0
Max Yes: 1.0644
Mode Yes: 0.3465

Count No: 59793700
Average No: 0.7976
Std. deviation No: 0.1112
Min No: 0.0
Max No: 1.2973
Mode No: 0.8114

[INFO]:  8 CPU cores will be allocated in parallel running
C4.5  tree is going to be built...
-------------------------
finished in  135.35767483711243  seconds
-------------------------
Evaluate  train set
-------------------------
Accuracy:  99.81929981118321 % on  59901985  instances
Labels:  ['Yes' 'No']
Confusion matrix:  [[43, 1], [108242, 59793699]]
Precision:  97.7273 %, Recall:  0.0397 %, F1:  0.0794 %
{'trees': [<module 'outputs/rules/rules' from 'c:\\DESARROLLOS\\Python\\VID\\outputs/rules/rules.py'>], 'alphas': [], 'config': {'algorithm': 'C4.5', 'enableRandomForest': False, 'num_of_trees': 5, 'enableMultitasking': False, 'enableGBM': False, 'epochs': 10, 'learning_rate': 1, 'max_depth': 3, 'enableAdaboost': False, 'num_of_weak_classifier': 4, 'enableParallelism': True, 'num_cores': 8}, 'nan_values': [['Distance', None]]}

The plot is:

and outputs/rules/rules.py:

def findDecision(obj): #obj[0]: Distance
	# {"feature": "Distance", "instances": 59901985, "metric_value": 0.0191, "depth": 1}
	if obj[0]>0.0:
		return 'No'
	elif obj[0]<=0.0:
		return 'Yes'
	else: return 'Yes'

As you can see, it gives me a 0.0 threshold when it should be around 0.68.

Am I doing something wrong?

Regards

Parallelism do not seems to be working

Hi,

I've used the following code:

if __name__ == '__main__':
    config = {'algorithm': 'Regression', 'enableParallelism' : True, 'enableGBM': True, 'epochs': 10, 'learning_rate': 0.01}
    model = chef.fit(df_tree_train, config)

and when I check my CPU usage I see that only one core is being used. Why aren't all my cores being used?

Unreasonable training time when I make a simple change

So I am training a CHAID decision tree for multiclass classification, and the target variable is a string. Other than the target, I have 4 other features, two of which I want to be string type. When I train the model with only one feature as string, training takes about 15 minutes. But when I convert the other feature I wish to be treated as categorical to string, training takes forever (entire day and no result).

What could be causing this?

what is the method used to estimate the accuray rate and precision rate ?

did you used the cross validation k fold or what is the method that you have used to calculate these rates ?

Last available PyPi version is from Feb 15, 2022 thus is missing fixes patched from other issues.

title, specifically ran into #34 myself before tracking down the difference between the master branch of the repo and my installed files.

For anyone else with this issue, you can use pip install git+https://github.com/serengil/chefboost.git to install directly from the git repo.

Built-in library 'imp' was removed in python 3.12, breaking chefboost

As title states, imp library used by chefboost was removed in python 3.12 (released last month), meaning chefboost is currently broken after 3.11. It's been replaced by importlib.

Q: Are feature engineering tools mixed in for BT, RF, and GB?

From This Scikit Learn tutorial it is easy to see that for data that is not orthogonal to one another, often times produce subpar results.
There has been tools that help with this through mixing different columns in the dataset through "feature engineering".
Some notable ones include the following libraries:

Q: can this feature builder be used to create classifiers in tests/outputs/rules with engineered features?

'Series' object has no attribute 'Decision'

When running the golf example:

df = pd.read_csv("data/golf.txt")
config = {'algorithm': 'C4.5'}
model = chef.fit(df, config = config, target_label = 'Decision')

I get the following error:

[INFO]:  10 CPU cores will be allocated in parallel running
C4.5  tree is going to be built...

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_28440\547795482.py in ?()
     10 import pandas as pd
     11 
     12 df = pd.read_csv("data/golf.txt")
     13 config = {'algorithm': 'C4.5'}
---> 14 model = chef.fit(df, config = config, target_label = 'Decision')

C:\Lib\site-packages\chefboost\Chefboost.py in ?(df, config, target_label, validation_df)
    209                 if enableParallelism == True:
    210                         json_file = "outputs/rules/rules.json"
    211                         functions.createFile(json_file, "[\n")
    212 
--> 213 		trees = Training.buildDecisionTree(df, root = root, file = file, config = config
    214                                 , dataset_features = dataset_features
    215 				, parent_level = 0, leaf_id = 0, parents = 'root', validation_df = validation_df, main_process_id = process_id)
    216 

C:\Lib\site-packages\chefboost\\chefboost\training\Training.py in ?(df, root, file, config, dataset_features, parent_level, leaf_id, parents, tree_id, validation_df, main_process_id)
    432                 pivot = pd.DataFrame(subdataset.Decision.value_counts()).reset_index()
    433                 pivot = pivot.rename(columns = {"Decision": "Instances","index": "Decision"})
    434                 pivot = pivot.sort_values(by = ["Instances"], ascending = False).reset_index()
    435 
--> 436                 else_decision = "return '%s'" % (pivot.iloc[0].Decision)
    437 
    438                 if enableParallelism != True:
    439                         functions.storeRule(file,(functions.formatRule(root), "else:"))

C:\Lib\site-packages\chefboost\Lib\site-packages\pandas\core\generic.py in ?(self, name)
   5985             and name not in self._accessors
   5986             and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5987         ):
   5988             return self[name]
-> 5989         return object.__getattribute__(self, name)

AttributeError: 'Series' object has no attribute 'Decision'

I think someone run into the same issue on stackoverflow.

as a developer, i want to see type hinting of chefboost functions

GridSearchCV work?

There is a problem about how it supported by GridSearchCV

Chaid model result always return 0 accuracy

I'am using chefboost for CHAID algorithm. Dataset contains 10000 rows and 7 columns and fit object always return 0 accuracy. What could be causing this, can you help me?
I also want to visualize tree graph how can I do this?

config={"algorithm":"CHAID",'enableParallelism': False}
model=cb.fit(df.loc[:10000,independent_variable_columns],config,target_label='Decision')

CHAID tree is going to be built...

finished in 6.883694887161255 seconds

Evaluate train set

Accuracy: 0.0 % on 10001 instances
Labels: [0 1]
Confusion matrix: [[0, 0], [0, 0]]
Precision: 0.0 %, Recall: 0.0 %, F1: 0.0 %

visualize tree structure

Hi, is the package provide the visualization of decision tree result as sklearn do?

How to visualize

THX for chefboost，It help me a lot，but I want to know How to visualize the decision tree by chefboost，or how to know the number of the leaf.

feature_importance incorrect?

when I checked the function feature_importance(rules) in Chefboost.py, I found the process to caculate the child_score of a node is through "if child_rule["depth"] == current_depth + 1:", I dont know if I misunderstand the meaning of child_score, but I think the child node may not refer to the nodes at the same depth.

here is some of my data: (Only in this section does WBC appear)

{"feature": "WBC", "instances": 21, "metric_value": 0.7025, "depth": 5}
if obj[24]<=13.55:
{"feature": "HGB", "instances": 20, "metric_value": 0.6098, "depth": 6}
if obj[26]<=65.0:
{"feature": "Infusion volume", "instances": 19, "metric_value": 0.4855, "depth": 7}
if obj[58]>500.0:
{"feature": "MV_A", "instances": 18, "metric_value": 0.3095, "depth": 8}
................
...............
elif obj[58]<=500.0:
return 'yes'
else: return 'yes'
elif obj[26]>65.0:
return 'yes'
else: return 'yes'
elif obj[24]>13.55:
return 'yes'
else: return 'yes'

I think the feature_importance of WBC (before normalize) should be caculate by this:
WBC = 21 * 0.7025 - 20 * 0.6098

but in fact Chefboost caculate by this:
{"feature": "WBC", "instances": 21, "metric_value": 0.7025, "depth": 5}
child: {'feature': 'HGB', 'instances': 667, 'metric_value': 0.0295, 'depth': 6}
child: {'feature': 'Infusion volume', 'instances': 16, 'metric_value': 0.3373, 'depth': 6}
child: {'feature': 'HGB', 'instances': 20, 'metric_value': 0.6098, 'depth': 6}
score: -22.516799999999996

WBC = 210.7025 - 6670.0295 - 160.3373 - 200.6098 = -22.51679 ( even a negative value )

as you can see, HGB( instances 667) and Infusion volume are be consider into child to caculate, so I wonder that
which one is right?

Cheefbost result is not returned

I used this library like in documentation using the same dataset but the result didn't return.
Just showing below text but process didn't complete.

from chefboost import Chefboost as chef
config = {'algorithm': 'CHAID'}
model = cb.fit(df, config)

[INFO]: 40 CPU cores will be allocated in parallel running
CHAID tree is going to be built...

Python version 3.8

What about plotting?

That.

What about plotting dear friend?. You have made an awesome work but still missing this great functionallity.

Im new on Python. Are the chefboost outputs prepared to be plotted and show on a picture the tree, for example?

Thanks in advance!

Getting None as predicted values

I am getting None as a predicted output, what would be the reason for it?

environment:
pandas==0.25.1
numpy==1.17.2
tqdm==4.36.1
Python 3.7.4

train data
test data

code:
chefboost_c45.txt
(unable to attach .py as Github doesn't allow, hence added .txt)

output:
C4.5 tree is going to be built...
Accuracy: 79.16666666666667 % on 24 instances
finished in 0.41808056831359863 seconds
Win
Win
Win
None
Win
Win
Win
Win
Win
Lose
Win
Lose

Also, does the chefboost have support to get precision, recall, and f1 score?

cant ensemble with other models . "estimator should be a classifier " error

split unit tests into many files

classification returns irrelevant results in else conditions

For the configuration

config = {
        "algorithm": "ID3",
        # "enableRandomForest": True,
        # "num_of_trees": 3,
    }

I am getting the following tree for car.data dataset.

def findDecision(obj): #obj[0]: buying, obj[1]: maint, obj[2]: doors, obj[3]: persons, obj[4]: lug_boot, obj[5]: safety
	# {"feature": "safety", "instances": 1728, "metric_value": 1.2057, "depth": 1}
	if obj[5] == 'low':
		return 'unacc'
	elif obj[5] == 'med':
		# {"feature": "persons", "instances": 576, "metric_value": 1.2152, "depth": 2}
		if obj[3] == '2':
			return 'unacc'
		elif obj[3] == '4':
			# {"feature": "buying", "instances": 192, "metric_value": 1.3543, "depth": 3}
			if obj[0] == 'vhigh':
				# {"feature": "maint", "instances": 48, "metric_value": 0.8113, "depth": 4}
				if obj[1] == 'vhigh':
					return 'unacc'
				elif obj[1] == 'high':
					return 'unacc'
				elif obj[1] == 'med':
					# {"feature": "lug_boot", "instances": 12, "metric_value": 1.0, "depth": 5}
					if obj[4] == 'small':
						return 'unacc'
					elif obj[4] == 'med':
						return 'unacc'
					elif obj[4] == 'big':
						return 'acc'
					else: return '4'
				elif obj[1] == 'low':
					# {"feature": "lug_boot", "instances": 12, "metric_value": 1.0, "depth": 5}
					if obj[4] == 'small':
						return 'unacc'
					elif obj[4] == 'med':
						return 'unacc'
					elif obj[4] == 'big':
						return 'acc'
					else: return '4'
				else: return '6'

As seen, results should be nominal but in else conditions it is returning numbers somehow.

spawn make it unable to run on linux

I guess because of This line in Chefboost.py set_start_method("spawn", force=True)

I'm on linux, and I'm unable to run chef.fit both in jupyter and in a main (if __name__ == '__main__':) unless I disable the parallelism (enableParallelism: False)

Error for model code

I followed the instructions in the README, but I encounter an error when running the model code. Why might this be happening?
my python version is 3.11.4.

24-04-12 13:57:40 - [INFO]: 16 CPU cores will be allocated in parallel running
24-04-12 13:57:40 - C4.5 tree is going to be built...

ImportError Traceback (most recent call last)
Cell In[4], line 1
----> 1 model = chef.fit(df, config = config, target_label = 'Decision')

File c:\Users\heaop\Documents\GitHub\chefboost\Chefboost.py:275, in fit(df, config, target_label, validation_df, silent)
272 json_file = "outputs/rules/rules.json"
273 functions.createFile(json_file, "[\n")
--> 275 trees = Training.buildDecisionTree(
276 df,
277 root=root,
278 file=file,
279 config=config,
280 dataset_features=dataset_features,
281 parent_level=0,
282 leaf_id=0,
283 parents="root",
284 validation_df=validation_df,
285 main_process_id=process_id,
286 )
288 if silent is False:
289 logger.info("-------------------------")

File c:\Users\heaop\Documents\GitHub\chefboost\training\Training.py:712, in buildDecisionTree(df, root, file, config, dataset_features, parent_level, leaf_id, parents, tree_id, validation_df, main_process_id)
703 if (
...
---> 20 raise ImportError(f"Module '{module_name}' not found")
22 module = importlib.util.module_from_spec(spec)
23 spec.loader.exec_module(module)

ImportError: Module 'outputs/rules/rules' not found

Target label type

Is it true that the Decision column of input training dataset should be string type?
I tried to feed integer array at first and got 0 accuracy. But converting to a string array works.

add github actions

'numpy.float32' object has no attribute 'is_integer'

Tried to do the following on a dataset with float samples. (Running on Python 3.7)

configGBM = {'algorithm': 'C4.5', 'enableGBM': True, 'epochs': 7, 'learning_rate': 1, 'max_depth': 5}
modelGBM = chef.fit(train, config = configGBM)

Error Log:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/var/folders/vk/rw3fbc110n3fsf_xhz6_r4m00000gn/T/ipykernel_67628/3037199772.py in <module>
      1 configGBM = {'algorithm': 'C4.5', 'enableGBM': True, 'epochs': 7, 'learning_rate': 1, 'max_depth': 5}
----> 2 modelGBM = chef.fit(train, config = configGBM)
    
/usr/local/lib/python3.7/site-packages/chefboost/Chefboost.py in fit(df, config, target_label, validation_df)
    190 
    191                 if df['Decision'].dtypes == 'object': #transform classification problem to regression
--> 192                         trees, alphas = gbm.classifier(df, config, header, dataset_features, validation_df = validation_df, process_id = process_id)
    193                         classification = True
    194 

/usr/local/lib/python3.7/site-packages/chefboost/tuning/gbm.py in classifier(df, config, header, dataset_features, validation_df, process_id)
    270                                 instance['P_'+str(j)] = probabilities[j]
    271 
--> 272                         worksheet.loc[row] = instance
    273 
    274                 for i in range(0, len(classes)):

/usr/local/lib/python3.7/site-packages/pandas/core/indexing.py in __setitem__(self, key, value)
    721 
    722         iloc = self if self.name == "iloc" else self.obj.iloc
--> 723         iloc._setitem_with_indexer(indexer, value, self.name)
    724 
    725     def _validate_key(self, key, axis: int):

/usr/local/lib/python3.7/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value, name)
   1728         if take_split_path:
   1729             # We have to operate column-wise
-> 1730             self._setitem_with_indexer_split_path(indexer, value, name)
   1731         else:
   1732             self._setitem_single_block(indexer, value, name)

/usr/local/lib/python3.7/site-packages/pandas/core/indexing.py in _setitem_with_indexer_split_path(self, indexer, value, name)
   1795                 # We are setting multiple columns in a single row.
   1796                 for loc, v in zip(ilocs, value):
-> 1797                     self._setitem_single_column(loc, v, pi)
   1798 
   1799             elif len(ilocs) == 1 and com.is_null_slice(pi) and len(self.obj) == 0:

/usr/local/lib/python3.7/site-packages/pandas/core/indexing.py in _setitem_single_column(self, loc, value, plane_indexer)
   1918             # set the item, possibly having a dtype change
   1919             ser = ser.copy()
-> 1920             ser._mgr = ser._mgr.setitem(indexer=(pi,), value=value)
   1921             ser._maybe_update_cacher(clear=True)
   1922 

/usr/local/lib/python3.7/site-packages/pandas/core/internals/managers.py in setitem(self, indexer, value)
    353 
    354     def setitem(self: T, indexer, value) -> T:
--> 355         return self.apply("setitem", indexer=indexer, value=value)
    356 
    357     def putmask(self, mask, new, align: bool = True):

/usr/local/lib/python3.7/site-packages/pandas/core/internals/managers.py in apply(self, f, align_keys, ignore_failures, **kwargs)
    325                     applied = b.apply(f, **kwargs)
    326                 else:
--> 327                     applied = getattr(b, f)(**kwargs)
    328             except (TypeError, NotImplementedError):
    329                 if not ignore_failures:

/usr/local/lib/python3.7/site-packages/pandas/core/internals/blocks.py in setitem(self, indexer, value)
    924         # coerce if block dtype can store value
    925         values = self.values
--> 926         if not self._can_hold_element(value):
    927             # current dtype cannot store value, coerce to common dtype
    928             return self.coerce_to_target_dtype(value).setitem(indexer, value)

/usr/local/lib/python3.7/site-packages/pandas/core/internals/blocks.py in _can_hold_element(self, element)
    620         """require the same dtype as ourselves"""
    621         element = extract_array(element, extract_numpy=True)
--> 622         return can_hold_element(self.values, element)
    623 
    624     @final

/usr/local/lib/python3.7/site-packages/pandas/core/dtypes/cast.py in can_hold_element(arr, element)
   2181         if tipo is not None:
   2182             if tipo.kind not in ["i", "u"]:
-> 2183                 if is_float(element) and element.is_integer():
   2184                     return True
   2185                 # Anything other than integer we cannot hold

AttributeError: 'numpy.float32' object has no attribute 'is_integer'

Indentation error

While using C4.5 i get indentation error

config = {'algorithm': 'C4.5'}
model = chef.fit(df, config = config, target_label = 'party')

[INFO]: 1 CPU cores will be allocated in parallel running
C4.5 tree is going to be built...
File "outputs/rules/rules.py", line 37
else: return 'no'
^
IndentationError: expected an indented block

Cannot install it

Hello!

I am trying to install chefboost in Windows without any success...

max_depth parameter networking

The max_depth parameter seems to not be working. The fit function fits a tree with maximal possible depth regardless of setting.

Parallelism does not work properly

Hi,

I'm testing the library using the following code snippet:

config = {
'algorithm': 'C4.5',
#'enableParallelism': True, 'num_cores': 32,
}
model = chef.fit(df, config = config)

Which prints: "finished in 9.606534719467163 seconds"

Then, enabling parallelism uncommenting the line in the config, it never finishes but it uses 100% of CPU for a really long time - much more than 10 seconds

I had just 3000 samples, but it is taking forever, (num_cores = 2). Any suggestions??

Paralellism default value is True

In the documentation it says that the default value to allow parallelism is False, but checking the code and commit history the default value is True.

It is necessary to adjust the documentation to be consistent with the code.

Code:

chefboost/chefboost/commons/functions.py

Line 85 in 5e1477a

enableParallelism = True

README: https://github.com/serengil/chefboost/tree/5e1477af150d1dd16f9bc5e4a46acb3dc5f63ea8#paralellism

findDecision(obj) and accuracy giving weird results

I try to run this code:

import Chefboost as chef
import pandas as pd

if __name__ == "__main__":
    df = pd.read_csv("dataset/golf.txt")
    
    config = {'algorithm': 'C4.5'}
    model = chef.fit(df, config)

and then when i check outputs/rules/rules.py this is what i get :

def findDecision(obj): #obj[0]: Outlook, obj[1]: Temp., obj[2]: Humidity, obj[3]: Wind
   if obj[2] == 'Rain':
      if obj[0] == 'Weak':
         return 'Yes'
      elif obj[0] == 'Strong':
         return 'No'
   elif obj[2] == 'Sunny':
      if obj[1] == 'High':
         return 'No'
      elif obj[1] == 'Normal':
         return 'Yes'
   elif obj[2] == 'Overcast':
      return 'Yes'

obj[0] isn't Outlook, but Wind.. and also, sometimes i get accuracy 0% after running the code 2 or 3 times..

Getting KeyError: 'Decision'

Trying to find gain using for SepalLengthCm in IRIS dataset following this .

config = {'algorithm': 'ID3'}
sorted(df['SepalLengthCm'].unique())

threshold = 6.0
idx = df[df['SepalLengthCm'] <= threshold].index
tmp_df = df.copy()
tmp_df['SepalLengthCm'] = '>'+str(threshold)
tmp_df.loc[idx, 'SepalLengthCm'] = '<='+str(threshold)

gain = Training.findGains(tmp_df, config)
print(threshold, ': ', gain)

Also

df = iris[['SepalLengthCm', 'y']]

When running this I get the following error

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File D:\Files\dev\misc\py_venvs\cvip\lib\site-packages\pandas\core\indexes\base.py:3621, in Index.get_loc(self, key, method, tolerance)
   3620 try:
-> 3621     return self._engine.get_loc(casted_key)
   3622 except KeyError as err:

File D:\Files\dev\misc\py_venvs\cvip\lib\site-packages\pandas\_libs\index.pyx:136, in pandas._libs.index.IndexEngine.get_loc()

File D:\Files\dev\misc\py_venvs\cvip\lib\site-packages\pandas\_libs\index.pyx:163, in pandas._libs.index.IndexEngine.get_loc()

File pandas\_libs\hashtable_class_helper.pxi:5198, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas\_libs\hashtable_class_helper.pxi:5206, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Decision'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Input In [10], in <cell line: 7>()
      4 tmp_df['SepalLengthCm'] = '>'+str(threshold)
      5 tmp_df.loc[idx, 'SepalLengthCm'] = '<='+str(threshold)
----> 7 gain = Training.findGains(tmp_df, config)
      8 print(threshold, ': ', gain)

File D:\Files\dev\misc\py_venvs\cvip\lib\site-packages\chefboost\training\Training.py:107, in findGains(df, config)
    104 def findGains(df, config):
    106 	algorithm = config['algorithm']
--> 107 	decision_classes = df["Decision"].unique()
    109 	#-----------------------------
    111 	entropy = 0

File D:\Files\dev\misc\py_venvs\cvip\lib\site-packages\pandas\core\frame.py:3505, in DataFrame.__getitem__(self, key)
   3503 if self.columns.nlevels > 1:
   3504     return self._getitem_multilevel(key)
-> 3505 indexer = self.columns.get_loc(key)
   3506 if is_integer(indexer):
   3507     indexer = [indexer]

File D:\Files\dev\misc\py_venvs\cvip\lib\site-packages\pandas\core\indexes\base.py:3623, in Index.get_loc(self, key, method, tolerance)
   3621     return self._engine.get_loc(casted_key)
   3622 except KeyError as err:
-> 3623     raise KeyError(key) from err
   3624 except TypeError:
   3625     # If we have a listlike key, _check_indexing_error will raise
   3626     #  InvalidIndexError. Otherwise we fall through and re-raise
   3627     #  the TypeError.
   3628     self._check_indexing_error(key)

KeyError: 'Decision'

Is it possible to obtain detailed metrics using another Python library with the decision tree of chefboost?

I saw in another issue that the detailed metrics are not yet implemented in chefboost, so I tried to use scikit learn metrics, but it was not compatible. Do you know any external alternatives at the moment?

Error while fitting the model

Following is the dataframe:

and following is the additional code:

df.rename(columns={'result': 'Decision'}, inplace=True)

Output:

Index(['Date', 'Country', 'League', 'Season', 'HomeTeam', 'AwayTeam',
       'home_goal', 'away_goal', 'Decision'],
      dtype='object')

config = {"algorithm" : "C4.5"}
model = chef.fit(df, config,  target_label = "Decision")

I am getting error:

[INFO]:  4 CPU cores will be allocated in parallel running
C4.5  tree is going to be built...
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_22413/130574452.py in ?()
----> 1 model = chef.fit(df, config,  target_label = "Decision")

~/anaconda3/envs/rover/lib/python3.10/site-packages/chefboost/Chefboost.py in ?(df, config, target_label, validation_df)
    209                 if enableParallelism == True:
    210                         json_file = "outputs/rules/rules.json"
    211                         functions.createFile(json_file, "[\n")
    212 
--> 213 		trees = Training.buildDecisionTree(df, root = root, file = file, config = config
    214                                 , dataset_features = dataset_features
    215 				, parent_level = 0, leaf_id = 0, parents = 'root', validation_df = validation_df, main_process_id = process_id)
    216 

~/anaconda3/envs/rover/lib/python3.10/site-packages/chefboost/training/Training.py in ?(df, root, file, config, dataset_features, parent_level, leaf_id, parents, tree_id, validation_df, main_process_id)
    432                 pivot = pd.DataFrame(subdataset.Decision.value_counts()).reset_index()
    433                 pivot = pivot.rename(columns = {"Decision": "Instances","index": "Decision"})
    434                 pivot = pivot.sort_values(by = ["Instances"], ascending = False).reset_index()
    435 
--> 436                 else_decision = "return '%s'" % (pivot.iloc[0].Decision)
    437 
    438                 if enableParallelism != True:
    439                         functions.storeRule(file,(functions.formatRule(root), "else:"))

~/anaconda3/envs/rover/lib/python3.10/site-packages/pandas/core/generic.py in ?(self, name)
   6200             and name not in self._accessors
   6201             and self._info_axis._can_hold_identifiers_and_holds_name(name)
   6202         ):
   6203             return self[name]
-> 6204         return object.__getattribute__(self, name)

AttributeError: 'Series' object has no attribute 'Decision'

Even If I do not rename it doesn't matter. It always throws this error.

Only Regression Tree is Built

No matter what configuration I give, it seems to always default to building a Regression Tree.
I've tried putting in a non-existent value as well, but there are no issues there as it just throws an error.

Code:

from chefboost import Chefboost as cb
import pandas as pd

df = pd.read_csv("~/Downloads/kmodes_fillna_cluster.csv")

config = {'algorithm': 'C4.5'}
model = cb.fit(df, config)

Output:

Regression  tree is going to be built...
MAE:  0.10815602836879432
RMSE:  0.2325467999874373
Mean:  0.2872340425531915
MAE / Mean:  37.654320987654316 %
RMSE / Mean:  80.96073777340409 %
finished in  1.4060499668121338  seconds

add logger

UnboundLocalError: local variable 'subdataset' referenced before assignment

Hi,

I am facing below error while running the CHAID. Though, when used ID3, it run successfully.

[INFO]: 4 CPU cores will be allocated in parallel running
CHAID tree is going to be built...

RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "C:\Users\Rahul.Chandel\Anaconda3\lib\multiprocessing\pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "C:\Users\Rahul.Chandel\Anaconda3\lib\site-packages\chefboost\training\Training.py", line 209, in createBranchWrapper
return func(*args)
File "C:\Users\Rahul.Chandel\Anaconda3\lib\site-packages\chefboost\training\Training.py", line 330, in createBranch
results = buildDecisionTree(subdataset, root, file, config, dataset_features
File "C:\Users\Rahul.Chandel\Anaconda3\lib\site-packages\chefboost\training\Training.py", line 533, in buildDecisionTree
sub_results = createBranchWrapper(createBranch, input_param)
File "C:\Users\Rahul.Chandel\Anaconda3\lib\site-packages\chefboost\training\Training.py", line 209, in createBranchWrapper
return func(*args)
File "C:\Users\Rahul.Chandel\Anaconda3\lib\site-packages\chefboost\training\Training.py", line 330, in createBranch
results = buildDecisionTree(subdataset, root, file, config, dataset_features
File "C:\Users\Rahul.Chandel\Anaconda3\lib\site-packages\chefboost\training\Training.py", line 533, in buildDecisionTree
sub_results = createBranchWrapper(createBranch, input_param)
File "C:\Users\Rahul.Chandel\Anaconda3\lib\site-packages\chefboost\training\Training.py", line 209, in createBranchWrapper
return func(*args)
File "C:\Users\Rahul.Chandel\Anaconda3\lib\site-packages\chefboost\training\Training.py", line 330, in createBranch
results = buildDecisionTree(subdataset, root, file, config, dataset_features
File "C:\Users\Rahul.Chandel\Anaconda3\lib\site-packages\chefboost\training\Training.py", line 432, in buildDecisionTree
pivot = pd.DataFrame(subdataset.Decision.value_counts()).reset_index()
UnboundLocalError: local variable 'subdataset' referenced before assignment
"""

The above exception was the direct cause of the following exception:

UnboundLocalError Traceback (most recent call last)
in
7
8 df = df.drop('fill_ratio', axis =1)
----> 9 model = cb.fit(df, config = config)

~\Anaconda3\lib\site-packages\chefboost\Chefboost.py in fit(df, config, target_label, validation_df)
211 functions.createFile(json_file, "[\n")
212
--> 213 trees = Training.buildDecisionTree(df, root = root, file = file, config = config
214 , dataset_features = dataset_features
215 , parent_level = 0, leaf_id = 0, parents = 'root', validation_df = validation_df, main_process_id = process_id)

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in buildDecisionTree(df, root, file, config, dataset_features, parent_level, leaf_id, parents, tree_id, validation_df, main_process_id)
531 else: #serial
532 for input_param in input_params:
--> 533 sub_results = createBranchWrapper(createBranch, input_param)
534 for sub_result in sub_results:
535 results.append(sub_result)

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in createBranchWrapper(func, args)
207
208 def createBranchWrapper(func, args):
--> 209 return func(*args)
210
211 def createBranch(config, current_class, subdataset, numericColumn, branch_index, winner_name, winner_index, root, parents, file, dataset_features, num_of_instances, metric, tree_id = 0, main_process_id = None):

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in createBranch(config, current_class, subdataset, numericColumn, branch_index, winner_name, winner_index, root, parents, file, dataset_features, num_of_instances, metric, tree_id, main_process_id)
328 parents = copy.copy(leaf_id)
329
--> 330 results = buildDecisionTree(subdataset, root, file, config, dataset_features
331 , root-1, leaf_id, parents, tree_id = tree_id, main_process_id = main_process_id)
332

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in buildDecisionTree(df, root, file, config, dataset_features, parent_level, leaf_id, parents, tree_id, validation_df, main_process_id)
519
520 for f in funclist:
--> 521 branch_results = f.get(timeout = 100000)
522
523 for branch_result in branch_results:

~\Anaconda3\lib\multiprocessing\pool.py in get(self, timeout)
769 return self._value
770 else:
--> 771 raise self._value
772
773 def _set(self, i, obj):

UnboundLocalError: local variable 'subdataset' referenced before assignment

serengil / chefboost Goto Github PK

chefboost's People

Contributors

Stargazers

Watchers

Forkers

chefboost's Issues

24-04-12 13:57:40 - [INFO]: 16 CPU cores will be allocated in parallel running 24-04-12 13:57:40 - C4.5 tree is going to be built...

[INFO]: 4 CPU cores will be allocated in parallel running CHAID tree is going to be built...

Recommend Projects

Recommend Topics

Recommend Org

24-04-12 13:57:40 - [INFO]: 16 CPU cores will be allocated in parallel running
24-04-12 13:57:40 - C4.5 tree is going to be built...

[INFO]: 4 CPU cores will be allocated in parallel running
CHAID tree is going to be built...