Giter VIP home page Giter VIP logo

serengil / chefboost Goto Github PK

View Code? Open in Web Editor NEW
443.0 18.0 101.0 1.11 MB

A Lightweight Decision Tree Framework supporting regular algorithms: ID3, C4.5, CART, CHAID and Regression Trees; some advanced techniques: Gradient Boosting, Random Forest and Adaboost w/categorical features support for Python

Home Page: https://www.youtube.com/watch?v=Z93qE5eb6eg&list=PLsS_1RYmYQQHp_xZObt76dpacY543GrJD&index=3

License: MIT License

Python 99.66% Makefile 0.10% Shell 0.23%
decision-trees gradient-boosting gradient-boosting-machine random-forest adaboost id3 c45-trees cart regression-tree gbm

chefboost's People

Contributors

anapaulamendes avatar jannisbush avatar nurettin avatar serengil avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chefboost's Issues

Python 3.12 issue (no imp module)

When trying chefboost with python 3.12, it gives the issue of no imp module.


..../lib/python3.12/site-packages/chefboost/Chefboost.py", line 5, in <module>
    import imp
ModuleNotFoundError: No module named 'imp'

Any Tree Traversal API or Example?

I am interested in plotting chef trees, particularly decision path for a sample.

A generic traversal iterator call would allow users to dump rule in different formats or create various plots with networkx/pygraphviz/matplotlib/dtreeviz/treeinterpreter ex https://stackoverflow.com/questions/20224526/how-to-extract-the-decision-rules-from-scikit-learn-decision-tree/39772170

  1. Is there an example of DFS/BFS generator for traversing the nodes?

ex sklearn DFS via structure & decision path
https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html

ex sklearn BFS generator
https://stackoverflow.com/questions/61203080/traversal-of-sklearn-decision-tree

  1. does chef have anything like decision_path() in scikit?

decision_path()
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.decision_path

I don't mind adding this, looking for a guide to the internals of chefboost - reconstructRules might be the closes to a traversal?

Reference
#20
#2

findDecision incorrect?

I have a CSV with pre-calculated cosine distance between face embeddings of people images in my dataset like this:

       Person1     Person2  Idx1  Idx2  Distance Decision
0   Aaron Paul  Aaron Paul     0     1    0.3245      Yes
1   Aaron Paul  Aaron Paul     0     2    0.2281      Yes
2   Aaron Paul  Aaron Paul     0     3    0.4737      Yes
3   Aaron Paul  Aaron Paul     0     4    0.4103      Yes
4   Aaron Paul  Aaron Paul     0     5    0.3236      Yes
5   Aaron Paul  Aaron Paul     0     6    0.3270      Yes
6   Aaron Paul  Aaron Paul     0     7    0.4873      Yes
7   Aaron Paul  Aaron Paul     0     8    0.3988      Yes
8   Aaron Paul  Aaron Paul     1     2    0.2357      Yes
9   Aaron Paul  Aaron Paul     1     3    0.2613      Yes
10  Aaron Paul  Aaron Paul     1     4    0.3827      Yes
11  Aaron Paul  Aaron Paul     1     5    0.2221      Yes
12  Aaron Paul  Aaron Paul     1     6    0.2183      Yes
13  Aaron Paul  Aaron Paul     1     7    0.4568      Yes
14  Aaron Paul  Aaron Paul     1     8    0.2391      Yes
15  Aaron Paul  Aaron Paul     2     3    0.4439      Yes
16  Aaron Paul  Aaron Paul     2     4    0.4086      Yes
17  Aaron Paul  Aaron Paul     2     5    0.2592      Yes
18  Aaron Paul  Aaron Paul     2     6    0.2863      Yes
19  Aaron Paul  Aaron Paul     2     7    0.4588      Yes

And I use this script to calculate findDecision tree:

import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
tqdm.pandas()

if __name__ == '__main__':
	##############################################################################
	# Leer CSV para determinar el mejor threshold...
	df = pd.read_csv(R"\\10.15.20.109\e$\MODELS\ProtecFR\Model\faces2.csv", encoding='UTF8')
	print(df.head(20))

	df1 = df[df['Decision'] == "Yes"]['Distance'].copy()
	df2 = df[df['Decision'] == "No"]['Distance'].copy()
	print(f"Count Yes: {df1.count()}")
	print(f"Average Yes: {round(df1.mean(), 4)}")
	print(f"Std. deviation Yes: {round(df1.std(), 4)}")
	print(f"Min Yes: {round(df1.min(), 4)}")
	print(f"Max Yes: {round(df1.max(), 4)}")
	print(f"Mode Yes: {round(df1.mode()[0], 4)}")

	print(f"Count No: {df2.count()}")
	print(f"Average No: {round(df2.mean(), 4)}")
	print(f"Std. deviation No: {round(df2.std(), 4)}")
	print(f"Min No: {round(df2.min(), 4)}")
	print(f"Max No: {round(df2.max(), 4)}")
	print(f"Mode No: {round(df2.mode()[0], 4)}")

	df1.plot.kde()
	df2.plot.kde()
	plt.legend(["Yes", "No"])
	plt.grid()
	plt.axhline(0,color='red')
	plt.axvline(0,color='red')
	plt.show()

	from chefboost import Chefboost as chef
	config = {'algorithm': 'C4.5'}

	tmp_df = df[['Distance', 'Decision']].copy()
	model = chef.fit(tmp_df, config)
	print (model)

The results I get are:

Count Yes: 108285
Average Yes: 0.4496
Std. deviation Yes: 0.1557
Min Yes: 0.0
Max Yes: 1.0644
Mode Yes: 0.3465

Count No: 59793700
Average No: 0.7976
Std. deviation No: 0.1112
Min No: 0.0
Max No: 1.2973
Mode No: 0.8114

[INFO]:  8 CPU cores will be allocated in parallel running
C4.5  tree is going to be built...
-------------------------
finished in  135.35767483711243  seconds
-------------------------
Evaluate  train set
-------------------------
Accuracy:  99.81929981118321 % on  59901985  instances
Labels:  ['Yes' 'No']
Confusion matrix:  [[43, 1], [108242, 59793699]]
Precision:  97.7273 %, Recall:  0.0397 %, F1:  0.0794 %
{'trees': [<module 'outputs/rules/rules' from 'c:\\DESARROLLOS\\Python\\VID\\outputs/rules/rules.py'>], 'alphas': [], 'config': {'algorithm': 'C4.5', 'enableRandomForest': False, 'num_of_trees': 5, 'enableMultitasking': False, 'enableGBM': False, 'epochs': 10, 'learning_rate': 1, 'max_depth': 3, 'enableAdaboost': False, 'num_of_weak_classifier': 4, 'enableParallelism': True, 'num_cores': 8}, 'nan_values': [['Distance', None]]}

The plot is:

ArcFace-cosine

and outputs/rules/rules.py:

def findDecision(obj): #obj[0]: Distance
	# {"feature": "Distance", "instances": 59901985, "metric_value": 0.0191, "depth": 1}
	if obj[0]>0.0:
		return 'No'
	elif obj[0]<=0.0:
		return 'Yes'
	else: return 'Yes'

As you can see, it gives me a 0.0 threshold when it should be around 0.68.

Am I doing something wrong?

Regards

Parallelism do not seems to be working

Hi,

I've used the following code:

if __name__ == '__main__':
    config = {'algorithm': 'Regression', 'enableParallelism' : True, 'enableGBM': True, 'epochs': 10, 'learning_rate': 0.01}
    model = chef.fit(df_tree_train, config)

and when I check my CPU usage I see that only one core is being used. Why aren't all my cores being used?

Unreasonable training time when I make a simple change

So I am training a CHAID decision tree for multiclass classification, and the target variable is a string. Other than the target, I have 4 other features, two of which I want to be string type. When I train the model with only one feature as string, training takes about 15 minutes. But when I convert the other feature I wish to be treated as categorical to string, training takes forever (entire day and no result).

What could be causing this?

Q: Are feature engineering tools mixed in for BT, RF, and GB?

From This Scikit Learn tutorial it is easy to see that for data that is not orthogonal to one another, often times produce subpar results.
There has been tools that help with this through mixing different columns in the dataset through "feature engineering".
Some notable ones include the following libraries:

Q: can this feature builder be used to create classifiers in tests/outputs/rules with engineered features?

'Series' object has no attribute 'Decision'

When running the golf example:

df = pd.read_csv("data/golf.txt")
config = {'algorithm': 'C4.5'}
model = chef.fit(df, config = config, target_label = 'Decision')

I get the following error:

[INFO]:  10 CPU cores will be allocated in parallel running
C4.5  tree is going to be built...

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_28440\547795482.py in ?()
     10 import pandas as pd
     11 
     12 df = pd.read_csv("data/golf.txt")
     13 config = {'algorithm': 'C4.5'}
---> 14 model = chef.fit(df, config = config, target_label = 'Decision')

C:\Lib\site-packages\chefboost\Chefboost.py in ?(df, config, target_label, validation_df)
    209                 if enableParallelism == True:
    210                         json_file = "outputs/rules/rules.json"
    211                         functions.createFile(json_file, "[\n")
    212 
--> 213 		trees = Training.buildDecisionTree(df, root = root, file = file, config = config
    214                                 , dataset_features = dataset_features
    215 				, parent_level = 0, leaf_id = 0, parents = 'root', validation_df = validation_df, main_process_id = process_id)
    216 

C:\Lib\site-packages\chefboost\\chefboost\training\Training.py in ?(df, root, file, config, dataset_features, parent_level, leaf_id, parents, tree_id, validation_df, main_process_id)
    432                 pivot = pd.DataFrame(subdataset.Decision.value_counts()).reset_index()
    433                 pivot = pivot.rename(columns = {"Decision": "Instances","index": "Decision"})
    434                 pivot = pivot.sort_values(by = ["Instances"], ascending = False).reset_index()
    435 
--> 436                 else_decision = "return '%s'" % (pivot.iloc[0].Decision)
    437 
    438                 if enableParallelism != True:
    439                         functions.storeRule(file,(functions.formatRule(root), "else:"))

C:\Lib\site-packages\chefboost\Lib\site-packages\pandas\core\generic.py in ?(self, name)
   5985             and name not in self._accessors
   5986             and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5987         ):
   5988             return self[name]
-> 5989         return object.__getattribute__(self, name)

AttributeError: 'Series' object has no attribute 'Decision'

I think someone run into the same issue on stackoverflow.

Chaid model result always return 0 accuracy

I'am using chefboost for CHAID algorithm. Dataset contains 10000 rows and 7 columns and fit object always return 0 accuracy. What could be causing this, can you help me?
I also want to visualize tree graph how can I do this?

config={"algorithm":"CHAID",'enableParallelism': False}
model=cb.fit(df.loc[:10000,independent_variable_columns],config,target_label='Decision')

CHAID tree is going to be built...

finished in 6.883694887161255 seconds

Evaluate train set

Accuracy: 0.0 % on 10001 instances
Labels: [0 1]
Confusion matrix: [[0, 0], [0, 0]]
Precision: 0.0 %, Recall: 0.0 %, F1: 0.0 %

How to visualize

THX for chefboost๏ผŒIt help me a lot๏ผŒbut I want to know How to visualize the decision tree by chefboost๏ผŒor how to know the number of the leaf.

feature_importance incorrect?

when I checked the function feature_importance(rules) in Chefboost.py, I found the process to caculate the child_score of a node is through "if child_rule["depth"] == current_depth + 1:", I dont know if I misunderstand the meaning of child_score, but I think the child node may not refer to the nodes at the same depth.

here is some of my data: (Only in this section does WBC appear)

{"feature": "WBC", "instances": 21, "metric_value": 0.7025, "depth": 5}
if obj[24]<=13.55:
{"feature": "HGB", "instances": 20, "metric_value": 0.6098, "depth": 6}
if obj[26]<=65.0:
{"feature": "Infusion volume", "instances": 19, "metric_value": 0.4855, "depth": 7}
if obj[58]>500.0:
{"feature": "MV_A", "instances": 18, "metric_value": 0.3095, "depth": 8}
................
...............
elif obj[58]<=500.0:
return 'yes'
else: return 'yes'
elif obj[26]>65.0:
return 'yes'
else: return 'yes'
elif obj[24]>13.55:
return 'yes'
else: return 'yes'

I think the feature_importance of WBC (before normalize) should be caculate by this:
WBC = 21 * 0.7025 - 20 * 0.6098

but in fact Chefboost caculate by this:
{"feature": "WBC", "instances": 21, "metric_value": 0.7025, "depth": 5}
child: {'feature': 'HGB', 'instances': 667, 'metric_value': 0.0295, 'depth': 6}
child: {'feature': 'Infusion volume', 'instances': 16, 'metric_value': 0.3373, 'depth': 6}
child: {'feature': 'HGB', 'instances': 20, 'metric_value': 0.6098, 'depth': 6}
score: -22.516799999999996

WBC = 210.7025 - 6670.0295 - 160.3373 - 200.6098 = -22.51679 ( even a negative value )

as you can see, HGB( instances 667) and Infusion volume are be consider into child to caculate, so I wonder that
which one is right?

Cheefbost result is not returned

I used this library like in documentation using the same dataset but the result didn't return.
Just showing below text but process didn't complete.

from chefboost import Chefboost as chef
config = {'algorithm': 'CHAID'}
model = cb.fit(df, config)

[INFO]: 40 CPU cores will be allocated in parallel running
CHAID tree is going to be built...

Python version 3.8

What about plotting?

That.

What about plotting dear friend?. You have made an awesome work but still missing this great functionallity.

Im new on Python. Are the chefboost outputs prepared to be plotted and show on a picture the tree, for example?

Thanks in advance!

Getting None as predicted values

I am getting None as a predicted output, what would be the reason for it?

environment:
pandas==0.25.1
numpy==1.17.2
tqdm==4.36.1
Python 3.7.4

train data
test data

code:
chefboost_c45.txt
(unable to attach .py as Github doesn't allow, hence added .txt)

output:
C4.5 tree is going to be built...
Accuracy: 79.16666666666667 % on 24 instances
finished in 0.41808056831359863 seconds
Win
Win
Win
None
Win
Win
Win
Win
Win
Lose
Win
Lose

Also, does the chefboost have support to get precision, recall, and f1 score?

classification returns irrelevant results in else conditions

For the configuration

config = {
        "algorithm": "ID3",
        # "enableRandomForest": True,
        # "num_of_trees": 3,
    }

I am getting the following tree for car.data dataset.

def findDecision(obj): #obj[0]: buying, obj[1]: maint, obj[2]: doors, obj[3]: persons, obj[4]: lug_boot, obj[5]: safety
	# {"feature": "safety", "instances": 1728, "metric_value": 1.2057, "depth": 1}
	if obj[5] == 'low':
		return 'unacc'
	elif obj[5] == 'med':
		# {"feature": "persons", "instances": 576, "metric_value": 1.2152, "depth": 2}
		if obj[3] == '2':
			return 'unacc'
		elif obj[3] == '4':
			# {"feature": "buying", "instances": 192, "metric_value": 1.3543, "depth": 3}
			if obj[0] == 'vhigh':
				# {"feature": "maint", "instances": 48, "metric_value": 0.8113, "depth": 4}
				if obj[1] == 'vhigh':
					return 'unacc'
				elif obj[1] == 'high':
					return 'unacc'
				elif obj[1] == 'med':
					# {"feature": "lug_boot", "instances": 12, "metric_value": 1.0, "depth": 5}
					if obj[4] == 'small':
						return 'unacc'
					elif obj[4] == 'med':
						return 'unacc'
					elif obj[4] == 'big':
						return 'acc'
					else: return '4'
				elif obj[1] == 'low':
					# {"feature": "lug_boot", "instances": 12, "metric_value": 1.0, "depth": 5}
					if obj[4] == 'small':
						return 'unacc'
					elif obj[4] == 'med':
						return 'unacc'
					elif obj[4] == 'big':
						return 'acc'
					else: return '4'
				else: return '6'

As seen, results should be nominal but in else conditions it is returning numbers somehow.

spawn make it unable to run on linux

I guess because of This line in Chefboost.py set_start_method("spawn", force=True)

I'm on linux, and I'm unable to run chef.fit both in jupyter and in a main (if __name__ == '__main__':) unless I disable the parallelism (enableParallelism: False)

Error for model code

I followed the instructions in the README, but I encounter an error when running the model code. Why might this be happening?
my python version is 3.11.4.

24-04-12 13:57:40 - [INFO]: 16 CPU cores will be allocated in parallel running
24-04-12 13:57:40 - C4.5 tree is going to be built...

ImportError Traceback (most recent call last)
Cell In[4], line 1
----> 1 model = chef.fit(df, config = config, target_label = 'Decision')

File c:\Users\heaop\Documents\GitHub\chefboost\Chefboost.py:275, in fit(df, config, target_label, validation_df, silent)
272 json_file = "outputs/rules/rules.json"
273 functions.createFile(json_file, "[\n")
--> 275 trees = Training.buildDecisionTree(
276 df,
277 root=root,
278 file=file,
279 config=config,
280 dataset_features=dataset_features,
281 parent_level=0,
282 leaf_id=0,
283 parents="root",
284 validation_df=validation_df,
285 main_process_id=process_id,
286 )
288 if silent is False:
289 logger.info("-------------------------")

File c:\Users\heaop\Documents\GitHub\chefboost\training\Training.py:712, in buildDecisionTree(df, root, file, config, dataset_features, parent_level, leaf_id, parents, tree_id, validation_df, main_process_id)
703 if (
...
---> 20 raise ImportError(f"Module '{module_name}' not found")
22 module = importlib.util.module_from_spec(spec)
23 spec.loader.exec_module(module)

ImportError: Module 'outputs/rules/rules' not found

Target label type

Is it true that the Decision column of input training dataset should be string type?
I tried to feed integer array at first and got 0 accuracy. But converting to a string array works.

'numpy.float32' object has no attribute 'is_integer'

Tried to do the following on a dataset with float samples. (Running on Python 3.7)

configGBM = {'algorithm': 'C4.5', 'enableGBM': True, 'epochs': 7, 'learning_rate': 1, 'max_depth': 5}
modelGBM = chef.fit(train, config = configGBM)

Error Log:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/var/folders/vk/rw3fbc110n3fsf_xhz6_r4m00000gn/T/ipykernel_67628/3037199772.py in <module>
      1 configGBM = {'algorithm': 'C4.5', 'enableGBM': True, 'epochs': 7, 'learning_rate': 1, 'max_depth': 5}
----> 2 modelGBM = chef.fit(train, config = configGBM)
    
/usr/local/lib/python3.7/site-packages/chefboost/Chefboost.py in fit(df, config, target_label, validation_df)
    190 
    191                 if df['Decision'].dtypes == 'object': #transform classification problem to regression
--> 192                         trees, alphas = gbm.classifier(df, config, header, dataset_features, validation_df = validation_df, process_id = process_id)
    193                         classification = True
    194 

/usr/local/lib/python3.7/site-packages/chefboost/tuning/gbm.py in classifier(df, config, header, dataset_features, validation_df, process_id)
    270                                 instance['P_'+str(j)] = probabilities[j]
    271 
--> 272                         worksheet.loc[row] = instance
    273 
    274                 for i in range(0, len(classes)):

/usr/local/lib/python3.7/site-packages/pandas/core/indexing.py in __setitem__(self, key, value)
    721 
    722         iloc = self if self.name == "iloc" else self.obj.iloc
--> 723         iloc._setitem_with_indexer(indexer, value, self.name)
    724 
    725     def _validate_key(self, key, axis: int):

/usr/local/lib/python3.7/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value, name)
   1728         if take_split_path:
   1729             # We have to operate column-wise
-> 1730             self._setitem_with_indexer_split_path(indexer, value, name)
   1731         else:
   1732             self._setitem_single_block(indexer, value, name)

/usr/local/lib/python3.7/site-packages/pandas/core/indexing.py in _setitem_with_indexer_split_path(self, indexer, value, name)
   1795                 # We are setting multiple columns in a single row.
   1796                 for loc, v in zip(ilocs, value):
-> 1797                     self._setitem_single_column(loc, v, pi)
   1798 
   1799             elif len(ilocs) == 1 and com.is_null_slice(pi) and len(self.obj) == 0:

/usr/local/lib/python3.7/site-packages/pandas/core/indexing.py in _setitem_single_column(self, loc, value, plane_indexer)
   1918             # set the item, possibly having a dtype change
   1919             ser = ser.copy()
-> 1920             ser._mgr = ser._mgr.setitem(indexer=(pi,), value=value)
   1921             ser._maybe_update_cacher(clear=True)
   1922 

/usr/local/lib/python3.7/site-packages/pandas/core/internals/managers.py in setitem(self, indexer, value)
    353 
    354     def setitem(self: T, indexer, value) -> T:
--> 355         return self.apply("setitem", indexer=indexer, value=value)
    356 
    357     def putmask(self, mask, new, align: bool = True):

/usr/local/lib/python3.7/site-packages/pandas/core/internals/managers.py in apply(self, f, align_keys, ignore_failures, **kwargs)
    325                     applied = b.apply(f, **kwargs)
    326                 else:
--> 327                     applied = getattr(b, f)(**kwargs)
    328             except (TypeError, NotImplementedError):
    329                 if not ignore_failures:

/usr/local/lib/python3.7/site-packages/pandas/core/internals/blocks.py in setitem(self, indexer, value)
    924         # coerce if block dtype can store value
    925         values = self.values
--> 926         if not self._can_hold_element(value):
    927             # current dtype cannot store value, coerce to common dtype
    928             return self.coerce_to_target_dtype(value).setitem(indexer, value)

/usr/local/lib/python3.7/site-packages/pandas/core/internals/blocks.py in _can_hold_element(self, element)
    620         """require the same dtype as ourselves"""
    621         element = extract_array(element, extract_numpy=True)
--> 622         return can_hold_element(self.values, element)
    623 
    624     @final

/usr/local/lib/python3.7/site-packages/pandas/core/dtypes/cast.py in can_hold_element(arr, element)
   2181         if tipo is not None:
   2182             if tipo.kind not in ["i", "u"]:
-> 2183                 if is_float(element) and element.is_integer():
   2184                     return True
   2185                 # Anything other than integer we cannot hold

AttributeError: 'numpy.float32' object has no attribute 'is_integer'

Indentation error

While using C4.5 i get indentation error

config = {'algorithm': 'C4.5'}
model = chef.fit(df, config = config, target_label = 'party')

[INFO]: 1 CPU cores will be allocated in parallel running
C4.5 tree is going to be built...
File "outputs/rules/rules.py", line 37
else: return 'no'
^
IndentationError: expected an indented block

Cannot install it

Hello!

I am trying to install chefboost in Windows without any success...

image

max_depth parameter networking

The max_depth parameter seems to not be working. The fit function fits a tree with maximal possible depth regardless of setting.

Parallelism does not work properly

Hi,

I'm testing the library using the following code snippet:

config = {
'algorithm': 'C4.5',
#'enableParallelism': True, 'num_cores': 32,
}
model = chef.fit(df, config = config)

Which prints: "finished in 9.606534719467163 seconds"

Then, enabling parallelism uncommenting the line in the config, it never finishes but it uses 100% of CPU for a really long time - much more than 10 seconds

findDecision(obj) and accuracy giving weird results

I try to run this code:

import Chefboost as chef
import pandas as pd

if __name__ == "__main__":
    df = pd.read_csv("dataset/golf.txt")
    
    config = {'algorithm': 'C4.5'}
    model = chef.fit(df, config)

and then when i check outputs/rules/rules.py this is what i get :

def findDecision(obj): #obj[0]: Outlook, obj[1]: Temp., obj[2]: Humidity, obj[3]: Wind
   if obj[2] == 'Rain':
      if obj[0] == 'Weak':
         return 'Yes'
      elif obj[0] == 'Strong':
         return 'No'
   elif obj[2] == 'Sunny':
      if obj[1] == 'High':
         return 'No'
      elif obj[1] == 'Normal':
         return 'Yes'
   elif obj[2] == 'Overcast':
      return 'Yes'

obj[0] isn't Outlook, but Wind.. and also, sometimes i get accuracy 0% after running the code 2 or 3 times..

Getting KeyError: 'Decision'

Trying to find gain using for SepalLengthCm in IRIS dataset following this .

config = {'algorithm': 'ID3'}
sorted(df['SepalLengthCm'].unique())

threshold = 6.0
idx = df[df['SepalLengthCm'] <= threshold].index
tmp_df = df.copy()
tmp_df['SepalLengthCm'] = '>'+str(threshold)
tmp_df.loc[idx, 'SepalLengthCm'] = '<='+str(threshold)

gain = Training.findGains(tmp_df, config)
print(threshold, ': ', gain)

Also

df = iris[['SepalLengthCm', 'y']]

When running this I get the following error

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File D:\Files\dev\misc\py_venvs\cvip\lib\site-packages\pandas\core\indexes\base.py:3621, in Index.get_loc(self, key, method, tolerance)
   3620 try:
-> 3621     return self._engine.get_loc(casted_key)
   3622 except KeyError as err:

File D:\Files\dev\misc\py_venvs\cvip\lib\site-packages\pandas\_libs\index.pyx:136, in pandas._libs.index.IndexEngine.get_loc()

File D:\Files\dev\misc\py_venvs\cvip\lib\site-packages\pandas\_libs\index.pyx:163, in pandas._libs.index.IndexEngine.get_loc()

File pandas\_libs\hashtable_class_helper.pxi:5198, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas\_libs\hashtable_class_helper.pxi:5206, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Decision'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Input In [10], in <cell line: 7>()
      4 tmp_df['SepalLengthCm'] = '>'+str(threshold)
      5 tmp_df.loc[idx, 'SepalLengthCm'] = '<='+str(threshold)
----> 7 gain = Training.findGains(tmp_df, config)
      8 print(threshold, ': ', gain)

File D:\Files\dev\misc\py_venvs\cvip\lib\site-packages\chefboost\training\Training.py:107, in findGains(df, config)
    104 def findGains(df, config):
    106 	algorithm = config['algorithm']
--> 107 	decision_classes = df["Decision"].unique()
    109 	#-----------------------------
    111 	entropy = 0

File D:\Files\dev\misc\py_venvs\cvip\lib\site-packages\pandas\core\frame.py:3505, in DataFrame.__getitem__(self, key)
   3503 if self.columns.nlevels > 1:
   3504     return self._getitem_multilevel(key)
-> 3505 indexer = self.columns.get_loc(key)
   3506 if is_integer(indexer):
   3507     indexer = [indexer]

File D:\Files\dev\misc\py_venvs\cvip\lib\site-packages\pandas\core\indexes\base.py:3623, in Index.get_loc(self, key, method, tolerance)
   3621     return self._engine.get_loc(casted_key)
   3622 except KeyError as err:
-> 3623     raise KeyError(key) from err
   3624 except TypeError:
   3625     # If we have a listlike key, _check_indexing_error will raise
   3626     #  InvalidIndexError. Otherwise we fall through and re-raise
   3627     #  the TypeError.
   3628     self._check_indexing_error(key)

KeyError: 'Decision'

Error while fitting the model

Following is the dataframe:
image
and following is the additional code:

df.rename(columns={'result': 'Decision'}, inplace=True)

Output:

Index(['Date', 'Country', 'League', 'Season', 'HomeTeam', 'AwayTeam',
       'home_goal', 'away_goal', 'Decision'],
      dtype='object')
config = {"algorithm" : "C4.5"}
model = chef.fit(df, config,  target_label = "Decision")

I am getting error:

[INFO]:  4 CPU cores will be allocated in parallel running
C4.5  tree is going to be built...
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_22413/130574452.py in ?()
----> 1 model = chef.fit(df, config,  target_label = "Decision")

~/anaconda3/envs/rover/lib/python3.10/site-packages/chefboost/Chefboost.py in ?(df, config, target_label, validation_df)
    209                 if enableParallelism == True:
    210                         json_file = "outputs/rules/rules.json"
    211                         functions.createFile(json_file, "[\n")
    212 
--> 213 		trees = Training.buildDecisionTree(df, root = root, file = file, config = config
    214                                 , dataset_features = dataset_features
    215 				, parent_level = 0, leaf_id = 0, parents = 'root', validation_df = validation_df, main_process_id = process_id)
    216 

~/anaconda3/envs/rover/lib/python3.10/site-packages/chefboost/training/Training.py in ?(df, root, file, config, dataset_features, parent_level, leaf_id, parents, tree_id, validation_df, main_process_id)
    432                 pivot = pd.DataFrame(subdataset.Decision.value_counts()).reset_index()
    433                 pivot = pivot.rename(columns = {"Decision": "Instances","index": "Decision"})
    434                 pivot = pivot.sort_values(by = ["Instances"], ascending = False).reset_index()
    435 
--> 436                 else_decision = "return '%s'" % (pivot.iloc[0].Decision)
    437 
    438                 if enableParallelism != True:
    439                         functions.storeRule(file,(functions.formatRule(root), "else:"))

~/anaconda3/envs/rover/lib/python3.10/site-packages/pandas/core/generic.py in ?(self, name)
   6200             and name not in self._accessors
   6201             and self._info_axis._can_hold_identifiers_and_holds_name(name)
   6202         ):
   6203             return self[name]
-> 6204         return object.__getattribute__(self, name)

AttributeError: 'Series' object has no attribute 'Decision'

Even If I do not rename it doesn't matter. It always throws this error.

Only Regression Tree is Built

No matter what configuration I give, it seems to always default to building a Regression Tree.
I've tried putting in a non-existent value as well, but there are no issues there as it just throws an error.

Code:

from chefboost import Chefboost as cb
import pandas as pd

df = pd.read_csv("~/Downloads/kmodes_fillna_cluster.csv")

config = {'algorithm': 'C4.5'}
model = cb.fit(df, config)

Output:

Regression  tree is going to be built...
MAE:  0.10815602836879432
RMSE:  0.2325467999874373
Mean:  0.2872340425531915
MAE / Mean:  37.654320987654316 %
RMSE / Mean:  80.96073777340409 %
finished in  1.4060499668121338  seconds

UnboundLocalError: local variable 'subdataset' referenced before assignment

Hi,

I am facing below error while running the CHAID. Though, when used ID3, it run successfully.

[INFO]: 4 CPU cores will be allocated in parallel running
CHAID tree is going to be built...

RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "C:\Users\Rahul.Chandel\Anaconda3\lib\multiprocessing\pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "C:\Users\Rahul.Chandel\Anaconda3\lib\site-packages\chefboost\training\Training.py", line 209, in createBranchWrapper
return func(*args)
File "C:\Users\Rahul.Chandel\Anaconda3\lib\site-packages\chefboost\training\Training.py", line 330, in createBranch
results = buildDecisionTree(subdataset, root, file, config, dataset_features
File "C:\Users\Rahul.Chandel\Anaconda3\lib\site-packages\chefboost\training\Training.py", line 533, in buildDecisionTree
sub_results = createBranchWrapper(createBranch, input_param)
File "C:\Users\Rahul.Chandel\Anaconda3\lib\site-packages\chefboost\training\Training.py", line 209, in createBranchWrapper
return func(*args)
File "C:\Users\Rahul.Chandel\Anaconda3\lib\site-packages\chefboost\training\Training.py", line 330, in createBranch
results = buildDecisionTree(subdataset, root, file, config, dataset_features
File "C:\Users\Rahul.Chandel\Anaconda3\lib\site-packages\chefboost\training\Training.py", line 533, in buildDecisionTree
sub_results = createBranchWrapper(createBranch, input_param)
File "C:\Users\Rahul.Chandel\Anaconda3\lib\site-packages\chefboost\training\Training.py", line 209, in createBranchWrapper
return func(*args)
File "C:\Users\Rahul.Chandel\Anaconda3\lib\site-packages\chefboost\training\Training.py", line 330, in createBranch
results = buildDecisionTree(subdataset, root, file, config, dataset_features
File "C:\Users\Rahul.Chandel\Anaconda3\lib\site-packages\chefboost\training\Training.py", line 432, in buildDecisionTree
pivot = pd.DataFrame(subdataset.Decision.value_counts()).reset_index()
UnboundLocalError: local variable 'subdataset' referenced before assignment
"""

The above exception was the direct cause of the following exception:

UnboundLocalError Traceback (most recent call last)
in
7
8 df = df.drop('fill_ratio', axis =1)
----> 9 model = cb.fit(df, config = config)

~\Anaconda3\lib\site-packages\chefboost\Chefboost.py in fit(df, config, target_label, validation_df)
211 functions.createFile(json_file, "[\n")
212
--> 213 trees = Training.buildDecisionTree(df, root = root, file = file, config = config
214 , dataset_features = dataset_features
215 , parent_level = 0, leaf_id = 0, parents = 'root', validation_df = validation_df, main_process_id = process_id)

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in buildDecisionTree(df, root, file, config, dataset_features, parent_level, leaf_id, parents, tree_id, validation_df, main_process_id)
531 else: #serial
532 for input_param in input_params:
--> 533 sub_results = createBranchWrapper(createBranch, input_param)
534 for sub_result in sub_results:
535 results.append(sub_result)

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in createBranchWrapper(func, args)
207
208 def createBranchWrapper(func, args):
--> 209 return func(*args)
210
211 def createBranch(config, current_class, subdataset, numericColumn, branch_index, winner_name, winner_index, root, parents, file, dataset_features, num_of_instances, metric, tree_id = 0, main_process_id = None):

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in createBranch(config, current_class, subdataset, numericColumn, branch_index, winner_name, winner_index, root, parents, file, dataset_features, num_of_instances, metric, tree_id, main_process_id)
328 parents = copy.copy(leaf_id)
329
--> 330 results = buildDecisionTree(subdataset, root, file, config, dataset_features
331 , root-1, leaf_id, parents, tree_id = tree_id, main_process_id = main_process_id)
332

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in buildDecisionTree(df, root, file, config, dataset_features, parent_level, leaf_id, parents, tree_id, validation_df, main_process_id)
531 else: #serial
532 for input_param in input_params:
--> 533 sub_results = createBranchWrapper(createBranch, input_param)
534 for sub_result in sub_results:
535 results.append(sub_result)

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in createBranchWrapper(func, args)
207
208 def createBranchWrapper(func, args):
--> 209 return func(*args)
210
211 def createBranch(config, current_class, subdataset, numericColumn, branch_index, winner_name, winner_index, root, parents, file, dataset_features, num_of_instances, metric, tree_id = 0, main_process_id = None):

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in createBranch(config, current_class, subdataset, numericColumn, branch_index, winner_name, winner_index, root, parents, file, dataset_features, num_of_instances, metric, tree_id, main_process_id)
328 parents = copy.copy(leaf_id)
329
--> 330 results = buildDecisionTree(subdataset, root, file, config, dataset_features
331 , root-1, leaf_id, parents, tree_id = tree_id, main_process_id = main_process_id)
332

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in buildDecisionTree(df, root, file, config, dataset_features, parent_level, leaf_id, parents, tree_id, validation_df, main_process_id)
531 else: #serial
532 for input_param in input_params:
--> 533 sub_results = createBranchWrapper(createBranch, input_param)
534 for sub_result in sub_results:
535 results.append(sub_result)

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in createBranchWrapper(func, args)
207
208 def createBranchWrapper(func, args):
--> 209 return func(*args)
210
211 def createBranch(config, current_class, subdataset, numericColumn, branch_index, winner_name, winner_index, root, parents, file, dataset_features, num_of_instances, metric, tree_id = 0, main_process_id = None):

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in createBranch(config, current_class, subdataset, numericColumn, branch_index, winner_name, winner_index, root, parents, file, dataset_features, num_of_instances, metric, tree_id, main_process_id)
328 parents = copy.copy(leaf_id)
329
--> 330 results = buildDecisionTree(subdataset, root, file, config, dataset_features
331 , root-1, leaf_id, parents, tree_id = tree_id, main_process_id = main_process_id)
332

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in buildDecisionTree(df, root, file, config, dataset_features, parent_level, leaf_id, parents, tree_id, validation_df, main_process_id)
531 else: #serial
532 for input_param in input_params:
--> 533 sub_results = createBranchWrapper(createBranch, input_param)
534 for sub_result in sub_results:
535 results.append(sub_result)

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in createBranchWrapper(func, args)
207
208 def createBranchWrapper(func, args):
--> 209 return func(*args)
210
211 def createBranch(config, current_class, subdataset, numericColumn, branch_index, winner_name, winner_index, root, parents, file, dataset_features, num_of_instances, metric, tree_id = 0, main_process_id = None):

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in createBranch(config, current_class, subdataset, numericColumn, branch_index, winner_name, winner_index, root, parents, file, dataset_features, num_of_instances, metric, tree_id, main_process_id)
328 parents = copy.copy(leaf_id)
329
--> 330 results = buildDecisionTree(subdataset, root, file, config, dataset_features
331 , root-1, leaf_id, parents, tree_id = tree_id, main_process_id = main_process_id)
332

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in buildDecisionTree(df, root, file, config, dataset_features, parent_level, leaf_id, parents, tree_id, validation_df, main_process_id)
531 else: #serial
532 for input_param in input_params:
--> 533 sub_results = createBranchWrapper(createBranch, input_param)
534 for sub_result in sub_results:
535 results.append(sub_result)

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in createBranchWrapper(func, args)
207
208 def createBranchWrapper(func, args):
--> 209 return func(*args)
210
211 def createBranch(config, current_class, subdataset, numericColumn, branch_index, winner_name, winner_index, root, parents, file, dataset_features, num_of_instances, metric, tree_id = 0, main_process_id = None):

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in createBranch(config, current_class, subdataset, numericColumn, branch_index, winner_name, winner_index, root, parents, file, dataset_features, num_of_instances, metric, tree_id, main_process_id)
328 parents = copy.copy(leaf_id)
329
--> 330 results = buildDecisionTree(subdataset, root, file, config, dataset_features
331 , root-1, leaf_id, parents, tree_id = tree_id, main_process_id = main_process_id)
332

~\Anaconda3\lib\site-packages\chefboost\training\Training.py in buildDecisionTree(df, root, file, config, dataset_features, parent_level, leaf_id, parents, tree_id, validation_df, main_process_id)
519
520 for f in funclist:
--> 521 branch_results = f.get(timeout = 100000)
522
523 for branch_result in branch_results:

~\Anaconda3\lib\multiprocessing\pool.py in get(self, timeout)
769 return self._value
770 else:
--> 771 raise self._value
772
773 def _set(self, i, obj):

UnboundLocalError: local variable 'subdataset' referenced before assignment

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.