python - Combining multiple parameters for creating SVM vector -
new scikit-learn , working data following.
data[0] = {"string": "some arbitrary text", "label1": "orange", "value1" : false } data[0] = {"string": "some other arbitrary text", "label1": "red", "value1" : true }
for single lines of text there countvectorizer
, dictvectorizer
in pipeline before tfidftransformer
. output of these concatenated, i'm hoping following caveat: arbitrary text don't want equal in importance specific, limited , well-defined parameters.
finally, other questions, possibly related
- might data structure indicate svm kernel best?
- or random forest/decision tree, dbn, or bayes classifier possibly better in case? or ensemble method? (the output multi-class)
- i see there upcoming feature feature union, run different methods on same data , combine them.
- should using feature selection?
see also:
all classifiers in scikit-learn(*) expect flat feature representation samples, you'll want turn string
feature vector. first, let incorrect assumptions out of way:
dictvectorizer
not handling "lines of text", arbitrary symbolic features.countvectorizer
not handling lines, entire text documents.- whether features "equal in importance" learning algorithm, though kernelized svm, can assign artificially small weights features make dot products come out differently. i'm not saying that's idea, though.
there 2 ways of handling kind of data:
- build
featureunion
consisting ofcountvectorizer
(ortfidfvectorizer
) textual data ,dictvectorizer
additional features. manually split textual data words, use each word feature in
dictvectorizer
, e.g.{"string:some": true, "string:arbitrary": true, "string:text": true, "label1": "orange", "value1" : false }
then related questions:
- might data structure indicate svm kernel best?
since you're handling textual data, try linearsvc
first , polynomial kernel of degree 2 if doesn't work. rbf kernels bad match textual data, , cubic or higher-order poly kernels tend overfit badly. alternative kernels, can manually construct products of individual features , train linearsvc
on that; sometimes, works better kernel. gets rid of feature importances issue linearsvc
learns per-feature weights.
- or random forest/decision tree, dbn, or bayes classifier possibly better in case?
that's impossible tell without trying. scikit-learn's random forests , dtrees unfortunately don't handle sparse matrices, they're rather hard apply. dbns not implemented.
- should using feature selection?
impossible tell without seeing data.
(*) except svms if implement custom kernels, such advanced topic won't discuss now.
Comments
Post a Comment