python - Combining multiple parameters for creating SVM vector -

- September 15, 2011

new scikit-learn , working data following.

data[0] = {"string": "some arbitrary text", "label1": "orange", "value1" : false } data[0] = {"string": "some other arbitrary text", "label1": "red", "value1" : true }

for single lines of text there countvectorizer , dictvectorizer in pipeline before tfidftransformer. output of these concatenated, i'm hoping following caveat: arbitrary text don't want equal in importance specific, limited , well-defined parameters.

finally, other questions, possibly related

might data structure indicate svm kernel best?
or random forest/decision tree, dbn, or bayes classifier possibly better in case? or ensemble method? (the output multi-class)
i see there upcoming feature feature union, run different methods on same data , combine them.
should using feature selection?

see also:

all classifiers in scikit-learn(*) expect flat feature representation samples, you'll want turn string feature vector. first, let incorrect assumptions out of way:

dictvectorizer not handling "lines of text", arbitrary symbolic features.
countvectorizer not handling lines, entire text documents.
whether features "equal in importance" learning algorithm, though kernelized svm, can assign artificially small weights features make dot products come out differently. i'm not saying that's idea, though.

there 2 ways of handling kind of data:

build featureunion consisting of countvectorizer (or tfidfvectorizer) textual data , dictvectorizer additional features.

manually split textual data words, use each word feature in dictvectorizer, e.g.

{"string:some": true, "string:arbitrary": true, "string:text": true,  "label1": "orange", "value1" : false }

then related questions:

might data structure indicate svm kernel best?

since you're handling textual data, try linearsvc first , polynomial kernel of degree 2 if doesn't work. rbf kernels bad match textual data, , cubic or higher-order poly kernels tend overfit badly. alternative kernels, can manually construct products of individual features , train linearsvc on that; sometimes, works better kernel. gets rid of feature importances issue linearsvc learns per-feature weights.

or random forest/decision tree, dbn, or bayes classifier possibly better in case?

that's impossible tell without trying. scikit-learn's random forests , dtrees unfortunately don't handle sparse matrices, they're rather hard apply. dbns not implemented.

should using feature selection?

impossible tell without seeing data.

(*) except svms if implement custom kernels, such advanced topic won't discuss now.

Search This Blog

Scrio

python - Combining multiple parameters for creating SVM vector -

Comments

Post a Comment

Popular posts from this blog

python - cx_oracle unable to find Oracle Client -

Delphi XE2 Indy10 udp client-server interchange using SendBuffer-ReceiveBuffer -

Qt ActiveX WMI QAxBase::dynamicCallHelper: ItemIndex(int): No such property in -